Scylla Enterprise Release 2018.1.9

Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.9, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.9 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.9 in coordination with the Scylla support team.

This release fixes two issues listed below, with open source references, if exists:

  • Scylla aborted with an “Assertion `end >= _stream_position’ failed” exception. This occurred when querying a partition with no clustering ranges (happened on counter tables with no live rows) which also didn’t have static columns. #3304
  • Monitoring: latency values reported by Prometheus might be wrong #3827

Related Links     

The post Scylla Enterprise Release 2018.1.9 appeared first on ScyllaDB.

Wide Partitions in Apache Cassandra 3.11

This article was originally published on Backblaze.

 

Wide Partitions in Cassandra can put tremendous pressure on the java heap and garbage collector, impact read latencies, and can cause issues ranging from load shedding and dropped messages to crashed and downed nodes.

While the theoretical limit on the number of cells per Partition has always been two billion cells, the reality has been quite different, as the impacts of heap pressure show. To mitigate these problems, the community has offered a standard recommendation for Cassandra users to keep Partitions under 400MB, and preferably under 100MB.

However, in version 3 many improvements were made that affected how Cassandra handles wide Partitions. Memtables, caches, and SSTable components were moved off-heap, the storage engine was rewritten in CASSANDRA-8099 and Robert Stupp made a number of other improvements listed under CASSANDRA-11206.

Working with Backblaze and operating a Cassandra version 3.11 cluster we had the opportunity to test and validate how Cassandra actually handles Partitions with this latest version. We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.

Below, we walk through how Cassandra writes Partitions to disk in 3.11, look at how wide Partitions impact read latencies, and then present our testing and verification of wide Partition impacts on the cluster, using the work we did with Backblaze.

The Art and Science of Writing Wide Partitions to Disk

First we need to understand what a Partition is and how Cassandra writes Partitions to disk in version 3.11.

Each SSTable contain a set of files, and the (–Data.db) file contains numerous Partitions.

The layout of a Partition in the –Data.db file has three components: a header, followed by zero or one static rows, which is followed by zero or more ordered Clusterable objects. The Clusterable object in this file may either be a Row or a RangeTombstone that deletes data with each wide Partition containing many Clusterable objects. For an excellent in-depth examination of this, see Aaron’s blog post on the Cassandra 3.x Storage Engine.

The –Index.db file stores offsets for the Partitions, as well as the IndexInfo serialized objects for each Partition. These indices facilitate locating the data on disk within the –Data.db file. Stored Partition offsets are represented by a subclass of the RowIndexEntry. This subclass is chosen by the the ColumnIndex and depends on the size of the partition:

  • RowIndexEntry is used when there are no Clusterable objects in the Partition, such as when there is only a static Row. In this case there are no IndexInfo objects to store and so the parent RowIndexEntry class is used rather than a subclass.

  • The IndexEntry subclass holds the IndexInfo objects in memory until the Partition has finished writing to disk. It is used in for Partitions where the total serialized size of the IndexInfo objects is less than the column_index_cache_size_in_kb configuration setting (which defaults to 2KB).

  • The ShallowIndexEntry subclass serializes IndexInfo objects to disk as they are created and references these objects using only their position in the file. This is used in Partitions where the total serialized size of the IndexInfo objects is more than the column_index_cache_size_in_kb configuration setting.

These IndexInfo objects provide a sampling of positional offsets for Rows within a Partition, creating an index. Each object specifies the offset the page starts at, the first Row and the last Row.

So, in general, the bigger the Partition, the more IndexInfo objects need to be created when writing to disk - and if they are held in memory until the Partition is fully written to disk they can cause memory pressure. This is why the column_index_cache_size_in_kb setting was added in Cassandra 3.6 and the objects are now serialized as they are created.

The relationship betweeen Partition size and the number of objects was quantified by Robert Stupp in his presentation, Myths of Big Partitions:

IndexInfo numbers from Robert Stupp

How Wide Partitions Impact Read Latencies

Cassandra’s key cache is an optimization that is enabled by default and helps to improve the speed and efficiency of the read path by reducing the amount of disk activity per read.

Each key cache entry is identified by a combination of the keyspace, table name, SSTable, and the Partition key. The value of the key cache is a RowIndexEntry or one of its subclasses - either IndexedEntry or the new ShallowIndexedEntry. The size of the key cache is limited by the key_cache_size_in_mb configuration setting.

When a read operation in the storage engine gets a cache hit it avoids having to access the –Summary.db and –Index.db SSTable components, which reduces that read request’s latency. Wide Partitions, however, can decrease the efficiency of this key cache optimization because fewer hot Partitions will fit into the allocated cache size.

Indeed, before the ShallowIndexedEntry was added in Cassandra version 3.6, a single wide Row could fill the key cache, reducing the hit rate efficiency. When applied to multiple Rows, this will cause greater churn of additions and evictions of cache entries.

For example, if the IndexEntry for a 512MB Partition contains 100K+ IndexInfo objects and if these IndexInfo objects total 1.4MB, then the key cache would only be able to hold 140 entries.

The introduction of ShallowIndexedEntry objects changed how the key cache can hold data. The ShallowIndexedEntry contains a list of file pointers referencing the serialized IndexInfo objects and can binary search through this list, rather than having to deserialize the entire IndexInfo objects list. Thus when the ShallowIndexedEntry is used no IndexInfo objects exist within the key cache. This increases the storage efficiency of the key cache in storing more entries, but does still require that the IndexInfo objects are binary searched and deserialized from the –Index.db file on a cache hit.

In short, on wide Partitions a key cache miss still results in two additional disk reads, as it did before Cassandra 3.6, but now a key cache hit incurs a disk read to the -Index.db file where it did not before Cassandra 3.6.

Object Creation and Heap Behavior with Wide Partitions in 2.2.13 vs 3.11.3

Introducing the ShallowIndexedEntry into Cassandra version 3.6 creates a measurable improvement in the performance of wide Partitions. To test the effects of this and the other performance enhancement features introduced in version 3 we compared how Cassandra 2.2.13 and 3.11.3 performed when inserting one hundred thousand, one million, or ten million Rows were each written to a single Partition.

The results and accompanying screenshots help illustrate the impact of object creation and heap behavior when inserting Rows into wide Partitions. While version 2.2.13 crashed repeatedly during this test, 3.11.3 was able to write over 30 million Rows to a single Partition before Cassandra Out-of-Memory crashed. The test and results are reproduced below.

Both Cassandra versions were started as single-node clusters with default configurations, excepting heap customization in the cassandra–env.sh:

    MAX_HEAP_SIZE="1G"
    HEAP_NEWSIZE="600M"

In Cassandra only the configured concurrency of memtable flushes and compactors determines how many Partitions are processed by a node and thus pressuring its heap at any one time. Based on this known concurrency limitation, profiling can be done by inserting data into one Partition against one Cassandra node with a small heap. These results extrapolate to production environments.

The tlp-stress tool inserted data in three separate profiling passes against both versions of Cassandra, creating wide Partitions of one hundred thousand (100K), one million (1M), or ten million (10M) Rows.

A tlp-stress profile for wide Partitions was written, as no suitable profile existed. The read to write ratio used the default setting of 1:100.

The following command lines then implemented the tlp-stress tool:

# To write 100000 rows into one partition
tlp-stress run Wide --replication "{'class':'SimpleStrategy','replication_factor': 1}" -n 100K

# To write 1M rows into one partition
tlp-stress run Wide --replication "{'class':'SimpleStrategy','replication_factor': 1}" -n 1M

# To write 10M rows into one partition
tlp-stress run Wide --replication "{'class':'SimpleStrategy','replication_factor': 1}" -n 10M

Each time tlp-stress executed it was immediately followed by a command to ensure the full count of specified Rows passed through the memtable flush and were written to disk:

    nodetool flush

The graphs in the sections below, taken from the Apache NetBeans Profiler, illustrate how the ShallowIndexEntry in Cassandra version 3.11 avoids keeping IndexInfo objects in memory.

Notably, the IndexInfo objects are instantiated far more often, but are referenced for much shorter periods of time. The Garbage Collector is more effective at removing short-lived objects, as illustrated by the GC pause times being barely present in the Cassandra 3.11 graphs compared to Cassandra 2.2 where GC pause times overwhelm the JVM.

Wide Partitions in Cassandra 2.2

Benchmarks were against Cassandra 2.2.13

One Partition with 100K Rows (2.2.13)

The following three screenshots shows the number of IndexInfo objects instantiated during the write benchmark, during compaction, and a heap profile.

The partition grew to be ~40MB.

Objects created during tlp-stress Objects created during tlp-stress

Objects created during subsequent major compaction Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction Heap profiled during tlp-stress and major compaction

The above diagrams do not have their x-axis expanded to the full width, but still encompass the startup, stress test, flush, and compaction periods of the benchmark.

When stress testing starts with tlp-stress the CPU Time and Surviving Generations starts to climb. During this time the heap also starts to increase and decrease more frequently as it fills up and then the Garbage Collector cleans it out. In these diagrams the garbage collection intervals are easy to identify and isolate from one another.

One Partition with 1M Rows (2.2.13)

Here, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Times and the heap profile from the time writes started through when the compaction was completed.

The partition grew to be ~400MB.

Already at this size the Cassandra JVM is GC thrashing and has occasionally Out-of-Memory crashed.

Objects created during tlp-stress Objects created during tlp-stress

Objects created during subsequent major compaction Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction Heap profiled during tlp-stress and major compaction

The above diagrams display a longer running benchmark, with the quiet period during the startup barely noticeable on the very left-hand side of each diagram. The number of garbage collection intervals and the oscillations in heap size are far more frequent. The GC Pause Time during the stress testing period is now consistently higher and comparable to the CPU Time. It only dissipates when the benchmark performs the flush and compaction.

One Partition with 10M Rows (2.2.13)

In this final test of Cassandra version 2.2.13, the results were difficult to reproduce reliably, as more often than not this test Out-of-Memory crashed from GC heap pressure.

The first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the GC Pause Time and the heap profile from the time writes started until compaction was completed.

The partition grew to be ~4GB.

Objects created during tlp-stress Objects created during tlp-stress

Objects created during subsequent major compaction Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction Heap profiled during tlp-stress and major compaction

The above diagrams display consistently very high GC Pause Time compared to CPU Time. Any Cassandra node under this much duress from garbage collection is not healthy. It is suffering from high read latencies, could become blacklisted by other nodes due to its lack of responsiveness, and even crash altogether from Out-of-Memory errors (as it did often during this benchmark).

Wide Partitions in Cassandra 3.11.3

Benchmarks were against Cassandra 3.11.3

In this series, the graphs demonstrate how IndexInfo objects are created either from memtable flushes or from deserialization off disk. The ShallowIndexEntry is used in Cassandra 3.11.3 when deserializing the IndexInfo objects from the -Index.db file.

Neither form of IndexInfo objects reside long in the heap and thus the GC Pause Time is barely visible in comparison to Cassandra 2.2.13 despite the additional numbers of IndexInfo objects created via deserialization.

One Partition with 100K Rows (3.11.3)

As with the earlier version test of this size, the following two screenshots shows the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Time and the heap profile from the time writes started through when the compaction was completed.

The partition grew to be ~40MB, the same as with Cassandra 2.2.13

Objects created during tlp-stress Objects created during tlp-stress

Objects created during subsequent major compaction Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction Heap profiled during tlp-stress and major compaction

The diagrams above are roughly comparable to the first diagrams presented under Cassandra 2.2.13, except here the x-axis is expanded to full width. Note there are significantly more instantiated IndexInfo objects, but barely any noticeable GC Pause Time.

One Partition with 1M Rows (3.11.3)

Again, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Time and the heap profile over the time writes started until the compaction was completed.

The partition grew to be ~400MB, the same as with Cassandra 2.2.13

Objects created during tlp-stress Objects created during tlp-stress

Objects created during subsequent major compaction Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction Heap profiled during tlp-stress and major compaction

The above diagrams show a wildly oscillating heap as many IndexInfo objects are created, and shows many garbage collection intervals, yet the GC Pause Time remains low, if at all noticeable.

One Partition with 10M Rows (3.11.3)

Here again, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Time and the heap profile over the time writes started until the compaction was completed.

The partition grew to be ~4GB, the same as with Cassandra 2.2.13

Objects created during tlp-stress Objects created during tlp-stress

Objects created during subsequent major compaction Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction Heap profiled during tlp-stress and major compaction

Unlike this profile in 2.2.13, the cluster remains stable as it was when running 1M Rows per Partition. The above diagrams display an oscillating heap when IndexInfo objects are created, and many garbage collection intervals, yet GC Pause Time remains low, if at all noticeable.

Maximum Rows in 1GB Heap (3.11.3)

In an attempt to push Cassandra 3.11.3 to the limit, we ran a test to see how much data could be written to a single Partition before Cassandra Out-of-Memory crashed.

The result was 30M+ rows, which is ~12GB of data on disk.

This is similar to the limit of 17GB of data written to a single Partition as Robert Stupp found in CASSANDRA-9754 when using a 5GB java heap.

Maximum Rows in 1GB Heap

What about Reads

The following graph reruns the benchmark on Cassandra version 3.11.3 over a longer period of time with a read to write ratio of 10:1. It illustrates that reads of wide Partitions do not create the heap pressure that writes do. Reading from wide Partitions

Conclusion

While the 400MB community recommendation for Partition size is clearly appropriate for version 2.2.13, version 3.11.3 shows that performance improvements have created a tremendous ability to handle wide Partitions and they can easily be an order of magnitude larger than earlier versions of Cassandra without nodes crashing through heap pressure.

The trade-off for better supporting wide Partitions in Cassandra 3.11.3 is increased read latency as Row offsets now need to be read off disk. However, modern SSDs and kernel pagecaches take advantage of larger configurations of physical memory providing enough IO improvements to compensate for the read latency trade-offs.

The improved stability and falling back on better hardware to deal with the read latency issue allows Cassandra operators to worry less about how to store massive amounts of data in different schemas and unexpected data growth patterns on those schemas.

Come CASSANDRA-9754 custom B+ tree structures will be used to more effectively lookup the deserialised Row offsets and further avoid the deserialization and instantiation of short-lived unused IndexInfo objects.

Backblaze logo

JSON Support in Scylla

Beginning with version 2.3, Scylla Open Source supports the Javascript Object Notation (JSON) format. That includes inserting JSON documents, retrieving data in JSON and providing helper functions to transform native CQL types into JSON and vice versa.

Also note that schemas are still enforced for all operations — one cannot just insert random JSON documents into a table. The new API is simply a convenient way of working with JSON without having to convert everything back and forth client-side.

JSON support consists of CQL statements and functions, described here, one by one, with examples.

You can use the following code snippet to build a sample restaurant menu. This example will serve as a basis in the following sections. This snippet also contains a second table based on collections, which contains additional information about served dishes.

CREATE KEYSPACE restaurant WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
use restaurant;

CREATE TABLE menu (category text, position int, name text, price float, PRIMARY KEY(category, position));

INSERT INTO menu (category, position, name, price) VALUES ('starters', 1, 'foie gras', 10.50);
INSERT INTO menu (category, position, name, price) VALUES ('starters', 2, 'steak tartare', 9.50);
INSERT INTO menu (category, position, name, price) VALUES ('starters', 3, 'taco de pulpo', 8.00);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 1, 'sour rye soup', 12);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 2, 'sorrel soup', 8);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 3, 'beef tripe soup', 11.20);
INSERT INTO menu (category, position, name, price) VALUES ('main courses', 1, 'red-braised pork belly', 24.90);
INSERT INTO menu (category, position, name, price) VALUES ('main courses', 2, 'boknafisk', 19);


CREATE TABLE info (category text PRIMARY KEY, calories map<text, int>, vegan set, ranking list);
INSERT INTO info (category, calories, vegan, ranking) VALUES ('soups', {'sour rye soup': 500, 'sorrel soup': 290}, {'sorrel soup'}, ['sour rye soup', 'sorrel soup']);

SELECT JSON

Selecting data in JSON format can be performed with SELECT JSON statement. It’s syntax is almost identical to regular CQL SELECT.

In order to extract all data and see what the restaurant serves, try:

SELECT JSON * from menu;

Named columns can also be specified to narrow down the results. So, if we’re only interested in names and prices:

SELECT JSON name, price from menu;

As in regular CQL SELECT, it’s of course possible to restrict the query. Extracting soup info from the database can be achieved like this:

SELECT JSON name, price from menu WHERE category='soups';

Since data underneath is still structured with our schema, it’s possible to apply filtering too. So, if our meal is reimbursed anyway and we don’t want to ruin it by spending too little money:

SELECT JSON name, price from menu WHERE price > 10 ALLOW FILTERING;

Note that the results always consist of one column named [json]. This column contains the requested information in JSON format, properly typed – to string, int, float or boolean. Of course, (nested) collections are supported too!

SELECT JSON * FROM info;

INSERT JSON

Inserting JSON data is also very similar to a regular INSERT statement. Still, note that even though JSON documents can contain lots of arbitrary columns, the ones inserted into Scylla will be validated with table’s schema. Let’s add another soup to the menu:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11}';

That’s it – not complicated at all. What happens if we try to sneak some out-of-schema data to the statement?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11, "comment": "filling and delicious"}';

Not possible – schema rules cannot be ignored. What if some columns are missing from our JSON?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}'

SELECT * from menu;

Works fine, the omitted column just defaults to null. But, there’s more to the topic.

DEFAULT NULL/DEFAULT UNSET

By default, omitted columns are going to be treated as null values. If, instead, the user wants to omit changing the value in case it already exists, DEFAULT UNSET flag can be used. So, if our red borscht sells well and we want to boost the price in order to increase revenue:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}' DEFAULT UNSET;

We can see that our soup name was left intact, but the price changed:

SELECT * FROM menu WHERE category='soups';

fromJson()

fromJson() is a functional equivalent of INSERT JSON for a single value. The easiest way to explain its usage is with an example:

INSERT INTO menu (category, position, name, price) VALUES (fromJson('”soups”'), fromJson(‘1’), 'sour rye soup', 12);

The function works fine with collections too.

INSERT INTO info (category, calories) VALUES ('starters', fromJson('{"foie gras": 550}'));

SELECT * FROM info WHERE category = 'starters';

toJson()

toJson() is a counterpart of the fromJson() function (yes, really!) and can be used to convert single values to JSON format.

SELECT toJson(category), toJson(name) FROM menu;

SELECT category, toJson(calories), toJson(vegan), toJson(ranking) FROM info;

Types

Mapping of CQL types to JSON is well defined and usually intuitive. Full reference table of corresponding types can be found below. Note that some CQL types (e.g. decimal) will be implicitly converted to others, with possibly different precision (e.g. float) when returning JSON values.

CQL type INSERT JSON accepted type SELECT JSON returned type
ascii string string
bigint integer, string integer
blob string string
boolean boolean, string boolean
date string string
decimal integer, string, float float
double integer, string, float float
float integer, string, float float
inet string string
int integer, string integer
list list, string list
map map, string map
smallint integer, string integer
set list, string list
text string string
time string string
timestamp integer, string string
timeuuid string string
tinyint integer, string integer
tuple list, string list
uuid string string
varchar string string
varint integer, string integer

We do JSON. How about you?

JSON support in Scylla permits a variety of new novel designs and implementations. If you are currently using JSON in your own Scylla deployment or planning to use this feature in your own development, we’d love to hear from you.

The post JSON Support in Scylla appeared first on ScyllaDB.

Five Great Use Cases for a Hybrid Cloud Database

In a data-driven world, leading enterprises are moving to hybrid cloud databases that deliver the agility, reliability, and availability they require.

But “hybrid cloud” can span many functions.

How, specifically, are organizations actually using hybrid cloud databases to stay ahead of the competition and stay agile?

Let’s take a look at five more popular enterprise use cases.

1. Always-on applications

Many organizations today rely on teams distributed across the world.

Today’s workers are on the go—not to mention the average smartphone users who are glued to their phones as they go about their day.

Hybrid cloud databases powerfully support always-on web, mobile, and desktop applications. Due to their distributed architecture, applications are always available, ensuring consistent user experiences across devices and geographies.

Employees can always access the apps they need to do their work, so productivity increases. Customers can also use apps as they’re designed to be used, regardless of whether they’re on a mobile device or a laptop.

2. Fraud detection

In 2016, consumers lost $16 billion to identity theft and fraud. On average, financial firms with over $10 million in revenue lose 2.79% to fraud each year.

Since those figures aren’t acceptable, financial services organizations are increasingly deploying hybrid cloud databases that are able to quickly detect, at scale, anomalies in the relationships between data and transactions to identify fraud as it occurs—or even sooner.

This helps organizations such as banks them reduce the occurrence of fraud significantly, increasing profitability and protecting customers.

3. Personalization

Personalization is increasingly important to both employees and customers.

Unfortunately, traditional relational database management systems (RDBMS) lack the ability to quickly process the data needed to serve up personalized experiences.

Distributed NoSQL hybrid cloud databases, on the other hand, deliver the scalability, availability, and speed required to deliver personalized experiences in every interaction.

All of a sudden, employees and customers feel more at home on your apps and engage with them more.

4 . Customer experience

By 2020, the customer experience will overtake price and product as the key differentiator between brands, according to the Customers 2020 report.

Today’s customers expect a seamless experience as they hop across channels.

This is why companies are increasingly using hybrid cloud databases. Versatile, multimodel databases store data in an efficient, value-maximizing format (e.g., key-value, document or JSON). Unlike traditional RDBMS, they also enable applications to access all relevant customer data—regardless of where it lives or which other applications are using it—thereby ensuring consistent customer experiences.

5. Real-time data processing

Making great decisions starts with having the ability to store, process, and analyze complex data sets in real time.

Hybrid cloud databases provide the agility enterprises need to respond to opportunities immediately—reducing costs and increasing efficiency along the way.

Thanks to its NoSQL foundation, the same database can be used for both operational and analytical purposes. The same can’t be said for most traditional databases.

As you can see, hybrid databases are incredibly versatile.

From powering always-on applications and detecting fraud in real time to supporting personalization, delivering positive user experiences,and processing data in real time, hybrid databases enable organizations to supercharge their operations, increasing productivity while mitigating potential risks.

To learn more about how your organization can get to the next level by moving to a hybrid database, check out our eBook, The Power of an Active Everywhere Database.

Ingesting Data from Relational Databases to Cassandra with StreamSets

I know what some of you are thinking, write and deploy some code. And maybe the code can utilize a framework like Apache Spark. That's what I would have thought a few years ago. But, it often turns out that's not as easy as expected.

Don't get me wrong, writing and deploying code makes sense for some folks. But for many others, writing and deploying custom code may require significant time and resources.

Scylla Summit 2018 Tech Talks Now Online

There was so much happening at our Scylla Summit 2018 late last fall. We held more than three dozen sessions, including multiple keynotes and concurrent breakout tracks. Our Tech Talks page has been updated with the videos and slides from Scylla Summit 2018. Now you can see what you missed — whether or not you were able to attend our user conference.

Dor Laor at Scylla Summit 2018

In the weeks ahead we’ll showcase some of the best talks from our conference, but there’s no need to wait. You can browse the whole catalog of talks from Scylla Summit 2018 today!

You can see the YouTube videos and all the SlideShare presentations in one place. All the keynotes, both by ScyllaDB executives and some of our most prominent customers. All the use case presentations by our incredible groundbreaking community members. All the tech talks from our engineers on current and upcoming features, best practices, tips and tricks.

We’ll point out a good intro to many of the engineering talks at Scylla Summit 2018. ScyllaDB CTO and Co-Founder Avi Kivity presented on our near-term and longer-term initiatives. This is quite timely, too, as the release of Scylla Open Source 3.0 is right around the corner!



If you have any questions or comments after watching these presentations, or if you’d like to share your own experience with Scylla, please feel free to contact us!

The post Scylla Summit 2018 Tech Talks Now Online appeared first on ScyllaDB.

Scylla Enterprise Release 2018.1.8

Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.8, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.8 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.8 in coordination with the Scylla support team.

This release includes memory management improvements, first introduced in Scylla Open Source 2.3 release and graduated into Scylla Enterprise. These changes allow Scylla to free large contiguous areas of memory more reliably with less effort and improves the performance of workloads that have large blobs or collections. #2480

Related Links     

Additional fixed issues in this release, with open source references, if exists:

  • In extreme out-of-memory situations, failure to allocate an internal data structure used for synchronizing memtable flushes can result in a segmentation fault and crash. #3931

Auditing: audit configuration syntax in scylla.yaml was changed from audit-categories and audit-tables to audit_categories and audit_tables. Note that the upgrade procedure does *not* update existing scylla.yaml automatically. More on Scylla Enterprise auditing and auditing configuration here

The post Scylla Enterprise Release 2018.1.8 appeared first on ScyllaDB.

Scylla Summit 2018 Keynote: Four Years of Scylla

Dor Laor at Scylla Summit 2018
Now that the dust has settled from our Scylla Summit 2018 user conference, we’re glad for the chance to share the content with those who couldn’t make the trip to the Bay Area. We’ll start with the keynote from our CEO and Co-founder, Dor Laor, who kicked off the event with his talk about the past, present and future of Scylla.

Watch the video in full:

Browse the slides:

Dor began with an overview of the trends in the industry and first and foremost, digital transformation. Sticking to the down-to-earth, practical culture at ScyllaDB, Dor covered real life customer examples from all around us. Starting with space, with satellite-enabled services like GPS Insight and TellusLabs, to the rain forest plant-based beauty products of Brazil’s Natura, to the GE Predix-powered Industrial Internet of Things (IIoT) platform, Comcast X1’s on-demand services, as well as automotive applications from Faraday Future and Nauto.

In the beginning…

Scylla and Charybdis

Dor shared our company’s origins. He recalled first announcing the Seastar framework in February 2015, and leaving stealth mode in September of that year. ScyllaDB CTO Avi Kivity. presented at that year’s Cassandra Summit on how a new database, Scylla, could deliver 1,000,000 CQL operations per server.

Scylla 1.0 Release Graphic

Over the ensuing three years we made a great deal of progress. We released Scylla 1.0 at the end of March 2016. That year also saw the first Scylla Summit in September. The following year, in March 2017, Scylla unveiled its first Enterprise software release.

Scylla Enterprise Release Graphic

While Scylla was blazing its own path in the world of NoSQL, Dor also remarked on the successes of others in the industry, including MongoDB’s public offering in October 2017, and the September 2018 IPO of Elastic. These events serve as validation of the growing Big Data market as the hunger for data increases, fed by the growing appetite of modern, planet scale software. Not only most enterprises now trust in the operational capabilities of NoSQL distributed databases, the new world requirements cannot be met by traditional relational models.

State of the Art

Moving to the present, Dor announced Scylla Open Source 3.0. With this release, Scylla was finally achieving feature parity with Cassandra, and, in some cases, it was taking the lead. For storage, SSTable format 3.0 (mc) would reduce data footprint on disk. Production-ready Materialized Views (MV) and Global Secondary Indexes (GSI) will help users access only the data they need. Lightweight Transactions (LWT) remains the last major feature to achieve full feature parity with Cassandra.

Dor also announced that our cloud managed database, Scylla Cloud, was available as early access. running on Amazon Web Services (AWS), Scylla Cloud lets users launch a fully managed single-tenant, self-service Scylla cluster in minutes.

Scylla Cloud Graphic

As much as we talk about Cassandra, we are shifting gears and wish to be competitive with the best of breed NoSQL databases, led by DynamoDB as an example.

Scylla vs. Dynamo Graphic

Dor shared results from a head-to-head YCSB comparison of Scylla versus Amazon DynamoDB. We just recently published the comparative benchmark results. Our test results show you can achieve 1/4th the latency and spend only 1/7th the cost with Scylla for similar throughput on DynamoDB. (Scylla Cloud is 4-6X less expensive than DynamoDB.)

However, the real performance difference occurred in Zipfian distributions. You can read the blog in full as to why this is an important real-world consideration. Analogous test results were found for Bigtable, and CosmosDB was expected to perform similarly.

OLTP vs. OLAP Graphic

Another key feature introduced for the first time at Scylla Summit 2018 was our unique ability to support per-user SLAs, allowing system managers to limit database resource utilization. With this, Scylla customers can use the same Scylla cluster to service both transaction processing (mixed read-write, or write-heavy loads) as well as analytics (read-only/mostly) requests. Glauber Costa would host a full session on this, entitled OLAP or OLTP: Why not both?

Per-user-SLA utilizes 3 years of development of SLA guarantee for real time operations over distributed database background operations such as compaction, repair and streaming. This is a point in time evolution towards perfect multi tenant database.

Dor then enumerated a list of noteworthy accomplishments and the challenges we still have before us. For example, while he was proud of our Mutant Monitoring System (MMS), there is still work to be done on our Knowledgebase, as well as our upcoming launch of Scylla University. And while performance is good, and compactions are relatively smooth compared to other offerings, there are still more optimizations to be done. And while he was proud of the work we’ve done to integrate with Apache Spark, there’s a lot more to do to align Scylla with Kubernetes.

The Shape of Things to Come

 

To conclude, Dor gave a glimpse into the future of Scylla. Finishing up Cassandra parity features, especially Lightweight Transactions. Fleshing out Scylla Cloud. Making Scylla itself a stronger offering, with new tiered storage options, improvements in performance and additional drivers. And finally, making Scylla even easier to manage.

It has been a remarkable journey over the past four years. From all of us at ScyllaDB, thank you for following us on our journey, and for a wonderful 2018. 

Looking ahead, 2019 is sure to be another amazing year of pioneering achievements in the world of Big Data, both for Scylla as well as our users and customers. We’re looking forward to all that we will accomplish together!

The post Scylla Summit 2018 Keynote: Four Years of Scylla appeared first on ScyllaDB.