Dissecting Scylla packets with Wireshark

Investigating problems in a distributed system may be a tedious task, but it becomes much easier with the right tools for the job. One of such tools is Wireshark, a well known utility that offers snooping on all kinds of network protocols – ranging from Ethernet to HTTP and beyond. From now on, it’s also possible to dissect Scylla’s internal protocol, used to communicate between nodes. This includes reading and writing rows, exchanging schema information, gossiping and repairs.


Using a packet dissector to investigate issues in distributed systems is a great asset. It’s extremely valuable to be able to see which nodes communicated with each other, which requests arrived from the clients at which time, and more. However, instead of examining raw bytes that happened to go through our network interfaces, it’s useful to get them parsed to a human-readable form first – assuming we know which protocols were used. With Wireshark, it’s possible to dissect the communication between Scylla nodes and their clients – both via the legacy Thrift protocol and the current standard, CQL (for which we contributed as well!). In order to examine CQL packets, simply use a “cql” filter in Wireshark:

With the “cql” filter, only CQL packets are shown

Filters can also limit the output to specific packets we’re interested in, e.g. only queries and their results:

Only CQL requests and their results are shown – other types (auth, events, etc.) are not visible

That’s all very cool, but Scylla is much more than its client-server protocol! Nodes also need to communicate with each other to fulfill various tasks. These are for example:

  • sending user-provided data from coordinators to replicas
  • reading data from other replicas in order to fulfill the consistency level
  • generating materialized view updates
  • updating schema information – announcing new tables, keyspaces, etc.
  • sharing cluster state via gossiping – which nodes are down, which are still bootstrapping, etc.
  • performing PAXOS rounds for lightweight transactions

and many more.

While it’s very useful to be able to dissect and examine what happens between clients and Scylla nodes, it’s often not enough. We know that a write request reached the coordinator node, but was data replicated to other nodes? Were all needed view updates generated and sent to appropriate replicas? Dissecting the network traffic between Scylla nodes can help answer these kinds of tough questions.

Now, since Scylla’s internal protocol is based on TCP and usually happens on port 7000, we can try to manually interpret the packets and check if they match our expectations.

For instance, the image below clearly represents a mutation, since the ninth byte of the TCP payload equals to 0x01, which is Scylla’s internal opcode for mutation:

Raw TCP payload inspection

Easy, right? Just kidding – parsing the flow manually (or even with custom shell scripts) based on raw TCP packets is a tedious process. After all, couldn’t Wireshark decode opcodes for us in a neat manner, just like it does for CQL? Well, now it can — an initial implementation of Scylla internal protocol dissector landed in Wireshark.

Quick Start

Scylla dissector was merged to Wireshark master only recently (March 30th 2020), so if it hasn’t become part of any official Wireshark release, it would first need to be built from source.

Example script:

git clone git@github.com:wireshark/wireshark.git
cd wireshark
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j16

Now that Wireshark is built and ready to launch, one last thing to do is to ensure that Scylla packets will get recognized. Since port 7000 is not officially IANA registered as ScyllaDB protocol (https://www.iana.org/protocols), we need to configure wireshark to look for Scylla at port 7000. That can be done via:

  1. Wireshark GUI: `Analyze > Decode As > set Scylla as default for TCP port 7000`
  2. a config entry in Wireshark decoding config:
    echo "decode_as_entry: tcp.port,7000,Scylla,Scylla" >>

That’s it! Wireshark should now be able to properly dissect Scylla inter-node messages.


Let’s take another look at the network traffic after the Scylla dissector was added. It’s now much easier to determine which packets represent a mutation — rows sent from the coordinator to replicas, in order to be saved on disk:

A dissected mutation – carrying data expected to be written to a replica

The reading process also became more transparent – here we can see that one node received a read request, while another one was asked only for a digest, which will be used later to compare checksums and ensure that the consistency level requested by a user was reached:

A dissected read request – asking for data from a specific table

The ScyllaDB dissector is still in development, so most of the packets are not fully dissected — but it’s always possible to see what kind of operations they are responsible for.

It’s easy to notice that the majority of the traffic comes from exchanging gossip information between all nodes. Wireshark filters come in really handy when it comes to extract only interesting information from the stream of packets. Here are some examples:

  1. All packets except gossip information:
    scylla and (not scylla.verb == GOSSIP_DIGEST_ACK and not scylla.verb == GOSSIP_DIGEST_ACK2 and not scylla.verb == GOSSIP_DIGEST_SYN)
  2. Write operations only:
    scylla.verb == MUTATION
  3. All write operations for a specific partition key in a specific table:
    scylla.mut.table_id == ea:11:0f:44:50:9e:ef:aa:00:00:00:00:00:00:23:8a and scylla.mut.pkey == 68:65:68

Naturally, it’s possible to dissect multiple protocols at a time in order to get a complete picture of what happened in the distributed environment. For instance, here’s a comprehensive example of a complete flow: inserting a row with a lightweight transaction, and then trying to insert it again, which does not succeed, since it already exists. In order words, a following CQL payload is executed on a `map` table (defined as `CREATE TABLE map(id int PRIMARY KEY, country text, capital text)`:

INSERT INTO map(id, country, capital) values (17, 'Germany', 'Berlin') IF NOT EXISTS;
INSERT INTO map(id, country, capital) values (17, 'Germany', 'Berlin') IF NOT EXISTS;

Note that the example includes both client-Scylla communication and Scylla inter-node messages, which together form a complete picture of what happened in the cluster. The following Wireshark filter was used:

(scylla and (not scylla.verb == GOSSIP_DIGEST_ACK and not scylla.verb == GOSSIP_DIGEST_ACK2 and not scylla.verb == GOSSIP_DIGEST_SYN)) or cql

Wireshark allows visualizing communication as a flow graph, so let’s take advantage of this capability:

A flow graph for packets involved in a lightweight transaction

The dissector can also visualize the throughput overhead induced by materialized views and secondary indexes. Here’s how a query which uses a secondary index looks like — it needs to read rows not only from the base table, but also from the underlying index, which may reside on other nodes:

A flow graph for an indexed read operation with consistency level TWO

and here’s a corresponding query, which simply specifies a primary key to read from, with consistency level TWO. It means that one node will be queried for data, and another will send only a digest:

A flow graph for a regular read operation (without any indexes involved), consistency level TWO

Similarly, here’s how writing to a table looks like with replication factor 2. No surprises here – two mutations need to be sent to appropriate replicas:

A flow graph for a regular write operation with replication factor 2

and here’s a corresponding write operation, but after adding a secondary index to this table – besides writing to the base table, view updates are also generated and sent:

A flow graph for a write operation involving materialized view updates, replication factor 2

The trace above shows another interesting property of materialized view updates – the fact that they are executed in the background. Two mutations were acknowledged before returning a response to the client – these are the writes to the base table. Conversely, it can be observed that view updates were acknowledged only after the client received a response.


Protocol dissectors are a very powerful tool to investigate issues in distributed systems. Snooping on internal communication between Scylla nodes became much easier with Wireshark, now that it’s officially able to dissect Scylla inter-node protocol. Give it a try!

The post Dissecting Scylla packets with Wireshark appeared first on ScyllaDB.

Persosa Connects TV with Real-time Digital Experiences Using Scylla Cloud

Persosa helps media and entertainment companies create connected experiences for their viewers. Founded in 2016, Persosa is a Phoenix-based startup company. In 2019, the company won the Arizona Innovation Challenge award.

Viewers today consume a vast array of television programming via streaming platforms that contain commercials and advertising, along with product placements. Persosa connects those viewing experiences with the advertisers’ content on other platforms, creating a consistent and personalized brand experience.

Web personalization can increase conversions by 350%. Persosa’s goal is to extend those personalization capabilities to work with media and entertainment.

The Challenge

To connect TV and digital experiences in real time, Persosa needs to track and segment viewers, determine what they should see, and then serve them that content and render it in real time. Google Analytics, for example, has the luxury of taking hours before displaying data. In contrast, Persosa needs to make decisions in milliseconds.

As such, the first page-view of a session from a brand new visitor is Persosa’s most expensive request, with the highest latency. Those requests typically take two to three hundred milliseconds response time. Subsequent requests often fall in the 25 to 35 millisecond range. Persosa’s assets load on the page before the brand’s other assets.

For this reason, speed and latency are a very high concern. Persosa needed to find the right database to meet these requirements. Databases in the traditional stack just weren’t going to cut it. Relational databases, for example, were discounted immediately.

Kirk Morales, a co-founder and the CTO at Persosa, explained their thinking: “Even with something like MongoDB, which I’ve personally had a lot of experience with, the latency was too high. Having had some experience in the past with Cassandra, I started looking into Cassandra and from there, into Bigtable. Then, in researching that, came across Scylla.”

The Solution

Ultimately, the decision boiled down to Scylla versus Bigtable. Morales decided to go with Scylla for a few reasons. “First of all, Scylla’s consistently low latency reads were the most important factor for us,” he explained. “Having used Cassandra in the past, there was really no learning curve for me when I was building. And also when you’re looking at defining a schema and figuring out what your data model looks like, Scylla is a lot easier to understand than Bigtable. Ultimately that led to the decision.”

“Our experience with Scylla Cloud has been great. We’ve been meeting and actually exceeding our expectations on our request latency.”

– Kirk Morales, Chief Technology Officer, Persosa

As a startup, Persosa was not equipped to manage their own deployment. They initially used the only available hosted option, which at the time was Scylla on IBM Compose. “We could start at a couple hundred bucks a month and then it would grow with us. And at the time, everything worked great. It was easy to connect, fast deployment, and a pretty reliable service for a while.”

Eventually, Morales ran into a few hiccups and support delays. It was also clear that scaling would become expensive; the entry-level price point was a lot cheaper but at scale, it was actually not an affordable solution.

Scylla Cloud image

Morales took a look at Scylla Cloud, but was concerned that there were only deployments on AWS. Persosa was running on Google Cloud Platform (GCP). After talking with Scylla and running tests, Morales found out that, even with cross provider latency, Scylla Cloud on AWS was actually faster than Compose on GCP.

“I dedicated a few days to building some tests and actually running them from our GCP environment to simulate what our reads and writes looked like. And the performance with Scylla Cloud came back consistently better every single time.”

“Our experience with Scylla Cloud has been great. It’s met all our expectations. It’s easy to manage the account. It’s been consistently fast. So we’ve been meeting and actually exceeding our expectations on our request latency. Our requests are actually faster than they were before, which is just great for the end user experience. We’re seeing our requests come in at about 10% of what they were — from 80 milliseconds on server side, and that includes Scylla reads and everything else, down to about 9 milliseconds on average. Most of that improvement is due to Scylla’s consistently fast reads.”

Persosa also saw greatly improved support from Scylla. “To have that great experience as a client, but then also know that behind it we’re getting a great enterprise offering, that’s really the best of both worlds because, usually, you’re only getting one or the other.”

The post Persosa Connects TV with Real-time Digital Experiences Using Scylla Cloud appeared first on ScyllaDB.

Scylla Enterprise Release 2019.1.8

The ScyllaDB team announces the release of Scylla Enterprise 2019.1.8, a production-ready Scylla Enterprise patch release. As always, Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2019.1.8 in coordination with the Scylla support team.

The focus of Scylla Enterprise 2019.1.8 is improving stability and bug fixes. More below.

Related Links

Fixed issues in this release are listed below, with open source references, if present:

  • Stability: in some rare cases, SSTable metadata in memory is not correctly evicted, causing memory bloat #4951
  • CQL: a restricted column that is not in a select clause returns a wrong value #5708
  • Stability: Node shutdown may exit when using encryption in transit #5759, seastar #727
  • Performance: scylla doesn’t enforce the use of TSC clocksource #4474
  • Stability: wrong hinted handoff logic for detecting a destination node in DN state #4461
  • Stability: commit log exception in shutdown #4394
  • Stability: potential race condition when creating a table with the same name as a deleted one #4382
  • Install: scylla_setup does not present virtio-blk devices correctly on interactive RAID setup #4066
  • Stability: malformed replication factor is not blocked in time, causing an error when running DESCRIBE SCHEMA later #3801
  • Stability: In rare cases, and under heavy load, for example, during repair, Scylla Enterprise might OOM and exit with an error such as “compaction_manager - compaction failed: std::bad_alloc (std::bad_alloc)”. #3717
  • Stability: possible abort when using reverse queries that read too much data #5804
  • Stability: writes inserted into memtable may be interpreted using incorrect schema version on schema ALTER of a counter table #5095
  • Stability: possible query failure, crash if the number of columns in a clustering key or partition key is more than 16 #5856
  • Stability: When speculative read is configured a write may fail even though enough replicas are alive #6123
  • Performance: Allow tweaking of slow repairs due to redundant writes for tables with materialized views
  • Stability, hinted_handoff: all nodes in the cluster become overloaded (CPU 100% loaded on all shards) after node finishes the “replace node” procedure
  • Monitoring: streaming: Make total_incoming_bytes and total_outgoing_bytes metrics monotonic #5915

The post Scylla Enterprise Release 2019.1.8 appeared first on ScyllaDB.

Zeotap: A Graph of Twenty Billion IDs Built on Scylla and JanusGraph

Zeotap is a customer intelligence platform that enables brands to better understand their customers and better predict their behavior. They integrate and correlate data across a number of social media, adtech, martech and CRM platforms. Headquartered in Germany, the company emphasizes consumer privacy and data security, with multiple certifications (such as GDPR) to ensure end-to-end compliance across markets.

At our Scylla Summit, Zeotap’s Vice President of Engineering, Sathish K S, and Principal Data Engineer Saurabh Verma shared how Zeotap addresses the challenge of identity resolution and integration. They presented what Saurabh described as the “world’s largest native JanusGraph database backed by Scylla” to meet the needs of tracking and correlating 20 billion IDs for users and their devices.

Identity Resolution at Scale

Zeotap uses Scylla and JanusGraph as part of their Connect product, a deterministic identity solution that can correlate offline customer CRM data to online digital identifiers. Sourcing data from over 80 partners, Zeotap distributes that data out to over forty destinations.

As an example of identity resolution, Sathish gave the example of a person who might have one or two mobile phones, a tablet, an IP television, and may belong to any number of loyalty programs or clubs. Each of these devices and each of these programs is going to generate an identifier for you, plus cookie identifiers across all your devices for each of the sites you visit.

Sathish emphasized, “The linkages, or the connections, between the identities are more valuable than the individual identities.” The business case behind this, the match test and the ability to export these IDs, are what Zeotap can monetize. For example, a customer can present a set of one million email addresses and ask what percentage of those can be matched against Zeotap’s known user data. They can also make a transitive link to other identifiers that represent that known user. The user can then export that data to their own systems.

On top of this, Zeotap maintains full reporting and compliance mechanisms, including supporting opt-outs. They also have to ensure identity quality and integrate their data with multiple third parties, all while maintaining a short service level agreement for “freshness” of data. “That means you need to have real quick ingestion as well as real quick query performance.”

Their original MVP implementation used Apache Spark batch processing to correlate all of their partner data, with output results stored to s3. Users then accessed data via Amazon Athena queries and other Apache Spark jobs against an Amazon Redshift data warehouse to produce match tests, exports and reports. But when Zeotap started crossing the three billion ID scale, they saw their processing going towards greater than a full day (twenty four hours), while their SLAs were often in the two to six hour range.

Zeotap looked at their requirements. They needed a system that allowed for both high levels of reads and writes simultaneously. Because even as data is being analyzed you can still have more data pouring into the system. Their SLA, back at a time when they had three billion IDs, was 50k writes per second at a minimum. They were also looking for a database solution that provided Time To Live (TTL) for data right out of the box.

For the read workload, Zeotap needed to do a lot of analytics to match IDs and retrieve (export) data on linked IDs, often based on conditions such as ID types (Android, iOS, website cookies), or properties (recency, quality score, or country).

Sathish then spoke about their business model. Zeotap doesn’t buy data from their partners. They provide revenue sharing agreements instead. So all data has to be marked to note which data partner provided it. As a partner’s data is accessed they get a portion of the revenue that Zeotap earns. Counters are essential to calculate that partner revenue from data utilization.

Depth filters are a method to calculate how far to traverse a graph before ending a line of linkage. So their customers can set limits — stop scanning after an indirection depth of four, or five, or ten.

“It was time to change. If you look at it what I have been talking about, parlances like ‘transitivity,’ ‘depth,’ ‘connections,’ ‘linkages’ — these are all directly what you are studying in college in graph theory. So we thought, ‘is there an ID graph that can directly replace my storage and give all of this out of the box?'” This is where Zeotap began the process of evaluating graph databases.

Zeotap relies upon Apache Spark to process data coming in from their partners, as well as to provide analytics for their clients based upon sets of IDs to test for matches, and to export based on its linkages. The question they had was whether a graph database would be able to scale to meet their demanding workloads.

Searching for the Right Graph Database

Saurabh then described the process to find a native graph database. Their solution needed to provide four properties:

  • Low latency (subsecond) neighborhood traversal
  • Faster data ingestion
  • Linkages are first-class citizens
  • Integration with Apache analytics projects: Spark, Hadoop and Giraph

Again, Saurabh emphasized that the linkages (edges) were more important than the vertices. The linkages had a lot of metadata. How recent is the link? Which data partner gave us this information? Quality scores are also link attributes. This was why they needed the linkages to be first class citizens.

Further, looking at their data analysts team, they wanted something that would be understandable to them. This is why they looked at Apache Gremlin/Tinkerpop. In the past, doing SQL queries meant that they had to do more complex and heavyweight JOINs the deeper they wished to traverse. With Gremlin the queries are much similar and more comprehensible.

SQL Query Gremlin Query
(select * from idmvp
where id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and idtype1 = 'id_mid_13'
and id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and idtype2 = 'id_mid_4') // depth = 1

(select * from idmvp t1, idmvp t2
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13'
and t2.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t2.idtype2 = 'id_mid_4') // depth = 2

(select * from idmvp t1, idmvp t2, idmvp t3
where t1.id1 = '75d630a9-2d34-433e-b05f-2031a0342e42' and t1.idtype1 = 'id_mid_13'
and t3.id2 = '5c557df3-df47-4603-64bc-5a9a63f22245' and t3.idtype2 = 'id_mid_4') // depth = 3?

.has('id','75d630a9-2d34-433e-b05f-2031a0342e42').has('type', 'id_mid_13')






A side-by-side comparison of a SQL query to a Gremlin query. The SQL query weighs in at 629 characters, whereas the Gremlin query is only 229 characters. The SQL query would be significantly longer if more levels of query depth were required.

Zeotap then began a proof of concept (POC) starting in August 2018 considering a range of open source databases on appropriate underlying AWS EC2 server clusters:

  • JanusGraph on Scylla (3x i3.2xlarge)
  • Aerospike (3x i3.2xlarge)
  • OrientDB (3x it.2xlarge)
  • DGraph (3x r4.16xlarge)

For the POC, all of these clusters had a replication factor of only 1. They also created a client cluster configuration on 3x c5.18xlarge instances.

While Aerospike is not a native graph database, Zeotap was familiar with it, and included it as a “Plan B” option, utilizing UDFs in Aerospike.

Their aim was to test a workload representative of three times their then-existing production workload, to be able to allow them to scale based on current expectations of growth. However, Saurabh reminded the audience again that they have since grown to over 15 billion IDs.

Their work aligned well with the Label Property Graph (LPG) model, as opposed to the Resource Description Framework (RDF) model. (If you are not familiar with the difference, you can read more about them here.) The LPG mode met their expectations and lowered the cardinality of data.

Store Benchmarking: 3 Billion IDs, 1 Billion Edges

with ScyllaDB
Aerospike OrientDB DGraph
Sharded, Distributed
Storage Model LPG Custom LPG RDF
Cost of ETL, before Ingestion Lower Lower Lower Higher
Native Graph DB
Node/Edge Schema Change without downtime?
Benchmark dataset load completed?
Acceptable Query Performance?
Production Setup Running Cost Lower Higher
Production Setup Operational Management (based on our experience with AS in production) Higher Lower

Zeotap looked at the cost of the entire system. An Aerospike solution would have been more costly because of its high memory (RAM) usage. In the end, they decided upon JanusGraph backed by ScyllaDB on SSD because of its affordability.

How JanusGraph on Scylla met Zeotap’s requirements

Low latency neighborhood traversal (OLTP) – Lookup & Retrieve
  • Graph traversal modeled as iterative low-latency lookups in the Scylla K,V store
  • Runtime proportional to the client data set & overlap percentage
Lower Data Ingestion SLAs
  • Ingestion modeled as UPSERT operations
  • Aligned with Streaming & Differential data ingestions
  • Economically lower footprint to run in production
Linkages are first-class citizen
  • Linkages have properties and traversals can leverage these properties
  • On the fly path computation
Analytics Stats on the Graph, Clustering (OLAP)
  • Bulk export and massive parallel processing available with GraphComputer integration with Spark, Hadoop, Giraph

The Data Model

“A graph DB data model promises two things: one is the extreme flexibility in terms of schema evolution, and another thing is the expressiveness of your domain,” Sathish interjected. Then to prove the point Saurabh shared this example of Zeotap’s data model:

Each vertex (white box) represents a user’s digital identity, stored with its properties in JanusGraph. They can include the device’s operating system (android or ios), the country (in the above there are data points both in Spain and Italy), and so on.

The edges (lines joining the boxes) represent the linkages provided by Zeotap’s partners. In these you can see ids for the data partner that shared that particular linkage (dp1, dp2, etc.) at a particular time (t1, t2, etc.). They also assign a quality score.

Zeotap’s business requirements included being able to model three main points.

  • The first is transitive links. Zeotap does not store pre-computed linkages. Every query is processed at the time of traversal.
  • The second is to filter traversals based on metadata stored in the vertices and edges. For example, filtering based on country is vital to maintain GDPR compliance. The same could also be done based on quality score. A customer might want to get results with only high quality linkages.
  • The third was extensibility of the data model. They cannot have a database that requires them to stop and restart the database depending on changes to the data model. All changes need to be able to be done on the fly. In the example above, “linkSource” was added to one of the edges. They could use this in case it was not a data partner, but Zeotap’s own heuristic that provided the linkage.

JanusGraph and Scylla in Production

In production, Zeotap runs on AWS i3.4xlarge nodes in multiple regions, with three or four nodes per region.

Zeotap began with a batch ingestion process in their MVP, but changed to a mixed streaming and batch ingestion into JanusGraph using Apache Kafka after first passing through a de-duplicating and enriching process for their data. This removed a lot of painful failures from their original batch-only processes and cut back dramatically on the time needed to ingest hundreds of millions of new data points per day.

They also found splitting the vertex and edge loads into separate Kafka streams was important to achieve a higher level of queries per second (QPS), because write behavior was different. A vertex was a single write, whereas the edge was a lookup as well as a write (read-before-write).

The Zeotap engineers also began to measure not just per-node performance, but drilled down to see CPU utilization. They observed 5k transactions per second (TPS) per server core.

For traversals, Sathish warned users to be on the lookout for their “supernodes,” which he broke down into two classes. The first case is a node that is connected to many other nodes. A second is one where its depth keeps increasing.

Another piece of advice Sathish offered was to play with your compaction strategies. With LevelTiered, Zeotap was able to obtain 2.5x more queries per second, and concurrent clients were better handled.

He also showed how filtering data on the server side may be causing latencies. A query filtering for certain conditions could take 2.3 seconds, whereas just fetching all of the data unfiltered might take only 1 second. So for them filtering based on attributes (properties and depths) is instead done at the application layer.

Saurabh then spoke about the quality scoring that Zeotap developed. It is based on what they call “AD scoring,” which is a ratio of edge agreement (A) divided by edge disagreement (D). They also score based on recency, which is then modified by an event rarity adjustment to derive the final quality score.

Zeotap then stores these scores as edge metadata as shown in this example:

Zeotap also has across-graph traversals. This is where, beyond their own data, they may also search through third party partner data stored in separate graph databases. Zeotap is able to save this 3rd party data as separate keyspaces in Scylla. Queries against these are chained together not in the database, but instead on the application layer.

Lastly, Zeotap runs ID graph quality analytics, the results of which are fed back into their OLTP and OLAP systems to consistently ensure the highest quality data.

Watch the Full Video

You can find out more details, as well as learn Zeotap’s key takeaways, by watching the video in full below or checking out their slides on our Tech Talks page. If you have more questions about how JanusGraph on Scylla may work in your own environment, we encourage you to contact us, or to join our Slack to ask your peers in our online community.


The post Zeotap: A Graph of Twenty Billion IDs Built on Scylla and JanusGraph appeared first on ScyllaDB.

Comparing CQL and the DynamoDB API

Six years ago, a few of us were busy hacking on a new unikernel, OSv, which we hoped would speed up any Linux application. One of the applications which we wanted to speed up was Apache Cassandra — a popular and powerful open-source NoSQL database. We soon realized that although OSv does speed up Cassandra a bit, we could achieve much better performance by rewriting Cassandra from scratch.

The result of this rewrite was Scylla — a new open-source distributed database. Scylla kept compatibility with Cassandra’s APIs and file formats, and as hoped outperformed Cassandra — achieving higher throughput per node, lower tail latencies, and vertical scalability (many-core nodes) in addition to Cassandra’s already-famous horizontal scalability (many-node clusters).

As part of Scylla’s compatibility with Cassandra, Scylla adopted Cassandra’s CQL (Cassandra Query Language) API. This choice of API had several consequences:

  • CQL, as its acronym says, is a query language. This language was inspired by the popular SQL but differs from it in syntax and features in many important ways.
  • CQL is also a protocol — telling clients how to communicate with which Scylla node.
  • CQL is also a data model — database rows are grouped together in a wide row, called a partition. The rows inside the partition are sorted by a clustering key.
    Oddly enough, this data model does not have an established name, and is often referred to as a wide column store.

Recently, Scylla announced Project Alternator, which added support for a second NoSQL API to Scylla — the API of Amazon’s DynamoDB. DynamoDB is another popular NoSQL database, whose popularity has been growing steadily in recent years thanks to its ease of use and its backing by Amazon. DynamoDB was designed based on lessons learned from Cassandra (as well as Amazon’s earlier Dynamo work), so its data model is sufficiently similar to that of Cassandra to make supporting both Cassandra’s API and DynamoDB’s API in one database an approachable effort. However, while the two data models are similar, the other aspects of the two APIs — the protocol and the query language — are very different.

After implementing both APIs — CQL and DynamoDB — we, the Scylla developers, are in a unique position to be able to provide an unbiased technical comparison between the two APIs. We have implemented both APIs in Scylla, and have no particular stake in either. The goal of this post is to explain some of the more interesting differences between the two APIs, and how these differences affect users and implementers of these APIs. However, this post will not cover all the differences between the two APIs.

Transport: proprietary vs. standard

A very visible difference between the CQL and DynamoDB APIs is the transport protocol used to send requests and responses. CQL uses its own binary protocol for transporting SQL-like requests, while DynamoDB uses JSON-formatted requests and responses over HTTP or HTTPS.

DynamoDB’s use of popular standards — JSON and HTTP — has an obvious benefit: They are familiar to most users, and many programming languages have built-in tools for them. This, at least in theory, makes it easier to write client code for this protocol compared to CQL’s non-standard protocol, which requires special client libraries (a.k.a. “client drivers”). Moreover, the generality of JSON and HTTP also makes it easy to add more features to this API over time, unlike rigid binary protocols that need to change and further complicate the client libraries. In the last 8 years, the CQL binary protocol underwent four major changes (it is now in version 5), while the DynamoDB API added features gradually without changing the protocol at all.

However, closer inspection reveals that special client libraries are needed for DynamoDB as well. The problem is that beyond the standard JSON and HTTP, several other non-standard pieces are needed to fully utilize this API. Among other things, clients need to sign their requests using a non-standard signature process, and DynamoDB’s data types are encoded in JSON in non-obvious ways.

So DynamoDB users cannot really use JSON or HTTP directly. Just like CQL users, they also use a client library that Amazon or the AWS community provided. For example, the Python SDK for AWS (and for DynamoDB in particular) is called boto3. The bottom line is that both CQL and DynamoDB ecosystems have client libraries for many programming languages, with a slight numerical advantage to CQL. For users who use many AWS services, the ability to use the same client library to access many different services — not just the database — is a bonus. But in this post we want to limit ourselves just to the database, and DynamoDB’s use of standardized protocols ends up being of no real advantage to the user.

At the same time, there are significant drawbacks to using the standard JSON and HTTP as the DynamoDB API does:

  1. DynamoDB supports only HTTP/1.1 — it does not support HTTP/2. One of the features missing in HTTP/1.1 is support for multiplexing many requests (and their responses) over the same connection. As a result, clients that need high throughput, and therefore high concurrency, need to open many connections to the server. These extra connections consume significant networking, CPU, and memory resources.
    At the same time the CQL binary protocol allows sending multiple (up to 32,768) requests asynchronously and potentially receiving their responses in a different order.
    The DynamoDB API had to work around its transport’s lack of multiplexing by adding special multi-request operations, namely BatchGetItem and BatchWriteItem — each can perform multiple operations and return reordered responses. However, these workarounds are more limited and less convenient than generic support for multiplexing.
    Had DynamoDB supported HTTP/2, it would have gained multiplexing as well, and would even gain an advantage over CQL for large requests or responses: The CQL protocol cannot break up a single request or response into chunks, but HTTP/2 does do it, avoiding large latencies for small requests which happen to follow a large request on the same connection. We predict that in the future, DynamoDB will support HTTP/2. Some other Amazon services already do.
  2. DynamoDB’s requests and responses are encoded in JSON. The JSON format is verbose and textual, which is convenient for humans but longer and slower to parse than CQL’s binary protocol. Because there isn’t a direct mapping of DynamoDB’s types to JSON, the JSON encoding is quite elaborate. For example, the string “hello” is represented in JSON as {"S": "hello"},  and the number 3 is represented as {"N": "3"}. Binary blobs are encoded using base64, which makes them 33% longer than necessary. The request parsing code then needs to parse JSON, decode all these conventions and deal with all the error cases when a request deviates from these conventions. This is all significantly slower than parsing a binary protocol.
  3. DynamoDB, like other HTTP-based protocols, treats each request as a self-contained message that can be sent over any HTTP connection. In particular, each request is signed individually using a hash-based message authentication code (HMAC) which proves the request’s source as well as its integrity. But these signatures do not encrypt the requests, which are still visible to prying eyes on the network — so most DynamoDB users use encrypted HTTPS (a.k.a. TLS, or SSL) instead of unencrypted HTTP as the transport. But TLS also has its own mechanisms to ensure message integrity, duplicating some of the effort done by the DynamoDB signature mechanism. In contrast, the CQL protocol also uses TLS — but relies on TLS for both message secrecy and integrity. All that is left for the CQL protocol is to authenticate the client (i.e., verify which client connected), but that can be done only once per connection — and not once per request.

Protocol: cluster awareness vs. one endpoint

A distributed database is, well… distributed. It is a cluster of many different servers, each responsible for part of the data. When a client needs to send a request, the CQL and DynamoDB APIs took very different approaches on where the client should send its request:

  1. In CQL, clients are “topology aware” — they know the cluster’s layout. A client knows which Scylla or Cassandra nodes exist, and is expected to balance the load between them. Moreover, the client usually knows which node holds which parts of the data (this is known as “token awareness”), so it can send a request to read or write an individual item directly to a node holding this item. Scylla clients can even be “shard aware,” meaning they can send a request directly to the specific CPU (in a node with many CPUs) that is responsible for the data.
  2. In the DynamoDB API, clients are not aware of the cluster’s layout. They are not even aware how many nodes there are in the cluster. All that a client gets is one “endpoint,” a URL like https://dynamodb.us-east-1.amazonaws.com — and directs all requests from that data-center to this one endpoint. DynamoDB’s name server can then direct this one domain name to multiple IP addresses, and load balancers can direct the different IP addresses to any number of underlying physical nodes.

Again, one benefit of the DynamoDB approach is the simplicity of the clients. In contrast, CQL clients need to learn the cluster’s layout and remain aware of it as it changes, which adds complexity to the client library implementation. But since these client libraries already exist, and both CQL and DynamoDB API users need to use client libraries, the real advantage to application developers is minimal.

But CQL’s more elaborate approach comes with an advantage — reducing the number of network hops, and therefore reducing both latency and cost: In CQL, a request can often be sent on a socket directed to a specific node or even a specific CPU core that can answer it. With the DynamoDB API, the request needs to be sent at a minimum through another hop — e.g., in Amazon’s implementation the request first arrives at a front-end server, which then directs the request to the appropriate back-end node. If the front-end servers are themselves load-balanced, this adds yet another hop (and more cost). These extra hops add to the latency and cost of a DynamoDB API implementation compared to a cluster-aware CQL implementation.

On the other hand, the “one endpoint” approach of the DynamoDB API also has an advantage: It makes it easier for the implementation to hide from users unnecessary details on the cluster, and makes it easier to change the cluster without users noticing or caring.

Query language: prepared statements

We already noted above how the DynamoDB API uses the HTTP transport in the traditional way: Even if requests are sent one after another on the same connection (i.e., HTTP Keep-Alive), the requests are independent, signed independently and parsed independently.

In many intensive (and therefore, expensive) use cases, the client sends a huge number of similar requests with slightly different parameters. In the DynamoDB API, the server needs to parse these very similar requests again and again, wasting valuable CPU cycles.

The CQL query language offers a workaround: prepared statements. The user asks to prepare a parameterized statement — e.g., “insert into tb (key, val) values (?, ?)”  with two parameters marked with question-marks. This statement is parsed only once, and further requests to the same node can reuse the pre-parsed statement, passing just the two missing values. The benefit of prepared statements are especially pronounced in small and quick requests, such as reads that can frequently be satisfied from cache, or writes. In this blog post we demonstrated in one case that using prepared statements can increase throughput by 50% and lower latency by 33%.

Adding prepared-statement support to the DynamoDB API would be possible, but will require some changes to the protocol:

First, we would need to allow a client to send an identifier of a previously-sent query instead of the full query. In addition to the query identifier, the client will need to send the query’s parameters – luckily the notion of query parameters is already supported by the DynamoDB API, as ExpressionAttributeValues.

Second, DynamoDB API clients are not aware of which of the many nodes in the cluster will get their request, so the client doesn’t know if the node that will receive its request has already parsed this request in the past. So we would need a new “prepare” request that will parse this request and make its identifier available across the entire cluster.

Write model: CRDT vs. read-modify-write

Amazon DynamoDB came out in 2012, two years after Cassandra. DynamoDB’s data model was inspired by Cassandra’s, which in turn was inspired by the data model Google’s BigTable (you can read a brief history of how the early NoSQL databases influenced each other here). In both DynamoDB and Cassandra, database rows are grouped together in a partition by a partition key, with the rows inside the partition sorted by a clustering key. The DynamoDB API and CQL use different names for these same concepts, though:

CQL Term DynamoDB term
Smallest addressable content row item
Sub-entities column attribute
Unit of distribution partition (many rows) partition (many items)
Key for distribution partition key partition key (also hash key)
Key within partition clustering key sort key (also range key)

Despite this basic similarity in how the data is organized — and therefore in how data can be efficiently retrieved — there is a significant difference in how DynamoDB and CQL model writes. The two APIs offer different write primitives and make different guarantees on the performance of these primitives:

Cassandra is optimized for write performance. Importantly, ordinary writes do not require a read of the old item before the write: A write just records the change, and only read operations (and background operations known as compaction and repair) will later reconcile the final state of the row. This approach, known as Conflict-free Replicated Data Types (CRDT), has implications on Cassandra’s implementation, including its data replication and its on-disk storage in a set of immutable files called SSTables (comprising an LSM). But more importantly for this post, the wish that ordinary writes should not read also shaped Cassandra’s API, CQL:

  • Ordinary write operations in CQL — both UPDATE and INSERT requests — just set the value of some columns in a given row. These operations cannot check whether the row already existed beforehand, and the new values cannot depend on the old values. A write cannot return the old pre-modification values, nor can it return the entire row when just a part of it was modified.
  • These ordinary writes are not always serialized. They usually are: CQL assigns a timestamp (with microsecond resolution) to each write and the latest timestamp wins. However two concurrent writes which happen to receive exactly the same timestamp may get mixed up, with values from both writes being present in the result.
    For more information on this problem, see this Jepsen post.
  • CQL offers a small set of additional CRDT data structures for which writes just record the change and do not need to involve a read of the prior value. These include collections (maps, sets and lists) and counters.
  • Additionally, CQL offers special write operations that do need to read a row’s previous state before a write. These include materialized views and lightweight transactions (LWT). However, these write operations are special; Users know that they are significantly slower than ordinary writes because they involve additional synchronization mechanisms – so users use them with forethought, and often sparingly.

Contrasting with CQL’s approach, DynamoDB’s approach is that every write starts with a read of the old value of the item. In other words, every write is a read-modify-write operation. Again, this decision had implications on how DynamoDB was implemented. DynamoDB uses leader-based replication so that a single node (the leader for the item) can know the most recent value of an item and serialize write operations on it. DynamoDB stores data on disk using B-trees which allow reads together with writes. But again, for the purpose of this post, the interesting question is how making every write a read-modify-write operation shaped DynamoDB’s API:

  • In the DynamoDB API, every write operation can be conditional: A condition expression can be added to a write operation, and uses the previous value of the item to determine whether the write should occur.
  • Every write operation can return the full content of the item before the write (ReturnValues).
  • In some update operations (UpdateItem), the new value of the item may be based on the previous value and an update expression. For example, “a = a + 1” can increment the attribute a (there isn’t a need for a special “counter” type as in CQL).
  • The value of an item attribute may be a nested JSON-like type, and an update may modify only parts of this nested structure. DynamoDB can do this by reading the old nested structure, modifying it, and finally writing the structure back.
    CQL could have supported this ability as well without read-modify-write by adding a CRDT implementation of a nested JSON-like structure (see, for example, this paper) — but currently doesn’t and only supports single-level collections.
  • Since all writes involve a read of the prior value anyway, DynamoDB charges for writes the same amount whether or not they use any of the above features (condition, returnvalues, update expression). Therefore users do not need to use these features sparingly.
  • All writes are isolated from each other; two concurrent writes to the same row will be serialized.

This comparison demonstrates that the write features that the DynamoDB and CQL APIs provide are actually very similar: Much of what DynamoDB can do with its various read-modify-write operations can be done in CQL with the lightweight transactions (LWT) feature or by simple extensions of it.  In fact, this is what Scylla does internally: We implemented CQL’s LWT feature first, and then used the same implementation for the write operations of the DynamoDB API.

However, despite the similar write features, the two APIs are very different in their performance characteristics:

  • CQL makes an official distinction between write-only updates and read-modify-write updates (LWT), and the two types of writes must not be done concurrently to the same row. This allows a CQL implementation to implement write-only updates more efficiently than read-modify-write updates.
  • A DynamoDB API implementation, on the other hand, cannot do this optimization: With the DynamoDB API, write-only updates and (for example) conditional updates can be freely mixed and need to be isolated from one another.

This requirement has led Scylla’s implementation of the DynamoDB API to use the heavyweight (Paxos-based) LWT mechanism for every write, even simple ones that do not require reads, making them significantly slower than the same write performed via the CQL API. We will likely optimize this implementation in the future, perhaps by adopting a leader model (like DynamoDB), but an important observation is that in every implementation of the DynamoDB API, writes will be significantly more expensive than reads. Amazon prices write operations at least 5 times more expensive than reads (this ratio reaches1 40 for weakly-consistent reads of 4 KB items).

Scylla provides users with an option to do DynamoDB-API write-only updates (updates without a condition, update expression, etc.) efficiently, without LWT. However, this option is not officially part of the DynamoDB API, and as explained above — cannot be made safe without adding an assumption that write-only updates and read-modify-write updates to the same item cannot be mixed.

Data items: schema vs. (almost) schema-less

Above we noted that both APIs — DynamoDB and CQL — model their data in the same way: as a set of partitions indexed by a partition key, and each partition is a list of items sorted by a sort key.

Nevertheless, the way that the two APIs model data inside each item is distinctly different:

  1. CQL is schema-full: A table has a schema — a list of column names and their types – and each item must conform with this schema. The item must have values of the expected types for all or some of these columns.
  2. DynamoDB is (almost) schema-less: Each item may have a different set of attribute names, and the same attribute may have values of different types2. DynamoDB is “almost” schema-less because the items’ key attributes — the partition key and sort key — do have a schema. When a table is created, one must specify the name and type of its partition-key attribute and its optional sort-key attribute.

It might appear, therefore, that DynamoDB’s schema-less data model is more general and more powerful than CQL’s schema-full data model. But this isn’t quite the case. When discussing CRDT above, we already mentioned that CQL supports collection columns. In particular, a CQL table can have a map column, and an application can store in that map any attributes that aren’t part of its usual schema. CQL’s map does have one limitation, though: The values stored in a map must all have the same type. But even this limitation can be circumvented, by serializing values of different types into one common format such as a byte array. This is exactly how Scylla implemented the DynamoDB API’s schema-less items on top of its existing schema-full CQL implementation.

So it turns out that the different approaches that the two APIs took to schemas have no real effect on the power or the generality of the two APIs. Still, this difference does influence what is more convenient to do in each API:

  • CQL’s schema makes it easier to catch errors in a complex application ahead of time – in the same way that compile-time type checking in a programming language can catch some errors before the program is run.
  • The DynamoDB API makes it natural to store different kinds of data in the same database. Many DynamoDB best-practice guides recommend a single-table design, where all the application’s data resides in just a single table (plus its indexes). In contrast, in CQL the more common approach is to have multiple tables each with a different schema. CQL introduces the notion of “keyspace” to organize several related tables together.
  • The schema-less DynamoDB API makes it somewhat easier to change the data’s structure as the application evolves. CQL does allow modifying the schema of an existing table via an “ALTER TABLE” statement, but it is more limited in what it can modify. For example, it is not possible to start storing strings in a column that used to hold integers (however, it is possible to create a second column to store these strings).


In this post, we surveyed some of the major differences between the Amazon DynamoDB API and Apache Cassandra’s CQL API — both of which are now supported in one open-source database, Scylla. Despite a similar data model and design heritage, these two APIs differ in the query language, in the protocol used to transport requests and responses, and somewhat even in the data model (e.g., schema-full vs. schema-less items).

We showed that in general, it is easier to write client libraries for the DynamoDB API because it is based on standardized building blocks such as JSON and HTTP. Nevertheless, we noted that this difference does not translate to a real benefit to users because such client libraries already exist for both APIs, in many programming languages. But the flip side is that the simplicity of the DynamoDB API costs in performance: The DynamoDB API requires extra network hops (and therefore increased costs and latency) because clients do not need to be aware of which data belongs to which node. The DynamDB API also requires more protocol and parsing work and more TCP connections.

We discussed some possible directions in which the DynamoDB API could be improved (such as adopting HTTP/2 and a new mechanism for prepared statements) to increase its performance. We don’t know when, if ever, such improvements might reach Amazon DynamoDB. But as an experiment, we could implement these improvements in the open-source Scylla and measure what kinds of benefits an improved API might bring. When we do, look forward to a follow-up on this post detailing these improvements.

We also noted the difference in how the two APIs handle write operations. The DynamoDB API doesn’t make a distinction between writes that need the prior value of the item and writes that do not. This makes it easier to write consistent applications with this API, but also means that writes with the DynamoDB API are always significantly more expensive than reads. Conversely, the CQL API does make this distinction — and does forbid mixing the two types of writes. Therefore, the CQL API allows write-only updates to be optimized, while the read-modify-write updates remain slower. For some applications, this can be a big performance boost.

The final arbiter in weighing these APIs is subjective in answer to the question: “Which will be better for your use case?” Due to many of the points made above, and due to other features which at this point are only available through Scylla’s CQL implementation, ScyllaDB currently advises users to adopt the CQL interface for all use cases apart from “lift and shift” migrations from DynamoDB.



1 Amazon prices a write unit (WCU) at 5 times the cost of a read unit (RCU). Additionally, for large items, each 1 KB is counted as a write unit, while a read unit is 4 KB. Moreover, eventually-consistent reads are counted as half a read unit. So writing a large item is 40 (=5*4*2) times more expensive than reading it with an eventually-consistent read.

2 The value types supported by the two APIs also differ, and notably the DynamoDB API supports nested documents. But these are mostly superficial differences and we won’t elaborate on this further here.

The post Comparing CQL and the DynamoDB API appeared first on ScyllaDB.

Introducing Scylla Open Source 4.0

With the release of Scylla Open Source 4.0, we’ve introduced a set of noteworthy production-ready new features, including a DynamoDB-compatible API that lets you to take your locked-in DynamoDB workloads and run them anywhere, a more efficient implementation of Lightweight Transactions (LWT), plus improved and new experimental features such as Change Data capture (CDC), which uses standard CQL tables.

4.0 is the most expansive release we’ve ever done. After all, it’s rare to release a whole new database API in General Availability (GA), and our Alternator project has finally graduated. In 4.0 we also reach full feature parity with the database of our roots: LWT graduated from experimental mode and is now ready for prime time. Just yesterday, Scylla was tested with Kong, the cloud native API gateway, which relies upon LWT, and it worked as flawlessly as we hoped. This single Github issue of Kong integration shows the long journey we’ve completed to reach full feature parity. The issue was opened in 2015, just as we launched our company out of stealth mode. From that point forward, integration waited for counters, ALTER TABLE, materialized views, indexes, Cassandra Cluster Manager (CCM) integration and finally LWT.

In this blog, I’d like to look at the enhancements we’ve made to our technology from 3.0 to 4.0, compare where we are today versus Cassandra 4.0 and DynamoDB, and touch on the future.

Journey from 3.0 to 4.0

We released 3.0 on January 17, 2019. The team had been burning the midnight oil to deliver many vital features such as the “mc” SSTable format. After that we put in extensive stabilization work and we relaxed our upstream master branch testing, which caused us a great deal of stabilization time before we delivered our 3.1 release on October 15, 2019. After that, we shifted more focus upstream, so more testing, unlocking the debug release mode, which turned out to be instrumental to catching bugs early (thanks Benny Halevy and Rafael Avila de Espindola) on our nightly master branch. Releases 3.2 (Jan 21), 3.3 (March 24) and now 4.0 were subsequently published. Today we’re able to meet our goal of a monthly release cadence (with minimum misses). Apart from achieving better release predictability, our new processes increase the quality of our releases and reduce the bug hunt churn.

During the 3.0 to 4.0 development period, 6,242 commits were contributed to Scylla by 80 unique developers. That’s a pace of 13 commits per day. This pace, both in the quality and quantity, was what allowed us to reach parity with Cassandra, a very good project with many adopters but, surprisingly, far fewer core committers. A git log analysis over the past 5 years verifies this, as you’ll see below. Top is Cassandra, bottom is Scylla, the number of commits is on the left, and number of authors on the right. (Note that the charts are not to the same scale. Cassandra peaks at just over 500 commits and now averages less than 100 a month; the peak for Scylla was over 900 and averages between 200 to 400 a month.)

We invested in a mix of performance, functionality and usability features. As our adoption continues to grow, we’re moving our emphasis from performance to usability and functionality.

Some of our key functionality enhancements over this period included:

  • Lightweight transactions (LWT)
  • Local secondary indexes (next to our global indexes)
  • Change Data Capture (CDC, in experimental mode)
  • User Defined Functions (using Lua, in experimental mode)
  • IPv6 support
  • Various query improvements: CQL per partition limit, allow filtering enhancements, CQL LIKE operator.

Usability enhancements include:

  • Large Cell / Collection Detector
  • Nodetool TopPartitions
  • Relocatable packages

Performance enhancements include:

  • Row-level repair
  • Support for AWS i3en servers with up to 60TB/node
  • Faster network polling – IOCB_CMD_POLL
    (Another contribution by ScyllaDB to the Linux kernel)
  • Improved CQL server admission control

Outside of Scylla CQL core our 4.0 release also includes other game changing new features. I’ll take a bit of time to describe them here.

DynamoDB-Compatible API (Project Alternator)

Project Alternator is an open-source implementation for an Amazon DynamoDB™-compatible API. The goal of this project was to deliver an open source alternative to Amazon’s DynamoDB, deployable wherever a user would want: on-premises, on other public clouds like Microsoft Azure or Google Cloud Platform, or still on AWS (for users who wish to take advantage of other aspects of Amazon’s market-leading cloud ecosystem, such as the high-density i3en instances).
As this is a full database API, I won’t cover it here. Go give it a try on Docker, Kubernetes or Scylla Cloud.

Scylla Operator for Kubernetes (Beta)

Kubernetes has become the go-to technology for the cloud devops community. It allows for the automated provisioning of cloud-native containerized applications, supporting the entire lifecycle of deployment, scaling, and management. Scylla Operator, created by Yannis Zarkadas and championed by Henrik Johansson, is our extension to enable the management of Scylla clusters. It currently supports deploying multi-zone clusters, scaling up or adding new racks, scaling down, and monitoring Scylla clusters with Prometheus and Grafana.

Scylla and the Rest of the Pack

Now that we have feature parity with Cassandra, and our DynamoDB-compatible API is production ready, let’s compare value.

We’ll start by repeating one of our previous benchmarks that compares Scylla 2.2 vs Cassandra 3.11. In this benchmark, we defined an application with an SLA of 10 msec for its P99 reads and ran a mixed workload of 100k, 200k, 300k OPS and compared the cost and SLA violations. We have repeated this comparison with Scylla 4.0. Previously, 4 Scylla servers of i3.metal met the SLA for 100k, 200k and were slightly above the latency to meet 300k OPS.

40 C* 3.11 servers running on i3.4xl were used in our initial benchmark. We selected i3.4xl for Cassandra since Cassandra doesn’t scale vertically well, thus it would have been a waste of machine time and money to use machines that were too large. Back then, the 40 i3.4xl servers, which cost 2.5x more than the 4 i3.metal servers, could meet the SLA only in the 100k case.

Not only did Scylla 4.0 easily make the grade with 300k OPS with P99 read of 7msec, we tried 600k and achieved P99 of 12msec. So Scylla 4.0 is twice as cost effective as Scylla 2.2. It means that Cassandra, even with the most appropriate machine type, will be 5x more expensive, its latency will be 10x worse, and it will be much harder to manage.

We tested Cassandra 4.0 but had issues; naturally, it is still in alpha. However, a smaller scale comparison between Cassandra 3.11 and Cassandra 4.0 did not reveal performance differences. We will share those results in depth at a later date.

Elasticity Results

In order to be agile and budget friendly, a horizontally scalable database needs to expand on demand. We added a 5th node to our 4-node i3.metal cluster and measured the time it took it to stream the data to the new node. Each of the four nodes had 7.5TB, and it’s quite a challenge to stream the data as fast as possible while still meeting a 10 msec P99 read latency goal. Scylla managed to complete the streaming in just over an hour while handling 300k OPS of mixed workload. Streaming pace was 1.5GB/s (12gbps).

If you dig into our Grafana dashboards, you will observe all sorts of activities: The commitlog bandwidth, streaming scheduler share, query class IO bandwidth, number of active compactions (tens to hundreds), queue sizes, and more. Our CPU and IO schedulers are responsible for prioritizing queries over streaming and compactions and keeping these balance/queue sizes dynamic.

It is possible to observe that the new joining node is capped by CPU load, originated from compaction. Future releases of Scylla will automatically halt the compaction while streaming, since more and more data arrives to the node, and will further accelerate streaming.

As the 40 nodes of Cassandra suffered from P99 latency of 131 msec with 300k OPS, we estimate that Cassandra will need around 60-80 nodes of i3.4xl to meet the latency target for 300k OPS. If you need to grow the cluster by 25% as Scylla has, that’s an additional 15 machines you’ll need. Since Cassandra (and Scylla, too, but not for long) adds a single node at a time, it will take 1-2 hours per node multiplied by 15 nodes. That’s over 24 hours! Not only is it 24x slower than Scylla, you wouldn’t have the capacity in place to handle variable spikes. That means you’d have to overprovision your Cassandra cluster from the get-go, thus incurring even greater costs.

Scylla 4.0 has a new experimental feature called repair-based-node-operation. Asias He implemented streaming for node addition and node decommission using our repair algorithm (better than a merkle tree). Repair is the process of syncing our replicas. It can run on different sets of replicas and is supposed to run iteratively. This allows us to offer restartable node operations. That means if you started to add/remove a node, and as an admin, and you want to pause, restart the server, etc., it’s now possible to do so without starting the streaming from scratch.

How DynamoDB Compares with Scylla 4.0

DynamoDB is known for its elasticity, ease of use and also its high price. Let’s look at how much a 600k OPS mixed workload costs on DynamoDB and assume the way to scale there is simple (although it is not the case, as many blogs indicate).

Scylla Cloud running 4 i3.metal servers costs $48/hour on demand and around $32/hour with an annual commitment.

DynamoDB has multiple pricing options. Let’s fit the best one for the right use cases. If you have a completely spiky workload, very unexpected, like this pattern:

The best fit would be the provisioned capacity since DynamoDB’s autoscale cannot meet this speed and, to be frank, no database can deal with it without a reservation. We’ve taken the scenario from the AWS documentation itself. In the example, AWS recommended to use the more expensive on-demand mode (explained below), but also reported that DynamoDB needs time to scale, cannot double its capacity more often than every half an hour, and is complex to pre-warm.

In such a case, a write per second capacity unit costs $0.00065 and a read unit costs $0.00013. Storage costs $0.25 GB/month. If we multiply it by 300k reads, 300k writes, then add 25% spare (otherwise Dynamo will throttle your requests) and add 10TB of storage it looks like this:

Throughput Additional spare
Total read/sec 300,000 375,000
Total writes/sec 300,000 375,000
Read Unit $0.00013 $48.75000
Write Unit $0.00065 $243.75
Storage $3.47
Total (per hour) $296

That’s 6-9 times the Scylla cost.

Now we can look at another scenario, with a more classic, bell curve of usage, also shared by the AWS team. In this case, the data enables automatic scaling, which uses the on-demand pricing.

The on-demand cost, with $1.25 for every million writes (amount, not pace), would be $2,028/hour for a full hour with this load — 40x higher than Scylla. Lucky for the user, the scenario is such that the load nicely grows and fades, thus there is no need to run 24×7 at full capacity. However, on average, you’d need around 50% of the load, thus you’ll be paying a shocking price of 20x the cost of the statically provisioned Scylla cluster (which can easily scale too, but not automatically).

Even in cases where the spikes wouldn’t be that high and most of the time, like the first diagram where only 3 hours of usage are needed, on-demand mode is very expensive. A single hour would cover more than a day of full provisioning of Scylla.

Lastly, the AWS documentation recommends combining the provisioned and the on-demand workloads to reduce costs.

Reserved capacity, with a full 1-year commitment is the cheapest option. 100 write units (per second) cost $0.0128 PLUS $150 upfront. Reads are 5 times cheaper. We’ll need 3,000 of both. When adjusting the price per hour, it will cost $138. That’s still 3x-4.5x more than Scylla, and this is without having the expensive auto-scale price mode.

Just the sheer complexity of tracking the cost, and making bets on consumption, makes it horribly challenging for the end user. We haven’t touched the hot partition issue that can make DynamoDB even more complicated and expensive. Scylla doesn’t ‘like’ hot partitions either but we deal with them better. Plus this year we added a top-partitions enhancement, which allows you to discover your most-accessed partitions.

We’re happy to report that we have early adopters for Scylla Alternator. Some use it not instead of AWS DynamoDB but as a surprising extension — in case you are used to the DynamoDB HTTP API but you want another deployment option, in this case on-premise, Scylla’s Alternator is a great choice. We expect to see multi-cloud usage before we’ll gain a hybrid mode. It is also possible to just wet your feet and let Alternator duplicate a few of your data centers while you keep the lights on for some time using the original DynamoDB.

Open source software allows you to run within a Docker container, there’s no need to pay for dev & test clusters. When you are ready to go live, you can deploy anAlternator cluster via Kubernetes. Stop paying by the query and also enjoy a fully managed, affordable and highly available service on Scylla Cloud. All of our DynamoDB-compatible tables are global, multi-DC and multi-zone and it’s easy to change at any time.

What’s to Expect from Scylla 5.0

The team will kill me for pivoting so quickly to 5.0… they deserve a real vacation once COVID-19 is over. We had to cancel our yearly developer meeting, which was planned in Poland this year. Luckily so far none of our team has gotten sick and the vast majority of our developers were already working from home.

Since we’ve reached feature parity, we now plan on investing our efforts in two main areas. The first is to continue to improve the core Scylla promise, keep raising the bar with awesome performance out of the box, verifying that no matter what happens, P99 latency is the lowest. Scylla tracks latency violations and we still have a few to iron out. Repair speed should continue to be improved and we’d like to invest in workload shedding so Scylla will be able to sustain all types of bursts, including ones with unbounded parallelism.

Another important core enhancement is around elasticity. We want to make it easy, fast and reliable to add and remove nodes, multiple nodes at a time, with minimal interruptions from compactions and simpler operations (further automating repairs, nodetool cleanup and so forth).

The second major area is ease of use. Now that we’ve got drivers such as GocqlX and sharded Go, Java and Python drivers, it is time to simplify the onboarding experience with best practices, to simplify migration, to add a component called Scylla Advisor, which will automatically indicate where there may be configuration issues or performance challenges. We’ll add machine images on GCP and Azure, productize our k8s operator, make Scylla Cloud more elastic and add more self service operations. The list goes on.

For now, as my friends in India like to say, happy upgradation!

Release Notes

You can read more details about Scylla Open Source 4.0 in the Release Notes.


The post Introducing Scylla Open Source 4.0 appeared first on ScyllaDB.

Scylla Monitoring Stack 3.3

The Scylla team is pleased to announce the release of Scylla Monitoring Stack 3.3

Scylla Monitoring Stack is an open-source stack for monitoring Scylla Enterprise and Scylla Open Source, based on Prometheus and Grafana. Scylla Monitoring Stack 3.3 supports:

  • Scylla Open Source versions 3.1, 3.2, 3.3 and the upcoming 4.0
  • Scylla Enterprise versions 2018.x and 2019.x
  • Scylla Manager 1.4.x, 2.0.x and the upcoming Scylla Manager 2.1

Related Links

New in Scylla Monitoring Stack 3.3

  • Scylla OS 4.0 dashboards #871
  • Scylla Manager dashboard 2.1 #899
  • Shows speculative execution in official detail dashboard #880
    Speculative Retry is a mechanism that the client or server retries when it speculates that the request would fail even before a timeout occurred. In a high load, this can overload the system and would cause performance issues.

  • Adding LWT panels #875
    The LWT section in the Detailed dashboard has a new section to help identify performance issues:
    • LWT Condition-Not-Met: An LWT INSERT, UPDATE or DELETE command that involves a condition will be rejected if the condition is not met. Based on scylla_storage_proxy_coordinator_cas_write_condition_not_met.
    • LWT Write Contention: Number of times some INSERT, UPDATE or DELETE request with conditions had to retry because there was a concurrent conditional statement against the same key. Each retry is performed after a randomized sleep interval, so it can lead to statement timing out completely. Based on scylla_storage_proxy_coordinator_cas_write_contention_count.
    • LWT Read Contention: Number of times some SELECT with SERIAL consistency had to retry because there was a concurrent conditional statement against the same key. Each retry is performed after a randomized sleep interval, so it can lead to statement timing out completely. Based on scylla_storage_proxy_coordinator_cas_read_contention_count.
    • LWT Write Timeout Due to Uncertainty: Number of partially succeeded conditional statements. These statements were not committed by the coordinator, due to some replicas responding with errors or timing out. The coordinator had to propagate the error to the client. However, the statement succeeded on a minority of replicas, so may later be propagated to the rest during repair. Based on scylla_storage_proxy_coordinator_cas_write_timeout_due_to_uncertainty.
    • LWT Write Unavailable: Number of times a INSERT, UPDATE, or DELETE with conditions failed after being unable to contact enough replicas to match the consistency level. Based on scylla_storage_proxy_coordinator_cas_write_unavailable.
    • LWT Read Unavailable: Number of times a SELECT with SERIAL consistency failed after being unable to contact enough replicas to match the consistency level. Based on scylla_storage_proxy_coordinator_cas_read_unavailable.
    • LWT Write Unfinished – Repair Attempts: Number of Paxos-repairs of INSERT, UPDATE, or DELETE with conditions. Based on scylla_storage_proxy_coordinator_cas_write_unfinished_commit.
    • LWT Read Unfinished – Repair Attempts: Number of Paxos-repairs of SELECT statement with SERIAL consistency. Based on scylla_storage_proxy_coordinator_cas_read_unfinished_commit.
    • Failed Read-Round Optimization: Normally, a PREPARE Paxos-round piggy-backs the previous value along with the PREPARE response. When the coordinator is unable to obtain the previous value (or its digest) from some of the participants, or when the digests did not match, a separate repair round has to be performed. Based on scylla_storage_proxy_coordinator_cas_failed_read_round_optimization.
    • LWT Prune: Number of pruning requests. Based on scylla_storage_proxy_coordinator_cas_prune.
    • LWT Dropped Prune: Number of Dropped pruning requests. Based on scylla_storage_proxy_coordinator_cas_dropped_prune.
  • More on LWT here

The CQL Dashboard has a new LWT section with panels for Conditional Insert, Delete, Update and Batches.

  • CPU and IO Dashboard reorganization #857A reorganization of the CPU and the IO dashboards collect the relevant information in rows.

  • Add CDC panels #819
    CDC operations: The number of CDC operations per seconds. Based on scylla_cdc_operations_total
    Failed CDC operations: The number of Failed CDC operations per seconds. Based on scylla_cdc_operations_failed

  • Default Latency Alerts #803. You can read more about Alerts here.
    Two new Latency alerts were added out of the box, one when the average latency is above 10ms and one when the 95% latency is above 100ms.
    Latency is system specific, and it is best if you set the value to match your system and expected latency.
  • Start the monitoring with an Alternator option #858
    Alternator users can start the monitoring stack with alternator-start-all.sh to show only the alternator related dashboards: alternator, CPU, IO, OS and Errors.
  • Graphs’ hover-tooltip sort order #861.
    The hover-tooltip shows the current value of each of the graphs. It is now sorted according the the value/

Bug Fixes

  • CQL Optimization dashboard gauges can be off for various time frames #881
  • Read timeout metrics are misleading #876
  • Support manager agents when using Scylla-Manager Consul integration #888
  • Hints derived-class metrics are not being displayed using irate #865

The post Scylla Monitoring Stack 3.3 appeared first on ScyllaDB.

How io_uring and eBPF Will Revolutionize Programming in Linux

Things will never be the same again after the dust settles. And yes, I’m talking about Linux.

As I write this, most of the world is in lockdown due to COVID-19. It’s hard to say how things will look when this is over (it will be over, right?), but one thing is for sure: the world is no longer the same. It’s a weird feeling: it’s as if we ended 2019 in one planet and started 2020 in another.

While we all worry about jobs, the economy and our healthcare systems, one other thing that has changed dramatically may have escaped your attention: the Linux kernel.

That’s because every now and then something shows up that replaces evolution with revolution. The black swan. Joyful things like the introduction of the automobile, which forever changed the landscape of cities around the world. Sometimes it’s less joyful things, like 9/11 or our current nemesis, COVID-19.

I’ll put what happened to Linux in the joyful bucket. But it’s a sure revolution, one that most people haven’t noticed yet. That’s because of two new, exciting interfaces: eBPF (or BPF for short) and io_uring, the latter added to Linux in 2019 and still in very active development. Those interfaces may look evolutionary, but they are revolutionary in the sense that they will — we bet — completely change the way applications work with and think about the Linux Kernel.

In this article, we will explore what makes these interfaces special and so powerfully transformational, and dig deeper into our experience at ScyllaDB with io_uring.

How Did Linux I/O System Calls Evolve?

In the old days of the Linux you grew to know and love, the kernel offered the following system calls to deal with file descriptors, be they storage files or sockets:

Those system calls are what we call blocking system calls. When your code calls them it will sleep and be taken out of the processor until the operation is completed. Maybe the data is in a file that resides in the Linux page cache, in which case it will actually return immediately, or maybe it needs to be fetched over the network in a TCP connection or read from an HDD.

Every modern programmer knows what is wrong with this: As devices continue to get faster and programs more complex, blocking becomes undesirable for all but the simplest things. New system calls, like select() and poll() and their more modern counterpart, epoll() came into play: once called, they will return a list of file descriptors that are ready. In other words, reading from or writing to them wouldn’t block. The application can now be sure that blocking will not occur.

It’s beyond our scope to explain why, but this readiness mechanism really works only for network sockets and pipes — to the point that epoll() doesn’t even accept storage files. For storage I/O, classically the blocking problem has been solved with thread pools: the main thread of execution dispatches the actual I/O to helper threads that will block and carry the operation on the main thread’s behalf.

As time passed, Linux grew even more flexible and powerful: it turns out database software may not want to use the Linux page cache. It then became possible to open a file and specify that we want direct access to the device. Direct access, commonly referred to as Direct I/O, or the O_DIRECT flag, required the application to manage its own caches — which databases may want to do anyway, but also allow for zero-copy I/O as the application buffers can be sent to and populate from the storage device directly.

As storage devices got faster, context switches to helper threads became even less desirable. Some devices in the market today, like the Intel Optane series have latencies in the single-digit microsecond range — the same order of magnitude of a context switch. Think of it this way: every context switch is a missed opportunity to dispatch I/O.

With Linux 2.6, the kernel gained an Asynchronous I/O (linux-aio for short) interface. Asynchronous I/O in Linux is simple at the surface: you can submit I/O with the io_submit system call, and at a later time you can call io_getevents and receive back events that are ready. Recently, Linux even gained the ability to add epoll() to the mix: now you could not only submit storage I/O work, but also submit your intention to know whether a socket (or pipe) is readable or writable.

Linux-aio was a potential game-changer. It allows programmers to make their code fully asynchronous. But due to the way it evolved, it fell short of these expectations. To try and understand why, let’s hear from Mr. Torvalds himself in his usual upbeat mood, in response to someone trying to extend the interface to support opening files asynchronously:

So I think this is ridiculously ugly.

AIO is a horrible ad-hoc design, with the main excuse being “other, less gifted people, made that design, and we are implementing it for compatibility because database people — who seldom have any shred of taste — actually use it”.

— Linus Torvalds (on lwn.net)

First, as database people ourselves, we’d like to take this opportunity to apologize to Linus for our lack of taste. But also expand on why he is right. Linux AIO is indeed rigged with problems and limitations:

  • Linux-aio only works for O_DIRECT files, rendering it virtually useless for normal, non-database applications.
  • The interface is not designed to be extensible. Although it is possible — we did extend it — every new addition is complex.
  • Although the interface is technically non-blocking, there are many reasons that can lead it to blocking, often in ways that are impossible to predict.

We can clearly see the evolutionary aspect of this: interfaces grew organically, with new interfaces being added to operate in conjunction with the new ones. The problem of blocking sockets was dealt with with an interface to test for readiness. Storage I/O gained an asynchronous interface tailored-fit to work with the kind of applications that really needed it at the moment and nothing else. That was the nature of things. Until… io_uring came along.

What Is io_uring?

io_uring is the brainchild of Jens Axboe, a seasoned kernel developer who has been involved in the Linux I/O stack for a while. Mailing list archaeology tells us that this work started with a simple motivation: as devices get extremely fast, interrupt-driven work is no longer as efficient as polling for completions — a common theme that underlies the architecture of performance-oriented I/O systems.

But as the work evolved, it grew into a radically different interface, conceived from the ground up to allow fully asynchronous operation. It’s a basic theory of operation is close to linux-aio: there is an interface to push work into the kernel, and another interface to retrieve completed work.

But there are some crucial differences:

  • By design, the interfaces are designed to be truly asynchronous. With the right set of flags, it will never initiate any work in the system call context itself and will just queue work. This guarantees that the application will never block.
  • It works with any kind of I/O: it doesn’t matter if they are cached files, direct-access files, or even blocking sockets. That is right: because of its async-by-design nature, there is no need for poll+read/write to deal with sockets. One submits a blocking read, and once it is ready it will show up in the completion ring.
  • It is flexible and extensible: new opcodes are being added at a rate that leads us to believe that indeed soon it will grow to re-implement every single Linux system call.

The io_uring interface works through two main data structures: the submission queue entry (sqe) and the completion queue entry (cqe). Instances of those structures live in a shared memory single-producer-single-consumer ring buffer between the kernel and the application.

The application asynchronously adds sqes to the queue (potentially many) and then tells the kernel that there is work to do. The kernel does its thing, and when work is ready it posts the results in the cqe ring. This also has the added advantage that system calls are now batched. Remember Meltdown? At the time I wrote about how little it affected our Scylla NoSQL database, since we would batch our I/O system calls through aio. Except now we can batch much more than just the storage I/O system calls, and this power is also available to any application.

The application, whenever it wants to check whether work is ready or not, just looks at the cqe ring buffer and consumes entries if they are ready. There is no need to go to the kernel to consume those entries.

Here are some of the operations that io_uring supports: read, write, send, recv, accept, openat, stat, and even way more specialized ones like fallocate.

This is not an evolutionary step. Although io_uring is slightly similar to aio, its extensibility and architecture are disruptive: it brings the power of asynchronous operations to anyone, instead of confining it to specialized database applications.

Our CTO, Avi Kivity, made the case for async at the Core C++ 2019 event. The bottom line is this; in modern multicore, multi-CPU devices, the CPU itself is now basically a network, the intercommunication between all the CPUs is another network, and calls to disk I/O are effectively another. There are good reasons why network programming is done asynchronously, and you should consider that for your own application development too.

It fundamentally changes the way Linux applications are to be designed: Instead of a flow of code that issues syscalls when needed, that have to think about whether or not a file is ready, they naturally become an event-loop that constantly add things to a shared buffer, deals with the previous entries that completed, rinse, repeat.

So, what does that look like? The code block below is an example on how to dispatch an entire array of reads to multiple file descriptors at once down the io_uring interface:

At a later time, in an event-loop manner, we can check which reads are ready and process them. The best part of it is that due to its shared-memory interface, no system calls are needed to consume those events. The user just has to be careful to tell the io_uring interface that the events were consumed.

This simplified example works for reads only, but it is easy to see how we can batch all kinds of operations together through this unified interface. A queue pattern also goes very well with it: you can just queue operations at one end, dispatch, and consume what’s ready at the other end.

Advanced Features

Aside from the consistency and extensibility of the interface, io_uring offers a plethora of advanced features for specialized use cases. Here are some of them:

  • File registration: every time an operation is issued for a file descriptor, the kernel has to spend cycles mapping the file descriptor to its internal representation. For repeated operations over the same file, io_uring allows you to pre-register those files and save on the lookup.
  • Buffer registration: analogous to file registration, the kernel has to map and unmap memory areas for Direct I/O. io_uring allows those areas to be pre-registered if the buffers can be reused.
  • Poll ring: for very fast devices, the cost of processing interrupts is substantial. io_uring allows the user to turn off those interrupts and consume all available events through polling.
  • Linked operations: allows the user to send two operations that are dependent on each other. They are dispatched at the same time, but the second operation only starts when the first one returns.

And as with other areas of the interface, new features are also being added quickly.


As we said, the io_uring interface is largely driven by the needs of modern hardware. So we would expect some performance gains. Are they here?

For users of linux-aio, like ScyllaDB, the gains are expected to be few, focused in some particular workloads and come mostly from the advanced features like buffer and file registration and the poll ring. This is because io_uring and linux-aio are not that different as we hope to have made clear in this article: io_uring is first and foremost bringing all the nice features of linux-aio to the masses.

We have used the well-known fio utility to evaluate 4 different interfaces: synchronous reads, posix-aio (which is implemented as a thread pool), linux-aio and io_uring. In the first test, we want all reads to hit the storage, and not use the operating system page cache at all. We then ran the tests with the Direct I/O flags, which should be the bread and butter for linux-aio. The test is conducted on NVMe storage that should be able to read at 3.5M IOPS. We used 8 CPUs to run 72 fio jobs, each issuing random reads across four files with an iodepth of 8. This makes sure that the CPUs run at saturation for all backends and will be the limiting factor in the benchmark. This allows us to see the behavior of each interface at saturation. Note that with enough CPUs all interfaces would be able to at some point achieve the full disk bandwidth. Such a test wouldn’t tell us much.

backend IOPS context switches IOPS ±% vs io_uring
sync 814,000 27,625,004 -42.6%
posix-aio (thread pool) 433,000 64,112,335 -69.4%
linux-aio 1,322,000 10,114,149 -6.7%
io_uring (basic) 1,417,000 11,309,574
io_uring (enhanced) 1,486,000 11,483,468 4.9%

Table 1: performance comparison of 1kB random reads at 100% CPU utilization using Direct I/O, where data is never cached: synchronous reads, posix-aio (uses a thread pool), linux-aio, and the basic io_uring as well as io_uring using its advanced features.

We can see that as we expect, io_uring is a bit faster than linux-aio, but nothing revolutionary. Using advanced features like buffer and file registration (io_uring enhanced) gives us an extra boost, which is nice, but nothing that justifies changing your entire application, unless you are a database trying to squeeze out every operation the hardware can give. Both io_uring and linux-aio are around twice as fast as the synchronous read interface, which in turn is twice as fast as the thread pool approach employed by posix-aio, which is surprisingly at first.

The reason why posix-aio is the slowest is easy to understand if we look at the context switches column at Table 1: every event in which the system call would block, implies one additional context switch. And in this test, all reads will block. The situation is just worse for posix-aio. Now not only there is the context switch between the kernel and the application for blocking, the various threads in the application have to go in and out the CPU.

But the real power of io_uring can be understood when we look at the other side of the scale. In a second test, we preloaded all the memory with the data in the files and proceeded to issue the same random reads. Everything is equal to the previous test, except we now use buffered I/O and expect the synchronous interface to never block — all results are coming from the operating system page cache, and none from storage.

Backend IOPS context switches IOPS ±% vs io_uring
sync 4,906,000  105,797 -2.3%
posix-aio (thread pool) 1,070,000 114,791,187 -78.7%
linux-aio 4,127,000 105,052 -17.9%
io_uring 5,024,000 106,683

Table 2: comparison between the various backends. Test issues 1kB random reads using buffered I/O files with preloaded files and a hot cache. The test is run at 100% CPU.

We don’t expect a lot of difference between synchronous reads and io_uring interface in this case because no reads will block. And that’s indeed what we see. Note, however, that in real life applications that do more than just read all the time there will be a difference, since io_uring supports batching many operations in the same system call.

The other two interfaces, however, suffer a big penalty: the large number of context switches in the posix-aio interface due to its thread pool completely destroys the benchmark performance at saturation. Linux-aio, which is not designed for buffered I/O, at all, actually becomes a synchronous interface when used with buffered I/O files. So now we pay the price of the asynchronous interface — having to split the operation in a dispatch and consume phase, without realizing any of the benefits.

Real applications will be somewhere in the middle: some blocking, some non-blocking operations. Except now there is no longer the need to worry about what will happen. The io_uring interface performs well in any circumstance. It doesn’t impose a penalty when the operations would not block, is fully asynchronous when the operations would block, and does not rely on threads and expensive context switches to achieve its asynchronous behavior. And what’s even better: although our example focused on random reads, io_uring will work for a large list of opcodes. It can open and close files, set timers, transfer data to and from network sockets. All using the same interface.

ScyllaDB and io_uring

Because Scylla scales up to 100% of server capacity before scaling out, it relies exclusively on Direct I/O and we have been using linux-aio since the start.

In our journey towards io_uring, we have initially seen results as high as 50% better in some workloads. At closer inspection, that made clear that this is because our implementation of linux-aio was not as good as it could be. This, in my view, highlights one usually underappreciated aspect of performance: how easy it is to achieve it. As we fixed our linux-aio implementation according to the deficiencies io_uring shed light into, the performance difference all but disappeared. But that took effort, to fix an interface we have been using for many years. For io_uring, achieving that was trivial.

However, aside from that, io_uring can be used for much more than just file I/O (as already mentioned many times throughout this article). And it comes with specialized high performance interfaces like buffer registration, file registration, and a poll interface with no interrupts.

When io_uring’s advanced features are used, we do see a performance difference: we observed a 5% speedup when reading 512-byte payloads from a single CPU in an Intel Optane device, which is consistent with the fio results in Tables 1 and 2. While that doesn’t sound like a lot, that’s very valuable for databases trying to make the most out of the hardware.


  Throughput         :      330 MB/s
  Lat average        :     1549 usec
  Lat quantile=  0.5 :     1547 usec
  Lat quantile= 0.95 :     1694 usec
  Lat quantile= 0.99 :     1703 usec
  Lat quantile=0.999 :     1950 usec
  Lat max            :     2177 usec
io_uring, with buffer and file registration and poll:
  Throughput         :      346 MB/s
  Lat average        :     1470 usec
  Lat quantile= 0.5  :     1468 usec
  Lat quantile= 0.95 :     1558 usec
  Lat quantile= 0.99 :     1613 usec
  Lat quantile=0.999 :     1674 usec
  Lat max            :     1829 usec
Reading 512-byte buffers from an Intel Optane device from a single CPU. Parallelism of 1000 in-flight requests. There is very little difference between linux-aio and io_uring for the basic interface. But when advanced features are used, a 5% difference is seen.

The io_uring interface is advancing rapidly. For many of its features to come, it plans to rely on another earth-shattering new addition to the Linux Kernel: eBPF.

What Is eBPF?

eBPF stands for extended Berkeley Packet Filter. Remember iptables? As the name implies, the original BPF allows the user to specify rules that will be applied to network packets as they flow through the network. This has been part of Linux for years.

But when BPF got extended, it allowed users to add code that is executed by the kernel in a safe manner in various points of its execution, not only in the network code.

I will suggest the reader to pause here and read this sentence again, to fully capture its implications: You can execute arbitrary code in the Linux kernel now. To do essentially whatever you want.

eBPF programs have types, which determine what they will attach to. In other words, which events will trigger their execution. The old-style packet filtering use case is still there. It’s a program of the BPF_PROG_TYPE_SOCKET_FILTER type.

But over the past decade or so, Linux has been accumulating an enormous infrastructure for performance analysis, that adds tracepoints and probe points almost everywhere in the kernel. You can attach a tracepoint, for example, to a syscall — any syscall — entry or return points. And through the BPF_PROG_TYPE_KPROBE and BPF_PROG_TYPE_TRACEPOINT types, you can attach bpf programs essentially anywhere.

The most obvious use case for this is performance analysis and monitoring. A lot of those tools are being maintained through the bcc project. It’s not like it wasn’t possible to attach code into those tracepoints before: Tools like systemtap allowed the user to do just that. But previously all that such probes could do was pass on information to the application in raw form which would then constantly switch to and from the kernel, making it unusably slow.

Because eBPF probes run in kernel space, they can do complex analysis, collect large swathes of information, and then only return to the application with summaries and final conclusions. Pandora’s box has been opened.

Here are some examples of what those tools can do:

  • Trace how much time an application spends sleeping, and what led to those sleeps. (wakeuptime)
  • Find all programs in the system that reached a particular place in the code (trace)
  • Analyze network TCP throughput aggregated by subnet (tcpsubnet)
  • Measure how much time the kernel spent processing softirqs (softirqs)
  • Capture information about all short-lived files, where they come from, and for how long they were opened (filelife)

The entire bcc collection is a gold mine and I strongly recommend the reader to take a look. But the savvy reader had already noticed by now that the main point is not that there are new tools. Rather, the point is that these tools are built upon an extensible framework that allows them to be highly specialized.

We could always measure network bandwidth in Linux. But now we can split it per subnet, because that’s just a program that was written and injected into the kernel. Which means if you ever need to insert specifics of your own network and scenario, now you can too.

Having a framework to execute code in the kernel goes beyond just performance analysis and debugging. We can’t know for sure how the marriage between io_uring and bpf will look like, but here are some interesting things that can happen:

io_uring supports linking operations, but there is no way to generically pass the result of one system call to the next. With a simple bpf program, the application can tell the kernel how the result of open is to be passed to read — including the error handling, which then allocates its own buffers and keeps reading until the entire file is consumed and finally closed: we can checksum, compress, or search an entire file with a single system call.

Where Is All This Going?

These two new features, io_uring and eBPF, will revolutionize programming in Linux. Now you can design apps that can truly take advantage of these large multicore multiprocessor systems like the Amazon i3en “meganode” systems, or take advantage of µsecond-scale storage I/O latencies of Intel Optane persistent memory.

You’re also going to be able to run arbitrary code in the kernel, which is hugely empowering for tooling and debugging. For those who have only done synchronous or POSIX aio thread pooling, there’s now a lot of new capabilities to take advantage of. These are exciting developments — even for those of you who are not database developers like us.

Originally published in The New Stack.

The post How io_uring and eBPF Will Revolutionize Programming in Linux appeared first on ScyllaDB.