Apache Cassandra Graviton2, gp3, and Vertical Scaling

Instaclustr Now Offering AWS Graviton2 Instances With gp3 volumes for Apache Cassandra®

Instaclustr has released AWS Graviton2 instances and gp3 volumes for Apache Cassandra®, providing customers with both performance and financial benefits.

Instaclustr is pleased to announce the general availability of AWS instances on Graviton2 processors, combined with gp3 volumes for Instaclustr managed Apache Cassandra. AWS Graviton2 infrastructure has better compute performance than other equivalent instances using the same number of vCPUs and the same amount of RAM, so provides notable performance improvement at a reduced cost.

Instaclustr managed Apache Cassandra on AWS Graviton2 has also incorporated the new generation of AWS Elastic Block Storage (EBS) gp3 disks that have a higher IO throughput compared to EBS gp2 disks. Previous IOPS constraints on smaller instances have been eliminated with the new generation of AWS gp3 volumes that come with at least 3000 IOPS no matter the disk size, as well as the ability to specify additional IOPS and throughput independent of disk size. This is good news for our customers that previously had to provision larger than necessary disks for the purpose of achieving sufficient throughput. Through load testing, Instaclustr has matched the provisioned IOPS and throughput of the gp3 disks with the capabilities of the associated instance to achieve an overall balanced infrastructure configuration for optimum price/performance.

The post Apache Cassandra Graviton2, gp3, and Vertical Scaling appeared first on Instaclustr.

Benchmarking Apache Cassandra (40 Nodes) vs ScyllaDB (4 Nodes)

Since the NoSQL revolution in database management systems kicked off over a decade ago, organizations of all sizes have benefitted from a key feature: massive scale using relatively inexpensive commodity hardware. It has enabled organizations to deploy architectures that would have been prohibitively expensive and impossible to scale with traditional relational database systems.

Commodity hardware itself has undergone a transformation over the same decade, but most modern software doesn’t take advantage of modern computing resources. Most frameworks that scale out for data-intensive applications don’t scale up. They aren’t able to take advantage of the resources offered by large nodes, such as the added CPU, memory and solid-state drives (SSDs), nor can they store large amounts of data on disk efficiently. Managed runtimes, like Java, are further constrained by heap size. Multi-threaded
code, with its locking overhead and lack of attention for non-uniform memory architecture (NUMA), imposes a significant performance penalty against modern hardware architectures.

Software’s inability to keep up with hardware advancements has led to the widespread belief that running database infrastructure on many small nodes is the optimal architecture for scaling massive workloads. The alternative, using small clusters of large nodes, is often viewed with skepticism. A few common concerns are that large nodes won’t be fully utilized, that they have a hard time streaming data when scaling out and, finally, they might have a catastrophic effect on recovery times.

Based on our experiences with ScyllaDB,– a fast and scalable database that takes full advantage of modern infrastructure and networking capabilities, we were confident that scaling up beats scaling out. So, we put it to the test.

ScyllaDB is API-compatible with Apache Cassandra (and DynamoDB compatible too); it provides the same Cassandra Query Language (CQL) interface and queries, the same drivers, even the same on-disk SSTable format. Our core Apache Cassandra 4.0 performance benchmarking used identical three-node hardware for both ScyllaDB and Cassandra (TL;DR, ScyllaDB performed 3x-8x better). Since ScyllaDB scales well vertically, we executed what we are calling a “4×40” test, with a large-scale setup where we used node sizes optimal for each database. Since ScyllaDB’s architecture takes full advantage of extremely large nodes, we compared a setup of four i3.metal machines (288 vCPUs in total) vs. 40 i3.4xlarge Cassandra machines (640 vCPUs in total, almost 2.5 times ScyllaDB’s resources).

In terms of CPU count, RAM volume or cluster topology, this would appear to be an apples-to-oranges comparison. You might wonder why anyone would ever consider such a test. After all, we’re comparing four machines to 40 very different machines. Due to ScyllaDB’s shard-per-core architecture and custom memory management, we know that ScyllaDB can utilize very powerful hardware. Meanwhile, Cassandra and its JVM’s garbage collectors are optimized to be heavily distributed with many smaller nodes.

The true purpose of this test was to see whether both CQL solutions could perform similarly in this duel, even with Cassandra requiring about 2.5 times more hardware, for 2.5x the cost. What’s really at stake here is a reduction in administrative burden: Could a DBA be maintaining just four servers instead of 40?

The 4 x 40 Node Setup

We set up clusters on Amazon EC2 in a single Availability Zone within us-east-2 data center. The ScyllaDB cluster consisted of four i3.metal VMs and the competing Cassandra cluster consisted of 40 i3.4xlarge VMs. Servers were initialized with clean machine images (AMIs) of Ubuntu 20.04 (Cassandra 4.0) or CentOS 7.9 (ScyllaDB 4.4).

Apart from the cluster, 15 loader machines were used to run cassandra-stress to insert data, and – later – to provide background load at CL=QUORUM to mess with the administrative operations.


ScyllaDB Cassandra Loaders
EC2 Instance type i3.metal i3.4xlarge c5n.9xlarge
Cluster size 4 40 15
Storage (total) 8x 1.9 TB NVMe in RAID0

(60.8 TB)

2x 1.9 TB NVMe in RAID0

(152 TB)

Not important for a loader (EBS-only)
Network 25 Gbps Up to 10 Gbps 50 Gbps
vCPUs (total) 72 (288) 16 (640) 36 (540)
RAM (total) 512 (2048) GiB 122 (4880) GiB 96 (1440) GiB


Once up and running, both databases were loaded with random data at RF=3 until the cluster’s total disk usage reached approximately 40 TB. This translated to 1 TB of data per Cassandra node and 10 TB of data per ScyllaDB node. After loading was done, we flushed the data and waited until the compactions finished, so we could start the actual benchmarking.

Spoiler: We found that a ScyllaDB cluster can be 10 times smaller in node count and run on a cluster 2.5x less expensively, yet maintain the equivalent performance of Cassandra 4. Here’s how it played out.

Throughput and Latencies

UPDATE Queries

The following shows the 90- and 99-percentile latencies of UPDATE queries, as measured on:

  • Four-node ScyllaDB cluster (4 x i3.metal, 288 vCPUs in total)
  • 40-node Cassandra cluster (40 x i3.4xlarge, 640 vCPUs in total)

The workload was uniformly distributed — every partition in the multi-terabyte dataset had an equal chance of being selected/updated.

Under low load, Cassandra slightly outperformed ScyllaDB. The reason is that ScyllaDB runs more compaction automatically when it is idle and the default scheduler tick of 0.5 milliseconds hurts the P99 latency. (Note, there is a parameter that controls this, but we wanted to provide out-of-the-box results with zero custom tuning or configuration.)

Under high load with P99 latency <10 milliseconds, ScyllaDB’s throughput on four nodes was 33% higher than Cassandra’s on 40 nodes.

SELECT Queries

The following shows the 99-percentile latencies of SELECT queries, as measured on:

  • Four-node ScyllaDB cluster (4 x i3.metal, 288 vCPUs in total)
  • 40-node Cassandra cluster (40 x i3.4xlarge, 640 vCPUs in total

The workload was uniformly distributed — every partition in the multi-terabyte dataset had an equal chance of being selected/updated. Under low load Cassandra slightly outperformed ScyllaDB. Under high load with P99 latency <10 milliseconds, ScyllaDB’s throughput on four nodes was again 33% higher than Cassandra’s on 40 nodes.

Scaling up the Cluster by 25%

In this benchmark, we increase the capacity of the cluster by 25%:

  • By adding a single ScyllaDB node to the cluster (from four nodes to five)
  • By adding 10 Cassandra nodes to the cluster (from 40 nodes to 50 nodes)


ScyllaDB 4.4.3 Cassandra 4.0 ScyllaDB 4.4 vs. Cassandra 4.0
Add 25% capacity 1 hour, 29 minutes 16 hours, 54 minutes 11x faster

Performing Major Compaction

In this benchmark, we measure the throughput of a major compaction. To compensate for Cassandra having 10 times more nodes (each having 1/10th of the data), this benchmark measures the throughput of a single ScyllaDB node performing major compaction and the collective throughput of 10 Cassandra nodes performing major compactions concurrently.

Here, ScyllaDB ran on a single i3.metal machine (72 vCPUs) and competed with a 10-node cluster of Cassandra 4 (10x i3.4xlarge machines; 160 vCPUs in total). ScyllaDB can split this problem across CPU cores, which Cassandra cannot. ScyllaDB performed 32x better in this case.

The Bottom Line

The bottom line is that a ScyllaDB cluster can be 10 times smaller and 2.5 times less expensive, yet still outperform Cassandra 4.0 by 42%. In this use case, choosing ScyllaDB over Cassandra 4.0 would result in hundreds of thousands annual savings in hardware costs alone, without factoring in reduced administration costs or environmental impact. Scaling the cluster is 11 times faster, and ScyllaDB provides additional features, from change data capture (CDC) to cache bypass and native Prometheus support and more. That’s why teams at companies such as Discord, Expedia, Fanatics and Rakuten have switched.

For more details on how this test was configured and how to replicate it, read the complete benchmark.

Distributed Databases Compared

The past decade saw the rise of fully distributed databases. Not just local clustering to enable basic load balancing and provide high availability — with attributes such as rack awareness within a datacenter. truly distributed systems that could be spanned across the globe, designed to work within public clouds — across availability zones, regions, and, with orchestration technologies, even across multiple cloud providers and on-premises hybrid cloud deployments.

Likewise, this past decade has seen a plethora of new database systems designed specifically for distributed database deployments, and others that had distributed architectural components added to their original designs.

DB-Engines.com Top 100 Databases

For everyone who has never visited this site, I wanted to draw your attention to DB-engines.com. It’s the “Billboard Charts” of databases. It keeps a rough popularity index of all the databases you can imagine, weighted using an algorithm that tracks things like the number of mentions on websites and the Google trends of searches, to discussions on Stack Overflow or mention in Tweets, to jobs postings asking for these as technical skills, to the number of profiles that mention these technologies by name in their LinkedIn profile.

Top 100 Databases on DB-Engines.com as of May 2022

While there are hundreds of different databases that it tracks — 394 in total as of May 2022 — let’s narrow down to just looking at the top 100 listings. It reveals a great deal about the current state of the market.

Relational Database Management Systems (RDBMS), traditional SQL systems, remain the biggest category: 47% of the listings.

Another 25% are NoSQL systems spanning a number of different types: document databases like MongoDB, key-value systems like Redis, wide column databases like ScyllaDB, and graph databases like Neo4j.

There’s also a fair-sized block (11%) of databases listed as “multi-model.” These include hybrids that support both SQL and NoSQL in the same system — such as Microsoft Cosmos DB or ArangoDB — or databases that support more than one type of NoSQL data model, such as DynamoDB, which lists itself as both a NoSQL key-value system and a document store.

Lastly, there’s a few slices of the pie comprised of miscellaneous special-purpose databases from search engines to time series databases, and others those that don’t easily fall into simple “SQL vs. NoSQL” delineations.

But are all of these top databases “distributed” databases? What does that term even mean?

What Defines a Distributed Database?

SQL is formally standardized as per ANSI/ISO/IEC 9075:2016. It hasn’t changed in six years. But what has changed over that time is how people have architected distributed RDBMS systems compliant with SQL. Those continue to evolve. Distributed SQL such as PostgreSQL. Or “NewSQL” systems such as CockroachDB.

Conversely there’s no ANSI or ISO or IETF or W3C definition of what a “NoSQL database” is. Each is proprietary, or at best uses some de facto standard, such as the Cassandra Query Language (CQL) for wide column NoSQL databases, or, say, the Gremlin/Tinkerpop query methods for graph databases.

Yet these are just query protocols. They don’t define how the data gets distributed across these databases. That’s an architectural issue query languages don’t and won’t address.

So whether SQL or NoSQL, there’s no standard or protocol or consensus on what a “distributed database” is.

Thus I took some time to write up my own definition. I’ll freely admit this is more of a layman’s pragmatic view than a computer science professor’s.

In brief, you have to decide how you define a cluster, and distribute data across it.

Next, you have to determine what the roles of each of the nodes of the cluster are. Is every node a peer, or are some nodes in a more superior leader position and others are more followers.

And then, based on these roles, how do you deal with failover?

Lastly, you have to figure out based on this how you replicate and shard your data as evenly and easily as possible.

And this isn’t attempting to be exhaustive. You can add your own specific criteria.

The Short List: Systems of Interest

So with all this in mind, let’s go into our Top 100 and find five examples to see how they compare when measured against those attributes. I’ve chosen two SQL systems and three NoSQL systems.

PostgreSQL MongoDB
CockroachDB Redis
ScyllaDB (Cassandra)

Postgres and CockroachDB represent the best of distributed SQL. CockroachDB is referred to as “NewSQL” — designed specifically for this world of distributed databases. (If you are specifically interested in CockroachDB, check out this article where we go into a deeper comparison.)

MongoDB, Redis and ScyllaDB are my choices for distributed NoSQL. The first as an example of a document database. The second as a key-value store, and ScyllaDB as a wide-column database — also called a “key-key-value” database.

Also note that, for the most part, what is true for ScyllaDB is, in many cases, also true of Apache Cassandra and other Cassandra-compatible systems.

I apologize to all those whose favorite systems weren’t selected. But I hope if you have another system in mind you compare what we say for these systems to others you have in mind.

For now, I will presume you already have professional experience, and you’re somewhat familiar with the differences of SQL vs. NoSQL. If not, check out our primer on SQL vs NoSQL. Basically, if you need a table JOIN, stick with SQL and an RDBMS. If you can denormalize your data, then maybe NoSQL might be a good fit. We’re not going to argue whether one or the other is “better” as a data structure or query language. We are here to see if any of these are better as a distributed database.

Multi-Datacenter Clustering

Let’s see how our options compare in terms of clustering. Now, all of them are capable of clustering and even multi-datacenter operations. But in the cases of PostgreSQL, MongoDB, and Redis — these designs predate multi-datacenter design as an architectural requirement. They were designed in the world of single datacenter local clustering to begin with.

Postgres, first released in 1986, totally predates the concepts of cloud computing. But over time it evolved to allow for these advances and capabilities to be bolted on to its design.

CockroachDB, part of the NewSQL revolution, was designed from the ground up with global distribution in mind.

MongoDB, released at the dawn of the public cloud, was initially designed with a single datacenter cluster in mind, but now has added support for quite a lot of different topologies. And with MongoDB Atlas, you can deploy to multiple regions pretty easily.

Redis, because of its low-latency design assumptions, is generally deployed on a single datacenter. But it has enterprise features that allow for multi-datacenter deployments.

ScyllaDB, like Cassandra, was designed with multi-datacenter deployments in mind from the get-go.


How you do replication and sharding is also dependent upon how hierarchical or homogeneous your database architecture is.

For example, in MongoDB there is a single primary server; the rest are replicas of that primary. This is referred to as a replica set. You can only make writes to this primary copy of the database. The replicas are read-only. You can’t update them directly. Instead, you write to the primary, and it updates the replicas. So nodes are heterogenous; not homogenous.

This helps distribute traffic in a read-heavy workload, but in a mixed or write-heavy workload it’s not doing you much good at all. The primary can become a bottleneck.

As well, what happens if the primary goes down? You’ll have to hold up writes entirely until the cluster elects a new primary, and write operations are shunted over to it. It’s a concerning single point of failure.

Instead, if you look at ScyllaDB, or Cassandra, or any other leaderless, peer-to-peer system — these are known as “active-active,” because clients can read from or write to any node. There’s no single point of failure. Nodes are far more homogenous.

And each node can and will update any other replicas of the data in the cluster. So if you have a replication factor of 3, and three nodes, each node will get updated based on any writes to the other two nodes.

Active-active is inherently more difficult to do computationally, but once you solve for how the servers keep each other in sync, you end up with a system that can load balance mixed or write-heavy workloads far better, because each node can serve reads or writes.

So how do our various examples stack up in regards to primary-replica or active-active peer-to-peer?

CockroachDB and ScyllaDB (and Cassandra) began with a peer-to-peer active-active design in mind.

Whereas in Postgres there’s a few optional ways you can do it, but it’s not built in.

Also, active-active is not officially supported in MongoDB, but there have been some stabs at how to do it.

And with Redis, an active-active model is possible with Conflict-free replicated data types — CRDTs — in Redis Enterprise.

Otherwise Postgres, MongoDB and Redis all default to a primary-replica data distribution model.


Distributed systems designs also affect how you might distribute data across the different racks or datacenters you’ve deployed to. For example, given a primary-replica system, only the datacenter with the primary can serve any write workloads. Other datacenters can only serve as a read-only copy.

In a peer-to-peer system that supports multi-datacenter clustering, each node in the overall cluster can accept reads or writes. This allows for better geographic workload distribution.

With ScyllaDB you can decide, for example, to have the same or even different replication factors per site. Here I’ve shown the possibility of having three replicas of data at one datacenter and two replicas at another.

Operations then can have different levels of consistency. You might have a local quorum read or write at the three node datacenter — requiring two of three nodes to be updated for local quorum. Or you might have a cluster-wide quorum, requiring any three nodes across either or both datacenters to be updated for an operation to be successful. Tunable consistency, combined with multi-datacenter topology awareness, basically gives you a lot more flexibility to customize workloads.

Topology Awareness

Local clustering was the way distributed databases began, allowing more than one system to share the load. This was important if you wanted to allow for sharding your database across multiple nodes, or replicating data if you wanted to have high availability by ensuring the same data was available on multiple nodes.

But if all your nodes were installed in the same rack, and if that rack went down, that’s no good. So topology awareness was added so that you could be rack aware within the same datacenter. This ensures you spread your data across multiple racks of that datacenter, thus minimizing outages if power or connectivity is lost to one rack or another. That’s the barest-bones form of topology awareness you’d want.

Some databases do better than this, and allow for multiple copies of the database to be running in different datacenters, with some sort of cross-cluster updating mechanism. Each of these databases is running autonomously. Their synchronization mechanism could be unidirectional — one datacenter updating a downstream replica — or it could bidirectional or multi-directional.

This geographic distribution can minimize latencies by allowing connections closer to users. Spanning a database across availability zones or regions also ensures that no single datacenter disaster means you’ve lost part or all of your database. That actually happened to one of our customers last year, but because they were deployed across three different datacenters, they lost zero data.

Cross-cluster updates were at first implemented in a sort of gross, batch level. Ensuring your datacenters got in sync at least once a day. Ugh. That didn’t cut it for long. So people started ensuring more active, transaction-level updates.

The problem there was, if you were running strongly consistent databases, you were limited based on the real-time propagation delay of the speed of light. So eventual consistency was implemented to allow for multi-datacenter, per operation updates, with the understanding and tradeoff that in the short term it might take time before your data was consistent across all your datacenters.

So how do our exemplars stack up in terms of topology awareness?

So, again, with CockroachDB and ScyllaDB this comes built in.

Topology awareness was also made part of MongoDB starting circa 2015. So, not since its launch in 2009, but certainly they have years of experience with it.

Postgres and Redis were originally designed to be single datacenter solutions, thus dealing with multi-datacenter latencies are sort of an anti-pattern for both. Now, you can add-on topology awareness, like you can add on active-active system features, but it doesn’t come out of the box.

So let’s review what we’ve gone over by looking at these databases individually versus these attributes.


“Postgres” is one of the most popular implementations of SQL these days. It offers local clustering out of the box.

However, Postgres, as far as I know, is still working on its cross-cluster and multi-datacenter clustering. You may have to put some effort into getting it working.

Because SQL is grounded in a strongly consistent transactional mindset, it doesn’t lend itself well to spanning a cluster across a wide geography. Each query would be held up by long latency delays between all the relevant datacenters.

Also, Postgres relies upon a primary-replica model. One node in the cluster is the leader, and the others are replicas. And while there are load balancers for it, or active-active add-ons those are also beyond the base offering.

Finally, sharding in Postgres still remains manual for the most part, though they are making advances in developing auto-sharding which are, again, beyond the base offering.


CockroachDB bills itself as “NewSQL” — a SQL database designed in mind for distribution. This is a SQL designed to be survivable. Hence the name.

Note that CockroachDB uses the Postgres wire protocol, and borrows heavily from many concepts pioneered in Postgres. However, it doesn’t limit itself to the Postgres architecture.

Multi-datacenter clustering and peer-to-peer leaderless topology is built-in from the get-go.

So is auto-sharding and data replication.

And it has datacenter-awareness built in, and you can add rack-awareness too.

The only caveat to CockroachDB — and you may see it as a strength or a weakness — is that it requires strong consistency on all its transactions. You don’t have the flexibility of eventual consistency nor tunable consistency. Which will lower throughput and require high baseline latencies in any cross-datacenter deployment.


MongoDB is the venerable leader of the NoSQL pack. So over time as it developed a lot of distributed database capabilities were added. It’s come a long way from its origins. Now MongoDB is capable of multi-datacenter clustering. It still follows a primary-replica model for the most part, but there are ways to make it peer-to-peer active-active.


Next up is Redis, a key-value store designed to act as an in-memory cache or datastore. While it can persist data, it suffers from a huge performance penalty if the dataset doesn’t fit into RAM.

Because of that, it was designed with local clustering in mind. Because if you can’t afford to wait five milliseconds to get data off an SSD, you probably can’t wait 145 milliseconds to make the network round trip time from San Francisco to London.

However, there are enterprise features that do allow multi-datacenter Redis clusters for those who do need geographic distribution.

Redis operates for the most part as a primary-replica model. Which is appropriate for a read-heavy caching server. But what it means is that the primary is where the data needs to get written to first, which will then fan out to the replicas to help balance their caching load.

There is an enterprise feature to allow peer-to-peer active-active clusters.

Redis does auto-shard and replicate data, but its topology awareness is limited to rack-awareness as an enterprise feature.


Finally we get to ScyllaDB. It was patterned after the distributed database model found in Apache Cassandra. And so it comes, by default, with multi-datacenter clustering. A leaderless active-active topology.

It automatically shards and has tunable consistency per operation, and, if you want stronger consistency, even supports lightweight transactions to provide linearizability of writes.

As far as topology awareness, ScyllaDB of course supports rack-awareness and datacenter-awareness. It even supports token-awareness and shard-awareness to know not only which node data will be stored in, but even down to which CPU is associated with that data.

All the goodness you need from a distributed database.


While there is no industry standard for what a distributed database is, we can see that many of the leading SQL and NoSQL databases support a core set of features or attributes to some degree or other. Some of these features are built in, and some are considered value-added packages or third-party options.

Of the five exemplar distributed database systems analyzed herein, CockroachDB offers the most comprehensive combination of features and attributes out-of-the-box for SQL databases and ScyllaDB offers the most comprehensive combination for NoSQL systems.

This analysis should be considered a point-in-time survey. Each of these systems is constantly evolving given the relentless demands of this next tech cycle. Database-as-a-Service offerings. Serverless options. Elasticity. The industry is not standing still.

The great news for users is that with each year more advances are being made to distributed databases to make them ever more flexible, more performant, more resilient, and more scalable.

If you’d like to stay on top of the advances being made in ScyllaDB, we encourage you to join our user community in Slack.


5 Tips to Optimize Your CQL Queries

You’ve finalized the development of your twitter-killer app. You heard of NoSQL databases and decided to use ScyllaDB for its sub-millisecond writes and high availability. The app looks also great! But then, you get an email from your colleague telling you there are issues with the load tests.

ScyllaDB Monitoring Dashboard: CQL

The above screenshot is the ScyllaDB Monitoring dashboard, more specifically from the Scylla CQL dashboard. We can clearly see red gages indicating there are a few issues in the app.

How can you further optimize your code to face millions of operations? Here are five ways to get the most out of your CQL queries.

1. Prepare Your Statements

We can see from the ScyllaDB CQL Dashboard that 99% of my queries during my test were non-prepared statements. To understand what this means and how it impacts performance, let’s talk about how queries are executed in ScyllaDB.

ScyllaDB CQL Dashboard Non-Prepared Statements

You can run a query using the execute function like so:

rows = session.execute(‘SELECT name, age, email FROM users’)
for user_row in rows:
print user_row.name, user_row.age, user_row.email

The above query is an example of a Simple Statement. When executed, ScyllaDB will parse the query string again, without the use of a cache. This is inefficient if you run the same queries often.

If you often execute the same query, consider using Prepared Statements instead.

user_lookup_stmt = session.prepare(“SELECT * FROM users WHERE user_id=?”)users = []
for user_id in user_ids_to_query:
user = session.execute(user_lookup_stmt, [user_id])

When you prepare the statement, ScyllaDB will parse the query string, cache the result and return a unique identifier. When you execute the prepared statement, the driver will only send the identifier, which allows skipping the parsing phase. Additionally, you’re guaranteed your query will be executed by the node holding the data.

Step 1: Parse query and return statement id

Step 2: Send id and values

2. Page Your Queries

We can observe from the graph below that only a tiny fraction of my queries are non-paged. However, if we consider the fact that every query triggers the scan of an entire table and that the client might not need the entire data, we can understand how this is not efficient.

ScyllaDB Dashboard Non-Paged CQL Reads

This one might sound obvious. If your users query an entire table on a consistent basis, paging could improve latency tremendously. To do so, you can add the fetch_size argument to the statement.

query = "SELECT * FROM users"
statement = SimpleStatement(query, fetch_size=10)
for user_row in session.execute(statement):

3. Avoid Allow Filtering

The below query selects all users with a particular name.

query = "SELECT * FROM users WHERE name=%s ALLOW FILTERING"

Let’s see what the above does in more detail.

From the dashboard, we can see that we only have about 100 reads per second, which in theory can be considered negligible.


However, the query triggers a scan of the entire table. This means that the query reads the entire users’ table just to return a few, which is inefficient.

Because of the way ScyllaDB is designed, it’s far more efficient to use an index for lookups. Thus, your query will always find the node holding the data without scanning the entire table.

query = "SELECT * FROM users WHERE id=%s"

If you feel that is not enough, you might consider revisiting your schema.

4. Bypass Cache

For low latency purposes, ScyllaDB looks for the result of your query in the cache first. In case the data is not present in the cache, the database will read from the disk. We use BYPASS CACHE to avoid unnecessary lookups in the cache and get the data straight from the disk.

ScyllaDB CQL Dashboard Range Scans without BYPASS CACHE

You can use BYPASS CACHE for rare range scans to inform the database that the data is unlikely to be in memory and need to be fetched directly from the disk instead. This will avoid an unnecessary lookup in the cache.

SELECT name, occupation FROM users WHERE userid IN (199, 200, 207) BYPASS CACHE;
SELECT * FROM users WHERE birth_year = 1981 AND country = 'US' ALLOW FILTERING BYPASS CACHE;


CL stands for Consistency Level. To better understand what it is, let’s review the journey of an INSERT query.

ScyllaDB is a distributed database. The cluster is formed by a group of nodes (or machines) that communicate with each other. When the client sends an insert query, it is first sent to a random node (the coordinator) that will pass the data to other nodes to save copies of it. The number of nodes the data is copied to is called the Replication Factor.

ScyllaDB cluster with Replication Factor = 3 and Consistency Level = Quorum

QUORUM: it provides better availability than ONE and better latency than ALL.

While the client awaits the database’s response, the coordinator also needs a response from the nodes replicating the data. The number of nodes the coordinator waits for is called the Consistency Level.

Let’s now get back to QUORUM. QUORUM means that the coordinator requires the majority of the nodes to send a response before it sends itself a response to the client. The majority is (Replication Factor / 2 )+ 1 . In the above case, it would 2.

Why should you use QUORUM? Because it provides better availability than ONE and better latency than ALL.

With CL=ONE, the coordinator sends a response to the client as fast as it inserts the data, without waiting for other nodes. In the case that the nodes are down, the data is not replicated and therefore not highly available.

With CL=ALL, the coordinator needs a response of all nodes in order to respond to the client, which increases latency.

Consistency Level ONE (left) vs ALL (right)


An app that performs well in development might face more issues in production. The ScyllaDB Monitoring Dashboard describes the health of your cluster’s nodes and gives you a good indication of the quality of your CQL queries and how they perform at scale. Consider using it at an early stage, for development or testing, to capture potential issues in your code and fix them before they ever get to your users.

ScyllaDB on the New AWS EC2 I4i Instances: Twice the Throughput & Lower Latency

As you might have heard, AWS recently released a new AWS EC2 instance type perfect for data-intensive storage and IO-heavy workloads like ScyllaDB: the Intel-based I4i. According to the AWS I4i description, “Amazon EC2 I4i instances are powered by 3rd generation Intel Xeon Scalable processors and feature up to 30 TB of local AWS Nitro SSD storage. Nitro SSDs are NVMe-based and custom-designed by AWS to provide high I/O performance, low latency, minimal latency variability, and security with always-on encryption.”

Now that the I4i series is officially available, we can share benchmark results that demonstrate the impressive performance we achieved on them with ScyllaDB (a high-performance NoSQL database that can tap the full power of high-performance cloud computing instances).

We observed up to 2.7x higher throughput per vCPU on the new I4i series compared to I3 instances for reads. With an even mix of reads and writes, we observed 2.2x higher throughput per vCPU on the new I4i series, with a 40% reduction in average latency than I3 instances.

We are quite excited about the incredible performance and value that these new instances will enable for our customers going forward.

How the I4i Compares: CPU and Memory

For some background, the new I4i instances, powered by “Ice Lake” processors, have a higher CPU frequency (3.5 GHz) vs. the I3 (3.0 GHz) and I3en (3.1 GHz) series.

Moreover, the i4i.32xlarge is a monster in terms of processing power, capable of packing in up to 128 vCPUs. That’s 33% more than the i3en.metal, and 77% greater than the i3.metal.

We correctly predicted ScyllaDB should be able to support a high number of transactions on these huge machines and set out to test just how fast the new I4i was in practice. ScyllaDB really shines on machines with many CPUs because it scales linearly with the number of cores thanks to our unique shard-per-core architecture. Most other applications cannot take full advantage of this large number of cores. As a result, the performance of other databases might remain the same, or even drop, as the number of cores increases.

In addition to more CPUs, these new instances are also equipped with more RAM. A third more than the i3en.metal, and twice that of the i3.metal.

The storage density on the i4i.32xlarge (TB storage / GB RAM) is similar in proportion to the i3.metal, while the i3en.metal has more. This is as expected. In total storage, the i3.metal maxes out at 15.2 TB, the i3en.metal can store a whopping 60 TB, while the i4i.32xlarge is perfectly nestled about halfway between both, at 30 TB storage — twice the i3.metal, and half the i3en.metal. So if storage density per server is paramount to you, the I3en series still has a role to play. Otherwise, in terms of CPU count and clock speed, memory and overall raw performance, the I4i excels. Now let’s get into the details.

EC2 I4i Benchmark Results

The performance of the new I4i instances is truly impressive. AWS worked hard to improve storage performance using the new Nitro SSDs, and that work clearly paid off. Here’s how the I4i’s performance stacked up against the I3’s.

Operations per Second (OPS) throughput results on i4i.16xlarge (64 vCPU servers) vs i3.16xlarge with 50% Reads / 50% Writes (higher is better)

P99 latency results on i4i.16xlarge (64 vCPU servers) vs i3.16xlarge with 50% Reads / 50% Writes – latency with 50% of the max throughput (lower is better)

On a similar kind of server with the same number of cores, we achieved more than twice the throughput on the I4i – with better P99 latency.

Yes. Read that again. The long-tail latency is lower even though the throughput has more than doubled. This doubling applies to both the workloads we tested. We are really excited to see this, and look forward to seeing what an impact this makes for our customers.

Note the above results are presented per server, assuming a data replication factor of 3 (RF=3).

High cache hit rate performance results on i4i.16xlarge (64 vCPU servers) vs i3.16xlarge with 50% Reads / 50% Writes (3 node cluster) – latency with 50% of the max throughput

Just three I4i.16xlarge nodes support well over a million requests per second –  with a realistic workload. With the higher-end i4i.32xlarge, we’re expecting at least twice that number of requests per second.

“Essentially, if you have the I4i available in your region, use it for ScyllaDB”

Essentially, if you have the I4i available in your region, use it for ScyllaDB. It provides superior performance – in terms of both throughput and latency – over the previous generation of EC2 instances.




A Graph Data System Powered by ScyllaDB and JanusGraph

Today’s blog is a lesson taken from ScyllaDB University course tS110: The Mutant Monitoring System (MMS)  A Graph Data System Powered by ScyllaDB and JanusGraph. Feel free to log in or register for ScyllaDB University to get credit for the course, and to see all the rest of the NoSQL database courses available. 


This lesson will teach you how to deploy a Graph Data System, using JanusGraph and ScyllaDB as the underlying data storage layer.

A graph data system (or graph database) is a database that uses a graph structure with nodes and edges to represent data. Edges represent relationships between nodes, and these relationships allow the data to be linked and for the graph to be visualized. It’s possible to use different storage mechanisms for the underlying data, and this choice affects the performance, scalability, ease of maintenance, and cost.

Some common use cases for graph databases are knowledge graphs, recommendation applications, social networks, and fraud detection.

JanusGraph is a scalable open-source graph database that’s optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. It stores graphs in adjacency list format, which means that a graph is stored as a collection of vertices with their edges and properties.

JanusGraph natively supports the graph traversal language Gremlin. Gremlin is a part of Apache TinkerPop and is developed and maintained separately from JanusGraph. Many graph databases support it, and by using it, users avoid vendor lock-in as their applications can be migrated to other graph databases. You can learn more about the architecture on the JanusGraph Documentation page.

JanusGraphs supports ElasticSearch as an indexing backend. ElasticSearch is a search and analytics engine based on Apache Lucene. JanusGraph server (see Running the JanusGraph server) automatically starts a local ElasticSearch instance. It’s also possible to manually start a local ElasticSearch instance, however, this is not in the scope of this lesson. (However, see the documentation here.)


This lesson presents a step-by-step, hands-on example for deploying JanusGraph and performing some basic operations. Here are the main steps in the lesson:

  1. Spinning up a virtual machine on AWS
  2. Installing the prerequisites
  3. Running the JanusGraph server (using Docker)
  4. Running a Gremlin Console to connect to the new server (also in Docker)
  5. Spinning up a three-node Scylla Cluster and setting it as the data storage for the JanusGraph server
  6. Performing some basic graph operations

Since there are many moving parts to setting up JanusGraph and lots of options, the lesson goes over each step in the setup process. It uses AWS, but you can follow through with the example with another cloud provider or with a local machine.

Note that this example is for training purposes only and in production, things should be done differently. For example, each Scylla node should be run on a single server.

Launch an AWS EC2 Instance

Start by spinning up a t3.large Amazon Linux machine. From the EC2 console, select Instances and Launch Instances. In the first step choose the “Amazon Linux 2 AMI (HVM) – Kernel 5.10, SSD Volume Type – ami-008e1e7f1fcbe9b80 (64-bit x86)” AMI. In the second step, select t3.2xlarge. The reason for using an 2xlarge machine is that we’re running a few docker instances on the same server and it requires some resources. Change the storage to 16gb.

Once the machine is running, connect to it using SSH. In the command below, replace the path to your key pair and the public DNS name of your instance.

ssh -i ~/Downloads/aws/guy-janus.pem ec2-user@ec2-3-21-28-125.us-east-2.compute.amazonaws.com

Next, make sure all your packages are updated:

sudo yum update

Install Java

JanusGraph requires Java. To install Java and its prerequisites, run:

sudo yum install java

Run the following command to make sure the installation succeeded:

java -version

JanusGraph required the $JAVA_HOME environment variable to be set, and it also needs to be added to the $PATH. This isn’t automatically set when installing Java. So you’ll set it manually.

Locate the Java installation location by running:

sudo update-alternatives --config java

Copy the path to the Java Installation. Now, edit the .bashrc file found in the home directory of the ec2-user user, and add the two lines below to the file, replacing the path with the one you copied, without the /bin/java at the end.

export JAVA_HOME="/usr/lib/jvm/java-17-amazon-corretto.x86_64"

Save the file and run the following command:

source .bashrc

Make sure the environment variables were set:

echo $PATH

Docker Installation

Next, since you’ll be running JanusGraph and Scylla in Docker containers, install docker and its prerequisites:

sudo yum install docker

Start the Docker service:

sudo service docker start

and (optionally) to ensure that the Docker daemon starts after each system reboot:

sudo systemctl enable docker

Add the current user (ec2-user) to the docker group so you can execute Docker commands without using sudo:

sudo usermod -a -G docker ec2-user

To activate the changes to the groups, run:

newgrp docker

Finally, make sure that the Docker engine was successfully installed by running the hello-world image:

docker run hello-world

Deploying JanusGraph

There are a few options for how to deploy JanusGraph. In this lesson, you’ll deploy JanusGraph in a docker container to simplify the process. To learn about other ways of deploying JanusGraph read the Installation Guide.

Start a docker container for the JanusGraph server:

docker run --name janusgraph-default janusgraph/janusgraph:latest

ScyllaDB Data Storage Backend, Transactional and Analytical Workloads

In this part, you’ll spin up a three-node Scylla cluster which you’ll then use as a data storage backend for JanusGraph. The data storage layer for JanusGraph is pluggable. Some options for the data storage layer are Apache HBase, Google Cloud Bigtable, Oracle Berkeley DB Java Edition, Apache Cassandra, and ScyllaDB. The storage backend you use for your application is extremely important. By using ScyllaDB, some of the advantages you’ll get are low and consistent latency, high availability, up to x10 throughput, ease of use, and a highly scalable system.

A group at IBM compared using ScyllaDB as the JanusGraph storage backend vs. Apache Cassandra and HBase. They found that Scylla displayed nearly 35% higher throughput when inserting vertices than HBase and almost 3X Cassandra’s throughput. When inserting edges, Scylla’s throughput was 160% better than HBase and more than 4X that of Cassandra. In a query performance test, Scylla performed 72% better than Cassandra and nearly 150% better than HBase.

With graph data systems, just like with any data system, we can separate our workloads into two categories – transactional and analytical. JanusGraph follows the Apache TinkerPop project’s approach to graph computation. The Gremlin graph traversal language allows us to traverse a graph, traveling from vertex to vertex via the connecting edges. We can use the same approach for both OLTP and OLAP workloads.

Transactional workloads begin with a small number of vertices (found with the help of an index) and then traverse across a reasonably small number of edges and vertices to return a result or add a new graph element. We can describe these transactional workloads as graph local traversals. Our goal with these traversals is to minimize latency.

Analytical workloads require traversing a substantial portion of the vertices and edges in the graph to find our answer. Many classic analytical graph algorithms fit into this bucket. We can describe these as graph global traversals. Our goal with these traversals is to maximize throughput.

With our JanusGraph – Scylla graph data system, we can blend both capabilities. Backed by the high-IO performance of Scylla, we can achieve scalable, single-digit millisecond responses for transactional workloads. We can also leverage Spark to handle large-scale analytical workloads.

With the server running in the foreground in your previous terminal, open a new terminal window and connect to your VM like before:

ssh -i ~/Downloads/aws/guy-janus.pem ec2-user@ec2-3-21-28-125.us-east-2.compute.amazonaws.com

Using Docker, you’ll spin up a three-node Scylla cluster:

Start by setting up a Docker container with one node, called Node_X:

docker run --name Node_X -d scylladb/scylla:4.5.0 --smp 1 --overprovisioned 1 --memory 750M

Create two more nodes, Node_Y and Node_Z, and add them to the cluster of Node_X. The command “$(docker inspect –format='{{ .NetworkSettings.IPAddress }}’ Node_X)” translates to the IP address of Node_X:

docker run --name Node_Y -d scylladb/scylla:4.5.0 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' Node_X)" --smp 1 --overprovisioned 1 --memory 750M

docker run --name Node_Z -d scylladb/scylla:4.5.0 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' Node_X)" --smp 1 --overprovisioned 1 --memory 750M

Wait a minute or so and check the node status:

docker exec -it Node_Z nodetool status

Once all nodes are in up status (UN), continue to the next step.

Gremlin Console and Testing

Gremlin is a graph traversal query language developed by Apache TinkerPop. It works for OLTP and OLAP type traversals and supports multiple graph systems. JanusGraph, Neo4j, and Hadoop are some examples.

Run the Gremlin console, and link it to the already running server container:

docker run --rm --link janusgraph-default:janusgraph -e GREMLIN_REMOTE_HOSTS=janusgraph  -it janusgraph/janusgraph:latest ./bin/gremlin.sh

Now in the gremlin console, connect to the server:

:remote connect tinkerpop.server conf/remote.yaml

Next, set Scylla as the data storage and instantiate our JanusGraph object in the console with a JanusGraphFactory. Make sure to replace the IP in the command below with the IP of one of your Scylla cluster nodes:

JanusGraph graph = JanusGraphFactory.build().set("storage.backend", "cql").set("storage.hostname", "").open();
g = TinkerGraph.open().traversal()

Now we can perform some basic queries. Make sure the graph is empty to begin with:


Add two vertices

g.addV('person').property('name', 'guy')
g.addV('person').property('name', 'john')

And make sure they were added:



In this lesson, you learned how to deploy JanusGraph with Scylla as the underlying database. There are different ways to deploy JanusGraph and Scylla. To simplify the lesson, it uses Docker. The main steps were: Spinning up a virtual machine on AWS, running the JanusGraph server, running a Gremlin Console to connect to the new server, spinning up a three-node Scylla cluster, and setting it as the data storage for the JanusGraph server, and performing some basic graph operations.

You can learn more about using JanusGraph with Scylla from this real-world Cybersecurity use case. Another use case is Zeotap, which built a graph of twenty billion IDs on Scylla and JanusGraph, which they use for their customer intelligence platform. In terms of performance, a group at IBM compared using ScyllaDB as the JanusGraph storage backend vs. Apache Cassandra and HBase. They found that when inserting vertices, Scylla displayed nearly 35% higher throughput than HBase and almost 3X Cassandra’s throughput. Scylla’s throughput was 160% better than HBase and more than 4X that of Cassandra when inserting edges. Scylla performed 72% better than Cassandra in a query performance test and nearly 150% better than HBase.

These ScyllaDB benchmarks compare its performance to Cassandra, DynamoDB, Bigtable, and CockroachDB.


How to Build a Low-Latency IoT App with ScyllaDB

In this tutorial, you will learn how to build an IoT sensor simulator application that streams data to a ScyllaDB cluster hosted on ScyllaDB Cloud.

Low Latency Requirements

In general, latency refers to the time it takes for data to be processed, written, or read. Imagine a sensor sending data to a database connected to another server that reads the data through HTTP in an IoT use case.

There are two different classes of latency when you are designing an IoT application. The first is end-to-end latency, usually a full round trip time (RTT) from the IoT device itself, up into your application (and database), and then back to the IoT device itself. This often depends on your geographical (real-world) topology, and the kind of networking and computing hardware you are using.

Achieving lower end-to-end latency may require improving the form of network connectivity (say, shifting from 4G to 5G infrastructure), upgrading IoT sensor packages, hardware tuning, and other elements out of control of a backend app developer. So we’ll leave those aside for now.

The second are latencies within the application or even within the database itself. At ScyllaDB we are particularly concerned about optimizing the latter. How long does your database take to internally process updates or queries, especially when real-time IoT devices may be depending on the reply before responding to users, or changing their state or behavior?

Achieving lower latency within such an application requires decreasing processing and IO times across the server and the database behind it. Let’s go through that exercise and imagine a chocolate factory with thousands of devices measuring and recording fifty critical data points per second.

Does it Matter?

Not every performance gain is worth the investment. You’ll need to know your system’s requirements and constraints and ask yourself questions such as:

  • How often is data recorded?
  • How critical is it?
  • In case of a production failure, what’s the tolerated response time?

For example, my chocolate factory might not need real-time data. In this case, we can decrease the sampling rate — the measurement’s polling frequency. Do I need to check every device every second? Every five seconds? Every minute? Longer? By backing off the sampling rate, I may lose “real time” understanding, but may greatly decrease the amount of data ingestion. Or perhaps I can batch up all the local samples — the device might poll every second, but will only send a batch update every minute to the server, decreasing the number of operations by 60x, but increasing the size of the payload by the same rate. Thus, if I have moved the polling time to one minute, doing a lot of work to improve a sensors’ processing capability over thousands of devices for a 100µs gain might not be the highest priority.

That long polling time might be appropriate for, say, a machine health algorithm. But what if I need my IoT system for actual manufacturing device controls? In that case, the application may need millisecond-scale or even sub-millisecond response times. Or who knows how many chocolates I may ruin if the system takes a minute to respond to real-time changing conditions on the manufacturing floor!

You need to design a system that is scoped and scaled to the specific workload you have.

IoT and ScyllaDB, a Match Made in Heaven

ScyllaDB provides highly available data and sub-millisecond writes, meeting the requirement to design an IoT system at scale.

ScyllaDB’s shard-per-core architecture allows queries to connect directly not just to the right node, but to the exact CPU that handles the data. This approach minimizes intracluster hops, as well as CPU-to-CPU communication.

Sensor Simulator

To populate the database, we created a sensor app that generates random data and pushes it to the database, thus simulating the sensor’s activity. We created the simulator in three steps:

  1. Create random sensors and random measures
  2. Connect to ScyllaDB
  3. Insert batch measures

The app runs indefinitely, creates data every second, and then sends a batch query to the database every 10 seconds.


  1. NodeJS
  2. ScyllaDB Cloud account
  3. A cluster created on ScyllaDB Cloud

Create Random Sensor and Measure Data

Let’s start by creating a new NodeJS project with the following commands:

mkdir sensor
cd sensor
npm init -y
npm install cassandra-driver casual dotenv

Create an index.js file and run the following commands:

mkdir src
touch src/index.js

Add the following code in index.js

function main() {
// Generate random sensors

// Generate random measurements


In the package.json file, remove the “test” script and add the following:

"scripts": {

   "dev": "node src/index.js"

We can run the index.js primary function using npm run dev.

Add a new file named model.js in the src folder like this:

touch src/model.js

As the name suggests, we define our model in model.js. Let’s first define the Sensor class.

We use the casual package to generate random data. Let’s install with the following command:

npm I casual

Add the code below to model.js:

// model.js
const casual = require('casual');

const SENSOR_TYPES = ['Temperature', 'Pulse', 'Location', 'Respiration'];

casual.define('sensor', function () {
  return {
    sensor_id: casual.uuid,
    type: casual.random_element(SENSOR_TYPES),

class Sensor {
  constructor() {
    const { sensor_id, type } = casual.sensor;
    this.sensor_id = sensor_id;
    this.type = type;

module.exports = { Sensor };

The casual.sensor function creates a random sensor_id and type provided in the SENSOR_TYPES array a new owner and generates random data using the casual library.

Let’s test this out by adding the below code to index.js:

cconst casual = require('casual');
const { Sensor } = require('./model');

function main() {
// Generate random sensors
const sensors = new Array(casual.integer(1, 4))
  .map(() => new Sensor());

// Save sensors
for (let sensor of sensors) {
  console.log(`New sensor # ${sensor.sensor_id}`);
// Generate random measurements

The next step is to create a function runSensorData that generates random measures. For that, we need to define two constants:

  • INTERVAL: the interval between two measures taken by the sensor.
  • BUFFER_INTERVAL: the gap between two database insertions.
// ...
const INTERVAL = 1000;
const BUFFER_INTERVAL = 5000;

We also need to import the moment package:

const moment = require('moment');

The runSensorData function looks like the following:

function runSensorData(sensors) {
  while (true) {
    const measures = [];
    const last = moment();
    while (moment().diff(last) < BUFFER_INTERVAL) {
      await delay(INTERVAL);
        ...sensors.map((sensor) => {
          const measure = new Measure(
            `New measure: sensor_id: ${measure.sensor_id} type: ${sensor.type} value: ${measure.value}`
          return measure;

The code above creates a new measure using the Measure class and randomSensorData function and pushes it to the array.

The Measure class should look like this:

// model.js
class Measure {
  constructor(sensor_id, ts, value) {
    this.sensor_id = sensor_id;
    this.ts = ts;
    this.value = value;

We’ll create a new file helpers.js and add the randomSensorData and delay:

// helpers.js
const casual = require('casual');

function randomSensorData(sensor) {
  switch (sensor.type) {
    case 'Temperature':
      return 101 + casual.integer(0, 10);
    case 'Pulse':
      return 101 + casual.integer(0, 40);
    case 'Location':
      return 35 + casual.integer(0, 5);
    case 'Respiration':
      return 10 * casual.random;

function delay(time) {
  return new Promise((resolve) => setTimeout(resolve, time));

module.exports = { randomSensorData, delay };

Connect to the Database

You can create a free ScyllaDB Cloud account on https://cloud.scylladb.com. Follow this blog post to learn how to create a cluster.

Once the cluster is created, connect to your database using CQLSH:

docker run -it --rm --entrypoint cqlsh scylladb/scylla -u [USERNAME] -p [PASSWORD] [NODE_IP_ADDRESS]

Note that you will need to replace the USERNAME, PASSWORD, and NODE_IP_ADDRESS with your own.

Create a keyspace:

CREATE KEYSPACE iot WITH replication = {'class': 'NetworkTopologyStrategy', 'AWS_US_EAST_1' : 3} AND durable_writes = true;

Use iot

  sensor_id UUID,
  type TEXT,
  PRIMARY KEY (sensor_id)

  sensor_id UUID,
  ts    TIMESTAMP,
  value FLOAT,
  PRIMARY KEY (sensor_id, ts)
) WITH compaction = { 'class' : 'TimeWindowCompactionStrategy' };

Let’s create a getClient function that returns an Apache Cassandra client. NODE_1 represents the IP address of one of the nodes. ScyllaDB Cloud is a distributed database that runs with a minimum of three nodes.

const cassandra = require('cassandra-driver');


function getClient() {
  const client = new cassandra.Client({
    contactPoints: [NODE_1],
    authProvider: new cassandra.auth.PlainTextAuthProvider(USERNAME, PASSWORD),
    localDataCenter: DATA_CENTER,
    keyspace: KEYSPACE,

  return client;

function insertQuery(table) {
  const tableName = table.tableName;
  const values = table.columns.map(() => '?').join(', ');
  const fields = table.columns.join(', ');
  return `INSERT INTO ${tableName} (${fields}) VALUES (${values})`;

module.exports = { getClient, insertQuery };

We also added an insertQuery function that will help insert data to the tables. For that to work, we need to add to getters in the Sensor and Measure classes:

class Sensor {
// ...
  static get tableName() {
    return 'sensor';

  static get columns() {
    return ['sensor_id', 'type'];

class Measure {
// ...
  static get tableName() {
    return 'measurement';

  static get columns() {
    return ['sensor_id', 'ts', 'value'];

Let’s add getClient function to index.js, add the sensors to the database:

async function main() {
  // Create a session and connect to the database
  const client = getClient();
  // Generate random owner, pet and sensors
// ...
    .map(() => new Sensor());

  // Save sensors
  for (let sensor of sensors) {
    await client.execute(insertQuery(Sensor), sensor, { prepare: true });
    console.log(`New sensor # ${sensor.sensor_id}`);
  // Generate random measurements
  await runSensorData(client, sensors);

  return client;

Insert batch measures

We use batch to avoid inserting data after every measure. Instead, we push the measures to an array and then create a batch query that we execute using the client.batch function.

async function runSensorData(client, sensors) {
  while (true) {
    // ...

    console.log('Pushing data');

    const batch = measures.map((measure) => ({
      query: insertQuery(Measure),
      params: measure,

    await client.batch(batch, { prepare: true });


In this post, you saw how to create a sensor simulator that generates random data and sends it to the database. We created a ScyllaDB Cloud cluster and a keyspace, connected to it, then ran batch queries to insert the data.

I hope you find this post interesting. If you’d like to learn more about ScyllaDB, I highly recommend getting an account in ScyllaDB University. It’s completely free and all online. Let me know in the comments if you have any suggestions.



Shaving 40% Off Google’s B-Tree Implementation with Go Generics

There are many reasons to be excited about generics in Go. In this blog post I’m going to show how, using the generics, we got a 40% performance gain in an already well optimized package, the Google B-Tree implementation.

A B-Tree is a kind of self-balancing tree. For the purpose of this blog post it’s sufficient to say that it is a collection. You can add, remove, get or iterate over its elements. The Google B-Tree is well optimized, measures are taken to make sure memory consumption is correct. There is a benchmark for every exported method. The benchmark results show that there are zero allocations in the B-Tree code for all operations but cloning. Probably it would be hard to further optimize using traditional techniques.

ScyllaDB and the University of Warsaw

We have had a longstanding relationship with the Computer Science department at the University of Warsaw. You may remember some of our original projects, including those integrating Parquet, an async userspace filesystem, or a Kafka client for Seastar. Or more recent ones like a system for linear algebra in ScyllaDB or a design for a new Rust driver.

This work was also part of our ongoing partnership with the University of Warsaw.

Making Faster B-Trees with Generics

While working on a new Scylla Go Driver with students of University of Warsaw, we ported the B-tree code to generics. (If you’re not familiar with generics in Go, check out this tutorial.).

The initial result: the generics code is faster by 20 to 30 percent according to the Google benchmarks (link to issue we opened). Below a full benchmark comparison done with benchstat.

This is great but within those numbers there is a troubling detail. The zero allocations is not something that you would normally see given that the functions accept an interface as a parameter.

For the rest of the blog post we’ll focus on benchmarking the ReplaceOrInsert function responsible for ingesting data. Let’s consider a simplified benchmark.

The results show even greater improvement: 31% vs. 27%, and allocations drop from 1, in case of the interface based implementation, to 0 in the case of generics.

Let’s try to understand what happens here.

The Additional Allocation

The Google benchmarks operate on a B-tree of integers hidden by an Item interface. They use pre-generated random data in a slice. When an Item is passed to the ReplaceOrInsert function the underlying integer is already on the heap, technically we are copying a pointer. This is not the case when a plain integer needs to be converted to an Item interface — the parameter values start “escaping to heap”.

Go has a feature of deciding if a variable you initialized should live in the function’s stack or in the heap. Traditionally the compiler was very “conservative” and when it saw a function like func bind(v interface{}) anything you wanted to pass as v would have to go to heap first. This is referred to as variable escaping to the heap. Over the years the compiler has gotten smarter, and calls to local functions or functions in other packages in your project can be optimized, preventing the variables from escaping. You can check for yourself by running  go build -gcflags="-m" . in a Go package.

In the below example Go can figure out that it’s safe to take a pointer to the main functions stack.

As you can see the compiler informs us that variables do not escape to heap.

By changing the ToString implementation to

we see that the variables and literal values do start escaping.

In practical examples, when calling a function that accepts an interface as a parameter, the value almost always escapes to heap. When this happens it not only slows down the function call by the allocation, but also increases the GC pressure. Why is this important? The generics approach enables a truly zero allocation API, with zero GC pressure added as we will learn in the remainder of this blog post.

Why is it faster?

The B-tree, being a tree, consists of nodes. Each node holds a list of items.

When the Item is a pre-generics plain old interface, the value it holds must live separately somewhere on the heap. The compiler is not able to tell what is the size of an Item. From the runtime perspective an interface value is an unsafe pointer to data (word), a pointer to its type definition (typ), a pointer to interface definition (ityp); see definitions in the reflect package. It’s easier to digest than the runtime package. In that case we have items as a slice of int pointers.

On the other hand, with generic interface

and a generic type definition

items are a slice of ints — this reduces the number of small heap objects by a factor of 32.

Enough of theory, let’s try to examine a concrete usage. For the purpose of this blog I wrote a test program that is a scaled up version of my benchmark code.

We are adding 100 million integers, and the degree of the B-tree (number of items in a node) is 1k. There are two versions of this program: one uses generics, the other plain old interface. The difference in code is minimal, it’s literally changing btree.New(degree) to btree.New[btree.Int](degree) in line 13. Let’s compare data gathered by running both versions under `/usr/bin/time -l -p`.

generics interface delta
real 11.49 16.59 -30.74%
user 11.27 18.61 -39.44%
sys 0.24 0.6 -60.00%
maximum resident set size 2334212096 6306217984 -62.99%
average shared memory size 0 0
average unshared data size 0 0
average unshared stack size 0 0
page reclaims 142624 385306 -62.98%
page faults 0 0
swaps 0 0
block input operations 0 0
block output operations 0 0
messages sent 0 0
messages received 0 0
signals received 600 843 -28.83%
voluntary context switches 25 48 -47.92%
involuntary context switches 1652 2943 -43.87%
instructions retired 204760684966 288827272312 -29.11%
cycles elapsed 37046278867 60503503105 -38.77%
peak memory footprint 2334151872 6308147904 -63.00%
HeapObjects 236884 50255826 -99.53%
HeapAlloc 2226292560 6043893088 -63.16%

Here using generics solves a version of N+1 problem for slices of interfaces. Instead of one slice and N integers in heap we now can have just the slice of ints. The results are profound, the new code behaves better in every aspect. The wall time duration is down by 40%, context switches are down by 40%, system resources utilization is down by 60% — all thanks to a 99.53% reduction of small heap objects.

I’d like to conclude by taking a look at top CPU utilization.

go tool pprof -top cpu.pprof

generics interface
Type: cpu
Time: Apr 5, 2022 at 10:23am (CEST)
Duration: 11.61s, Total samples = 11.05s (95.18%)<
Showing nodes accounting for 10.77s, 97.47% of 11.05s total
Dropped 52 nodes (cum <= 0.06s)
flat  flat%   sum%        cum   cum%
4.96s 44.89% 44.89%      4.96s 44.89%  runtime.madvise
4.61s 41.72% 86.61%      4.61s 41.72%  runtime.memclrNoHeapPointers
0.64s  5.79% 92.40%      0.64s  5.79%  github.com/google/btree.items[...].find.func1
0.19s  1.72% 94.12%      0.83s  7.51%  sort.Search
0.08s  0.72% 94.84%      5.82s 52.67%  github.com/google/btree..insert
0.08s  0.72% 95.57%      0.08s  0.72%  runtime.mmap
0.07s  0.63% 96.20%      0.90s  8.14%  github.com/google/btree.items[...].find
0.05s  0.45% 96.65%      5.88s 53.21%  github.com/google/btree..ReplaceOrInsert
0.05s  0.45% 97.10%      4.19s 37.92%  github.com/google/btree..insertAt (inline)
0.04s  0.36% 97.47%      0.61s  5.52%  github.com/google/btree..maybeSplitChild
0     0% 97.47%      0.57s  5.16%  github.com/google/btree..split
Type: cpu
Time: Apr 5, 2022 at 10:31am (CEST)
Duration: 16.69s, Total samples = 18.65s (111.74%)
Showing nodes accounting for 17.94s, 96.19% of 18.65s total
Dropped 75 nodes (cum <= 0.09s)
flat  flat%   sum%        cum   cum%
9.53s 51.10% 51.10%      9.53s 51.10%  runtime.madvise
2.62s 14.05% 65.15%      2.62s 14.05%  runtime.memclrNoHeapPointers
1.09s  5.84% 70.99%      1.31s  7.02%  github.com/google/btree.items.find.func1
0.93s  4.99% 75.98%      2.73s 14.64%  runtime.scanobject
0.67s  3.59% 79.57%      0.67s  3.59%  runtime.heapBits.bits (inline)
0.44s  2.36% 81.93%      1.75s  9.38%  sort.Search
0.30s  1.61% 83.54%      0.30s  1.61%  runtime.markBits.isMarked (inline)
0.27s  1.45% 84.99%      2.03s 10.88%  github.com/google/btree.items.find
0.27s  1.45% 86.43%      3.35s 17.96%  runtime.mallocgc
0.26s  1.39% 87.83%      0.26s  1.39%  runtime.(*mspan).refillAllocCache
0.25s  1.34% 89.17%      0.60s  3.22%  runtime.greyobject
0.24s  1.29% 90.46%      0.26s  1.39%  runtime.heapBits.next (inline)
0.23s  1.23% 91.69%      0.23s  1.23%  github.com/google/btree.Int.Less
0.20s  1.07% 92.76%      0.20s  1.07%  runtime.memmove
0.20s  1.07% 93.83%      0.20s  1.07%  runtime.mmap
0.15s   0.8% 94.64%      2.47s 13.24%  github.com/google/btree.(*items).insertAt (inline)
0.12s  0.64% 95.28%      0.27s  1.45%  runtime.findObject
0.08s  0.43% 95.71%      5.44s 29.17%  github.com/google/btree.(*node).insert
0.03s  0.16% 95.87%      5.48s 29.38%  github.com/google/btree.(*BTree).ReplaceOrInsert
0.02s  0.11% 95.98%      0.84s  4.50%  github.com/google/btree.(*node).maybeSplitChild
0.02s  0.11% 96.09%      0.45s  2.41%  runtime.convT64
0.01s 0.054% 96.14%      9.83s 52.71%  runtime.(*mheap).allocSpan
0.01s 0.054% 96.19%      2.82s 15.12%  runtime.gcDrain
0     0% 96.19%      0.78s  4.18%  github.com/google/btree.(*node).split

You can literally see how messy the interface profile is, how gc starts kicking in killing it… It’s even more evident when we focus on gc.

go tool pprof -focus gc -top cpu.pprof

generics interface
Type: cpu
Time: Apr 5, 2022 at 10:23am (CEST)
Duration: 11.61s, Total samples = 11.05s (95.18%)
Active filters:
Showing nodes accounting for 0.29s, 2.62% of 11.05s total
flat  flat%   sum%        cum   cum%
0.19s  1.72%  1.72%      0.19s  1.72%  runtime.memclrNoHeapPointers
0.02s  0.18%  1.90%      0.02s  0.18%  runtime.(*mspan).refillAllocCache
0.01s  0.09%  1.99%      0.02s  0.18%  runtime.(*fixalloc).alloc
0.01s  0.09%  2.08%      0.01s  0.09%  runtime.(*mheap).allocNeedsZero
0.01s  0.09%  2.17%      0.01s  0.09%  runtime.(*mspan).init (inline)
0.01s  0.09%  2.26%      0.01s  0.09%  runtime.heapBits.bits (inline)
0.01s  0.09%  2.35%      0.01s  0.09%  runtime.markrootSpans
0.01s  0.09%  2.44%      0.01s  0.09%  runtime.recordspan
0.01s  0.09%  2.53%      0.02s  0.18%  runtime.scanobject
0.01s  0.09%  2.62%      0.01s  0.09%  runtime.stkbucket
Type: cpu
Time: Apr 5, 2022 at 10:31am (CEST)
Duration: 16.69s, Total samples = 18.65s (111.74%)
Active filters:
Showing nodes accounting for 6.06s, 32.49% of 18.65s total
Dropped 27 nodes (cum <= 0.09s)
flat  flat%   sum%        cum   cum%
2.62s 14.05% 14.05%      2.62s 14.05%  runtime.memclrNoHeapPointers
0.93s  4.99% 19.03%      2.73s 14.64%  runtime.scanobject
0.67s  3.59% 22.63%      0.67s  3.59%  runtime.heapBits.bits (inline)
0.30s  1.61% 24.24%      0.30s  1.61%  runtime.markBits.isMarked (inline)
0.27s  1.45% 25.68%      3.35s 17.96%  runtime.mallocgc
0.26s  1.39% 27.08%      0.26s  1.39%  runtime.(*mspan).refillAllocCache
0.25s  1.34% 28.42%      0.60s  3.22%  runtime.greyobject
0.24s  1.29% 29.71%      0.26s  1.39%  runtime.heapBits.next (inline)
0.12s  0.64% 30.35%      0.27s  1.45%  runtime.findObject
0.08s  0.43% 30.78%      0.08s  0.43%  runtime.spanOf (inline)
0.06s  0.32% 31.10%      0.06s  0.32%  runtime.(*mspan).base (inline)
0.06s  0.32% 31.42%      0.06s  0.32%  runtime.(*mspan).init (inline)
0.06s  0.32% 31.74%      0.06s  0.32%  runtime.heapBitsSetType
0.04s  0.21% 31.96%      0.04s  0.21%  runtime.(*mSpanStateBox).get (inline)
0.04s  0.21% 32.17%      0.04s  0.21%  runtime.pthread_kill
0.04s  0.21% 32.39%      0.04s  0.21%  runtime.usleep
0.01s 0.054% 32.44%      0.10s  0.54%  runtime.(*mheap).allocSpan
0.01s 0.054% 32.49%      2.82s 15.12%  runtime.gcDrain

The generic version spent 0.29s (2.62%) in GC while the interface version spent 6.06s accounting for, hold your breath, 32.49% of the total time!

Generics: CPU profile flame focused on GC related function


Interface: CPU profile flame focused on GC related functions


By shifting the implementation from one using interfaces, to one using generics, we were able to significantly improve performance, minimize garbage collection time, and minimize CPU and other resource utilization, such as heap size. Particularly with heap size, we were able to reduce HeapObjects by 99.53%.

The future of Go generics is bright especially in the domain of slices.

Want to be a ScyllaDB Monster?

We’re definitely proud of the work we do with the students at the University of Warsaw. Yet ScyllaDB is a growing company with a talented workforce drawn from all over the world. If you enjoy writing high performance generic Go code, come join us. Or if you specialize in other languages or talents, check out our full list of careers at ScyllaDB: