Medusa 0.16 was released

The k8ssandra team is happy to announce the release of Medusa for Apache Cassandra™ v0.16. This is a special release as we did a major overhaul of the storage classes implementation. We now have less code (and less dependencies) while providing much faster and resilient storage communications.

Back to ~basics~ official SDKs

Medusa has been an open source project for about 4 years now, and a private one for a few more. Over such a long time, other software it depends upon (or doesn’t) evolves as well. More specifically, the SDKs of the major cloud providers evolved a lot. We decided to check if we could replace a lot of custom code doing asynchronous parallelisation and calls to the cloud storage CLI utilities with the official SDKs.

Our storage classes so far relied on two different ways of interacting with the object storage backends:

  • Apache Libcloud, which provided a Python API for abstracting ourselves from the different protocols. It was convenient and fast for uploading a lot of small files, but very inefficient for large transfers.
  • Specific cloud vendors CLIs, which were much more efficient with large file transfers, but invoked through subprocesses. This created an overhead that made them inefficient for small file transfers. Relying on subprocesses also created a much more brittle implementation which led the community to create a lot of issues we’ve been struggling to fix.

To cut a long story short, we did it!

  • We started by looking at S3, where we went for the official boto3. As it turns out, boto3 does all the chunking, throttling, retries and parallelisation for us. Yay!
  • Next we looked at GCP. Here we went with TalkIQ’s gcloud-aio-storage. It works very well for everything, including the large files. The only thing missing is the throughput throttling.
  • Finally, we used Azure’s official SDK to cover Azure compatibility. Sadly, this still works without throttling as well.

Right after finishing these replacements, we spotted the following improvements:

  • The integration tests duration against the storage backends dropped from ~45 min to ~15 min.
    • This means Medusa became far more efficient.
    • There is now much less time spent managing storage interaction thanks to it being asynchronous to the core.
  • The Medusa uncompressed image size we bundle into k8ssandra dropped from ~2GB to ~600MB and its build time went from 2 hours to about 15 minutes.
    • Aside from giving us much faster feedback loops when working on k8ssandra, this should help k8ssandra itself move a little bit faster.
  • The file transfers are now much faster.
    • We observed up to several hundreds of MB/s per node when moving data from a VM to blob storage within the same provider. The available network speed is the limit now.
    • We are also aware that consuming the whole network throughput is not great. That’s why we now have proper throttling for S3 and are working on a solution for this in other backends too.

The only compromise we had to make was to drop Python 3.6 support. This is because the Pythons asyncio features only come in Python 3.7.

The other good stuff

Even though we are the happiest about the storage backends, there is a number of changes that should not go without mention:

  • We fixed a bug with hierarchical storage containers in Azure. This flavor of blob storage works more like a regular file system, meaning it has a concept of directories. None of the other backends do this (including the vanilla Azure ones), and Medusa was not dealing gracefully with this.
  • We are now able to build Medusa images for multiple architectures, including the arm64 one.
  • Medusa can now purge backups of nodes that have been decommissioned, meaning they are no longer present in the most recent backups. Use the new medusa purge-decommissioned command to trigger such a purge.

Upgrade now

We encourage all Medusa users to upgrade to version 0.16 to benefit from all these storage improvements, making it much faster and reliable.

Medusa v0.16 is the default version in the newly released k8ssandra-operator v1.9.0, and it can be used with previous releases by setting the .spec.medusa.containerImage.tag field in your K8ssandraCluster manifests.

Introducing the New JSON API for Astra DB: Develop AI Applications in JavaScript with Ease

One of our goals at DataStax is to enable every developer—regardless of the language they build in—to deliver AI applications to production as fast as possible.  We recently added vector search capabilities to DataStax Astra DB, our database-as-a-service built on open source Apache Cassandra, to...

5 More Intriguing ScyllaDB Capabilities You Might Have Overlooked

You asked for it! Here are another 5 unique ScyllaDB capabilities for power users!

We were very happy (and grateful!) to receive TONS of positive feedback when 5 Intriguing ScyllaDB Capabilities You Might Have Overlooked was published. Some readers were inspired to try out cool features they weren’t previously aware of. Others asked why some specific capability wasn’t covered. And others just wanted to discover more of these “hidden gems.” 🙂

When we first decided to write a “5 intriguing capabilities” blog, we simply featured the first 5 capabilities that came to mind. It was by no means a “top 5” ranking. Of course, there are many more little-known capabilities that are providing value to teams just like yours. Here’s another round of cool features you might not know about.

Load and Stream

The nodetool refresh command has several purposes: migrate Apache Cassandra SSTables to ScyllaDB, restore a cluster via ScyllaDB Manager, mirror a ScyllaDB cluster to another, and even rollback to a previous release when things go wrong. In a nutshell, a refresh simply loads newly placed SSTables into the system.

One of the problems with nodetool refresh is that, by default, it doesn’t account for the actual ownership of the data. That means that when you try to restore data to a cluster with different topology, replication settings, or even a different token ownership, you would need to load your data multiple times (since every refresh would discard the data not owned by the receiving node).

For example, assume you are refreshing data from a 3-node cluster to a single-node cluster. Also assume that the replication factor is 2. The correct way to move data around would be to copy the SSTables from all 3 source nodes, and – one at a time – refresh it at the destination. Failure to do so could potentially result in data being lost during the process, as we can’t be sure that the data from all 3 source nodes are in sync.

It gets worse if your destination is a 3-node cluster with different token ownership. Instead of copying and loading the data 3 times (as in the previous example), you’d have to run it 9 times in total (3 source nodes x 3 destination nodes) to properly migrate with no data loss. See the problem now?

That’s what Load and Stream addresses. The concept is really simple: Rather than ignoring and dropping the tokens the node is not responsible for, just stream it over to its rightful peer members! This simple change greatly speeds up and simplifies the process of moving data around, and is much less error prone.

We introduced Load and Stream in ScyllaDB 5.1 and extended the nodetool refresh command syntax. Note that the default behavior remains, so you must adjust the command syntax accordingly to tell your refresh command to also stream data.

Timeout per Operation

Sometimes you know beforehand that you are about to run a query that may take considerably longer than your average. Perhaps you are running an aggregation, a full table scan, or maybe your workload simply counts with distinct SLAs for different types of queries.

One Apache Cassandra limitation is the fact that timeouts are hardcoded, which makes it extremely inflexible for processing queries that might knowingly take longer than usual. The default is for the coordinator to timeout a write taking longer than 2 seconds, and a read taking longer than 5 seconds.

Although the defaults are a reasonable amount of time for writes and reads, it is worth noting that those defaults are unacceptable for many low latency use cases. Therefore, many use cases often configure stricter server-side timeout settings, and even configure context timeouts within their applications in order to prevent requests from taking too long.

In Apache Cassandra, lowering your server-side timeout settings effectively forbids you from running any kinds of queries past your configured settings. And increasing it too much makes you subject to retry storms. We needed a solution: How to prevent long-running queries from timing out while avoiding retry storms and complying with aggressive P99 latency requirements?

The answer is USING TIMEOUT, a ScyllaDB CQL extension that allows you to dynamically adjust – on a per query basis – the coordinator timeout. This allows you to notify ScyllaDB that your query should not stick with the server defaults and then the coordinator will use your provided value instead.

We introduced the USING TIMEOUT extension in ScyllaDB 4.4, and it is extremely useful when applied in conjunction with the BYPASS CACHE option discussed in the first installment of this blog.

Live Updateable Parameters

Let’s be honest: No one likes to run a cluster rolling restart whenever you have to change configuration parameters. The larger your deployment gets, the more nodes you will have to restart – and the longer it takes for a parameter to become active cluster-wide.

ScyllaDB solves this problem with live updateable parameters, which follows a Unix-like syntax to notify the node to re-read its configuration file (/etc/scylla/scylla.yaml) and refresh its configuration.

The ability for ScyllaDB to reload its runtime configuration was introduced 4 years ago and, over time, we have greatly extended its capabilities to support more features. We know it’s definitely a challenge to find out on which release every single option got introduced. To help, here’s a list of (currently) supported parameters that can get changed at database runtime. Consult the output of scylla --help for an explanation of each:

To live update a parameter, simply update its value in ScyllaDB’s configuration, and SIGHUP the ScyllaDB PID. For example:

# echo 'compaction_static_shares: 100' >> /etc/scylla/scylla.yaml
# kill -HUP $(pidof scylla)

And you should see in the logs:

INFO 2023-08-07 21:55:34,515 [shard 0] init - re-reading configuration file
INFO 2023-08-07 21:55:34,524 [shard 0] compaction_manager - Updating static shares to 100
INFO 2023-08-07 21:55:34,526 [shard 0] init - completed re-reading configuration file


A ScyllaDB deployment is very flexible. You can create single-node clusters, use a replication factor of 1, and even write data using a timestamp in the far future to be deleted by the time robots take over human civilization. We won’t prevent you from choosing whatever setting makes sense for your deployment, as we believe you know what you are doing. 🙂

Guardrails are settings made to prevent you (or others) from shooting themselves in the foot. Just like the previous live updateable parameters we’ve seen, it is part of a continuous process to give the database administrator more control over which restrictions make sense over deployments.

Some known guardrails enabled by default are, for example, the DateTiered compaction strategy (which has long been deprecated in favor of the Time Window Compaction Strategy) and the deprecated Thrift API.

The easiest way to find out which Guardrails parameters are available for use is to simply filter by the guardrails label in our GitHub project. There, you will find all options you might be interested in, as well as which ones our Engineering team is planning to implement in the foreseeable future.

Compaction & Streaming Rate-Limiting

We previously discussed how ScyllaDB achieves performance isolation of its components and self-tunes to the workload in question. This means that under normal and ideal temperature and pressure conditions, you ideally should not have to bother with throttling background activities, such as compaction or streaming.

However, there may be some situations where you still want to throttle how much I/O each class is allowed to consume – either for a specific period of time or simply because you suspect a scheduler problem may be affecting your latencies.

ScyllaDB 5.1 introduced two newer live updateable parameters to address these situations:

  • compaction_throughput_mb_per_sec – Throttles compaction to the specified total throughput across the entire system
  • stream_io_throughput_mb_per_sec – Throttles streaming I/O (including repairs, repair-based node operations and legacy topology operations)

Note that it is not recommended to change these values from their defaults (0 – disable any throttling), as ScyllaDB should dynamically be able to choose the best bandwidth to consume according to your system load.

P99 CONF 23 Agenda: Take a Peek

A first look at the agenda for P99 CONF, the technical conference for engineers who obsess over high-performance, low-latency applications.

The wait is over. Speed on over to the P99 CONF agenda page for a first look at 50+ low-latency tech talks jam-packed across two half days. As always, the open-source-focused conference is free, fully virtual, and highly interactive.

See the agenda – and save your spot

Expanding the P99 CONF Core

P99 CONF began as a wild idea in 2021. With the goal of hosting a highly technical conference on low-latency engineering strategies, we reached out to our friends across the ScyllaDB and kernel engineering communities. The response was overwhelming. With two half days of tech talks on low-latency Rust, data systems, and infrastructure, the speakers rose to the challenge – and the 5K attendees loved it.

As word got out, we doubled the number of tech talks and attendees for P99 CONF 2022. With keynotes from Gil Tene, Liz Rice, Charity Majors, Armin Ronacher Dor Laor, Avi Kivity, and Bryan Cantrill, we touched on 99th percentlies (not a typo), eBPF, observability, event processing, the primacy of toolmaking, and a look inside (quite literally) high performance databases. We also added a second agenda track, so we could feature even more talks on “all things performance,” including sessions on Go, Kubernetes, and front-end performance optimization.

This year, join ~15K of your peers for 50+ talks across 3 parallel tracks, as well as an Instant Access library that allows you to binge-watch sessions throughout the 2 days – including some extensive profiling training sessions and other deep dives you won’t want to miss. Some new topics on the agenda for 2023 include Zig, low-latency C++, NATS, edge, and AI/ML. And then there are the speakers. Top speakers from both P99 CONF 2021 and 2022 are returning with new talks they’ve been saving for the P99 CONF community. And we hope you share our excitement about the host of new speakers we’re featuring this year. We’ve been trying to get many of them on the P99 CONF virtual stage since 2021. This year, the planets finally aligned.

Do take a few minutes to look at the agenda. You’ll find an all-star lineup including fresh perspectives from a few companies you might have heard of – for example, Netflix, Google, Meta, Lyft, Uber, TikTok, Square, Bloomberg, Postman, RazorPay, ShareChat, Jetbrains, Grafana, Microsoft, and Red Hat.

Can’t Miss Keynotes

As you can see in the agenda, each day is anchored around multiple keynote blocks. Having access to all these brilliant minds at a single event is an unparalleled opportunity, and we hope you’ll be there on October 18th and 19th to experience it, chat with them, and see what happens when they all interact.

Here are some inside scoops on a few of the keynotes:

  • Gwen Shapira: Performance engineering approaches that are effective at the “enterprise” level (e.g., Oracle, Cloudera, Confluent) just don’t translate well to startups. Gwen will be sharing stories and lessons learned about solving performance challenges at a startup, when everything is needed ASAP but time and resources are constantly stretched thin. Her keynote covers the decisions that worked well on her personal journey from specialist to startup co-founder, the mistakes made, and the tools that were worth far beyond their cost.
  • Bryan Cantrill: About a decade ago, Bryan delivered a passionate talk on corporate OSS antipatterns and promised to revisit the topic in a decade to explore any new mistakes that happened to emerge. Well, recent developments on the “corporate OSS mistakes” front reminded him about this promise … and the moment is ripe. Bryan is eager to address this, and he’s saving the topic for the P99 CONF community. We’re expecting a rather lively chat for this one!
  • Jon Haddad: For over 20 years, Jon has been responsible for the continued health and performance of some of the most massive distributed systems on the planet (Netflix, Apple…). As you can imagine, this inflicted a fair share of battle scars and resulted in quite a library of lessons learned. Jon will be sharing his pro tips, including the tools and processes he applies and the best way to spend your time for the biggest impact.
  • Dor Laor: Most software isn’t architected to take advantage of modern hardware. How does a shard-per-code and shared-nothing architecture help – and exactly what impact can it make? Dor will examine technical opportunities and tradeoffs, as well as disclose the results of a new benchmark study.
  • Avi Kivity: Nobody is infallible. Not all architectural decisions stand the test of time. So, what happens when you realize you need to change your architecture? How do you minimize the impact on performance when making substantial changes to critical software? Avi will take you through this journey, exploring how ScyllaDB evolved its architecture to take advantage of tablets.
  • Paul McKenney: Paul, Dor, and Avi go back to the KVM hypervisor days. While Dor and Avi have since transformed into sea monsters, Paul is now on the Meta kernel team. His keynote focuses on the Linux Kernel Memory Model (LKMM), a powerful tool for developing highly concurrent Linux Kernel code. Not all that familiar with LKMM? Perfect. Paul’s goal is to help you get the most out of it, without the otherwise steep learning curve.

O’Reilly as a Media Sponsor

As we were building out the agenda, we noticed that our speaker lineup included many authors of books hosted on the O’Reilly platform (which includes technical books by Apress, Manning, Pragmatic Bookshelf, and more in addition to the ones sporting the distinctive O’Reilly animal covers). One of our speakers (thanks Cary Millsap!) made a connection, we chatted a bit, and now we’re honored to have O’Reilly on board as a P99 CONF media sponsor.

What does this mean for you?

  • 30-day access to the complete O’Reilly platform
  • Hard-copy book bundle giveaways throughout the event
  • Sneak peeks at what over a dozen top technical authors are working on next

Register Now

NoSQL Data Modeling Mistakes that Hurt Performance

See how NoSQL data modeling anti-patterns actually manifest themselves in practice, and how to diagnose/eliminate them.

In NoSQL Data Modeling Guidelines for Avoiding Performance Issues, we explored general data modeling guidelines that have helped many teams optimize the performance of their distributed databases. We introduced what imbalances are, and covered some well-known imbalance types – such as uneven data distribution and uneven access patterns.

Let’s now dive deeper, and understand how some anti-patterns actually manifest themselves in practice. We will use ScyllaDB throughout to provide concrete examples, although the majority of the strategies and challenges are also applicable to other NoSQL databases.

To help you follow along, we have created a GitHub repository. It includes sample code that can be used to observe and reproduce most of these scenarios, plus build additional (and more complex) ones.

As you play along and reproduce hot and large partitions, large collections, large tombstone runs, etc., be sure to observe the results along with the ScyllaDB Monitoring Stack in order to understand how the anti-pattern may affect your cluster stability in practice.

Not Addressing Large Partitions

Large partitions commonly emerge as teams scale their distributed databases. Large partitions are partitions that grow too big, up to the point when they start introducing performance problems across the cluster’s replicas.

One of the questions that we hear often– at least once a month  – is what constitutes a large partition.  Well, it depends. Some things to consider:

  • Latency expectations:  The larger your partition grows, the longer it will take to be retrieved. Consider your page size and the number of client-server round trips needed to fully scan a partition.
  • Average payload size:  Larger payloads generally lead to higher latency. They require more server-side processing time for serialization and deserialization, and also incur a higher network data transmission overhead.
  • Workload needs: Some workloads organically require larger payloads than others. For instance, I’ve worked with a web3 blockchain company that would store several transactions as BLOBs under a single key, and every key could easily get past 1MB in size.
  • How you read from these partitions:  For example, a time-series use case will typically have a timestamp clustering component. In that case, reading from a specific time-window will retrieve much less data than if you were to scan the entire partition.

The following table illustrates the impact of large partitions under different payload sizes, such as 1, 2 and 4KB.

As you can see, the higher your payload gets under the same row count, the larger your partition is going to be. However, if your use case frequently requires scanning partitions as a whole, then be aware that databases have limits to prevent unbounded memory consumption. For example, ScyllaDB cuts off pages at every 1MB to prevent the system from potentially running out of memory. Other databases (even relational ones) have similar protection mechanisms to prevent an unbounded bad query from starving the database resources.  To retrieve a payload size of 4KB and 10K rows with ScyllaDB, you would need to retrieve at least 40 pages to scan the partition with a single query. This may not seem a big deal at first. However, as you scale over time,  it could impact your overall client-side tail latency.

Another consideration: with databases like ScyllaDB and Cassandra, data written to the database is stored in the commitlog and under an in-memory data structure called a “memtable.”

The commitlog is a write-ahead log that is never really read from, except when there’s a server crash or a service interruption. Since the memtable lives in memory, it eventually gets full. In order to free up memory space, the database flushes memtables to disk. That process results in SSTables (which is how your data gets persisted).

What does all this have to do with large partitions? Well, SSTables have specific components that need to be held in memory when the database starts. This ensures that reads are always efficient and minimizes wasting storage disk I/O when looking for data. When you have extremely large partitions (for example, we recently had a user with a 2.5TB partition in ScyllaDB), these SSTable components introduce heavy memory pressure, therefore shrinking the database’s room for caching and further constraining your latencies.

Introducing Hot Spots

Hot spots can be a side effect of large partitions. If you have a large partition (storing a large portion of your data set), it’s quite likely that your application access patterns will hit that partition more frequently than others. In that case, it also becomes a hot spot.

Hot spots occur whenever a problematic data access pattern causes an imbalance in how data is accessed in your cluster.One culprit: when the application doesn’t impose any limits on the client side and allows tenants to potentially spam a given key. For example, think about bots in a messaging app frequently spamming messages in a channel. Hot spots could also be introduced by erratic client side configurations in the form of retry storms. That is, a client attempts to query specific data, times out before the database does, and retries the query while the database is still processing the previous one.

Monitoring dashboards should make it simple for you to find hot spots in your cluster. For example, this dashboard shows that shard 20 is overwhelmed with reads.

For another example, the following graph shows 3 shards with higher utilization, which correlates to the replication factor of 3 configured for the keyspace in question.

Here, shard 7 introduces a much higher load due to the spamming.

How do you address hot spots such as hot partitions? With ScyllaDB, you can use the nodetool toppartitions command to gather a sample of the partitions being hit over a period of time on the affected nodes.  You can also use tracing, such as probabilistic tracing, to analyze which queries are hitting which shards and then act from there.

Misusing Collections

Teams don’t always use collections, but when they do … they often use them incorrectly. Collections are meant for storing/denormalizing a relatively small amount of data. They’re essentially stored in a single cell, which can make serialization/deserialization extremely expensive.

When you use collections, you can define whether the field in question is frozen or non-frozen. A frozen collection can only be written as a whole; you can not append or remove elements from it. A non-frozen collection can be appended to, and that’s exactly the type of collection that people most misuse. To make things worse, you can even have nested collections, such as a map which contains another map, which includes a list, and so on.

Misused collections will introduce performance problems much sooner than large partitions, for example. If you care about performance, collections can’t be very large at all. For example, if we create a simple key:value table, where our key is a sensor_id and our value a collection of samples recorded over time, as soon as we start ingesting data our performance is going to be suboptimal.

        CREATE TABLE IF NOT EXISTS {table} (
                sensor_id uuid PRIMARY KEY,
                events map<timestamp, FROZEN<map<text, int>>>,

The following monitoring snapshots show what happens when you try to append several items to a collection at once. 

We can see that while the throughput decreases, the p99 latency increases. Why does this occur?

  • Collection cells are stored in memory as sorted vectors
  • Adding elements requires a merge of two collections (old & new)
  • Adding an element has a cost proportional to the size of the entire collection
  • Trees (instead of vectors) would improve the performance, BUT…
  • Trees would make small collections less efficient!

Returning that same example, the solution would be to move the timestamp to a clustering key and transform the map into a FROZEN collection (since you no longer need to append data to it). These very simple changes will greatly improve the performance of the use case.

        CREATE TABLE IF NOT EXISTS {table} (
                sensor_id uuid,
                record_time timestamp,
                events FROZEN<map<text, int>>,
         PRIMARY KEY(sensor_id, record_time)

Using Low Cardinality Indexes & Views

Creating a low cardinality view is in fact a nice way to win the lottery and introduce hotspots, data imbalances and large partitions all at the same time. The table below shows approximately how many large partitions you will end up creating depending on the type of data you chose as either a base table key or as a view key:

A boolean column has 2 possible values, which means that you will end up with 2 large partitions (100%). If you decide to restrict by country, then there are a total of 195 countries in the world. However, countries such as the US, China, and India are likely to become large partitions. And if we decide to select from a field such as “Status” (which might include active, inactive, suspended, etc.) it is likely that a good portion of the data set will result in large partitions.

Misuse of Tombstones (for LSM-Tree-Based Databases)

When you delete a record in a database with a Log-Structured Merge tree storage engine, then you need to know that your deletes are actually writes. Tombstones are a write-marker that tells the database that your data should be eventually deleted from disk.

On top of that, there are different types of tombstones (since when running DELETE operations, there are several ways to delete data):

  • Cell-level tombstones, which delete single columns
  • Range tombstones, which delete a whole range of information (such as all records since last week)
  • Row tombstones, which delete an entire row
  • Partition tombstones, which delete an entire partition

When deleting data – especially if you have a delete-heavy use case – always prefer to delete partitions entirely because that’s much more performant from the database perspective. If you delete a lot and create many tombstones, you may end up with elevated latencies on your read path.

In order to understand why tombstones may slow down your read path, let’s re-revisit the ScyllaDB and Cassandra write path again:

  1. Data is written to an in-memory data structure, known as the memtable.
  2. The memtable eventually gets flushed to disk, and results in a SSTable, which is how your data gets persisted
  3. This cycle repeats
  4. Over time, these SSTables will accumulate

As SSTables accumulate, this introduces a problem: The more SSTables a partition contains, the more reads need to scan through these SSTables to retrieve the data we are looking for, which effectively increases the read latency. Then, with tombstones, what happens is that the database may need to scan through a large amount of data that shouldn’t be returned back to clients. By the time it manages to fetch only the live rows you need, it may have already spent too much time walking through a large tombstone run.

Over time, ScyllaDB and Cassandra will merge your SSTables via a process known as compaction, which will improve the latencies. However, depending on how many tombstones are present in the output, latency may still suffer. In the following example, it took 6 seconds for the database to reply back with a single live row in a partition with around 9 million range tombstones.

In the following extreme example, the partition in question was simply unreadable, and the database would timeout whenever trying to read from it.

cqlsh> TRACING ON; SELECT * FROM masterclass.tombstones WHERE pk=0;
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for masterclass.tombstones - received only 0 responses from 1 CL=ONE." info={'received_responses': 0, 'required_responses': 1, 'consistency': ONE}

Diagnosing and Preventing Data Modeling Mistakes with ScyllaDB

Finally, let’s talk about some of the ways to prevent or diagnose some of these mistakes becuase…trust me…you might end up running into them. Many before you already did. 🙂  To provide specific examples, we’ll use ScyllaDB, but you can accomplish many of the same things with other databases.

Finding Large Partitions / Cells / Collections

ScyllaDB has several system tables that record whenever large partitions, cells or collections are found. We even have a large partition hunting guide you can consult if you ever stumble upon that problem. Below, you can see an example of the output of a populated large_partitions table:

Addressing Hot Spots

As mentioned previously, there are multiple ways to identify and address hot spots. When in doubt of which partition may be causing you problems, run the nodetool toppartitions command under one of the affected nodes to sample which keys are most frequently hit during your sampling period:
nodetool toppartitions <keyspace> <table> <sample in ms>

If hot partitions are often a problem to you, ScyllaDB allows you to specify per partition rate limits, after which the database will reject any queries that hit that same partition. And remember that retry storms may cause hot spots, so ensure that your client-side timeouts are higher than the server-side timeouts to keep clients from retrying queries before the server has a chance to process them.

Addressing Hot Shards

Remember per-shard imbalances that may introduce contention in your cluster? You can use monitoring to identify whether the affected shards are on the coordinator or the replica side.  If the imbalance is just on the coordinator side, then it likely means that your drivers are not configured correctly. However, if it affects the replica side, then it means that there is a data access pattern problem that you should review.

Depending on how large the imbalance is, you may configure ScyllaDB to shed requests past a specific concurrency limit. ScyllaDB will shed any queries that hit that specific shard past the number you specify. To do this, you use:
--max-concurrent-requests-per-shard <n>

And of course, make use of tracing, which can be user defined, probabilistic, slow query logging, and many other tracing options that you can try which may be the best fit for your problem.

Using Tombstone Eviction

Ensure that your tombstone eviction is efficient to avoid slowing down your read path, especially when your use case heavily relies on deletes!  Everything starts with selecting the right compaction strategy for your use case, and reviewing your DELETE patterns. Remember: Deleting a partition is much more performant than deleting a row, or a cell, for example.

ScyllaDB’s Repair-based Tombstone Garbage Collection feature allows you to tell the database how to evict tombstones. We also introduced the concept of empty replica pages. This allows ScyllaDB to hint to the driver to wait for a longer period of time before receiving live data (while it may be scanning through a potentially long tombstone run).

Learn More – On-Demand NoSQL Data Modeling Masterclass

Want to learn more about NoSQL data modeling best practices for performance? Catch our NoSQL data modeling masterclass – 3 hours of expert instruction, now on-demand (and free).

Designed for organizations migrating from SQL to NoSQL or optimizing any NoSQL data model, this masterclass will assist practitioners looking to advance their understanding of NoSQL data modeling. Pascal Desmarets (Hackolade), Tzach Livyatan (ScyllaDB) and I team up to cover a range of topics, from building a solid foundation on NoSQL to correcting your course if you’re heading down a dangerous path.

You will learn how to:

  • Analyze your application’s data usage patterns and determine what data modeling approach will be most performant for your specific usage patterns.
  • Select the appropriate data modeling options to address a broad range of technical challenges, including benefits and tradeoffs of each option
  • Apply common NoSQL data modeling strategies in the context of a sample application
  • Identify signs that indicate your data modeling is at risk of causing hot spots, timeouts, and performance degradation – and how to recover
  • Avoid the data modeling mistakes that have troubled some of the world’s top engineering teams in production

Access the Data Modeling Masterclass On-Demand (It’s Free)

5 NoSQL Data Modeling Guidelines for Avoiding Performance Issues

NoSQL data modeling initially appears quite simple. Unlike with the traditional relational model, there is no longer a need to reason around relationships across different entities, and the focus shifts to satisfy the required application queries and their access patterns more efficiently. However, the seeming simplicity of NoSQL data modeling can often lead to several hiccups that may inflict pain immediately or later on come back to haunt you.

With NoSQL data modeling, it’s essential to consider edge cases and to understand and know your access patterns very well. Otherwise, you could end up with performance issues stemming from imbalances across your cluster.

This blog explores general data modeling guidelines that have helped many teams optimize the performance of their distributed databases. It introduces what imbalances are, and covers some well-known imbalance types – such as uneven data distribution and uneven access patterns. For a more detailed explanation of these guidelines, a look at data modeling mistakes in action, and tips for diagnosing problems in your own deployment, see our free NoSQL data modeling masterclass.

Access the Data Modeling Masterclass On-Demand (It’s Free)

Exploring Imbalances

An imbalance is defined as an uneven and abnormal access pattern that introduces heavier pressure in some specific replicas on your distributed database. An imbalance causes some of your replicas to receive more traffic than others. These replicas are overutilized while the remaining cluster members serve less traffic and are underutilized. If the imbalance is not corrected, the busier replicas will inevitably become slower over time, up to a point where latencies will rise to a level that impacts user experience. Eventually, you could reach a point where data cannot be read – or your database stops working altogether.

What exactly do imbalances look like and how do you find them? Let’s take a look at two examples: uneven access patterns and uneven data distribution.

Uneven Access Patterns

The following image shows how reads are balanced across the replicas on a 3 node cluster. 2 nodes are seemingly balanced, except for the green one.

Uneven Access Patterns

Writes will always be replicated to the number of nodes as specified with your replication factor, but reads will be replicated to the number of replicas as specified within your Consistency Level. Here, one node is receiving less traffic than the others. That typically means that there is a problem with the access patterns from an application’s perspective.

For some use cases, it may be hard or nearly impossible to achieve perfectly balanced traffic, but that’s what you should ideally be striving for. Here’s an example of what the ideal state looks like from the perspective of a monitoring dashboard.

A Perfect Distribution

Here, queries are evenly balanced and distributed across all nodes in a cluster, ensuring that the cluster is functioning smoothly.

Uneven Data Distribution

Here’s another example of a dashboard showing an imbalance:

Uneven Data Distribution

Here, the imbalance is related to data distribution. The dashboard shows a 9-node cluster, where 3 of the nodes have a much higher data utilization than the others. This is a very high indication that the cluster in question has large partitions. Given that the data imbalance precisely affects 3 nodes, and that it equals our user’s replication factor, this is yet another indicator for large partitions.

5 Guidelines for Avoiding Imbalances and Other Performance Issues

So how do you avoid such imbalances? Here are 5 guidelines, along with resources that will help you put them into action:

  • Follow a query-driven design approach
  • Select an appropriate primary key
  • Use high cardinality for even data distribution
  • Avoid bad access patterns
  • Don’t overlook monitoring

Follow a query-driven design approach

With NoSQL data modeling, you’ll always want to follow a query-driven design approach, rather than follow the traditional entity-relationship model commonly seen in relational databases. Think about the queries you need to run first, then switch over to the schema.

Recommended resources:

Select an appropriate primary key for your application’s access patterns

Assuming that you’ve successfully mapped your queries (per the previous guideline), you should end up with a logical primary key selection. The primary key determines how your data will be distributed across the cluster, and it will dictate which queries your clients will be allowed to execute.

Carefully select partition keys to avoid imbalances and therefore ensure that you fully utilize your cluster, rather than end up introducing performance bottlenecks to a few replicas. With any distributed database, the overall cluster performance is always tied to the speed of its slowest replica.

Recommended resources:

Use high cardinality for even data distribution

Select a primary key – specifically a partition key – with high enough cardinality to ensure that 1) your data gets distributed as evenly as possible 2) the load and processing are spread across all members in your cluster.

Recommended resources:

Avoid bad access patterns

Avoid hot access patterns, such as hot partitions, which similarly cause an imbalance to how data gets accessed into your cluster. Here, it is worth noting that few access imbalances are normal, and databases like ScyllaDB often come with a specialized cache to ensure that “hot items” are always served efficiently. Still, you should watch for the “hot partitions” effect as it may undermine your latencies in the long run.

Recommended resources:

Don’t overlook monitoring

Your database monitoring solution is your main ally for identifying erratic access patterns, troubleshooting imbalances, identifying bottlenecks, predicting pattern changes, and alerting you when things go wrong. Ensure that you install and configure monitoring your database vendor’s monitoring, as well as any other monitoring, observability, or APM tools you generally rely on.

Recommended resources:

Learn More – On-Demand NoSQL Data Modeling Masterclass

Want to learn more about NoSQL data modeling best practices for performance? Catch our NoSQL data modeling masterclass – 3 hours of expert instruction, now on-demand (and free).

Designed for organizations migrating from SQL to NoSQL or optimizing any NoSQL data model, this masterclass will assist practitioners looking to advance their understanding of NoSQL data modeling. Pascal Desmarets (Hackolade), Tzach Livyatan (ScyllaDB) and I team up to cover a range of topics, from building a solid foundation on NoSQL to correcting your course if you’re heading down a dangerous path.

You will learn how to:

  • Analyze your application’s data usage patterns and determine what data modeling approach will be most performant for your specific usage patterns.
  • Select the appropriate data modeling options to address a broad range of technical challenges, including benefits and tradeoffs of each option
  • Apply common NoSQL data modeling strategies in the context of a sample application
  • Identify signs that indicate your data modeling is at risk of causing hot spots, timeouts, and performance degradation – and how to recover
  • Avoid the data modeling mistakes that have troubled some of the world’s top engineering teams in production

Access the Data Modeling Masterclass On-Demand (It’s Free)

Integrating ScyllaDB with a Presto Server for Data Analytics

Learn how to integrate ScyllaDB with PrestoDB and Metabase, unlocking powerful data analytics capabilities. 

The integration of ScyllaDB, Presto, and Metabase provides an open-source data analytics solution for high-speed querying and user-friendly visualization. This blog shows a hands-on example of how it works and why this integration is worth exploring further.

We’ll cover how to:

  • Unlock real-time insights: Merge ScyllaDB’s speed with Presto’s processing power to gain real-time insights from your data without delays.
  • Simplify data exploration: Metabase’s intuitive UI makes data exploration and visualization accessible to everyone, regardless of technical skills.
  • Break data silos: Seamlessly query and analyze data from ScyllaDB alongside other sources, breaking down data silos and enhancing overall efficiency.

Here are some of the benefits of choosing this solution:

  • High performance, big data: Leverage ScyllaDB’s performance and Presto’s scalability for swift data processing, ideal for handling substantial data volumes.
  • Adaptability: Accommodate evolving data structures with ScyllaDB and Presto, ensuring your analytics are never limited by rigid schemas.
  • Empowering visualization: Metabase’s user-centric approach empowers business users to create impactful visualizations and reports, fostering data-driven decisions.

In short, integrating ScyllaDB, Presto, and Metabase creates a dynamic solution that supercharges data utilization, delivering speed, flexibility, availability and accessibility.

Database Schema

In the following hands-on exercise, you will use a simple DB schema for an online merchandise store with users, goods, and transactions.

Here is the example db schema:

And here is the architecture:

It includes:

  • ScyllaDB: A high-throughput low-latency NoSQL database (available as open source).
  • Presto: An open-source distributed SQL query engine.
  • Metabase: An open-source business intelligence and data visualization tool.

About Presto

In this lesson, you will be working with Presto, an open-source distributed SQL query engine designed to run interactive analytic queries against various data sources. Facebook developed it, and it is now actively maintained by the Presto Software Foundation.

Presto consists of the following components:

  • Coordinator: The Coordinator node is responsible for receiving user queries, parsing and analyzing them, and creating a query plan. It coordinates the execution of queries across multiple worker nodes.
  • Worker: Worker nodes are responsible for executing the tasks assigned to them by the Coordinator. They perform data retrieval, processing, and aggregation tasks as part of the distributed query execution.
  • Catalog: The Catalog represents the metadata about the available data sources and their tables. It helps Presto understand how to access and query the data stored in various data stores like ScyllaDB.
  • Connector: Connectors are plugins that allow Presto to communicate with different data sources. Each connector understands how to translate SQL queries into the respective data source’s query language and how to fetch the data.
  • SQL Parser: This component parses the SQL queries submitted by users and transforms them into an internal representation that the Coordinator can understand.
  • Query Optimizer: The Query Optimizer analyzes the query plan and tries to find the most efficient way to execute the query. It aims to minimize data movement and processing overhead to improve performance.
  • Execution Engine: The Execution Engine executes the query plan generated by the Coordinator. It coordinates the execution across multiple worker nodes, ensuring the data processing tasks are distributed efficiently.

By integrating a ScyllaDB cluster with Presto, you can leverage the power of SQL queries to access and analyze data stored in ScyllaDB using the Presto SQL dialect, making it easier to gain insights and perform analytics on large datasets. Additionally, using Metabase as a visualization tool, you can create interactive dashboards and visualizations based on the data queried through Presto, enhancing your data exploration capabilities.

About Metabase

Metabase is an open-source business intelligence and data visualization tool. It is designed to make it easy for non-technical users to explore, analyze, and visualize data without complex SQL queries or programming skills.

Metabase consists of the following main components:

  • Dashboard: The Dashboard is the user interface of Metabase, where users interact with the tool to explore data and create visualizations. It provides a user-friendly interface that allows users to build custom dashboards and visualizations using a drag-and-drop approach.
  • Query Builder: The Query Builder is an essential part of Metabase that enables users to create queries and retrieve data from various data sources. It offers a simple and intuitive way to construct SQL-like queries without requiring users to write raw SQL code.
  • Data Model: Metabase maintains a data model that represents the metadata about the available data sources and the underlying database schemas. This data model helps users understand the structure of the data and aids in creating meaningful visualizations.
  • Visualization Engine: Metabase comes equipped with a powerful Visualization Engine that allows users to create a wide range of charts, graphs, and other visual representations of their data. It supports various visualization types, such as bar charts, line charts, pie charts, and more.
  • Database Drivers: Metabase uses database drivers to connect to different data sources, including relational databases, NoSQL databases, and other data storage systems. These drivers facilitate data retrieval and enable Metabase to interact with various databases.
  • Permissions and Access Control: Metabase provides a robust system for managing user permissions and access control. Administrators can control which users can access specific data sources, dashboards, and features, ensuring data security and privacy.

Using Metabase alongside Presto and the integrated ScyllaDB cluster, you will create insightful visualizations and interactive dashboards based on the data queried through Presto. This combination allows for a seamless end-to-end data analysis and visualization process, making it easier to gain actionable insights from your data and share them with stakeholders effectively.

Services Setup With Docker

In this lab, you’ll use Docker.

First, ensure that your environment meets the following prerequisites:

  • Docker for Linux, Mac, or Windows. Please note that running ScyllaDB in Docker is only recommended to evaluate and try
  • ScyllaDB. For best performance, a regular install is recommended.
  • 4GB of RAM or greater for ScyllaDB, Presto, and Metabase services
  • Git

When docker is installed, we can run the whole infra with a single docker-compose from

Check out ScyllaDB’s code samples and go to the preso directory with all the required assets.

git clone
cd scylla-code-samples/presto

Run docker compose. It creates a docker network “presto” and creates and starts containers with a single node ScyllaDB server, a Presto server and a Metabase server:

docker compose up -d

It takes about 40s for ScyllaDB to initialize. You can check the status of Presto and Metabase by running:

docker compose ps

As long as it is still initializing the STATUS will show (health: starting), and once initialized it will show (healthy):

Check the status of the ScyllaDB cluster by running:

docker exec -it presto-scylladb-1 nodetool status

The UN status shows that the ScyllaDB node is up and running.

Presto UI

Browse to the Presto UI at localhost:8080/ui

When queries are executed, it displays their execution result and some useful Presto server metrics:

The UI is helpful for debugging errors. You can see some query details along with any failures. In the example above, all queries are successful. Below, is an example of a query that fails:

Metabase UI

Browse to the Metabase UI at localhost:3000

Testing the Integration

Data Preparation

Let’s create a namespace and generate a schema with sample data for our store:

make prepare_data

This command will connect to ScyllaDB, and invoke cqlsh to fill the cluster with the schema and sample data from the data.txt file, which is located in the root of the presto directory.

Connect Metabase to Presto

After this part is complete and Metabase has started, continue with the setup process in the Metabase UI.

During the “Add your data” step, choose Presto.

Then fill in all required fields except “password”, and leave it empty, it will still show the grayed masked points.

Here are the details of each field:

  • Setting host to “presto”. This is how the Presto service is called in docker-compose.
  • Setting a port to 8080.  This is the default port for presto. We did not override it.
  • Setting catalog to scylladb. This name comes from the catalog name we use in Presto by adding the file in the presto/resources/etc/catalog directory.
  • The schema name merch_store comes from the sample data.
  • The username is a free type string here.
  • We left the password empty with a masked example from Metabase.

After onboarding completion, Metabase will show the home dashboard with a small hint in the right-down corner saying that it is in the process of syncing tables from our presto installation.

Here you have insights about the test data that Metabase added for a demo. You can delete the Sample database from admin settings -> Databases -> Sample database -> Remove database. After deleting the sampling database and waiting for about a minute or two, Metabase will generate insights based on the schema and data that we prepared. The dashboard looks like this:

Next, open the “A summary at Transactions” insight. It shows some insights into the transaction data.

Example Insights

Here’s an example of  how the integration lets you get insights, such as creating a dashboard to show the sum of transactions per day:

  1. Click on the “Browse Data” section in the menu on the left.
  2. Choose the “Presto” database.
  3. Choose the “Transactions” table.

    You will see transactions as a table with all available fields.
  4. On the right corner of the UI, click on “Summarize” to start building the report.
  5. Click on the green “Count.”
    and select “Sum of…”
    Metabase will propose a column on which to sum.
    Select “Total Amount”. Selecting the column to sum by Metabase will calculate the whole sum of all transactions and display just a number.
  6. In the Group by section, choose the “Transaction Date” column to calculate the sum by day.

    Metabase will display transactions sum by day in the line view
  7. Click on “Save” to persist the created report to the existing dashboard or to a new one.

See it in Action

Watch the following video to see:

  • The above example (transaction sums),
  • A second example, which demonstrates how to create a table of the most valuable users by joining the transactions table with the users table
  • How to create a custom dashboard



Clean-up is as easy as startup. Just use

docker-compose down --volumes

This stops all running docker containers and removes the created volumes.


In this lesson, you learned how to integrate a ScyllaDB cluster with a Presto server, unlocking powerful data analytics capabilities. ScyllaDB’s high performance complements Presto’s distributed SQL query engine. Leveraging ScyllaDB’s capabilities with Presto enables lightning-fast data retrieval and processing, which is ideal for handling large-scale data workloads. The integration empowers users to explore and analyze vast datasets effortlessly, making data-driven decisions more accessible and efficient. Combined with Metabase’s intuitive visualization tools, this integration forms a robust end-to-end solution for deriving valuable insights from your data, empowering businesses to stay competitive in a data-driven world.

For more information, see the ScyllaDB docs on integrating with Presto.

Learn more at ScyllaDB University

Share Your Database Experiences and Insights at ScyllaDB Summit

Join the ranks of distinguished ScyllaDB Summit speakers like Discord, Disney+ Hotstar, Epic Games, Palo Alto Networks, ShareChat, Strava and more.

If you’ve tackled some interesting database challenges recently, we invite you to share your experiences and lessons learned at ScyllaDB Summit 2024. We just opened the call for speakers – so tell us about your ideas!

Become a ScyllaDB Summit Speaker

ScyllaDB Summit is a free + virtual event where a global community of database professionals – especially those obsessed with performance – gather to share strategies, learn from their peers, and discover the latest trends and innovations. If you’re selected to speak, you’ll join a rather impressive list of past speakers: Discord, Disney+ Hotstar, Zillow, Strava, Epic Games, ShareChat, Expedia, Palo Alto Networks, Ticketmaster… just to name a few.

What should you talk about? Here are some ideas:

  • What interesting distributed data challenges you’ve been tackling
  • How you’re using ScyllaDB (Open Source, Cloud, or Enterprise) to build something amazing
  • What best practices you’ve developed for working with high-throughput, low-latency applications – database and beyond
  • What technologies you’re using in concert with ScyllaDB, or how you’re extending ScyllaDB to meet your project’s specialized needs

We’d love to hear from new users as well as power users. The focus is on sharing authentic, technical experiences with your peers. What challenges were you wrestling with that prompted you to consider ScyllaDB? How else did you consider addressing the challenge? What path forward did you select, and why? How’s it going so far? What have you learned – and what’s next?

Also, note that talks don’t need to be specifically about ScyllaDB. For example, ScyllaDB Summit 23 featured broader distributed database talks such as:

  • The Consistency vs. Throughput Tradeoff in Distributed Databases
  • Everything in its Place: Putting Code and Data Where They Belong
  • Solving the Issue of Mysterious Database Benchmarking Results
  • The Trends that are Transforming Your Database Infrastructure Forever

Speaking slots are just 15-20 minutes, so consider it your TED Talk debut. 🙂 We welcome a broad range of speakers, including first-time speakers. If you’re selected, our team can help you craft a compelling presentation. It’s a great opportunity to share your insights and achievements with thousands of your peers – no travel required!

The conference is designed to be fully virtual and highly interactive. All sessions will be pre-recorded at your convenience; during the live conference, speakers focus on chatting with attendees, conference hosts, and fellow speakers. The deadline for submissions is October 20th. The event is February 14 and 15 (two half days).

For inspiration, take a look at the blog Dor Laor (ScyllaDB Co-Founder and CEO) wrote, sharing his key takeaways from ScyllaDB Summit 2023’s user talks.