Apache Cassandra® 5.0: Improving performance with Unified Compaction Strategy

Introduction

Unified Compaction Strategy (UCS), introduced in Apache Cassandra 5.0, is a versatile compaction framework that not only unifies the benefits of Size-Tiered (STCS) and Leveled (LCS) Compaction Strategies, but also introduces new capabilities like shard parallelism, density-aware SSTable organization, and safer incremental compaction, all of which deliver more predictable performance at scale. By utilizing a flexible scaling model, UCS allows operators to tune compaction behavior to match evolving workloads, spanning from write-heavy to read-heavy, without requiring disruptive strategy migrations in most cases.

In the past, operators had to choose between rigid strategies and accept significant trade-offs. UCS changes this paradigm, allowing the system to efficiently adapt to changing workloads with tuneable configurations that can be altered mid-flight and even applied differently across different compaction levels based on data density.

Why compaction matters

Compaction is the critical process that determines a cluster’s long-term health and cost-efficiency. When executed correctly, it produces denser nodes with highly organized SSTables, allowing each server to store more data without sacrificing speed. This efficiency translates to a smaller infrastructure footprint, which can lower cloud costs and resource usage.

Conversely, inefficient compaction is a primary driver of performance degradation. Poorly managed SSTables lead to fragmented data, forcing the system to work harder for every request. This overhead consumes excessive CPU and I/O, often forcing teams to try adding more nodes (horizontal scale) just to keep up with background maintenance noise.

Key concepts and terminology

To understand how UCS optimizes a cluster, it is necessary to understand the fundamental trade-offs it balances:

  • Read amplification: Occurs when the database must consult multiple SSTables to answer a single query. High read amplification acts as a “latency tax,” forcing extra I/O to reconcile data fragments.
  • Write amplification: A metric that quantifies the overhead of background processes (such as compactions). It represents the ratio between total data written to disk and the amount of data originally sent by an application. High write amplification wears out SSDs and steals throughput.
  • Space amplification: The ratio of disk space used to the actual size of the “live” data. It tracks data such as tombstones or overwritten rows that haven’t been purged yet.
  • Fan factor: The “growth dial” for the cluster data hierarchy. It defines how many files of a similar size must accumulate before they are merged into a larger tier.
  • Sharding: UCS splits data into smaller, independent token ranges (shards), allowing the system to run multiple compactions in parallel across CPU cores.

Performance gains by design

UCS provides baseline architectural improvements that were not available in older strategies:

Improved compaction parallelism

Older strategies often got stuck on a single thread during large merges. UCS sharding allows a server to use its full processing power. This significantly reduces the likelihood of compaction storms and keeps tail latencies (p99) predictable.

Reduced disk space amplification

Because UCS operates on smaller shards, it doesn’t need to double the entire disk space of a node to perform a major merge. This greatly reduces the risk of nodes from running out of space during heavy maintenance cycles.

Density-based SSTable organization

UCS measures SSTables by density (token range coverage). This mitigates the huge SSTable problem where a single massive file becomes too large to compact, hindering read performance indefinitely.

Scaling parameter

The scaling parameter (denoted as W) is a configurable setting that determines the size ratio between compaction tiers. It helps balance write amplification and read performance by controlling how much data is rewritten during compaction operations. A lower scaling parameter value results in more frequent, smaller compactions, whereas a higher value leads to larger compaction groups.

The strategy engine: tuning and parameters

UCS acts as a strategy engine by adjusting the scaling parameter (W), allowing UCS to mimic, or outperform, its predecessors STCS and LCS.

At a high level, the scaling parameter influences the effective fan-out behavior at each compaction level. Tiered-style settings such as T4 allow more SSTables to accumulate before merging, favoring write efficiency, while leveled-style settings such as L10 keep SSTables more tightly organized, reducing read amplification at the cost of additional background work.

The numbers below are illustrative and not prescriptive:

UCS configuration guide

Workload type Strategy target Scaling (W) Primary benefit
Heavy writes / IoT STCS (Tiered) Negative   (e.g., -4) Lowest read amplification
Heavy reads LCS (Leveled) Positive      (e.g., 10) Lowest write amplification
Balanced Hybrid Zero (0) Balanced performance for general apps

Practical example

UCS allows operators to mix behaviors across the data lifecycle.

'scaling_parameters': 'T4, T4, L10'

Note that scaling_parameters takes a string format that can accommodate parameters for per-level tuning.

This example instructs a cluster: “Use tiered compaction for the first two levels to keep up with the high write volume, but once data reaches the third level, reorganize it into a leveled structure so reads stay fast.”

Here’s a fuller, illustrative example of how one might structure their CQL to change the compaction strategy.

ALTER TABLE keyspace_name.table_name   WITH compaction = {  'class': 'UnifiedCompactionStrategy',  'scaling_parameters': 'T4,T4,L10'   };

Operational evolution: moving beyond major compactions

In older strategies and in Apache Cassandra versions prior to 5.0, operators often felt forced to run a major compaction to reclaim disk space or fix performance. This was a critical event that could impact a node’s I/O for extended periods of time and required substantial free disk space to complete.

Because UCS is density-aware and sharded, it effectively performs compactions constantly and granularly so major compactions are rarely needed. It identifies overlapping data within specific token ranges (shards) and cleans them up incrementally. Operators no longer must choose between a fragmented disk and a risky, resource-heavy manual compaction; UCS keeps data density more uniform across the cluster over time.

The migration advantage: “in-place” adoption

One of the key performance features of a UCS migration is in-place adoption, meaning that when a table is switched to UCS, it does not immediately force a massive data rewrite. Instead, it looks at the existing SSTables, calculates their density, and maps them into its new sharding structure.

This allows for moving from STCS or LCS to UCS with significantly less I/O overhead than any other strategy change.

Conclusion

UCS is an operational shift toward simplicity and predictability. By removing the need to choose between compaction trade-offs, UCS allows organizations to scale with confidence. Whether handling a massive influx of IoT data or serving high-speed user profiles, UCS helps clusters remain performant, cost-effective, and ready for the future.

On a newly deployed NetApp Instaclustr Apache Cassandra 5 cluster, UCS is already the default strategy (while Apache Cassandra 5.0 has STCS set as the default).

Ready to experience this new level of Cassandra performance for yourself? Try it with a free 30-day trial today!

The post Apache Cassandra® 5.0: Improving performance with Unified Compaction Strategy appeared first on Instaclustr.

The Deceptively Simple Act of Writing to Disk

Tracking down a mysterious write throughput degradation From a high-level perspective, writing a file seems like a trivial operation: open, write data, close. Modern programming languages abstract this task into simple, seemingly instantaneous function calls. However, beneath this thin veneer of simplicity lies a complex, multi-layered gauntlet of technical challenges, especially when dealing with large files and high-performance SSDs. For the uninitiated, the path from application buffer to persistent storage is fraught with performance pitfalls and unexpected challenges. If your goal is to master the art of writing large files efficiently on modern hardware, understanding all the details under the hood is essential. This article will walk you through a case study of fixing a throughput performance issue. We’ll get into the intricacies of high-performance disk I/O, exploring the essential technical questions and common oversights that can dramatically affect reliability, speed, and efficiency. It’s part 2 of a 3-part series. Read part 1 When lots of work leads to a performance regression If you haven’t yet read part 1 (When bigger instances don’t scale), now’s a great time to do so. It will help you understand the origin of the problem we’re focusing on here. TL;DR: In that blog post, we described how we managed to figure out why a new class of highly performant machines didn’t scale as expected when instance sizes increased. We discovered a few bugs in our Seastar IO Scheduler (stick around a bit, I’ll give a brief description of that below). That helped us measure scalable bandwidth numbers. At the time, we believed these new NVMes were inclined to perform better with 4K requests than with 512 byte requests. We later discovered that the latter issue was not related to the scheduler at all. We were actually chasing a firmware bug in the SSD controller itself. These disks do, in fact, perform better with 4K requests. What we initially thought was a problem in our measurement tool (IOTune) turned out to be something else entirely. IOTune wasn’t misdetecting the disk’s physical sector size (this is the request size at which a disk can achieve the best IOPS). Instead, the disk firmware was reporting it incorrectly. It was reporting it as 512 bytes. However, in reality, it was 4K. We worked around the IOPS issue since the cloud provider wasn’t willing to fix the firmware bug due to backward-compatibility concerns. We also deployed the IO Scheduler fixes and our measured disk models (io-properties) with IOTune scaled nicely with the size of the instance. Still, in real workload tests, ScyllaDB didn’t like it. Performance results of some realistic workloads showed a write throughput degradation of around 10% on some instances provisioned with quite new and very fast SSDs. While this wasn’t much, it was alarming because we were kind of expecting an improvement after the last series of fixes. These first two charts give us a good indication of how well ScyllaDB utilizes the disk. In short, we’re looking for both of them to be as stable as possible and as close to 100% as possible. The “I/O Group consumption” metric tracks the amount of shared capacity currently taken by in-flight operations from the group (reads, writes, compaction, etc.). It’s expressed as a percentage of the configured disk capacity. The “I/O Group Queue flow ratio” metric in ScyllaDB measures the balance between incoming I/O request rates and the dispatch rate from the I/O queue for a given I/O group. It should be as close as possible to 1.0, because requests cannot accumulate in disk. If it jumps up, it means one of two things. The reactor might be constantly falling behind and not kicking the I/O queue in a timely manner. Or, it can mean that the disk is slower than we told ScyllaDB it was – and the scheduler is overloading it with requests. The spikes here indicate that the IO Scheduler doesn’t provide a very good QoS. That led us to believe the disk was overloaded with requests, so we ended up not saturating the throughput. The following throughput charts for commitlog, memtable, and compaction groups reinforce this claim. The 4 NVMe RAID array we were testing against was capable of around 14.5GB/s throughput. We expected that at any point in time during the test, the sum of the bandwidths for those three groups would get close to the configured disk capacity. Please note that according to the formula described in the section below on IO Scheduler, bandwidth and IOPS have a competing relationship. It’s not possible to reach the maximum configured bandwidth because that would leave you with 0 space for IOPS. The reverse holds true: You cannot reach the maximum IOPS because that would mean your request size got so low that you’re most likely not getting any bandwidth from the disk. At the end of the chart, we would expect an abrupt drop in throughput for the commitlog and memtable groups because the test ends, with the compaction group rapidly consuming most of the 14.5GB/s of disk throughput. That was indeed the case, except that these charts are very spiky as well. In many cases, summing up the bandwidth for the three groups shows that they consume around 12.5GB/s of the disk total throughput. The Seastar IO Scheduler Seastar uses an I/O scheduler to coordinate shards for maximizing the disk’s bandwidth while still preserving great IOPS. Basically, the scheduler lets Seastar saturate a disk and at the same time fit all requests within a latency goal configured on the library (usually 0.5ms). A detailed explanation of how the IO Scheduler works can be found in this blog post. But here’s a summary of where max bandwidth/IOPS values come from and where they go to within the ScyllaDB IO Scheduler. I believe it will connect the problem description above with the rest of the case study below. The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a disk, it will output 4 values corresponding to read/write IOPS and read/write bandwidth. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of your disk. This model then helps ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula which looks something like: read_bw/read_bw_max + write_bw/write_bw_max + read_iops/read_iops_max + write_iops/write_iops_max <= 1 The internal mechanics of how the IO Scheduler works are detailed in the blog post linked above. Peak and sustained throughput We observed some bursty behavior in the SSDs under test. It wasn’t much; iops/bandwidth would be lower with around 5% than the values measured by running the benchmark for the default 2 minutes. The iops/bandwidth values would start stabilizing at around 30 minutes and that’s what we call sustained io-properties. We thought our IOTune runs might have recorded peak disk io-properties (i.e., we ran the tool for too short of a duration – the default is 120 seconds). With a realistic workload, we are actually testing the disks at their sustained throughput, so the IO Scheduler builds an inflated model of the disk and ends up sending more requests than the disk can actually handle. This would cause the overload we saw in the charts. We then tested with newly measured sustained io-properties (with the IOTune duration configured to run 30 minutes, then 60 minutes). However, there wasn’t any noticeable improvement in the throughput degradation problem…and the charts were still spiky. Disk saturation length The disk saturation length, as defined and measured by IOTune, is the smallest request length that’s needed to achieve the maximum disk throughput. All the disks that we’ve seen so far had a measured saturation length of 128K. This means that it should be perfectly possible to achieve maximum throughput with 128K requests. We noticed something quite odd while running tests on these performant NVMes: the Seastar IOTune tool would report a saturation length of 1MB. We immediately panicked because there are a few important assumptions that rely on us being able to saturate disks with 128K requests. The issue matched the symptoms we were chasing. A disk model built with the assumption that saturation length is 1MB would trick the IO Scheduler into allowing a higher number of 128K requests (the length Seastar uses for sequential writes) than the disk controller can handle efficiently. In other words, the IO Scheduler would try to achieve the high throughput measured with 1MB request length, but using 128K requests. This would make the disk appear overloaded, as we saw in the charts. Assume you’re trying to reach the maximum throughput on a common disk with 4K requests, for instance. You won’t be able to do it. And since the throughput would stay below the maximum, the IO Scheduler would stuff more and more requests into the disk. It’s hoping to reach the maximum – but again, it won’t be reached. The side effect is that the IO Scheduler ends up overloading the disk controller with requests, increasing in-disk latencies for all the other processes trying to use it. As is typical when you’re navigating muddy waters like these, this turned out to be a false lead. We were stepping on some bugs in our measuring tools, IOTune and io_tester. IOTune was running with lower parallelism than those disks needed for saturation. And io_tester was measuring overwriting a file rather than writing to a pre-truncated new file. The saturation length of this type of disks was still 128k, like we had seen in the past. Fortunately, that meant we didn’t need to make potential architectural changes in Seastar in order to accommodate larger requests. A nice observation we can make here based on the tests we ran trying to dis/prove this theory is that extent allocation is a rather slow process. If the allocation group is already quite busy (a few big files already exist under the same directory, for instance), the effect on throughput when appending and extending a file is quite dramatic. Internal parallelism Another interesting finding was that the disk/filesystem seemed to suffer from internal parallelism problems. We ran io_tester with 128k requests, 8 fibers per shard with 8 shards and 64 shards. The results were very odd. The expected bandwidth was ~12.7GB/s, but we were confused to see it drop when we increased the number of shards. Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the increase of the input. However, in the table above, we see something different. When the disk is overloaded, latency increases (because it holds more requests internally), but the bandwidth drops at the same time. This might explain why IOTune measured bandwidths around 12GB/s, while under the real workload (capped with IOTune-measured io-properties.yaml), the disk behaved as if it was overloaded (i.e., high latency and the actual bandwidth below the maximum bandwidth). When IOTune measures the disk, shards load the disk one-by-one. However, the real test workload sends parallel requests from all shards at once. The storage device was a RAID0 with 4 NVMes and a RAID chunk size of 1MB. The theory here was that since each shard writes via 8 fibers of 128k requests, it’s possible that many shards could end up writing to the same disk in the array. The explanation is that XFS aligns files on 1MB boundaries. If all shards start at the same file offset and move at relatively the same speed, the shards end up picking the same drive from the array. That means we might not be measuring the full throughput of the raid array. The measurements confirm that the shards do not consistently favor the same drive. Single disk throughput was measured at 3.220 GB/s while the entire 4-disk array achieved a throughput of 10.9 GB/s. If they were picking the same disk all the time, the throughput of the entire 4-disk array would’ve been equal to that of a single disk (i.e. 3.2GB/s). This lead ended up being a dead end. We tried to prove it in a simulator, but all requests ended up shuffled evenly between the disks. Sometimes, interesting theories that you bet can explain certain effects just don’t hold true in practice. In this case, something else sits at the base of this issue. XFS formatting Although the previous lead didn’t get us very far, it opened a very nice door. We noticed that the throughput drop is reproducible on 64 shards, rather than 8 shards RAID mounted with the scylla_raid_setup script (a ScyllaDB script which does preparation work on a new machine, e.g., formats the filesystem, sets up the RAID array), not on a raw block device and not on a RAID created with default parameters Comparing the running mkfs.xfs commands, we spotted a few differences. In the table below, notice how the XFS parameters differ between default-mounted XFS and XFS mounted by scylla_raid_setup. The 1K vs 4K data.bsize difference stands out. We also spotted – and this is an important observation– that truncating the test files to some large size seems to bring back the stolen throughput. The results coming from this observation are extremely surprising. Keep reading to see how this leads us to actually figuring out the root cause of the problem. The table below shows the throughput in MB/s when running tests on files that are being appended and extended and files that are being pre-truncated to their final size (both cases were run on XFS mounted with the Scylla script). We’ve experimented with Seastar’s sloppy_size=true file option, which truncates the file’s size to double every time it sees an access past the current size of the file. However, while it improved the throughput numbers, it unfortunately still left half of the missing throughput on the table. RWF_NOWAIT and non-blocking IO The first lead that we got from here was by running our tests under strace. Apparently, all of our writes would get re-submitted out of the Seastar Reactor thread to XFS threads. Seastar uses RWF_NOWAIT to attempt non-blocking AIO submissions directly from the Seastar Reactor thread. It sets aio_rw_flags=RWF_NOWAIT on IOCBs during initial io_submit calls from the Reactor backend. On modern kernels, this flag requests immediate failure (EAGAIN) if the submission may block, preserving reactor responsiveness.​ On io_submit failure with RWF_NOWAIT, Seastar clears the flag and queues the IOCBs for retrying. The thread pool then executes the retry submissions without the RWF_NOWAIT flag, as these can tolerate blocking. We thought reducing the number of retries would increase throughput. Unfortunately, it didn’t actually do that when we disabled the flag. The cause for the throughput drop is uncovered next. As for the RWF_NOWAIT issue, it’s still unclear why it doesn’t affect throughput. However, the fix was a kernel patch by our colleague Pavel Emelyanov which fiddles with inode time update when IOCBs carry the RWF_NOWAIT flag. More details on this would definitely exceed the scope of this blog post. blktrace and offset alignment distribution Returning to our throughput performance issue, we started running io_tester experiments with blktrace on, and we noticed something strange. For around 25% of the requests, io_tester was submitting 128k requests and XFS would queue them as 256 sector requests (256 sectors x 512 bytes reported physical sector size equals to 128k). However, it would split the requests and complete them in 2 parts. (Note the Q label on the first line of the output below; this indicates the request was queued) The first part of the request would finish in 161 microseconds, while the second part would finish in 5249 microseconds. This dragged down the latency of the whole request to 5249 microseconds (the 4th column in the table is the timestamp of the event in seconds; the latency of a request is max(ts_completed – ts_queued)). The remaining 75% of requests were queued and completed in one go as 256 sector requests. They were also quite fast: 52 microseconds, as shown below. The explanation for the split is related to how 128k requests hit the file, given that XFS lays it out on disk, considering a 1MB RAID chunk size. The split point occurs at address 21057335224 + 72, which translates to hex 0x9CE3AE00000. This reveals it is, in fact, a multiple of 0x100000 – the 1MB RAID chunk boundary. We can discuss optimizations for this, but that’s outside the scope of this article. Unfortunately, it was also out of scope for our throughput issue. However, here are some interesting charts showing how request offsets alignment looks based on the blktrace events we collected. default formatted XFS with 4K block size scylla_raid_setup XFS: 1K block scylla_raid_setup XFS: 1K block + truncation This result is significant! For XFS formatted with a 4K block size, most requests are 4K aligned. For XFS formatted with scylla_raid_setup (1K block size), most requests are 1K or 2K aligned. For XFS formatted with scylla_raid_setup (1K block size) and with test files truncated to their final size, all requests are 64K aligned (although in some cases we also saw them being 4k aligned). It turns out that XFS lays out files on disk very differently when the file is known in advance compared to the case when the file is grown on-demand. That results in IO unaligned to the disk block size. Punchline Now here comes the explanation of the problem we’ve been chasing since the beginning. In the first part of the article, we saw that doing random writes with 1K requests produces worse IOPS than with 4K requests on these 4K optimized NVMes. This happens because when executing the 1K request, the disk needs to perform Read-Modify-Write to land the data into chips. When we submit 128k requests (as ScyllaDB does) that are 1K or 2K aligned (see the alignment distributions in the charts above), the disk is forced to do RMW on the head and tail of the requests. This slows down all the requests (unrelated, but similar in concept to the raid chunk alignment split we’ve seen above). Individually, the slowdown is probably tiny. But since most requests are 1k and 2k aligned on XFS formatted with 1k block size (no truncation), the throughput hit is quite significant. It’s very interesting to note that, as shown in the last chart above, truncation also improved the alignment distribution quite significantly, and also improved throughput. It also appeared to significantly shorten the list of extents created for our test files. For ScyllaDB, the right solution was to format XFS with a 4K block size. Truncating to the final size of the file wasn’t really an option for us because we can’t predict how big an SSTable will grow. Since sloppy_size’ing the files didn’t provide great results, we agreed that 4K-formatted XFS was the way to go. The throughput degradation we got using the higher io-properties numbers seems to be solved. We initially expected to see “improved performance” compared to the original low io-properties case (i.e., a higher measured write throughput). The success wasn’t obvious, though. It was rather hidden within the dashboards, as shown below. Here’s what disk utilization and IO flow ratio charts look like. The disk is fully utilized and clearly not overloaded anymore. And here are memtable and commitlog charts. These look very similar to the charts we got with the initial low io-properties numbers from the “When bigger instances don’t scale” article. Most likely, this means that’s what the test can do. The good news was hidden here. While the test went full speed ahead, the compaction job filled all the available throughput, from ~10GB/s (when commitlog+memtable were running at 4.5GB/s) to 14.5GB/s (when commitlog and memtable flush processes were done). The only thing left to check was whether the filesystem formatted with the 4K block size would cause read amplification on older disks with 512 bytes physical sector size. It turns out it didn’t. We were able to achieve similar IOPS on a RAID with 4 NVMes of the older type.   Next up: Part 3, Common performance pitfalls of modern storage I/O.

Inside the 2026 Monster Scale Summit Agenda

Monster Scale Summit is all about extreme scale engineering and data-intensive applications. So here’s a big announcement: the agenda is now available! Attendees can join 50+ tech talks, including: Keynotes by antirez, Camille Fournier, Pat Helland, Joran Greef, Thea Aarrestad, Dor Laor, and Avi Kivity Martin Kleppmann & Chris Riccomini chatting about the second edition of Designing Data-Intensive Applications Tales of extreme scale engineering – Rivian, Pinterest, LinkedIn, Nextdoor, Uber Eats, Google, Los Alamos National Labs, CERN, and AmEx ScyllaDB customer perspectives – Discord, Disney, Freshworks, ShareChat, SAS, Sprig, MoEngage, Meesho, Tiket, and Zscaler Database engineering – Inside looks at ScyllaDB, turbopuffer, Redis, ClickHouse, DBOS, MongoDB, DuckDB, and TigerBeetle What’s new/next for ScyllaDB – Vector search, tablets, tiered storage, data consistency, incremental repair, Rust-based drivers, and more Like other ScyllaDB-hosted conferences (e.g., P99 CONF), the event will be free and virtual so everyone can participate. Take a look, register, and start choosing your own adventure across the multiple tracks of tech talks. Full Agenda Register [free + virtual] When you join us March 11 and 12, you can… Chat directly with speakers and connect with ~20K of your peers Participate in interactive trainings on topics like real-time AI, database performance at scale, high availability, and cloud cost optimization strategies Pick the minds of ScyllaDB engineers and architects, who are available to answer your toughest database performance questions Win conference swag, sea monster plushies, book bundles, and other cool giveaways Details, Details The agenda site has all the scheduling, abstracts, and speaker details. Please note that times are shown in your local time zone. Be sure to scroll down into the Instant Access section. This is one of the best parts of Monster SCALE Summit. You can access these sessions from the minute the event platform opens until the conference wraps. Some teams have shared that they use Instant Access to build their own watch parties beyond the live conference hours. If you do this, please share photos! Another important detail: books. Quite a few of our speakers are gluttons for punishmentaccomplished book authors. Camille Fournier – Platform Engineering, The Manager’s Path Martin Kleppmann and Chris Riccomini – Designing Data-Intensive Applications 2E Chris Riccomini – The Missing README Dominik Tornow – Think Distributed Systems Teiva Harsanyi – 100 Go Mistakes and How to Avoid Them, The Coder Cafe Felipe Cardeneti Mendes – Database Performance at Scale We will have book giveaways throughout the event, so be sure to attend live. All registrants get 30-day access to the complete O’Reilly platform, which includes all O’Reilly, Manning, and many other tech books. This includes the shiny new second edition of Designing Data-Intensive Applications, which publishes this month. Perfect timing… Register now – it’s free

Getting Started with Database-Level Encryption at Rest in ScyllaDB Cloud

Learn about ScyllaDB database-level encryption with Customer-Managed Keys & see how to set up and manage encryption with a customer key — or delegate encryption to ScyllaDB ScyllaDB Cloud takes a proactive approach to ensuring the security of sensitive data: we provide database-level encryption in addition to the default storage-level encryption. With this added layer of protection, customer data is always protected against attacks. Customers can focus on their core operations, knowing that their critical business and customer assets are well-protected. Our clients can either use a customer-managed key (CMK, our version of BYOK) or let ScyllaDB Cloud manage the CMK for them. The feature is available in all cloud platforms supported by ScyllaDB Cloud. This article explains how ScyllaDB Cloud protects customer data. It focuses on the technical aspects of ScyllaDB database-level encryption with Customer-Managed Keys (CMK). Storage-level encryption Encryption at rest is when data files are encrypted before being written to persistent storage. ScyllaDB Cloud always uses encrypted volumes to prevent data breaches caused by physical access to disks. Database-level encryption Database-level encryption is a technique for encrypting all data before it is stored in the database. 
 The ScyllaDB Cloud feature is based on the proven ScyllaDB Enterprise database-level encryption at rest, extended with the Customer Managed Keys (CMK) encryption control. This ensures that the data is securely stored – and the customer is the one holding the key. The keys are stored and protected separately from the database, substantially increasing security. ScyllaDB Cloud provides full database-level encryption using the Customer Managed Keys (CMK) concept. It is based on envelope encryption to encrypt the data and decrypt only when the data is needed. This is essential to protect the customer data at rest. Some industries, like healthcare or finance, have strict data security regulations. Encrypting all data helps businesses comply with these requirements, avoiding the need to prove that all tables holding sensitive personal data are covered by encryption. It also helps businesses protect their corporate data, which can be even more valuable. A key feature of CMK is that the customer has complete control of the encryption keys. Data encryption keys will be introduced later (it is confusing to cover them at  the beginning). The customer can: Revoke data access at any time Restore data access at any time Manage the master keys needed for decryption Log all access attempts to keys and data Customers can delegate all key management operations to the ScyllaDB Cloud support team if they prefer this. To achieve this, the customer can choose the ScyllaDB key when creating the cluster. To ensure customer data is secure and adheres to all privacy regulations. By default, encryption uses the symmetrical algorithm AES-128, a solid corporate encryption standard covering all practical applications. Breaking AES-128 can take an immense amount of time, approximately trillions of years. The strength can be increased to AES-256. Note: Database-level encryption in ScyllaDB Cloud is available for all clusters deployed in Amazon Web Services (AWS) and Google Cloud Platform (GCP). Encryption To ensure all user data is protected, ScyllaDB will encrypt: All user tables Commit logs Batch logs Hinted handoff data This ensures all customer data is properly encrypted. The first step of the encryption process is to encrypt every record with a data encryption key (DEK). Once the data is encrypted with the DEK, it is sent to either AWS KMS or GCP KMS, where the master key (MK) resides. The DEK is then encrypted with the master key (MK), producing an encrypted DEK (EDEK or a wrapped key). The master key remains in the KMS, while the EDEK is returned and stored with the data. The DEK used to encrypt the data is destroyed to ensure data protection. A new DEK will be generated the next time new data needs to be encrypted. Decryption Because the original non-encrypted DEK is destroyed when the EDEK was produced, the data cannot be decrypted. The EDEK cannot be used to decrypt the data directly because the DEK key is encrypted. It has to be decrypted, and for that, the master key will be required again. This can only be decrypted with the master key(MK) in the KMS. Once the DEK is unwrapped, the data can be decrypted. As you can see, the data cannot be decrypted without the master key – which is protected at all times in the KMS and cannot be “copied” outside KMS. By revoking the master key, the customer can disable access to the data independently from the database or application authorization. Multi-region deployment Adding new data centers to the ScyllaDB cluster will create additional local keys in those regions. All master keys support multi-regions, and a copy of each key resides locally in each region – ensuring those multi-regional setups are protected from regional outages for the cloud provider and against disaster. The keys are available in the same region as the data center and can be controlled independently. In case you use a Customer Key – cloud providers will charge you for the KMS. AWS will charge $1/month, GCP will change you $0.06 for each cluster prorated per hour. Each additional DC creates a replica that is counted as an additional key.   There is an additional cost per key request. ScyllaDB Enterprise utilizes those requests efficiently, resulting in an estimated monthly cost of up to $1 for a 9-node cluster. Managing encryption keys adds another layer of administrative work in addition to the extra cost. ScyllaDB Cloud offers database clusters that can be encrypted using keys managed by ScyllaDB support. They provide the same level of protection, but our support team helps you manage the master keys. The ScyllaDB keys are applied by default and are subsidized by ScyllaDB. Creating a Cluster with Database-Level Encryption Creating a cluster with database-level encryption requires: A ScyllaDB Cloud account – If you don’t have one, you can create a ScyllaDB Cloud account here. 10 minutes with ScyllaDB Key or 20 minutes creating your own key To create a cluster with database-level encryption enabled, we will need a master key. We can either create a customer-managed key using ScyllaDB Cloud UI or skip this step completely and use a ScyllaDB Managed Key, which will skip the next six steps. In both cases, all the data will be protected by strong encryption at the database level. Setting up the customer-managed key can be found in the database-level encryption documentation. Transparent database-level encryption in ScyllaDB Cloud significantly boosts the security of your ScyllaDB clusters and backups.   Next Steps Start using this feature in ScyllaDB Cloud. Get your questions answered in our community forum and Slack channel. Or, use our contact form.

We Built a Better Cassandra + ScyllaDB Driver for Node.js – with Rust

Lessons learned building a Rust-backed Node.js driver for ScyllaDB: bridging JS and Rust, performance pitfalls, and benchmark results This blog post explores the story of building a new Node.js database driver as part of our Student Team Programming Project. Up ahead: troubles with bridging Rust with JavaScript, a new solution being initially a few times slower than the previous one, and a few charts! Note: We cover the progress made until June 2025 as part of the ZPP project, which is a collaboration between ScyllaDB and University of Warsaw. Since then, the ScyllaDB Driver team adopted the project (and now it’s almost production ready). Motivation The database speaks one language, but users want to speak to it in multiple languages: Rust, Go, C++, Python, JavaScript, etc. This is where a driver comes in, acting as a “translator” of sorts. All the JavaScript developers of the world currently rely on the DataStax Node.js driver. It is developed with the Cassandra database in mind, but can also be used for connecting to ScyllaDB, as they use the same protocol – CQL. This driver gets the job done, but it is not designed to take full advantage of ScyllaDB’s features (e.g., shard-per-core architecture, tablets). A solution for that is rewriting the driver and creating one that is in-house, developed and maintained by ScyllaDB developers. This is a challenging task requiring years of intensive development, with new tasks interrupting along the way. An alternative approach is writing the new driver as a wrapper around an existing one – theoretically simplifying the task  (spoiler: not always) to just bridging the interfaces. This concept was proven in the making of the ScyllaDB C / C++ driver, which is an overlay over the Rust driver. We chose the ScyllaDB Rust driver as the backend of the new JavaScript driver for a few reasons. ScyllaDB’s Rust driver is developed and maintained by ScyllaDB. That means it’s always up to date with the latest database features, bug fixes, and optimizations. And since it’s written in Rust, it offers native-level performance without sacrificing memory safety. [More background on this approach] Development of such a solution skips the implementation of complicated database handling logic, but brings its own set of problems. We wanted our driver to be as similar as possible to the Node.js driver so anyone wanting to switch does not need to do much configuration. This was a restriction on one side. On the other side, we have limitations of the Rust driver interface. Driver implementations differ and the API for communicating with them can vary in some places. Some give a lot of responsibility to the user, requiring more effort but giving greater flexibility. Others do most of the work without allowing for much customization. Navigating these considerations is a recurring theme when choosing to write a driver as a wrapper over a different one. Despite the challenges during development, this approach comes with some major advantages. Once the initial integration is complete, adding new ScyllaDB features becomes much easier. It’s often just a matter of implementing a few bridging functions. All the complex internal logic is handled by the Rust driver team. That means faster development, fewer bugs, and better consistency across languages. On top of that, Rust is significantly faster than Node.js. So if we keep the overhead from the bridging layer low, the resulting driver can actually outperform existing solutions in terms of raw speed. The environment: Napi vs Napi-Rs vs Neon With the goal of creating a driver that uses ScyllaDB Rust Driver underneath, we needed to decide how we would be communicating between languages. There are two main options when it comes to communicating between JavaScript and other languages: Use  a Node API (NAPI for short) – an API built directly into the NodeJS engine, or Interface the program through the V8 JavaScript engine. While we could use one of those communication methods directly, they are dedicated for C / C++, which would mean writing a lot of unsafe code. Luckily, other options exist: NAPI-RS and Neon. Those libraries handle all the unsafe code required for using the C / C++ APIs and expose (mostly safe) Rust interfaces. The first option uses NAPI exclusively under the hood, while the Neon option uses both of those interfaces. After some consideration, we decided to use NAPI-RS over Neon. Here are the things we considered when deciding which library to use: – Library approach — In NAPI-RS, the library handles the serialization of data into the expected Rust types. This lets us take full advantage of Rust’s static typing and any related optimizations. With Neon, on the other hand, we have to manually parse values into the correct types. With NAPI-RS, writing a simple function is as easy as adding a #[napi] tag:   Simple a+b example And in Neon, we need to manually handle JavaScript context: A+b example in Neon – Simplicity of use — As a result of the serialization model, NAPI-RS leads to cleaner and shorter code. When we were implementing some code samples for the performance comparison, we had serious trouble implementing code in Neon just for a simple example. Based on that experience, we assumed similar issues would likely occur in the future. – Performance — We made some simple tests comparing the performance of library function calls and sending data between languages. While both options were visibly slower than pure JavaScript code, the NAPI-RS version had better performance. Since driver efficiency is a critical requirement, this was an important factor in our decision. You can read more about the benchmarks in our thesis. – Documentation — Although the documentation for both tools is far from perfect, NAPI-RS’s documentation is slightly more complete and easier to navigate. Current state and capabilities Note: This represents the state as of May 2025. More features have been introduced since then.  See the project readme for a brief overview of current and planned features. The driver supports regular statements (both select and insert) and batch statements. It supports all CQL types, including encoding from almost all allowed JS types. We support prepared statements (when the driver knows the expected types based on the prepared statement), and we support unprepared statements (where users can either provide type hints, or the driver guesses expected value types). Error handling is one of the few major functions that behaves differently than the DataStax driver. Since the Rust driver throws different types of errors depending on the situation, it’s nearly impossible to map all of them reliably. To avoid losing valuable information, we pass through the original Rust errors as is. However, when errors are generated by our own logic in the wrapper, we try to keep them consistent with the old driver’s error types. In the DataStax driver, you needed to explicitly call shutdown() to close the database connection. This generated some problems: when the connection variable was dropped, the connection sometimes wouldn’t stop gracefully, even keeping the program running in some situations. We decided to switch this approach, so that the connection is automatically closed when the variable keeping the client is dropped. For now, it’s still possible to call shutdown on the client. Note: We are still discussing the right approach to handling a shutdown. As a result, the behavior described here may change in the future. Concurrent execution The driver has a dedicated endpoint for executing multiple queries concurrently. While this endpoint gives you less control over individual requests — for example, all statements must be prepared and you can’t set different options per statement — these constraints allow us to optimize performance. In fact, this approach is already more efficient than manually executing queries in parallel (around 35% faster in our internal testing), and we have additional optimization ideas planned for future implementation. Paging The Rust and DataStax drivers both have built-in support for paging, a CQL feature that allows splitting results of large queries into multiple chunks (pages). Interestingly, although the DataStax driver has multiple endpoints for paging, it doesn’t allow execution of unpaged queries. Our driver supports the paging endpoints (for now, one of those endpoints is still missing) and we also added the ability to execute unpaged queries in case someone ever needs that. With the current paging API, you have several options for retrieving paged results: Automatic iteration: You can iterate over all rows in the result set, and the driver will automatically request the next pages as needed. Manual paging:  You can manually request the next page of results when you’re ready, giving you more control over the paging process. Page state transfer:  You can extract the current page state and use it to fetch the next page from a different instance of the driver. This is especially useful in scenarios like stateless web servers, where requests may be handled by different server instances. Prepared statements cache Whenever executing multiple instances of the same statement, it’s recommended to use prepared statements. In ScyllaDB Rust Driver, by default, it’s the user’s responsibility to keep track of the already prepared statements to avoid preparing them multiple times (and, as a result, increasing both the network usage and execution times). In the DataStax driver, it was the driver’s responsibility to avoid preparing the same query multiple times. In the new driver, we use Rust’s Driver Caching Session for (most) of the statement caching. Optimizations One of the initial goals for the project was to have a driver that is faster than the DataStax driver. While using NAPI-RS added some overhead, we hoped the performance of the Rust driver would help us achieve this goal. With the initial implementation, we didn’t put much focus on efficient usage of the NAPI-RS layer. When we first benchmarked the new driver, it turned out to be way slower compared to both the DataStax JavaScript driver and the ScyllaDB Rust driver… Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 4.08 3.53 1.04 250000 13.50 5.81 1.73 1000000 55.05 15.37 4.61 4000000 227.69 66.95 18.43 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08 250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72 11.64 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08 250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72 11.64 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.96 3.11 1.31 250000 4.90 4.33 1.89 1000000 16.99 10.58 4.93 4000000 65.74 31.83 17.26 Those results were a bit of a surprise, as we didn’t fully anticipate how much overhead NAPI-RS would introduce. It turns out that using JavaScript Objects introduced way higher overhead compared to other built-in types, or Buffers. You can see on the following flame graph how much time was spent executing NAPI functions (yellow-orange highlight), which are related to sending objects between languages. Creating objects with NAPI-RS is as simple as adding the #[napi] tag to the struct we want to expose to the NodeJS part of the code. This approach also allows us to create methods on those objects. Unfortunately, given its overhead, we needed to switch the approach – especially in the most used parts of the driver, like parsing parameters, results, or other parts of executing queries. We can create a napi object like this: Which is converted to the following JavaScript class: We can use this struct between JavaScript and Rust. When accepting values as arguments to Rust functions exposed in NAPI-RS, we can either accept values of the types that implement the FromNapiValue trait, or accept references to values of types that are exposed to NAPI (these implement the default FromNapiReference trait). We can do it like this:   Then, when we call the following Rust function we can just pass a number in the JavaScript code. FromNapiValue is implemented for built-in types like numbers or strings, and the  FromNapiReference trait is created automatically when using the #[napi] tag on a Rust struct. Compared to that, we need to manually implement FromNapiValue for custom structs. However, this approach allows us to receive those objects in functions exposed to NodeJS, without the need for creating Objects – and thus significantly improves performance. We used this mostly to improve the performance of passing query parameters to the Rust side of the driver. When it comes to returning values from Rust code, a type must have a ToNapiValue trait implemented. Similarly, this trait is already implemented for built-in types, and is auto generated with macros when adding the #[napi] tag to the object. And this auto generated implementation was causing most of our performance problems. Luckily, we can also implement our own ToNapiValue trait. If we return a raw value and create an object directly in the JavaScript part of the code, we can avoid almost all of the negative performance impacts that come from the default implementation of ToNapiValue. We can do it like this: This will return just the number instead of the whole struct. An example of such places in the code was UUID. This type is used for providing the UUID retrieved as part of any query, and can also be used for inserts. In the initial implementation, we had a UUID wrapper:  an object created in the Rust part of the code, that had a default ToNapiValue implementation, that was handling all the logic for the UUID. When we changed the approach to returning just a raw buffer representing the UUID and handling all the logic on the JavaScript side, we shaved off about 20% of the CPU time we were using in the select benchmarks at that point in time. Note: Since completing the initial project, we’ve introduced additional changes to how serialization and deserialization works. This means the current state may be different from what we describe here. A new round of benchmarking is in progress; stay tuned for those results. Benchmarks In the previous section, we showed you some early benchmarks. Let’s talk a bit more about how we tested and what we tested. All benchmarks presented here were run on a single machine – the database was run in a Docker container and the driver benchmarks were run without any virtualization or containerization. The machine was running on AMD Ryzen™ 7 PRO 7840U with 32GB RAM, with the database itself limited to 8GB of RAM in total. We tested the driver both with ScyllaDB and Cassandra (latest stable versions as of the time of testing – May 2025). Both of those databases were run in a three node configuration, with 2 shards per node in the case of ScyllaDB. With this information on the benchmarks, let’s see the effect all the optimizations we added had on the driver performance when tested with ScyllaDB: Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] scylladb-javascript-driver (initial version) [s] 62500 1.89 3.45 0.99 4.08 250000 4.15 5.66 1.73 13.50 1000000 13.65 15.86 4.41 55.05 4000000 55.85 56.73 18.42 227.69   Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] scylladb-javascript-driver (initial version) [s] 62500 2.83 2.48 1.04 1.63 250000 1.91 2.91 1.56 4.09 1000000 4.58 4.69 3.42 15.74 4000000 16.05 14.27 11.92 58.96 Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] scylladb-javascript-driver (initial version) [s] 62500 1.50 3.04 1.33 1.96 250000 2.93 4.52 1.94 4.90 1000000 8.79 11.11 5.08 16.99 4000000 32.99 36.62 17.90 65.74   Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] scylladb-javascript-driver (initial version) [s] 62500 1.42 3.09 1.25 1.45 250000 2.94 3.81 2.43 3.43 1000000 9.19 8.98 7.21 10.82 4000000 33.51 28.97 25.81 40.74 And here are the same benchmarks, without the initial driver version.   Here are the results of running the benchmark on Cassandra. Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 2.48 14.50 1.25 250000 5.82 19.93 2.00 1000000 19.77 19.54 5.16     Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.60 2.99 1.48 250000 3.06 4.46 2.42 1000000 9.02 9.03 6.53   Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 2.32 4.03 2.11 250000 5.45 6.53 4.01 1000000 18.77 16.20 13.21   Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.86 4.15 1.57 250000 4.24 5.41 3.36 1000000 13.11 14.11 10.54 The test results across both ScyllaDB and Cassandra show that the new driver has slightly better performance on the insert benchmarks. For select benchmarks, it starts ahead and the performance advantage decreases with time. Despite a series of optimizations, the majority of the CPU time still comes from NAPI communication and thread synchronization (according to internal flamegraph testing). There is still some room for improvement, which we’re going to explore. Since running those benchmarks, we introduced changes that improve the performance of the driver. With those improvements performance of select benchmarks is much closer to the speed of the DataStax driver. Again…please stay tuned for another blog post with updated results. Shards and tablets Since the DataStax driver lacked tablet and shard support, we were curious if our new shard-aware and tablet-aware drivers provided a measurable performance gain with shards and tablets. Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver [s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No Shards 62,500 1.89 2.61 3.45 3.51 0.99 1.20 250,000 4.15 7.61 5.66 6.14 1.73 2.30 1,000,000 13.65 30.36 15.86 16.62 4.41 8.33 4,000,000 55.85 134.90 56.73 77.68 18.42 42.64   Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver [s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No Shards 62,500 1.50 1.52 3.04 3.63 1.33 1.33 250,000 2.93 3.29 4.52 5.09 1.94 2.02 1,000,000 8.79 10.29 11.11 11.13 5.08 5.71 4,000,000 32.99 38.53 36.62 39.28 17.90 20.67 In insert benchmarks, there are noticeable changes across all drivers when having more than one shard. The Rust driver improved by around 36%, the new driver improved by around 46%, and the DataStax driver improved by only around 10% when compared to the single sharded version. While sharding provides some performance benefits for the DataStax driver, which is not shard aware, the new driver benefits significantly more — achieving performance improvements comparable to the Rust driver. This shows that it’s not only introducing more shards that provide an improvement in this case; a major part of the performance improvement is indeed shard-awareness. Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver [s] No Tablets Standard No Tablets Standard No Tablets Standard 62,500 1.76 1.89 3.67 3.45 1.06 0.99 250,000 3.91 4.15 5.65 5.66 1.59 1.73 1,000,000 12.81 13.65 13.54 15.86 3.74 4.41 Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver [s] No Tablets Standard No Tablets Standard No Tablets Standard 62,500 1.46 1.50 2.92 3.04 1.33 1.33 250,000 2.76 2.93 4.03 4.52 1.94 1.94 1,000,000 8.36 8.79 7.68 11.11 4.84 5.08 When it comes to tablets, the new driver and the Rust driver see only minimal changes to the performance, while the performance of the DataStax driver drops significantly. This behavior is expected. The DataStax driver is not aware of the tablets. As a result, it is unable to communicate directly with the node that will store the data – and that increases the time spent waiting on network communication. Interesting things happen, however, when we look at the network traffic: WHAT TOTAL CQL TCP Total Size New driver 3 node all 412,764 112,318 300,446 ∼ 43.7 MB New driver 3 node | driver ↔ database 409,678 112,318 297,360 – New driver 3 node | node ↔ node 3,086 0 3,086 – DataStax driver 3 node all 268,037 45,052 222,985 ∼ 81.2 MB DataStax driver 3 node | driver ↔ database 90,978 45,052 45,926 – DataStax driver 3 node | node ↔ node 177,059 0 177,059 – This table shows the number of packets sent during the concurrent insert benchmark on three-node ScyllaDB with 2 shards per node. Those results were obtained with RF = 1. While running the database with such a replication factor is not production-suitable,  we chose it to better visualize the results. When looking at those numbers, we can draw the following conclusions: The new driver has a different coalescing mechanism. It has a shorter wait time, which means it sends more messages to the database and achieves lower  latencies. The new driver knows which node(s) will store the data. This reduces internal traffic between database nodes and lets the database serve more traffic with the same resources. Future plans The goal of this project was to create a working prototype, which we managed to successfully achieve. It’s available at https://github.com/scylladb/nodejs-rs-driver, but it’s considered experimental at this point. Expect it to change considerably, with ongoing work and refactors. Some of the features that were present in DataStax driver, and are expected for the driver to be considered deployment-ready, are not yet implemented. The Drivers team is actively working to add those features. If you’re interested in this project and would like to contribute, here’s the project’s GitHub repository.

When Bigger Instances Don’t Scale

A bug hunt into why disk I/O performance failed to scale on larger AWS instances The promise of cloud computing is simple: more resources should equate to better, faster performance. When scaling up our systems by moving to larger instances, we naturally expect a proportional increase in capabilities, especially in critical areas like disk I/O. However, ScyllaDB’s experience enabling support for the AWS i7i and i7ie instance families uncovered a puzzling performance bottleneck. Contrary to expectations, bigger instances simply did not scale their I/O performance as advertised. This blog post traces the challenging, multi-faceted investigation into why IOTune (a disk benchmarking tool shipped with Seastar) was achieving a fraction of the advertised disk bandwidth on larger instances. On these machines, throughput plateaued at a modest 8.5GB/s and IOPS were much lower than expected on increasingly beefy machines. What followed was a deep dive into the internals of the ScyllaDB IO Scheduler, where we uncovered subtle bugs and incorrect assumptions that conspired to constrain performance scaling. Join us as we investigate the symptoms, pin down the root cause, and share the hard-fought lessons learned on this long journey. This blog post is the first in a three-part series detailing our journey to fully harness the performance of modern cloud instances. While this piece focuses on the initial set of bottlenecks within the IO Scheduler, the story continues in two subsequent posts. Part 2, The deceptively simple act of writing to disk, tracks down a mysterious write throughput degradation we observed in realistic ScyllaDB workloads after applying the fixes discussed here. Part 3, Common performance pitfalls of modern storage I/O, summarizes the invaluable lessons learned and provides a consolidated list of performance pitfalls to consider when striving for high-performance I/O on modern hardware and cloud platforms. Problem description Some time ago, ScyllaDB decided to support the AWS i7i and i7ie families. Before we support a new instance type, we run extensive tests to ensure ScyllaDB squeezes every drop of performance out of the provisioned hardware. While measuring disk capabilities with the Seastar IOTune tool, we noticed that the IOPS and bandwidth numbers didn’t scale well with the size of the instance, and we ended up with much lower values than AWS advertised. Read IOPS were on par with AWS specs up to i7i.4xlarge, but they were getting progressively worse, up to 25% lower than spec on i7i.48xlarge. Write IOPS were worse, starting at around 25% less than spec for i7i.4xlarge and up to 42% less on i7i.48xlarge. Bandwidth numbers were even more interesting. Our IOTune measurements were similar to fio up to the i7i.4xlarge instance type. However, as we scaled up the instance type, our IOTune bandwidth numbers were plateauing at around 8.5GB/s while fio was managing to pull up to 40GB/s throughput for i7i.48xlarge instances. Essential Toolkit The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a storage mount point, it outputs 4 values corresponding to the read/write IOPS and read/write bandwidth of the underlying storage system. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of the disk, which it will use to help ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula that looks something like: read_bw/read_bw_max + write_bw/write_bw_max + read_iops/read_iops_max + write_iops/write_iops_max <= 1 The internal mechanics of how the IO Scheduler works are described very thoroughly in the blog post I linked above. The io_tester tool is another utility within the Seastar framework. It’s used for testing and profiling I/O performance, often in more controlled and customizable scenarios than the automated IOTune. It allows users to simulate specific I/O workloads (e.g., sequential vs. random, various request sizes, and concurrency levels) and measure resulting metrics like throughput and latency. It is particularly useful for: Deep-dive analysis: Running experiments with fine-grained control over I/O parameters (e.g., --io-latency-goal, request size, parallelism) to isolate performance characteristics or potential bottlenecks. Regression testing: Verifying that changes to the IO Scheduler or underlying storage stack do not negatively impact I/O performance under various conditions. Fair Queue experimentation: As shown in this investigation, io_tester can be used to observe the relationship between configured workload parameters, the resulting in-disk queue lengths, and the throttling behavior of the IO Scheduler. What this meant for ScyllaDB We didn’t want to enable i7i instances if the IOTune numbers didn’t accurately reflect the underlying disk performance of the instance type. Lower io-properties numbers cause the IO Scheduler to overestimate the cost of each request. This leads to more throttling, making monstrous instances like i7i.48xlarge perform like much cheaper alternatives (such as the i7i.4xlarge, for example). Pinning the symptoms Early on, we noticed that the observed symptoms pointed to two different problems. This helped us narrow down the root causes much faster (well, fast here is a very misleading term). We were chasing a lower-than-expected IOPS issue and a different low-bandwidth issue. IOPS and bandwidth numbers were behaving differently when scaling up instances. The former was scaling, but with much lower values than we expected. The latter would just plateau from one point and stay there, no matter how much money you’d throw at the problem. We started with the hypothesis that IOTune might misdetect the disk’s physical block size from sysfs and that we issue requests with a different size than what the disk “likes,” leading to lower IOPS. After some debugging, we confirmed that IOTune indeed failed to detect the block size, so it defaulted to using requests of 512bytes. There’s no bug to fix on the IOTune side here, but we decided we needed to be able to specify the disk block size for reads and writes independently when measuring. This turned out to be quite helpful later on. With 4K requests, we were able to measure the expected ~1M IOPS for writes compared to the ~650k IOPS we were getting with the autodetected 512-byte requests (numbers relevant for the i7i.12xlarge instance). We had a fix for the IOPS issue, but – as we discovered later – we didn’t properly understand the actual root cause. At that point, we thought the problem was specific to this instance type and caused by IOTune misdetecting the block size. As you’ll see in the next blog post in the series, the root cause is a lot more interesting and complicated. The plateauing bandwidth issue was still on the table. Unfortunately, we had no clue about what could be going on. So, we started exploring the problem space, concentrating our efforts as you’d imagine any engineer would. Blaming the IO Scheduler We dug around, trying to see if IOTune became CPU-limited for the bandwidth measurements. But that wasn’t it. It’s somewhat amusing that our initial reaction was to point the finger at the IO Scheduler. This bias stems from when the IO Scheduler was first introduced in ScyllaDB. It had such a profound impact that numerous performance issues over time – things that were propagating downward to the storage team – were often (and sometimes unfairly) attributed to it. Understanding the root cause We went through a series of experiments to try to narrow down the problem further and hopefully get a better understanding of what was happening. Most of the experiments in this article, unless explicitly specified, were run on an i7i.12xlarge instance. The expected throughput was ~9.6GB/s while IOTune was measuring a write throughput of 8.5GB/s. To rule out poor disk queue utilization, we ran fio with various iodepths and block sizes, then recorded the bandwidth. We noticed that the request needs to be ~4MB to fill the disk queue. Next, we collected the same for io_tester with –io-latency-goal=1000 to prevent the queue from splitting requests. A larger latency goal means the scheduler can be more relaxed and submit the requests as they come because it has plenty of time (1000 ms) to complete each request in time. If the goal is smaller, the IO Scheduler gets stressed because it needs to make each request complete in that tight schedule. Sometimes it might just split a request in half to take advantage of the in-disk parallelism and hopefully make the original request fit the tight latency goal. The fio tool seemed to be pulling the full bandwidth from the disk, but our io_tester tool was not. The issue was definitely on our side. The good news was that both io_tester and IOTune measured similar write throughputs, so we weren’t chasing a bug in our measurement tools. The conclusion of this experiment was that we saturated the disk queue properly, but we still got low bandwidth. Next, we pulled an ace out of the sleeve. A few months before this, we were at a hackathon during our engineering summit. During that hackathon, our Storage and Networking team built a prototype Seastar IO Scheduler controller that would bring more transparency and visibility into how the IO Scheduler works. One of the patches from that project was a hack that would make the IO Scheduler drop a lot of the IOPS/bandwidth throttling logic and just drain like crazy whatever requests are queued. We applied that patch to Seastar and ran the IOTune tool again. It was very rewarding to see the following output: Measuring sequential write bandwidth: 9775 MB/s (deviation 6%) Measuring sequential read bandwidth: 13617 MB/s (deviation 22%) The bandwidth numbers escaped the 8.5GB/s limit that was previously constraining our measurements. This meant we were correct in blaming the IO Scheduler. We were indeed experiencing throttling from the scheduler, specifically from something in the fair queue math. At that point, we needed to look more closely at the low-level behavior. We patched Seastar with another home-brewed project that adds a low-overhead binary tracer to the IO Scheduler. The plan was to run the tracer on both the master version and the one with the hackathon patch applied – then try to understand why the hackathon-patched scheduler performs better. We added a few traces and we immediately started to see patterns like these in the slow master trace: Here it took it 134-51=83us to dispatch one request. The “Q” event is when a request arrives at the scheduler and gets queued. “D” stands for when a request gets dispatched. For reference, the patched IO scheduler spent 1us to dispatch a request. The unexpected behavior suggested an issue with the token bucket accumulation, as requests should be dispatched instantly when running without io-properties.yaml (effectively providing unlimited tokens). This is precisely the scenario when IOTune is running: it withholds io-properties.yaml from the IO Scheduler. This allows the token bucket to operate with unlimited tokens, stressing the disk to its maximum potential so IOTune can compute, by itself, the required io-properties.yaml. The token bucket seems to run out of tokens…but why? When the token bucket runs out of tokens, it needs to wait for tokens to be replenished when other requests are completed. This delays the dispatch of the next request. That’s why the above request waited 83us to get dispatched when it should have actually been dispatched in 1us. There wasn’t much more we could do with the event tracer. We needed to get closer to the fair queue math. We returned to io_tester to examine the relationship between the parallelism of the test and the size of the in-disk queues. We ran io_tester for requests sized within [128k, 1MB] with parallelism within [1,2,4,8,16] fibers. We ran it once for the master branch (slow) and once for the “hackathon” branch (fast). Here are some plots from these results. The plots are throughput (vertical axis) against parallelism (horizontal) for two request sizes, 1MB and 128kB. For both request sizes, the “hackathon” branch outperformed the “master” branch. Also, the 1MB request saturates the disk with much lower parallelism than the 128k request. No surprises here, the result wasn’t that valuable. In a follow-up test, we collected the in-disk latencies as well. We plotted throughput against parallelism for both the master and hackathon branches. The lines crossing the bars represent the in-disk latencies measured. This is already much better. After the disk is saturated, increasing parallelism should create a proportional increase for in-disk latency. That’s exactly what happens for the hackathon branch. We couldn’t say the same about the master branch. Here, the throughput plateaued around 4 fibers, and the in-disk latency didn’t grow! For some reason, we didn’t end up stressing this disk. To investigate further, we wanted to see the size of the actual in-disk queues. So, we coded up a patch to make io_tester output this information. We plotted the in-disk queue size alongside parallelism for various request sizes. At this point, it became clear that we weren’t sufficiently leveraging the in-disk parallelism. Likely, the fair_queue math was making the IO Scheduler throttle requests excessively. This is indeed what the plots below show. In the master (slow) branch run, the in-disk queue length for the 1MB request (which saturates the disk faster) plateaus at around 4 requests once parallelism=4 and higher. That’s definitely not alright. Just for fun, let’s look at Little’s Law in action. We plotted disk_queue_length / latency for each branch as follows. Next, we wanted to (somehow) replicate this behavior without involving an actual disk. This way, we could maybe create a regression test for the IO Scheduler. The Seastar ioinfo tool was perfect for this job. ioinfo can take an io-properties.yaml file as an argument. It feeds the values to the IO Scheduler, then the tool outputs the token bucket parameters (which can be used to calculate the theoretical IOPS and throughput values that the IO Scheduler can achieve). Our goal was to compare these calculated values with what was configured in an io-properties.yaml file and make sure the IO Scheduler could deliver very close to what it was configured for. For reference, here’s how the calculated IOPS/bandwidth looked compared to the configured values. The values returned by the scheduler were within a 5% margin of the configured one. This was fantastic news (in a way). It meant the fair_queue math behaves correctly even with bandwidths above 8.xGB/s. We didn’t get the regression test we hoped for, since the fair_queue math was not causing the throttling and disk underutilization we’d seen in the previous experiment. However, we did add a test that would check if this behavior changes in the future. We did get a huge win from this, though. We came to the conclusion that something must be wrong with the fair_queue math or something in the IO Scheduler must be incorrect only when it’s not configured with an io-properties file. At that point, the problem space narrowed significantly. Playing around with the inputs from the io-properties.yaml file, we uncovered yet another bug. For large enough read IOPS/bandwidth numbers in the config file, the IO Scheduler would report request costs of zero. After many discussions, we learned that this is not really a bug. It’s how the math should behave. With big io-properties numbers, the math should plateau the costs at 0. It makes sense: the more resources you have available, the single unit of effort becomes less significant. This led us to an important realization: the unconfigured case (our original issue) should also produce a cost of zero. A zero cost means that the token bucket won’t consume any tokens. That gives us unbounded output…which is exactly what IOTune wants. Now we needed to figure out two things: Why doesn’t the IO Scheduler report a cost of zero for the unconfigured case? In theory, it should. In the issue linked above, costs became zero for values that weren’t even close to UINT64_MAX. Was our code prepared to handle costs of zero? We should ensure we don’t end up with weird overflows or divisions by zero or any undefined behavior from code that assumes costs can’t be zero. When things start to converge At this point, we had no further leads, so we thought there must be something wrong with the fair queue math. I reviewed the math from Implementing a New IO Scheduler Algorithm for Mixed Read/Write Workloads, but I didn’t find any obvious flaws that could explain our unconfigured case. We hoped we’d find some formula mistakes that made the bandwidth hit its theoretical limit at 8.5GB/s. We didn’t find any obvious issues here, so we concluded there must be some flaw in the implementation of the math itself. We started suspecting that there must be some overflow that ends up shrinking the bandwidth numbers. After quite some time tracking the math implementation in the code, we managed to find the issue. Two internal IO Scheduler variables that were storing the IOPS and bandwidth values configured via io-properties.yaml had a default value set to `std::numeric_limits<int>::max()`. It wasn’t that intuitive to figure out – the variables weren’t holding the actual io-properties values, but rather some values that derived from them. This made the mistake harder to spot. There is some code that recalculates those variables when the io-properties.yaml file is provided and parsed by the Seastar code. However, in the “unconfigured” case, those code paths are intentionally not hit. So, the INT_MAX values were carried into the fair queue math, crunched into the formulas, and resulted in the 8.xGB/s throughput limit we kept seeing. The fix was as simple as changing the default value to ‘std::numeric_limits<uint64_t>::max()’.  A one-line fix for many weeks of work. It’s been a crazy long journey chasing such a small bug, but it has been an invaluable (and fun!) learning experience. It led to lots of performance gains and enabled ScyllaDB to support highly efficient storage instances like i7i, i7ie and i8g. Stay tuned for the next episode in this series of blog posts, In part 2, we will uncover that the performance gains after this work weren’t quite what we were expecting on realistic workloads. We will deep dive into some very dark corners of modern NVMEs and filesystems to unlock a significant chunk of write throughput.

Scaling Is the “Funnest” Game: Rachel Stephens and Adam Jacob

When not to worry about scale, when to rearchitect everything and why passionate criticism is a win “There’s no funner game than the at-scale technology game. But if you play it, some people will hate you for it. That’s okay…that’s the game you chose to play.” – Adam Jacob At Monster Scale Summit 2025, Rachel Stephens, research director at RedMonk, spoke with Adam Jacob, co-founder of Chef and CEO of System Initiative, about what it really means to build and operate software at scale. Note: Monster SCALE Summit 2026 will go live March 11-12, featuring antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications” and more than 50 others. The event is free and virtual. Register for free and join the community for some lively chats The Existential Question of Scale Stephens opened with an existential question: “Does your software exist if your users can’t run it?” Yes, your code still exists in GitHub even if us-east-1 goes down. But what if … Your system crawls under load. Critical integrations constantly break. You can’t afford the infrastructure costs. “Software at scale isn’t just about throughput,” Stephens said. “It’s about making sure that your code endures, adapts and remains accessible no matter the load and location of where you’re running. Because if your users can’t use it, your software may as well not exist.” With that framing, Stephens brought in someone who’s spent his career dealing with scale firsthand: Adam Jacob. Only Scale When It Hurts Stephens asked Jacob how teams can balance quality, speed and scale under uncertainty. How do you avoid both cutting corners and premature optimization? Jacob argues that early on, it’s fine not to worry much about scale. Most products fail for other reasons before scalability ever becomes a problem. He explained: “I think of it basically through the lens of optionality. When you start building new things, it’s nice not to worry too much about scale, because you may never reach it. Most products don’t fail because they fail to scale. Think about how badly Twitter failed to scale … and yet here we are.” The first priority is to build a solid product. Once scale becomes a real issue, that’s when it makes sense to refactor and remove bottlenecks. But if you’ve been around the block a little, your experience helps you make early choices that pay off later. Jacob noted, “Premature optimization is real. But as you gain experience, there are some decisions you make early because you know that if things work out, you’ll be happier later — like factoring your code so it can be broken apart across network boundaries over time, if you need to.” Chef Scalability Horror Stories Next, Stephens asked Jacob if he would share a scaling horror story from his Chef days. Jacob obliged and offered two memorable ones. “The best was when we launched the first version of Hosted Chef. The day before the launch, we discovered it took about a minute and a half to create a new user. It didn’t take that long when we were running it on a laptop, but it did later … and we never really tested it. So, in the final hours before launch, we changed it from ‘anyone can sign up’ to a queue system with a little space robot saying, ‘Demand is so high; we’ll get back to you.’ We just papered over the scalability problem.” “Another example: that same Chef server (the one that couldn’t create accounts quickly) eventually had to work at Facebook. The original version was written in Rails, which was great to work with, but not parallel enough. At Facebook scale, you might have 40,000 or 50,000 things pointed at one Chef server. So we rebuilt it in Erlang, which is great for that kind of problem. I literally brought the Erlang version to Facebook on a USB stick. When we installed it and bootstrapped a data center, we thought it was broken because it was using less compute and finished almost instantly.” Jacob explained that if they’d tried to build the Chef server in Erlang from the start, the project probably wouldn’t have gained traction. Starting in Rails made it possible to get Chef out into the world and learn what the system really needed to do. Only later, once they understood how the system really behaved, could they rebuild it with the right architecture and runtime for scale. Growth or Efficiency: Know Which Game You’re Playing At Chef, scaling was ultimately required to land customers like Facebook and JPMorgan Chase, which operate at massive scale. Jacob advised, “Making it scale required major investment, but it worked. You can’t buy your core. If it matters to customers, you have to build it yourself. People often wait too long to realize they have a deep architectural problem that’s also a business problem. Rebuilding for scale takes months, so you have to start early.” Your own approach to scale should ultimately be driven by what game you’re playing: In the venture-capital game, growth and traction come first. You can spend money to scale faster because you’re funded. In the profitability game, efficiency comes first. Overspending on compute or poor architecture hits the bottom line hard. Why Scaling is the ‘Funnest’ Game Stephens mentioned that “when software succeeds, it stops being yours – it becomes everyone’s.” She then asked Jacob what it’s like when your tech scales to the point that people have extremely strong opinions about it. His response: “It’s hard to build things that people care about. If you’re lucky enough to create something you love and share it with the world and people love it back, that’s incredibly rewarding. Even when they don’t, that’s still a gift.” “Someone once tapped me in a coffee shop and said, ‘You wrote Chef? I hate Chef.’ I said, ‘I’m sorry; I didn’t write it to hurt you.’ But at scale, that means he used it. It mattered in his life. And that’s what you want: for people to experience what you built.” “I love the technology, the problem, the difficulty. Scaling adds more layers of complexity, more layers of fun. There’s no funner game than the at-scale technology game. But if you play it, some people will hate you for it. That’s okay…that’s the game you chose to play.” You can watch the full talk below.

You Got OLAP in My OLTP: Can Analytics and Real-Time Database Workloads Coexist?

Explore isolation mechanisms and prioritization strategies that allow different database workloads to coexist without resource contention issues Analytics (OLAP) and real-time (OLTP) workloads serve distinctly different purposes. OLAP (online analytical processing) is optimized for data analysis and reporting, while OLTP (online transaction processing) is optimized for real-time low-latency traffic. Most databases are designed to primarily benefit from either OLAP or OLTP, but not both. Worse, concurrently running both workloads under the same data store will frequently introduce resource contention. The workloads end up hurting each other, considerably dragging down the overall distributed system’s performance. Let’s look at how this problem arises, then consider a few ways to address it. OLTP vs OLAP Databases There are basically two fundamental approaches involving how databases store data on disk. We have row-oriented databases, often used for real-time workloads. These store all data pertaining to a single row on disk. Row-oriented storage (ideal for OLTP) Column-oriented storage (ideal for OLAP) On the other side of the spectrum, we have column-oriented databases, which are often used for running analytics. These databases store data in a vertical way (versus horizontal partitioning of rows). This single design decision effectively makes it much easier and efficient for the database to run aggregations, perform calculations and answer retrieving insights such as “Top K” metrics. OLTP vs. OLAP Workloads So the general consensus is that if you want to run OLTP workloads, you use a row-oriented database – and you use a columnar one for your analytics workloads. However, contrary to popular belief, there are a variety of reasons why people might actually want to run an OLAP workload on top of their real-time databases. For example, this might be a good option when organizations want to avoid data duplication or the complexity and overhead associated with maintaining two data stores. Or maybe they don’t extract insights all that often. The Latency Problem But problems can arise when you try to bring OLAP to your real-time database. We’ve studied this a lot with ScyllaDB, a specialized NoSQL database that’s primarily meant for high throughput and low-latency real-time workloads. The following graphic from ScyllaDB monitoring demonstrates what happens to latency when you try to run OLAP and OLTP workloads alongside one another. The green line represents a real-time workload, whereas the yellow one represents an analytics job that’s running at the same time. While the OLTP workload is running on its own, latencies are great. But as soon as the OLAP workload starts, the real-time latencies dramatically rise to unacceptable levels. The Throughput Problem Throughput is also an issue in such scenarios. Looking at the throughput clarifies why latencies climbed: The analytics process is consuming much higher throughput than the OLTP one. You can even see that the real-time throughput drops, which is a sign that the database got overloaded. Unsurprisingly, as soon as the OLAP job finishes, the real-time throughput increases and the database can then process its backlog of queued requests from that workload. That’s how the contention plays out in the database when you have two totally different workloads competing for resources in an uncoordinated way. The database is naively trying to process requests as they come in. When Things Gets Contentious But why does this contention happen in the first place? If you overwhelm your database with too many requests, it cannot keep up. Usually, that’s because your database lacks either the CPU or I/O capacity that’s required to fulfill your requests. As a result, requests queue up and latency climbs. The workloads contribute to contention too. OLTP applications often process many smaller transactions and are very latency sensitive. However, OLAP ones generally run fewer transactions requiring scanning and processing through large amounts of data. So hopefully that explains the problem. But how do we actually solve it? Option A: Physical Isolation One option is to physically isolate these resources. For example, in a Cassandra deployment, you would simply add a new data center and separate your real-time processing from your analytics. This saves you from having to stream data and work with a different database. However, it considerably elevates your costs. Some specific examples of this strategy: Instaclustr, a managed services provider, shared a benchmark after isolating its deployments (Apache Spark and Apache Cassandra). GumGum shared the results of this approach (with multiregion Cassandra) at a past Cassandra Summit. There are definitely use cases and organizations running OLAP on top real-time databases. But are there any other alternatives to resolve the problem altogether? Option B: Scheduled Isolation Other teams take a different approach: They avoid running their OLAP during their peak periods. They simply run through their Analytics pipelines during off-peak hours in order to mitigate the impact on latencies. For example, consider a food delivery company. Answering the question like, “How much did this merchant sell within the past week?” is simple in OLTP. However, offering discounts to 10 top-selling restaurants within a given region is much more complicated. In a wide-column database like Cassandra or ScyllaDB, it inevitably requires a full table scan. Therefore, it would make sense for such a company to run these analytics from after midnight until around 10 a.m. – before its peak traffic hours. This is a doable strategy, but it still doesn’t solve the problem. For example, what if your dataset doubles or triples? Your pipeline might overrun your time window. And you have to consider that your business is still running at that time (people will still order food at 2 a.m.). If you take this approach, you still need to tune your analytics job and ensure it doesn’t kill your database. Option C: Workload Prioritization ScyllaDB has developed an approach called Workload Prioritization to address this problem. It lets users define separate workloads and assign different resource shares to them. For example, you might define two service levels: The main one has 600 shares, and the secondary one has 200 shares. CREATE SERVICE LEVEL main WITH shares = 600 CREATE SERVICE LEVEL secondary WITH shares = 200 ScyllaDB’s internal scheduler will process three times more tasks from the main workload than the secondary one. Whenever the system is under contention, the system prioritizes its resources allocation accordingly. Why does this kick in only during contention? Because if there’s no contention, it means there is no bottleneck, so there is effectively nothing to prioritize. [Play with an interactive animation] Workload Prioritization Under the Hood Under the hood, ScyllaDB’s Workload Prioritization relies on Seastar scheduling groups.   Seastar is a C++ framework for data-intensive applications. ScyllaDB, Redpanda, Ceph’s SeaStore and other technologies are built on top of it. Scheduling groups are effectively the way Seastar allows background operations to have little impact on foreground activities. For example, in ScyllaDB and database-specific terms, there are several different scheduling groups within the database. ScyllaDB has a distinct group for compactions, streaming, Memtables, and so on. With Cassandra, you might end up in a situation where compactions impact your workload performance. But in ScyllaDB, all compaction resources are scheduled by Seastar. And according to its shares of resources, the database will allocate a respective share of resources to the background activity (compaction, in that case) – therefore ensuring that the latency of the primary user-facing workload doesn’t suffer. Using scheduling groups in this way also helps the database auto-tune. If the user workload is running during off-peak hours, then the system will automatically have more spare computing and I/O cycles to spend. The database will simply speed up its background activities. Here’s a guided tour of how Workload Prioritization actually plays out: OLTP and OLAP Can Coexist Running OLAP alongside OLTP inevitably involves anticipating and managing contention. You can control it in a few ways: isolate analytics to its own cluster, run it in off-peak windows, or enforce workload prioritization. And workload prioritization isn’t just for allowing OLAP along with your OLTP. That same approach could also be used to assign different priorities to reads vs. writes, for example. If you’d like to learn more, take a look at my recent tech talk on this topic: “How to Balance Multiple Workloads in a Cluster.”