Common Performance Pitfalls of Modern Storage I/O
Performance pitfalls when aiming for high-performance I/O on modern hardware and cloud platforms Over the past two blog posts in this series, we’ve explored real-world performance investigations on storage-optimized instances with high-performance NVMe RAID arrays. Part 1 examined how the Seastar IO Scheduler‘s fair-queue issues inadvertently throttle bandwidth to a fraction of the advertised instance capacity. Part 2 dove into the filesystem layer, revealing how XFS’s block size and alignment strategies could force read-modify-write operations and cut a big chunk out of write throughput. This third and final part synthesizes those findings into a practical performance checklist. Whether you’re optimizing ScyllaDB, building your own database system, or simply trying to understand why your storage isn’t delivering the advertised performance, understanding these three interconnected layers – disk, filesystem, and application – is essential. Each layer has its own assumptions of what constitutes an optimal request. When these expectations misalign, the consequences cascade down, amplifying latency and degrading throughput. This post presents a set of delicate pitfalls we’ve encountered, organized by layer. Each includes concrete examples from production investigations as well as actionable mitigation strategies. Mind the Disk — Block Size and Alignment Matter Physical and Logical sector sizes Modern SSDs, particularly high-performance NVMe drives, expose two critical properties: physical sector size and logical sector size. The logical sector size is what the operating system sees. It typically defaults to 512 bytes for backward compatibility, though many modern drives are also capable of reporting 4K. The physical sector size reflects how data is actually stored on the flash chips. It’s the size at which the drive delivers peak performance. When you submit a write request that doesn’t align to the physical sector size, the SSD controller must: Read the entire physical page containing your data Modify the portion you’re writing to Write the entire sector back to the empty/erased flash page In our investigation of AWS i7i instances, we were bitten by exactly this problem. First, IOTune, our disk benchmarking tool, was using 512 byte requests to run the benchmarks because the firmware in these new NVMes was incorrectly reporting the physical sector size as 512 bytes when it was actually 4 KB. This made us measure disks as having up to 25% less read IOPS and up to 42% less write IOPS. The measurements were used to configure the IO Scheduler, so we ended up using the disks like they were a less performant model. That’s a very good way for your business to lose money. 🙂 If you’re wondering why I only explained the reasoning for slow writes with requests unaligned to the physical sector size, it’s because we still don’t fully understand why read requests are also hit by this issue. It’s an open question and we’re still researching some leads (which I hope will get us an answer). Sustained performance If your software relies on a disk hitting a certain performance number, try to account for the fact that even dedicated provisioned NVMes have peak and sustained performance values. It’s well known that elastic storage (like EBS, for instance) has baseline performance and peak performance, but it’s less intuitive for dedicated NVMes to behave like this. Measuring disk performance with 10m+ workloads might result in 3-5% lower IOPS/throughput. That allows your software to better predict how the disk behaves under sustained load. Mind the RAID If you’re using a RAID0 for NVMe arrays, be aware that your app’s parallel architecture might resonate with the way requests end up distributed over the RAID array. RAIDs are made out of chunks and stripes. A chunk is a block of data within a single disk; when the chunk size is exceeded, the driver moves on to the next disk in the array. A stripe contains all the chunks, one from each disk that will get written sequentially. A RAID0 with 2 disks will get written like this: stripe0: chunk0 (disk0), chunk1 (disk1); stripe1; stripe2… Filesystems usually align files to the RAID stripe boundary. Depending on the write pattern of your application, you could end up stressing some of the disks more, and not leveraging the entire power of the RAID array. Key Takeaways for the Disk Layer Detection: Always verify physical and logical sector sizes if you suspect your issue might be related to this. Don’t blindly trust firmware-reported values; cross-check with benchmarking tools and adjust your app and filesystem to use the physical sector size if possible. Measurement discipline: Increase measurement time when benchmarking disks. Even dedicated NVMes can have baseline vs. peak performance. RAID awareness: RAID architecture is made out of blocks, addresses, and drivers managing them. It’s not a magic endpoint that will just amplify your N-drives array into N times the performance of a disk. Its architecture has its own set of assumptions and limitations which, together with the filesystem’s own limitations and assumptions, might interfere with your app’s. Mind the Filesystem — Block Size, Alignment, and Metadata Operations Filesystem Block Size and Request Alignment Every filesystem has its own block size, independent of the disk’s physical sector size. XFS, for instance, can be formatted with block sizes ranging from 512 bytes to 64 KB. In ScyllaDB, we used to format with 1K block sizes because we wanted the fastest commitlog writes >= 1K. For older SSDs, the physical sector size was 512 bytes. On modern 4K-optimized NVMe arrays, this choice became a liability. We realized that 4K block sizes would bring us lots of extra write throughput. This filesystem-level block size affects two critical aspects: how data is stored and aligned on disk, and how metadata is laid out. Here’s a concrete example. When Seastar issues 128 KB sequential writes to a file on 1K-block XFS, the filesystem doesn’t seem to align these to 4K boundaries (maybe because the SSD firmware reported a physical sector size of 512 bytes). Using blktrace to inspect the actual disk I/O, we observed that approximately 75% of requests aligned to 1K or 2K boundaries. For these requests, the drive controller would split them each into at most 3 parts: a head, a 4K-aligned core, and a tail. For the head and tail, the disk would perform RMW, which is very slow. That would become the dominating factor for the entire request (consisting of the 3 parts). Reformatting the filesystem with 4K block size completely transformed the alignment distribution of requests, and 100% of them aligned to 4K. This brought a lot of throughput back for us. Filesystem Metadata Operations and Blocking Consider this: when a file grows, XFS must: Allocate extents from the freespace tree, requiring B-Tree modifications and mutex locks Update inode metadata to reflect the new file size Flush metadata periodically to ensure durability Update access/change times (ctimes) on every write Each of these operations can block subsequent I/O submissions. In our research, we discovered that the RWF_NOWAIT flag (requesting non-blocking async I/O submission) was insufficiently effective when metadata operations were queued. Writes would be re-submitted from worker threads rather than the Seastar reactor, adding context-switch overhead and latency spikes. When the final size of files is known, it is beneficial to pre-allocate or pre-truncate the file to that size using functions like fallocate() or ftruncate(). This practice dramatically improves the alignment distribution across the file and helps to amortize the overhead associated with extent allocation and metadata updates. While effective, fallocate() can be an expensive operation. It can potentially impact latency, especially if the allocation groups are already busy. Truncation is significantly cheaper; this alone can offer substantial benefits. Another helpful technique is using approaches like Seastar’s sloppy_size=true, where a file’s size is doubled via truncation whenever the current limit is reached. Key Takeaways for the Filesystem Layer Format Correctly: Format XFS (or other filesystems) with block sizes matching your SSD physical sector size if possible. Most modern NVMe drives are 4K-optimized. Go lower only if there are strong restrictions – like inability to read 4K aligned or potential read amplification or you have benchmarks showing better performance for the disk used with a smaller request size. Pre-allocation: When file sizes are known, pre-truncate or pre-allocate files to their final size using `fallocate()` or truncation. This amortizes extent allocation overhead and ensures uniform alignment across the file. Metadata Flushing: Understand the filesystem’s metadata update behavior. File sizes and access time updates are expensive. If you’re doing AIO, use RWF_NOWAIT if possible and make sure it works by tracing the calls with `strace`. Data deletion has also proved to have very expensive side effects. However, on older generation NVMes, TRIM requests that accumulated in the filesystem would flush all at once, overloading the SSD controller and causing huge latency spikes for the apps running on that machine. Mind the Application — Parallelism and Request Size Strategy Parallelism tuning Modern NVMe storage devices can handle thousands of requests in flight simultaneously. Factors like the internal queue depth and the number of outstanding requests the device can accept determine the maximum achievable bandwidth and latency when properly loaded. However, application-level concurrency (threads, fibers, async tasks) must be sufficient to keep these queues full. Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows – as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the input increase. The relationship between throughput, latency, and queue depth follows Little’s Law: Average Queue Depth = Throughput * Average Latency For a device delivering 14.5 GB/s with 128K requests, the number of requests/second the device can handle is 14.5GB/s divided by 128K – so roughly 113k req/s. If an individual request latency is, for example, 1ms, the device queue needs at least 113 outstanding requests to sustain 14.5 GB/s. This has practical implications: if you tune your application for, say, 40 concurrent requests and later upgrade to a faster device, you’ll need to increase concurrency or you’ll under-utilize the new device. Conversely, if you over-provision concurrency, you risk queue saturation and latency spikes because the SSD controllers only have so much compute power themselves. In ScyllaDB, concurrency is expressed as the number of shards (1 per CPU core) and fibers submitting I/O in parallel. Because we want the database to perform exceptionally even in mixed workloads, we’re using Seastar’s IO Scheduler to modulate the number of requests we send to storage. The Scheduler is configured with a latency goal that it needs to follow and with the iops/bandwidth the disk can handle at peak performance. In real workloads, it’s very difficult to match the load with a static execution model you’ve built yourself, even with thorough benchmarking and testing. Hardware performance varies, and read/write patterns are usually surprising. Your best bet is to use an IO scheduler built to leverage the maximum potential of the hardware while still obeying a latency limit you’ve set. Request size strategy Many small I/O requests often fail to saturate an NVMe drive’s throughput because bandwidth is the product of IOPS and request size, and small requests hit an IOPS/latency/CPU ceiling before they hit the PCIe/NAND bandwidth ceiling. For a given request size, bandwidth is essentially IOPS x Request size. So, for instance, 4K I/Os would need extremely high IOPS to reach GB/s-class throughput. A concrete example: 350k IOPS (which is typical for an AWS i7i.4xlarge, for instance) at 4K request size will get you ~1.3GB/s (i7i.4xlarge can do 3.5GB/s easily). This looks slow in throughput terms even though the device might be doing exactly what the workload allows. NVMe devices reach peak bandwidth when they have enough outstanding requests (queue depth) to keep the many internal flash channels busy. If your workload issues many small requests but mostly one-at-a-time (or only a couple in flight), the drive spends time waiting between completions instead of streaming data. As a result, bandwidth stays low. Another important aspect is that with small I/O sizes, the fixed per-I/O costs (syscalls, scheduling, interrupts, copying, etc.) can dominate the latency. That means the kernel/filesystem path becomes the limiter even before the NVMe hardware does. This is one reason why real applications often can’t reproduce vendors’ peak numbers. In practice, you usually need high concurrency (multiple threads/fibers/jobs) and true async I/O to maintain a deep in-flight queue and approach device limits. But, to repeat one idea from the section above, also remember that the storage controller itself has a limited compute capacity. Overloading it results in higher latencies for both read and write paths. It is important to find the right request size for your application’s workload. Go as big as your latency expectations will allow. If your workload is not very predictable, keep in mind that you’ll most likely need some dynamic logic that adjusts the I/O patterns so you can squeeze every last drop of performance out of that expensive storage. Key Takeaways for the Application Layer Parallelism tuning: Benchmark your specific hardware and workload to find the optimal fiber/thread/shard count. Watch the latency. While you increase the parallelism, you’ll notice throughput increasing. At some point, throughput will plateau and latency will start to increase. That’s the sweet spot you’re looking for. Request size strategy: Picking the right request size is important. Seastar defaults to 128K for sequential writes, which works well on most modern storage (but validate it via benchmarking). If your device prefers larger or smaller requests for throughput, the cost is latency – so design your workload accordingly. In practice, we’ve never seen SSDs that cannot be saturated for throughput with 128K requests and ScyllaDB can achieve sub-millisecond latencies with this request size. Conclusion The path from application buffer to persistent storage on modern hardware is complex. Often, performance issues are counterintuitive and very difficult to track down. A 1 KB filesystem block size doesn’t “save space” on 4K-optimized SSDs; it wastes throughput by forcing read-modify-write operations. A perfectly tuned IO Scheduler can still throttle requests if given incorrect disk properties. Sufficient parallelism doesn’t guarantee high throughput if request sizes are too small to fill device queues. By minding the disk, the filesystem, and the application – and by understanding how they interact – you can fully take advantage of modern storage hardware and build systems that reliably deliver the performance that the storage vendor advertised.Kelsey Hightower’s Take on Engineering at Scale
At ScyllaDB’s Monster SCALE Summit 25, Hightower shared why facing scale the hard way is the right way “There’s a saying I like: ‘Some people have good ideas. Some ideas have people.’ When your idea outlives you, that’s success.” – Kelsey Hightower Kelsey Hightower’s career is a perfect example of that. His ideas have taken on a life of their own, extending far beyond his work at Puppet, CoreOS, to KubeCon and Google. And he continues to scale his impact with his signature unscripted keynotes as well as the definitive book, “Kubernetes Up and Running.” We were thrilled that Hightower joined ScyllaDB’s Monster SCALE Summit to share his experiences and advice on engineering at scale. And to scale his insights beyond those who joined us for the live keynote, we’re capturing some of the most memorable moments and takeaways here. Note: Monster SCALE Summit 2026 will go live March 11-12, featuring antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications”) and more than 50 others. The event is free and virtual; register and join the community for some lively chats. Monster Scale Summit 2026 – Free + Virtual Fail Before You Scale The interview with host Tim Koopmans began with a pointed warning for attendees of a conference that’s all about scaling: Don’t just chase scale because you’re fascinated by others’ scaling strategies and achievements. You need to really experience some pain personally first. As Hightower put it: “A lot of people go see a conference talk — I’m probably guilty of this myself — and then try to ‘do scale things’ before they even have experience with what they’re doing. Back at Puppet Labs, lots of people wrote shell scripts with bad error handling. Things went awry when conditions weren’t perfect. Then they moved on to configuration management, and those who made that journey could understand the trade-offs. Those who started directly with Puppet often didn’t.” “Be sure you have a reason,” he said “So before you over-optimize for a problem you may not even have, you should ask: ‘How bad is it really? Where are the metrics proving that you need a more scalable solution?’ Sometimes you can do nothing and just wait for scaling to become the default option.” Ultimately, you should hope for the “good problem” where increasing demand causes you to hit the limits of your tech, he said. That’s much better than having few customers and over-engineering for problems you don’t have. Make the Best Choice You Can … For Now The conversation shifted to what level of scale teams should target with their initial tech stack and each subsequent iteration. Should you optimize for a future state that hasn’t happened yet? Play it safe in case the market changes? “If you’re not sure whether you’re on the right stack … I promise you, it’s going to change,” Hightower said. “Make the best choice you can for now. You can spend all year optimizing for ‘the best thing,’ but it may not be the best thing 10 years from now. Say you pick a database, go all in, learn best practices. But put a little footnote in your design doc: ‘Here’s how we’d change this.’ Estimate the switching cost. If you do that, you won’t get stuck in sunk cost fallacy later.” Rather than trying to predict the future, think about how to avoid getting trapped. You don’t want dependencies or extensions to limit your ability to migrate when it’s time to take a better (or just different) path. “Change isn’t failure,” he emphasized. “Plan for it; don’t fear it.” In Hightower’s view, scaling decisions should start on a whiteboard, not in code. “When I was at Google, we’d do technical whiteboard sessions. Draw a line — that’s time. “Today, we’re here. Our platform allows us to do these things. Is that good or bad?” Then draw ahead: “Where do we want to be in two years?” He continued, “Usually that’s driven by teams and customer needs. You can’t do everything at once. So plot milestones — six months, a year, etc. You can push things out in time for when new libraries or tools arrive. If something new shows up that gets you two years ahead instantly, great. Having a timeline gives you freedom without guilt that you can’t ship everything today.” Are You Really Prepared For a 747? Following up on the Google thread, Koopmans asked, “I’d love to hear practical ways Google avoids over-engineering when designing for scale.” To illustrate why “Google-scale” solutions don’t always fit everyone else, Hightower used a memorable analogy: “I had a customer once say, ‘We want BigQuery on-prem.’ I said, ‘You do? Really? OK, how much money do you have?’ And it was one of those companies that had plenty of capital, so that wasn’t the issue. I told them, ‘That would be like going to the airport, looking out the window, seeing a brand-new 747 and telling the airline that you want that plane. Even if they let you buy it, you don’t have a pilot’s license, you don’t know how to fuel it. Where are you going to park it? Are you going to drive it down your subdivision, decapitating the roofs of your neighbors’ houses?” Some things just aren’t meant for everyone.” Ultimately, whether it’s over-engineering or not depends on the target user. Understand who they are, how they work and what tools they use, then build with that in mind. Hightower also warned against treating “best practices” as universal truths: “One question that most customers show up with is, ‘What are the best practices?’ Not necessarily the best practices for me. They just want to know what everyone else is doing. I think that might be another anti-pattern in the mix, where you only care about what everyone else is doing and you don’t bring the necessary context for a good recommendation.” How Leaders Should Think About Dev Tooling “Serializing engineering culture” (Hightower’s phrase) like Google did with its google3 monorepo makes it simple for thousands of new engineers to join the team and start contributing almost instantly. During his tenure at Google, everything from gRPC to deployment tools was integrated. Engineers just opened a browser, added code and reviews would start automatically. However, there’s a fine line between serializing and stifling. Hightower believes that prohibiting engineers from even installing Python on their laptops, for example, is overkill: “That’s like telling Picasso he can’t use his favorite brush.” He continued: “Everyone works differently. As a leader, learn what tools people actually use and promote sharing. Have engineers show their workflows — the shortcuts, setups and plugins that make them productive. That’s where creativity and speed come from. Share the nuance. Most people think their tricks are too small to matter, but they do. I want to see your dotfiles! You’ll inspire others.” Watch the Complete Talk As Hightower noted, “Some people have good ideas. Some ideas have people.” His approach to scale – pragmatic, context-driven and human – shows why some ideas really do outlive the people who created them. You can see the full talk below. Fun fact: it was truly an unscripted interview – Hightower insisted! The team met him in the hotel lobby that morning, chatted a bit during a coffee run, prepped the camera angles … and suddenly Hightower and Koopmans were broadcasting to 20,000 attendees around the world.Claude Code Marketplace Now Available
Claude Code has become an indispensable part of my daily workflow. I use it for everything from writing code to debugging production issues. But while Claude is incredibly capable out of the box, there are areas where injecting specialized domain knowledge makes it dramatically more useful.
That’s why I built a plugin marketplace. Yesterday I released rustyrazorblade/skills, a collection of Claude Code plugins that extend Claude with expert-level knowledge in specific domains. The first plugin is something I’ve been talking about doing for a while: a Cassandra expert.
The Deceptively Simple Act of Writing to Disk
Tracking down a mysterious write throughput degradation From a high-level perspective, writing a file seems like a trivial operation: open, write data, close. Modern programming languages abstract this task into simple, seemingly instantaneous function calls. However, beneath this thin veneer of simplicity lies a complex, multi-layered gauntlet of technical challenges, especially when dealing with large files and high-performance SSDs. For the uninitiated, the path from application buffer to persistent storage is fraught with performance pitfalls and unexpected challenges. If your goal is to master the art of writing large files efficiently on modern hardware, understanding all the details under the hood is essential. This article will walk you through a case study of fixing a throughput performance issue. We’ll get into the intricacies of high-performance disk I/O, exploring the essential technical questions and common oversights that can dramatically affect reliability, speed, and efficiency. It’s part 2 of a 3-part series. Read part 1 Read part 3 When lots of work leads to a performance regression If you haven’t yet read part 1 (When bigger instances don’t scale), now’s a great time to do so. It will help you understand the origin of the problem we’re focusing on here. TL;DR: In that blog post, we described how we managed to figure out why a new class of highly performant machines didn’t scale as expected when instance sizes increased. We discovered a few bugs in our Seastar IO Scheduler (stick around a bit, I’ll give a brief description of that below). That helped us measure scalable bandwidth numbers. At the time, we believed these new NVMes were inclined to perform better with 4K requests than with 512 byte requests. We later discovered that the latter issue was not related to the scheduler at all. We were actually chasing a firmware bug in the SSD controller itself. These disks do, in fact, perform better with 4K requests. What we initially thought was a problem in our measurement tool (IOTune) turned out to be something else entirely. IOTune wasn’t misdetecting the disk’s physical sector size (this is the request size at which a disk can achieve the best IOPS). Instead, the disk firmware was reporting it incorrectly. It was reporting it as 512 bytes. However, in reality, it was 4K. We worked around the IOPS issue since the cloud provider wasn’t willing to fix the firmware bug due to backward-compatibility concerns. We also deployed the IO Scheduler fixes and our measured disk models (io-properties) with IOTune scaled nicely with the size of the instance. Still, in real workload tests, ScyllaDB didn’t like it. Performance results of some realistic workloads showed a write throughput degradation of around 10% on some instances provisioned with quite new and very fast SSDs. While this wasn’t much, it was alarming because we were kind of expecting an improvement after the last series of fixes. These first two charts give us a good indication of how well ScyllaDB utilizes the disk. In short, we’re looking for both of them to be as stable as possible and as close to 100% as possible. The “I/O Group consumption” metric tracks the amount of shared capacity currently taken by in-flight operations from the group (reads, writes, compaction, etc.). It’s expressed as a percentage of the configured disk capacity. The “I/O Group Queue flow ratio” metric in ScyllaDB measures the balance between incoming I/O request rates and the dispatch rate from the I/O queue for a given I/O group. It should be as close as possible to 1.0, because requests cannot accumulate in disk. If it jumps up, it means one of two things. The reactor might be constantly falling behind and not kicking the I/O queue in a timely manner. Or, it can mean that the disk is slower than we told ScyllaDB it was – and the scheduler is overloading it with requests. The spikes here indicate that the IO Scheduler doesn’t provide a very good QoS. That led us to believe the disk was overloaded with requests, so we ended up not saturating the throughput. The following throughput charts for commitlog, memtable, and compaction groups reinforce this claim. The 4 NVMe RAID array we were testing against was capable of around 14.5GB/s throughput. We expected that at any point in time during the test, the sum of the bandwidths for those three groups would get close to the configured disk capacity. Please note that according to the formula described in the section below on IO Scheduler, bandwidth and IOPS have a competing relationship. It’s not possible to reach the maximum configured bandwidth because that would leave you with 0 space for IOPS. The reverse holds true: You cannot reach the maximum IOPS because that would mean your request size got so low that you’re most likely not getting any bandwidth from the disk. At the end of the chart, we would expect an abrupt drop in throughput for the commitlog and memtable groups because the test ends, with the compaction group rapidly consuming most of the 14.5GB/s of disk throughput. That was indeed the case, except that these charts are very spiky as well. In many cases, summing up the bandwidth for the three groups shows that they consume around 12.5GB/s of the disk total throughput. The Seastar IO Scheduler Seastar uses an I/O scheduler to coordinate shards for maximizing the disk’s bandwidth while still preserving great IOPS. Basically, the scheduler lets Seastar saturate a disk and at the same time fit all requests within a latency goal configured on the library (usually 0.5ms). A detailed explanation of how the IO Scheduler works can be found in this blog post. But here’s a summary of where max bandwidth/IOPS values come from and where they go to within the ScyllaDB IO Scheduler. I believe it will connect the problem description above with the rest of the case study below. The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a disk, it will output 4 values corresponding to read/write IOPS and read/write bandwidth. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of your disk. This model then helps ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula which looks something like:read_bw/read_bw_max + write_bw/write_bw_max +
read_iops/read_iops_max + write_iops/write_iops_max <= 1
The internal mechanics of how the IO Scheduler works are detailed
in the blog post linked above. Peak and sustained throughput We
observed some bursty behavior in the SSDs under test. It wasn’t
much; iops/bandwidth would be lower with around 5% than the values
measured by running the benchmark for the default 2 minutes. The
iops/bandwidth values would start stabilizing at around 30 minutes
and that’s what we call sustained io-properties. We thought our
IOTune runs might have recorded peak disk io-properties (i.e., we
ran the tool for too short of a duration – the default is 120
seconds). With a realistic workload, we are actually testing the
disks at their sustained throughput, so the IO Scheduler builds an
inflated model of the disk and ends up sending more requests than
the disk can actually handle. This would cause the overload we saw
in the charts. We then tested with newly measured sustained
io-properties (with the IOTune duration configured to run 30
minutes, then 60 minutes). However, there wasn’t any noticeable
improvement in the throughput degradation problem…and the charts
were still spiky. Disk saturation length The disk saturation
length, as defined and measured by IOTune, is the smallest request
length that’s needed to achieve the maximum disk throughput. All
the disks that we’ve seen so far had a measured saturation length
of 128K. This means that it should be perfectly possible to achieve
maximum throughput with 128K requests. We noticed something quite
odd while running tests on these performant NVMes: the Seastar
IOTune tool would report a saturation length of 1MB. We immediately
panicked because there are a few important assumptions that rely on
us being able to saturate disks with 128K requests. The issue
matched the symptoms we were chasing. A disk model built with the
assumption that saturation length is 1MB would trick the IO
Scheduler into allowing a higher number of 128K requests (the
length Seastar uses for sequential writes) than the disk controller
can handle efficiently. In other words, the IO Scheduler would try
to achieve the high throughput measured with 1MB request length,
but using 128K requests. This would make the disk appear
overloaded, as we saw in the charts. Assume you’re trying to reach
the maximum throughput on a common disk with 4K requests, for
instance. You won’t be able to do it. And since the throughput
would stay below the maximum, the IO Scheduler would stuff more and
more requests into the disk. It’s hoping to reach the maximum – but
again, it won’t be reached. The side effect is that the IO
Scheduler ends up overloading the disk controller with requests,
increasing in-disk latencies for all the other processes trying to
use it. As is typical when you’re navigating muddy waters like
these, this turned out to be a false lead. We were stepping on some
bugs in our measuring tools, IOTune and io_tester. IOTune was
running with lower parallelism than those disks needed for
saturation. And io_tester was measuring overwriting a file rather
than writing to a pre-truncated new file. The saturation length of
this type of disks was still 128k, like we had seen in the past.
Fortunately, that meant we didn’t need to make potential
architectural changes in Seastar in order to accommodate larger
requests. A nice observation we can make here based on the tests we
ran trying to dis/prove this theory is that extent allocation is a
rather slow process. If the allocation group is already quite busy
(a few big files already exist under the same directory, for
instance), the effect on throughput when appending and extending a
file is quite dramatic. Internal parallelism Another interesting
finding was that the disk/filesystem seemed to suffer from internal
parallelism problems. We ran io_tester with 128k requests, 8 fibers
per shard with 8 shards and 64 shards. The results were very odd.
The expected bandwidth was ~12.7GB/s, but we were confused to see
it drop when we increased the number of shards.
Generally, the bandwidth vs. latency dependency is defined by two
parts. If you measure latency and bandwidth while you increase the
app parallelism, the latency is constant and the bandwidth grows as
long as the internal disk parallelism is not saturated. Once the
disk is throughput loaded, the bandwidth stops growing or grows
very little, while the latency scales almost linearly with the
increase of the input. However, in the table above, we see
something different. When the disk is overloaded, latency increases
(because it holds more requests internally), but the bandwidth
drops at the same time. This might explain why IOTune measured
bandwidths around 12GB/s, while under the real workload (capped
with IOTune-measured io-properties.yaml), the disk behaved as if it
was overloaded (i.e., high latency and the actual bandwidth below
the maximum bandwidth). When IOTune measures the disk, shards load
the disk one-by-one. However, the real test workload sends parallel
requests from all shards at once. The storage device was a RAID0
with 4 NVMes and a RAID chunk size of 1MB. The theory here was that
since each shard writes via 8 fibers of 128k requests, it’s
possible that many shards could end up writing to the same disk in
the array. The explanation is that XFS aligns files on 1MB
boundaries. If all shards start at the same file offset and move at
relatively the same speed, the shards end up picking the same drive
from the array. That means we might not be measuring the full
throughput of the raid array. The measurements confirm that the
shards do not consistently favor the same drive. Single disk
throughput was measured at 3.220 GB/s while the entire 4-disk array
achieved a throughput of 10.9 GB/s. If they were picking the same
disk all the time, the throughput of the entire 4-disk array
would’ve been equal to that of a single disk (i.e. 3.2GB/s). This
lead ended up being a dead end. We tried to prove it in a
simulator, but all requests ended up shuffled evenly between the
disks. Sometimes, interesting theories that you bet can explain
certain effects just don’t hold true in practice. In this case,
something else sits at the base of this issue. XFS formatting
Although the previous lead didn’t get us very far, it opened a very
nice door. We noticed that the throughput drop is reproducible on
64 shards, rather than 8 shards RAID mounted with the
scylla_raid_setup script (a ScyllaDB script which does
preparation work on a new machine, e.g., formats the filesystem,
sets up the RAID array), not on a raw block device and not on a
RAID created with default parameters
Comparing the running mkfs.xfs commands, we spotted a few
differences. In the table below, notice how the XFS parameters
differ between default-mounted XFS and XFS mounted by
scylla_raid_setup. The 1K vs 4K data.bsize difference
stands out. We
also spotted – and this is an important
observation– that truncating the test files to some large
size seems to bring back the stolen throughput. The results coming
from this observation are extremely surprising. Keep reading to see
how this leads us to actually figuring out the root cause of the
problem. The table below shows the throughput in MB/s when running
tests on files that are being appended and extended and files that
are being pre-truncated to their final size (both cases were run on
XFS mounted with the Scylla script).
We’ve experimented with Seastar’s sloppy_size=true file option,
which truncates the file’s size to double every time it sees an
access past the current size of the file. However, while it
improved the throughput numbers, it unfortunately still left half
of the missing throughput on the table. RWF_NOWAIT and non-blocking
IO The first lead that we got from here was by running our tests
under strace. Apparently, all of our writes would
get re-submitted out of the Seastar Reactor thread to XFS threads.
Seastar uses RWF_NOWAIT to attempt non-blocking AIO
submissions directly from the Seastar Reactor thread. It sets
aio_rw_flags=RWF_NOWAIT on IOCBs during initial
io_submit calls from the Reactor backend. On modern
kernels, this flag requests immediate failure (EAGAIN)
if the submission may block, preserving reactor responsiveness. On
io_submit failure with RWF_NOWAIT,
Seastar clears the flag and queues the IOCBs for retrying. The
thread pool then executes the retry submissions without the
RWF_NOWAIT flag, as these can tolerate blocking. We
thought reducing the number of retries would increase throughput.
Unfortunately, it didn’t actually do that when we disabled the
flag. The cause for the throughput drop is uncovered next. As for
the RWF_NOWAIT issue, it’s still unclear why it
doesn’t affect throughput. However, the fix was a kernel patch by
our colleague Pavel Emelyanov which fiddles with inode time update
when IOCBs carry the RWF_NOWAIT flag. More details on
this would definitely exceed the scope of this blog post. blktrace
and offset alignment distribution Returning to our throughput
performance issue, we started running io_tester experiments with
blktrace on, and we noticed something strange. For
around 25% of the requests, io_tester was submitting 128k requests
and XFS would queue them as 256 sector requests (256 sectors x 512
bytes reported physical sector size equals to 128k). However, it
would split the requests and complete them in 2 parts. (Note the Q
label on the first line of the output below; this indicates the
request was queued) The first part of the request would finish in
161 microseconds, while the second part would finish in 5249
microseconds. This dragged down the latency of the whole request to
5249 microseconds (the 4th column in the table is the timestamp of
the event in seconds; the latency of a request is max(ts_completed
– ts_queued)). The
remaining 75% of requests were queued and completed in one go as
256 sector requests. They were also quite fast: 52 microseconds, as
shown below.
The explanation for the split is related to how 128k requests hit
the file, given that XFS lays it out on disk, considering a 1MB
RAID chunk size. The split point occurs at address 21057335224 +
72, which translates to hex 0x9CE3AE00000. This reveals it is, in
fact, a multiple of 0x100000 – the 1MB RAID chunk boundary. We can
discuss optimizations for this, but that’s outside the scope of
this article. Unfortunately, it was also out of scope for our
throughput issue. However, here are some interesting charts showing
how request offsets alignment looks based on the
blktrace events we collected.
default formatted XFS with 4K block size
scylla_raid_setup XFS: 1K block
scylla_raid_setup XFS: 1K block + truncation This result is
significant! For XFS formatted with a 4K block size, most
requests are 4K aligned. For XFS formatted with
scylla_raid_setup (1K block size), most requests are
1K or 2K aligned. For XFS formatted with
scylla_raid_setup (1K block size) and with test files
truncated to their final size, all requests are 64K aligned
(although in some cases we also saw them being 4k aligned). It
turns out that XFS lays out files on disk very differently when the
file is known in advance compared to the case when the file is
grown on-demand. That results in IO unaligned to the disk block
size. Punchline Now here comes the explanation of the problem we’ve
been chasing since the beginning. In
the first part of the article, we saw that doing random writes
with 1K requests produces worse IOPS than with 4K requests on these
4K optimized NVMes. This happens because when executing the 1K
request, the disk needs to perform Read-Modify-Write to land the
data into chips. When we submit 128k requests (as ScyllaDB does)
that are 1K or 2K aligned (see the alignment distributions in the
charts above), the disk is forced to do RMW on the head and tail of
the requests. This slows down all the requests (unrelated, but
similar in concept to the raid chunk alignment split we’ve seen
above). Individually, the slowdown is probably tiny. But since most
requests are 1k and 2k aligned on XFS formatted with 1k block size
(no truncation), the throughput hit is quite significant. It’s very
interesting to note that, as shown in the last chart above,
truncation also improved the alignment distribution quite
significantly, and also improved throughput. It also appeared to
significantly shorten the list of extents created for our test
files. For ScyllaDB, the right solution was to format XFS with a 4K
block size. Truncating to the final size of the file wasn’t really
an option for us because we can’t predict how big an SSTable will
grow. Since sloppy_size’ing the files didn’t provide great results,
we agreed that 4K-formatted XFS was the way to go. The throughput
degradation we got using the higher io-properties numbers seems to
be solved. We initially expected to see “improved performance”
compared to the original low io-properties case (i.e., a higher
measured write throughput). The success wasn’t obvious, though. It
was rather hidden within the dashboards, as shown below. Here’s
what disk utilization and IO flow ratio charts look like. The disk
is fully utilized and clearly not overloaded anymore.
And here are memtable and commitlog charts. These look very similar
to the charts we got with the initial low io-properties numbers
from the
“When bigger instances don’t scale” article. Most likely, this
means that’s what the test can do.
The
good news was hidden here. While the test went full speed ahead,
the compaction job filled all the available throughput, from
~10GB/s (when commitlog+memtable were running at 4.5GB/s) to
14.5GB/s (when commitlog and memtable flush processes were done).
The
only thing left to check was whether the filesystem formatted with
the 4K block size would cause read amplification on older disks
with 512 bytes physical sector size. It turns out it didn’t. We
were able to achieve similar IOPS on a RAID with 4 NVMes of the
older type.
Next up:
Part 3, Common performance pitfalls of modern storage I/O. Inside the 2026 Monster Scale Summit Agenda
Monster Scale Summit is all about extreme scale engineering and data-intensive applications. So here’s a big announcement: the agenda is now available! Attendees can join 50+ tech talks, including: Keynotes by antirez, Camille Fournier, Pat Helland, Joran Greef, Thea Aarrestad, Dor Laor, and Avi Kivity Martin Kleppmann & Chris Riccomini chatting about the second edition of Designing Data-Intensive Applications Tales of extreme scale engineering – Rivian, Pinterest, LinkedIn, Nextdoor, Uber Eats, Google, Los Alamos National Labs, CERN, and AmEx ScyllaDB customer perspectives – Discord, Disney, Freshworks, ShareChat, SAS, Sprig, MoEngage, Meesho, Tiket, and Zscaler Database engineering – Inside looks at ScyllaDB, turbopuffer, Redis, ClickHouse, DBOS, MongoDB, DuckDB, and TigerBeetle What’s new/next for ScyllaDB – Vector search, tablets, tiered storage, data consistency, incremental repair, Rust-based drivers, and more Like other ScyllaDB-hosted conferences (e.g., P99 CONF), the event will be free and virtual so everyone can participate. Take a look, register, and start choosing your own adventure across the multiple tracks of tech talks. Full Agenda Register [free + virtual] When you join us March 11 and 12, you can… Chat directly with speakers and connect with ~20K of your peers Participate in interactive trainings on topics like real-time AI, database performance at scale, high availability, and cloud cost optimization strategies Pick the minds of ScyllaDB engineers and architects, who are available to answer your toughest database performance questions Win conference swag, sea monster plushies, book bundles, and other cool giveaways Details, Details The agenda site has all the scheduling, abstracts, and speaker details. Please note that times are shown in your local time zone. Be sure to scroll down into the Instant Access section. This is one of the best parts of Monster SCALE Summit. You can access these sessions from the minute the event platform opens until the conference wraps. Some teams have shared that they use Instant Access to build their own watch parties beyond the live conference hours. If you do this, please share photos! Another important detail: books. Quite a few of our speakers areApache Cassandra® 5.0: Improving performance with Unified Compaction Strategy
IntroductionUnified Compaction Strategy (UCS), introduced in Apache Cassandra 5.0, is a versatile compaction framework that not only unifies the benefits of Size-Tiered (STCS) and Leveled (LCS) Compaction Strategies, but also introduces new capabilities like shard parallelism, density-aware SSTable organization, and safer incremental compaction, all of which deliver more predictable performance at scale. By utilizing a flexible scaling model, UCS allows operators to tune compaction behavior to match evolving workloads, spanning from write-heavy to read-heavy, without requiring disruptive strategy migrations in most cases.
In the past, operators had to choose between rigid strategies and accept significant trade-offs. UCS changes this paradigm, allowing the system to efficiently adapt to changing workloads with tuneable configurations that can be altered mid-flight and even applied differently across different compaction levels based on data density.
Why compaction mattersCompaction is the critical process that determines a cluster’s long-term health and cost-efficiency. When executed correctly, it produces denser nodes with highly organized SSTables, allowing each server to store more data without sacrificing speed. This efficiency translates to a smaller infrastructure footprint, which can lower cloud costs and resource usage.
Conversely, inefficient compaction is a primary driver of performance degradation. Poorly managed SSTables lead to fragmented data, forcing the system to work harder for every request. This overhead consumes excessive CPU and I/O, often forcing teams to try adding more nodes (horizontal scale) just to keep up with background maintenance noise.
Key concepts and terminologyTo understand how UCS optimizes a cluster, it is necessary to understand the fundamental trade-offs it balances:
- Read amplification: Occurs when the database must consult multiple SSTables to answer a single query. High read amplification acts as a “latency tax,” forcing extra I/O to reconcile data fragments.
- Write amplification: A metric that quantifies the overhead of background processes (such as compactions). It represents the ratio between total data written to disk and the amount of data originally sent by an application. High write amplification wears out SSDs and steals throughput.
- Space amplification: The ratio of disk space used to the actual size of the “live” data. It tracks data such as tombstones or overwritten rows that haven’t been purged yet.
- Fan factor: The “growth dial” for the cluster data hierarchy. It defines how many files of a similar size must accumulate before they are merged into a larger tier.
- Sharding: UCS splits data into smaller, independent token ranges (shards), allowing the system to run multiple compactions in parallel across CPU cores.
UCS provides baseline architectural improvements that were not available in older strategies:
Improved compaction parallelismOlder strategies often got stuck on a single thread during large merges. UCS sharding allows a server to use its full processing power. This significantly reduces the likelihood of compaction storms and keeps tail latencies (p99) predictable.
Reduced disk space amplificationBecause UCS operates on smaller shards, it doesn’t need to double the entire disk space of a node to perform a major merge. This greatly reduces the risk of nodes from running out of space during heavy maintenance cycles.
Density-based SSTable organizationUCS measures SSTables by density (token range coverage). This mitigates the huge SSTable problem where a single massive file becomes too large to compact, hindering read performance indefinitely.
Scaling parameterThe scaling parameter (denoted as W) is a configurable setting that determines the size ratio between compaction tiers. It helps balance write amplification and read performance by controlling how much data is rewritten during compaction operations. A lower scaling parameter value results in more frequent, smaller compactions, whereas a higher value leads to larger compaction groups.
The strategy engine: tuning and parametersUCS acts as a strategy engine by adjusting the scaling parameter (W), allowing UCS to mimic, or outperform, its predecessors STCS and LCS.
At a high level, the scaling parameter influences the effective fan-out behavior at each compaction level. Tiered-style settings such as T4 allow more SSTables to accumulate before merging, favoring write efficiency, while leveled-style settings such as L10 keep SSTables more tightly organized, reducing read amplification at the cost of additional background work.
The numbers below are illustrative and not prescriptive:
UCS configuration guide Workload type Strategy target Scaling (W) Primary benefit Heavy writes / IoT STCS (Tiered) Negative (e.g., -4) Lowest read amplification Heavy reads LCS (Leveled) Positive (e.g., 10) Lowest write amplification Balanced Hybrid Zero (0) Balanced performance for general apps Practical exampleUCS allows operators to mix behaviors across the data lifecycle.
'scaling_parameters': 'T4, T4, L10'
Note that scaling_parameters takes a string format that can accommodate parameters for per-level tuning.
This example instructs a cluster: “Use tiered compaction for the first two levels to keep up with the high write volume, but once data reaches the third level, reorganize it into a leveled structure so reads stay fast.”
Here’s a fuller, illustrative example of how one might structure their CQL to change the compaction strategy.
ALTER TABLE keyspace_name.table_name WITH compaction = { 'class': 'UnifiedCompactionStrategy', 'scaling_parameters': 'T4,T4,L10' };
Operational evolution: moving beyond major compactions
In older strategies and in Apache Cassandra versions prior to 5.0, operators often felt forced to run a major compaction to reclaim disk space or fix performance. This was a critical event that could impact a node’s I/O for extended periods of time and required substantial free disk space to complete.
Because UCS is density-aware and sharded, it effectively performs compactions constantly and granularly so major compactions are rarely needed. It identifies overlapping data within specific token ranges (shards) and cleans them up incrementally. Operators no longer must choose between a fragmented disk and a risky, resource-heavy manual compaction; UCS keeps data density more uniform across the cluster over time.
The migration advantage: “in-place” adoptionOne of the key performance features of a UCS migration is in-place adoption, meaning that when a table is switched to UCS, it does not immediately force a massive data rewrite. Instead, it looks at the existing SSTables, calculates their density, and maps them into its new sharding structure.
This allows for moving from STCS or LCS to UCS with significantly less I/O overhead than any other strategy change.
ConclusionUCS is an operational shift toward simplicity and predictability. By removing the need to choose between compaction trade-offs, UCS allows organizations to scale with confidence. Whether handling a massive influx of IoT data or serving high-speed user profiles, UCS helps clusters remain performant, cost-effective, and ready for the future.
On a newly deployed NetApp Instaclustr Apache Cassandra 5 cluster, UCS is already the default strategy (while Apache Cassandra 5.0 has STCS set as the default).
Ready to experience this new level of Cassandra performance for yourself? Try it with a free 30-day trial today!
The post Apache Cassandra® 5.0: Improving performance with Unified Compaction Strategy appeared first on Instaclustr.
Getting Started with Database-Level Encryption at Rest in ScyllaDB Cloud
Learn about ScyllaDB database-level encryption with Customer-Managed Keys & see how to set up and manage encryption with a customer key — or delegate encryption to ScyllaDB ScyllaDB Cloud takes a proactive approach to ensuring the security of sensitive data: we provide database-level encryption in addition to the default storage-level encryption. With this added layer of protection, customer data is always protected against attacks. Customers can focus on their core operations, knowing that their critical business and customer assets are well-protected. Our clients can either use a customer-managed key (CMK, our version of BYOK) or let ScyllaDB Cloud manage the CMK for them. The feature is available in all cloud platforms supported by ScyllaDB Cloud. This article explains how ScyllaDB Cloud protects customer data. It focuses on the technical aspects of ScyllaDB database-level encryption with Customer-Managed Keys (CMK). Storage-level encryption Encryption at rest is when data files are encrypted before being written to persistent storage. ScyllaDB Cloud always uses encrypted volumes to prevent data breaches caused by physical access to disks. Database-level encryption Database-level encryption is a technique for encrypting all data before it is stored in the database. The ScyllaDB Cloud feature is based on the proven ScyllaDB Enterprise database-level encryption at rest, extended with the Customer Managed Keys (CMK) encryption control. This ensures that the data is securely stored – and the customer is the one holding the key. The keys are stored and protected separately from the database, substantially increasing security. ScyllaDB Cloud provides full database-level encryption using the Customer Managed Keys (CMK) concept. It is based on envelope encryption to encrypt the data and decrypt only when the data is needed. This is essential to protect the customer data at rest. Some industries, like healthcare or finance, have strict data security regulations. Encrypting all data helps businesses comply with these requirements, avoiding the need to prove that all tables holding sensitive personal data are covered by encryption. It also helps businesses protect their corporate data, which can be even more valuable. A key feature of CMK is that the customer has complete control of the encryption keys. Data encryption keys will be introduced later (it is confusing to cover them at the beginning). The customer can: Revoke data access at any time Restore data access at any time Manage the master keys needed for decryption Log all access attempts to keys and data Customers can delegate all key management operations to the ScyllaDB Cloud support team if they prefer this. To achieve this, the customer can choose the ScyllaDB key when creating the cluster. To ensure customer data is secure and adheres to all privacy regulations. By default, encryption uses the symmetrical algorithm AES-128, a solid corporate encryption standard covering all practical applications. Breaking AES-128 can take an immense amount of time, approximately trillions of years. The strength can be increased to AES-256. Note: Database-level encryption in ScyllaDB Cloud is available for all clusters deployed in Amazon Web Services (AWS) and Google Cloud Platform (GCP). Encryption To ensure all user data is protected, ScyllaDB will encrypt: All user tables Commit logs Batch logs Hinted handoff data This ensures all customer data is properly encrypted. The first step of the encryption process is to encrypt every record with a data encryption key (DEK). Once the data is encrypted with the DEK, it is sent to either AWS KMS or GCP KMS, where the master key (MK) resides. The DEK is then encrypted with the master key (MK), producing an encrypted DEK (EDEK or a wrapped key). The master key remains in the KMS, while the EDEK is returned and stored with the data. The DEK used to encrypt the data is destroyed to ensure data protection. A new DEK will be generated the next time new data needs to be encrypted. Decryption Because the original non-encrypted DEK is destroyed when the EDEK was produced, the data cannot be decrypted. The EDEK cannot be used to decrypt the data directly because the DEK key is encrypted. It has to be decrypted, and for that, the master key will be required again. This can only be decrypted with the master key(MK) in the KMS. Once the DEK is unwrapped, the data can be decrypted. As you can see, the data cannot be decrypted without the master key – which is protected at all times in the KMS and cannot be “copied” outside KMS. By revoking the master key, the customer can disable access to the data independently from the database or application authorization. Multi-region deployment Adding new data centers to the ScyllaDB cluster will create additional local keys in those regions. All master keys support multi-regions, and a copy of each key resides locally in each region – ensuring those multi-regional setups are protected from regional outages for the cloud provider and against disaster. The keys are available in the same region as the data center and can be controlled independently. In case you use a Customer Key – cloud providers will charge you for the KMS. AWS will charge $1/month, GCP will change you $0.06 for each cluster prorated per hour. Each additional DC creates a replica that is counted as an additional key. There is an additional cost per key request. ScyllaDB Enterprise utilizes those requests efficiently, resulting in an estimated monthly cost of up to $1 for a 9-node cluster. Managing encryption keys adds another layer of administrative work in addition to the extra cost. ScyllaDB Cloud offers database clusters that can be encrypted using keys managed by ScyllaDB support. They provide the same level of protection, but our support team helps you manage the master keys. The ScyllaDB keys are applied by default and are subsidized by ScyllaDB. Creating a Cluster with Database-Level Encryption Creating a cluster with database-level encryption requires: A ScyllaDB Cloud account – If you don’t have one, you can create a ScyllaDB Cloud account here. 10 minutes with ScyllaDB Key or 20 minutes creating your own key To create a cluster with database-level encryption enabled, we will need a master key. We can either create a customer-managed key using ScyllaDB Cloud UI or skip this step completely and use a ScyllaDB Managed Key, which will skip the next six steps. In both cases, all the data will be protected by strong encryption at the database level. Setting up the customer-managed key can be found in the database-level encryption documentation. Transparent database-level encryption in ScyllaDB Cloud significantly boosts the security of your ScyllaDB clusters and backups. Next Steps Start using this feature in ScyllaDB Cloud. Get your questions answered in our community forum and Slack channel. Or, use our contact form.We Built a Better Cassandra + ScyllaDB Driver for Node.js – with Rust
Lessons learned building a Rust-backed Node.js driver for ScyllaDB: bridging JS and Rust, performance pitfalls, and benchmark results This blog post explores the story of building a new Node.js database driver as part of our Student Team Programming Project. Up ahead: troubles with bridging Rust with JavaScript, a new solution being initially a few times slower than the previous one, and a few charts! Note: We cover the progress made until June 2025 as part of the ZPP project, which is a collaboration between ScyllaDB and University of Warsaw. Since then, the ScyllaDB Driver team adopted the project (and now it’s almost production ready). Motivation The database speaks one language, but users want to speak to it in multiple languages: Rust, Go, C++, Python, JavaScript, etc. This is where a driver comes in, acting as a “translator” of sorts. All the JavaScript developers of the world currently rely on the DataStax Node.js driver. It is developed with the Cassandra database in mind, but can also be used for connecting to ScyllaDB, as they use the same protocol – CQL. This driver gets the job done, but it is not designed to take full advantage of ScyllaDB’s features (e.g., shard-per-core architecture, tablets). A solution for that is rewriting the driver and creating one that is in-house, developed and maintained by ScyllaDB developers. This is a challenging task requiring years of intensive development, with new tasks interrupting along the way. An alternative approach is writing the new driver as a wrapper around an existing one – theoretically simplifying the task (spoiler: not always) to just bridging the interfaces. This concept was proven in the making of the ScyllaDB C / C++ driver, which is an overlay over the Rust driver. We chose the ScyllaDB Rust driver as the backend of the new JavaScript driver for a few reasons. ScyllaDB’s Rust driver is developed and maintained by ScyllaDB. That means it’s always up to date with the latest database features, bug fixes, and optimizations. And since it’s written in Rust, it offers native-level performance without sacrificing memory safety. [More background on this approach] Development of such a solution skips the implementation of complicated database handling logic, but brings its own set of problems. We wanted our driver to be as similar as possible to the Node.js driver so anyone wanting to switch does not need to do much configuration. This was a restriction on one side. On the other side, we have limitations of the Rust driver interface. Driver implementations differ and the API for communicating with them can vary in some places. Some give a lot of responsibility to the user, requiring more effort but giving greater flexibility. Others do most of the work without allowing for much customization. Navigating these considerations is a recurring theme when choosing to write a driver as a wrapper over a different one. Despite the challenges during development, this approach comes with some major advantages. Once the initial integration is complete, adding new ScyllaDB features becomes much easier. It’s often just a matter of implementing a few bridging functions. All the complex internal logic is handled by the Rust driver team. That means faster development, fewer bugs, and better consistency across languages. On top of that, Rust is significantly faster than Node.js. So if we keep the overhead from the bridging layer low, the resulting driver can actually outperform existing solutions in terms of raw speed. The environment: Napi vs Napi-Rs vs Neon With the goal of creating a driver that uses ScyllaDB Rust Driver underneath, we needed to decide how we would be communicating between languages. There are two main options when it comes to communicating between JavaScript and other languages: Use a Node API (NAPI for short) – an API built directly into the NodeJS engine, or Interface the program through the V8 JavaScript engine. While we could use one of those communication methods directly, they are dedicated for C / C++, which would mean writing a lot of unsafe code. Luckily, other options exist: NAPI-RS and Neon. Those libraries handle all the unsafe code required for using the C / C++ APIs and expose (mostly safe) Rust interfaces. The first option uses NAPI exclusively under the hood, while the Neon option uses both of those interfaces. After some consideration, we decided to use NAPI-RS over Neon. Here are the things we considered when deciding which library to use: – Library approach — In NAPI-RS, the library handles the serialization of data into the expected Rust types. This lets us take full advantage of Rust’s static typing and any related optimizations. With Neon, on the other hand, we have to manually parse values into the correct types. With NAPI-RS, writing a simple function is as easy as adding a #[napi] tag: Simple a+b example And in Neon, we need to manually handle JavaScript context: A+b example in Neon – Simplicity of use — As a result of the serialization model, NAPI-RS leads to cleaner and shorter code. When we were implementing some code samples for the performance comparison, we had serious trouble implementing code in Neon just for a simple example. Based on that experience, we assumed similar issues would likely occur in the future. – Performance — We made some simple tests comparing the performance of library function calls and sending data between languages. While both options were visibly slower than pure JavaScript code, the NAPI-RS version had better performance. Since driver efficiency is a critical requirement, this was an important factor in our decision. You can read more about the benchmarks in our thesis. – Documentation — Although the documentation for both tools is far from perfect, NAPI-RS’s documentation is slightly more complete and easier to navigate. Current state and capabilities Note: This represents the state as of May 2025. More features have been introduced since then. See the project readme for a brief overview of current and planned features. The driver supports regular statements (both select and insert) and batch statements. It supports all CQL types, including encoding from almost all allowed JS types. We support prepared statements (when the driver knows the expected types based on the prepared statement), and we support unprepared statements (where users can either provide type hints, or the driver guesses expected value types). Error handling is one of the few major functions that behaves differently than the DataStax driver. Since the Rust driver throws different types of errors depending on the situation, it’s nearly impossible to map all of them reliably. To avoid losing valuable information, we pass through the original Rust errors as is. However, when errors are generated by our own logic in the wrapper, we try to keep them consistent with the old driver’s error types. In the DataStax driver, you needed to explicitly call shutdown() to close the database connection. This generated some problems: when the connection variable was dropped, the connection sometimes wouldn’t stop gracefully, even keeping the program running in some situations. We decided to switch this approach, so that the connection is automatically closed when the variable keeping the client is dropped. For now, it’s still possible to call shutdown on the client. Note: We are still discussing the right approach to handling a shutdown. As a result, the behavior described here may change in the future. Concurrent execution The driver has a dedicated endpoint for executing multiple queries concurrently. While this endpoint gives you less control over individual requests — for example, all statements must be prepared and you can’t set different options per statement — these constraints allow us to optimize performance. In fact, this approach is already more efficient than manually executing queries in parallel (around 35% faster in our internal testing), and we have additional optimization ideas planned for future implementation. Paging The Rust and DataStax drivers both have built-in support for paging, a CQL feature that allows splitting results of large queries into multiple chunks (pages). Interestingly, although the DataStax driver has multiple endpoints for paging, it doesn’t allow execution of unpaged queries. Our driver supports the paging endpoints (for now, one of those endpoints is still missing) and we also added the ability to execute unpaged queries in case someone ever needs that. With the current paging API, you have several options for retrieving paged results: Automatic iteration: You can iterate over all rows in the result set, and the driver will automatically request the next pages as needed. Manual paging: You can manually request the next page of results when you’re ready, giving you more control over the paging process. Page state transfer: You can extract the current page state and use it to fetch the next page from a different instance of the driver. This is especially useful in scenarios like stateless web servers, where requests may be handled by different server instances. Prepared statements cache Whenever executing multiple instances of the same statement, it’s recommended to use prepared statements. In ScyllaDB Rust Driver, by default, it’s the user’s responsibility to keep track of the already prepared statements to avoid preparing them multiple times (and, as a result, increasing both the network usage and execution times). In the DataStax driver, it was the driver’s responsibility to avoid preparing the same query multiple times. In the new driver, we use Rust’s Driver Caching Session for (most) of the statement caching. Optimizations One of the initial goals for the project was to have a driver that is faster than the DataStax driver. While using NAPI-RS added some overhead, we hoped the performance of the Rust driver would help us achieve this goal. With the initial implementation, we didn’t put much focus on efficient usage of the NAPI-RS layer. When we first benchmarked the new driver, it turned out to be way slower compared to both the DataStax JavaScript driver and the ScyllaDB Rust driver… Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 4.08 3.53 1.04 250000 13.50 5.81 1.73 1000000 55.05 15.37 4.61 4000000 227.69 66.95 18.43 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08 250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72 11.64 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08 250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72 11.64 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.96 3.11 1.31 250000 4.90 4.33 1.89 1000000 16.99 10.58 4.93 4000000 65.74 31.83 17.26 Those results were a bit of a surprise, as we didn’t fully anticipate how much overhead NAPI-RS would introduce. It turns out that using JavaScript Objects introduced way higher overhead compared to other built-in types, or Buffers. You can see on the following flame graph how much time was spent executing NAPI functions (yellow-orange highlight), which are related to sending objects between languages. Creating objects with NAPI-RS is as simple as adding the#[napi] tag to the struct we want to expose to the
NodeJS part of the code. This approach also allows us to create
methods on those objects. Unfortunately, given its overhead, we
needed to switch the approach – especially in the most used parts
of the driver, like parsing parameters, results, or other parts of
executing queries. We can create a napi object like this:
Which is converted to the following JavaScript class: We
can use this struct between JavaScript and Rust. When accepting
values as arguments to Rust functions exposed in NAPI-RS, we can
either accept values of the types that implement the
FromNapiValue trait, or accept references to values of
types that are exposed to NAPI (these implement the default
FromNapiReference trait). We can do it like this:
Then, when we call the following Rust function we
can just pass a number in the JavaScript code.
FromNapiValue is implemented for built-in types like
numbers or strings, and the
FromNapiReference trait is created automatically
when using the #[napi] tag on a Rust struct. Compared
to that, we need to manually implement
FromNapiValue for custom structs. However, this
approach allows us to receive those objects in functions exposed to
NodeJS, without the need for creating Objects – and thus
significantly improves performance. We used this mostly to improve
the performance of passing query parameters to the Rust side of the
driver. When it comes to returning values from Rust code, a type
must have a ToNapiValue trait implemented. Similarly,
this trait is already implemented for built-in types, and is auto
generated with macros when adding the #[napi] tag to
the object. And this auto generated implementation was causing most
of our performance problems. Luckily, we can also implement our own
ToNapiValue trait. If we return a raw value and create
an object directly in the JavaScript part of the code, we can avoid
almost all of the negative performance impacts that come from the
default implementation of ToNapiValue. We can do it
like this:
This will return just the number instead of the whole struct. An
example of such places in the code was UUID. This type is used for
providing the UUID retrieved as part of any query, and can also be
used for inserts. In the initial implementation, we had a UUID
wrapper: an object created in the Rust part of the code, that
had a default ToNapiValue implementation, that
was handling all the logic for the UUID. When we changed the
approach to returning just a raw buffer representing the UUID and
handling all the logic on the JavaScript side, we shaved off about
20% of the CPU time we were using in the select benchmarks at that
point in time. Note: Since completing the initial project, we’ve
introduced additional changes to how serialization and
deserialization works. This means the current state may be
different from what we describe here. A new round of benchmarking
is in progress; stay tuned for those results. Benchmarks In the
previous section, we showed you some early benchmarks. Let’s talk a
bit more about how we tested and what we tested. All benchmarks
presented here were run on a single machine – the database was run
in a Docker container and the driver benchmarks were run without
any virtualization or containerization. The machine was running on
AMD Ryzen™ 7 PRO 7840U with 32GB RAM, with the database itself
limited to 8GB of RAM in total. We tested the driver both with
ScyllaDB and Cassandra (latest stable versions as of the time of
testing – May 2025). Both of those databases were run in a three
node configuration, with 2 shards per node in the case of ScyllaDB.
With this information on the benchmarks, let’s see the effect all
the optimizations we added had on the driver performance when
tested with ScyllaDB:
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.89 3.45 0.99 4.08 250000 4.15 5.66 1.73 13.50 1000000
13.65 15.86 4.41 55.05 4000000 55.85 56.73 18.42 227.69
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 2.83 2.48 1.04 1.63 250000 1.91 2.91 1.56 4.09 1000000
4.58 4.69 3.42 15.74 4000000 16.05 14.27 11.92 58.96
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.50 3.04 1.33 1.96 250000 2.93 4.52 1.94 4.90 1000000
8.79 11.11 5.08 16.99 4000000 32.99 36.62 17.90 65.74
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.42 3.09 1.25 1.45 250000 2.94 3.81 2.43 3.43 1000000
9.19 8.98 7.21 10.82 4000000 33.51 28.97 25.81 40.74 And here are
the same benchmarks, without the initial driver version.
Here are the results of running the benchmark on Cassandra.
Operations Scylladb-javascript-driver [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 2.48 14.50 1.25
250000 5.82 19.93 2.00 1000000 19.77 19.54 5.16
Operations Scylladb-javascript-driver [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.60 2.99 1.48
250000 3.06 4.46 2.42 1000000 9.02 9.03 6.53
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] 62500 2.32 4.03 2.11 250000 5.45 6.53 4.01
1000000 18.77 16.20 13.21
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] 62500 1.86 4.15 1.57 250000 4.24 5.41 3.36
1000000 13.11 14.11 10.54 The test results across both ScyllaDB and
Cassandra show that the new driver has slightly better performance
on the insert benchmarks. For select benchmarks, it starts ahead
and the performance advantage decreases with time. Despite a series
of optimizations, the majority of the CPU time still comes from
NAPI communication and thread synchronization (according to
internal flamegraph testing). There is still some room for
improvement, which we’re going to explore. Since running those
benchmarks, we introduced changes that improve the performance of
the driver. With those improvements performance of select
benchmarks is much closer to the speed of the DataStax driver.
Again…please stay tuned for another blog post with updated results.
Shards and tablets Since the DataStax driver lacked tablet and
shard support, we were curious if our new shard-aware and
tablet-aware drivers provided a measurable performance gain with
shards and tablets.
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No
Shards 62,500 1.89 2.61 3.45 3.51 0.99 1.20 250,000 4.15 7.61 5.66
6.14 1.73 2.30 1,000,000 13.65 30.36 15.86 16.62 4.41 8.33
4,000,000 55.85 134.90 56.73 77.68 18.42 42.64
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No
Shards 62,500 1.50 1.52 3.04 3.63 1.33 1.33 250,000 2.93 3.29 4.52
5.09 1.94 2.02 1,000,000 8.79 10.29 11.11 11.13 5.08 5.71 4,000,000
32.99 38.53 36.62 39.28 17.90 20.67 In insert benchmarks, there are
noticeable changes across all drivers when having more than one
shard. The Rust driver improved by around 36%, the new driver
improved by around 46%, and the DataStax driver improved by only
around 10% when compared to the single sharded version. While
sharding provides some performance benefits for the DataStax
driver, which is not shard aware, the new driver benefits
significantly more — achieving performance improvements comparable
to the Rust driver. This shows that it’s not only introducing more
shards that provide an improvement in this case; a major part of
the performance improvement is indeed shard-awareness.
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] No Tablets Standard No Tablets Standard No Tablets Standard
62,500 1.76 1.89 3.67 3.45 1.06 0.99 250,000 3.91 4.15 5.65 5.66
1.59 1.73 1,000,000 12.81 13.65 13.54 15.86 3.74 4.41
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] No Tablets Standard No Tablets Standard No Tablets Standard
62,500 1.46 1.50 2.92 3.04 1.33 1.33 250,000 2.76 2.93 4.03 4.52
1.94 1.94 1,000,000 8.36 8.79 7.68 11.11 4.84 5.08 When it comes to
tablets, the new driver and the Rust driver see only minimal
changes to the performance, while the performance of the DataStax
driver drops significantly. This behavior is expected. The DataStax
driver is not aware of the tablets. As a result, it is unable to
communicate directly with the node that will store the data – and
that increases the time spent waiting on network communication.
Interesting things happen, however, when we look at the network
traffic: WHAT TOTAL CQL TCP Total Size New driver 3 node all
412,764 112,318 300,446 ∼ 43.7 MB New driver 3 node | driver ↔
database 409,678 112,318 297,360 – New driver 3 node | node ↔ node
3,086 0 3,086 – DataStax driver 3 node all 268,037 45,052 222,985 ∼
81.2 MB DataStax driver 3 node | driver ↔ database 90,978 45,052
45,926 – DataStax driver 3 node | node ↔ node 177,059 0 177,059 –
This table shows the number of packets sent during the concurrent
insert benchmark on three-node ScyllaDB with 2 shards per node.
Those results were obtained with RF = 1. While running the database
with such a replication factor is not production-suitable, we
chose it to better visualize the results. When looking at those
numbers, we can draw the following conclusions: The new driver has
a different coalescing mechanism. It has a shorter wait time, which
means it sends more messages to the database and achieves
lower latencies. The new driver knows which node(s) will
store the data. This reduces internal traffic between database
nodes and lets the database serve more traffic with the same
resources. Future plans The goal of this project was to create a
working prototype, which we managed to successfully achieve. It’s
available at https://github.com/scylladb/nodejs-rs-driver,
but it’s considered experimental at this point. Expect it to change
considerably, with ongoing work and refactors. Some of the features
that were present in DataStax driver, and are expected for the
driver to be considered deployment-ready, are not yet implemented.
The Drivers team is actively working to add those features. If
you’re interested in this project and would like to contribute,
here’s the project’s GitHub
repository. When Bigger Instances Don’t Scale
A bug hunt into why disk I/O performance failed to scale on larger AWS instances The promise of cloud computing is simple: more resources should equate to better, faster performance. When scaling up our systems by moving to larger instances, we naturally expect a proportional increase in capabilities, especially in critical areas like disk I/O. However, ScyllaDB’s experience enabling support for the AWS i7i and i7ie instance families uncovered a puzzling performance bottleneck. Contrary to expectations, bigger instances simply did not scale their I/O performance as advertised. This blog post traces the challenging, multi-faceted investigation into why IOTune (a disk benchmarking tool shipped with Seastar) was achieving a fraction of the advertised disk bandwidth on larger instances. On these machines, throughput plateaued at a modest 8.5GB/s and IOPS were much lower than expected on increasingly beefy machines. What followed was a deep dive into the internals of the ScyllaDB IO Scheduler, where we uncovered subtle bugs and incorrect assumptions that conspired to constrain performance scaling. Join us as we investigate the symptoms, pin down the root cause, and share the hard-fought lessons learned on this long journey. This blog post is the first in a three-part series detailing our journey to fully harness the performance of modern cloud instances. While this piece focuses on the initial set of bottlenecks within the IO Scheduler, the story continues in two subsequent posts. Part 2, The deceptively simple act of writing to disk, tracks down a mysterious write throughput degradation we observed in realistic ScyllaDB workloads after applying the fixes discussed here. Part 3, Common performance pitfalls of modern storage I/O, summarizes the invaluable lessons learned and provides a consolidated list of performance pitfalls to consider when striving for high-performance I/O on modern hardware and cloud platforms. Problem description Some time ago, ScyllaDB decided to support the AWS i7i and i7ie families. Before we support a new instance type, we run extensive tests to ensure ScyllaDB squeezes every drop of performance out of the provisioned hardware. While measuring disk capabilities with the Seastar IOTune tool, we noticed that the IOPS and bandwidth numbers didn’t scale well with the size of the instance, and we ended up with much lower values than AWS advertised. Read IOPS were on par with AWS specs up to i7i.4xlarge, but they were getting progressively worse, up to 25% lower than spec on i7i.48xlarge. Write IOPS were worse, starting at around 25% less than spec for i7i.4xlarge and up to 42% less on i7i.48xlarge. Bandwidth numbers were even more interesting. Our IOTune measurements were similar to fio up to the i7i.4xlarge instance type. However, as we scaled up the instance type, our IOTune bandwidth numbers were plateauing at around 8.5GB/s while fio was managing to pull up to 40GB/s throughput for i7i.48xlarge instances. Essential Toolkit The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a storage mount point, it outputs 4 values corresponding to the read/write IOPS and read/write bandwidth of the underlying storage system. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of the disk, which it will use to help ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula that looks something like:read_bw/read_bw_max +
write_bw/write_bw_max + read_iops/read_iops_max +
write_iops/write_iops_max <= 1 The internal mechanics of
how the IO Scheduler works are described very thoroughly in the
blog post I linked above. The io_tester tool is another utility
within the Seastar framework. It’s used for testing and profiling
I/O performance, often in more controlled and customizable
scenarios than the automated IOTune. It allows users to simulate
specific I/O workloads (e.g., sequential vs. random, various
request sizes, and concurrency levels) and measure resulting
metrics like throughput and latency. It is particularly useful for:
Deep-dive analysis: Running experiments with fine-grained control
over I/O parameters (e.g., --io-latency-goal, request
size, parallelism) to isolate performance characteristics or
potential bottlenecks. Regression testing: Verifying that changes
to the IO Scheduler or underlying storage stack do not negatively
impact I/O performance under various conditions. Fair Queue
experimentation: As shown in this investigation, io_tester can be
used to observe the relationship between configured workload
parameters, the resulting in-disk queue lengths, and the throttling
behavior of the IO Scheduler. What this meant for ScyllaDB We
didn’t want to enable i7i instances if the IOTune numbers didn’t
accurately reflect the underlying disk performance of the instance
type. Lower io-properties numbers cause the IO Scheduler to
overestimate the cost of each request. This leads to more
throttling, making monstrous instances like i7i.48xlarge perform
like much cheaper alternatives (such as the i7i.4xlarge, for
example). Pinning the symptoms Early on, we noticed that the
observed symptoms pointed to two different problems. This helped us
narrow down the root causes much faster (well, fast here is a very
misleading term). We were chasing a lower-than-expected IOPS issue
and a different low-bandwidth issue. IOPS and bandwidth numbers
were behaving differently when scaling up instances. The former was
scaling, but with much lower values than we expected. The latter
would just plateau from one point and stay there, no matter how
much money you’d throw at the problem. We started with the
hypothesis that IOTune might misdetect the disk’s physical block
size from sysfs and that we issue requests with a different size
than what the disk “likes,” leading to lower IOPS. After some
debugging, we confirmed that IOTune indeed failed to detect the
block size, so it defaulted to using requests of 512bytes. There’s
no bug to fix on the IOTune side here, but we decided we needed to
be able to specify the disk block size for reads and writes
independently when measuring. This turned out to be quite helpful
later on. With 4K requests, we were able to measure the expected
~1M IOPS for writes compared to the ~650k IOPS we were getting with
the autodetected 512-byte requests (numbers relevant for the
i7i.12xlarge instance). We had a fix for the IOPS issue, but – as
we discovered later – we didn’t properly understand the actual root
cause. At that point, we thought the problem was specific to this
instance type and caused by IOTune misdetecting the block size. As
you’ll see in the next blog post in the series, the root cause is a
lot more interesting and complicated. The plateauing bandwidth
issue was still on the table. Unfortunately, we had no clue about
what could be going on. So, we started exploring the problem space,
concentrating our efforts as you’d imagine any engineer would.
Blaming the IO Scheduler We dug around, trying to see if IOTune
became CPU-limited for the bandwidth measurements. But that wasn’t
it. It’s somewhat amusing that our initial reaction was to point
the finger at the IO Scheduler. This bias stems from when the IO
Scheduler was first introduced in ScyllaDB. It had such a profound
impact that numerous performance issues over time – things that
were propagating downward to the storage team – were often (and
sometimes unfairly) attributed to it. Understanding the root cause
We went through a series of experiments to try to narrow down the
problem further and hopefully get a better understanding of what
was happening. Most of the experiments in this article, unless
explicitly specified, were run on an i7i.12xlarge instance. The
expected throughput was ~9.6GB/s while IOTune was measuring a write
throughput of 8.5GB/s. To rule out poor disk queue utilization, we
ran fio with various iodepths and block sizes, then recorded the
bandwidth. We
noticed that the request needs to be ~4MB to fill the disk queue.
Next, we collected the same for io_tester with
–io-latency-goal=1000 to prevent the queue from splitting requests.
A larger latency goal means the scheduler can be more relaxed and
submit the requests as they come because it has plenty of time
(1000 ms) to complete each request in time. If the goal is smaller,
the IO Scheduler gets stressed because it needs to make each
request complete in that tight schedule. Sometimes it might just
split a request in half to take advantage of the in-disk
parallelism and hopefully make the original request fit the tight
latency goal. The
fio tool seemed to be pulling the full bandwidth from the disk, but
our io_tester tool was not. The issue was definitely on our side.
The good news was that both io_tester and IOTune measured similar
write throughputs, so we weren’t chasing a bug in our measurement
tools. The conclusion of this experiment was that we saturated the
disk queue properly, but we still got low bandwidth. Next, we
pulled an ace out of the sleeve. A few months before this, we were
at a hackathon during our engineering summit. During that
hackathon, our Storage and Networking team built a prototype
Seastar IO Scheduler controller that would bring more transparency
and visibility into how the IO Scheduler works. One of the patches
from that project was a hack that would make the IO Scheduler drop
a lot of the IOPS/bandwidth throttling logic and just drain like
crazy whatever requests are queued. We applied that patch to
Seastar and ran the IOTune tool again. It was very rewarding to see
the following output: Measuring sequential write bandwidth: 9775
MB/s (deviation 6%) Measuring sequential read bandwidth: 13617 MB/s
(deviation 22%) The bandwidth numbers escaped the
8.5GB/s limit that was previously constraining our
measurements. This meant we were correct in blaming the IO
Scheduler. We were indeed experiencing throttling from the
scheduler, specifically from something in the fair queue math. At
that point, we needed to look more closely at the low-level
behavior. We patched Seastar with another home-brewed project
that adds a low-overhead binary tracer to the IO Scheduler. The
plan was to run the tracer on both the master version and the one
with the hackathon patch applied – then try to understand why the
hackathon-patched scheduler performs better. We added a few traces
and we immediately started to see patterns like these in the slow
master trace: Here
it took it 134-51=83us to dispatch one request.
The “Q” event is when a request arrives at the scheduler and gets
queued. “D” stands for when a request gets dispatched. For
reference, the patched IO scheduler spent 1us to
dispatch a request. The unexpected behavior suggested an issue with
the
token bucket accumulation, as requests should be dispatched
instantly when running without io-properties.yaml (effectively
providing unlimited tokens). This is precisely the scenario when
IOTune is running: it withholds io-properties.yaml from the IO
Scheduler. This allows the token bucket to operate with unlimited
tokens, stressing the disk to its maximum potential so IOTune can
compute, by itself, the required io-properties.yaml. The token
bucket seems to run out of tokens…but why? When the token bucket
runs out of tokens, it needs to wait for tokens to be replenished
when other requests are completed. This delays the dispatch of the
next request. That’s why the above request waited
83us to get dispatched when it should have
actually been dispatched in 1us. There wasn’t much
more we could do with the event tracer. We needed to get closer to
the fair queue math. We returned to io_tester to examine the
relationship between the parallelism of the test and the size of
the in-disk queues. We ran io_tester for requests sized within
[128k, 1MB] with parallelism within [1,2,4,8,16] fibers. We ran it
once for the master branch (slow) and once for the “hackathon”
branch (fast). Here are some plots from these results. The plots
are throughput (vertical axis) against parallelism (horizontal) for
two request sizes, 1MB and 128kB.
For
both request sizes, the “hackathon” branch outperformed the
“master” branch. Also, the 1MB request saturates the disk with much
lower parallelism than the 128k request. No surprises here, the
result wasn’t that valuable. In a follow-up test, we collected the
in-disk latencies as well. We plotted throughput against
parallelism for both the master and hackathon branches. The lines
crossing the bars represent the in-disk latencies measured.
This is already much better. After the disk is saturated,
increasing parallelism should create a proportional increase for
in-disk latency. That’s exactly what happens for the hackathon
branch. We couldn’t say the same about the master branch. Here, the
throughput plateaued around 4 fibers, and the in-disk latency
didn’t grow! For some reason, we didn’t end up stressing this disk.
To investigate further, we wanted to see the size of the actual
in-disk queues. So, we coded up a patch to make io_tester output
this information. We plotted the in-disk queue size alongside
parallelism for various request sizes. At this point, it became
clear that we weren’t sufficiently leveraging the in-disk
parallelism. Likely, the fair_queue math was making the IO
Scheduler throttle requests excessively. This is indeed what the
plots below show. In the master (slow) branch run, the in-disk
queue length for the 1MB request (which saturates the disk faster)
plateaus at around 4 requests once parallelism=4 and higher. That’s
definitely not alright.
Just
for fun, let’s look at Little’s Law in
action. We plotted disk_queue_length / latency for
each branch as follows. Next,
we wanted to (somehow) replicate this behavior without involving an
actual disk. This way, we could maybe create a regression test for
the IO Scheduler. The Seastar ioinfo tool was
perfect for this job. ioinfo can take an
io-properties.yaml file as an argument. It feeds the values to the
IO Scheduler, then the tool outputs the token bucket parameters
(which can be used to calculate the theoretical IOPS and throughput
values that the IO Scheduler can achieve). Our goal was to compare
these calculated values with what was configured in an
io-properties.yaml file and make sure the IO Scheduler could
deliver very close to what it was configured for. For reference,
here’s how the calculated IOPS/bandwidth looked compared to the
configured values.
The
values returned by the scheduler were within a 5% margin of the
configured one. This was fantastic news (in a way). It meant the
fair_queue math behaves correctly even with bandwidths above
8.xGB/s. We didn’t get the regression test we
hoped for, since the fair_queue math was not
causing the throttling and disk underutilization we’d seen in the
previous experiment. However, we did add a test that would check if
this behavior changes in the future. We did get a huge win from
this, though. We came to the conclusion that something must be
wrong with the fair_queue math or something in the IO Scheduler
must be incorrect only when it’s not configured with an
io-properties file. At that point, the problem space narrowed
significantly. Playing around with the inputs from the
io-properties.yaml file, we uncovered yet another bug. For large
enough read IOPS/bandwidth numbers in the config file, the IO
Scheduler would report request costs of zero. After many
discussions, we learned that this is not really a bug. It’s how the
math should behave. With big io-properties numbers, the math should
plateau the costs at 0. It makes sense: the more resources you have
available, the single unit of effort becomes less significant. This
led us to an important realization: the unconfigured case (our
original issue) should also produce a cost of zero. A zero cost
means that the token bucket won’t consume any tokens. That gives us
unbounded output…which is exactly what IOTune wants. Now we needed
to figure out two things: Why doesn’t the IO Scheduler report a
cost of zero for the unconfigured case? In theory, it should. In
the issue linked above, costs became zero for values that weren’t
even close to UINT64_MAX. Was our code prepared to handle costs of
zero? We should ensure we don’t end up with weird overflows or
divisions by zero or any undefined behavior from code that assumes
costs can’t be zero. When things start to converge At this point,
we had no further leads, so we thought there must be something
wrong with the fair queue math. I reviewed the math from
Implementing a New IO Scheduler Algorithm for Mixed Read/Write
Workloads, but I didn’t find any obvious flaws that could
explain our unconfigured case. We hoped we’d find some formula
mistakes that made the bandwidth hit its theoretical limit at
8.5GB/s. We didn’t find any obvious issues here,
so we concluded there must be some flaw in the implementation of
the math itself. We started suspecting that there must be some
overflow that ends up shrinking the bandwidth numbers. After quite
some time tracking the math implementation in the code, we managed
to find the issue. Two internal IO Scheduler variables that were
storing the IOPS and bandwidth values configured via
io-properties.yaml had a default value set to
`std::numeric_limits<int>::max()`. It wasn’t that intuitive
to figure out – the variables weren’t holding the actual
io-properties values, but rather some values that derived from
them. This made the mistake harder to spot. There is some code that
recalculates those variables when the io-properties.yaml file is
provided and parsed by the Seastar code. However, in the
“unconfigured” case, those code paths are intentionally not hit.
So, the INT_MAX values were carried into the fair queue math,
crunched into the formulas, and resulted in the
8.xGB/s throughput limit we kept seeing. The fix
was as simple as changing the default value to
‘std::numeric_limits<uint64_t>::max()’. A
one-line fix for many weeks of work. It’s been a crazy
long journey chasing such a small bug, but it has been an
invaluable (and fun!) learning experience. It led to lots of
performance gains and enabled ScyllaDB to support highly efficient
storage instances like i7i, i7ie and i8g. Stay tuned for the next
episode in this series of blog posts, In part 2, we will uncover
that the performance gains after this work weren’t quite what we
were expecting on realistic workloads. We will deep dive into some
very dark corners of modern NVMEs and filesystems to unlock a
significant chunk of write throughput.
Read part 2 Scaling Is the “Funnest” Game: Rachel Stephens and Adam Jacob
When not to worry about scale, when to rearchitect everything and why passionate criticism is a win “There’s no funner game than the at-scale technology game. But if you play it, some people will hate you for it. That’s okay…that’s the game you chose to play.” – Adam Jacob At Monster Scale Summit 2025, Rachel Stephens, research director at RedMonk, spoke with Adam Jacob, co-founder of Chef and CEO of System Initiative, about what it really means to build and operate software at scale. Note: Monster SCALE Summit 2026 will go live March 11-12, featuring antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications” and more than 50 others. The event is free and virtual. Register for free and join the community for some lively chats The Existential Question of Scale Stephens opened with an existential question: “Does your software exist if your users can’t run it?” Yes, your code still exists in GitHub even if us-east-1 goes down. But what if … Your system crawls under load. Critical integrations constantly break. You can’t afford the infrastructure costs. “Software at scale isn’t just about throughput,” Stephens said. “It’s about making sure that your code endures, adapts and remains accessible no matter the load and location of where you’re running. Because if your users can’t use it, your software may as well not exist.” With that framing, Stephens brought in someone who’s spent his career dealing with scale firsthand: Adam Jacob. Only Scale When It Hurts Stephens asked Jacob how teams can balance quality, speed and scale under uncertainty. How do you avoid both cutting corners and premature optimization? Jacob argues that early on, it’s fine not to worry much about scale. Most products fail for other reasons before scalability ever becomes a problem. He explained: “I think of it basically through the lens of optionality. When you start building new things, it’s nice not to worry too much about scale, because you may never reach it. Most products don’t fail because they fail to scale. Think about how badly Twitter failed to scale … and yet here we are.” The first priority is to build a solid product. Once scale becomes a real issue, that’s when it makes sense to refactor and remove bottlenecks. But if you’ve been around the block a little, your experience helps you make early choices that pay off later. Jacob noted, “Premature optimization is real. But as you gain experience, there are some decisions you make early because you know that if things work out, you’ll be happier later — like factoring your code so it can be broken apart across network boundaries over time, if you need to.” Chef Scalability Horror Stories Next, Stephens asked Jacob if he would share a scaling horror story from his Chef days. Jacob obliged and offered two memorable ones. “The best was when we launched the first version of Hosted Chef. The day before the launch, we discovered it took about a minute and a half to create a new user. It didn’t take that long when we were running it on a laptop, but it did later … and we never really tested it. So, in the final hours before launch, we changed it from ‘anyone can sign up’ to a queue system with a little space robot saying, ‘Demand is so high; we’ll get back to you.’ We just papered over the scalability problem.” “Another example: that same Chef server (the one that couldn’t create accounts quickly) eventually had to work at Facebook. The original version was written in Rails, which was great to work with, but not parallel enough. At Facebook scale, you might have 40,000 or 50,000 things pointed at one Chef server. So we rebuilt it in Erlang, which is great for that kind of problem. I literally brought the Erlang version to Facebook on a USB stick. When we installed it and bootstrapped a data center, we thought it was broken because it was using less compute and finished almost instantly.” Jacob explained that if they’d tried to build the Chef server in Erlang from the start, the project probably wouldn’t have gained traction. Starting in Rails made it possible to get Chef out into the world and learn what the system really needed to do. Only later, once they understood how the system really behaved, could they rebuild it with the right architecture and runtime for scale. Growth or Efficiency: Know Which Game You’re Playing At Chef, scaling was ultimately required to land customers like Facebook and JPMorgan Chase, which operate at massive scale. Jacob advised, “Making it scale required major investment, but it worked. You can’t buy your core. If it matters to customers, you have to build it yourself. People often wait too long to realize they have a deep architectural problem that’s also a business problem. Rebuilding for scale takes months, so you have to start early.” Your own approach to scale should ultimately be driven by what game you’re playing: In the venture-capital game, growth and traction come first. You can spend money to scale faster because you’re funded. In the profitability game, efficiency comes first. Overspending on compute or poor architecture hits the bottom line hard. Why Scaling is the ‘Funnest’ Game Stephens mentioned that “when software succeeds, it stops being yours – it becomes everyone’s.” She then asked Jacob what it’s like when your tech scales to the point that people have extremely strong opinions about it. His response: “It’s hard to build things that people care about. If you’re lucky enough to create something you love and share it with the world and people love it back, that’s incredibly rewarding. Even when they don’t, that’s still a gift.” “Someone once tapped me in a coffee shop and said, ‘You wrote Chef? I hate Chef.’ I said, ‘I’m sorry; I didn’t write it to hurt you.’ But at scale, that means he used it. It mattered in his life. And that’s what you want: for people to experience what you built.” “I love the technology, the problem, the difficulty. Scaling adds more layers of complexity, more layers of fun. There’s no funner game than the at-scale technology game. But if you play it, some people will hate you for it. That’s okay…that’s the game you chose to play.” You can watch the full talk below.You Got OLAP in My OLTP: Can Analytics and Real-Time Database Workloads Coexist?
Explore isolation mechanisms and prioritization strategies that allow different database workloads to coexist without resource contention issues Analytics (OLAP) and real-time (OLTP) workloads serve distinctly different purposes. OLAP (online analytical processing) is optimized for data analysis and reporting, while OLTP (online transaction processing) is optimized for real-time low-latency traffic. Most databases are designed to primarily benefit from either OLAP or OLTP, but not both. Worse, concurrently running both workloads under the same data store will frequently introduce resource contention. The workloads end up hurting each other, considerably dragging down the overall distributed system’s performance. Let’s look at how this problem arises, then consider a few ways to address it. OLTP vs OLAP Databases There are basically two fundamental approaches involving how databases store data on disk. We have row-oriented databases, often used for real-time workloads. These store all data pertaining to a single row on disk. Row-oriented storage (ideal for OLTP) Column-oriented storage (ideal for OLAP) On the other side of the spectrum, we have column-oriented databases, which are often used for running analytics. These databases store data in a vertical way (versus horizontal partitioning of rows). This single design decision effectively makes it much easier and efficient for the database to run aggregations, perform calculations and answer retrieving insights such as “Top K” metrics. OLTP vs. OLAP Workloads So the general consensus is that if you want to run OLTP workloads, you use a row-oriented database – and you use a columnar one for your analytics workloads. However, contrary to popular belief, there are a variety of reasons why people might actually want to run an OLAP workload on top of their real-time databases. For example, this might be a good option when organizations want to avoid data duplication or the complexity and overhead associated with maintaining two data stores. Or maybe they don’t extract insights all that often. The Latency Problem But problems can arise when you try to bring OLAP to your real-time database. We’ve studied this a lot with ScyllaDB, a specialized NoSQL database that’s primarily meant for high throughput and low-latency real-time workloads. The following graphic from ScyllaDB monitoring demonstrates what happens to latency when you try to run OLAP and OLTP workloads alongside one another. The green line represents a real-time workload, whereas the yellow one represents an analytics job that’s running at the same time. While the OLTP workload is running on its own, latencies are great. But as soon as the OLAP workload starts, the real-time latencies dramatically rise to unacceptable levels. The Throughput Problem Throughput is also an issue in such scenarios. Looking at the throughput clarifies why latencies climbed: The analytics process is consuming much higher throughput than the OLTP one. You can even see that the real-time throughput drops, which is a sign that the database got overloaded. Unsurprisingly, as soon as the OLAP job finishes, the real-time throughput increases and the database can then process its backlog of queued requests from that workload. That’s how the contention plays out in the database when you have two totally different workloads competing for resources in an uncoordinated way. The database is naively trying to process requests as they come in. When Things Gets Contentious But why does this contention happen in the first place? If you overwhelm your database with too many requests, it cannot keep up. Usually, that’s because your database lacks either the CPU or I/O capacity that’s required to fulfill your requests. As a result, requests queue up and latency climbs. The workloads contribute to contention too. OLTP applications often process many smaller transactions and are very latency sensitive. However, OLAP ones generally run fewer transactions requiring scanning and processing through large amounts of data. So hopefully that explains the problem. But how do we actually solve it? Option A: Physical Isolation One option is to physically isolate these resources. For example, in a Cassandra deployment, you would simply add a new data center and separate your real-time processing from your analytics. This saves you from having to stream data and work with a different database. However, it considerably elevates your costs. Some specific examples of this strategy: Instaclustr, a managed services provider, shared a benchmark after isolating its deployments (Apache Spark and Apache Cassandra). GumGum shared the results of this approach (with multiregion Cassandra) at a past Cassandra Summit. There are definitely use cases and organizations running OLAP on top real-time databases. But are there any other alternatives to resolve the problem altogether? Option B: Scheduled Isolation Other teams take a different approach: They avoid running their OLAP during their peak periods. They simply run through their Analytics pipelines during off-peak hours in order to mitigate the impact on latencies. For example, consider a food delivery company. Answering the question like, “How much did this merchant sell within the past week?” is simple in OLTP. However, offering discounts to 10 top-selling restaurants within a given region is much more complicated. In a wide-column database like Cassandra or ScyllaDB, it inevitably requires a full table scan. Therefore, it would make sense for such a company to run these analytics from after midnight until around 10 a.m. – before its peak traffic hours. This is a doable strategy, but it still doesn’t solve the problem. For example, what if your dataset doubles or triples? Your pipeline might overrun your time window. And you have to consider that your business is still running at that time (people will still order food at 2 a.m.). If you take this approach, you still need to tune your analytics job and ensure it doesn’t kill your database. Option C: Workload Prioritization ScyllaDB has developed an approach called Workload Prioritization to address this problem. It lets users define separate workloads and assign different resource shares to them. For example, you might define two service levels: The main one has 600 shares, and the secondary one has 200 shares. CREATE SERVICE LEVEL main WITH shares = 600 CREATE SERVICE LEVEL secondary WITH shares = 200 ScyllaDB’s internal scheduler will process three times more tasks from the main workload than the secondary one. Whenever the system is under contention, the system prioritizes its resources allocation accordingly. Why does this kick in only during contention? Because if there’s no contention, it means there is no bottleneck, so there is effectively nothing to prioritize. [Play with an interactive animation] Workload Prioritization Under the Hood Under the hood, ScyllaDB’s Workload Prioritization relies on Seastar scheduling groups. Seastar is a C++ framework for data-intensive applications. ScyllaDB, Redpanda, Ceph’s SeaStore and other technologies are built on top of it. Scheduling groups are effectively the way Seastar allows background operations to have little impact on foreground activities. For example, in ScyllaDB and database-specific terms, there are several different scheduling groups within the database. ScyllaDB has a distinct group for compactions, streaming, Memtables, and so on. With Cassandra, you might end up in a situation where compactions impact your workload performance. But in ScyllaDB, all compaction resources are scheduled by Seastar. And according to its shares of resources, the database will allocate a respective share of resources to the background activity (compaction, in that case) – therefore ensuring that the latency of the primary user-facing workload doesn’t suffer. Using scheduling groups in this way also helps the database auto-tune. If the user workload is running during off-peak hours, then the system will automatically have more spare computing and I/O cycles to spend. The database will simply speed up its background activities. Here’s a guided tour of how Workload Prioritization actually plays out: OLTP and OLAP Can Coexist Running OLAP alongside OLTP inevitably involves anticipating and managing contention. You can control it in a few ways: isolate analytics to its own cluster, run it in off-peak windows, or enforce workload prioritization. And workload prioritization isn’t just for allowing OLAP along with your OLTP. That same approach could also be used to assign different priorities to reads vs. writes, for example. If you’d like to learn more, take a look at my recent tech talk on this topic: “How to Balance Multiple Workloads in a Cluster.”ScyllaDB Cloud on AWS I8g and I8ge: 2x Throughput, Lower Latency, Zero Extra Cost
How ScyllaDB performs on the new I8g and I8ge instances, across different workload types Let’s start with the bottom line. For ScyllaDB, the new Graviton4-based i8g instances improve i4i throughput by up to 2x with better latency – and the i8ge improves i3en throughput by up to 2x with better latency. Benchmarks also show single-digit millisecond latency during maintenance operations like scaling. Fast and smooth scaling is an important part of the new ScyllaDB X Cloud offering. The chart below shows ScyllaDB max through under a latency SLA of 10ms latency for different workloads, for the old i4i, i3en and the new i8g, i8ge. AWS recently launched the I8g and I8ge storage-optimized EC2 instances powered by AWS Graviton4 CPUs and 3rd-generation AWS Nitro SSDs. They’re designed for I/O-intensive workloads like real-time databases, search, and analytics (so a nice fit for ScyllaDB). Instance Family Use Case Number of vCPUs per instance Storage i8g Compute bound 2 to 96 0.5 to 22.5 TB i8ge Storage bound 2 to 192 1.25 to 120 TB Reduced TCO in ScyllaDB Cloud Based on our performance results, ScyllaDB users migrating to Graviton4 can reduce infrastructure requirements by up to 50% compared to i4i and i3en previous generations. This translates into significantly lower total cost of ownership (TCO) by requiring fewer nodes to sustain the same workload. These improvements stem from a few factors – both in the new instances themselves, and in the match between ScyllaDB and these instances. The new I8g architecture features: vCPU-to-core mapping: On x86, each vCPU uses half a physical core (a hyperthread); for i8g (ARM), each core matches one physical core Larger caches: 64kB instruction cache and 64kB data cache, compared to 32/48kB on Intel (shared between the two hyperthreads) Faster storage and networking (see spec above) In addition, ScyllaDB’s design allows it to take full advantage of the new server types: The shard-per-core architecture scales with linear performance to any number of cores The IO scheduler can take full advantage of the 3rd-generation AWS Nitro SSD, fully utilizing the higher IO rate, and lower latency without overloading it and increasing latency ARM’s relaxed memory model suits Seastar applications. Since locks and fences are rare, the memory subsystem has more opportunities to reorder memory accesses to optimize performance. What this means for you I8g and i8ge are now available on ScyllaDB Cloud. If you’re running ScyllaDB Cloud, the net impact is: Compute-bound workloads: Move from I4i to I8g. This should provide up to 2x throughput at the same ScyllaDB Cloud price. Storage-bound workloads: Move from I3en to I8ge. Here, you should expect up to 2x higher throughput at the same ScyllaDB Cloud price. Note that using the new ScyllaDB dictionary-based compression can lower the storage cost further. For both use cases, ScyllaDB can keep the 10ms P99 latency SLA during maintenance operations, including scaling out and scaling down. What we measured Max Throughput: The maximum requests per second the database can handle Max Throughput under SLA: The maximum request per second under a P99 latency of 10ms. Only throughput with latency below this SLA counts. This throughput can be sustained under any operation, like scaling and repair. This is the number you should use when sizing your ScyllaDB Database on i8g instances. P99 Latency: Measures the p99 latency for the Max Throughput under SLA Results Read Workload – cached data Cached data: working set size < available RAM, resulting in close to 100% cache hit rate. Instance type Max throughput Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge 1,062,578 750,000 100% 7.84 i8g.4xlarge 1,434,215 1,300,000 135% 6.29 i3en.3xlarge 585,975 550,000 100% 4.37 i8ge.3xlarge 962,504 800,000 164% 6.38 Read Workload – non-cached data, storage only Non-cached data: working set size >> available RAM, resulting in 0% cache hit rate. When most of the data is not cached, storage becomes a significant factor for performance. Instance type Max throughput Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge 218,674 210,000 100% 4.56 i8g.4xlarge 444,548 300,000 203% 4.24 i3en.3xlarge 145,702 140,000 100% 6.83 i8ge.3xlarge 259,693 255,000 178% 7.95 Write Workload Instance type Max throughput Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge 289,154 150,000 100% 2.4 i8g.4xlarge 689,474 600,000 238% 4.02 i3en.3xlarge 217,072 200,000 100% 5.42 i8ge.3xlarge 452,968 400,000 209% 3.41 Tests under maintenance operations ScyllaDB takes pride in testing under realistic use cases, including scaling out and in, repair, backups, and various failure tests. The following results represent the P99 average latency (across all nodes) of different maintenance operations on a 3-node cluster of i8ge.3xlarge. It’s using the same setup as above. Setup ScyllaDB version: 2025.3.1-20250907.2bbf3cf669bb DB node amount: 3 DB instance types: i8ge.3xlarge Loader node amount: 4 Loader instance type: c5.2xlarge Throughput: Read 41K, write 81K, Mixed 35K Results Read Test: Read Latency Operation Read P99 latency in ms Base: Steady State 0.95 During Repair 4.92 During Add Node (out scale) 2.68 During Replace Node 3.10 During Decommission Node (downscale) 2.44 Write Test: Write Latency Operation Write P99 latency in ms Steady State 2.22 During Repair 3.24 Add Node (scale out) 2.49 Replace Node 3.07 Decommission Node (downscale) 2.37 Mixed Test: Write and Read Latency Operation Write P99 Latency in ms Read P99 Latency in ms Steady state 2.03 2.11 During Repair 3.21 4.70 Add Node (scale out) 2.19 2.71 Replace Node 3.00 3.37 Decommission Node (downscale) 2.20 3.05 The results indicate that ScyllaDB can meet the latency SLA under maintenance operations. This is critical for ScyllaDB Cloud, and in particular ScyllaDB X Cloud, where scaling out and in scaling are automatic, and can happen multiple times per day. It’s also critical in unexpected failure cases, when a node must be replaced rapidly, without hurting availability and the latency SLA. Test Setup ScyllaDB cluster 3-node cluster I4i.4xlarge vs. i8g.4xlarge I3en.3xlarge vs. i8ge.3xlarge Loaders Loader node amount: 4 Loader instance type: c7i.8xlarge Workload Replication Factor (RF): 3 Consistency Level (CL): Quorum Data size 650GB for read/mixed, 1.5T for writeAlan Shimel and Dor Laor on Database Elasticity and AI with ScyllaDB
Alan and Dor chat about high-performance databases & AI trends Everything about re:Invent 2025 screamed “massive” – from the exhibit hall’s towering booths, to the overflowing keynotes, to product announcements at every turn. ScyllaDB’s “scale fearlessly” message fit in perfectly. See ScyllaDB’s re:Invent videos But despite the crowds and chaos, Alan Shimel (founder and CEO of Techstrong Group) and Dor Laor (ScyllaDB co-founder and CEO) found a way to meet for a laid-back chat. Topics ranged from ScyllaDB’s origin story, to OSS, to ScyllaDB’s latest announcements for AI and extreme database elasticity. Read highlights below, or enjoy the full interview: ScyllaDB AI Use Cases: Vector Scale, Feature Store, AI Stack Alan: What’s it like on the re:Invent floor? What are the conversations like? What are you hearing? Dor: There’s certainly no shortage of crowds at the booth. A lot of the conversation is about AI. We’re seeing a surge in AI-related use cases. At this point, about half of the use cases we see with ScyllaDB are directly related to AI. Alan: Explain that to me. What’s the use case? Dor: We usually split our AI uses cases into three categories. The first is being part of the AI stack itself. During training and serving, the stack needs to access a huge number of objects, and it needs a fast database to do that. In this case, we’re part of the core AI stack. It’s distributed databases handling very high workloads – and these are very high workloads. That can be for large LLM companies, or for much smaller companies that are just starting their AI journey. That’s the first category. The second category is the feature store. Feature stores are more traditionally associated with machine learning, but they’re still part of the AI world. A feature store lets people classify users, or sometimes agents, automatically. That can be used for recommendations in e-commerce, fraud detection, and a variety of other use cases. In those cases, the feature store needs a fast database to quickly determine how a user is classified and what’s appropriate for them – what they might want to watch, what ad they should see, and so on. The third category is vector search for running LLMs on private datasets. That’s where RAG comes in, with vector data. We added vector search ourselves, and we’re already seeing a lot of interest. In January, we’ll be going live with the general availability of our RAG and vector store. Alan: So in essence, they could use ScyllaDB as their vector database. They’re creating small language models or RAG. That’s got to be big…that’s fantastic. Dor: Our vector search is the most scalable. We can easily run models with a billion objects. Very few vendors can even reach a billion. We can do that while handling hundreds of thousands of requests per second, so we scale to very high numbers. And for people with lower or medium demand – which is most users, with models around 10 million or 100 million objects – we can deliver the best latency at very low price points. Alan: That’s fantastic. Look, there are a lot of people saying we’ve scraped everything there is to scrape for these LLMs. That continuing to make generative AI better by just increasing model size or training data is starting to hit diminishing returns. The thinking is that the way forward might be smaller language models, more RAG. Some people even argue we should move away from ML altogether and toward things like world models. But I definitely believe there’s going to be a lot of activity in the SLM and RAG space. And beyond that, as we build AI for specific use cases, I don’t need the whole internet. I just need the data that matters for that use case – especially if it’s my own proprietary information. I don’t want to put that out there. I want it right here. So I think that’s a huge business. Congratulations. Dor: Thanks. It’s market demand. It’s not just an opportunity, it’s also a defensive move. If we don’t do it, customers will go elsewhere, to be frank. People now expect the same ease of use they get from LLMs on public internet data when they come to any vendor. They want to ask questions in free text, in a single line, and immediately get the best results – without digging through a complicated UI. That’s the power of LLMs. And sometimes it won’t even be people doing that. It could be agents that come in, automate things, and issue those queries on their behalf. True Database Elasticity: Scaling Out and In, Fast Alan: All right, let’s fast-forward past AWS for a second. You have some new announcements coming soon. Share a bit, if you don’t mind. Dor: Thank you for the opportunity. We’re moving from beta to general availability with ScyllaDB X Cloud, our managed platform. This is the new generation of our core database, delivered as a database-as-a-service, including management and consumption. The unique thing here is our new core architecture, which we call tablets. It’s way more elastic than any other database – or even infrastructure – out there. Before this, we were okay in terms of scaling clusters out and scaling them back in. We were about average. But there was demand to do it much faster. And frankly, we also compete with DynamoDB. We’re API-compatible with DynamoDB. Up until now, DynamoDB has been the best in the industry at scaling up and down quickly. If your workload changes throughout the day, you don’t want to pay for peak capacity all the time. You want the system to follow usage dynamically. That’s exactly what X Cloud is. With tablets, we break a very large database– say, a petabyte of data – into five-gigabyte chunks. We can move those chunks around super quickly. That allows us to scale extremely fast. We can increase capacity by four times in about ten minutes. For example, you can go from 500,000 operations per second to two billion operations per second in ten minutes. Alan: And back to 500K? Dor: That’s right. Alan: Sometimes with these things, it’s like blowing up a balloon. You know what I mean? It never really goes back to the size it was before you blew it up. Dor: So with this, we can also go back and shrink. It’s complicated, but it works. User workloads come and go – whether it’s Black Friday or just daily patterns. That leads to big TCO improvements and usability improvements. It’s also pretty unique. We have a shard-per-core engine. So if you have a machine with 32 cores, you’ll have 32 independent threads in the server. If you have a 64-core machine, you’ll have 64 threads, and it will perform twice as well. Now, let’s say you have a 64-core machine, but you actually need 66 threads. If you only had 64, would you buy another 64-core machine? That’s expensive. Instead, we can mix and match. You can run a 64-core machine together with a small two-vCPU machine side by side. Because of the flexibility of our sharding model, we can combine the two. I haven’t seen any other vendor that can do that. What the user gets is efficiency. They have exactly what they need, without having to buy oversized, expensive servers. Alan: Really, what we’re talking about here is almost a FinOps play. I think that’s where we are now, especially with cloud usage. Look, we’re talking about spending five trillion dollars on data center AI factories. But when I talk to people, what they actually say is, I want to get control of my cloud bill. They want to be more efficient in how they use these resources. That’s why I made the balloon joke– that’s pretty much how the cloud works. It never seems to go back down. People want insight. They want to be able to turn the dial. And they want to ask, how can I do this more efficiently? Dor: Most databases aren’t that loaded. I’m not talking about spikes. I’m talking about normal daily usage, or overnight usage. Often it’s only 10% or 20% utilized – but you’re paying for the entire thing. Alan: That was always the promise of the cloud – that elasticity would go up and down. In practice, it mostly just went up. Tiered Storage at ScyllaDB Alan: So, what else is new at ScyllaDB? Dor: We’re also working on things like tiered storage and other technologies to reduce the bill. Normally, we use NVMe for fast storage and performance. It’s also relatively cheap compared to other high-performance storage options. But S3 is cheaper. The problem with S3 is latency. It can be 50 milliseconds, 100 milliseconds, which is prohibitive for many workloads. With tiered storage, we can keep the hot data on fast NVMe and automatically move cold data to S3. That lets us come up with a good solution for common use cases. For example, you might want to keep 30 days of data in ScyllaDB on NVMe, but keep a year of data overall – and still access it through the same API, without having to build a separate access path. That gives users a single API and a very cost-effective solution. Learn more about what’s next for ScyllaDB at Monster SCALE Summit — free and virtual.From Batch to Real-Time: How MoEngage Achieved Millisecond Personalization with ScyllaDB
How a leading customer engagement platform handles 250K writes per second at 1ms p99 latency with 200TB+ data At MoEngage, our mission as a leading customer engagement platform is to help marketers build deep, lasting relationships with their users by processing hundreds of billions of events each month. Initially, our data architecture was built on a solid foundation of Amazon S3 for large-scale batch analytics and Elasticsearch for search. This dual system was effective for historical segmentation but began to buckle under the modern demand for instantaneous personalization. Our clients needed to react to user actions not in minutes, but in milliseconds. For example, sending a notification based on a product just viewed or updating a user’s segment the moment they qualify for a new campaign. This shift exposed the fundamental limitations of our architecture. Querying a single user’s recent activity in S3 was prohibitively slow, requiring massive dataset scans. At the same time, our write-heavy workload overwhelmed Elasticsearch, creating performance bottlenecks and significant operational overhead from indexing and sharding. It became clear that we couldn’t just optimize our way to real-time. We needed a new, purpose-built system designed from the ground up for high-throughput ingestion and low latency queries. Editor’s note: Karthik and Atish Andhare will be sharing their experiences at Monster SCALE Summit, a free + virtual conference on extreme scale engineering. Learn more and access passes here. Envisioning the Eventstore Our solution was to build the Eventstore – a system that can store all the user actions. It doesn’t replace our vast S3 data lake; think of it more like a high-speed, short-term memory for all user actions and events. Its sole purpose is to handle recent user activity, absorbing the constant firehose of incoming events while allowing us to instantly pull up any single user’s complete activity timeline from the last 30, 60, or 90 days. This new real-time backbone lets us see a user’s entire recent journey in milliseconds, a capability that was completely out of reach with our old architecture. With this new real-time backbone in place, we could finally unlock a class of product capabilities that our customers were demanding, moving from theoretical concepts to tangible features. The Eventstore directly powers: Instantaneous Segmentation: Instead of waiting hours for a segment to update, users are added or removed the moment their behavior meets specific criteria. This ensures communications are always sent to the right audience at exactly the right time. True Real-Time Triggers: Campaigns can be initiated the instant a user performs a key action, such as abandoning a cart or completing a purchase. This eliminates the “lag” that made triggered messages feel disconnected from the user’s immediate context. Hyper-Personalization at the Edge: We can now personalize messages using attributes from a user’s very last action. This allows for powerful use cases like including the “last product viewed” in an email, recommending content based on the “last article read,” or personalizing web content based on the “last item added to cart.” Live User Activity Feeds: Our platform’s user profile dashboard, which once showed a delayed activity history, can now display a live, up-to-the-millisecond feed of every action a user takes, giving marketers a true real-time view of their customers. Choosing Our Engine Our requirements were ambitious and non-negotiable: the system had to handle a write-heavy workload of at least 250,000 events per second with an avg latency of 1 ms, p99 of under 10ms, and it needed to scale horizontally without any performance degradation. We evaluated several distributed databases, but ScyllaDB quickly emerged as the clear frontrunner. Its architecture, a C++ rewrite of Cassandra, is engineered for raw performance, promising to harness the full power of modern hardware and deliver the predictable, ultra-low latencies we required. Also, a few of us had extensive experience working with Cassandra which made it easier to understand ScyllaDB. The ability to add nodes seamlessly to handle increasing load was the final piece of the puzzle, giving us the confidence that this was a solution that wouldn’t just meet our immediate needs, but would grow with us for years to come. Handling Multi-Tenancy As an open platform, MoEngage serves a diverse customer base, from companies trialing our product to large enterprises with varying performance and service-level agreement (SLA) expectations. This reality meant that a one-size-fits-all approach to data storage was not viable. We could not house all customer data in a single, massive cluster, as this would risk performance degradation from “noisy neighbors” and fail to meet the distinct needs of our clients. Our multi-tenancy strategy, therefore, had to be built around workload isolation from the ground up. Our decision was heavily influenced by two core ScyllaDB design principles. First, ScyllaDB recommends having one large table per cluster rather than many small ones to reduce metadata management overhead. Second, and more critically, it is a best practice to configure data retention with a Time-To-Live (TTL) at the table level, not at the cell level. Since our customers require different retention periods (15, 30, or 60 days), managing this at the row level within a single table would create significant overhead on compaction and tombstone management. Based on these constraints, we chose a strategy of physical isolation using multiple, independent ScyllaDB clusters. This approach allows us to group tenants logically based on their needs. For example: All customers with the same retention policy (e.g., 30 days) are housed in the same cluster, allowing us to use a single, efficient table-level TTL. Customers who require stricter, guaranteed SLAs can be isolated in their own dedicated cluster. All MoEngage test accounts can be grouped into a single cluster to separate their non-production workloads. This model provides the perfect balance, ensuring that the workload of one tenant group does not impact another while aligning perfectly with ScyllaDB’s operational best practices for performance and data management. Working with ScyllaDB Open Source Our Large Partition Problem One of the most critical challenges in designing our schema was avoiding the “large partition” anti-pattern in ScyllaDB. While our experiments showed that large partitions don’t significantly penalize write performance, they have a significant impact on reads and compactions. We have a use case where read won’t be able to take advantage of the clustering key ordering and hence have to fetch the entire partition and perform the filtering on the client side. In such cases querying a large partition causes ScyllaDB to fetch data from disk (if not in memtable), decompress, load into memory and then return the result. This creates significant latency overhead and puts unnecessary pressure on the cluster. With ScyllaDB’s official recommendation to keep partitions under 100MB, we knew that a naive partition key like `(user_id, tenant_id)` would be a recipe for disaster, as highly active users could easily generate gigabytes of event data over their retention period. To solve this, our schema design focused on proactively breaking up potentially large partitions into smaller, consistently sized buckets. Our analysis showed an average event row size of about 1KB, meaning our 100MB target partition size could comfortably hold around 100,000 events. A simple calculation revealed that for a 30-day retention period, any user generating more than 330 events per day would exceed this limit. To prevent this, we introduced a `bucket_id` as a core component of our partition key. Our final partition key became a composite of `(uid, tenant, bid)`. The `bucket_id` acts as a split mechanism, splitting a single user’s long event history into multiple, smaller physical partitions. For example, a bucket could represent a day or a week of activity, ensuring no single partition grows indefinitely. This foresight was crucial because a table’s partition key cannot be changed after creation. By including the `bucket_id` in our initial schema, we built in the flexibility to define and refine our exact bucketing strategy over time, guaranteeing a healthy, performant cluster as our data scales. Building for Resilience From the very beginning, two principles were non-negotiable for the Eventstore: fault tolerance and zero data loss. The system had to withstand common infrastructure failures like node loss or disk corruption, and under no circumstances could we lose data that had been acknowledged with a success response. This commitment to durability shaped every decision we made about our cluster architecture, from data replication to physical topology. To achieve this, we made a critical decision to use a Replication Factor (RF) of 3. This means that for every piece of data written, three copies (replicas) are stored on three different nodes in the cluster. With RF3, we could enforce a write consistency level of `LOCAL_QUORUM`. This setting guarantees that a write operation is only considered successful after a majority of the replicas (two out of the three) have confirmed the write to disk. This simple but powerful mechanism is our guarantee against data loss; even if one node fails mid-write, the data is already safe on at least two other nodes. Having three copies of the data is only half the battle; ensuring those copies are physically isolated is just as important. To protect against large-scale failures, we architected our clusters to be Availability Zone (AZ) aware. By leveraging ScyllaDB’s Ec2Snitch feature, we make the database aware of the underlying AWS infrastructure, treating each AWS AZ as a separate “rack.” With this configuration, combined with NetworkTopologyStrategy replication strategy, ScyllaDB intelligently places each of the three data replicas in a different AZ. This strategy ensures that we can withstand the complete failure of an entire Availability Zone without any data loss or service interruption. While this architecture provides excellent high availability against common failures, we also planned for disaster recovery scenarios, such as losing a quorum of nodes or a full region-wide outage. Since our chosen EC2 instances use ephemeral storage, our recovery strategy in these cases is to quickly bootstrap a new cluster from a previous backup. For this, we leverage ScyllaDB’s native backup capabilities and our application’s ability to replay messages from Kafka. Our process involves taking regular snapshots, supplemented by a continuous stream of incremental backups. Any data lost between the last incremental backup and point of outage is available in Kafka, by simply replaying the data from Kafka we are able to fully restore the data. This combination ensures we can rebuild a cluster to a recent, consistent state, completing our comprehensive resilience strategy from minor hiccups to major outages. Cluster Topology Choosing the right database engine was only half the equation; building a resilient and performant Eventstore meant running it on the right hardware. Our workload is fundamentally I/O-bound, characterized by a relentless, high-throughput stream of writes. This reality guided our evaluation of EC2 instance types, where the choice between local NVMe storage and network-attached EBS volumes became the central decision point. After a thorough analysis, we followed ScyllaDB’s strong recommendation and opted for storage-optimized i-series instances with local NVMe SSDs. While we considered memory-optimized instances with EBS, they proved unsuitable for our write-heavy needs. High performance `io2` EBS volumes were prohibitively expensive at our scale, and more affordable `GP3` volumes could not guarantee the p99 latencies we required and introduced risks of throttling during traffic bursts. AWS’s own guidance suggests EBS is better suited for read heavy workloads, the exact opposite of our profile. Local NVMe storage, by contrast, delivers the sustained, sub-millisecond I/O performance essential for our ingestion pipeline. Specifically, we selected the i3en instance family, which provides an excellent balance of vCPU, RAM, and the large, fast storage capacity needed to meet our heavy data retention requirements. Our approach to capacity planning is therefore not a one-time calculation but a dynamic process tied directly to our multi-tenant cluster strategy. The size and configuration of each physical cluster are determined by the specific workload of the tenants it houses. We carefully model capacity based on four key variables for each tenant: 1. The number of active users. 2. The average number of actions per user per day. 3. Their specific data retention policy (e.g., 15, 30, or 60 days). 4. The overall write and read traffic patterns. This allows us to right-size each cluster for its intended workload, ensuring performance and cost-efficiency across our entire infrastructure. Compaction Strategy A critical factor in managing the total cost of ownership for a large-scale database is controlling disk space amplification. Open Source ScyllaDB’s default Size-Tiered Compaction Strategy (STCS) requires keeping nearly 50% of disk space free for compaction operations, which would have effectively doubled our storage costs. We also experimented with the Leveled Compaction Strategy but that too required additional 50% disk space during initial bootstrapping. While ScyllaDB Enterprise offers the highly efficient Incremental Compaction Strategy (ICS) that reduces this overhead to 20%, it comes with a significant license fee. Our Operational Challenges Cluster Management Our initial capacity planning pointed us toward the i3en.3xlarge instance type (12 vCPUs, 96GB RAM, and a 7.5TB NVMe drive). To ensure low latency for our global customer base, we deployed one ScyllaDB cluster in each of our three primary AWS regions. In total, our footprint grew to approximately 50 nodes across these clusters. ScyllaDB provides region-specific, production-ready AMIs that simplify the deployment process. Our deployment workflow followed a structured path: provisioning nodes, configuration, security, and RBAC, followed by onboarding the cluster into our internal monitoring stack. Because ScyllaDB’s AMIs are self-contained, scaling out theoretically meant launching a new node and letting it complete the automated bootstrap process. Things ran smoothly until we encountered a surge in customer data within one of our regions. As disk utilization climbed and we were using STSC, we followed our runbook and added a new node to the cluster. However, this expansion revealed two critical operational hurdles: First, during our POC, a 4TB node bootstrap typically took 18 hours using vNodes. In the live production environment, this window stretched significantly. Bootstrapping took anywhere from 24 to 36 hours. In a high-growth environment, a 1.5-day lead time for scaling is a lifetime. Followed by another issue when an on-call engineer noticed the disk space on the newly joining node was hitting 90%. This was counterintuitive—why was a joining node, which hadn’t even finished taking its share of the data, running out of space? Our investigation revealed that it was caused during the RESHAPE compactions. When a new node joins, ScyllaDB reshapes the data to fit the new shard distribution. This process creates temporary data overhead. After researching similar issues reported in the community, we identified a temporary fix to get our node back to service. Allow the node to initiate the bootstrap. The moment the RESHAPE compaction begins, manually pause it. Let the node finish joining the ring to provide immediate capacity. ICS Our initial experiences led us to a conservative rule of thumb, where we felt safe onboarding new nodes when disk usage on the existing nodes reached between 40% and 45%. This buffer was a technical necessity to accommodate a 2.4x worst-case space amplification during RESHAPE compactions while bootstrapping. We experienced a glimmer of hope when we discovered ScyllaDB’s Incremental Compaction Strategy (ICS). After discussing the ICS with the ScyllaDB team, we realized we were looking at our space amplification issues through an outdated lens. The technical shift offered by ICS is profound because it utilizes a default fragment size of 1GB, meaning a single compaction job typically requires a maximum of only 2GB of disk overhead. To put that into perspective for our specific setup, the old methodology required nearly 50% of free space on a 6.9TB node to handle heavy compaction cycles safely. Under ICS, that same 6.9TB node with 12 shards would only experience roughly 110GB of overhead during compactions. This shift creates massive headroom, allowing us to move away from capping nodes at 45% capacity and safely utilizing over 80% of the disk. By drastically minimizing space amplification, ICS has effectively doubled our storage efficiency without compromising performance during critical node operations. Our Move to ScyllaDB Enterprise Our journey toward ScyllaDB Enterprise began with a rigorous Proof of Concept designed to validate three core pillars: performance, reliability, and operational efficiency. We needed to ensure that the Enterprise edition could not only handle our existing throughput but also provide an edge in performance and cluster operations. To validate these objectives without risking production stability, we deployed a parallel ScyllaDB Enterprise cluster. This environment supported dual writes, allowing us to mirror data from our existing Open Source Software (OSS) cluster to the new Enterprise setup in real-time. This side by-side comparison was instrumental in proving the superiority of the new architecture. The most significant architectural shift involved moving to i3en.6xlarge nodes. These powerful instances, equipped with 24 vCPUs, 192GB of memory, and two 7.5TB NVMe drives, allowed us to dramatically consolidate our infrastructure. By leveraging these denser nodes, we were able to shrink our total node count to just one-third of the original OSS cluster size, significantly reducing the complexity of our distributed network. Alongside this hardware upgrade, we transitioned our tables to the Incremental Compaction Strategy (ICS) to better manage disk space. Following a successful “soak-in” period where the Enterprise cluster met all performance benchmarks, we executed a structured four-step migration strategy. We first upgraded our OSS environment to ScyllaDB Enterprise 2024.1, followed by the systematic migration of tables to the ICS format. Once the tables were optimized, we began the process of downsizing the legacy OSS cluster and finalized the transition by onboarding the entire environment to ScyllaDB Manager for centralized management and automated maintenance. Lessons Learned Schema Design Is Paramount The most important aspect while using ScyllaDB is getting your schema right. It’s not just about the data model but aspects like RF, TTL, partition size, compaction strategy etc. that dictate how your ScyllaDB performs in production. Adding Nodes & Removing Nodes Take Longer As your data size grows the process of adding nodes and removing nodes becomes a lot slower with ScyllaDB’s legacy vNode-based replication. Make sure you are monitoring everything and plan for these activities ahead of time. One thing we learned is that while these operations are slower they don’t quite impact the query / write latencies significantly during these maintenance activities. POC != Production No matter how hard you try to anticipate & simulate issues in POC, your production system will always surprise you. Our Next Steps with ScyllaDB Our journey with the Eventstore has fundamentally transformed our real-time capabilities, but we’re always looking ahead to the next evolution of our architecture. One of the most exciting developments on our roadmap involves leveraging a powerful new feature in the latest versions of ScyllaDB: tablets. While our multi-cluster topology provides excellent isolation, it still requires us to plan capacity for peak workloads. In a multi-tenant world, traffic can be unpredictable. A single customer launching a wildly successful campaign can create a sudden performance hotspot on specific sets of nodes, even if the rest of the cluster has ample storage and spare compute capacity. Manually rebalancing or adding nodes to handle these temporary spikes is a significant operational challenge. This is where tablets change the game. By breaking down the token ring into smaller, movable units of data, tablets decouple data partitions from specific physical nodes. Instead of a partition being permanently owned by a set of nodes, a tablet can be automatically moved to a different node to balance the load in real-time. For us, this unlocks the holy grail of database management: true elastic scaling. When a traffic hotspot emerges, ScyllaDB can automatically rebalance the cluster by shifting tablets away from overloaded nodes to those with spare capacity. This will allow us to absorb sudden traffic surges with grace, ensuring consistent performance for all tenants without manual intervention or costly overprovisioning. It’s the key to providing on-demand compute for our customers’ biggest moments, ensuring the Eventstore remains a robust and highly elastic foundation for the future of real-time engagement at MoEngage.Exploring the key features of Cassandra® 5.0
Apache Cassandra has become one of the most broadly adopted distributed databases for large-scale, highly available applications since its launch as an open source project in 2008. The 5.0 release in September 2024 represents the most substantial advancement to the project since 4.0 released in July 2021. Multiple customers (and our own internal Cassandra use case) have now been happily running on Cassandra 5 for up to 12 months so we thought the time was right to explore the key features they are leveraging to power their modern applications.
An overview of new features in Apache Cassandra 5.0Apache Cassandra 5.0 introduces core capabilities aimed at AI-driven systems, low-latency analytical workloads, and environments that blend operational and analytical processing.
Highlights include:
- The new vector data type and an Approximate Nearest Neighbor (ANN) index based on Hierarchical Navigable Small World (HNSW), which is integrated into the Storage-Attached Index (SAI) architecture
- Trie-based memtables and the Big Trie-Index (BTI) SSTable format, delivering better memory efficiency and more consistent write performance
- The Unified Compaction Strategy, a tunable density-based approach that can align with leveled or tiered compaction patterns.
Additional enhancements include expanded mathematical CQL functions, dynamic data masking, and experimental support for Java 17.
At NetApp, Apache Cassandra 5.0 is fully supported, and we are actively assisting customers as they transition from 4.x.
A deeper look at Cassandra 5.0’s key features Storage–Attached Indexes (SAI)Storage–Attached Indexes bring a modern, storage-integrated approach to secondary indexing in Apache Cassandra, resolving many of the scalability and maintenance challenges associated with earlier index implementations. Legacy Secondary Indexes (2i) and SASI remain available, but SAI offers a more robust and predictable indexing model for a broad range of production workloads.
SAI operates per-SSTable, allowing queries to be indexed locally versus the cluster-wide coordination required of other strategies. This model supports diverse CQL data types, enables efficient numeric and text range filters, and provides more consistent performance characteristics than 2i or SASI. The same storage-attached foundation is also used for Cassandra 5’s vector indexing mechanism, allowing ANN search to operate within the same storage and query framework.
SAI supports combining filters across multiple indexed columns and works seamlessly with token-aware routing to reduce unnecessary coordinator work. Public evaluations and community testing have shown faster index builds, more predictable read paths, and improved disk utilization compared with previous index formats.
Operationally, SAI functions as part of the storage engine itself: indexes are defined using standard CQL statements and are maintained automatically during flush and compaction, with no cluster-wide rebuilds required. This provides more flexible query options and can simplify application designs that previously relied on manual denormalization or external indexing systems.
Native Vector Search capabilitiesApache Cassandra 5.0 introduces native support for high-dimensional vector embeddings through the new vector data type. Embeddings represent semantic information in numerical form, enabling similarity search to be performed directly within the database. The vector type is integrated with the database’s storage-attached index architecture, which uses HNSW graphs to efficiently support ANN search across cosine, Euclidean, and dot-product similarity metrics.
With vector search implemented at the storage layer, applications involving semantic matching, content discovery, and retrieval-oriented workflows while maintaining the system’s established scalability and fault-tolerance characteristics are supported.
After upgrading to 5.0, existing schemas can add vector columns and store embeddings through standard write operations. For example:
UPDATE products SET embedding = [0.1, 0.2, 0.3, 0.4, 0.5] WHERE id = <id>;
To create a new table with a vector type column:
CREATE TABLE items ( product_id UUID PRIMARY KEY, embedding VECTOR<FLOAT, 768> // 768 denotes dimensionality );
Because vector indexes are attached to SSTables, they participate automatically in the compaction and repair processes and do not require an external indexing system. ANN queries can be combined with regular CQL filters, allowing similarity searches and metadata conditions to be evaluated within a unified distributed query workflow. This brings vector retrieval into Apache Cassandra’s native consistency, replication, and storage model.
Unified Compaction Strategy (UCS)Unified Compaction Strategy in Apache Cassandra 5 included a density-aware approach to organizing SSTables that blends the strengths of Leveled Compaction Strategy (LCS) and Size Tiered Compaction Strategy (STCS). UCS aims to provide the predictable read amplification associated with LCS and the write efficiency of STCS, without many of the workload-specific drawbacks that previously made compaction selection difficult. Choosing an unsuitable compaction strategy in earlier releases could lead to operational complexity and long-term performance issues, which UCS is designed to mitigate.
UCS exposes a set of tunable parameters like density thresholds and per-level scaling that let operators adjust compaction behavior toward read-heavy, write-heavy, or time-series patterns. This flexibility also helps smooth the transition from existing strategies, as UCS can adopt and improve the current SSTable layout without requiring a full rewrite in most cases. The introduction of compaction shards further increases parallelism and reduces the impact of large compactions on cluster performance.
Although LCS and STCS remain available (and while STCS remains the default strategy in 5.0, UCS is the default strategy on newly deployed NetApp Instaclustr’s managed Apache Cassandra 5 clusters), UCS supports a broader range of workloads, reduces the operational burden of compaction tuning, and aligns well with other storage engine improvements in Apache Cassandra 5 such as trie-based SSTables and Storage-Attached Indexes.
Trie Memtables and Trie-Indexed SSTablesTrie Memtables and Trie-indexed SSTables (Big Trie-Index, BTI) are significant storage engine enhancements released in Apache Cassandra 5. They are designed to reduce memory overhead, improve lookup performance, and increase flush efficiency. A trie data structure stores keys by shared prefixes instead of repeatedly storing full keys, which lowers object count and improves CPU cache locality compared with the legacy skip-list memtable structure. These benefits are particularly visible in high-ingestion, IoT, and time-series workloads.
Skip-list memtables store full keys for every entry, which can lead to large heap usage and increased garbage collection activity under heavy write loads. Trie Memtables substantially reduce this overhead by compacting key storage and avoiding pointer-heavy layouts. On disk, the BTI SSTable format replaces the older BIG index with a trie-based partition index that removes redundant key material and reduces the number of key comparisons needed during partition lookups.
Using Trie memtables requires enabling both the trie-based memtable implementation and the BTI SSTable format. Existing BIG SSTables are converted to BTI through normal compaction or by rebuilding data. On NetApp Instaclustr’s managed Apache Cassandra clusters Trie Memtables and BTI are enabled by default, but when upgrading major versions to 5.0, data must be converted from BIG to BTI first to utilize Trie structures.
Other new features Mathematical CQL functionsApache Cassandra 5.0 added a rich set of math functions allowing developers to perform computations directly within queries. This reduces data transfer overhead and reduces client-side post-processing, among many other benefits. From fundamental functions like ABS(), ROUND(), or SQRT() to more complex operations like SIN(), COS(), TAN(), these math functions are extensible to a multitude of domains from financial data, scientific measurements or spatial data.
Dynamic Data MaskingDynamic Data Masking (DDM) is a new feature to obscure sensitive
column-level data at query time or permanently attach the
functionality to a column so that the data always returns
obfuscated. Stored data values are not altered in this process, and
administrators can control access through role-based access control
(RBAC) to ensure only those with access can see the data while also
tuning the visibility of the obscured data. This feature helps with
adherence to data privacy regulations such as GDPR, HIPAA, and PCI
DSS without needing external redaction systems.
Apache Cassandra 5.0 packs a punch with game changing features that meet the needs of modern workloads and applications. Features like vector search capabilities and Storage Attached Indexes stand out as they will inevitably shape how data can be leveraged within the same database while maintaining speed, scale, and resilience.
When you deploy a managed cluster on NetApp Instaclustr’s Managed Platform, you get the benefits of all these amazing features without worrying about configuration and maintenance.
Ready to experience the power of Apache Cassandra 5.0 for yourself? Try it free for 30 days today!
The post Exploring the key features of Cassandra® 5.0 appeared first on Instaclustr.
Scaling Performance Comparison: ScyllaDB Tablets vs Cassandra vNodes
Benchmarks show ScyllaDB tablet-based scaling 7.2× faster than Cassandra’s vNode-based scaling (9× with cleanup), sustaining ~3.5X higher throughput with fewer errors Real-world database deployments rarely experience steady traffic. Systems need sufficient headroom to absorb short bursts, perform maintenance safely, and survive unexpected spikes. At the same time, permanently sizing for peak load is wasteful. Elasticity lets you handle fluctuations without running an overprovisioned cluster. Increase capacity just-in-time when needed, then scale back as soon as the peak passes. When we built ScyllaDB just over a decade ago, it scaled fast enough for user needs at the time. However, deployments grew larger and nodes stored far more data per vCPU. Streaming took longer, especially on complex schemas that required heavy CPU work to serialize and deserialize data. The leaderless design forced operators to serialize topology changes, preventing parallel bootstraps or decommissions. And static (vNode-based) token assignments also meant data couldn’t be moved dynamically once a node was added. ScyllaDB’s recent move to tablet-based data distribution was designed to address those elasticity constraints. ScyllaDB now organizes data into independent tablets that dynamically split or merge as data grows or shrinks. Instead of being fixed to static ranges, tablets are load balanced transparently in the background to maintain optimal distribution. Clusters scale quickly with demand, so teams don’t need to overprovision ahead of time. If load increases, multiple nodes can be bootstrapped in parallel and start serving traffic almost immediately. Tablets rebalance in small increments, letting teams safely use up to ~90% of available storage. This means less wasted storage. The goal of this design is to make data movement more granular and reduce the serialized steps that constrained vNode-based scaling. To understand the impact of this design shift, we evaluated how both ScyllaDB (now using tablets) and Cassandra (still using vNodes) compare when they must increase capacity under active traffic. The goal was to observe scale-out under realistic conditions: workloads running, caches warm, and topology changes occurring mid-operation. By expanding both clusters step by step, we captured how quickly capacity came online, how much the running workload was affected, and how each system performed after each expansion. Before we go deeper into the details, here are the key findings from the tests: Bootstrap operations: ScyllaDB completed capacity expansion 7.2X faster than Cassandra Total scaling time: When including Cassandra’s required cleanup operations (which can be performed during maintenance windows), the time difference reaches 9X Throughput while scaling: ScyllaDB sustained ~3.5X more traffic during these scaling operations Stability under load: ScyllaDB had far fewer errors and timeouts during scaling, even at higher traffic levels Why Fast Scaling Matters Most real-world database deployments are overprovisioned to some extent. The extra capacity helps sustain traffic fluctuations and short-lived bursts. It also supports routine maintenance tasks, like applying security patches, rolling out infrastructure maintenance, or recovering from replica failures. Another important consideration in real-world deployments is that benchmark reports often overlook traffic variability over time. In practice, only a subset of workloads consistently demand high baseline throughput, with low variability from their peak needs. Most workloads follow a cyclical pattern, with daily peaks during active hours and significantly lower baseline traffic during off-hours. A diurnal workload example, ranging between 50K to 250K operations per second in a day Fast scaling is also critical for handling unexpected events, such as viral traffic spikes, flash loads, backlog drains after cascading failures, or sudden pressure from upstream systems. It’s especially valuable when traffic has large peak-to-baseline swings, capacity needs to shift often, responses to load must be quick, or costs depend on scaling back down immediately after a surge. Comparing Tablets vs vNodes Fast scaling is ultimately a data distribution problem, and Cassandra’s vNodes and ScyllaDB’s tablets handle that distribution in distinctly different ways. Here’s more detail on the differences we previewed earlier. Apache Cassandra Apache Cassandra follows a token ring architecture. When a node joins the cluster, it is assigned a number of tokens (the default is 16), each representing a portion of the token ring. The node becomes responsible for the data whose partition keys fall within its assigned token ranges. During node bootstrap, existing replicas stream the relevant data to the new replica based on its token ownership. Conversely, when a node is removed, the process is reversed. Cassandra generally recommends avoiding concurrent topology changes; in practice, many operators add/remove nodes serially to reduce risk during range movements. Digression: In reality, topology changes in an Apache Cassandra cluster are plain unsafe. We explained the reasons in a previous blog, and pointed out that even its community acknowledged some of its design flaws. In addition to the administrative overhead involved in scaling a Cassandra cluster, there are other considerations. Adding nodes with higher CPU and memory is not straightforward. It typically requires a new tuning round and manually assigning a higher weight (increasing the number of tokens) to better match capacity. After bootstrap operations, Cassandra requires an intermediary step (cleanup) for older replicas in order to free up disk space and eliminate the risk of data resurrection. Lastly, multiple scaling rounds introduce significant streaming overhead since data is continuously shuffled across the cluster. Cassandra Token Ring ScyllaDB ScyllaDB introduced tablets starting with the 2024.2 release. Tablets are the smallest unit of replication in ScyllaDB and can be migrated independently across the cluster. Each table is dynamically split into tablets based on its size, with each tablet being assigned to a subset of replicas. In effect, tablets are smaller, manageable fragments of a table. As the topology evolves, tablet state transitions are triggered. A global load balancer balances tablets across the cluster, accounting for heterogeneity in node capacity (e.g., assigning more tablets to replicas with greater resources). Under the hood, Raft provides the underlying consensus mechanism that serializes tablet transitions in a way that avoids conflicting topology changes and ensures correctness. The load balancer is hosted on a single node, but not a designated node. If that node crashes or goes down for maintenance, the load balancer will start on another node. Raft and tablets effectively decouple topology changes from streaming operations. Users can orchestrate topology changes in parallel with minimal administrative overhead. ScyllaDB does not require a post-bootstrap cleanup phase. That allows for immediate request serving and more efficient data movement across the network. Visual representation of tablets state transitions Adding Nodes Starting with a 3-node cluster, we ran our “real-life” mixed workload targeting 70% of each database’s inferred total capacity. Before any scaling activity, both ScyllaDB and Cassandra were warmed up to ensure disk and cache activity were in effect. Note: Configuration details are provided in the Appendix. We then started the mixed workload and let it run for another 30 minutes to establish a performance baseline. At this point, we bootstrapped 3 additional nodes, expanding the cluster to 6 nodes. We then allowed the workload to run for an additional 30 minutes to observe the effects of this first scaling step. We increased traffic proportionally. After sustaining it for another 30 minutes, we bootstrapped 3 more nodes, bringing each cluster to a total of 9 nodes. Finally, we increased traffic one last time to ensure each database could sustain its anticipated traffic. Note: See the Appendix for details on the test setup and our Cassandra tuning work. The following table shows the target throughput used during and after each scaling step along with each cluster’s inferred maximum capacity: Nodes ScyllaDB Cassandra 3 (baseline) 196K ops/sec (Max 280K) 56K ops/sec (Max 80K) 6 392K ops/sec (Max 560K) 112K ops/sec (Max 160K) 9 672K ops/sec (Max 840K) 168K ops/sec (Max 240K) We conducted this scaling exercise twice for each database, introducing a minor variation in each run. For ScyllaDB, we bootstrapped all 6 additional nodes in parallel. For Cassandra, we enabled both the Key Cache and Row Cache, as we observed it performed better overall under our initial performance results. Comparison of different scaling approaches At first glance, it might look like ScyllaDB offers only a modest improvement over Cassandra (somewhere between 1.25X and 3.6X faster). But there are deeper nuances to consider. Resiliency In both of our Cassandra benchmarks, we observed a high rate of errors, including frequent timeouts and OverloadedExceptions reported by the server. Notably, our client was configured with an exponential backoff, allowing up to 10 retries per operation. In this environment, both Cassandra configurations showed elevated error rates under sustained load during scaling. The following table summarizes the number of errors observed by the client during the tests: Kind Step Throughput Retries Cassandra 5.0 – Page Cache 3 → 6 nodes 56K ops/sec 2010 Cassandra 5.0 – Page Cache 6 → 9 nodes 112K ops/sec 0 Cassandra 5.0 – Row & Key Cache 3 → 6 nodes 56K ops/sec 5004 Cassandra 5.0 – Row & Key Cache 6 → 9 nodes 112K ops/sec 8779 With the sole exception of scaling from 6 to 9 nodes in the Page Cache scenario, all other Cassandra scaling exercises resulted in noticeable traffic disruption, even while handling 3.5X less traffic than ScyllaDB. In particular, the “Row & Key Cache” configuration proved itself unable to sustain prolonged traffic, ultimately forcing us to terminate that test prematurely. Performance The earlier comparison chart also highlights the cost of repeated streaming across incremental expansion steps. Although bootstrap duration is governed by the volume of data being streamed and decreases as more nodes are added, each scaling operation redundantly re-streams data that was already redistributed in prior steps. This introduces significant overhead, compounding both the time and performance of scaling operations. As demonstrated, scaling directly from 3 to 9 nodes using ScyllaDB tablets eliminates the intermediary incremental redistribution overhead. By avoiding redundant streaming at each intermediate step, the system performs a single, targeted redistribution of tablets, resulting in a significantly faster and more efficient bootstrap process. ScyllaDB tablet streaming from 3 to 9 nodes After the scale out operations completed, we ran the following load tests to assess each database’s ability to withstand increased traffic: For ScyllaDB, we increased traffic to 80% of its peak capacity (280 * 3 * 0.8 = 672 Kops) For Cassandra, we increased traffic to 100% (240 Kops) and 125% (300 Kops) of its peak capacity to validate our starting assumptions ScyllaDB sustains 672 Kops/sec with load (per vCPU) around 80% utilization, as expected. Apache Cassandra latency variability under different throughput rates (240K vs 300K ops/sec) Cassandra maintained its expected 240K peak traffic. However, it failed to sustain 300K over time – leading to increased pauses and errors. This outcome was anticipated since the test was designed to validate our initial baseline assumptions, not to achieve or demonstrate superlinear scaling. Expectations In our tests, ScyllaDB scaled faster and delivered greater improvements in latency and throughput at each step. That reduces the number of scaling operations required. The compounded benefits translate to significantly faster capacity expansion. In contrast, Cassandra’s scaling behavior is more incremental. The initial scale-out from 3 to 6 nodes took 24 minutes. The subsequent step from 6 to 9 nodes introduced additional overhead, requiring 16 minutes. From this observation, we empirically derived a formula to model the scaling factor per step: 16 = 24 × (0.5/1.0)^overhead Solving for the exponent, we approximated the streaming overhead factor as 0.6. Using this, we constructed a practical formula to estimate Cassandra’s bootstrap duration at each scale step: Bootstrap_time ≈ Base_time × (data_to_stream / data_per_node)^0.6 With these formulas, we can project the bootstrap times for subsequent scaling steps. Based on our earlier performance results (where Cassandra sustained approximately 80K ops/sec for every 3-node increase), 27 total nodes of Cassandra would be required to match the throughput achieved by ScyllaDB. The following table presents the estimated cumulative bootstrap times needed for Cassandra to reach ScyllaDB performance, using the previously derived formula and applying the 0.6 streaming overhead factor at each step: Nodes Data to Stream Bootstrap Time Cumulative Time Peak Capacity 3 2.0TB – 0 min 80K 3 → 6 1.0TB 24.0 min 24.0 min 160K 6 → 9 0.67TB 15.8 min 39.8 min 240K 9 → 12 0.50TB 12.4 min 52.2 min 320K 12 → 15 0.40TB 10.4 min 62.6 min 400K 15 → 18 0.33TB 9.0 min 71.6 min 480K 18 → 21 0.29TB 8.1 min 79.7 min 560K 21 → 24 0.25TB 7.3 min 87.0 min 640K 24 → 27 0.22TB 6.7 min 93.7 min 720K Time to reach throughput capacity for bootstrap operations As the table and chart visually show, ScyllaDB responds to capacity needs 7.2X faster than Cassandra. That’s before accounting for the added operational and maintenance overhead associated with the process. Cleanup Cleanup is a process to reclaim disk space after a scale-out operation takes place in Cassandra. As the Cassandra documentation states: As a safety measure, Cassandra does not automatically remove data from nodes that “lose” part of their token range due to a range movement operation (bootstrap, move, replace). (…) If you do not do this, the old data will still be counted against the load on that node. We estimated the following cleanup times after scaling to 9 nodes with unthrottled compactions: Unlike topology changes, Cassandra cleanup operations can be executed in parallel across multiple replicas, rather than being serialized. The trade-off, however, is a temporary increase in compaction activity – something that may impact system performance through its execution. In practice, many users choose to run cleanup serially or per rack to minimize disruption to user-facing traffic. Despite its parallelizability, careful coordination is often preferred in production environments to minimize latency impact. The following table outlines the total time required under various cleanup strategies: In conclusion, ScyllaDB scaled faster and sustained higher throughput during scale-out, and it removes cleanup as part of the scaling cycle. Even for users willing to accept the risk of running cleanup in parallel across all Cassandra nodes, ScyllaDB still offers 9X faster capacity response time, once the minimum required cleanup time is factored into Cassandra’s previously estimated bootstrap durations. These results reflect how both databases behave under one specific scaling pattern. Teams should benchmark against their own workload shapes and operational constraints to see how these architectural differences play out in their particular environment. Parting Thoughts We know readers are (rightfully) skeptical of vendor benchmarks. As discussed earlier, Cassandra and ScyllaDB rely on fundamentally different scaling models, which makes designing a perfect comparison inherently difficult. The scaling exercises demonstrated here were not designed to fully maximize ScyllaDB tablets’ potential. The test design actually favors Cassandra by focusing on symmetrical scaling. Asymmetrical scaling scenarios would better highlight the advantage of tablets vs vNodes. Even with a design that favored Cassandra’s vNodes model, the results show the impact of tablets. ScyllaDB sustained 4X the throughput of Apache Cassandra while maintaining consistently lower P99 latencies under similar infrastructure. Interpreted differently, ScyllaDB delivers comparable performance to Cassandra using significantly smaller instances, which could then be scaled further by introducing larger, asymmetric nodes as needed. This approach (scaling from 3 small nodes to another 3 [much larger] nodes) optimizes infrastructure TCO and aligns naturally with ScyllaDB Tablets architecture. However, this would be far more difficult to achieve (and test) in Cassandra in practice. Also, the tests intentionally did not use large instances to avoid favoring ScyllaDB. ScyllaDB’s shard-per-core architecture is designed to linearly scale across large instances without requiring extensive tuning cycles, which are often encountered with Apache Cassandra. For example, a 3-node cluster running on the largest AWS Graviton4 instances can sustain over 4M operations per second. When combined with Tablets, ScyllaDB deployments can scale from tens of thousands to millions of operations per second within minutes. Finally, remember that performance should be just one component in a team’s database evaluation. ScyllaDB offers numerous features beyond Cassandra (local and global indexes, materialized views, workload prioritization, per query timeouts, internal cache, and advanced dictionary-based compression, for example). Appendix: How We Ran the Tests Both ScyllaDB and Cassandra tests were carried out in AWS EC2 in an apples-to-apples scenario. We ran our tests on a 3-node cluster running on top of i4i.4xlarge instances placed under the same Cluster Placement Group to further reduce networking round-trips. Consequently, each node was placed on an artificial rack using the GossipingPropertyFileSnitch. As usual, all tests used LOCAL_QUORUM as the consistency level, a replication factor of 3. They used NetworkTopologyStrategy as the replication strategy. To assess scalability under real-world traffic patterns, like Gaussian and other similar bell curve shapes, we measured the time required to bootstrap new replicas to a live cluster without disrupting active traffic. Based on these results, we derived a mathematical model to quantify and compare the scalability gaps between both systems. Methodology To assess scalability under realistic conditions, we ran performance tests to simulate typical production traffic fluctuations. The actual benchmarking is a series of invocations of ScyllaDB’s fork of latte with a consistency level of LOCAL_QUORUM. To test scalability, we used a “real-life” mixed distribution, with the majority (80%) of operations distributed over a hot set, and the remaining 20% iterating over a cold set. latte is the Lightweight Benchmarking Tool for Apache Cassandra as developed by Piotr Kołaczkowski, a DataStax Software Engineer. Under the hood, latte relies on ScyllaDB’s Rust driver, compatible with Apache Cassandra. It outperforms other widely used benchmarking tools, provides better scalability and has no GC pauses, resulting in less latency variability on the results. Unlike other benchmarking tools, latte (thanks to its use of Rune) also provides a flexible syntax for defining workloads closely tied on how developers actually interact with their databases. Lastly, we can always brag we did it in Rust, just because… 🙂 We set baseline traffic at 70% of its observed peak before P99 latency crossed a 10ms threshold. This was to ensure both databases retained sufficient CPU and I/O headroom to handle sudden traffic and concurrency spikes, as well as the overhead of scaling operations. Setup The following table shows the infrastructure we used for our tests: Cassandra/ScyllaDB Loaders EC2 Instance type i4i.4xlarge c6in.8xlarge Cluster size 3 1 vCPUs (total) 16 (48) 32 RAM (total) 128 (384) GiB 64 GiB Storage (total) 1 x 3.750 AWS Nitro SSD EBS-only Network Up to 25 Gbps 50 Gbps ScyllaDB and Cassandra nodes, as well as their respective loaders, were placed under their own exclusive AWS Cluster Placement Group for low-latency networking. Given the side-effect of all replicas being placed under the same availability zone, we placed each node under an artificial rack using the GossipingPropertyFileSnitch. The schema used through all testing suites resembles the same schema as the default cassandra-stress, whereas the keyspace relies on NetworkTopologyStrategy with a replication factor of 3: CREATE TABLE IF NOT EXISTS keyspace1.standard1 ( key blob PRIMARY KEY, c0 blob, c1 blob, c2 blob, c3 blob, c4 blob ) ; We used a payload of 1010 bytes, where: 10-bytes represent the keysize, and; Each of the 5 columns is a distinct 200-byte blob Both databases were pre-populated with 2 billion partitions for an approximate (replicated) storage utilization of ~2.02TB. That’s about 60% disk utilization, considering the metadata overhead. Tuning Apache Cassandra Cassandra was originally designed to be run on commodity hardware. As such, one of its features is shipping with numerous different tuning options suitable for various use cases. However, this flexibility comes with a cost: tuning Cassandra is entirely up to its administrators, with limited guidance from online resources. Unlike ScyllaDB, an Apache Cassandra deployment requires users to manually tune kernel settings, set user limits, configure the JVM, set disks’ read-ahead, decide upon compaction strategies, and figure out the best approach for pushing metrics to external monitoring systems. To make things worse, some configuration file comments are outdated or ambiguous across versions. For example, CASSANDRA-16315 and CASSANDRA-7139 describe problems involving the default setting for concurrent compactors and offer advice on how to tune that parameter. Along those lines, it’s worth mentioning Amy Tobey’s Cassandra tuning guide (perhaps the most relevant Cassandra tuning resource available to date), where it says: “The inaccuracy of some comments in Cassandra configs is an old tradition, dating back to 2010 or 2011. (…) What you need to know is that a lot of the advice in the config commentary is misleading. Whenever it says “number of cores” or “number of disks” is a good time to be suspicious. (…)” – Excerpt from Amy’s Cassandra tuning guide, cassandra.yaml section Tuning the JVM is a journey of its own. Cassandra 5.0 production recommendations don’t mention it, and the jvm-* files page only deals with the file-based structure as shipped with the database. Although DataStax’s Tuning Java resources does a better job on providing recommendations, it warns to adjust “settings gradually and test each incremental change.” Further, we didn’t find any references to ZGC (available as of JDK17) on either the Apache Cassandra or DataStax websites. That made us wonder whether this garbage collector was even recommended. Eventually, we settled on using settings similar to those that TheLastPickle used in their Apache Cassandra 4.0 Benchmarks. During our scaling tests, we hit another inconsistency: we noticed Cassandra’s streaming operations had a default cap of 24MiB/s per node, resulting in suboptimal transfer times. Upon raising those thresholds, we noticed that: Cassandra 4.0 docs mentioned tuning the stream_throughput_outbound_megabits_per_sec option Both Cassandra 4.1 and Cassandra 5.0 docs referenced the stream_throughput_outbound option This Instaclustr article (or carefully interpreting cassandra_latest.yaml) seem like the best resource for understanding the correct entire_sstable_stream_throughput_outbound option. In other words, 3 distinct settings exist for tuning the previous 3 major releases of Cassandra. If your organization is looking to upgrade, we strongly encourage you to conduct a careful review and full round of testing on your own. This is not an edge case; others noted similar upgrade problems under the Apache Cassandra Mailing List. CASSANDRA-20692 demonstrates that Apache Cassandra 5 failed to notice a potential WAL corruption under its newer Direct IO implementation, as issuing I/O requests without O_DSYNC could manifest as data loss during abrupt restarts. This, in turn, gives users a false sense of improved write performance. Configuring Apache Cassandra is not intuitive. We used cassandra_latest.yaml as a starting point, and ran multiple iterations of the same workload under a variety of settings and different GC settings. The results are shown below and demonstrate how little tuning can have a dramatic impact on Cassandra’s performance (for better or for worse). We started by evaluating the performance of G1GC and observed that tail latencies were severely affected beyond a throughput of 40K/s. Simply switching to ZGC gave a nice performance boost, so we decided to stick with it for the remainder of our testing. The following table shows the performance variability of Cassandra 5.0 while using different tuning settings (it’s ordered from best to worst case): Test Kind Garbage Collector Read-ahead Compaction Throughput P99 Latency Throughput Cassandra RA4 Compaction256 ZGC 4KB 256MB/s 6.662ms 120K/s Cassandra RA4 Compaction0 ZGC 4KB Unthrottled 8.159ms 120K/s Cassandra RA8 Compaction256 ZGC 8KB 256MB/s 4.657ms 100K/s Cassandra RA8 Compaction0 ZGC 8KB Unthrottled 4.903ms 100K/s Cassandra G1GC G1GC 4KB 256MB/s 5.521ms 40K/s Although we spent a considerable amount of time tuning Cassandra to provide an unbiased and neutral comparison, we eventually found ourselves in a feedback loop. That is, the reported performance levels are only applicable for the workload being stressed running under the infrastructure in question. If we were to switch to different instance types or run different workload profiles, then additional tuning cycles would be necessary. We anticipate that the majority of Cassandra deployments do not undergo the level of testing we carried out on a per-workload basis. We hope that our experience may prevent other users from running into the same mistakes and gotchas that we did. We’re not claiming that our settings are the absolute best, but we don’t expect that further iterations will yield large performance improvements beyond what we observed. Tuning ScyllaDB We carried out very little tuning for ScyllaDB beyond what is described in the Configure ScyllaDB documentation. Unlike Apache Cassandra, thescylla_setup script takes care of most of the
nitty-gritty details related to optimal OS tuning. ScyllaDB also
used tablets for data distribution. We targeted a minimum of 100
tablets/shard with the following CREATE KEYSPACE
statement: CREATE KEYSPACE IF NOT EXISTS keyspace1 WITH
REPLICATION = { 'class': 'NetworkTopologyStrategy', 'datacenter1':
3 } AND tablets = {'enabled': true, 'initial': 2048};
Limitations of our Testing Performance testing often fails to
capture real-world performance metrics tied to the semantics and
access patterns of applications. Aspects such as variable
concurrency, the impact of DELETEs (tombstones), hotspots,
and large partitions were beyond the scope of our testing. Our work
also did not aim to provide a feature-specific comparison. While
Apache Cassandra 5.0 ships with newer (and less battle-tested)
features like
Storage-attached Indexes (SAI), ScyllaDB also ships with
Workload Prioritization,
Local Secondary Indexes, and
Synchronous Materialized Views, all with no equivalent
counterpart. However, we ensured both databases’ transparent and
newer features were used, such as Cassandra’s Trie Memtables,
Trie-indexed SSTables and its newer Unified Compaction Strategy, as
well as ScyllaDB’s features like Tablets,
Shard-awareness, SSTable Index Caching, and so forth. Future
tests will use ScyllaDB’s Trie-indexed SSTables. Also note that
both databases now offer Vector Search, which was not in scope for
this project. Finally, this benchmark focuses specifically on
scaling operations, not steady-state performance. ScyllaDB has
historically demonstrated higher throughput and lower latency than
Cassandra in multiple performance benchmarks. Cassandra 5
introduces architectural improvements, but our preliminary testing
shows that ScyllaDB maintains its performance advantage. Producing
a full apples-to-apples benchmark suite for Cassandra 5 is a
sizable project that’s outside the scope of this study. For teams
evaluating a migration, the best insights will come from testing
your real-life workload profile, data models, and SLAs directly on
ScyllaDB. If you are running your own evaluations (tip: ScyllaDB
Cloud is the easiest way), our technical
team can review your setup and share tips for accurately
measuring ScyllaDB’s performance in your specific environment.