When Bigger Instances Don’t Scale

A bug hunt into why disk I/O performance failed to scale on larger AWS instances The promise of cloud computing is simple: more resources should equate to better, faster performance. When scaling up our systems by moving to larger instances, we naturally expect a proportional increase in capabilities, especially in critical areas like disk I/O. However, ScyllaDB’s experience enabling support for the AWS i7i and i7ie instance families uncovered a puzzling performance bottleneck. Contrary to expectations, bigger instances simply did not scale their I/O performance as advertised. This blog post traces the challenging, multi-faceted investigation into why IOTune (a disk benchmarking tool shipped with Seastar) was achieving a fraction of the advertised disk bandwidth on larger instances. On these machines, throughput plateaued at a modest 8.5GB/s and IOPS were much lower than expected on increasingly beefy machines. What followed was a deep dive into the internals of the ScyllaDB IO Scheduler, where we uncovered subtle bugs and incorrect assumptions that conspired to constrain performance scaling. Join us as we investigate the symptoms, pin down the root cause, and share the hard-fought lessons learned on this long journey. This blog post is the first in a three-part series detailing our journey to fully harness the performance of modern cloud instances. While this piece focuses on the initial set of bottlenecks within the IO Scheduler, the story continues in two subsequent posts. Part 2, The deceptively simple act of writing to disk, tracks down a mysterious write throughput degradation we observed in realistic ScyllaDB workloads after applying the fixes discussed here. Part 3, Common performance pitfalls of modern storage I/O, summarizes the invaluable lessons learned and provides a consolidated list of performance pitfalls to consider when striving for high-performance I/O on modern hardware and cloud platforms. Problem description Some time ago, ScyllaDB decided to support the AWS i7i and i7ie families. Before we support a new instance type, we run extensive tests to ensure ScyllaDB squeezes every drop of performance out of the provisioned hardware. While measuring disk capabilities with the Seastar IOTune tool, we noticed that the IOPS and bandwidth numbers didn’t scale well with the size of the instance, and we ended up with much lower values than AWS advertised. Read IOPS were on par with AWS specs up to i7i.4xlarge, but they were getting progressively worse, up to 25% lower than spec on i7i.48xlarge. Write IOPS were worse, starting at around 25% less than spec for i7i.4xlarge and up to 42% less on i7i.48xlarge. Bandwidth numbers were even more interesting. Our IOTune measurements were similar to fio up to the i7i.4xlarge instance type. However, as we scaled up the instance type, our IOTune bandwidth numbers were plateauing at around 8.5GB/s while fio was managing to pull up to 40GB/s throughput for i7i.48xlarge instances. Essential Toolkit The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a storage mount point, it outputs 4 values corresponding to the read/write IOPS and read/write bandwidth of the underlying storage system. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of the disk, which it will use to help ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula that looks something like: read_bw/read_bw_max + write_bw/write_bw_max + read_iops/read_iops_max + write_iops/write_iops_max <= 1 The internal mechanics of how the IO Scheduler works are described very thoroughly in the blog post I linked above. The io_tester tool is another utility within the Seastar framework. It’s used for testing and profiling I/O performance, often in more controlled and customizable scenarios than the automated IOTune. It allows users to simulate specific I/O workloads (e.g., sequential vs. random, various request sizes, and concurrency levels) and measure resulting metrics like throughput and latency. It is particularly useful for: Deep-dive analysis: Running experiments with fine-grained control over I/O parameters (e.g., --io-latency-goal, request size, parallelism) to isolate performance characteristics or potential bottlenecks. Regression testing: Verifying that changes to the IO Scheduler or underlying storage stack do not negatively impact I/O performance under various conditions. Fair Queue experimentation: As shown in this investigation, io_tester can be used to observe the relationship between configured workload parameters, the resulting in-disk queue lengths, and the throttling behavior of the IO Scheduler. What this meant for ScyllaDB We didn’t want to enable i7i instances if the IOTune numbers didn’t accurately reflect the underlying disk performance of the instance type. Lower io-properties numbers cause the IO Scheduler to overestimate the cost of each request. This leads to more throttling, making monstrous instances like i7i.48xlarge perform like much cheaper alternatives (such as the i7i.4xlarge, for example). Pinning the symptoms Early on, we noticed that the observed symptoms pointed to two different problems. This helped us narrow down the root causes much faster (well, fast here is a very misleading term). We were chasing a lower-than-expected IOPS issue and a different low-bandwidth issue. IOPS and bandwidth numbers were behaving differently when scaling up instances. The former was scaling, but with much lower values than we expected. The latter would just plateau from one point and stay there, no matter how much money you’d throw at the problem. We started with the hypothesis that IOTune might misdetect the disk’s physical block size from sysfs and that we issue requests with a different size than what the disk “likes,” leading to lower IOPS. After some debugging, we confirmed that IOTune indeed failed to detect the block size, so it defaulted to using requests of 512bytes. There’s no bug to fix on the IOTune side here, but we decided we needed to be able to specify the disk block size for reads and writes independently when measuring. This turned out to be quite helpful later on. With 4K requests, we were able to measure the expected ~1M IOPS for writes compared to the ~650k IOPS we were getting with the autodetected 512-byte requests (numbers relevant for the i7i.12xlarge instance). We had a fix for the IOPS issue, but – as we discovered later – we didn’t properly understand the actual root cause. At that point, we thought the problem was specific to this instance type and caused by IOTune misdetecting the block size. As you’ll see in the next blog post in the series, the root cause is a lot more interesting and complicated. The plateauing bandwidth issue was still on the table. Unfortunately, we had no clue about what could be going on. So, we started exploring the problem space, concentrating our efforts as you’d imagine any engineer would. Blaming the IO Scheduler We dug around, trying to see if IOTune became CPU-limited for the bandwidth measurements. But that wasn’t it. It’s somewhat amusing that our initial reaction was to point the finger at the IO Scheduler. This bias stems from when the IO Scheduler was first introduced in ScyllaDB. It had such a profound impact that numerous performance issues over time – things that were propagating downward to the storage team – were often (and sometimes unfairly) attributed to it. Understanding the root cause We went through a series of experiments to try to narrow down the problem further and hopefully get a better understanding of what was happening. Most of the experiments in this article, unless explicitly specified, were run on an i7i.12xlarge instance. The expected throughput was ~9.6GB/s while IOTune was measuring a write throughput of 8.5GB/s. To rule out poor disk queue utilization, we ran fio with various iodepths and block sizes, then recorded the bandwidth. We noticed that the request needs to be ~4MB to fill the disk queue. Next, we collected the same for io_tester with –io-latency-goal=1000 to prevent the queue from splitting requests. A larger latency goal means the scheduler can be more relaxed and submit the requests as they come because it has plenty of time (1000 ms) to complete each request in time. If the goal is smaller, the IO Scheduler gets stressed because it needs to make each request complete in that tight schedule. Sometimes it might just split a request in half to take advantage of the in-disk parallelism and hopefully make the original request fit the tight latency goal. The fio tool seemed to be pulling the full bandwidth from the disk, but our io_tester tool was not. The issue was definitely on our side. The good news was that both io_tester and IOTune measured similar write throughputs, so we weren’t chasing a bug in our measurement tools. The conclusion of this experiment was that we saturated the disk queue properly, but we still got low bandwidth. Next, we pulled an ace out of the sleeve. A few months before this, we were at a hackathon during our engineering summit. During that hackathon, our Storage and Networking team built a prototype Seastar IO Scheduler controller that would bring more transparency and visibility into how the IO Scheduler works. One of the patches from that project was a hack that would make the IO Scheduler drop a lot of the IOPS/bandwidth throttling logic and just drain like crazy whatever requests are queued. We applied that patch to Seastar and ran the IOTune tool again. It was very rewarding to see the following output: Measuring sequential write bandwidth: 9775 MB/s (deviation 6%) Measuring sequential read bandwidth: 13617 MB/s (deviation 22%) The bandwidth numbers escaped the 8.5GB/s limit that was previously constraining our measurements. This meant we were correct in blaming the IO Scheduler. We were indeed experiencing throttling from the scheduler, specifically from something in the fair queue math. At that point, we needed to look more closely at the low-level behavior. We patched Seastar with another home-brewed project that adds a low-overhead binary tracer to the IO Scheduler. The plan was to run the tracer on both the master version and the one with the hackathon patch applied – then try to understand why the hackathon-patched scheduler performs better. We added a few traces and we immediately started to see patterns like these in the slow master trace: Here it took it 134-51=83us to dispatch one request. The “Q” event is when a request arrives at the scheduler and gets queued. “D” stands for when a request gets dispatched. For reference, the patched IO scheduler spent 1us to dispatch a request. The unexpected behavior suggested an issue with the token bucket accumulation, as requests should be dispatched instantly when running without io-properties.yaml (effectively providing unlimited tokens). This is precisely the scenario when IOTune is running: it withholds io-properties.yaml from the IO Scheduler. This allows the token bucket to operate with unlimited tokens, stressing the disk to its maximum potential so IOTune can compute, by itself, the required io-properties.yaml. The token bucket seems to run out of tokens…but why? When the token bucket runs out of tokens, it needs to wait for tokens to be replenished when other requests are completed. This delays the dispatch of the next request. That’s why the above request waited 83us to get dispatched when it should have actually been dispatched in 1us. There wasn’t much more we could do with the event tracer. We needed to get closer to the fair queue math. We returned to io_tester to examine the relationship between the parallelism of the test and the size of the in-disk queues. We ran io_tester for requests sized within [128k, 1MB] with parallelism within [1,2,4,8,16] fibers. We ran it once for the master branch (slow) and once for the “hackathon” branch (fast). Here are some plots from these results. The plots are throughput (vertical axis) against parallelism (horizontal) for two request sizes, 1MB and 128kB. For both request sizes, the “hackathon” branch outperformed the “master” branch. Also, the 1MB request saturates the disk with much lower parallelism than the 128k request. No surprises here, the result wasn’t that valuable. In a follow-up test, we collected the in-disk latencies as well. We plotted throughput against parallelism for both the master and hackathon branches. The lines crossing the bars represent the in-disk latencies measured. This is already much better. After the disk is saturated, increasing parallelism should create a proportional increase for in-disk latency. That’s exactly what happens for the hackathon branch. We couldn’t say the same about the master branch. Here, the throughput plateaued around 4 fibers, and the in-disk latency didn’t grow! For some reason, we didn’t end up stressing this disk. To investigate further, we wanted to see the size of the actual in-disk queues. So, we coded up a patch to make io_tester output this information. We plotted the in-disk queue size alongside parallelism for various request sizes. At this point, it became clear that we weren’t sufficiently leveraging the in-disk parallelism. Likely, the fair_queue math was making the IO Scheduler throttle requests excessively. This is indeed what the plots below show. In the master (slow) branch run, the in-disk queue length for the 1MB request (which saturates the disk faster) plateaus at around 4 requests once parallelism=4 and higher. That’s definitely not alright. Just for fun, let’s look at Little’s Law in action. We plotted disk_queue_length / latency for each branch as follows. Next, we wanted to (somehow) replicate this behavior without involving an actual disk. This way, we could maybe create a regression test for the IO Scheduler. The Seastar ioinfo tool was perfect for this job. ioinfo can take an io-properties.yaml file as an argument. It feeds the values to the IO Scheduler, then the tool outputs the token bucket parameters (which can be used to calculate the theoretical IOPS and throughput values that the IO Scheduler can achieve). Our goal was to compare these calculated values with what was configured in an io-properties.yaml file and make sure the IO Scheduler could deliver very close to what it was configured for. For reference, here’s how the calculated IOPS/bandwidth looked compared to the configured values. The values returned by the scheduler were within a 5% margin of the configured one. This was fantastic news (in a way). It meant the fair_queue math behaves correctly even with bandwidths above 8.xGB/s. We didn’t get the regression test we hoped for, since the fair_queue math was not causing the throttling and disk underutilization we’d seen in the previous experiment. However, we did add a test that would check if this behavior changes in the future. We did get a huge win from this, though. We came to the conclusion that something must be wrong with the fair_queue math or something in the IO Scheduler must be incorrect only when it’s not configured with an io-properties file. At that point, the problem space narrowed significantly. Playing around with the inputs from the io-properties.yaml file, we uncovered yet another bug. For large enough read IOPS/bandwidth numbers in the config file, the IO Scheduler would report request costs of zero. After many discussions, we learned that this is not really a bug. It’s how the math should behave. With big io-properties numbers, the math should plateau the costs at 0. It makes sense: the more resources you have available, the single unit of effort becomes less significant. This led us to an important realization: the unconfigured case (our original issue) should also produce a cost of zero. A zero cost means that the token bucket won’t consume any tokens. That gives us unbounded output…which is exactly what IOTune wants. Now we needed to figure out two things: Why doesn’t the IO Scheduler report a cost of zero for the unconfigured case? In theory, it should. In the issue linked above, costs became zero for values that weren’t even close to UINT64_MAX. Was our code prepared to handle costs of zero? We should ensure we don’t end up with weird overflows or divisions by zero or any undefined behavior from code that assumes costs can’t be zero. When things start to converge At this point, we had no further leads, so we thought there must be something wrong with the fair queue math. I reviewed the math from Implementing a New IO Scheduler Algorithm for Mixed Read/Write Workloads, but I didn’t find any obvious flaws that could explain our unconfigured case. We hoped we’d find some formula mistakes that made the bandwidth hit its theoretical limit at 8.5GB/s. We didn’t find any obvious issues here, so we concluded there must be some flaw in the implementation of the math itself. We started suspecting that there must be some overflow that ends up shrinking the bandwidth numbers. After quite some time tracking the math implementation in the code, we managed to find the issue. Two internal IO Scheduler variables that were storing the IOPS and bandwidth values configured via io-properties.yaml had a default value set to `std::numeric_limits<int>::max()`. It wasn’t that intuitive to figure out – the variables weren’t holding the actual io-properties values, but rather some values that derived from them. This made the mistake harder to spot. There is some code that recalculates those variables when the io-properties.yaml file is provided and parsed by the Seastar code. However, in the “unconfigured” case, those code paths are intentionally not hit. So, the INT_MAX values were carried into the fair queue math, crunched into the formulas, and resulted in the 8.xGB/s throughput limit we kept seeing. The fix was as simple as changing the default value to ‘std::numeric_limits<uint64_t>::max()’.  A one-line fix for many weeks of work. It’s been a crazy long journey chasing such a small bug, but it has been an invaluable (and fun!) learning experience. It led to lots of performance gains and enabled ScyllaDB to support highly efficient storage instances like i7i, i7ie and i8g. Stay tuned for the next episode in this series of blog posts, In part 2, we will uncover that the performance gains after this work weren’t quite what we were expecting on realistic workloads. We will deep dive into some very dark corners of modern NVMEs and filesystems to unlock a significant chunk of write throughput. Read part 2

Scaling Is the “Funnest” Game: Rachel Stephens and Adam Jacob

When not to worry about scale, when to rearchitect everything and why passionate criticism is a win “There’s no funner game than the at-scale technology game. But if you play it, some people will hate you for it. That’s okay…that’s the game you chose to play.” – Adam Jacob At Monster Scale Summit 2025, Rachel Stephens, research director at RedMonk, spoke with Adam Jacob, co-founder of Chef and CEO of System Initiative, about what it really means to build and operate software at scale. Note: Monster SCALE Summit 2026 will go live March 11-12, featuring antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications” and more than 50 others. The event is free and virtual. Register for free and join the community for some lively chats The Existential Question of Scale Stephens opened with an existential question: “Does your software exist if your users can’t run it?” Yes, your code still exists in GitHub even if us-east-1 goes down. But what if … Your system crawls under load. Critical integrations constantly break. You can’t afford the infrastructure costs. “Software at scale isn’t just about throughput,” Stephens said. “It’s about making sure that your code endures, adapts and remains accessible no matter the load and location of where you’re running. Because if your users can’t use it, your software may as well not exist.” With that framing, Stephens brought in someone who’s spent his career dealing with scale firsthand: Adam Jacob. Only Scale When It Hurts Stephens asked Jacob how teams can balance quality, speed and scale under uncertainty. How do you avoid both cutting corners and premature optimization? Jacob argues that early on, it’s fine not to worry much about scale. Most products fail for other reasons before scalability ever becomes a problem. He explained: “I think of it basically through the lens of optionality. When you start building new things, it’s nice not to worry too much about scale, because you may never reach it. Most products don’t fail because they fail to scale. Think about how badly Twitter failed to scale … and yet here we are.” The first priority is to build a solid product. Once scale becomes a real issue, that’s when it makes sense to refactor and remove bottlenecks. But if you’ve been around the block a little, your experience helps you make early choices that pay off later. Jacob noted, “Premature optimization is real. But as you gain experience, there are some decisions you make early because you know that if things work out, you’ll be happier later — like factoring your code so it can be broken apart across network boundaries over time, if you need to.” Chef Scalability Horror Stories Next, Stephens asked Jacob if he would share a scaling horror story from his Chef days. Jacob obliged and offered two memorable ones. “The best was when we launched the first version of Hosted Chef. The day before the launch, we discovered it took about a minute and a half to create a new user. It didn’t take that long when we were running it on a laptop, but it did later … and we never really tested it. So, in the final hours before launch, we changed it from ‘anyone can sign up’ to a queue system with a little space robot saying, ‘Demand is so high; we’ll get back to you.’ We just papered over the scalability problem.” “Another example: that same Chef server (the one that couldn’t create accounts quickly) eventually had to work at Facebook. The original version was written in Rails, which was great to work with, but not parallel enough. At Facebook scale, you might have 40,000 or 50,000 things pointed at one Chef server. So we rebuilt it in Erlang, which is great for that kind of problem. I literally brought the Erlang version to Facebook on a USB stick. When we installed it and bootstrapped a data center, we thought it was broken because it was using less compute and finished almost instantly.” Jacob explained that if they’d tried to build the Chef server in Erlang from the start, the project probably wouldn’t have gained traction. Starting in Rails made it possible to get Chef out into the world and learn what the system really needed to do. Only later, once they understood how the system really behaved, could they rebuild it with the right architecture and runtime for scale. Growth or Efficiency: Know Which Game You’re Playing At Chef, scaling was ultimately required to land customers like Facebook and JPMorgan Chase, which operate at massive scale. Jacob advised, “Making it scale required major investment, but it worked. You can’t buy your core. If it matters to customers, you have to build it yourself. People often wait too long to realize they have a deep architectural problem that’s also a business problem. Rebuilding for scale takes months, so you have to start early.” Your own approach to scale should ultimately be driven by what game you’re playing: In the venture-capital game, growth and traction come first. You can spend money to scale faster because you’re funded. In the profitability game, efficiency comes first. Overspending on compute or poor architecture hits the bottom line hard. Why Scaling is the ‘Funnest’ Game Stephens mentioned that “when software succeeds, it stops being yours – it becomes everyone’s.” She then asked Jacob what it’s like when your tech scales to the point that people have extremely strong opinions about it. His response: “It’s hard to build things that people care about. If you’re lucky enough to create something you love and share it with the world and people love it back, that’s incredibly rewarding. Even when they don’t, that’s still a gift.” “Someone once tapped me in a coffee shop and said, ‘You wrote Chef? I hate Chef.’ I said, ‘I’m sorry; I didn’t write it to hurt you.’ But at scale, that means he used it. It mattered in his life. And that’s what you want: for people to experience what you built.” “I love the technology, the problem, the difficulty. Scaling adds more layers of complexity, more layers of fun. There’s no funner game than the at-scale technology game. But if you play it, some people will hate you for it. That’s okay…that’s the game you chose to play.” You can watch the full talk below.

You Got OLAP in My OLTP: Can Analytics and Real-Time Database Workloads Coexist?

Explore isolation mechanisms and prioritization strategies that allow different database workloads to coexist without resource contention issues Analytics (OLAP) and real-time (OLTP) workloads serve distinctly different purposes. OLAP (online analytical processing) is optimized for data analysis and reporting, while OLTP (online transaction processing) is optimized for real-time low-latency traffic. Most databases are designed to primarily benefit from either OLAP or OLTP, but not both. Worse, concurrently running both workloads under the same data store will frequently introduce resource contention. The workloads end up hurting each other, considerably dragging down the overall distributed system’s performance. Let’s look at how this problem arises, then consider a few ways to address it. OLTP vs OLAP Databases There are basically two fundamental approaches involving how databases store data on disk. We have row-oriented databases, often used for real-time workloads. These store all data pertaining to a single row on disk. Row-oriented storage (ideal for OLTP) Column-oriented storage (ideal for OLAP) On the other side of the spectrum, we have column-oriented databases, which are often used for running analytics. These databases store data in a vertical way (versus horizontal partitioning of rows). This single design decision effectively makes it much easier and efficient for the database to run aggregations, perform calculations and answer retrieving insights such as “Top K” metrics. OLTP vs. OLAP Workloads So the general consensus is that if you want to run OLTP workloads, you use a row-oriented database – and you use a columnar one for your analytics workloads. However, contrary to popular belief, there are a variety of reasons why people might actually want to run an OLAP workload on top of their real-time databases. For example, this might be a good option when organizations want to avoid data duplication or the complexity and overhead associated with maintaining two data stores. Or maybe they don’t extract insights all that often. The Latency Problem But problems can arise when you try to bring OLAP to your real-time database. We’ve studied this a lot with ScyllaDB, a specialized NoSQL database that’s primarily meant for high throughput and low-latency real-time workloads. The following graphic from ScyllaDB monitoring demonstrates what happens to latency when you try to run OLAP and OLTP workloads alongside one another. The green line represents a real-time workload, whereas the yellow one represents an analytics job that’s running at the same time. While the OLTP workload is running on its own, latencies are great. But as soon as the OLAP workload starts, the real-time latencies dramatically rise to unacceptable levels. The Throughput Problem Throughput is also an issue in such scenarios. Looking at the throughput clarifies why latencies climbed: The analytics process is consuming much higher throughput than the OLTP one. You can even see that the real-time throughput drops, which is a sign that the database got overloaded. Unsurprisingly, as soon as the OLAP job finishes, the real-time throughput increases and the database can then process its backlog of queued requests from that workload. That’s how the contention plays out in the database when you have two totally different workloads competing for resources in an uncoordinated way. The database is naively trying to process requests as they come in. When Things Gets Contentious But why does this contention happen in the first place? If you overwhelm your database with too many requests, it cannot keep up. Usually, that’s because your database lacks either the CPU or I/O capacity that’s required to fulfill your requests. As a result, requests queue up and latency climbs. The workloads contribute to contention too. OLTP applications often process many smaller transactions and are very latency sensitive. However, OLAP ones generally run fewer transactions requiring scanning and processing through large amounts of data. So hopefully that explains the problem. But how do we actually solve it? Option A: Physical Isolation One option is to physically isolate these resources. For example, in a Cassandra deployment, you would simply add a new data center and separate your real-time processing from your analytics. This saves you from having to stream data and work with a different database. However, it considerably elevates your costs. Some specific examples of this strategy: Instaclustr, a managed services provider, shared a benchmark after isolating its deployments (Apache Spark and Apache Cassandra). GumGum shared the results of this approach (with multiregion Cassandra) at a past Cassandra Summit. There are definitely use cases and organizations running OLAP on top real-time databases. But are there any other alternatives to resolve the problem altogether? Option B: Scheduled Isolation Other teams take a different approach: They avoid running their OLAP during their peak periods. They simply run through their Analytics pipelines during off-peak hours in order to mitigate the impact on latencies. For example, consider a food delivery company. Answering the question like, “How much did this merchant sell within the past week?” is simple in OLTP. However, offering discounts to 10 top-selling restaurants within a given region is much more complicated. In a wide-column database like Cassandra or ScyllaDB, it inevitably requires a full table scan. Therefore, it would make sense for such a company to run these analytics from after midnight until around 10 a.m. – before its peak traffic hours. This is a doable strategy, but it still doesn’t solve the problem. For example, what if your dataset doubles or triples? Your pipeline might overrun your time window. And you have to consider that your business is still running at that time (people will still order food at 2 a.m.). If you take this approach, you still need to tune your analytics job and ensure it doesn’t kill your database. Option C: Workload Prioritization ScyllaDB has developed an approach called Workload Prioritization to address this problem. It lets users define separate workloads and assign different resource shares to them. For example, you might define two service levels: The main one has 600 shares, and the secondary one has 200 shares. CREATE SERVICE LEVEL main WITH shares = 600 CREATE SERVICE LEVEL secondary WITH shares = 200 ScyllaDB’s internal scheduler will process three times more tasks from the main workload than the secondary one. Whenever the system is under contention, the system prioritizes its resources allocation accordingly. Why does this kick in only during contention? Because if there’s no contention, it means there is no bottleneck, so there is effectively nothing to prioritize. [Play with an interactive animation] Workload Prioritization Under the Hood Under the hood, ScyllaDB’s Workload Prioritization relies on Seastar scheduling groups.   Seastar is a C++ framework for data-intensive applications. ScyllaDB, Redpanda, Ceph’s SeaStore and other technologies are built on top of it. Scheduling groups are effectively the way Seastar allows background operations to have little impact on foreground activities. For example, in ScyllaDB and database-specific terms, there are several different scheduling groups within the database. ScyllaDB has a distinct group for compactions, streaming, Memtables, and so on. With Cassandra, you might end up in a situation where compactions impact your workload performance. But in ScyllaDB, all compaction resources are scheduled by Seastar. And according to its shares of resources, the database will allocate a respective share of resources to the background activity (compaction, in that case) – therefore ensuring that the latency of the primary user-facing workload doesn’t suffer. Using scheduling groups in this way also helps the database auto-tune. If the user workload is running during off-peak hours, then the system will automatically have more spare computing and I/O cycles to spend. The database will simply speed up its background activities. Here’s a guided tour of how Workload Prioritization actually plays out: OLTP and OLAP Can Coexist Running OLAP alongside OLTP inevitably involves anticipating and managing contention. You can control it in a few ways: isolate analytics to its own cluster, run it in off-peak windows, or enforce workload prioritization. And workload prioritization isn’t just for allowing OLAP along with your OLTP. That same approach could also be used to assign different priorities to reads vs. writes, for example. If you’d like to learn more, take a look at my recent tech talk on this topic: “How to Balance Multiple Workloads in a Cluster.”

ScyllaDB Cloud on AWS I8g and I8ge: 2x Throughput, Lower Latency, Zero Extra Cost

How ScyllaDB performs on the new I8g and I8ge instances, across different workload types Let’s start with the bottom line. For ScyllaDB, the new Graviton4-based i8g instances improve i4i throughput by up to 2x with better latency – and the i8ge improves i3en throughput by up to 2x with better latency. Benchmarks also show single-digit millisecond latency during maintenance operations like scaling. Fast and smooth scaling is an important part of the new ScyllaDB X Cloud offering. The chart below shows ScyllaDB max through under a latency SLA of 10ms latency for different workloads, for the old i4i, i3en and the new i8g, i8ge. AWS recently launched the I8g and I8ge storage-optimized EC2 instances powered by AWS Graviton4 CPUs and 3rd-generation AWS Nitro SSDs. They’re designed for I/O-intensive workloads like real-time databases, search, and analytics (so a nice fit for ScyllaDB). Instance Family Use Case Number of vCPUs per instance Storage i8g Compute bound 2 to 96 0.5 to 22.5 TB i8ge Storage bound 2 to 192 1.25 to 120 TB   Reduced TCO in ScyllaDB Cloud Based on our performance results, ScyllaDB users migrating to Graviton4 can reduce infrastructure requirements by up to 50% compared to i4i and i3en previous generations. This translates into significantly lower total cost of ownership (TCO) by requiring fewer nodes to sustain the same workload. These improvements stem from a few factors – both in the new instances themselves, and in the match between ScyllaDB and these instances. The new I8g architecture features: vCPU-to-core mapping: On x86, each vCPU uses half a physical core (a hyperthread); for i8g (ARM), each core matches one physical core Larger caches: 64kB instruction cache and 64kB data cache, compared to 32/48kB on Intel (shared between the two hyperthreads) Faster storage and networking (see spec above) In addition, ScyllaDB’s design allows it to take full advantage of the new server types: The shard-per-core architecture scales with linear performance to any number of cores The IO scheduler can take full advantage of the 3rd-generation AWS Nitro SSD, fully utilizing the higher IO rate, and lower latency without overloading it and increasing latency ARM’s relaxed memory model suits Seastar applications. Since locks and fences are rare, the memory subsystem has more opportunities to reorder memory accesses to optimize performance. What this means for you I8g and i8ge are now available on ScyllaDB Cloud. If you’re running ScyllaDB Cloud, the net impact is: Compute-bound workloads: Move from I4i to I8g. This should provide up to 2x throughput at the same ScyllaDB Cloud price. Storage-bound workloads: Move from I3en to I8ge. Here, you should expect up to 2x higher throughput at the same ScyllaDB Cloud price. Note that using the new ScyllaDB dictionary-based compression can lower the storage cost further. For both use cases, ScyllaDB can keep the 10ms P99 latency SLA during maintenance operations, including scaling out and scaling down. What we measured Max Throughput: The maximum requests per second the database can handle Max Throughput under SLA: The maximum request per second under a P99 latency of 10ms. Only throughput with latency below this SLA counts. This throughput can be sustained under any operation, like scaling and repair. This is the number you should use when sizing your ScyllaDB Database on i8g instances. P99 Latency: Measures the p99 latency for the Max Throughput under SLA Results Read Workload – cached data Cached data: working set size < available RAM, resulting in close to 100% cache hit rate. Instance type Max throughput Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge 1,062,578 750,000 100% 7.84 i8g.4xlarge 1,434,215 1,300,000 135% 6.29 i3en.3xlarge 585,975 550,000 100% 4.37 i8ge.3xlarge 962,504 800,000 164% 6.38 Read Workload – non-cached data, storage only Non-cached data: working set size >> available RAM, resulting in 0% cache hit rate. When most of the data is not cached, storage becomes a significant factor for performance. Instance type Max throughput Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge 218,674 210,000 100% 4.56 i8g.4xlarge 444,548 300,000 203% 4.24 i3en.3xlarge 145,702 140,000 100% 6.83 i8ge.3xlarge 259,693 255,000 178% 7.95 Write Workload Instance type Max throughput Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge 289,154 150,000 100% 2.4 i8g.4xlarge 689,474 600,000 238% 4.02 i3en.3xlarge 217,072 200,000 100% 5.42 i8ge.3xlarge 452,968 400,000 209% 3.41   Tests under maintenance operations ScyllaDB takes pride in testing under realistic use cases, including scaling out and in, repair, backups, and various failure tests. The following results represent the P99 average latency (across all nodes) of different maintenance operations on a 3-node cluster of i8ge.3xlarge. It’s using the same setup as above. Setup ScyllaDB version: 2025.3.1-20250907.2bbf3cf669bb DB node amount: 3 DB instance types: i8ge.3xlarge Loader node amount: 4 Loader instance type: c5.2xlarge Throughput: Read 41K, write 81K, Mixed 35K Results Read Test: Read Latency Operation Read P99 latency in ms Base: Steady State 0.95 During Repair 4.92 During Add Node (out scale) 2.68 During Replace Node 3.10 During Decommission Node (downscale) 2.44   Write Test: Write Latency Operation Write P99 latency in ms Steady State 2.22 During Repair 3.24 Add Node (scale out) 2.49 Replace Node 3.07 Decommission Node (downscale) 2.37   Mixed Test: Write and Read Latency Operation Write P99 Latency in ms Read P99 Latency in ms Steady state 2.03 2.11 During Repair 3.21 4.70 Add Node (scale out) 2.19 2.71 Replace Node 3.00 3.37 Decommission Node (downscale) 2.20 3.05   The results indicate that ScyllaDB can meet the latency SLA under maintenance operations. This is critical for ScyllaDB Cloud, and in particular ScyllaDB X Cloud, where scaling out and in scaling are automatic, and can happen multiple times per day. It’s also critical in unexpected failure cases, when a node must be replaced rapidly, without hurting availability and the latency SLA. Test Setup ScyllaDB cluster 3-node cluster I4i.4xlarge vs. i8g.4xlarge I3en.3xlarge vs. i8ge.3xlarge Loaders Loader node amount: 4 Loader instance type: c7i.8xlarge Workload Replication Factor (RF): 3 Consistency Level (CL): Quorum Data size 650GB for read/mixed, 1.5T for write

Alan Shimel and Dor Laor on Database Elasticity and AI with ScyllaDB

Alan and Dor chat about high-performance databases & AI trends Everything about re:Invent 2025 screamed “massive” – from the exhibit hall’s towering booths, to the overflowing keynotes, to product announcements at every turn. ScyllaDB’s “scale fearlessly” message fit in perfectly. See ScyllaDB’s re:Invent videos But despite the crowds and chaos, Alan Shimel (founder and CEO of Techstrong Group) and Dor Laor (ScyllaDB co-founder and CEO) found a way to meet for a laid-back chat. Topics ranged from ScyllaDB’s origin story, to OSS, to ScyllaDB’s latest announcements for AI and extreme database elasticity. Read highlights below, or enjoy the full interview:  ScyllaDB AI Use Cases: Vector Scale, Feature Store, AI Stack Alan: What’s it like on the re:Invent floor? What are the conversations like? What are you hearing? Dor: There’s certainly no shortage of crowds at the booth. A lot of the conversation is about AI. We’re seeing a surge in AI-related use cases. At this point, about half of the use cases we see with ScyllaDB are directly related to AI. Alan: Explain that to me. What’s the use case? Dor: We usually split our AI uses cases into three categories. The first is being part of the AI stack itself. During training and serving, the stack needs to access a huge number of objects, and it needs a fast database to do that. In this case, we’re part of the core AI stack. It’s distributed databases handling very high workloads – and these are very high workloads. That can be for large LLM companies, or for much smaller companies that are just starting their AI journey. That’s the first category. The second category is the feature store. Feature stores are more traditionally associated with machine learning, but they’re still part of the AI world. A feature store lets people classify users, or sometimes agents, automatically. That can be used for recommendations in e-commerce, fraud detection, and a variety of other use cases. In those cases, the feature store needs a fast database to quickly determine how a user is classified and what’s appropriate for them – what they might want to watch, what ad they should see, and so on. The third category is vector search for running LLMs on private datasets. That’s where RAG comes in, with vector data. We added vector search ourselves, and we’re already seeing a lot of interest. In January, we’ll be going live with the general availability of our RAG and vector store. Alan: So in essence, they could use ScyllaDB as their vector database. They’re creating small language models or RAG. That’s got to be big…that’s fantastic. Dor: Our vector search is the most scalable. We can easily run models with a billion objects. Very few vendors can even reach a billion. We can do that while handling hundreds of thousands of requests per second, so we scale to very high numbers. And for people with lower or medium demand – which is most users, with models around 10 million or 100 million objects – we can deliver the best latency at very low price points. Alan: That’s fantastic. Look, there are a lot of people saying we’ve scraped everything there is to scrape for these LLMs. That continuing to make generative AI better by just increasing model size or training data is starting to hit diminishing returns. The thinking is that the way forward might be smaller language models, more RAG. Some people even argue we should move away from ML altogether and toward things like world models. But I definitely believe there’s going to be a lot of activity in the SLM and RAG space. And beyond that, as we build AI for specific use cases, I don’t need the whole internet. I just need the data that matters for that use case – especially if it’s my own proprietary information. I don’t want to put that out there. I want it right here. So I think that’s a huge business. Congratulations. Dor: Thanks. It’s market demand. It’s not just an opportunity, it’s also a defensive move. If we don’t do it, customers will go elsewhere, to be frank. People now expect the same ease of use they get from LLMs on public internet data when they come to any vendor. They want to ask questions in free text, in a single line, and immediately get the best results – without digging through a complicated UI. That’s the power of LLMs. And sometimes it won’t even be people doing that. It could be agents that come in, automate things, and issue those queries on their behalf. True Database Elasticity: Scaling Out and In, Fast Alan: All right, let’s fast-forward past AWS for a second. You have some new announcements coming soon. Share a bit, if you don’t mind. Dor: Thank you for the opportunity. We’re moving from beta to general availability with ScyllaDB X Cloud, our managed platform. This is the new generation of our core database, delivered as a database-as-a-service, including management and consumption. The unique thing here is our new core architecture, which we call tablets. It’s way more elastic than any other database – or even infrastructure – out there. Before this, we were okay in terms of scaling clusters out and scaling them back in. We were about average. But there was demand to do it much faster. And frankly, we also compete with DynamoDB. We’re API-compatible with DynamoDB. Up until now, DynamoDB has been the best in the industry at scaling up and down quickly. If your workload changes throughout the day, you don’t want to pay for peak capacity all the time. You want the system to follow usage dynamically. That’s exactly what X Cloud is. With tablets, we break a very large database– say, a petabyte of data – into five-gigabyte chunks. We can move those chunks around super quickly. That allows us to scale extremely fast. We can increase capacity by four times in about ten minutes. For example, you can go from 500,000 operations per second to two billion operations per second in ten minutes.  Alan: And back to 500K? Dor: That’s right. Alan: Sometimes with these things, it’s like blowing up a balloon. You know what I mean? It never really goes back to the size it was before you blew it up. Dor: So with this, we can also go back and shrink. It’s complicated, but it works. User workloads come and go – whether it’s Black Friday or just daily patterns. That leads to big TCO improvements and usability improvements. It’s also pretty unique. We have a shard-per-core engine. So if you have a machine with 32 cores, you’ll have 32 independent threads in the server. If you have a 64-core machine, you’ll have 64 threads, and it will perform twice as well. Now, let’s say you have a 64-core machine, but you actually need 66 threads. If you only had 64, would you buy another 64-core machine? That’s expensive. Instead, we can mix and match. You can run a 64-core machine together with a small two-vCPU machine side by side. Because of the flexibility of our sharding model, we can combine the two. I haven’t seen any other vendor that can do that. What the user gets is efficiency. They have exactly what they need, without having to buy oversized, expensive servers. Alan: Really, what we’re talking about here is almost a FinOps play. I think that’s where we are now, especially with cloud usage. Look, we’re talking about spending five trillion dollars on data center AI factories. But when I talk to people, what they actually say is, I want to get control of my cloud bill. They want to be more efficient in how they use these resources. That’s why I made the balloon joke– that’s pretty much how the cloud works. It never seems to go back down. People want insight. They want to be able to turn the dial. And they want to ask, how can I do this more efficiently? Dor: Most databases aren’t that loaded. I’m not talking about spikes. I’m talking about normal daily usage, or overnight usage. Often it’s only 10% or 20% utilized – but you’re paying for the entire thing. Alan: That was always the promise of the cloud – that elasticity would go up and down. In practice, it mostly just went up. Tiered Storage at ScyllaDB Alan: So, what else is new at ScyllaDB? Dor: We’re also working on things like tiered storage and other technologies to reduce the bill. Normally, we use NVMe for fast storage and performance. It’s also relatively cheap compared to other high-performance storage options. But S3 is cheaper. The problem with S3 is latency. It can be 50 milliseconds, 100 milliseconds, which is prohibitive for many workloads. With tiered storage, we can keep the hot data on fast NVMe and automatically move cold data to S3. That lets us come up with a good solution for common use cases. For example, you might want to keep 30 days of data in ScyllaDB on NVMe, but keep a year of data overall – and still access it through the same API, without having to build a separate access path. That gives users a single API and a very cost-effective solution. Learn more about what’s next for ScyllaDB at Monster SCALE Summit — free and virtual.

From Batch to Real-Time: How MoEngage Achieved Millisecond Personalization with ScyllaDB

How a leading customer engagement platform handles 250K writes per second at 1ms p99 latency with 200TB+ data At MoEngage, our mission as a leading customer engagement platform is to help marketers build deep, lasting relationships with their users by processing hundreds of billions of events each month. Initially, our data architecture was built on a solid foundation of Amazon S3 for large-scale batch analytics and Elasticsearch for search. This dual system was effective for historical segmentation but began to buckle under the modern demand for instantaneous personalization. Our clients needed to react to user actions not in minutes, but in milliseconds. For example, sending a notification based on a product just viewed or updating a user’s segment the moment they qualify for a new campaign. This shift exposed the fundamental limitations of our architecture. Querying a single user’s recent activity in S3 was prohibitively slow, requiring massive dataset scans. At the same time, our write-heavy workload overwhelmed Elasticsearch, creating performance bottlenecks and significant operational overhead from indexing and sharding. It became clear that we couldn’t just optimize our way to real-time. We needed a new, purpose-built system designed from the ground up for high-throughput ingestion and low latency queries. Editor’s note: Karthik and Atish Andhare will be sharing their experiences at Monster SCALE Summit, a free + virtual conference on extreme scale engineering. Learn more and access passes here.   Envisioning the Eventstore Our solution was to build the Eventstore – a system that can store all the user actions. It doesn’t replace our vast S3 data lake; think of it more like a high-speed, short-term memory for all user actions and events. Its sole purpose is to handle recent user activity, absorbing the constant firehose of incoming events while allowing us to instantly pull up any single user’s complete activity timeline from the last 30, 60, or 90 days. This new real-time backbone lets us see a user’s entire recent journey in milliseconds, a capability that was completely out of reach with our old architecture. With this new real-time backbone in place, we could finally unlock a class of product capabilities that our customers were demanding, moving from theoretical concepts to tangible features. The Eventstore directly powers: Instantaneous Segmentation: Instead of waiting hours for a segment to update, users are added or removed the moment their behavior meets specific criteria. This ensures communications are always sent to the right audience at exactly the right time. True Real-Time Triggers: Campaigns can be initiated the instant a user performs a key action, such as abandoning a cart or completing a purchase. This eliminates the “lag” that made triggered messages feel disconnected from the user’s immediate context. Hyper-Personalization at the Edge: We can now personalize messages using attributes from a user’s very last action. This allows for powerful use cases like including the “last product viewed” in an email, recommending content based on the “last article read,” or personalizing web content based on the “last item added to cart.” Live User Activity Feeds: Our platform’s user profile dashboard, which once showed a delayed activity history, can now display a live, up-to-the-millisecond feed of every action a user takes, giving marketers a true real-time view of their customers. Choosing Our Engine Our requirements were ambitious and non-negotiable: the system had to handle a write-heavy workload of at least 250,000 events per second with an avg latency of 1 ms, p99 of under 10ms, and it needed to scale horizontally without any performance degradation. We evaluated several distributed databases, but ScyllaDB quickly emerged as the clear frontrunner. Its architecture, a C++ rewrite of Cassandra, is engineered for raw performance, promising to harness the full power of modern hardware and deliver the predictable, ultra-low latencies we required. Also, a few of us had extensive experience working with Cassandra which made it easier to understand ScyllaDB. The ability to add nodes seamlessly to handle increasing load was the final piece of the puzzle, giving us the confidence that this was a solution that wouldn’t just meet our immediate needs, but would grow with us for years to come. Handling Multi-Tenancy As an open platform, MoEngage serves a diverse customer base, from companies trialing our product to large enterprises with varying performance and service-level agreement (SLA) expectations. This reality meant that a one-size-fits-all approach to data storage was not viable. We could not house all customer data in a single, massive cluster, as this would risk performance degradation from “noisy neighbors” and fail to meet the distinct needs of our clients. Our multi-tenancy strategy, therefore, had to be built around workload isolation from the ground up. Our decision was heavily influenced by two core ScyllaDB design principles. First, ScyllaDB recommends having one large table per cluster rather than many small ones to reduce metadata management overhead. Second, and more critically, it is a best practice to configure data retention with a Time-To-Live (TTL) at the table level, not at the cell level. Since our customers require different retention periods (15, 30, or 60 days), managing this at the row level within a single table would create significant overhead on compaction and tombstone management. Based on these constraints, we chose a strategy of physical isolation using multiple, independent ScyllaDB clusters. This approach allows us to group tenants logically based on their needs. For example: All customers with the same retention policy (e.g., 30 days) are housed in the same cluster, allowing us to use a single, efficient table-level TTL. Customers who require stricter, guaranteed SLAs can be isolated in their own dedicated cluster. All MoEngage test accounts can be grouped into a single cluster to separate their non-production workloads. This model provides the perfect balance, ensuring that the workload of one tenant group does not impact another while aligning perfectly with ScyllaDB’s operational best practices for performance and data management. Working with ScyllaDB Open Source Our Large Partition Problem One of the most critical challenges in designing our schema was avoiding the “large partition” anti-pattern in ScyllaDB. While our experiments showed that large partitions don’t significantly penalize write performance, they have a significant impact on reads and compactions. We have a use case where read won’t be able to take advantage of the clustering key ordering and hence have to fetch the entire partition and perform the filtering on the client side. In such cases querying a large partition causes ScyllaDB to fetch data from disk (if not in memtable), decompress, load into memory and then return the result. This creates significant latency overhead and puts unnecessary pressure on the cluster. With ScyllaDB’s official recommendation to keep partitions under 100MB, we knew that a naive partition key like `(user_id, tenant_id)` would be a recipe for disaster, as highly active users could easily generate gigabytes of event data over their retention period. To solve this, our schema design focused on proactively breaking up potentially large partitions into smaller, consistently sized buckets. Our analysis showed an average event row size of about 1KB, meaning our 100MB target partition size could comfortably hold around 100,000 events. A simple calculation revealed that for a 30-day retention period, any user generating more than 330 events per day would exceed this limit. To prevent this, we introduced a `bucket_id` as a core component of our partition key. Our final partition key became a composite of `(uid, tenant, bid)`. The `bucket_id` acts as a split mechanism, splitting a single user’s long event history into multiple, smaller physical partitions. For example, a bucket could represent a day or a week of activity, ensuring no single partition grows indefinitely. This foresight was crucial because a table’s partition key cannot be changed after creation. By including the `bucket_id` in our initial schema, we built in the flexibility to define and refine our exact bucketing strategy over time, guaranteeing a healthy, performant cluster as our data scales. Building for Resilience From the very beginning, two principles were non-negotiable for the Eventstore: fault tolerance and zero data loss. The system had to withstand common infrastructure failures like node loss or disk corruption, and under no circumstances could we lose data that had been acknowledged with a success response. This commitment to durability shaped every decision we made about our cluster architecture, from data replication to physical topology. To achieve this, we made a critical decision to use a Replication Factor (RF) of 3. This means that for every piece of data written, three copies (replicas) are stored on three different nodes in the cluster. With RF3, we could enforce a write consistency level of `LOCAL_QUORUM`. This setting guarantees that a write operation is only considered successful after a majority of the replicas (two out of the three) have confirmed the write to disk. This simple but powerful mechanism is our guarantee against data loss; even if one node fails mid-write, the data is already safe on at least two other nodes. Having three copies of the data is only half the battle; ensuring those copies are physically isolated is just as important. To protect against large-scale failures, we architected our clusters to be Availability Zone (AZ) aware. By leveraging ScyllaDB’s Ec2Snitch feature, we make the database aware of the underlying AWS infrastructure, treating each AWS AZ as a separate “rack.” With this configuration, combined with NetworkTopologyStrategy replication strategy, ScyllaDB intelligently places each of the three data replicas in a different AZ. This strategy ensures that we can withstand the complete failure of an entire Availability Zone without any data loss or service interruption. While this architecture provides excellent high availability against common failures, we also planned for disaster recovery scenarios, such as losing a quorum of nodes or a full region-wide outage. Since our chosen EC2 instances use ephemeral storage, our recovery strategy in these cases is to quickly bootstrap a new cluster from a previous backup. For this, we leverage ScyllaDB’s native backup capabilities and our application’s ability to replay messages from Kafka. Our process involves taking regular snapshots, supplemented by a continuous stream of incremental backups. Any data lost between the last incremental backup and point of outage is available in Kafka, by simply replaying the data from Kafka we are able to fully restore the data. This combination ensures we can rebuild a cluster to a recent, consistent state, completing our comprehensive resilience strategy from minor hiccups to major outages. Cluster Topology Choosing the right database engine was only half the equation; building a resilient and performant Eventstore meant running it on the right hardware. Our workload is fundamentally I/O-bound, characterized by a relentless, high-throughput stream of writes. This reality guided our evaluation of EC2 instance types, where the choice between local NVMe storage and network-attached EBS volumes became the central decision point. After a thorough analysis, we followed ScyllaDB’s strong recommendation and opted for storage-optimized i-series instances with local NVMe SSDs. While we considered memory-optimized instances with EBS, they proved unsuitable for our write-heavy needs. High performance `io2` EBS volumes were prohibitively expensive at our scale, and more affordable `GP3` volumes could not guarantee the p99 latencies we required and introduced risks of throttling during traffic bursts. AWS’s own guidance suggests EBS is better suited for read heavy workloads, the exact opposite of our profile. Local NVMe storage, by contrast, delivers the sustained, sub-millisecond I/O performance essential for our ingestion pipeline. Specifically, we selected the i3en instance family, which provides an excellent balance of vCPU, RAM, and the large, fast storage capacity needed to meet our heavy data retention requirements. Our approach to capacity planning is therefore not a one-time calculation but a dynamic process tied directly to our multi-tenant cluster strategy. The size and configuration of each physical cluster are determined by the specific workload of the tenants it houses. We carefully model capacity based on four key variables for each tenant: 1. The number of active users. 2. The average number of actions per user per day. 3. Their specific data retention policy (e.g., 15, 30, or 60 days). 4. The overall write and read traffic patterns. This allows us to right-size each cluster for its intended workload, ensuring performance and cost-efficiency across our entire infrastructure. Compaction Strategy A critical factor in managing the total cost of ownership for a large-scale database is controlling disk space amplification. Open Source ScyllaDB’s default Size-Tiered Compaction Strategy (STCS) requires keeping nearly 50% of disk space free for compaction operations, which would have effectively doubled our storage costs. We also experimented with the Leveled Compaction Strategy but that too required additional 50% disk space during initial bootstrapping. While ScyllaDB Enterprise offers the highly efficient Incremental Compaction Strategy (ICS) that reduces this overhead to 20%, it comes with a significant license fee. Our Operational Challenges Cluster Management Our initial capacity planning pointed us toward the i3en.3xlarge instance type (12 vCPUs, 96GB RAM, and a 7.5TB NVMe drive). To ensure low latency for our global customer base, we deployed one ScyllaDB cluster in each of our three primary AWS regions. In total, our footprint grew to approximately 50 nodes across these clusters. ScyllaDB provides region-specific, production-ready AMIs that simplify the deployment process. Our deployment workflow followed a structured path: provisioning nodes, configuration, security, and RBAC, followed by onboarding the cluster into our internal monitoring stack. Because ScyllaDB’s AMIs are self-contained, scaling out theoretically meant launching a new node and letting it complete the automated bootstrap process. Things ran smoothly until we encountered a surge in customer data within one of our regions. As disk utilization climbed and we were using STSC, we followed our runbook and added a new node to the cluster. However, this expansion revealed two critical operational hurdles: First, during our POC, a 4TB node bootstrap typically took 18 hours using vNodes. In the live production environment, this window stretched significantly. Bootstrapping took anywhere from 24 to 36 hours. In a high-growth environment, a 1.5-day lead time for scaling is a lifetime. Followed by another issue when an on-call engineer noticed the disk space on the newly joining node was hitting 90%. This was counterintuitive—why was a joining node, which hadn’t even finished taking its share of the data, running out of space? Our investigation revealed that it was caused during the RESHAPE compactions. When a new node joins, ScyllaDB reshapes the data to fit the new shard distribution. This process creates temporary data overhead. After researching similar issues reported in the community, we identified a temporary fix to get our node back to service. Allow the node to initiate the bootstrap. The moment the RESHAPE compaction begins, manually pause it. Let the node finish joining the ring to provide immediate capacity. ICS Our initial experiences led us to a conservative rule of thumb, where we felt safe onboarding new nodes when disk usage on the existing nodes reached between 40% and 45%. This buffer was a technical necessity to accommodate a 2.4x worst-case space amplification during RESHAPE compactions while bootstrapping. We experienced a glimmer of hope when we discovered ScyllaDB’s Incremental Compaction Strategy (ICS). After discussing the ICS with the ScyllaDB team, we realized we were looking at our space amplification issues through an outdated lens. The technical shift offered by ICS is profound because it utilizes a default fragment size of 1GB, meaning a single compaction job typically requires a maximum of only 2GB of disk overhead. To put that into perspective for our specific setup, the old methodology required nearly 50% of free space on a 6.9TB node to handle heavy compaction cycles safely. Under ICS, that same 6.9TB node with 12 shards would only experience roughly 110GB of overhead during compactions. This shift creates massive headroom, allowing us to move away from capping nodes at 45% capacity and safely utilizing over 80% of the disk. By drastically minimizing space amplification, ICS has effectively doubled our storage efficiency without compromising performance during critical node operations. Our Move to ScyllaDB Enterprise Our journey toward ScyllaDB Enterprise began with a rigorous Proof of Concept designed to validate three core pillars: performance, reliability, and operational efficiency. We needed to ensure that the Enterprise edition could not only handle our existing throughput but also provide an edge in performance and cluster operations. To validate these objectives without risking production stability, we deployed a parallel ScyllaDB Enterprise cluster. This environment supported dual writes, allowing us to mirror data from our existing Open Source Software (OSS) cluster to the new Enterprise setup in real-time. This side by-side comparison was instrumental in proving the superiority of the new architecture. The most significant architectural shift involved moving to i3en.6xlarge nodes. These powerful instances, equipped with 24 vCPUs, 192GB of memory, and two 7.5TB NVMe drives, allowed us to dramatically consolidate our infrastructure. By leveraging these denser nodes, we were able to shrink our total node count to just one-third of the original OSS cluster size, significantly reducing the complexity of our distributed network. Alongside this hardware upgrade, we transitioned our tables to the Incremental Compaction Strategy (ICS) to better manage disk space. Following a successful “soak-in” period where the Enterprise cluster met all performance benchmarks, we executed a structured four-step migration strategy. We first upgraded our OSS environment to ScyllaDB Enterprise 2024.1, followed by the systematic migration of tables to the ICS format. Once the tables were optimized, we began the process of downsizing the legacy OSS cluster and finalized the transition by onboarding the entire environment to ScyllaDB Manager for centralized management and automated maintenance. Lessons Learned Schema Design Is Paramount The most important aspect while using ScyllaDB is getting your schema right. It’s not just about the data model but aspects like RF, TTL, partition size, compaction strategy etc. that dictate how your ScyllaDB performs in production. Adding Nodes & Removing Nodes Take Longer As your data size grows the process of adding nodes and removing nodes becomes a lot slower with ScyllaDB’s legacy vNode-based replication. Make sure you are monitoring everything and plan for these activities ahead of time. One thing we learned is that while these operations are slower they don’t quite impact the query / write latencies significantly during these maintenance activities. POC != Production No matter how hard you try to anticipate & simulate issues in POC, your production system will always surprise you. Our Next Steps with ScyllaDB Our journey with the Eventstore has fundamentally transformed our real-time capabilities, but we’re always looking ahead to the next evolution of our architecture. One of the most exciting developments on our roadmap involves leveraging a powerful new feature in the latest versions of ScyllaDB: tablets. While our multi-cluster topology provides excellent isolation, it still requires us to plan capacity for peak workloads. In a multi-tenant world, traffic can be unpredictable. A single customer launching a wildly successful campaign can create a sudden performance hotspot on specific sets of nodes, even if the rest of the cluster has ample storage and spare compute capacity. Manually rebalancing or adding nodes to handle these temporary spikes is a significant operational challenge. This is where tablets change the game. By breaking down the token ring into smaller, movable units of data, tablets decouple data partitions from specific physical nodes. Instead of a partition being permanently owned by a set of nodes, a tablet can be automatically moved to a different node to balance the load in real-time. For us, this unlocks the holy grail of database management: true elastic scaling. When a traffic hotspot emerges, ScyllaDB can automatically rebalance the cluster by shifting tablets away from overloaded nodes to those with spare capacity. This will allow us to absorb sudden traffic surges with grace, ensuring consistent performance for all tenants without manual intervention or costly overprovisioning. It’s the key to providing on-demand compute for our customers’ biggest moments, ensuring the Eventstore remains a robust and highly elastic foundation for the future of real-time engagement at MoEngage.

Exploring the key features of Cassandra® 5.0

Apache Cassandra has become one of the most broadly adopted distributed databases for large-scale, highly available applications since its launch as an open source project in 2008. The 5.0 release in September 2024 represents the most substantial advancement to the project since 4.0 released in July 2021. Multiple customers (and our own internal Cassandra use case) have now been happily running on Cassandra 5 for up to 12 months so we thought the time was right to explore the key features they are leveraging to power their modern applications.

An overview of new features in Apache Cassandra 5.0

Apache Cassandra 5.0 introduces core capabilities aimed at AI-driven systems, low-latency analytical workloads, and environments that blend operational and analytical processing. 

Highlights include: 

  • The new vector data type and an Approximate Nearest Neighbor (ANN) index based on Hierarchical Navigable Small World (HNSW), which is integrated into the Storage-Attached Index (SAI) architecture
  • Trie-based memtables and the Big Trie-Index (BTI) SSTable format, delivering better memory efficiency and more consistent write performance
  • The Unified Compaction Strategy, a tunable density-based approach that can align with leveled or tiered compaction patterns. 

Additional enhancements include expanded mathematical CQL functions, dynamic data masking, and experimental support for Java 17.

At NetApp, Apache Cassandra 5.0 is fully supported, and we are actively assisting customers as they transition from 4.x.

A deeper look at Cassandra 5.0’s key features Storage–Attached Indexes (SAI)

Storage–Attached Indexes bring a modern, storage-integrated approach to secondary indexing in Apache Cassandra, resolving many of the scalability and maintenance challenges associated with earlier index implementations. Legacy Secondary Indexes (2i) and SASI remain available, but SAI offers a more robust and predictable indexing model for a broad range of production workloads.

SAI operates per-SSTable, allowing queries to be indexed locally versus the cluster-wide coordination required of other strategies. This model supports diverse CQL data types, enables efficient numeric and text range filters, and provides more consistent performance characteristics than 2i or SASI. The same storage-attached foundation is also used for Cassandra 5’s vector indexing mechanism, allowing ANN search to operate within the same storage and query framework.

SAI supports combining filters across multiple indexed columns and works seamlessly with token-aware routing to reduce unnecessary coordinator work. Public evaluations and community testing have shown faster index builds, more predictable read paths, and improved disk utilization compared with previous index formats.

Operationally, SAI functions as part of the storage engine itself: indexes are defined using standard CQL statements and are maintained automatically during flush and compaction, with no cluster-wide rebuilds required. This provides more flexible query options and can simplify application designs that previously relied on manual denormalization or external indexing systems.

Native Vector Search capabilities

Apache Cassandra 5.0 introduces native support for high-dimensional vector embeddings through the new vector data type. Embeddings represent semantic information in numerical form, enabling similarity search to be performed directly within the database. The vector type is integrated with the database’s storage-attached index architecture, which uses HNSW graphs to efficiently support ANN search across cosine, Euclidean, and dot-product similarity metrics.

With vector search implemented at the storage layer, applications involving semantic matching, content discovery, and retrieval-oriented workflows while maintaining the system’s established scalability and fault-tolerance characteristics are supported.

After upgrading to 5.0, existing schemas can add vector columns and store embeddings through standard write operations. For example:

UPDATE products SET embedding = [0.1, 0.2, 0.3, 0.4, 0.5] WHERE id = <id>;

To create a new table with a vector type column:

CREATE TABLE items (     product_id UUID PRIMARY KEY,     embedding VECTOR<FLOAT, 768>  // 768 denotes dimensionality );

Because vector indexes are attached to SSTables, they participate automatically in the compaction and repair processes and do not require an external indexing system. ANN queries can be combined with regular CQL filters, allowing similarity searches and metadata conditions to be evaluated within a unified distributed query workflow. This brings vector retrieval into Apache Cassandra’s native consistency, replication, and storage model.

Unified Compaction Strategy (UCS)

Unified Compaction Strategy in Apache Cassandra 5 included a density-aware approach to organizing SSTables that blends the strengths of Leveled Compaction Strategy (LCS) and Size Tiered Compaction Strategy (STCS). UCS aims to provide the predictable read amplification associated with LCS and the write efficiency of STCS, without many of the workload-specific drawbacks that previously made compaction selection difficult. Choosing an unsuitable compaction strategy in earlier releases could lead to operational complexity and long-term performance issues, which UCS is designed to mitigate.

UCS exposes a set of tunable parameters like density thresholds and per-level scaling that let operators adjust compaction behavior toward read-heavy, write-heavy, or time-series patterns. This flexibility also helps smooth the transition from existing strategies, as UCS can adopt and improve the current SSTable layout without requiring a full rewrite in most cases. The introduction of compaction shards further increases parallelism and reduces the impact of large compactions on cluster performance.

Although LCS and STCS remain available (and while STCS remains the default strategy in 5.0, UCS is the default strategy on newly deployed NetApp Instaclustr’s managed Apache Cassandra 5 clusters), UCS supports a broader range of workloads, reduces the operational burden of compaction tuning, and aligns well with other storage engine improvements in Apache Cassandra 5 such as trie-based SSTables and Storage-Attached Indexes. 

Trie Memtables and Trie-Indexed SSTables

Trie Memtables and Trie-indexed SSTables (Big Trie-Index, BTI) are significant storage engine enhancements released in Apache Cassandra 5. They are designed to reduce memory overhead, improve lookup performance, and increase flush efficiency. A trie data structure stores keys by shared prefixes instead of repeatedly storing full keys, which lowers object count and improves CPU cache locality compared with the legacy skip-list memtable structure. These benefits are particularly visible in high-ingestion, IoT, and time-series workloads.

Skip-list memtables store full keys for every entry, which can lead to large heap usage and increased garbage collection activity under heavy write loads. Trie Memtables substantially reduce this overhead by compacting key storage and avoiding pointer-heavy layouts. On disk, the BTI SSTable format replaces the older BIG index with a trie-based partition index that removes redundant key material and reduces the number of key comparisons needed during partition lookups.

Using Trie memtables requires enabling both the trie-based memtable implementation and the BTI SSTable format. Existing BIG SSTables are converted to BTI through normal compaction or by rebuilding data. On NetApp Instaclustr’s managed Apache Cassandra clusters Trie Memtables and BTI are enabled by default, but when upgrading major versions to 5.0, data must be converted from BIG to BTI first to utilize Trie structures.

Other new features Mathematical CQL functions

Apache Cassandra 5.0 added a rich set of math functions allowing developers to perform computations directly within queries. This reduces data transfer overhead and reduces client-side post-processing, among many other benefits. From fundamental functions like ABS(), ROUND(), or SQRT() to more complex operations like SIN(), COS(), TAN(), these math functions are extensible to a multitude of domains from financial data, scientific measurements or spatial data.

Dynamic Data Masking

Dynamic Data Masking (DDM) is a new feature to obscure sensitive column-level data at query time or permanently attach the functionality to a column so that the data always returns obfuscated. Stored data values are not altered in this process, and administrators can control access through role-based access control (RBAC) to ensure only those with access can see the data while also tuning the visibility of the obscured data. This feature helps with adherence to data privacy regulations such as GDPR, HIPAA, and PCI DSS without needing external redaction systems.

Conclusion

Apache Cassandra 5.0 packs a punch with game changing features that meet the needs of modern workloads and applications. Features like vector search capabilities and Storage Attached Indexes stand out as they will inevitably shape how data can be leveraged within the same database while maintaining speed, scale, and resilience. 

When you deploy a managed cluster on NetApp Instaclustr’s Managed Platform, you get the benefits of all these amazing features without worrying about configuration and maintenance.

Ready to experience the power of Apache Cassandra 5.0 for yourself? Try it free for 30 days today!

The post Exploring the key features of Cassandra® 5.0 appeared first on Instaclustr.

Scaling Performance Comparison: ScyllaDB Tablets vs Cassandra vNodes

Benchmarks show ScyllaDB tablet-based scaling 7.2× faster than Cassandra’s vNode-based scaling (9× with cleanup), sustaining ~3.5X higher throughput with fewer errors Real-world database deployments rarely experience steady traffic. Systems need sufficient headroom to absorb short bursts, perform maintenance safely, and survive unexpected spikes. At the same time, permanently sizing for peak load is wasteful. Elasticity lets you handle fluctuations without running an overprovisioned cluster. Increase capacity just-in-time when needed, then scale back as soon as the peak passes. When we built ScyllaDB just over a decade ago, it scaled fast enough for user needs at the time. However, deployments grew larger and nodes stored far more data per vCPU. Streaming took longer, especially on complex schemas that required heavy CPU work to serialize and deserialize data. The leaderless design forced operators to serialize topology changes, preventing parallel bootstraps or decommissions. And static (vNode-based) token assignments also meant data couldn’t be moved dynamically once a node was added. ScyllaDB’s recent move to tablet-based data distribution was designed to address those elasticity constraints. ScyllaDB now organizes data into independent tablets that dynamically split or merge as data grows or shrinks. Instead of being fixed to static ranges, tablets are load balanced transparently in the background to maintain optimal distribution. Clusters scale quickly with demand, so teams don’t need to overprovision ahead of time. If load increases, multiple nodes can be bootstrapped in parallel and start serving traffic almost immediately. Tablets rebalance in small increments, letting teams safely use up to ~90% of available storage. This means less wasted storage. The goal of this design is to make data movement more granular and reduce the serialized steps that constrained vNode-based scaling. To understand the impact of this design shift, we evaluated how both ScyllaDB (now using tablets) and Cassandra (still using vNodes) compare when they must increase capacity under active traffic. The goal was to observe scale-out under realistic conditions: workloads running, caches warm, and topology changes occurring mid-operation. By expanding both clusters step by step, we captured how quickly capacity came online, how much the running workload was affected, and how each system performed after each expansion. Before we go deeper into the details, here are the key findings from the tests: Bootstrap operations: ScyllaDB completed capacity expansion 7.2X faster than Cassandra Total scaling time: When including Cassandra’s required cleanup operations (which can be performed during maintenance windows), the time difference reaches 9X Throughput while scaling: ScyllaDB sustained ~3.5X more traffic during these scaling operations Stability under load: ScyllaDB had far fewer errors and timeouts during scaling, even at higher traffic levels Why Fast Scaling Matters Most real-world database deployments are overprovisioned to some extent. The extra capacity helps sustain traffic fluctuations and short-lived bursts. It also supports routine maintenance tasks, like applying security patches, rolling out infrastructure maintenance, or recovering from replica failures. Another important consideration in real-world deployments is that benchmark reports often overlook traffic variability over time. In practice, only a subset of workloads consistently demand high baseline throughput, with low variability from their peak needs. Most workloads follow a cyclical pattern, with daily peaks during active hours and significantly lower baseline traffic during off-hours. A diurnal workload example, ranging between 50K to 250K operations per second in a day Fast scaling is also critical for handling unexpected events, such as viral traffic spikes, flash loads, backlog drains after cascading failures, or sudden pressure from upstream systems. It’s especially valuable when traffic has large peak-to-baseline swings, capacity needs to shift often, responses to load must be quick, or costs depend on scaling back down immediately after a surge. Comparing Tablets vs vNodes Fast scaling is ultimately a data distribution problem, and Cassandra’s vNodes and ScyllaDB’s tablets handle that distribution in distinctly different ways. Here’s more detail on the differences we previewed earlier. Apache Cassandra Apache Cassandra follows a token ring architecture. When a node joins the cluster, it is assigned a number of tokens (the default is 16), each representing a portion of the token ring. The node becomes responsible for the data whose partition keys fall within its assigned token ranges. During node bootstrap, existing replicas stream the relevant data to the new replica based on its token ownership. Conversely, when a node is removed, the process is reversed. Cassandra generally recommends avoiding concurrent topology changes; in practice, many operators add/remove nodes serially to reduce risk during range movements. Digression: In reality, topology changes in an Apache Cassandra cluster are plain unsafe. We explained the reasons in a previous blog, and pointed out that even its community acknowledged some of its design flaws. In addition to the administrative overhead involved in scaling a Cassandra cluster, there are other considerations. Adding nodes with higher CPU and memory is not straightforward. It typically requires a new tuning round and manually assigning a higher weight (increasing the number of tokens) to better match capacity. After bootstrap operations, Cassandra requires an intermediary step (cleanup) for older replicas in order to free up disk space and eliminate the risk of data resurrection. Lastly, multiple scaling rounds introduce significant streaming overhead since data is continuously shuffled across the cluster. Cassandra Token Ring ScyllaDB ScyllaDB introduced tablets starting with the 2024.2 release. Tablets are the smallest unit of replication in ScyllaDB and can be migrated independently across the cluster. Each table is dynamically split into tablets based on its size, with each tablet being assigned to a subset of replicas. In effect, tablets are smaller, manageable fragments of a table. As the topology evolves, tablet state transitions are triggered. A global load balancer balances tablets across the cluster, accounting for heterogeneity in node capacity (e.g., assigning more tablets to replicas with greater resources). Under the hood, Raft provides the underlying consensus mechanism that serializes tablet transitions in a way that avoids conflicting topology changes and ensures correctness. The load balancer is hosted on a single node, but not a designated node. If that node crashes or goes down for maintenance, the load balancer will start on another node. Raft and tablets effectively decouple topology changes from streaming operations. Users can orchestrate topology changes in parallel with minimal administrative overhead. ScyllaDB does not require a post-bootstrap cleanup phase. That allows for immediate request serving and more efficient data movement across the network. Visual representation of tablets state transitions Adding Nodes Starting with a 3-node cluster, we ran our “real-life” mixed workload targeting 70% of each database’s inferred total capacity. Before any scaling activity, both ScyllaDB and Cassandra were warmed up to ensure disk and cache activity were in effect. Note: Configuration details are provided in the Appendix. We then started the mixed workload and let it run for another 30 minutes to establish a performance baseline. At this point, we bootstrapped 3 additional nodes, expanding the cluster to 6 nodes. We then allowed the workload to run for an additional 30 minutes to observe the effects of this first scaling step. We increased traffic proportionally. After sustaining it for another 30 minutes, we bootstrapped 3 more nodes, bringing each cluster to a total of 9 nodes. Finally, we increased traffic one last time to ensure each database could sustain its anticipated traffic. Note: See the Appendix for details on the test setup and our Cassandra tuning work. The following table shows the target throughput used during and after each scaling step along with each cluster’s inferred maximum capacity: Nodes ScyllaDB Cassandra 3 (baseline) 196K ops/sec (Max 280K) 56K ops/sec (Max 80K) 6 392K ops/sec (Max 560K) 112K ops/sec (Max 160K) 9 672K ops/sec (Max 840K) 168K ops/sec (Max 240K) We conducted this scaling exercise twice for each database, introducing a minor variation in each run. For ScyllaDB, we bootstrapped all 6 additional nodes in parallel. For Cassandra, we enabled both the Key Cache and Row Cache, as we observed it performed better overall under our initial performance results. Comparison of different scaling approaches At first glance, it might look like ScyllaDB offers only a modest improvement over Cassandra (somewhere between 1.25X and 3.6X faster). But there are deeper nuances to consider. Resiliency In both of our Cassandra benchmarks, we observed a high rate of errors, including frequent timeouts and OverloadedExceptions reported by the server. Notably, our client was configured with an exponential backoff, allowing up to 10 retries per operation. In this environment, both Cassandra configurations showed elevated error rates under sustained load during scaling. The following table summarizes the number of errors observed by the client during the tests: Kind Step Throughput Retries Cassandra 5.0 – Page Cache 3 → 6 nodes 56K ops/sec 2010 Cassandra 5.0 – Page Cache 6 → 9 nodes 112K ops/sec 0 Cassandra 5.0 – Row & Key Cache 3 → 6 nodes 56K ops/sec 5004 Cassandra 5.0 – Row & Key Cache 6 → 9 nodes 112K ops/sec 8779 With the sole exception of scaling from 6 to 9 nodes in the Page Cache scenario, all other Cassandra scaling exercises resulted in noticeable traffic disruption, even while handling 3.5X less traffic than ScyllaDB. In particular, the “Row & Key Cache” configuration proved itself unable to sustain prolonged traffic, ultimately forcing us to terminate that test prematurely. Performance The earlier comparison chart also highlights the cost of repeated streaming across incremental expansion steps. Although bootstrap duration is governed by the volume of data being streamed and decreases as more nodes are added, each scaling operation redundantly re-streams data that was already redistributed in prior steps. This introduces significant overhead, compounding both the time and performance of scaling operations. As demonstrated, scaling directly from 3 to 9 nodes using ScyllaDB tablets eliminates the intermediary incremental redistribution overhead. By avoiding redundant streaming at each intermediate step, the system performs a single, targeted redistribution of tablets, resulting in a significantly faster and more efficient bootstrap process. ScyllaDB tablet streaming from 3 to 9 nodes After the scale out operations completed, we ran the following load tests to assess each database’s ability to withstand increased traffic: For ScyllaDB, we increased traffic to 80% of its peak capacity (280 * 3 * 0.8 = 672 Kops) For Cassandra, we increased traffic to 100% (240 Kops) and 125% (300 Kops) of its peak capacity to validate our starting assumptions ScyllaDB sustains 672 Kops/sec with load (per vCPU) around 80% utilization, as expected. Apache Cassandra latency variability under different throughput rates (240K vs 300K ops/sec) Cassandra maintained its expected 240K peak traffic. However, it failed to sustain 300K over time – leading to increased pauses and errors. This outcome was anticipated since the test was designed to validate our initial baseline assumptions, not to achieve or demonstrate superlinear scaling. Expectations In our tests, ScyllaDB scaled faster and delivered greater improvements in latency and throughput at each step. That reduces the number of scaling operations required. The compounded benefits translate to significantly faster capacity expansion. In contrast, Cassandra’s scaling behavior is more incremental. The initial scale-out from 3 to 6 nodes took 24 minutes. The subsequent step from 6 to 9 nodes introduced additional overhead, requiring 16 minutes. From this observation, we empirically derived a formula to model the scaling factor per step: 16 = 24 × (0.5/1.0)^overhead Solving for the exponent, we approximated the streaming overhead factor as 0.6. Using this, we constructed a practical formula to estimate Cassandra’s bootstrap duration at each scale step: Bootstrap_time ≈ Base_time × (data_to_stream / data_per_node)^0.6 With these formulas, we can project the bootstrap times for subsequent scaling steps. Based on our earlier performance results (where Cassandra sustained approximately 80K ops/sec for every 3-node increase), 27 total nodes of Cassandra would be required to match the throughput achieved by ScyllaDB. The following table presents the estimated cumulative bootstrap times needed for Cassandra to reach ScyllaDB performance, using the previously derived formula and applying the 0.6 streaming overhead factor at each step: Nodes Data to Stream Bootstrap Time Cumulative Time Peak Capacity 3 2.0TB – 0 min 80K 3 → 6 1.0TB 24.0 min 24.0 min 160K 6 → 9 0.67TB 15.8 min 39.8 min 240K 9 → 12 0.50TB 12.4 min 52.2 min 320K 12 → 15 0.40TB 10.4 min 62.6 min 400K 15 → 18 0.33TB 9.0 min 71.6 min 480K 18 → 21 0.29TB 8.1 min 79.7 min 560K 21 → 24 0.25TB 7.3 min 87.0 min 640K 24 → 27 0.22TB 6.7 min 93.7 min 720K   Time to reach throughput capacity for bootstrap operations As the table and chart visually show, ScyllaDB responds to capacity needs 7.2X faster than Cassandra. That’s before accounting for the added operational and maintenance overhead associated with the process. Cleanup Cleanup is a process to reclaim disk space after a scale-out operation takes place in Cassandra. As the Cassandra documentation states: As a safety measure, Cassandra does not automatically remove data from nodes that “lose” part of their token range due to a range movement operation (bootstrap, move, replace). (…) If you do not do this, the old data will still be counted against the load on that node. We estimated the following cleanup times after scaling to 9 nodes with unthrottled compactions: Unlike topology changes, Cassandra cleanup operations can be executed in parallel across multiple replicas, rather than being serialized. The trade-off, however, is a temporary increase in compaction activity – something that may impact system performance through its execution. In practice, many users choose to run cleanup serially or per rack to minimize disruption to user-facing traffic. Despite its parallelizability, careful coordination is often preferred in production environments to minimize latency impact. The following table outlines the total time required under various cleanup strategies: In conclusion, ScyllaDB scaled faster and sustained higher throughput during scale-out, and it removes cleanup as part of the scaling cycle. Even for users willing to accept the risk of running cleanup in parallel across all Cassandra nodes, ScyllaDB still offers 9X faster capacity response time, once the minimum required cleanup time is factored into Cassandra’s previously estimated bootstrap durations. These results reflect how both databases behave under one specific scaling pattern. Teams should benchmark against their own workload shapes and operational constraints to see how these architectural differences play out in their particular environment. Parting Thoughts We know readers are (rightfully) skeptical of vendor benchmarks. As discussed earlier, Cassandra and ScyllaDB rely on fundamentally different scaling models, which makes designing a perfect comparison inherently difficult. The scaling exercises demonstrated here were not designed to fully maximize ScyllaDB tablets’ potential. The test design actually favors Cassandra by focusing on symmetrical scaling. Asymmetrical scaling scenarios would better highlight the advantage of tablets vs vNodes. Even with a design that favored Cassandra’s vNodes model, the results show the impact of tablets. ScyllaDB sustained 4X the throughput of Apache Cassandra while maintaining consistently lower P99 latencies under similar infrastructure. Interpreted differently, ScyllaDB delivers comparable performance to Cassandra using significantly smaller instances, which could then be scaled further by introducing larger, asymmetric nodes as needed. This approach (scaling from 3 small nodes to another 3 [much larger] nodes) optimizes infrastructure TCO and aligns naturally with ScyllaDB Tablets architecture. However, this would be far more difficult to achieve (and test) in Cassandra in practice. Also, the tests intentionally did not use large instances to avoid favoring ScyllaDB. ScyllaDB’s shard-per-core architecture is designed to linearly scale across large instances without requiring extensive tuning cycles, which are often encountered with Apache Cassandra. For example, a 3-node cluster running on the largest AWS Graviton4 instances can sustain over 4M operations per second. When combined with Tablets, ScyllaDB deployments can scale from tens of thousands to millions of operations per second within minutes. Finally, remember that performance should be just one component in a team’s database evaluation. ScyllaDB offers numerous features beyond Cassandra (local and global indexes, materialized views, workload prioritization, per query timeouts, internal cache, and advanced dictionary-based compression, for example). Appendix: How We Ran the Tests Both ScyllaDB and Cassandra tests were carried out in AWS EC2 in an apples-to-apples scenario. We ran our tests on a 3-node cluster running on top of i4i.4xlarge instances placed under the same Cluster Placement Group to further reduce networking round-trips. Consequently, each node was placed on an artificial rack using the GossipingPropertyFileSnitch. As usual, all tests used LOCAL_QUORUM as the consistency level, a replication factor of 3. They used NetworkTopologyStrategy as the replication strategy. To assess scalability under real-world traffic patterns, like Gaussian and other similar bell curve shapes, we measured the time required to bootstrap new replicas to a live cluster without disrupting active traffic. Based on these results, we derived a mathematical model to quantify and compare the scalability gaps between both systems. Methodology To assess scalability under realistic conditions, we ran performance tests to simulate typical production traffic fluctuations. The actual benchmarking is a series of invocations of ScyllaDB’s fork of latte with a consistency level of LOCAL_QUORUM. To test scalability, we used a “real-life” mixed distribution, with the majority (80%) of operations distributed over a hot set, and the remaining 20% iterating over a cold set. latte is the Lightweight Benchmarking Tool for Apache Cassandra as developed by Piotr Kołaczkowski, a DataStax Software Engineer. Under the hood, latte relies on ScyllaDB’s Rust driver, compatible with Apache Cassandra. It outperforms other widely used benchmarking tools, provides better scalability and has no GC pauses, resulting in less latency variability on the results. Unlike other benchmarking tools, latte (thanks to its use of Rune) also provides a flexible syntax for defining workloads closely tied on how developers actually interact with their databases. Lastly, we can always brag we did it in Rust, just because… 🙂 We set baseline traffic at 70% of its observed peak before P99 latency crossed a 10ms threshold. This was to ensure both databases retained sufficient CPU and I/O headroom to handle sudden traffic and concurrency spikes, as well as the overhead of scaling operations. Setup The following table shows the infrastructure we used for our tests: Cassandra/ScyllaDB Loaders EC2 Instance type i4i.4xlarge c6in.8xlarge Cluster size 3 1 vCPUs (total) 16 (48) 32 RAM (total) 128 (384) GiB 64 GiB Storage (total) 1 x 3.750 AWS Nitro SSD EBS-only Network Up to 25 Gbps 50 Gbps ScyllaDB and Cassandra nodes, as well as their respective loaders, were placed under their own exclusive AWS Cluster Placement Group for low-latency networking. Given the side-effect of all replicas being placed under the same availability zone, we placed each node under an artificial rack using the GossipingPropertyFileSnitch. The schema used through all testing suites resembles the same schema as the default cassandra-stress, whereas the keyspace relies on NetworkTopologyStrategy with a replication factor of 3:   CREATE TABLE IF NOT EXISTS keyspace1.standard1 (     key blob PRIMARY KEY,     c0 blob,     c1 blob,     c2 blob,     c3 blob,     c4 blob     ) ; We used a payload of 1010 bytes, where: 10-bytes represent the keysize, and; Each of the 5 columns is a distinct 200-byte blob Both databases were pre-populated with 2 billion partitions for an approximate (replicated) storage utilization of ~2.02TB. That’s about 60% disk utilization, considering the metadata overhead. Tuning Apache Cassandra Cassandra was originally designed to be run on commodity hardware. As such, one of its features is shipping with numerous different tuning options suitable for various use cases. However, this flexibility comes with a cost: tuning Cassandra is entirely up to its administrators, with limited guidance from online resources. Unlike ScyllaDB, an Apache Cassandra deployment requires users to manually tune kernel settings, set user limits, configure the JVM, set disks’ read-ahead, decide upon compaction strategies, and figure out the best approach for pushing metrics to external monitoring systems. To make things worse, some configuration file comments are outdated or ambiguous across versions. For example, CASSANDRA-16315 and CASSANDRA-7139 describe problems involving the default setting for concurrent compactors and offer advice on how to tune that parameter. Along those lines, it’s worth mentioning Amy Tobey’s Cassandra tuning guide (perhaps the most relevant Cassandra tuning resource available to date), where it says:   “The inaccuracy of some comments in Cassandra configs is an old tradition, dating back to 2010 or 2011. (…) What you need to know is that a lot of the advice in the config commentary is misleading. Whenever it says “number of cores” or “number of disks” is a good time to be suspicious. (…)” – Excerpt from Amy’s Cassandra tuning guide, cassandra.yaml section Tuning the JVM is a journey of its own. Cassandra 5.0 production recommendations don’t mention it, and the jvm-* files page only deals with the file-based structure as shipped with the database. Although DataStax’s Tuning Java resources does a better job on providing recommendations, it warns to adjust “settings gradually and test each incremental change.” Further, we didn’t find any references to ZGC (available as of JDK17) on either the Apache Cassandra or DataStax websites. That made us wonder whether this garbage collector was even recommended. Eventually, we settled on using settings similar to those that TheLastPickle used in their Apache Cassandra 4.0 Benchmarks. During our scaling tests, we hit another inconsistency: we noticed Cassandra’s streaming operations had a default cap of 24MiB/s per node, resulting in suboptimal transfer times. Upon raising those thresholds, we noticed that: Cassandra 4.0 docs mentioned tuning the stream_throughput_outbound_megabits_per_sec option Both Cassandra 4.1 and Cassandra 5.0 docs referenced the stream_throughput_outbound option This Instaclustr article (or carefully interpreting cassandra_latest.yaml) seem like the best resource for understanding the correct entire_sstable_stream_throughput_outbound option. In other words, 3 distinct settings exist for tuning the previous 3 major releases of Cassandra. If your organization is looking to upgrade, we strongly encourage you to conduct a careful review and full round of testing on your own. This is not an edge case; others noted similar upgrade problems under the Apache Cassandra Mailing List. CASSANDRA-20692 demonstrates that Apache Cassandra 5 failed to notice a potential WAL corruption under its newer Direct IO implementation, as issuing I/O requests without O_DSYNC could manifest as data loss during abrupt restarts. This, in turn, gives users a false sense of improved write performance. Configuring Apache Cassandra is not intuitive. We used cassandra_latest.yaml as a starting point, and ran multiple iterations of the same workload under a variety of settings and different GC settings. The results are shown below and demonstrate how little tuning can have a dramatic impact on Cassandra’s performance (for better or for worse). We started by evaluating the performance of G1GC and observed that tail latencies were severely affected beyond a throughput of 40K/s. Simply switching to ZGC gave a nice performance boost, so we decided to stick with it for the remainder of our testing. The following table shows the performance variability of Cassandra 5.0 while using different tuning settings (it’s ordered from best to worst case): Test Kind Garbage Collector Read-ahead Compaction Throughput P99 Latency Throughput Cassandra RA4 Compaction256 ZGC 4KB 256MB/s 6.662ms 120K/s Cassandra RA4 Compaction0 ZGC 4KB Unthrottled 8.159ms 120K/s Cassandra RA8 Compaction256 ZGC 8KB 256MB/s 4.657ms 100K/s Cassandra RA8 Compaction0 ZGC 8KB Unthrottled 4.903ms 100K/s Cassandra G1GC G1GC 4KB 256MB/s 5.521ms 40K/s Although we spent a considerable amount of time tuning Cassandra to provide an unbiased and neutral comparison, we eventually found ourselves in a feedback loop. That is, the reported performance levels are only applicable for the workload being stressed running under the infrastructure in question. If we were to switch to different instance types or run different workload profiles, then additional tuning cycles would be necessary. We anticipate that the majority of Cassandra deployments do not undergo the level of testing we carried out on a per-workload basis. We hope that our experience may prevent other users from running into the same mistakes and gotchas that we did. We’re not claiming that our settings are the absolute best, but we don’t expect that further iterations will yield large performance improvements beyond what we observed. Tuning ScyllaDB We carried out very little tuning for ScyllaDB beyond what is described in the Configure ScyllaDB documentation. Unlike Apache Cassandra, the scylla_setup script takes care of most of the nitty-gritty details related to optimal OS tuning. ScyllaDB also used tablets for data distribution. We targeted a minimum of 100 tablets/shard with the following CREATE KEYSPACE statement: CREATE KEYSPACE IF NOT EXISTS keyspace1 WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'datacenter1': 3 } AND tablets = {'enabled': true, 'initial': 2048}; Limitations of our Testing Performance testing often fails to capture real-world performance metrics tied to the semantics and access patterns of applications. Aspects such as variable concurrency, the impact of DELETEs (tombstones), hotspots, and large partitions were beyond the scope of our testing. Our work also did not aim to provide a feature-specific comparison. While Apache Cassandra 5.0 ships with newer (and less battle-tested) features like Storage-attached Indexes (SAI), ScyllaDB also ships with Workload Prioritization, Local Secondary Indexes, and Synchronous Materialized Views, all with no equivalent counterpart. However, we ensured both databases’ transparent and newer features were used, such as Cassandra’s Trie Memtables, Trie-indexed SSTables and its newer Unified Compaction Strategy, as well as ScyllaDB’s features like Tablets, Shard-awareness, SSTable Index Caching, and so forth. Future tests will use ScyllaDB’s Trie-indexed SSTables. Also note that both databases now offer Vector Search, which was not in scope for this project. Finally, this benchmark focuses specifically on scaling operations, not steady-state performance. ScyllaDB has historically demonstrated higher throughput and lower latency than Cassandra in multiple performance benchmarks. Cassandra 5 introduces architectural improvements, but our preliminary testing shows that ScyllaDB maintains its performance advantage. Producing a full apples-to-apples benchmark suite for Cassandra 5 is a sizable project that’s outside the scope of this study. For teams evaluating a migration, the best insights will come from testing your real-life workload profile, data models, and SLAs directly on ScyllaDB. If you are running your own evaluations (tip: ScyllaDB Cloud is the easiest way), our technical team can review your setup and share tips for accurately measuring ScyllaDB’s performance in your specific environment.

Instaclustr product update: December 2025

Instaclustr product update: December 2025

Here’s a roundup of the latest features and updates that we’ve recently released.

If you have any particular feature requests or enhancement ideas that you would like to see, please get in touch with us.

Major announcements OpenSearch®

AI Search for OpenSearch®: Unlocking next-generation search

AI Search for OpenSearch, which is now available in Public Preview on the NetApp Instaclustr Managed Platform, is designed to bring semantic, hybrid, and multimodal search capabilities to OpenSearch deployments—turning them into an end-to-end AI-powered search solution within minutes. With built-in ML models, vector indexing, and streamlined ingestion pipelines, next-generation search can be enabled in minutes without adding operational complexity. This feature powers smarter, more relevant discovery experiences backed by AI—securely deployed across any cloud or on-premises environment.

ClickHouse®

FSx for NetApp ONTAP and Managed ClickHouse® integration is now available
We’re excited to announce that NetApp has introduced seamless integration between Amazon FSx for NetApp ONTAP and Instaclustr Managed ClickHouse, to enable customers to build a truly hybrid lakehouse architecture on AWS. This integration is designed to deliver lightning-fast analytics without the need for complex data movement, while leveraging FSx for ONTAP’s unified file and object storage, tiered performance, and cost optimization. Customers can now run zero-copy lakehouse analytics with ClickHouse directly on FSx for ONTAP data—to simplify operations, accelerate time-to-insight, and reduce total cost of ownership.

PostgreSQL®

Instaclustr for PostgreSQL® on Amazon FSx for ONTAP: A new era
We’re excited to announce the public preview of Instaclustr Managed PostgreSQL integrated with Amazon FSx for NetApp ONTAP—combining enterprise-grade storage with world-class open source database management. This integration is designed to deliver higher IOPS, lower latency, and advanced data management without increasing instance size or adding costly hardware. Customers can now run PostgreSQL clusters backed by FSx for ONTAP storage, leveraging on-disk compression for cost savings and paving the way for ONTAP-powered features, such as instant snapshot backups, instant restores, and fast forking. These ONTAP-enabled features are planned to unlock huge operational benefits and will be launched with our GA release.

Other significant changes Apache Cassandra®
  • Transitioned Apache Cassandra v4.1.8 to CLOSED lifecycle state; scheduled to reach End of Life (EOL) on December 20, 2025.
Apache Kafka®
  • Kafka on Azure now supports v5 generation nodes, available in General Availability.
  • Instaclustr Managed Apache ZooKeeper has moved from General Availability to closed status.
ClickHouse
  • Kafka Table Engine integration with ClickHouse has added support to enable real-time data ingestion, streamline streaming analytics, and accelerate insights.
  • New ClickHouse node sizes, powered by AWS m7g, r7i, and r7g instances, are now in Limited Availability for cluster creation.
Cadence®
  • Cadence is now available to be provisioned with Cassandra 5.x, designed to deliver improved performance, enhanced scalability, and stronger security for mission-critical workflows.
OpenSearch PostgreSQL
  • Added new PostgreSQL metrics for connect states and wait event types.
  • PostgreSQL Load Balancer add-on is now available, providing a unified endpoint for cluster access, simplifying failover handling, and ensuring node health through regular checks.
Upcoming releases Apache Cassandra
  • We’re working on enabling multi-datacenter (multi-DC) cluster provisioning via API and console, designed to make it easier to deploy clusters across regions with secure networking and reduced manual steps.
Apache Kafka
  • We’re working on adding Kafka Tiered Storage for clusters running in GCP— designed to bring affordable, scalable retention, and instant access to historical data, to ensure flexibility and performance across clouds for enterprise Kafka users.
ClickHouse
  • We’re planning to extend our Managed ClickHouse to allow it to work with on-prem deployments.
PostgreSQL
  • Following the success of our public preview, we’re preparing to launch PostgreSQL integrated with FSx for NetApp ONTAP (FSxN) into General Availability. This enhancement is designed to combine enterprise-grade PostgreSQL with FSxN’s scalable, cost-efficient storage, enabling customers to optimize infrastructure costs while improving performance and flexibility.
OpenSearch®
  • As part of our ongoing advancements in AI for OpenSearch, we are planning to enable adding GPU nodes into OpenSearch clusters, aiming to enhance the performance and efficiency of machine learning and AI workloads.
Instaclustr Managed Platform
  • Self-service Tags Management feature—allowing users to add, edit, or delete tags for their clusters directly through the Instaclustr console, APIs, or Terraform provider for RIYOA deployments.
Did you know?
  • Cadence Workflow, the open source orchestration engine created by Uber, has officially joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project. This milestone ensures transparent governance, community-driven innovation, and a sustainable future for one of the most trusted workflow technologies in modern microservices and agentic AI architectures. Uber donates Cadence Workflow to CNCF: The next big leap for the open source project—read the full story and discover what’s next for Cadence.
  • Upgrading ClickHouse® isn’t just about new features—it’s essential for security, performance, and long-term stability. In ClickHouse upgrade: Why staying updated matters, you’ll learn why skipping upgrades can lead to technical debt, missed optimizations, and security risks. Then, explore A guide to ClickHouse® upgrades and best practices for practical strategies, including when to choose LTS releases for mission-critical workloads and when stable releases make sense for fast-moving environments.
  • Our latest blog, AI Search for OpenSearch®: Unlocking next-generation search, explains how this new solution enables smarter discovery experiences using built-in ML models, vector embeddings, and advanced search techniques—all fully managed on the NetApp Instaclustr Platform. Ready to explore the future of search? Read the full article and see how AI can transform your OpenSearch deployments.

If you have any questions or need further assistance with these enhancements to the Instaclustr Managed Platform, please contact us.

SAFE HARBOR STATEMENT: Any unreleased services or features referenced in this blog are not currently available and may not be made generally available on time or at all, as may be determined in NetApp’s sole discretion. Any such referenced services or features do not represent promises to deliver, commitments, or obligations of NetApp and may not be incorporated into any contract. Customers should make their purchase decisions based upon services and features that are currently generally available.

The post Instaclustr product update: December 2025 appeared first on Instaclustr.

Freezing streaming data into Apache Iceberg™—Part 1: Using Apache Kafka®Connect Iceberg Sink Connector

Introduction 

Ever since the first distributed system—i.e. 2 or more computers networked together (in 1969)—there has been the problem of distributed data consistency: How can you ensure that data from one computer is available and consistent with the second (and more) computers? This problem can be uni-directional (one computer is considered the source of truth, others are just copies), or bi-directional (data must be synchronized in both directions across multiple computers). 

Some approaches to this problem I’ve come across in the last 8 years include Kafka Connect (for elegantly solving the heterogeneous many-to-many integration problem by streaming data from source systems to Kafka and from Kafka to sink systems, some earlier blogs on Apache Camel Kafka Connectors and a blog series on zero-code data pipelines), MirrorMaker2 (MM2, for replicating Kafka clusters, a 2 part blog series), and Debezium (Change Data Capture/CDC, for capturing changes from databases as streams and making them available in downstream systems, e.g. for Apache Cassandra and PostgreSQL)—MM2 and Debezium are actually both built on Kafka Connect.  

Recently, some “sink” systems have been taking over responsibility for streaming data from Kafka into themselves, e.g. OpenSearch pull-based ingestion (c.f. OpenSearch Sink Connector), and the ClickHouse Kafka Table Engine (c.f. ClickHouse Sink Connector). These “pull-based” approaches are potentially easier to configure and don’t require running a separate Kafka Connect cluster and sink connectors, but some downsides may be that they are not as reliable or independently scalable, and you will need to carefully monitor and scale them to ensure they perform adequately.  

And then there’s “zero-copy” approaches—these rely on the well-known computer science trick of sharing a single copy of data using references (or pointers), rather than duplicating the data. This idea has been around for almost as long as computers, and is still widely applicable, as we’ll see in part 2 of the blog. 

The distributed data use case we’re going to explore in this 2-part blog series is streaming Apache Kafka data into Apache Iceberg, or “Freezing streaming Apache Kafka data into an (Apache) Iceberg”! In part 1 we’ll introduce Apache Iceberg and look at the first approach for “freezing” streaming data using the Kafka Connect Iceberg Sink Connector. 

What is Apache Iceberg? 

Apache Iceberg is an open source specification open table format optimized for column-oriented workloads, supporting huge analytic datasets. It supports multiple different concurrent engines that can insert and query table data using SQL—and Iceberg is organized like, well, an iceberg! 

The tip of the Iceberg is the Catalog. An Iceberg Catalog acts as a central metadata repository, tracking the current state of Iceberg tables, including their names, schemas, and metadata file locations. It serves as the “single source of truth” for a data Lakehouse, enabling query engines to find the correct metadata file for a table to ensure consistent and atomic read/write operations.  

Just under the water, the next layer is the metadata layer. The Iceberg metadata layer tracks the structure and content of data tables in a data lake, enabling features like efficient query planning, versioning, and schema evolution. It does this by maintaining a layered structure of metadata files, manifest lists, and manifest files that store information about table schemas, partitions, and data files, allowing query engines to prune unnecessary files and perform operations atomically. 

The data layer is at the bottom. The Iceberg data layer is the storage component where the actual data files are stored. It supports different storage backends, including cloud-based object storage like Amazon S3 or Google Cloud Storage, or HDFS. It uses file formats like Parquet or Avro. Its main purpose is to work in conjunction with Iceberg’s metadata layer to manage table snapshots and provide a more reliable and performant table format for data lakes, bringing data warehouse features to large datasets. 

As shown in the above diagram, Iceberg supports multiple different engines, including Apache Spark and ClickHouse. Engines provide the “database” features you would expect, including:  

  • Data Management 
  • ACID Transactions 
  • Query Planning and Optimization 
  • Schema Evolution 
  • And more! 

I’ve recently been reading an excellent book on Apache Iceberg (“Apache Iceberg: The Definitive Guide”), which explains the philosophy, architecture and design, including operation, of Iceberg. For example, it says that it’s best practice to treat data lake storage as immutable—data should only be added to a Data Lake, not deleted.  So, in theory at least, writing infinite, immutable Kafka streams to Iceberg should be straightforward!  

But because it’s a complex distributed system (which looks like a database from above water but is really a bunch of files below water!), there is some operational complexity. For example, it handles change and consistency by creating new snapshots for every modification, enabling time travel, isolating readers from writes, and supporting optimistic concurrency control for multiple writers. But you need to manage snapshots (e.g. expiring old snapshots). And chapter 4 (performance optimisation) explains that you may need to worry about compaction (reducing too many small files), partitioning approaches (which can impact read performance), and handling row-level updates. The first two issues may be relevant for Kafka, but probably not the last one.  So, it looks like it’s good fit for the streaming Kafka use cases, but we may need to watch out for Iceberg management issues.  

“Freezing” streaming data with the Kafka Iceberg Sink Connector 

But Apache Iceberg is “frozen”—what’s the connection to fast-moving streaming data? You certainly don’t want to collide with an iceberg from your speedy streaming “ship”—but you may want to freeze your streaming data for long-term analytical queries in the future. How can you do that without sinking? Actually, a “sink” is the first answer: A Kafka Connect Iceberg Sink Connector is the most common way of “freezing” your streaming data in Iceberg!  

Kafka Connect is the standard framework provided by Apache Kafka to move data from multiple heterogeneous source systems to multiple heterogeneous sink systems, using:  

  • A Kafka cluster 
  • A Kafka Connect cluster (running connectors) 
  • Kafka Connect source connectors 
  • Kafka topics and 
  • Kafka Connect Sink Connectors 

That is, a highly decoupled approach. It provides real-time data movement with high scalability, reliability, error handling and simple transformations.   

Here’s the Kafka Connect Iceberg Sink Connector official documentation

It appears to be reasonably complicated to configure this sink connector; you will need to know something about Iceberg. For example, what is a “control topic”? It’s apparently used to coordinate commits for exactly-once semantics (EOS).  

The connector supports fan-out (writing to multiple Iceberg tables from one topic), fan-in (writing to one Iceberg table from multiple topics), static and dynamic routing, and filtering.  

In common with many technologies that you may want to use as Kafka Connect sinks, they may not all have good support for Kafka metadata. The KafkaMetadata Transform (which injects topic, partition, offset and timestamp properties) is only experimental at present.  

How are Iceberg tables created with the correct metadata? If you have JSON record values, then schemas are inferred by default (but may not be correct or optimal). Alternatively, explicit schemas can be included in-line or referenced from a Kafka Schema Registry (e.g. Karapace), and, as an added bonus, schema evolution is supported.  Also note that Iceberg tables may have to be manually created prior to use if your Catalog doesn’t support table auto-creation.  

From what I understood about Iceberg, to use it (e.g. for writes), you need support from an engine (e.g. to add raw data to the Iceberg warehouse, create the metadata files, and update the catalog).  How does this work for Kafka Connect? From this blog I discovered that the Kafka Connect Iceberg Sink connector is functioning as an Iceberg engine for writes, so there really is an engine, but it’s built into the connector.  

As is the case with all Kafka Connect Sink Connectors, records are available immediately they are written to Kafka topics by Kafka producers and Kafka Connect Source Connectors, i.e. records in active segments can be copied immediately to sink systems. But is the Iceberg Sink Connector real-time? Not really! The default time to write to Iceberg is every 5 minutes (iceberg.control.commit.interval-ms) to prevent multiplication of small files—something that Iceberg(s) doesn’t/don’t like (“melting”?). In practice, it’s because every data file must be tracked in the metadata layer, which impacts performance in many ways—proliferation of small files is typically addressed by optimization and compaction (e.g. Apache Spark supports Iceberg management, including these operations). 

So, unlike most Kafka Connect sink connectors, which write as quickly as possible, there will be lag before records appear in Iceberg tables (“time to freeze” perhaps)!  

The systems are separate (Kafka and Iceberg are independent), records are copied to Iceberg, and that’s it! This is a clean separation of concerns and ownership. Kafka owns the source data (with Kafka controlling data lifecycles, including record expiry), Kafka Connect Iceberg Sink Connector performs the reading from Kafka and writing to Iceberg, and is independently scalable to Kafka. Kafka doesn’t handle any of the Iceberg management.  Once the data has landed in Iceberg, Kafka has no further visibility or interest in it. And the pipeline is purely one way, write only – reads or deletes are not supported.  

Here’s a summary of this approach to freezing streams:  

  1. Kafka Connect Iceberg Sink Connector shares all the benefits of the Kafka Connect framework, including scalability, reliability, error handling, routing, and transformations.  
  2. At least, JSON values are required, ideally full schemas and referenced in Karapace—but not all schemas are guaranteed to work. 
  3. Kafka Connect doesn’t “manage” Iceberg (e.g. automatically aggregate small files, remove snapshots, etc.) 
  4. You may have to tune the commit interval – 5 minutes is the default. 
  5. But it does have a built-in engine that supports writing to Iceberg. 
  6. You may need to use an external tool (e.g. Apache Spark) for Iceberg management procedures. 
  7. It’s write-only to Iceberg. Reads or deletes are not supported 

But what’s the best thing about the Kafka Connect Iceberg Sink Connector? It’s available now (as part of the Apache Iceberg build) and works on the NetApp Instaclustr Kafka Connect platform as a “bring your own connector”  (instructions here).  

In part 2, we’ll look at Kafka Tiered Storage and Iceberg Topics! 

The post Freezing streaming data into Apache Iceberg™—Part 1: Using Apache Kafka®Connect Iceberg Sink Connector appeared first on Instaclustr.

Netflix Live Origin

Xiaomei Liu, Joseph Lynch, Chris Newton

Introduction

Behind the Streams: Building a Reliable Cloud Live Streaming Pipeline for Netflix introduced the architecture of the streaming pipeline. This blog post looks at the custom Origin Server we built for Live — the Netflix Live Origin. It sits at the demarcation point between the cloud live streaming pipelines on its upstream side and the distribution system, Open Connect, Netflix’s in-house Content Delivery Network (CDN), on its downstream side, and acts as a broker managing what content makes it out to Open Connect and ultimately to the client devices.

Live Streaming Distribution and Origin Architecture

Netflix Live Origin is a multi-tenant microservice operating on EC2 instances within the AWS cloud. We lean on standard HTTP protocol features to communicate with the Live Origin. The Packager pushes segments to it using PUT requests, which place a file into storage at the particular location named in the URL. The storage location corresponds to the URL that is used when the Open Connect side issues the corresponding GET request.

Live Origin architecture is influenced by key technical decisions of the live streaming architecture. First, resilience is achieved through redundant regional live streaming pipelines, with failover orchestrated at the server-side to reduce client complexity. The implementation of epoch locking at the cloud encoder enables the origin to select a segment from either encoding pipeline. Second, Netflix adopted a manifest design with segment templates and constant segment duration to avoid frequent manifest refresh. The constant duration templates enable Origin to predict the segment publishing schedule.

Multi-pipeline and multi-region aware origin

Live streams inevitably contain defects due to the non-deterministic nature of live contribution feeds and strict real-time segment publishing timelines. Common defects include:

  • Short segments: Missing video frames and audio samples.
  • Missing segments: Entire segments are absent.
  • Segment timing discontinuity: Issues with the Track Fragment Decode Time.

Communicating segment discontinuity from the server to the client via a segment template-based manifest is impractical, and these defective segments can disrupt client streaming.

The redundant cloud streaming pipelines operate independently, encompassing distinct cloud regions, contribution feeds, encoder, and packager deployments. This independence substantially mitigates the probability of simultaneous defective segments across the dual pipelines. Owing to its strategic placement within the distribution path, the live origin naturally emerges as a component capable of intelligent candidate selection.

The Netflix Live Origin features multi-pipeline and multi-region awareness. When a segment is requested, the live origin checks candidates from each pipeline in a deterministic order, selecting the first valid one. Segment defects are detected via lightweight media inspection at the packager. This defect information is provided as metadata when the segment is published to the live origin. In the rare case of concurrent defects at the dual pipeline, the segment defects can be communicated downstream for intelligent client-side error concealment.

Open Connect streaming optimization

When the Live project started, Open Connect had become highly optimised for VOD content delivery — nginx had been chosen many years ago as the Web Server since it is highly capable in this role, and a number of enhancements had been added to it and to the underlying operating system (BSD). Unlike traditional CDNs, Open Connect is more of a distributed origin server — VOD assets are pre-positioned onto carefully selected server machines (OCAs, or Open Connect Appliances) rather than being filled on demand.

Alongside the VOD delivery, an on-demand fill system has been used for non-VOD assets — this includes artwork and the downloadable portions of the clients, etc. These are also served out of the same nginx workers, albeit under a distinct server block, using a distinct set of hostnames.

Live didn’t fit neatly into this ‘small object delivery’ model, so we extended the proxy-caching functionality of nginx to address Live-specific needs. We will touch on some of these here related to optimized interactions with the Origin Server. Look for a future blog post that will go into more details on the Open Connect side.

The segment templates provided to clients are also provided to the OCAs as part of the Live Event Configuration data. Using the Availability Start Time and Initial Segment number, the OCA is able to determine the legitimate range of segments for each event at any point in time — requests for objects outside this range can be rejected, preventing unnecessary requests going up through the fill hierarchy to the origin. If a request makes it through to the origin, and the segment isn’t available yet, the origin server will return a 404 Status Code (indicating File Not Found) with the expiration policy of that error so that it can be cached within Open Connect until just before that segment is expected to be published.

If the Live Origin knows when segments are being pushed to it, and knows what the live edge is — when a request is received for the immediately next object, rather than handing back another 404 error (which would go all the way back through Open Connect to the client), the Live Origin can ‘hold open’ the request, and service it once the segment has been published to it. By doing this, the degree of chatter within the network handling requests that arrive early has been significantly reduced. As part of this, millisecond grain caching was added to nginx to enhance the standard HTTP Cache Control, which only works at second granularity, a long time when segments are generated every 2 seconds.

Streaming metadata enhancement

The HTTP standard allows for the addition of request and response headers that can be used to provide additional information as files move between clients and servers. The HTTP headers provide notifications of events within the stream in a highly scalable way that is independently conveyed to client devices, regardless of their playback position within the stream.

These notifications are provided to the origin by the live streaming pipeline and are inserted by the origin in the form of headers, appearing on the segments generated at that point in time (and persist to future segments — they are cumulative). Whenever a segment is received at an OCA, this notification information is extracted from the response headers and used to update an in-memory data structure, keyed by event ID; and whenever a segment is served from the OCA, the latest such notification data is attached to the response. This means that, given any flow of segments into an OCA, it will always have the most recent notification data, even if all clients requesting it are behind the live edge. In fact, the notification information can be conveyed on any response, not just those supplying new segments.

Cache invalidation and origin mask

An invalidation system has been available since the early days of the project. It can be used to “flush” all content associated with an event by altering the key used when looking up objects in cache — this is done by incorporating a version number into the cache key that can then be bumped on demand. This is used during pre-event testing so that the network can be returned to a pristine state for the test with minimal fuss.

Each segment published by the Live Origin conveys the encoding pipeline it was generated by, as well as the region it was requested from. Any issues that are found after segments make their way into the network can be remedied by an enhanced invalidation system that takes such variants into account. It is possible to invalidate (that is, cause to be considered expired) segments in a range of segment numbers, but only if they were sourced from encoder A, or from Encoder A, but only if retrieved from region X.

In combination with Open Connect’s enhanced cache invalidation, the Netflix Live Origin allows selective encoding pipeline masking to exclude a range of segments from a particular pipeline when serving segments to Open Connect. The enhanced cache invalidation and origin masking enable live streaming operations to hide known problematic segments (e.g., segments causing client playback errors) from streaming clients once the bad segments are detected, protecting millions of streaming clients during the DVR playback window.

Origin storage architecture

Our original storage architecture for the Live Origin was simple: just use AWS S3 like we do for SVOD. This served us well initially for our low-traffic events, but as we scaled up we discovered that Live streaming has unique latency and workload requirements that differ significantly from on-demand where we have significant time ahead-of-time to pre-position content. While S3 met its stated uptime guarantees, our strict 2-second retry budget inherent to Live events (where every write is critical) led us to explore optimizations specifically tailored for real-time delivery at scale. AWS S3 is an amazing object store, but our Live streaming requirements were closer to those of a global low-latency highly-available database. So, we went back to the drawing board and started from the requirements. The Origin required:

  1. [HA Writes] Extremely high write availability, ideally as close to full write availability within a single AWS region, with low second replication delay to other regions. Any failed write operation within 500ms is considered a bug that must be triaged and prevented from re-occurring.
  2. [Throughput] High write throughput, with hundreds of MiB replicating across regions
  3. [Large Partitions] Efficiently support O(MiB) writes that accumulate to O(10k) keys per partition with O(GiB) total size per event.
  4. [Strong Consistency] Within the same region, we needed read-your-write semantics to hit our <1s read delay requirements (must be able to read published segments)
  5. [Origin Storm] During worst-case load involving Open Connect edge cases, we may need to handle O(GiB) of read throughput without affecting writes.

Fortunately, Netflix had previously invested in building a KeyValue Storage Abstraction that cleverly leveraged Apache Cassandra to provide chunked storage of MiB or even GiB values. This abstraction was initially built to support cloud saves of Game state. The Live use case would push the boundaries of this solution, however, in terms of availability for writes (#1), cumulative partition size (#3), and read throughput during Origin Storm (#5).

High Availability for Writes of Large Payloads

The KeyValue Payload Chunking and Compression Algorithm breaks O(MiB) work down so each part can be idempotently retried and hedged to maintain strict latency service level objectives, as well as spreading the data across the full cluster. When we combine this algorithm with Apache Cassandra’s local-quorum consistency model, which allows write availability even with an entire Availability Zone outage, plus a write-optimized Log-Structured Merge Tree (LSM) storage engine, we could meet the first four requirements. After iterating on the performance and availability of this solution, we were not only able to achieve the write availability required, but did so with a P99 tail latency that was similar to the status quo’s P50 average latency while also handling cross-region replication behind the scenes for the Origin. This new solution was significantly more expensive (as expected, databases backed by SSD cost more), but minimizing cost was not a key objective and low latency with high availability was:

Storage System Write Performance

High Availability Reads at Gbps Throughputs

Now that we solved the write reliability problem, we had to handle the Origin Storm failure case, where potentially dozens of Open Connect top-tier caches could be requesting multiple O(MiB) video segments at once. Our back-of-the-envelope calculations showed worst-case read throughput in the O(100Gbps) range, which would normally be extremely expensive for a strongly-consistent storage engine like Apache Cassandra. With careful tuning of chunk access, we were able to respond to reads at network line rate (100Gbps) from Apache Cassandra, but we observed unacceptable performance and availability degradation on concurrent writes. To resolve this issue, we introduced write-through caching of chunks using our distributed caching system EVCache, which is based on Memcached. This allows almost all reads to be served from a highly scalable cache, allowing us to easily hit 200Gbps and beyond without affecting the write path, achieving read-write separation.

Final Storage Architecture

In the final storage architecture, the Live Origin writes and reads to KeyValue, which manages a write-through cache to EVCache (memcached) and implements a safe chunking protocol that spreads large values and partitions them out across the storage cluster (Apache Cassandra). This allows almost all read load to be handled from cache, with only misses hitting the storage. This combination of cache and highly available storage has met the demanding needs of our Live Origin for over a year now.

Storage System High Level Architecture

Delivering this consistent low latency for large writes with cross-region replication and consistent write-through caching to a distributed cache required solving numerous hard problems with novel techniques, which we plan to share in detail during a future post.

Scalability and scalable architecture

Netflix’s live streaming platform must handle a high volume of diverse stream renditions for each live event. This complexity stems from supporting various video encoding formats (each with multiple encoder ladders), numerous audio options (across languages, formats, and bitrates), and different content versions (e.g., with or without advertisements). The combination of these elements, alongside concurrent event support, leads to a significant number of unique stream renditions per live event. This, in turn, necessitates a high Requests Per Second (RPS) capacity from the multi-tenant live origin service to ensure publishing-side scalability.

In addition, Netflix’s global reach presents distinct challenges to the live origin on the retrieval side. During the Tyson vs. Paul fight event in 2024, a historic peak of 65 million concurrent streams was observed. Consequently, a scalable architecture for live origin is essential for the success of large-scale live streaming.

Scaling architecture

We chose to build a highly scalable origin instead of relying on the traditional origin shields approach for better end-to-end cache consistency control and simpler system architecture. The live origin in this architecture directly connects with top-tier Open Connect nodes, which are geographically distributed across several sites. To minimize the load on the origin, only designated nodes per stream rendition at each site are permitted to directly fill from the origin.

Netflix Live Origin Scalability Architecture

While the origin service can autoscale horizontally using EC2 instances, there are other system resources that are not autoscalable, such as storage platform capacity and AWS to Open Connect backbone bandwidth capacity. Since in live streaming, not all requests to the live origin are of the same importance, the origin is designed to prioritize more critical requests over less critical requests when system resources are limited. The table below outlines the request categories, their identification, and protection methods.

Publishing isolation

Publishing traffic, unlike potentially surging CDN retrieval traffic, is predictable, making path isolation a highly effective solution. As shown in the scalability architecture diagram, the origin utilizes separate EC2 publishing and CDN stacks to protect the latency and failure-sensitive origin writes. In addition, the storage abstraction layer features distinct clusters for key-value (KV) read and KV write operations. Finally, the storage layer itself separates read (EVCache) and write (Cassandra) paths. This comprehensive path isolation facilitates independent cloud scaling of publishing and retrieval, and also prevents CDN-facing traffic surges from impacting the performance and reliability of origin publishing.

Priority rate limiting

Given Netflix’s scale, managing incoming requests during a traffic storm is challenging, especially considering non-autoscalable system resources. The Netflix Live Origin implemented priority-based rate limiting when the underlying system is under stress. This approach ensures that requests with greater user impact are prioritized to succeed, while requests with lower user impact are allowed to fail during times of stress in order to protect the streaming infrastructure and are permitted to retry later to succeed.

Leveraging Netflix’s microservice platform priority rate limiting feature, the origin prioritizes live edge traffic over DVR traffic during periods of high load on the storage platform. The live edge vs. DVR traffic detection is based on the predictable segment template. The template is further cached in memory on the origin node to enable priority rate limiting without access to the datastore, which is valuable especially during periods of high datastore stress.

To mitigate traffic surges, TTL cache control is used alongside priority rate limiting. When the low-priority traffic is impacted, the origin instructs Open Connect to slow down and cache identical requests for 5 seconds by setting a max-age = 5s and returns an HTTP 503 error code. This strategy effectively dampens traffic surges by preventing repeated requests to the origin within that 5-second window.

The following diagrams illustrate origin priority rate limiting with simulated traffic. The nliveorigin_mp41 traffic is the low-priority traffic and is mixed with other high-priority traffic. In the first row: the 1st diagram shows the request RPS, the 2nd diagram shows the percentage of request failure. In the second row, the 1st diagram shows datastore resource utilization, and the 2nd diagram shows the origin retrieval P99 latency. The results clearly show that only the low-priority traffic (nliveorigin_mp41) is impacted at datastore high utilization, and the origin request latency is under control.

Origin Priority Rate Limiting

404 storm and cache optimization

Publishing isolation and priority rate limiting successfully protect the live origin from DVR traffic storms. However, the traffic storm generated by requests for non-existent segments presents further challenges and opportunities for optimization.

The live origin structures metadata hierarchically as event > stream rendition > segment, and the segment publishing template is maintained at the stream rendition level. This hierarchical organization allows the origin to preemptively reject requests with an HTTP 404(not found)/410(Gone) error, leveraging highly cacheable event and stream rendition level metadata, avoiding unnecessary queries to the segment level metadata:

  • If the event is unknown, reject the request with 404
  • If the event is known, but the segment request timing does not match the expected publishing timing, reject the request with 404 and cache control TTL matching the expected publishing time
  • If the event is known, the requested segment is never generated or misses the retry deadline, reject the request with a 410 error, preventing the client from repeatedly requesting

At the storage layer, metadata is stored separately from media data in the control plane datastore. Unlike the media datastore, the control plane datastore does not use a distributed cache to avoid cache inconsistency. Event and rendition level metadata benefits from a high cache hit ratio when in-memory caching is utilized at the live origin instance. During traffic storms involving non-existent segments, the cache hit ratio for control plane access easily exceeds 90%.

The use of in-memory caching for metadata effectively handles 404 storms at the live origin without causing datastore stress. This metadata caching complements the storage system’s distributed media cache, providing a complete solution for traffic surge protection.

Summary

The Netflix Live Origin, built upon an optimized storage platform, is specifically designed for live streaming. It incorporates advanced media and segment publishing scheduling awareness and leverages enhanced intelligence to improve streaming quality, optimize scalability, and improve Open Connect live streaming operations.

Acknowledgement

Many teams and stunning colleagues contributed to the Netflix live origin. Special thanks to Flavio Ribeiro for advocacy and sponsorship of the live origin project; to Raj Ummadisetty, Prudhviraj Karumanchi for the storage platform; to Rosanna Lee, Hunter Ford, and Thiago Pontes for storage lifecycle management; to Ameya Vasani for e2e test framework; Thomas Symborski for orchestrator integration; to James Schek for Open Connect integration; to Kevin Wang for platform priority rate limit; to Di Li, Nathan Hubbard for origin scalability testing.


Netflix Live Origin was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stay ahead with Apache Cassandra®: 2025 CEP highlights

Apache Cassandra® committers are working hard, building new features to help you more seamlessly ease operational challenges of a distributed database. Let’s dive into some recently approved CEPs and explain how these upcoming features will improve your workflow and efficiency.

What is a CEP?

CEP stands for Cassandra Enhancement Proposal. They are the process for outlining, discussing, and gathering endorsements for a new feature in Cassandra. They’re more than a feature request; those who put forth a CEP have intent to build the feature, and the proposal encourages a high amount of collaboration with the Cassandra contributors.

The CEPs discussed here were recently approved for implementation or have had significant progress in their implementation.  As with all open-source development, inclusion in a future release is contingent upon successful implementation, community consensus, testing, and approval by project committers.

CEP-42: Constraints framework

With collaboration from NetApp Instaclustr, CEP-42, and subsequent iterations, delivers schema level constraints giving Cassandra users and operators more control over their data. Adding constraints on the schema level means that data can be validated at write time and send the appropriate error when data is invalid.

Constraints are defined in-line or as a separate definition. The inline style allows for only one constraint while a definition allows users to define multiple constraints with different expressions.

The scope of this CEP-42 initially supported a few constraints that covered the majority of cases, but in follow up efforts the expanded list of support includes scalar (>, <, >=, <=), LENGTH(), OCTET_LENGTH(), NOT NULL, JSON(), REGEX(). A user is also able to define their own constraints if they implement it and put them on Cassandra’s class path.

A simple example of an in-line constraint:

CREATE TABLE users (

username text PRIMARY KEY,

age int CHECK age >= 0 and age < 120

);

Constraints are not supported for UDTs (User-Defined Types) nor collections (except for using NOT NULL for frozen collections).

Enabling constraints closer to the data is a subtle but mighty way for operators to ensure that data goes into the database correctly. By defining rules just once, application code is simplified, more robust, and prevents validation from being bypassed. Those who have worked with MySQL, Postgres, or MongoDB will enjoy this addition to Cassandra.

CEP-51: Support “Include” Semantics for cassandra.yaml

The cassandra.yaml file holds important settings for storage, memory, replication, compaction, and more. It’s no surprise that the average size of the file around 1,000 lines (though, yes—most are comments). CEP-51 enables splitting the cassandra.yaml configuration into multiple files using includes semantics. From the outside, this feels like a small change, but the implications are huge if a user chooses to opt-in.

In general, the size of the configuration file makes it difficult to manage and coordinate changes. It’s often the case that multiple teams manage various aspects of the single file. In addition, cassandra.yaml permissions are readable for those with access to this file, meaning private information like credentials are comingled with all other settings. There is risk from an operational and security standpoint.

Enabling the new semantics and therefore modularity for the configuration file eases management, deployment, complexity around environment-specific settings, and security in one shot. The configuration file follows the principle of least privilege once the cassandra.yaml is broken up into smaller, well-defined files; sensitive configuration settings are separated out from general settings with fine-grained access for the individual files. With the feature enabled, different development teams are better equipped to deploy safely and independently.

If you’ve deployed your Cassandra cluster on the NetApp Instaclustr platform, the cassandra.yaml file is already configured and managed for you. We pride ourselves on making it easy for you to get up and running fast.

CEP-52: Schema annotations for Apache Cassandra

With extensive review by the NetApp Instaclustr team and Stefan Miklosovic, CEP-52 introduces schema annotations in CQL allowing in-line comments and labels of schema elements such as keyspaces, tables, columns, and User Defined Types (UDT). Users can easily define and alter comments and labels on these elements. They can be copied over when desired using CREATE TABLE LIKE syntax. Comments are stored as plain text while labels are stored as structured metadata.

Comments and labels serve different annotation purposes: Comments document what a schema object is for, whereas labels describe how sensitive or controlled it is meant to be. For example, labels can be used to identify columns as “PII” or “confidential”, while the comment on that column explains usage, e.g. “Last login timestamp.”

Users can query these annotations. CEP-52 defines two new read-only tables (system_views.schema_comments and system_views.schema_security_labels) to store comments and security labels so objects with comments can be returned as a list or a user/machine process can query for specific labels, beneficial for auditing and classification. Note that adding security labels are descriptive metadata and do not enforce access control to the data.

CEP-53: Cassandra rolling restarts via Sidecar

Sidecar is an auxiliary component in the Cassandra ecosystem that exposes cluster management and streaming capabilities through APIs. Introducing rolling restarts through Sidecar, this feature is designed to provide operators with more efficient and safer restarts without cluster-wide downtime. More specifically, operators can monitor, pause, resume, and abort restarts all through an API with configurable options if restarts fail.

Rolling restarts brings operators a step closer to cluster-wide operations and lifecycle management via Sidecar. Operators will be able to configure the number of nodes to restart concurrently with minimal risk as this CEP unleashes clear states as a node progresses through a restart. Accounting for a variety of edge cases, an operator can feel assured that, for example, a non-functioning sidecar won’t derail operations.

The current process for restarting a node is a multi-step, manual process, which does not scale for large cluster sizes (and is also tedious for small clusters). Restarting clusters previously lacked a streamlined approach as each node needed to be restarted one at a time, making the process time-intensive and error-prone.

Though Sidecar is still considered WIP, it’s got big plans to improve operating large clusters!

The NetApp Instaclustr Platform, in conjunction with our expert TechOps team already orchestrates these laborious tasks for our Cassandra customers with a high level of care to ensure their cluster stays online. Restarting or upgrading your Cassandra nodes is a huge pain-point for operators, but it doesn’t have to be when using our managed platform (with round-the-clock support!)

CEP-54: Zstd with dictionary SSTable compression

CEP-54, with NetApp Instaclustr’s collaboration, aims to add support Zstd with dictionary compression for SSTables. Zstd, or Zstandard, is a fast, lossless data compression algorithm that boasts impressive ratio and speed and has been supported in Cassandra since 4.0. Certain workloads can benefit from significantly faster read/write performance, reduced storage footprint, and increased storage device lifetime when using dictionary compression.

At a high level, operators choose a table they want to compress with a dictionary. A dictionary must be trained first on a small amount of already present data (recommended no more than 10MiB). The result of a training is a dictionary, which is stored cluster-wide for all other nodes to use, and this dictionary is used for all subsequent writes of SSTables to a disk.

Workloads with structured data of similar rows benefit most from Zstd with dictionary compression. Some examples of ideal workloads include event logs, telemetry data, metadata tables with templated messages. Think: repeated row data. If the table data is too unstructured or random, this feature likely won’t be optimal for dictionary compression, however plain Zstd will still be an excellent option.

New SSTables with dictionaries are readable across nodes and can stream, repair, and backup. Existing tables are unaffected if dictionary compression is not enabled. Too many unique dictionaries hurt decompression; use minimal dictionaries (recommended dictionary size is about 100KiB and one dictionary per table) and only adopt new ones when they’re noticeably better.

CEP-55: Generated role names

 CEP-55 adds support to create users/roles without supplying a name, simplifying

user management, especially when generating users and roles in bulk. With an example syntax, CREATE GENERATED ROLE WITH GENERATED PASSWORD, new keys are placed in a newly introduced configuration section in cassandra.yaml under “role_name_policy.”

Stefan Miklosovic, our Cassandra engineer at NetApp Instaclustr, created this CEP as a logical follow up to CEP-24 (password validation/generation), which he authored as well. These quality-of-life improvements let operators spend less time doing trivial tasks with high-risk potential and more time on truly complex matters.

Manual name selection seems trivial until a hundred role names need to be generated; now there is a security risk if the new usernames—or worse passwords—are easily guessable. With CEP-55, the generated role name will be UUID-like, with optional prefix/suffix and size hints, however a pluggable policy is available to generate and validate names as well. This is an opt-in feature with no effect to the existing method of generating role names.

The future of Apache Cassandra is bright

 These Cassandra Enhancement Proposals demonstrate a strong commitment to making Apache Cassandra more powerful, secure, and easier to operate. By staying on top of these updates, we ensure our managed platform seamlessly supports future releases that accelerate your business needs.

At NetApp Instaclustr, our expert TechOps team already orchestrates laborious tasks like restarts and upgrades for our Apache Cassandra customers, ensuring their clusters stay online. Our platform handles the complexity so you can get up and running fast.

Learn more about our fully managed and hosted Apache Cassandra offering and try it for free today!

The post Stay ahead with Apache Cassandra®: 2025 CEP highlights appeared first on Instaclustr.

Vector search benchmarking: Embeddings, insertion, and searching documents with ClickHouse® and Apache Cassandra®

Welcome back to our series on vector search benchmarking. In part 1, we dove into setting up a benchmarking project and explored how to implement vector search in PostgreSQL from the example code in GitHub. We saw how a hands-on project with students from Northeastern University provided a real-world testing ground for Retrieval-Augmented Generation (RAG) pipelines.

Now, we’re continuing our journey by exploring two more powerful open source technologies: ClickHouse and Apache Cassandra. Both handle vector data differently and understanding their methods is key to effective vector search benchmarking. Using the same student project as our guide, this post will examine the code for embedding, inserting, and retrieving data to see how these technologies stack up.

Let’s get started.

Vector search benchmarking with ClickHouse

ClickHouse is a column-oriented database management system known for its incredible speed in analytical queries. It’s no surprise that it has also embraced vector search. Let’s see how the student project team implemented and benchmarked the core components.

Step 1: Embedding and inserting data

scripts/vectorize_and_upload.py

This is the file that handles Step 1 of the pipeline for ClickHouse. Embeddings in this file (scripts/vectorize_and_upload.py) are used as vector representations of Guardian news articles for the purpose of storing them in a database and performing semantic search. Here’s how embeddings are handled step-by-step (the steps look similar to PostgreSQL).

First up, is the generation of embeddings. The same SentenceTransformer model used in part 1 (all-MiniLM-L6-v2) is loaded in the class constructor. In the method generate_embeddings(self, articles), for each article:

  • The article’s title and body are concatenated into a text string.
  • The model generates an embedding vector (self.model.encode(text_for_embedding)), which is a numerical representation of the article’s semantic content.
  • The embedding is added to the article’s dictionary under the key embedding.

Then the embeddings are stored in ClickHouse as follows.

  • The database table guardian_articles is created with an embedding Array(Float64) NOT NULL column specifically to store these vectors.
  • In upload_to_clickhouse_debug(self, articles_with_embeddings), the script inserts articles into ClickHouse, including the embedding vector as part of each row.
Step 2: Vector search and retrieval

services/clickhouse/clickhouse_dao.py

The steps to search are the same as for PostgreSQL in part 1. Here’s part of the related_articles method for ClickHouse:

def related_articles(self, query: str, limit: int = 5):

"""Search for similar articles using vector similarity""" ... query_embedding = self.model.encode(query).tolist() search_query = f""" SELECT url, title, body, publication_date, cosineDistance(embedding, {query_embedding}) as distance FROM guardian_articles ORDER BY distance ASC LIMIT {limit} """ ...

When searching for related articles, it encodes the query into an embedding, then performs a vector similarity search in ClickHouse using cosineDistance between stored embeddings and the query embedding, and results are ordered by similarity, returning the most relevant articles.

Vector search benchmarking with Apache Cassandra

Next, let’s turn our attention to Apache Cassandra. As a distributed NoSQL database, Cassandra is designed for high availability and scalability, making it an intriguing option for large-scale RAG applications.

Step 1: Embedding and inserting data

scripts/pull_docs_cassandra.py

As in the above examples, embeddings in this file are used to convert article text (body) into numerical vector representations for storage and later retrieval in Cassandra.

For each article, the code extracts the body and computes the embeddings:

embedding = model.encode(body) embedding_list = [float(x) for x in embedding]
  • model.encode(body) converts the text to a NumPy array of 384 floats.
  • The array is converted to a standard Python list of floats for Cassandra storage.

Next, the embedding is stored in the vector column of the articles table using a CQL INSERT:

insert_cql = SimpleStatement(""" INSERT INTO articles (url, title, body, publication_date, vector) VALUES (%s, %s, %s, %s, %s) IF NOT EXISTS; """) result = session.execute(insert_cql, (url, title, body, publication_date, embedding_list))

The schema for the table specifies: vector vector<float, 384>, meaning each article has a corresponding 384-dimensional embedding. The code also creates a custom index for the vector column:

session.execute(""" CREATE CUSTOM INDEX IF NOT EXISTS ann_index ON articles(vector) USING 'StorageAttachedIndex'; """)

This enables efficient vector (ANN: Approximate Nearest Neighbor) search capabilities, allowing similarity queries on stored embeddings.

A key part of the setup is the schema and indexing. The Cassandra schema in services/cassandra/init/01-schema.cql defines the vector column.

Being a NoSQL database, Cassandra schemas are a bit different to normal SQL databases, so it’s worth taking a closer look. This Cassandra schema is designed to support Retrieval-Augmented Generation (RAG) architectures, which combine information retrieval with generative models to answer queries using both stored data and generative AI. Here’s how the schema supports RAG:

  • Keyspace and table structure
    • Keyspace (vectorembeds): Analogous to a database, this isolates all RAG-related tables and data.
    • Table (articles): Stores retrievable knowledge sources (e.g., articles) for use in generation.
  • Table columns
    • url TEXT PRIMARY KEY: Uniquely identifies each article/document, useful for referencing and deduplication.
    • title TEXT and body TEXT: Store the actual content and metadata, which may be retrieved and passed to the generative model during RAG.
    • publication_date TIMESTAMP: Enables filtering or ranking based on recency.
    • vector VECTOR<FLOAT, 384>: Stores the embedding representation of the article. The new Cassandra vector data type is documented here.
  • Indexing
    • Sets up an Approximate Nearest Neighbor (ANN) index using Cassandra’s Storage Attached Index.

More information about Cassandra vector support is in the documentation.

Step 2: Vector search and retrieval

The retrieval logic in services/cassandra/cassandra_dao.py showcases the elegance of Cassandra’s vector search capabilities.

The code to create the query embeddings and perform the query is similar to the previous examples, but the CQL query to retrieve similar documents looks like this:

query_cql = """ SELECT url, title, body, publication_date FROM articles ORDER BY vector ANN OF ? LIMIT ? """ prepared = self.client.prepare(query_cql) rows = self.client.execute(prepared, (emb, limit))
What have we learned?

By exploring the code from this RAG benchmarking project we’ve seen distinct approaches to vector search. Here’s a summary of key takeaways:

  • Critical steps in the process:
    • Step 1: Embedding articles and inserting them into the vector databases.
    • Step 2: Embedding queries and retrieving relevant articles from the database.
  • Key design pattern:
    • The DAO (Data Access Object) design pattern provides a clean, scalable way to support multiple databases.
    • This approach could extend to other databases, such as OpenSearch, in the future.
  • Additional insights:
    • It’s possible to perform vector searches over the latest documents, pre-empting queries, and potentially speeding up the pipeline.
What’s next?

So far, we have only scratched the surface. The students built a complete benchmarking application with a GUI (using Steamlit), used multiple other interesting components (e.g. LangChain, LangGraph, FastAPI and uvicorn), Grafana and LangSmith for metrics, and Claude to use the retrieved articles to answer questions, and Docker support for the components. They also revealed some preliminary performance results! Here’s what the final system looked like (this and the previous blog focused on the bottom boxes only).

student-built benchmarking application flow chart

In a future article, we will examine the rest of the application code, look at the preliminary performance results the students uncovered, and discuss what they tell us about the trade-offs between these different databases.

Ready to learn more right now? We have a wealth of resources on vector search. You can explore our blogs on ClickHouse vector search and Apache Cassandra Vector Search (here, here, and here) to deepen your understanding.

The post Vector search benchmarking: Embeddings, insertion, and searching documents with ClickHouse® and Apache Cassandra® appeared first on Instaclustr.

Optimizing Cassandra Repair for Higher Node Density

This is the fourth post in my series on improving the cost efficiency of Apache Cassandra through increased node density. In the last post, we explored compaction strategies, specifically the new UnifiedCompactionStrategy (UCS) which appeared in Cassandra 5.

Now, we’ll tackle another aspect of Cassandra operations that directly impacts how much data you can efficiently store per node: repair. Having worked with repairs across hundreds of clusters since 2012, I’ve developed strong opinions on what works and what doesn’t when you’re pushing the limits of node density.

Building easy-cass-mcp: An MCP Server for Cassandra Operations

I’ve started working on a new project that I’d like to share, easy-cass-mcp, an MCP (Model Context Protocol) server specifically designed to assist Apache Cassandra operators.

After spending over a decade optimizing Cassandra clusters in production environments, I’ve seen teams consistently struggle with how to interpret system metrics, configuration settings, schema design, and system configuration, and most importantly, how to understand how they all impact each other. While many teams have solid monitoring through JMX-based collectors, extracting and contextualizing specific operational metrics for troubleshooting or optimization can still be cumbersome. The good news is that we now have the infrastructure to make all this operational knowledge accessible through conversational AI.

easy-cass-stress Joins the Apache Cassandra Project

I’m taking a quick break from my series on Cassandra node density to share some news with the Cassandra community: easy-cass-stress has officially been donated to the Apache Software Foundation and is now part of the Apache Cassandra project ecosystem as cassandra-easy-stress.

Why This Matters

Over the past decade, I’ve worked with countless teams struggling with Cassandra performance testing and benchmarking. The reality is that stress testing distributed systems requires tools that can accurately simulate real-world workloads. Many tools make this difficult by requiring the end user to learn complex configurations and nuance. While consulting at The Last Pickle, I set out to create an easy to use tool that lets people get up and running in just a few minutes

Azure fault domains vs availability zones: Achieving zero downtime migrations

The challenges of operating production-ready enterprise systems in the cloud are ensuring applications remain up to date, secure and benefit from the latest features. This can include operating system or application version upgrades, but it is not limited to advancements in cloud provider offerings or the retirement of older ones. Recently, NetApp Instaclustr undertook a migration activity for (almost) all our Azure fault domain customers to availability zones and Basic SKU IP addresses.

Understanding Azure fault domains vs availability zones

“Azure fault domain vs availability zone” reflects a critical distinction in ensuring high availability and fault tolerance. Fault domains offer physical separation within a data center, while availability zones expand on this by distributing workloads across data centers within a region. This enhances resiliency against failures, making availability zones a clear step forward.

The need for migrating from fault domains to availability zones

NetApp Instaclustr has supported Azure as a cloud provider for our Managed open source offerings since 2016. Originally this offering was distributed across fault domains to ensure high availability using “Basic SKU public IP Addresses”, but this solution had some drawbacks when performing particular types of maintenance. Once released by Azure in several regions we extended our Azure support to availability zones which have a number of benefits including more explicit placement of additional resources, and we leveraged “Standard SKU Public IP’s” as part of this deployment.

When we introduced availability zones, we encouraged customers to provision new workloads in them. We also supported migrating workloads to availability zones, but we had not pushed existing deployments to do the migration. This was initially due to the reduced number of regions that supported availability zones.

In early 2024, we were notified that Azure would be retiring support for Basic SKU public IP addresses in September 2025. Notably, no new Basic SKU public IPs would be created after March 1, 2025. For us and our customers, this had the potential to impact cluster availability and stability – as we would be unable to add nodes, and some replacement operations would fail.

Very quickly we identified that we needed to migrate all customer deployments from Basic SKU to Standard SKU public IPs. Unfortunately, this operation involves node-level downtime as we needed to stop each individual virtual machine, detach the IP address, upgrade the IP address to the new SKU, and then reattach and start the instance. For customers who are operating their applications in line with our recommendations, node-level downtime does not have an impact on overall application availability, however it can increase strain on the remaining nodes.

Given that we needed to perform this potentially disruptive maintenance by a specific date, we decided to evaluate the migration of existing customers to Azure availability zones.

Key migration consideration for Cassandra clusters

As with any migration, we were looking at performing this with zero application downtime, minimal additional infrastructure costs, and as safe as possible. For some customers, we also needed to ensure that we do not change the contact IP addresses of the deployment, as this may require application updates from their side. We quickly worked out several ways to achieve this migration, each with its own set of pros and cons.

For our Cassandra customers, our go to method for changing cluster topology is through a data center migration. This is our zero-downtime migration method that we have completed hundreds of times, and have vast experience in executing. The benefit here is that we can be extremely confident of application uptime through the entire operation and be confident in the ability to pause and reverse the migration if issues are encountered. The major drawback to a data center migration is the increased infrastructure cost during the migration period – as you effectively need to have both your source and destination data centers running simultaneously throughout the operation. The other item of note, is that you will need to update your cluster contact points to the new data center.

For clusters running other applications, or customers who are more cost conscious, we evaluated doing a “node by node” migration from Basic SKU IP addresses in fault domains, to Standard SKU IP addresses in availability zones. This does not have any short-term increased infrastructure cost, however the upgrade from Basic SKU public IP to Standard SKU is irreversible, and different types of public IPs cannot coexist within the same fault domain. Additionally, this method comes with reduced rollback abilities. Therefore, we needed to devise a plan to minimize risks for our customers and ensure a seamless migration.

Developing a zero-downtime node-by-node migration strategy

To achieve a zero-downtime “node by node” migration, we explored several options, one of which involved building tooling to migrate the instances in the cloud provider but preserve all existing configurations. The tooling automates the migration process as follows:

  1. Begin with stopping the first VM in the cluster. For cluster availability, ensure that only 1 VM is stopped at any time.
  2. Create an OS disk snapshot and verify its success, then do the same for data disks
  3. Ensure all snapshots are created and generate new disks from snapshots
  4. Create a new network interface card (NIC) and confirm its status is green
  5. Create a new VM and attach the disks, confirming that the new VM is up and running
  6. Update the private IP address and verify the change
  7. The public IP SKU will then be upgraded, making sure this operation is successful
  8. The public IP will then be reattached to the VM
  9. Start the VM

Even though the disks are created from snapshots of the original disks, we encountered several discrepancies in our testing, with settings between the original VM and the new VM. For instance, certain configurations, such as caching policies, did not automatically carry over, requiring manual adjustments to align with our managed standards.

Recognizing these challenges, we decided to extend our existing node replacement mechanism to streamline our migration process. This is done so that a new instance is provisioned with a new OS disk with the same IP and application data. The new node is configured by the Instaclustr Managed Platform to be the same as the original node.

The next challenge: our existing solution is built so that the replaced node was provisioned to be the exact same as the original. However, for this operation we needed the new node to be placed in an availability zone instead of the same fault domain. This required us to extend the replacement operation so that when we triggered the replacement, the new node was placed in the desired availability zone. Once this operation completed, we had a replacement tool that ensured that the new instance was correctly provisioned in the availability zone, with a Standard SKU, and without data loss.

Now that we had two very viable options, we went back to our existing Azure customers to outline the problem space, and the operations that needed to be completed. We worked with all impacted customers on the best migration path for their specific use case or application and worked out the best time to complete the migration. Where possible, we first performed the migration on any test or QA environments before moving onto production environments.

Collaborative customer migration success

Some of our Cassandra customers opted to perform the migration using our data center migration path, however most customers opted for the node-by-node method. We successfully migrated the existing Azure fault domain clusters over to the Availability Zone that we were targeting, with only a very small number of clusters remaining. These clusters are operating in Azure regions which do not yet support availability zones, but we were able to successfully upgrade their public IP from Basic SKUs that are set for retirement to Standard SKUs.

No matter what provider you use, the pace of development in cloud computing can require significant effort to support ongoing maintenance and feature adoption to take advantage of new opportunities. For business-critical applications, being able to migrate to new infrastructure and leverage these opportunities while understanding the limitations and impact they have on other services is essential.

NetApp Instaclustr has a depth of experience in supporting business critical applications in the cloud. You can read more about another large-scale migration we completed The worlds Largest Apache Kafka and Apache Cassandra Migration or head over to our console for a free trial of the Instaclustr Managed Platform.

The post Azure fault domains vs availability zones: Achieving zero downtime migrations appeared first on Instaclustr.

Integrating support for AWS PrivateLink with Apache Cassandra® on the NetApp Instaclustr Managed Platform

Discover how NetApp Instaclustr leverages AWS PrivateLink for secure and seamless connectivity with Apache Cassandra®. This post explores the technical implementation, challenges faced, and the innovative solutions we developed to provide a robust, scalable platform for your data needs.

Last year, NetApp achieved a significant milestone by fully integrating AWS PrivateLink support for Apache Cassandra® into the NetApp Instaclustr Managed Platform. Read our AWS PrivateLink support for Apache Cassandra General Availability announcement here. Our Product Engineering team made remarkable progress in incorporating this feature into various NetApp Instaclustr application offerings. NetApp now offers AWS PrivateLink support as an Enterprise Feature add-on for the Instaclustr Managed Platform for Cassandra, Kafka®, OpenSearch®, Cadence®, and Valkey™.

The journey to support AWS PrivateLink for Cassandra involved considerable engineering effort and numerous development cycles to create a solution tailored to the unique interaction between the Cassandra application and its client driver. After extensive development and testing, our product engineering team successfully implemented an enterprise ready solution. Read on for detailed insights into the technical implementation of our solution.

What is AWS PrivateLink?

PrivateLink is a networking solution from AWS that provides private connectivity between Virtual Private Clouds (VPCs) without exposing any traffic to the public internet. This solution is ideal for customers who require a unidirectional network connection (often due to compliance concerns), ensuring that connections can only be initiated from the source VPC to the destination VPC. Additionally, PrivateLink simplifies network management by eliminating the need to manage overlapping CIDRs between VPCs. The one-way connection allows connections to be initiated only from the source VPC to the managed cluster hosted in our platform (target VPC)—and not the other way around.

To get an idea of what major building blocks are involved in making up an end-to-end AWS PrivateLink solution for Cassandra, take a look at the following diagram—it’s a simplified representation of the infrastructure used to support a PrivateLink cluster:

simplified representation of the infrastructure used to support a PrivateLink cluster

In this example, we have a 3-node Cassandra cluster at the far right with one Cassandra node per Availability Zone (or AZ). Next, we have the VPC Endpoint Service and a Network Load Balancer (NLB). The Endpoint Service is essentially the AWS PrivateLink, and by design AWS needs it to be backed by an NLB–that’s pretty much what we have to manage on our side.

On the customer side, they must create a VPC Endpoint that enables them to privately connect to the AWS PrivateLink on our end; naturally, customers will also have to use a Cassandra client(s) to connect to the cluster.

AWS PrivateLink support with Instaclustr for Apache Cassandra

To incorporate AWS PrivateLink support with Instaclustr for Apache Cassandra on our platform, we came across a few technical challenges. First and foremost, the primary challenge was relatively straightforward: Cassandra clients need to talk to each individual node in a cluster.

However, the problem is that nodes in an AWS PrivateLink cluster are only assigned private IPs; that is what the nodes would announce by default when Cassandra clients attempt to discover the topology of the cluster. Cassandra clients cannot do much with the received private IPs as they cannot be used to connect to the nodes directly in an AWS PrivateLink setup.

We devised a plan of attack to get around this problem:

  • Make each individual Cassandra node listen for CQL queries on unique ports.
  • Configure the NLB so it can route traffic to the appropriate node based on the relevant unique port.
  • Let clients implement the AddressTranslator interface from the Cassandra driver. The custom address translator will need to translate the received private IPs to one of the VPC Endpoint Elastic Network Interface (or ENI) IPs without altering the corresponding unique ports.

To understand this approach better, consider the following example:

Suppose we have a 3-node Cassandra cluster. According to the proposed approach we will need to do the followings:

  • Let the nodes listen on ports 172.16.0.1:6001 (in AZ1), 172.16.0.2: 6002 (in AZ2) and 172.16.0.3: 6003 (in AZ3)
  • Configure the NLB to listen on the same set of ports
  • Define and associate target groups based on the port. For instance, the listener on port 6002 will be associated with a target group containing only the node that is listening on port 6002.
  • As for how the custom address translator is expected to work, let’s assume the VPC Endpoint ENI IPs are 192.168.0.1 (in AZ1), 192.168.0.2 (in AZ2) and 192.168.0.3 (in AZ3). The address translator should translate received addresses like so:
    - 172.16.0.1:6001 --> 192.168.0.1:6001 - 172.16.0.2:6002 --> 192.168.0.2:6002 - 172.16.0.3:6003 --> 192.168.0.3:6003

The proposed approach not only solves the connectivity problem but also allows for connecting to appropriate nodes based on query plans generated by load balancing policies.

Around the same time, we came up with a slightly modified approach as well: we realized the need for address translation can be mostly mitigated if we make the Cassandra nodes return the VPC Endpoint ENI IPs in the first place.

But the excitement did not last for long! Why? Because we quickly discovered a key problem: there is a limit to the number of listeners that can be added to any given AWS NLB of just 50.

While 50 is certainly a decent limit, the way we designed our solution meant we wouldn’t be able to provision a cluster with more than 50 nodes. This was quickly deemed to be an unacceptable limitation as it is not uncommon for a cluster to have more than 50 nodes; many Cassandra clusters in our fleet have hundreds of nodes. We had to abandon the idea of address translation and started thinking about alternative solution approaches.

Introducing Shotover Proxy

We were disappointed but did not lose hope. Soon after, we devised a practical solution centred around using one of our open source products: Shotover Proxy.

Shotover Proxy is used with Cassandra clusters to support AWS PrivateLink on the Instaclustr Managed Platform.​ What is Shotover Proxy, you ask? Shotover is a layer 7 database proxy built to allow developers, admins, DBAs, and operators to modify in-flight database requests. By managing database requests in transit, Shotover gives NetApp Instaclustr customers AWS PrivateLink’s simple and secure network setup with the many benefits of Cassandra.

Below is an updated version of the previous diagram that introduces some Shotover nodes in the mix:

simplified representation of the infrastructure used to support a PrivateLink cluster with Shotover nodes included

As you can see, each AZ now has a dedicated Shotover proxy node.

In the above diagram, we have a 6-node Cassandra cluster. The Cassandra cluster sitting behind the Shotover nodes is an ordinary Private Network Cluster. The role of the Shotover nodes is to manage client requests to the Cassandra nodes while masking the real Cassandra nodes behind them. To the Cassandra client, the Shotover nodes appear to be Cassandra nodes, and it is only them that make up the entire cluster! This is the secret recipe for AWS PrivateLink for Instaclustr for Apache Cassandra that enabled us to get past the challenges discussed earlier.

So how is this model made to work?

Shotover can alter certain requests from—and responses to—the client. It can examine the tokens allocated to the Cassandra nodes in its own AZ (aka rack) and claim to be the owner of all those tokens. This essentially makes them appear to be an aggregation of the nodes in its own rack.

Given the purposely crafted topology and token allocation metadata, while the client directs queries to the Shotover node, the Shotover node in turn can pass them on to the appropriate Cassandra node and then transparently send responses back. It is worth noting that the Shotover nodes themselves do not store any data.

Because we only have 1 Shotover node per AZ in this design and there may be at most about 5 AZs per region, we only need that many listeners in the NLB to make this mechanism work. As such, the 50-listener limit on the NLB was no longer a problem.

The use of Shotover to manage client driver and cluster interoperability may sound straight forward to implement, but developing it was a year-long undertaking. As described above, the initial months of development were devoted to engineering CQL queries on unique ports and the AddressTranslator interface from the Cassandra driver to gracefully manage client connections to the Cassandra cluster. While this solution did successfully provide support for AWS PrivateLink with a Cassandra cluster, we knew that the 50-listener limit on the NLB was a barrier for use and wanted to provide our customers with a solution that could be used for any Cassandra cluster, regardless of node count.

The next few months of engineering were then devoted to the Proof of Concept of an alternative solution with the goal to investigate how Shotover could manage client requests for a Cassandra cluster with any number of nodes. And so, after a solution to support a cluster with any number of nodes was successfully proved, subsequent effort was then devoted to work through stability testing the new solution, the results of that engineering being the stable solution described above.

We have also conducted performance testing to evaluate the relative performance of a PrivateLink-enabled Cassandra cluster compared to its non-PrivateLink counterpart. Multiple iterations of performance testing were executed as some adjustments to Shotover were identified from test cases and resulted in the PrivateLink-enabled Cassandra cluster throughput and latency measuring near to a standard Cassandra cluster throughput and latency.

Related content: Read more about creating an AWS PrivateLink-enabled Cassandra cluster on the Instaclustr Managed Platform

The following was our experimental setup for identifying the max throughput in terms of Operations per second of a Cassandra PrivateLink cluster in comparison to a non-Cassandra PrivateLink cluster

  • Baseline node size: i3en.xlarge
  • Shotover Proxy node size on Cassandra Cluster: CSO-PRD-c6gd.medium-54
  • Cassandra version: 4.1.3
  • Shotover Proxy version: 0.2.0
  • Other configuration: Repair and backup disabled, Client Encryption disabled
Throughput results Operation Operation rate with PrivateLink and Shotover Operation rate without PrivateLink Mixed-small (3 Nodes) 16608 16206 Mixed-small (6 Nodes) 33585 33598 Mixed-small (9 Nodes) 51792 51798

Across different cluster sizes, we observed no significant difference in operation throughput between PrivateLink and non-PrivateLink configurations.

Latency results

Latency benchmarks were conducted at ~70% of the observed peak throughput (as above) to simulate realistic production traffic.

Operation Ops/second Setup Mean Latency (ms) Median Latency (ms) P95 Latency (ms) P99 Latency (ms) Mixed-small (3 Nodes) 11630 Non-PrivateLink 9.90 3.2 53.7 119.4 PrivateLink 9.50 3.6 48.4 118.8 Mixed-small (6 Nodes) 23510 Non-PrivateLink 6 2.3 27.2 79.4 PrivateLink 9.10 3.4 45.4 104.9 Mixed-small (9 Nodes) 36255 Non-PrivateLink 5.5 2.4 21.8 67.6 PrivateLink 11.9 2.7 77.1 141.2

Results indicate that for lower to mid-tier throughput levels, AWS PrivateLink introduced minimal to negligible overhead. However, at higher operation rates, we observed increased latency, most notably at the p99 mark—likely due to network level factors or Shotover.

The increase in latency is expected as AWS PrivateLink introduces an additional hop to route traffic securely, which can impact latencies, particularly under heavy load. For the vast majority of applications, the observed latencies remain within acceptable ranges. However, for latency-sensitive workloads, we recommend adding more nodes (for high load cases) to help mitigate the impact of the additional network hop introduced by PrivateLink.

As with any generic benchmarking results, performance may vary depending on specific data model, workload characteristics, and environment. The results presented here are based on specific experimental setup using standard configurations and should primarily be used to compare the relative performance of PrivateLink vs. Non-PrivateLink networking under similar conditions.

Why choose AWS PrivateLink with NetApp Instaclustr?

NetApp’s commitment to innovation means you benefit from cutting-edge technology combined with ease of use. With AWS PrivateLink support on our platform, customers gain:

  • Enhanced security: All traffic stays private, never touching the internet.
  • Simplified networking: No need to manage complex CIDR overlaps.
  • Enterprise scalability: Handles sizable clusters effortlessly.

By addressing challenges, such as the NLB listener cap and private-to-VPC IP translation, we’ve created a solution that balances efficiency, security, and scalability.

Experience PrivateLink today

The integration of AWS PrivateLink with Apache Cassandra® is now generally available with production-ready SLAs for our customers. Log in to the Console to create a Cassandra cluster with support for AWS PrivateLink with just a few clicks today. Whether you’re managing sensitive workloads or demanding performance at scale, this feature delivers unmatched value.

Want to see it in action? Book a free demo today and experience the Shotover-powered magic of AWS PrivateLink firsthand.

Resources
  • Getting started: Visit the documentation to learn how to create an AWS PrivateLink-enabled Apache Cassandra cluster on the Instaclustr Managed Platform.
  • Connecting clients: Already created a Cassandra cluster with AWS PrivateLink? Click here to read about how to connect Cassandra clients in one VPC to an AWS PrivateLink-enabled Cassandra cluster on the Instaclustr Platform.
  • General availability announcement: For more details, read our General Availability announcement on AWS PrivateLink support for Cassandra.

The post Integrating support for AWS PrivateLink with Apache Cassandra® on the NetApp Instaclustr Managed Platform appeared first on Instaclustr.

Compaction Strategies, Performance, and Their Impact on Cassandra Node Density

This is the third post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In the first post, I examined how streaming operations impact node density and laid out the groundwork for understanding why higher node density leads to significant cost savings. In the second post, I discussed how compaction throughput is critical to node density and introduced the optimizations we implemented in CASSANDRA-15452 to improve throughput on disaggregated storage like EBS.

Cassandra Compaction Throughput Performance Explained

This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, CASSANDRA-15452.

This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice section covering the storage engine if you’d like to brush up before reading this post.

How Cassandra Streaming, Performance, Node Density, and Cost are All related

This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.

A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.

Cassandra 5 Released! What's New and How to Try it

Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.

You can grab the latest release on the Cassandra download page.

easy-cass-lab v5 released

I’ve got some fun news to start the week off for users of easy-cass-lab: I’ve just released version 5. There are a number of nice improvements and bug fixes in here that should make it more enjoyable, more useful, and lay groundwork for some future enhancements.

  • When the cluster starts, we wait for the storage service to reach NORMAL state, then move to the next node. This is in contrast to the previous behavior where we waited for 2 minutes after starting a node. This queries JMX directly using Swiss Java Knife and is more reliable than the 2-minute method. Please see packer/bin-cassandra/wait-for-up-normal to read through the implementation.
  • Trunk now works correctly. Unfortunately, AxonOps doesn’t support trunk (5.1) yet, and using the agent was causing a startup error. You can test trunk out, but for now the AxonOps integration is disabled.
  • Added a new repl mode. This saves keystrokes and provides some auto-complete functionality and keeps SSH connections open. If you’re going to do a lot of work with ECL this will help you be a little more efficient. You can try this out with ecl repl.
  • Power user feature: Initial support for profiles in AWS regions other than us-west-2. We only provide AMIs for us-west-2, but you can now set up a profile in an alternate region, and build the required AMIs using easy-cass-lab build-image. This feature is still under development and requires using an easy-cass-lab build from source. Credit to Jordan West for contributing this work.
  • Power user feature: Support for multiple profiles. Setting the EASY_CASS_LAB_PROFILE environment variable allows you to configure alternate profiles. This is handy if you want to use multiple regions or have multiple organizations.
  • The project now uses Kotlin instead of Groovy for Gradle configuration.
  • Updated Gradle to 8.9.
  • When using the list command, don’t show the alias “current”.
  • Project cleanup, remove old unused pssh, cassandra build, and async profiler subprojects.

The release has been released to the project’s GitHub page and to homebrew. The project is largely driven by my own consulting needs and for my training. If you’re looking to have some features prioritized please reach out, and we can discuss a consulting engagement.

easy-cass-lab updated with Cassandra 5.0 RC-1 Support

I’m excited to announce that the latest version of easy-cass-lab now supports Cassandra 5.0 RC-1, which was just made available last week! This update marks a significant milestone, providing users with the ability to test and experiment with the newest Cassandra 5.0 features in a simplified manner. This post will walk you through how to set up a cluster, SSH in, and run your first stress test.

For those new to easy-cass-lab, it’s a tool designed to streamline the setup and management of Cassandra clusters in AWS, making it accessible for both new and experienced users. Whether you’re running tests, developing new features, or just exploring Cassandra, easy-cass-lab is your go-to tool.