The Taming of Collection Scans

Explore several collection layouts for efficient scanning, including a split-list structure that avoids extra memory Here’s a puzzle that I came across when trying to make tasks wake up in Seastar be a no-exception-throwing operation (related issue): come up with a collection of objects optimized for “scanning” usage. That is, when iterating over all elements of the collection, maximize the hardware utilization to process a single element as fast as possible. And, as always, we’re expected to minimize the amount of memory needed to maintain it. This seemingly simple puzzle will demonstrate some hidden effects of a CPU’s data processing. Looking ahead, such a collection can be used, for example, as a queue of running tasks. New tasks are added at the back of the queue; when processed, the queue is scanned in front-to-back order, and all tasks are usually processed. Throughout this article, we’ll refer to this use case of a collection being the queue of tasks to execute. There will be occasional side notes using this scenario to demonstrate various concerns. We will explore different ways to solve this puzzle of organizing collections for efficient scanning. First, we compare three collections: array, intrusive list, and array of pointers. You will see that the scanning performance of those collections differs greatly, and heavily depends on the way adjacent elements are referenced by the collection. After analyzing the way the processor executes the scanning code instructions, we suggest a new collection called a “split list.” Although this new collection seems awkward and bulky, it ultimately provides excellent scanning performance and memory efficiency. Classical solutions First consider two collections that usually come to mind: a plain sequential array of elements and a linked list of elements. The latter collection is sometimes unavoidable, particularly when the elements need to be created and destroyed independently and cannot be freely moved across memory. As a test, we’ll use elements that contain random, pre-populated integers and a loop that walks all elements in the collection and calculates the sum of those integers. Every programmer knows that in this case, an array of integers will win because of cache efficiency. To exclude the obvious advantage of one collection over another, we’ll penalize the array and prioritize the list. First, each element will occupy a 64-byte slot even when placed in an array, so walking a plain array doesn’t benefit from caching several adjacent elements. Second, we will use an intrusive list, which means that the “next” pointer will be stored next to the element value itself. The processor can then read both the pointer and the value with a single fetch from memory to cache. The expectation here is that both collections will behave the same. However, a scanning test shows that’s not true, especially on a large scale. The plot above shows the time to process a single entry (vertical axis) versus the number of elements in the list (horizontal axis). Both axes use a logarithmic scale because the collection size was increased ten times at each new test step. Plus, the vertical axis just looks better this way. So now we have two collections – an array and a list – and the list’s performance is worse than the array’s. However, as mentioned above, the list has an undeniable advantage:  elements in the list are independent of each other (in the sense that they can be allocated and destroyed independently). A less obvious advantage is that the data type stored in a list collection can be an abstract class, while the actual elements stored in the list can be specific classes that inherit from that base class. The ability to collect objects of different types can be crucial in the task processing scenario described above, where a task is described as an abstract base class and specific task implementations inherit from it and implement their own execution methods. Is it possible to build a collection that can maintain its elements independently, as a list of elements does, yet still provide scanning performance that’s the same (or close to) that of the array? Not so classical solution Let’s make an array of elements be “dispersed,” like a list, in a straightforward manner by turning each array element into a pointer to that element, and allocating the element itself elsewhere, as if it were prepared to be inserted into a list. In this array, pointers will no longer be aligned to a cache-line, thus letting the processor benefit from reading several pointers from memory at once. Elements are still 64-bytes in size, to be consistent with previous tests. The memory for pointers is allocated contiguously, with a single large allocation. This is not ideal for dynamic collection, where the number of elements is not known beforehand: the larger the collection grows, the more re-allocations are needed. It’s possible to overcome this by maintaining a list of sub-arrays. Looking ahead, just note that this chunked array of pointers will indeed behave slightly worse than a contiguous one. All further measurements and analysis refer to the contiguous collection. This approach actually looks worse than the linked list because it occupies more memory than the list. Also, when walking the list, the code touches one cache line per element – but when walking this collection, it additionally populates the cache with the contents of that array of pointers. Running the same scanning test shows that this cost is imaginary and the collection beats the list several times, approaching the plain array in its per-element efficiency. The processor’s inner parallelism To get an idea of why an array of pointers works better than the intrusive list, let’s drill down to the assembly level and analyze how the instructions are executed by the processor. Here’s what the array scanning main loop looks like in assembly: x: mov (%rdi,%rax,8),%rax mov 0x10(%rax),%eax add %rax,%rcx lea 0x1(%rdx),%eax mov %rax,%rdx cmp %rsi,%rax jb x We can see two memory accesses – the first moves the pointer to the array element to the ‘rax’ register, and the second fetches the value from the element into its ‘eax’ 32-bit sub-part. Then there comes in-register math and conditional jumping back to the start of the loop to process the next element. The main loop of the list scanning code looks much shorter: x: mov 0x10(%rdi),%edx mov (%rdi),%rdi add %rdx,%rax test %rdi,%rdi jne x Again, there are two memory accesses – the first fetches the value pointer into the ‘edx’ register and the next one fetches the pointer to the next element to the ‘rdi’ register. Instructions that involve fetching data from memory can be split into four stages: i-fetch – The processor fetches the instruction itself from memory. In our case, the instruction is likely in the instruction cache, so the fetch goes very fast. decode – The processor decides what the instruction should do and what operands are needed for this. m-fetch – The processor reads the data it needs from memory. In our case, elements are always read from memory because they are “large enough” not to be fetched into cache with anything else, while array pointers are likely to sit in cache. exec – The processor executes the instruction. Let’s illustrate this sequence with a color bar: Also, we know that modern processors can run multiple instructions in parallel, by executing parts of different instructions at the same time in different parts of the conveyor, as well as running instructions fully in parallel. One example of this parallel execution can be seen in the array-scanning example above, namely the add %rax,%rcx lea 0x1(%rdx),%eax part. Here, the second instruction is the increment of the index that’s used to scan through the array of pointers. The compiler rendered this as lea instruction instead of the inc (or add) one because inc and lea are executed in different parts of the pipeline. When placed back-to-back,  they will truly run in parallel. If the inc was used, the second instruction would have to spend some time in the same pipeline stage as the add. Here’s what executing the above array scan can look like: Here, fetching the element pointer from the array is short because it likely happens from cache. Fetching the element’s value is long (and painted darker) because the element is most certainly not in cache (and thus requires a memory read). Also, fetching the value from the element happens after the element pointer is fetched into the register. Similarly, the instruction that adds value to the result cannot execute before the value itself is fetched from memory, so it waits after being decoded. And here’s what scanning the list can look like: At first glance, almost nothing changed. The difference is that the next pointer is fetched from memory and takes a long time, but the value is fetched from cache (and is faster). Also, fetching the value can start before the next pointer is saved into the register. Considering that during an array scan, the “read element pointer from array” is long at times (e.g., when it needs to read the next cache line from memory), it’s still not clear why list scanning doesn’t win at all. In order to see why the array of pointers wins, we need to combine two consecutive loop iterations. First comes the array scan: It’s not obvious, but two loop iterations can run like that. Fetching the pointer for the next element is pretty much independent from fetching the pointer of the previous element; it’s just the next element of an array that’s already in the cache. Just like predicting the next branches, processors can “predict” that the next memory fetch will come from the pointer sitting next to the one being processed and start loading it ahead of time. List scanning cannot afford that parallelism even if the processor “foresees” that the fetched pointer will be dereferenced. As a consequence, its two loop iterations end up being serialized: Note that the processor can only start fetching the next element after it finishes fetching the next pointer itself, so the parallelism of accessing elements is greatly penalized here. Also note that despite how it seems in the above images, scanning the list can be many times slower than scanning the array, because blue bars (memory fetches) are in reality many times longer than the others (e.g., those fetching the instruction, decoding it, and storing the result in the register). A compromise solution The array of pointers turned out to be a much better solution than the list of elements, but it still has an inefficiency: extra memory that can grow large. Here we can say that this algorithm has O(N) memory complexity, meaning that it requires extra memory that’s proportional to the number of elements in the collection. Allocating it can be troublesome for many reasons – for example, because of memory fragmentation and because, at large scale, growing the array would require copying all the pointers from one place to another. There are ways to mitigate the problem of maintaining this extra memory, but is it possible to eliminate it completely? Or at least make it “constant complexity” (i.e., independent from the number of elements in it)? The requirement to not allocate extra memory can be crucial in task processing scenarios. In it, the auxiliary memory is allocated when an element is appended to the existing collection. And a new task is appended to the run-queue when it’s being woken up. If the allocation fails, the appending also fails as well as the wake-up call. And having non-failing wake-ups can be critical. It looks like letting the processor fetch independent data in different consecutive loop iterations is beneficial. With a list, it would be good if adjacent elements were accessed independently. That can be achieved by splitting the list into several sub-lists, and – when iterating the whole collection – processing it in a round-robin manner. Specifically, take an element from the first list, then from the second, … then from the Nth, then advancing on the first, then advancing on the second, and so on. The scanning code is made with the assumption that the collection only grows by appending elements to one of its ends – the front or the back end. This perfectly suits the task-processing usage scenario and allows making the scanning code break condition to be very simple: once a null element is met in either of the lists, all lists after it are empty as well, so scanning can stop. Below is the simplistic implementation of the scanning loop. A full implementation that handles appends is a bit more hairy and is based on the C++ “iterator” concept. But overall, it has the same efficiency and resulting assembly code. First, checking this list with N=2 OK, scanning two lists “in parallel” definitely helps. Since the number of splits is compile-time constant, we now need to run several tests to see which value is the most efficient one. The more we split the list, the worse it seems to behave at small scales, but the better at large scale. Splits at 16 and 32 lanes seem to “saturate” the processor’s parallelism ability. Here’s how the results look at a different angle: Here, the horizontal axis shows N (the number of lists in the collection), and individual lines on the plot correspond to different collection sizes starting from 10 elements and ending at one million. And both axes are at logarithmic scale too. At a low scale with 10 and 100 elements, adding more lists doesn’t improve the scanning speed. But at larger scales, 16 parallel lists are indeed the saturation point. Interestingly, the assembly code of the split-list main loop part contains two times more instructions than the plain list scan. x: mov %eax,%edx add $0x1,%eax and $0xf,%edx mov -0x78(%rsp,%rdx,8),%rcx mov 0x10(%rcx),%edi mov 0x8(%rcx),%rcx add %rdi,%rsi mov %rcx,-0x78(%rsp,%rdx,8) cmp %r8d,%eax jne x It also has two times more memory access than the plain list scanning code. Nonetheless, since the memory is better organized, prefetching it in a parallel manner makes this code win in terms of timing. Comparing different processors (and compilers) The above measurements were done on an AMD Threadripper processor and the binary was compiled with a GCC-15 compiler. It’s interesting to check what code different compilers render and, more importantly, how different processors behave. First, let’s look at it with the instructions set. No big surprises here; plain list is the shortest code, split list is the longest: Running the tests on different processors, however, renders very different results. Below are the number of cycles a processor needs to process a single element. Since the plain list is the outlier, it will be shown on its own plot. Here are the top performers – array, array of pointers, and split list: The split list is, as we’ve seen, the slowest one. But it’s not drastically different. More interesting is the way the Xeon processor beats the other competitors. A similar ratio was measured for plain list processing by different processors: But, again, even on the Xeon processor, it’s an order of magnitude slower than the split list. Summing things up In this article, we explored ways to organize a collection of objects to allow for efficient scanning. We compared four collections – array, intrusive list, array of pointers, and split-list. Since plain arrays have problems maintaining objects independently, we used them as a base reference and mainly compared three other collections with each other to find out which one behaved the best. From the experiments, we discovered that an array of pointers provided the best timing for single-element access, but required a lot of extra memory. This cost can be mitigated to some extent, but the memory itself doesn’t go away. The split-list approach showed comparable (almost as good) performance. And the advantage of the split-list solution is that it doesn’t require extra memory to work.    

Top Blogs of 2025: Rust, Elasticity, and Real-Time DB Workloads

Let’s look back at the top 10 ScyllaDB blog posts published in 2025, as well as 10 “classics” that are still resonating with readers. But first: thank you to all the community members who contributed to our blogs in various ways…from users sharing best practices at Monster SCALE Summit and P99 CONF, to engineers explaining how they raised the bar for database performance, to anyone who has initiated or contributed to the discussion on HackerNews, Reddit, and the like. And if you have suggestions for additional blog topics, please share them with us on our socials. With no further ado, here are the most-read blog posts that we published in 2025…   Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio By Wojciech Przytuła The engineering challenges and design decisions that led to the 1.0 release of ScyllaDB Rust Driver. Read: Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio Related: P99 CONF on-demand Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview By Tzach Livyatan ScyllaDB X Cloud just landed! It’s a truly elastic database that supports variable/unpredictable workloads with consistent low latency, plus low costs. Read: Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview Related: ScyllaDB X Cloud: An Inside Look with Avi Kivity Inside Tripadvisor’s Real-Time Personalization with ScyllaDB + AWS By Dean Poulin See the engineering behind real-time personalization at Tripadvisor’s massive (and rapidly growing) scale Read: Inside Tripadvisor’s Real-Time Personalization with ScyllaDB + AWS Related: How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database Why We Changed Our Data Streaming Approach By Asias He How moving from mutation-based streaming to file-based streaming resulted in 25X faster streaming time. Read: Why We Changed Our Data Streaming Approach Related: More engineering blog posts How Supercell Handles Real-Time Persisted Events with ScyllaDB By Cynthia Dunlop How a team of just two engineers tackled real-time persisted events for hundreds of millions of players Read: How Supercell Handles Real-Time Persisted Events with ScyllaDB Related: Rust Rewrite, Postgres Exit: Blitz Revamps Its “League of Legends” Backend Why Teams Are Ditching DynamoDB By Guilherme da Silva Nogueira, Felipe Cardeneti Mendes Teams sometimes need lower latency, lower costs (especially as they scale) or the ability to run their applications somewhere other than AWS Read: Why Teams Are Ditching DynamoDB Related: ScyllaDB vs DynamoDB: 5-Minute Demo A New Way to Estimate DynamoDB Costs By Tim Koopmans We built a new DynamoDB cost analyzer that helps developers understand what their workloads will really cost Read: A New Way to Estimate DynamoDB Costs Related: Understanding The True Cost of DynamoDB Efficient Full Table Scans with ScyllaDB Tablets By Felipe Cardeneti Mendes How “tablets” data distribution optimizes the perfromance of full table scans on ScyllaDB. Read: Efficient Full Table Scans with ScyllaDB Tablets Related: Fast and Deterministic Full Table Scans at Scale How We Simulate Real-World Production Workloads with “latte” By Valerii Ponomarov Learn why and how we adopted latte, a Rust-based lightweight benchmarking tool, for ScyllaDB’s specialized testing needs. Read: How We Simulate Real-World Production Workloads with “latte”  Related: Database Benchmarking for Performance Masterclass How JioCinema Uses ScyllaDB Bloom Filters for Personalization By Cynthia Dunlop JioCinema (now Disney+ Hotstar) was operating at a scale that required creative solutions beyond typical Redis Bloom filters. This post explains why and how they used ScyllaDB’s built-in Bloom filters for real-time watch status checks. Read: How JioCinema Uses ScyllaDB Bloom Filters for Personalization Related: More user perspectives Bonus: Top NoSQL Database Blogs From Years Past Many of the blogs published in previous years continued to resonate with the community. Here’s a rundown of 10 enduring favorites: How io_uring and eBPF Will Revolutionize Programming in Linux (Glauber Costa): How io_uring and eBPF will change the way programmers develop asynchronous interfaces and execute arbitrary code, such as tracepoints, more securely. [2020] Database Internals: Working with IO (Pavel Emelyanov): Explore the tradeoffs of different Linux I/O methods and learn how databases can take advantage of a modern SSD’s unique characteristics. [2024] On Coordinated Omission (Ivan Prisyazhynyy): Your benchmark may be lying to you. Learn why coordinated omissions are a concern and how they are handled in ScyllaDB benchmarking. [2021] ScyllaDB vs MongoDB vs PostgreSQL: Tractian’s Benchmarking & Migration (João Pedro Voltani): TRACTIAN compares ScyllaDB, MongoDB, and PostgreSQL and walks through their MongoDB-to-ScyllaDB migration, including challenges and results. [2023] Introducing “Database Performance at Scale”: A Free, Open Source Book (Dor Laor): A practical guide to understanding the tradeoffs and pitfalls of optimizing data-intensive applications for high throughput and low latency. [2023] ScyllaDB vs. DynamoDB Benchmark: Comparing Price Performance Across Workloads (Eliran Sinvani): A comparison of cost and latency across DynamoDB pricing models and ScyllaDB under varied workloads and read/write ratios. [2023] Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost (Dr. Daniel Seybold): A third-party benchmark comparing MongoDB and ScyllaDB on throughput, latency, scalability, and price-performance. [2023] Apache Cassandra 4.0 vs. ScyllaDB 4.4: Comparing Performance (Juliusz Stasiewicz, Piotr Grabowski, Karol Baryla): Benchmarks showing 2×–5× higher throughput and significantly better latency with ScyllaDB versus Cassandra. [2022] DynamoDB: When to Move Out (Felipe Cardeneti Mendes): Why teams leave DynamoDB, including throttling, latency, item size limits, flexibility constraints, and cost. [2023] Rust vs. Zig in Reality: A (Somewhat) Friendly Debate(Cynthia Dunlop): A recap of a P99 CONF debate on systems programming languages with participants from Bun.js, Turso, and ScyllaDB. [2024]

Lessons Learned Leading High-Stakes Data Migrations

“No one ever said ‘meh, it’s just our database'” Every data migration is high stakes to the person leading it. Whether you’re upgrading an internal app’s database or moving 362 PB of Twitter’s data from bare metal to GCP, a lot can go awry — and you don’t want to be blamed for downtime or data loss. But a migration done right will not only optimize your project’s infrastructure. It will also leave you with a deeper understanding of your system and maybe even yield some fun “war stories” to share with your peers. To cheat a bit, why not learn from others’ experiences first? Enter Miles Ward (CTO at SADA and former Google and AWS cloud lead) and Tim Koopmans (Senior Director at ScyllaDB, performance geek and SaaS startup founder). Miles and Tim recently got together to chat about lessons they’ve personally learned from leading real-world data migrations. You can watch the complete discussion here: Let’s look at three key takeaways from the chat. 1. Start with the Hardest, Ugliest Part First It’s always tempting to start a project with some quick wins. But tackling the worst part first will yield better results overall. Miles explains, “Start with the hardest, ugliest part first because you’re going to be wrong in terms of estimating timelines and noodling through who has the correct skills for each step and what are all of the edge conditions that drive complexity.” For example, he saw this approach in action during Google’s seven-year migration of the Gmail backend (handling trillions of transactions per day) from its internal Gmail data system to Spanner. First, Google built Spanner specifically for this purpose. Then, the migration team ran roll-forwards and roll-backs of individual mailbox migrations for over two years before deciding that the performance, reliability and consistency in the new environment met their expectations. Miles added, “You also get an emotional benefit in your teams. Once that scariest part is done, everything else is easier. I think that tends to work well both interpersonally and technically.” 2. Map the Minefield You can’t safely migrate until you’ve fully mapped out every little dependency. Both Tim and Miles stress the importance of exhaustive discovery: cataloging every upstream caller, every downstream consumer, every health check and contractual downtime window before a single byte shifts. Miles warns, “If you don’t have an idea of what the consequences of your change are…you’ll design a migration that’s ignorant of those needs.” Miles then offered a cautionary anecdote from his time at Twitter, as part of a team that migrated 362 petabytes of active data from bare-metal data centers into Google Cloud. They used an 800 Gbps interconnect (about the total internet throughput at the time) and transferred everything in 43 days. To be fair, this was a data warehouse migration, so it didn’t involve hundreds of thousands of transactional queries per second. Still, Twitter’s ad systems and revenue depended entirely on that warehouse, making the migration mission-critical. Miles shared: “They brought incredible engineers and those folks worked with us for months to lay out the plan before we moved any bytes. Compare that to something done a little more slapdash. I think there are plenty of places where businesses go too slow, where they overinvest in risk management because they haven’t modeled the cost-benefit of a faster migration. But if you don’t have that modeling done, you should probably take the slow boat and do it carefully.” 3. Engineer a “Blissfully Boring” Cutover “If you’re not feeling sleepy on cut-over day,” Miles quipped, “you’ve done something terribly wrong.” But how do you get to that point? Tim  shared that he’s always found dual writes with single reads useful: you can switch over once both systems are up to speed. If the database doesn’t support dual writes, replicating writes via Change Data Capture (CDC) or something similar works well. Those strategies provide confidence that the source and target behave the same under load before you start serving real traffic. Then Tim asked Miles, “Would you say those are generally good approaches, or does it just depend?” Miles’ response: “I think the biggest driver of ‘It depends’ is that those concepts are generally sound, but real‐world migrations are more complex. You always want split writes when feasible, so you build operational experience under write load in the new environment. But sample architecture diagrams and Terraform examples make migrations look simpler than they usually are.” Another complicating factor: most companies don’t have one application on one database. They have dozens of applications talking across multiple databases, data warehouses, cache layers and so on. All of this matters when you start routing read traffic from various sources. Some systems use scheduled database-to-warehouse extractions, while others avoid streaming replication costs. Load patterns shift throughout the day as different workloads come online. That’s why you should test beyond the immediate reads after migration or when initial writes move to the new environment. So codify every step, version it and test it all multiple times – exactly the same way. And if you need to justify extra preparation or planning for migration, frame it as improving your overall high-availability design. Those practices will carry forward even after the cutover. Also, be aware that new platforms will inevitably have different operational characteristics…that’s why you’re adopting them. But these changes can break hard-coded alerts or automation. For example, maybe you had alerts set to trigger at 10,000 transactions per second, but the new system easily handles 100,000. Ensure that your previous automation still works and systematically evaluate all upstream and downstream dependencies. Follow these tips and the big day could resemble Digital Turbine’s stellar example. Miles shared, “If Digital Turbine’s database went down, its business went down. But the company’s DynamoDB to ScyllaDB migration was totally drama free. It took two and a half weeks, all buttoned up, done. It was going so well that everybody had a beer in the middle of the cutover.” Closing Thoughts Data migrations are always “high stakes.” As Miles bluntly put it, “I know that if I screw this up, I’ll piss off customers, drive them to competitors, or miss out on joint growth opportunities. It all comes down to trust. There are countless ways you can screw up an application in a way that breaches stakeholder trust. But doing careful planning, being thoughtful about the migration process, and making the right design decisions sets the team up to grow trust instead of eroding it.” Data migration projects are also great opportunities to strengthen your team’s architecture and build your own engineering expertise. Tim left us with this thought: “My advice for anyone who’s scared of running a data migration: Just have a crack at it. Do it carefully, and you’ll learn a lot about distributed systems in general – and gain all sorts of weird new insights into your own systems in particular. ” Watch the complete video (at the start of this article) for more details on these topics – as well as some fun “war stories.” Bonus: Access our free NoSQL Migration Masterclass for a deeper dive into migration strategy, missteps, and logistics.

ScyllaDB Operator 1.19.0 Release: Multi-Tenant Monitoring with Prometheus and OpenShift Support

Multi-tenant monitoring with Prometheus/OpenShift, improved sysctl config, and a new opt-in synchronization for safer topology changes The ScyllaDB team is pleased to announce the release of ScyllaDB Operator 1.19.0. ScyllaDB Operator is an open-source project that helps you run ScyllaDB on Kubernetes. It manages ScyllaDB clusters deployed to Kubernetes and automates tasks related to operating a ScyllaDB cluster, like installation, vertical and horizontal scaling, as well as rolling upgrades. The latest release introduces the “External mode,” which enables multi-tenant monitoring with Prometheus and OpenShift support. It also adds a new guardrail in the must-gather debugging tool preventing accidental inclusion of sensitive information, optimizes kernel parameter (sysctl) configuration, and introduces an opt-in synchronization feature for safer topology changes – plus several other updates. Multi-tenant monitoring with Prometheus and OpenShift support ScyllaDB Operator monitoring uses Prometheus (an industry-standard cloud-native monitoring system) for metric collection and aggregation. Up until now, you had to run a fresh, clean instance of Prometheus for every ScyllaDB cluster. We coined the term “Managed mode” for this architecture (because, in that case, ScyllaDB Operator would manage the Prometheus deployment): ScyllaDB Operator 1.19 introduces the “External mode” – an option to connect (one or more) ScyllaDB clusters with a shared Prometheus deployment that may already be present in your production environment: The External  mode provides a very important capability for users who run ScyllaDB on Red Hat OpenShift. The User Workload Monitoring (UWM) capability of OpenShift becomes available as a backend for ScyllaDB Monitoring: Under the hood, ScyllaDB Operator 1.19 implements the new monitoring architectures by extending the ScyllaDBMonitoring CRD with a new field .spec.components.prometheus.mode that can now be set to Managed or External. Managed is the preexisting behavior (to deploy a clean Prometheus instance), while External deploys just the Grafana dashboard using your existing Prometheus as a data source instead, and puts ServiceMonitors and PrometheusRules in place to get all the ScyllaDB metrics there. See the new ScyllaDB Monitoring overview and Setting up ScyllaDB Monitoring documents to learn more about the new mode and how to set up ScyllaDB Monitoring with an existing Prometheus instance. The Setting up ScyllaDB Monitoring on OpenShift guide offers guidance on how to set up User Workload Monitoring (UWM) for ScyllaDB in OpenShift. That being said, our experience shows that cluster administrators prefer closer control over the monitoring stack than what the Managed mode offered. For this reason, we intend to standardize on using External in the long run. So, we’re still supporting the Managed mode, but it’s being deprecated and will be removed in a future Operator version. If you are an existing user, please consider deploying your own Prometheus using the Prometheus Operator platform guide and switching from Managed to External. Sensitive information excluded from must-gather ScyllaDB Operator comes with an embedded tool (called must-gather) that helps preserve the configuration (Kubernetes objects) and runtime state (ScyllaDB node logs, gossip information, nodetool status, etc.) in a convenient archive. This allows comparative analysis and troubleshooting with a holistic, reproducible view. As of ScyllaDB Operator 1.19, must-gather comes with a new setting --exclude-resource that serves as an additional guardrail preventing accidental inclusion of sensitive information – covering Secrets and SealedSecrets by default. Users can specify additional types to be restricted from capturing, or override the defaults by setting the --include-sensitive-resources flag. See the Gathering data with must-gather guide for more information. Configuration of kernel parameters (sysctl) ScyllaDB nodes require kernel parameter (sysctl) configuration for optimal performance and stability – ScyllaDB Operator 1.19 improves the API to do that. Before 1.19, it was possible to configure these parameters through v1.ScyllaCluster‘s .spec.sysctls. However, we learned that this wasn’t the optimal place in the API for a setting that affects entire Kubernetes nodes. So, ScyllaDB Operator 1.19 lets you configure sysctls through v1alpha1.NodeConfig for a range of Kubernetes nodes at once by matching the specified placement rules using a label-based selector. See the Configuring kernel parameters (sysctls) section of the documentation to learn how to configure the sysctl values recommended for production-grade ScyllaDB deployments. With the introduction of sysctl to NodeConfig, the legacy way of configuring sysctl values through v1.ScyllaCluster‘s .spec.sysctls is now deprecated. Topology change operations synchronisation ScyllaDB requires that no existing nodes are down when a new node is added to a cluster. ScyllaDB Operator 1.19 addresses this by extending ScyllaDB Pods for newly joining nodes with a barrier blocking the ScyllaDB container from starting until the preconditions for bootstrapping a new node are met. This feature is opt-in in ScyllaDB Operator 1.19. You can enable it by setting the --feature-gates=BootstrapSynchronisation=true command-line argument to ScyllaDB Operator. This feature supports ScyllaDB 2025.2 and newer. If you are running a multi-datacenter ScyllaDB cluster (multiple ScyllaCluster objects bound together with external seeds), you are still required to verify the preconditions yourself before initiating any topology changes. This is because the synchronisation only occurs on the level of an individual ScyllaCluster. See Synchronising bootstrap operations in ScyllaDB for more information. Other notable changes Deprecation of ScyllaDBMonitoring components’ exposeOptions By adding support for external Prometheus instances, ScyllaDB Operator 1.19 makes a step towards reducing  ScyllaDBMonitoring‘s complexity by deprecating exposeOptions in both ScyllaDBMonitoring‘s Prometheus and Grafana components. The use of exposeOptions is limited because it provides no way to configure an Ingress that will terminate TLS, which is likely the most common approach in production. As an alternative, this release introduces a more pragmatic and flexible approach: You can simply document how the components’ corresponding Services can be exposed. This gives you the flexibility to do exactly what your use case requires. See the Exposing Grafana documentation to learn how to expose Grafana deployed by ScyllaDBMonitoring using a self-managed Ingress resource. The deprecated ScyllaDBMonitoring‘s exposeOptions will be removed in a future Operator version. Dependency updates This release also includes regular updates of ScyllaDB Monitoring and the packaged dashboards to support the latest ScyllaDB releases (4.11.1->4.12.1, #3031), as well as its dependencies: Grafana (12.0.2->12.2.0) and Prometheus (v3.5.0->v3.6.0). For more changes and details, check out the GitHub release notes. Upgrade instructions For instructions on upgrading ScyllaDB Operator to 1.19, see the Upgrading Scylla Operator documentation. Supported versions ScyllaDB 2024.1, 2025.1 – 2025.3 Kubernetes 1.31 – 1.34 Container Runtime Interface API v1 ScyllaDB Manager 3.5, 3.7 Getting started with ScyllaDB Operator ScyllaDB Operator Documentation Learn how to deploy ScyllaDB on Google Kubernetes Engine (GKE) Learn how to deploy ScyllaDB on Amazon Elastic Kubernetes Engine (EKS)  Learn how to deploy ScyllaDB on a Kubernetes Cluster Related links ScyllaDB Operator source (on GitHub) ScyllaDB Operator image on DockerHub ScyllaDB Operator Helm Chart repository ScyllaDB Operator documentation ScyllaDB Operator for Kubernetes lesson in ScyllaDB University Report a problem Your feedback is always welcome! Feel free to open an issue or reach out on the #scylla-operator channel in ScyllaDB User Slack.  

Instaclustr product update: December 2025

Instaclustr product update: December 2025

Here’s a roundup of the latest features and updates that we’ve recently released.

If you have any particular feature requests or enhancement ideas that you would like to see, please get in touch with us.

Major announcements

OpenSearch®

AI Search for OpenSearch®: Unlocking next-generation search

AI Search for OpenSearch, which is now available in Public Preview on the NetApp Instaclustr Managed Platform, is designed to bring semantic, hybrid, and multimodal search capabilities to OpenSearch deployments—turning them into an end-to-end AI-powered search solution within minutes. With built-in ML models, vector indexing, and streamlined ingestion pipelines, next-generation search can be enabled in minutes without adding operational complexity. This feature powers smarter, more relevant discovery experiences backed by AI—securely deployed across any cloud or on-premises environment.

ClickHouse®

FSx for NetApp ONTAP and Managed ClickHouse® integration is now available
We’re excited to announce that NetApp has introduced seamless integration between Amazon FSx for NetApp ONTAP and Instaclustr Managed ClickHouse, to enable customers to build a truly hybrid lakehouse architecture on AWS. This integration is designed to deliver lightning-fast analytics without the need for complex data movement, while leveraging FSx for ONTAP’s unified file and object storage, tiered performance, and cost optimization. Customers can now run zero-copy lakehouse analytics with ClickHouse directly on FSx for ONTAP data—to simplify operations, accelerate time-to-insight, and reduce total cost of ownership.

PostgreSQL®

Instaclustr for PostgreSQL® on Amazon FSx for ONTAP: A new era
We’re excited to announce the public preview of Instaclustr Managed PostgreSQL integrated with Amazon FSx for NetApp ONTAP—combining enterprise-grade storage with world-class open source database management. This integration is designed to deliver higher IOPS, lower latency, and advanced data management without increasing instance size or adding costly hardware. Customers can now run PostgreSQL clusters backed by FSx for ONTAP storage, leveraging on-disk compression for cost savings and paving the way for ONTAP-powered features, such as instant snapshot backups, instant restores, and fast forking. These ONTAP-enabled features are planned to unlock huge operational benefits and will be launched with our GA release.

Other significant changes

Apache Cassandra®

  • Transitioned Apache Cassandra v4.1.8 to CLOSED lifecycle state; scheduled to reach End of Life (EOL) on December 20, 2025.

Apache Kafka®

  • Kafka on Azure now supports v5 generation nodes, available in General Availability.
  • Instaclustr Managed Apache ZooKeeper has moved from General Availability to closed status.

ClickHouse

  • Kafka Table Engine integration with ClickHouse has added support to enable real-time data ingestion, streamline streaming analytics, and accelerate insights.
  • New ClickHouse node sizes, powered by AWS m7g, r7i, and r7g instances, are now in Limited Availability for cluster creation.

Cadence®

  • Cadence is now available to be provisioned with Cassandra 5.x, designed to deliver improved performance, enhanced scalability, and stronger security for mission-critical workflows.

OpenSearch

PostgreSQL

  • Added new PostgreSQL metrics for connect states and wait event types.
  • PostgreSQL Load Balancer add-on is now available, providing a unified endpoint for cluster access, simplifying failover handling, and ensuring node health through regular checks.

Upcoming releases

Apache Cassandra

  • We’re working on enabling multi-datacenter (multi-DC) cluster provisioning via API and console, designed to make it easier to deploy clusters across regions with secure networking and reduced manual steps.

Apache Kafka

  • We’re working on adding Kafka Tiered Storage for clusters running in GCP— designed to bring affordable, scalable retention, and instant access to historical data, to ensure flexibility and performance across clouds for enterprise Kafka users.

ClickHouse

  • We’re planning to extend our Managed ClickHouse to allow it to work with on-prem deployments.

PostgreSQL

  • Following the success of our public preview, we’re preparing to launch PostgreSQL integrated with FSx for NetApp ONTAP (FSxN) into General Availability. This enhancement is designed to combine enterprise-grade PostgreSQL with FSxN’s scalable, cost-efficient storage, enabling customers to optimize infrastructure costs while improving performance and flexibility.

OpenSearch®

  • As part of our ongoing advancements in AI for OpenSearch, we are planning to enable adding GPU nodes into OpenSearch clusters, aiming to enhance the performance and efficiency of machine learning and AI workloads.

Instaclustr Managed Platform

  • Self-service Tags Management feature—allowing users to add, edit, or delete tags for their clusters directly through the Instaclustr console, APIs, or Terraform provider for RIYOA deployments.

Did you know?

  • Cadence Workflow, the open source orchestration engine created by Uber, has officially joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project. This milestone ensures transparent governance, community-driven innovation, and a sustainable future for one of the most trusted workflow technologies in modern microservices and agentic AI architectures. Uber donates Cadence Workflow to CNCF: The next big leap for the open source project—read the full story and discover what’s next for Cadence.
  • Upgrading ClickHouse® isn’t just about new features—it’s essential for security, performance, and long-term stability. In ClickHouse upgrade: Why staying updated matters, you’ll learn why skipping upgrades can lead to technical debt, missed optimizations, and security risks. Then, explore A guide to ClickHouse® upgrades and best practices for practical strategies, including when to choose LTS releases for mission-critical workloads and when stable releases make sense for fast-moving environments.
  • Our latest blog, AI Search for OpenSearch®: Unlocking next-generation search, explains how this new solution enables smarter discovery experiences using built-in ML models, vector embeddings, and advanced search techniques—all fully managed on the NetApp Instaclustr Platform. Ready to explore the future of search? Read the full article and see how AI can transform your OpenSearch deployments.

If you have any questions or need further assistance with these enhancements to the Instaclustr Managed Platform, please contact us.

SAFE HARBOR STATEMENT: Any unreleased services or features referenced in this blog are not currently available and may not be made generally available on time or at all, as may be determined in NetApp’s sole discretion. Any such referenced services or features do not represent promises to deliver, commitments, or obligations of NetApp and may not be incorporated into any contract. Customers should make their purchase decisions based upon services and features that are currently generally available.

The post Instaclustr product update: December 2025 appeared first on Instaclustr.

Consuming CDC with Java, Go… and Rust!

A quick look at how to use ScyllaDB Change Data Caputure with the Rust connector In 2021, we published a guide for using Java and Go with ScyllaDB CDC. Today, we are happy to share a new version of that post, including how to use ScyllaDB CDC with the Rust connector! Note: We will skip some of the sections in the original post, like “Why Use a Library?” and challenges in using CDC. If you are planning to use CDC in production, you should absolutely go back and read them. But if you’re just looking to get a demo up and running, this post will get you there. Getting Started with Rust scylla-cdc-rust is a library for consuming the ScyllaDB CDC Log in Rust applications. It automatically and transparently handles errors and topology changes of the underlying ScyllaDB cluster. As a result, the API allows the user to read the CDC log without having to deeply understand the internal structure of CDC. The library was written in pure Rust, using ScyllaDB Rust Driver and Tokio. Let’s see how to use the Rust library. We will build an application that prints changes happening to a table in real-time. You can see the final code here. Installing the library The scylla-cdc library is available on crates.io. Setting up the CDC consumer The most important part of using the library is to define a callback that will be executed after reading a CDC log from the database. Such a callback is defined by implementing the Consumer trait located in scylla-cdc::consumer. For now, we will define a struct with no member variables for this purpose: Since the callback will be executed asynchronously, we have to use the async-trait crate to implement the Consumer trait. We also use the anyhow crate for error handling. The library is going to create one instance of TutorialConsumer per CDC stream, so we also need to define a ConsumerFactory for them: Adding shared state to the consumer Different instances of Consumer are being used in separate Tokio tasks. Due to that, the runtime might schedule them on separate threads. In response, a struct implementing the Consumer trait should also implement the Send trait and a struct implementing the ConsumerFactory trait should implement Send and Sync traits. Luckily, Rust implements these traits by default if all member variables of a struct implement them. If the consumers need to share some state, like a reference to an object, they can be wrapped in an Arc. An example of that might be a Consumer that counts rows read by all its instances: Note: In general, keeping shared mutable state in the Consumer is not recommended. That’s because it requires synchronization (i.e. a mutex or an atomic like AtomicUsize), which reduces the speedup granted by Tokio by running the Consumer logic on multiple cores. Fortunately, keeping exclusive (not shared) mutable state in the Consumer comes with no additional overhead. Starting the application Now we’re ready to create our main function: As we can see, we have to configure a few things in order to start the log reader: We have to create a connection to the database, using the Session struct from ScyllaDB Rust Driver. Specify the keyspace and the table name. We create time bounds for our reader. This step is not compulsory – by default, the reader will start reading from now and will continue reading forever. In our case, we are going to read all logs added during the last 6 minutes. We create the factory. We can build the log reader. After creating the log reader, we can await the handle it returns so that our application will terminate as soon as the reader finishes. Now, let’s insert some rows into the table. After inserting 3 rows and running the application, you should see the output: Hello, scylla-cdc! Hello, scylla-cdc! Hello, scylla-cdc! The application printed one line for each CDC log consumed. To see how to use CDCRow and save progress, see the full example below. Full Example Follow this detailed cdc-rust tutorial or git clone https://github.com/scylladb/scylla-cdc-rust cd scylla-cdc-rust cargo run --release --bin scylla-cdc-printer -- --keyspace KEYSPACE --table TABLE --hostname HOSTNAME Where HOSTNAME is the IP address of the cluster. Getting Started with Java and Go For a detailed walk through of with Java and Go examples, see our previous blog, Consuming CDC with Java and Go. Further reading In this blog, we have explained what problems the scylla-cdc-rust, scylla-cdc-java, and scylla-cdc-go libraries solve and how to write a simple application with each. If you would like to learn more, check out the links below: Replicator example application in the scylla-cdc-java repository. It is an advanced application that replicates a table from one Scylla cluster to another one using the CDC log and scylla-cdc-java library. Example applications in scylla-cdc-go repository. The repository currently contains two examples: “simple-printer”, which prints changes from a particular schema, “printer”, which is the same as the example presented in the blog, and “replicator”, which is a relatively complex application which replicates changes from one cluster to another. API reference for scylla-cdc-go. Includes slightly more sophisticated examples which, unlike the example in this blog, cover saving progress. CDC documentation. Knowledge about the design of Scylla’s CDC can be helpful in understanding the concepts in the documentation for both the Java and Go libraries. The parts about the CDC log schema and representation of data in the log is especially useful. ScyllaDB users slack. We will be happy to answer your questions about the CDC on the #cdc channel. We hope all that talk about consuming data has managed to whet your appetite for CDC! Happy and fruitful coding!

Freezing streaming data into Apache Iceberg™—Part 1: Using Apache Kafka®Connect Iceberg Sink Connector

Introduction 

Ever since the first distributed systemi.e. 2 or more computers networked together (in 1969)there has been the problem of distributed data consistency: How can you ensure that data from one computer is available and consistent with the second (and more) computers? This problem can be uni-directional (one computer is considered the source of truth, others are just copies), or bi-directional (data must be synchronized in both directions across multiple computers). 

Some approaches to this problem I’ve come across in the last 8 years include Kafka Connect (for elegantly solving the heterogeneous many-to-many integration problem by streaming data from source systems to Kafka and from Kafka to sink systems, some earlier blogs on Apache Camel Kafka Connectors and a blog series on zero-code data pipelines), MirrorMaker2 (MM2, for replicating Kafka clusters, a 2 part blog series), and Debezium (Change Data Capture/CDC, for capturing changes from databases as streams and making them available in downstream systems, e.g. for Apache Cassandra and PostgreSQL)—MM2 and Debezium are actually both built on Kafka Connect.  

Recently, some “sink” systems have been taking over responsibility for streaming data from Kafka into themselves, e.g. OpenSearch pull-based ingestion (c.f. OpenSearch Sink Connector), and the ClickHouse Kafka Table Engine (c.f. ClickHouse Sink Connector). These “pull-based” approaches are potentially easier to configure and don’t require running a separate Kafka Connect cluster and sink connectors, but some downsides may be that they are not as reliable or independently scalable, and you will need to carefully monitor and scale them to ensure they perform adequately.  

And then there’s “zero-copy” approachesthese rely on the well-known computer science trick of sharing a single copy of data using references (or pointers), rather than duplicating the data. This idea has been around for almost as long as computers, and is still widely applicable, as we’ll see in part 2 of the blog. 

The distributed data use case we’re going to explore in this 2-part blog series is streaming Apache Kafka data into Apache Iceberg, or “Freezing streaming Apache Kafka data into an (Apache) Iceberg”! In part 1 we’ll introduce Apache Iceberg and look at the first approach for “freezing” streaming data using the Kafka Connect Iceberg Sink Connector. 

What is Apache Iceberg? 

Apache Iceberg is an open source specification open table format optimized for column-oriented workloads, supporting huge analytic datasets. It supports multiple different concurrent engines that can insert and query table data using SQLand Iceberg is organized like, well, an iceberg! 

The tip of the Iceberg is the Catalog. An Iceberg Catalog acts as a central metadata repository, tracking the current state of Iceberg tables, including their names, schemas, and metadata file locations. It serves as the “single source of truth” for a data Lakehouse, enabling query engines to find the correct metadata file for a table to ensure consistent and atomic read/write operations.  

Just under the water, the next layer is the metadata layer. The Iceberg metadata layer tracks the structure and content of data tables in a data lake, enabling features like efficient query planning, versioning, and schema evolution. It does this by maintaining a layered structure of metadata files, manifest lists, and manifest files that store information about table schemas, partitions, and data files, allowing query engines to prune unnecessary files and perform operations atomically. 

The data layer is at the bottom. The Iceberg data layer is the storage component where the actual data files are stored. It supports different storage backends, including cloud-based object storage like Amazon S3 or Google Cloud Storage, or HDFS. It uses file formats like Parquet or Avro. Its main purpose is to work in conjunction with Iceberg’s metadata layer to manage table snapshots and provide a more reliable and performant table format for data lakes, bringing data warehouse features to large datasets. 

As shown in the above diagram, Iceberg supports multiple different engines, including Apache Spark and ClickHouse. Engines provide the “database” features you would expect, including:  

  • Data Management 
  • ACID Transactions 
  • Query Planning and Optimization 
  • Schema Evolution 
  • And more! 

I’ve recently been reading an excellent book on Apache Iceberg (“Apache Iceberg: The Definitive Guide”), which explains the philosophy, architecture and design, including operation, of Iceberg. For example, it says that it’s best practice to treat data lake storage as immutabledata should only be added to a Data Lake, not deleted.  So, in theory at least, writing infinite, immutable Kafka streams to Iceberg should be straightforward!  

But because it’s a complex distributed system (which looks like a database from above water but is really a bunch of files below water!), there is some operational complexity. For example, it handles change and consistency by creating new snapshots for every modification, enabling time travel, isolating readers from writes, and supporting optimistic concurrency control for multiple writers. But you need to manage snapshots (e.g. expiring old snapshots). And chapter 4 (performance optimisation) explains that you may need to worry about compaction (reducing too many small files), partitioning approaches (which can impact read performance), and handling row-level updates. The first two issues may be relevant for Kafka, but probably not the last one.  So, it looks like it’s good fit for the streaming Kafka use cases, but we may need to watch out for Iceberg management issues.  

“Freezing” streaming data with the Kafka Iceberg Sink Connector 

But Apache Iceberg is “frozen”what’s the connection to fast-moving streaming data? You certainly don’t want to collide with an iceberg from your speedy streaming “ship”but you may want to freeze your streaming data for long-term analytical queries in the future. How can you do that without sinking? Actually, a “sink” is the first answer: A Kafka Connect Iceberg Sink Connector is the most common way of “freezing” your streaming data in Iceberg!  

Kafka Connect is the standard framework provided by Apache Kafka to move data from multiple heterogeneous source systems to multiple heterogeneous sink systems, using:  

  • A Kafka cluster 
  • A Kafka Connect cluster (running connectors) 
  • Kafka Connect source connectors 
  • Kafka topics and 
  • Kafka Connect Sink Connectors 

That is, a highly decoupled approach. It provides real-time data movement with high scalability, reliability, error handling and simple transformations.   

Here’s the Kafka Connect Iceberg Sink Connector official documentation

It appears to be reasonably complicated to configure this sink connector; you will need to know something about Iceberg. For example, what is a “control topic”? It’s apparently used to coordinate commits for exactly-once semantics (EOS).  

The connector supports fan-out (writing to multiple Iceberg tables from one topic), fan-in (writing to one Iceberg table from multiple topics), static and dynamic routing, and filtering.  

In common with many technologies that you may want to use as Kafka Connect sinks, they may not all have good support for Kafka metadata. The KafkaMetadata Transform (which injects topic, partition, offset and timestamp properties) is only experimental at present.  

How are Iceberg tables created with the correct metadata? If you have JSON record values, then schemas are inferred by default (but may not be correct or optimal). Alternatively, explicit schemas can be included in-line or referenced from a Kafka Schema Registry (e.g. Karapace), and, as an added bonus, schema evolution is supported.  Also note that Iceberg tables may have to be manually created prior to use if your Catalog doesn’t support table auto-creation.  

From what I understood about Iceberg, to use it (e.g. for writes), you need support from an engine (e.g. to add raw data to the Iceberg warehouse, create the metadata files, and update the catalog).  How does this work for Kafka Connect? From this blog I discovered that the Kafka Connect Iceberg Sink connector is functioning as an Iceberg engine for writes, so there really is an engine, but it’s built into the connector.  

As is the case with all Kafka Connect Sink Connectors, records are available immediately they are written to Kafka topics by Kafka producers and Kafka Connect Source Connectors, i.e. records in active segments can be copied immediately to sink systems. But is the Iceberg Sink Connector real-time? Not really! The default time to write to Iceberg is every 5 minutes (iceberg.control.commit.interval-ms) to prevent multiplication of small filessomething that Iceberg(s) doesn’t/don’t like (“melting”?). In practice, it’s because every data file must be tracked in the metadata layer, which impacts performance in many waysproliferation of small files is typically addressed by optimization and compaction (e.g. Apache Spark supports Iceberg management, including these operations). 

So, unlike most Kafka Connect sink connectors, which write as quickly as possible, there will be lag before records appear in Iceberg tables (“time to freeze” perhaps)!  

The systems are separate (Kafka and Iceberg are independent), records are copied to Iceberg, and that’s it! This is a clean separation of concerns and ownership. Kafka owns the source data (with Kafka controlling data lifecycles, including record expiry), Kafka Connect Iceberg Sink Connector performs the reading from Kafka and writing to Iceberg, and is independently scalable to Kafka. Kafka doesn’t handle any of the Iceberg management.  Once the data has landed in Iceberg, Kafka has no further visibility or interest in it. And the pipeline is purely one way, write only – reads or deletes are not supported.  

Here’s a summary of this approach to freezing streams:  

  1. Kafka Connect Iceberg Sink Connector shares all the benefits of the Kafka Connect framework, including scalability, reliability, error handling, routing, and transformations.  
  2. At least, JSON values are required, ideally full schemas and referenced in Karapacebut not all schemas are guaranteed to work. 
  3. Kafka Connect doesn’t “manage” Iceberg (e.g. automatically aggregate small files, remove snapshots, etc.) 
  4. You may have to tune the commit interval – 5 minutes is the default. 
  5. But it does have a built-in engine that supports writing to Iceberg. 
  6. You may need to use an external tool (e.g. Apache Spark) for Iceberg management procedures. 
  7. It’s write-only to Iceberg. Reads or deletes are not supported 

But what’s the best thing about the Kafka Connect Iceberg Sink Connector? It’s available now (as part of the Apache Iceberg build) and works on the NetApp Instaclustr Kafka Connect platform as a “bring your own connector”  (instructions here).  

In part 2, we’ll look at Kafka Tiered Storage and Iceberg Topics! 

The post Freezing streaming data into Apache Iceberg™—Part 1: Using Apache Kafka®Connect Iceberg Sink Connector appeared first on Instaclustr.

Netflix Live Origin

Xiaomei Liu, Joseph Lynch, Chris Newton

Introduction

Behind the Streams: Building a Reliable Cloud Live Streaming Pipeline for Netflix introduced the architecture of the streaming pipeline. This blog post looks at the custom Origin Server we built for Live — the Netflix Live Origin. It sits at the demarcation point between the cloud live streaming pipelines on its upstream side and the distribution system, Open Connect, Netflix’s in-house Content Delivery Network (CDN), on its downstream side, and acts as a broker managing what content makes it out to Open Connect and ultimately to the client devices.

Live Streaming Distribution and Origin Architecture

Netflix Live Origin is a multi-tenant microservice operating on EC2 instances within the AWS cloud. We lean on standard HTTP protocol features to communicate with the Live Origin. The Packager pushes segments to it using PUT requests, which place a file into storage at the particular location named in the URL. The storage location corresponds to the URL that is used when the Open Connect side issues the corresponding GET request.

Live Origin architecture is influenced by key technical decisions of the live streaming architecture. First, resilience is achieved through redundant regional live streaming pipelines, with failover orchestrated at the server-side to reduce client complexity. The implementation of epoch locking at the cloud encoder enables the origin to select a segment from either encoding pipeline. Second, Netflix adopted a manifest design with segment templates and constant segment duration to avoid frequent manifest refresh. The constant duration templates enable Origin to predict the segment publishing schedule.

Multi-pipeline and multi-region aware origin

Live streams inevitably contain defects due to the non-deterministic nature of live contribution feeds and strict real-time segment publishing timelines. Common defects include:

  • Short segments: Missing video frames and audio samples.
  • Missing segments: Entire segments are absent.
  • Segment timing discontinuity: Issues with the Track Fragment Decode Time.

Communicating segment discontinuity from the server to the client via a segment template-based manifest is impractical, and these defective segments can disrupt client streaming.

The redundant cloud streaming pipelines operate independently, encompassing distinct cloud regions, contribution feeds, encoder, and packager deployments. This independence substantially mitigates the probability of simultaneous defective segments across the dual pipelines. Owing to its strategic placement within the distribution path, the live origin naturally emerges as a component capable of intelligent candidate selection.

The Netflix Live Origin features multi-pipeline and multi-region awareness. When a segment is requested, the live origin checks candidates from each pipeline in a deterministic order, selecting the first valid one. Segment defects are detected via lightweight media inspection at the packager. This defect information is provided as metadata when the segment is published to the live origin. In the rare case of concurrent defects at the dual pipeline, the segment defects can be communicated downstream for intelligent client-side error concealment.

Open Connect streaming optimization

When the Live project started, Open Connect had become highly optimised for VOD content delivery — nginx had been chosen many years ago as the Web Server since it is highly capable in this role, and a number of enhancements had been added to it and to the underlying operating system (BSD). Unlike traditional CDNs, Open Connect is more of a distributed origin server — VOD assets are pre-positioned onto carefully selected server machines (OCAs, or Open Connect Appliances) rather than being filled on demand.

Alongside the VOD delivery, an on-demand fill system has been used for non-VOD assets — this includes artwork and the downloadable portions of the clients, etc. These are also served out of the same nginx workers, albeit under a distinct server block, using a distinct set of hostnames.

Live didn’t fit neatly into this ‘small object delivery’ model, so we extended the proxy-caching functionality of nginx to address Live-specific needs. We will touch on some of these here related to optimized interactions with the Origin Server. Look for a future blog post that will go into more details on the Open Connect side.

The segment templates provided to clients are also provided to the OCAs as part of the Live Event Configuration data. Using the Availability Start Time and Initial Segment number, the OCA is able to determine the legitimate range of segments for each event at any point in time — requests for objects outside this range can be rejected, preventing unnecessary requests going up through the fill hierarchy to the origin. If a request makes it through to the origin, and the segment isn’t available yet, the origin server will return a 404 Status Code (indicating File Not Found) with the expiration policy of that error so that it can be cached within Open Connect until just before that segment is expected to be published.

If the Live Origin knows when segments are being pushed to it, and knows what the live edge is — when a request is received for the immediately next object, rather than handing back another 404 error (which would go all the way back through Open Connect to the client), the Live Origin can ‘hold open’ the request, and service it once the segment has been published to it. By doing this, the degree of chatter within the network handling requests that arrive early has been significantly reduced. As part of this, millisecond grain caching was added to nginx to enhance the standard HTTP Cache Control, which only works at second granularity, a long time when segments are generated every 2 seconds.

Streaming metadata enhancement

The HTTP standard allows for the addition of request and response headers that can be used to provide additional information as files move between clients and servers. The HTTP headers provide notifications of events within the stream in a highly scalable way that is independently conveyed to client devices, regardless of their playback position within the stream.

These notifications are provided to the origin by the live streaming pipeline and are inserted by the origin in the form of headers, appearing on the segments generated at that point in time (and persist to future segments — they are cumulative). Whenever a segment is received at an OCA, this notification information is extracted from the response headers and used to update an in-memory data structure, keyed by event ID; and whenever a segment is served from the OCA, the latest such notification data is attached to the response. This means that, given any flow of segments into an OCA, it will always have the most recent notification data, even if all clients requesting it are behind the live edge. In fact, the notification information can be conveyed on any response, not just those supplying new segments.

Cache invalidation and origin mask

An invalidation system has been available since the early days of the project. It can be used to “flush” all content associated with an event by altering the key used when looking up objects in cache — this is done by incorporating a version number into the cache key that can then be bumped on demand. This is used during pre-event testing so that the network can be returned to a pristine state for the test with minimal fuss.

Each segment published by the Live Origin conveys the encoding pipeline it was generated by, as well as the region it was requested from. Any issues that are found after segments make their way into the network can be remedied by an enhanced invalidation system that takes such variants into account. It is possible to invalidate (that is, cause to be considered expired) segments in a range of segment numbers, but only if they were sourced from encoder A, or from Encoder A, but only if retrieved from region X.

In combination with Open Connect’s enhanced cache invalidation, the Netflix Live Origin allows selective encoding pipeline masking to exclude a range of segments from a particular pipeline when serving segments to Open Connect. The enhanced cache invalidation and origin masking enable live streaming operations to hide known problematic segments (e.g., segments causing client playback errors) from streaming clients once the bad segments are detected, protecting millions of streaming clients during the DVR playback window.

Origin storage architecture

Our original storage architecture for the Live Origin was simple: just use AWS S3 like we do for SVOD. This served us well initially for our low-traffic events, but as we scaled up we discovered that Live streaming has unique latency and workload requirements that differ significantly from on-demand where we have significant time ahead-of-time to pre-position content. While S3 met its stated uptime guarantees, our strict 2-second retry budget inherent to Live events (where every write is critical) led us to explore optimizations specifically tailored for real-time delivery at scale. AWS S3 is an amazing object store, but our Live streaming requirements were closer to those of a global low-latency highly-available database. So, we went back to the drawing board and started from the requirements. The Origin required:

  1. [HA Writes] Extremely high write availability, ideally as close to full write availability within a single AWS region, with low second replication delay to other regions. Any failed write operation within 500ms is considered a bug that must be triaged and prevented from re-occurring.
  2. [Throughput] High write throughput, with hundreds of MiB replicating across regions
  3. [Large Partitions] Efficiently support O(MiB) writes that accumulate to O(10k) keys per partition with O(GiB) total size per event.
  4. [Strong Consistency] Within the same region, we needed read-your-write semantics to hit our <1s read delay requirements (must be able to read published segments)
  5. [Origin Storm] During worst-case load involving Open Connect edge cases, we may need to handle O(GiB) of read throughput without affecting writes.

Fortunately, Netflix had previously invested in building a KeyValue Storage Abstraction that cleverly leveraged Apache Cassandra to provide chunked storage of MiB or even GiB values. This abstraction was initially built to support cloud saves of Game state. The Live use case would push the boundaries of this solution, however, in terms of availability for writes (#1), cumulative partition size (#3), and read throughput during Origin Storm (#5).

High Availability for Writes of Large Payloads

The KeyValue Payload Chunking and Compression Algorithm breaks O(MiB) work down so each part can be idempotently retried and hedged to maintain strict latency service level objectives, as well as spreading the data across the full cluster. When we combine this algorithm with Apache Cassandra’s local-quorum consistency model, which allows write availability even with an entire Availability Zone outage, plus a write-optimized Log-Structured Merge Tree (LSM) storage engine, we could meet the first four requirements. After iterating on the performance and availability of this solution, we were not only able to achieve the write availability required, but did so with a P99 tail latency that was similar to the status quo’s P50 average latency while also handling cross-region replication behind the scenes for the Origin. This new solution was significantly more expensive (as expected, databases backed by SSD cost more), but minimizing cost was not a key objective and low latency with high availability was:

Storage System Write Performance

High Availability Reads at Gbps Throughputs

Now that we solved the write reliability problem, we had to handle the Origin Storm failure case, where potentially dozens of Open Connect top-tier caches could be requesting multiple O(MiB) video segments at once. Our back-of-the-envelope calculations showed worst-case read throughput in the O(100Gbps) range, which would normally be extremely expensive for a strongly-consistent storage engine like Apache Cassandra. With careful tuning of chunk access, we were able to respond to reads at network line rate (100Gbps) from Apache Cassandra, but we observed unacceptable performance and availability degradation on concurrent writes. To resolve this issue, we introduced write-through caching of chunks using our distributed caching system EVCache, which is based on Memcached. This allows almost all reads to be served from a highly scalable cache, allowing us to easily hit 200Gbps and beyond without affecting the write path, achieving read-write separation.

Final Storage Architecture

In the final storage architecture, the Live Origin writes and reads to KeyValue, which manages a write-through cache to EVCache (memcached) and implements a safe chunking protocol that spreads large values and partitions them out across the storage cluster (Apache Cassandra). This allows almost all read load to be handled from cache, with only misses hitting the storage. This combination of cache and highly available storage has met the demanding needs of our Live Origin for over a year now.

Storage System High Level Architecture

Delivering this consistent low latency for large writes with cross-region replication and consistent write-through caching to a distributed cache required solving numerous hard problems with novel techniques, which we plan to share in detail during a future post.

Scalability and scalable architecture

Netflix’s live streaming platform must handle a high volume of diverse stream renditions for each live event. This complexity stems from supporting various video encoding formats (each with multiple encoder ladders), numerous audio options (across languages, formats, and bitrates), and different content versions (e.g., with or without advertisements). The combination of these elements, alongside concurrent event support, leads to a significant number of unique stream renditions per live event. This, in turn, necessitates a high Requests Per Second (RPS) capacity from the multi-tenant live origin service to ensure publishing-side scalability.

In addition, Netflix’s global reach presents distinct challenges to the live origin on the retrieval side. During the Tyson vs. Paul fight event in 2024, a historic peak of 65 million concurrent streams was observed. Consequently, a scalable architecture for live origin is essential for the success of large-scale live streaming.

Scaling architecture

We chose to build a highly scalable origin instead of relying on the traditional origin shields approach for better end-to-end cache consistency control and simpler system architecture. The live origin in this architecture directly connects with top-tier Open Connect nodes, which are geographically distributed across several sites. To minimize the load on the origin, only designated nodes per stream rendition at each site are permitted to directly fill from the origin.

Netflix Live Origin Scalability Architecture

While the origin service can autoscale horizontally using EC2 instances, there are other system resources that are not autoscalable, such as storage platform capacity and AWS to Open Connect backbone bandwidth capacity. Since in live streaming, not all requests to the live origin are of the same importance, the origin is designed to prioritize more critical requests over less critical requests when system resources are limited. The table below outlines the request categories, their identification, and protection methods.

Publishing isolation

Publishing traffic, unlike potentially surging CDN retrieval traffic, is predictable, making path isolation a highly effective solution. As shown in the scalability architecture diagram, the origin utilizes separate EC2 publishing and CDN stacks to protect the latency and failure-sensitive origin writes. In addition, the storage abstraction layer features distinct clusters for key-value (KV) read and KV write operations. Finally, the storage layer itself separates read (EVCache) and write (Cassandra) paths. This comprehensive path isolation facilitates independent cloud scaling of publishing and retrieval, and also prevents CDN-facing traffic surges from impacting the performance and reliability of origin publishing.

Priority rate limiting

Given Netflix’s scale, managing incoming requests during a traffic storm is challenging, especially considering non-autoscalable system resources. The Netflix Live Origin implemented priority-based rate limiting when the underlying system is under stress. This approach ensures that requests with greater user impact are prioritized to succeed, while requests with lower user impact are allowed to fail during times of stress in order to protect the streaming infrastructure and are permitted to retry later to succeed.

Leveraging Netflix’s microservice platform priority rate limiting feature, the origin prioritizes live edge traffic over DVR traffic during periods of high load on the storage platform. The live edge vs. DVR traffic detection is based on the predictable segment template. The template is further cached in memory on the origin node to enable priority rate limiting without access to the datastore, which is valuable especially during periods of high datastore stress.

To mitigate traffic surges, TTL cache control is used alongside priority rate limiting. When the low-priority traffic is impacted, the origin instructs Open Connect to slow down and cache identical requests for 5 seconds by setting a max-age = 5s and returns an HTTP 503 error code. This strategy effectively dampens traffic surges by preventing repeated requests to the origin within that 5-second window.

The following diagrams illustrate origin priority rate limiting with simulated traffic. The nliveorigin_mp41 traffic is the low-priority traffic and is mixed with other high-priority traffic. In the first row: the 1st diagram shows the request RPS, the 2nd diagram shows the percentage of request failure. In the second row, the 1st diagram shows datastore resource utilization, and the 2nd diagram shows the origin retrieval P99 latency. The results clearly show that only the low-priority traffic (nliveorigin_mp41) is impacted at datastore high utilization, and the origin request latency is under control.

Origin Priority Rate Limiting

404 storm and cache optimization

Publishing isolation and priority rate limiting successfully protect the live origin from DVR traffic storms. However, the traffic storm generated by requests for non-existent segments presents further challenges and opportunities for optimization.

The live origin structures metadata hierarchically as event > stream rendition > segment, and the segment publishing template is maintained at the stream rendition level. This hierarchical organization allows the origin to preemptively reject requests with an HTTP 404(not found)/410(Gone) error, leveraging highly cacheable event and stream rendition level metadata, avoiding unnecessary queries to the segment level metadata:

  • If the event is unknown, reject the request with 404
  • If the event is known, but the segment request timing does not match the expected publishing timing, reject the request with 404 and cache control TTL matching the expected publishing time
  • If the event is known, the requested segment is never generated or misses the retry deadline, reject the request with a 410 error, preventing the client from repeatedly requesting

At the storage layer, metadata is stored separately from media data in the control plane datastore. Unlike the media datastore, the control plane datastore does not use a distributed cache to avoid cache inconsistency. Event and rendition level metadata benefits from a high cache hit ratio when in-memory caching is utilized at the live origin instance. During traffic storms involving non-existent segments, the cache hit ratio for control plane access easily exceeds 90%.

The use of in-memory caching for metadata effectively handles 404 storms at the live origin without causing datastore stress. This metadata caching complements the storage system’s distributed media cache, providing a complete solution for traffic surge protection.

Summary

The Netflix Live Origin, built upon an optimized storage platform, is specifically designed for live streaming. It incorporates advanced media and segment publishing scheduling awareness and leverages enhanced intelligence to improve streaming quality, optimize scalability, and improve Open Connect live streaming operations.

Acknowledgement

Many teams and stunning colleagues contributed to the Netflix live origin. Special thanks to Flavio Ribeiro for advocacy and sponsorship of the live origin project; to Raj Ummadisetty, Prudhviraj Karumanchi for the storage platform; to Rosanna Lee, Hunter Ford, and Thiago Pontes for storage lifecycle management; to Ameya Vasani for e2e test framework; Thomas Symborski for orchestrator integration; to James Schek for Open Connect integration; to Kevin Wang for platform priority rate limit; to Di Li, Nathan Hubbard for origin scalability testing.


Netflix Live Origin was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

re:Invent Recap

It’s been a while since I last attended re:Invent… long enough that I’d almost forgotten how expensive a bottle of water can be in a Vegas hotel room. This time was different. Instead of just attending, I wore many hats: audio-visual tech, salesperson, technical support, friendly ear, and booth rep. re:Invent is an experience that’s hard to explain to anyone outside tech. Picture 65,000 people converging in Las Vegas: DJ booths thumping beside deep-dive technical sessions, competitions running nonstop, and enough swag to fill a million Christmas stockings. Only then do you start to grasp what it’s really like. Needless to say, having the privilege to fly halfway across the globe, stay in a bougie hotel, and help host the impressive ScyllaDB booth was a fitting way to finish the year on a high. This year was ScyllaDB’s biggest re:Invent presence yet… a full-scale booth designed to show what predictable performance at extreme scale really looks like. The booth was buzzing from open to close, packed with data engineers, developers, and decision-makers exploring how ScyllaDB handles millions of operations per second with single-digit P99 millisecond latency. Some of the standout moments for me included: Customer sessions at the Content Hub featuring Freshworks and SAS, both showcasing how ScyllaDB powers their mission-critical AI workloads. In-booth talks from Freshworks, SAS, Sprig, and Revinate. Real users sharing real production stories. There’s nothing better than hearing how our customers are conquering performance challenges at scale. Technical deep dives exploring everything from linear scalability to real-time AI pipelines. ScyllaDB X Cloud linear-scale demonstration, a live visualization of throughput scaling predictably with every additional node. Watching tablets rebalance automatically linearly never gets old. High-impact in-booth videos driving home ScyllaDB’s key differentiators. I’m particularly proud of our DB Guy parody along with the ScyllaDB monster on the Vegas Sphere (yes, we fooled many with that one)  For many visitors, this was their first time seeing ScyllaDB X Cloud and Vector Search in action. Our demos made it clear what we mean by performance at scale: serving billions of vectors or millions of events per second, all while keeping tail latency comfortably under 5 ms and cost behavior entirely predictable. Developers that I chatted to loved that ScyllaDB drops neatly into existing Cassandra or DynamoDB environments while delivering much better performance and a lower TCO. Architects zeroed in on our flexibility across EC2 instance families (especially i8g) and hybrid deployment models. The ability to bring your own AWS (or GCP) account sparked plenty of conversations around performance, security, and data sovereignty. What really stood out this year was the shift in mindset. re:Invent 2025 confirmed that the future of extreme scale database engineering belongs to real-time systems … from AI inference to IoT telemetry, where low latency and linear scale are essential for success. ScyllaDB sits right at that intersection: a database built to scale fearlessly, delivering the control of bare metal with the simplicity of managed cloud. If you missed us in Vegas, don’t worry … you can still catch the highlights. Watch our customer sessions and the full X Cloud demo, and see why predictable performance at extreme scale isn’t just our tagline. It’s what we do every day. Catch the re:Invent videos    

Stay ahead with Apache Cassandra®: 2025 CEP highlights

Apache Cassandra® committers are working hard, building new features to help you more seamlessly ease operational challenges of a distributed database. Let’s dive into some recently approved CEPs and explain how these upcoming features will improve your workflow and efficiency.

What is a CEP?

CEP stands for Cassandra Enhancement Proposal. They are the process for outlining, discussing, and gathering endorsements for a new feature in Cassandra. They’re more than a feature request; those who put forth a CEP have intent to build the feature, and the proposal encourages a high amount of collaboration with the Cassandra contributors.

The CEPs discussed here were recently approved for implementation or have had significant progress in their implementation.  As with all open-source development, inclusion in a future release is contingent upon successful implementation, community consensus, testing, and approval by project committers.

CEP-42: Constraints framework

With collaboration from NetApp Instaclustr, CEP-42, and subsequent iterations, delivers schema level constraints giving Cassandra users and operators more control over their data. Adding constraints on the schema level means that data can be validated at write time and send the appropriate error when data is invalid.

Constraints are defined in-line or as a separate definition. The inline style allows for only one constraint while a definition allows users to define multiple constraints with different expressions.

The scope of this CEP-42 initially supported a few constraints that covered the majority of cases, but in follow up efforts the expanded list of support includes scalar (>, <, >=, <=), LENGTH(), OCTET_LENGTH(), NOT NULL, JSON(), REGEX(). A user is also able to define their own constraints if they implement it and put them on Cassandra’s class path.

A simple example of an in-line constraint:

CREATE TABLE users (

username text PRIMARY KEY,

age int CHECK age >= 0 and age < 120

);

Constraints are not supported for UDTs (User-Defined Types) nor collections (except for using NOT NULL for frozen collections).

Enabling constraints closer to the data is a subtle but mighty way for operators to ensure that data goes into the database correctly. By defining rules just once, application code is simplified, more robust, and prevents validation from being bypassed. Those who have worked with MySQL, Postgres, or MongoDB will enjoy this addition to Cassandra.

CEP-51: Support “Include” Semantics for cassandra.yaml

The cassandra.yaml file holds important settings for storage, memory, replication, compaction, and more. It’s no surprise that the average size of the file around 1,000 lines (though, yes—most are comments). CEP-51 enables splitting the cassandra.yaml configuration into multiple files using includes semantics. From the outside, this feels like a small change, but the implications are huge if a user chooses to opt-in.

In general, the size of the configuration file makes it difficult to manage and coordinate changes. It’s often the case that multiple teams manage various aspects of the single file. In addition, cassandra.yaml permissions are readable for those with access to this file, meaning private information like credentials are comingled with all other settings. There is risk from an operational and security standpoint.

Enabling the new semantics and therefore modularity for the configuration file eases management, deployment, complexity around environment-specific settings, and security in one shot. The configuration file follows the principle of least privilege once the cassandra.yaml is broken up into smaller, well-defined files; sensitive configuration settings are separated out from general settings with fine-grained access for the individual files. With the feature enabled, different development teams are better equipped to deploy safely and independently.

If you’ve deployed your Cassandra cluster on the NetApp Instaclustr platform, the cassandra.yaml file is already configured and managed for you. We pride ourselves on making it easy for you to get up and running fast.

CEP-52: Schema annotations for Apache Cassandra

With extensive review by the NetApp Instaclustr team and Stefan Miklosovic, CEP-52 introduces schema annotations in CQL allowing in-line comments and labels of schema elements such as keyspaces, tables, columns, and User Defined Types (UDT). Users can easily define and alter comments and labels on these elements. They can be copied over when desired using CREATE TABLE LIKE syntax. Comments are stored as plain text while labels are stored as structured metadata.

Comments and labels serve different annotation purposes: Comments document what a schema object is for, whereas labels describe how sensitive or controlled it is meant to be. For example, labels can be used to identify columns as “PII” or “confidential”, while the comment on that column explains usage, e.g. “Last login timestamp.”

Users can query these annotations. CEP-52 defines two new read-only tables (system_views.schema_comments and system_views.schema_security_labels) to store comments and security labels so objects with comments can be returned as a list or a user/machine process can query for specific labels, beneficial for auditing and classification. Note that adding security labels are descriptive metadata and do not enforce access control to the data.

CEP-53: Cassandra rolling restarts via Sidecar

Sidecar is an auxiliary component in the Cassandra ecosystem that exposes cluster management and streaming capabilities through APIs. Introducing rolling restarts through Sidecar, this feature is designed to provide operators with more efficient and safer restarts without cluster-wide downtime. More specifically, operators can monitor, pause, resume, and abort restarts all through an API with configurable options if restarts fail.

Rolling restarts brings operators a step closer to cluster-wide operations and lifecycle management via Sidecar. Operators will be able to configure the number of nodes to restart concurrently with minimal risk as this CEP unleashes clear states as a node progresses through a restart. Accounting for a variety of edge cases, an operator can feel assured that, for example, a non-functioning sidecar won’t derail operations.

The current process for restarting a node is a multi-step, manual process, which does not scale for large cluster sizes (and is also tedious for small clusters). Restarting clusters previously lacked a streamlined approach as each node needed to be restarted one at a time, making the process time-intensive and error-prone.

Though Sidecar is still considered WIP, it’s got big plans to improve operating large clusters!

The NetApp Instaclustr Platform, in conjunction with our expert TechOps team already orchestrates these laborious tasks for our Cassandra customers with a high level of care to ensure their cluster stays online. Restarting or upgrading your Cassandra nodes is a huge pain-point for operators, but it doesn’t have to be when using our managed platform (with round-the-clock support!)

CEP-54: Zstd with dictionary SSTable compression

CEP-54, with NetApp Instaclustr’s collaboration, aims to add support Zstd with dictionary compression for SSTables. Zstd, or Zstandard, is a fast, lossless data compression algorithm that boasts impressive ratio and speed and has been supported in Cassandra since 4.0. Certain workloads can benefit from significantly faster read/write performance, reduced storage footprint, and increased storage device lifetime when using dictionary compression.

At a high level, operators choose a table they want to compress with a dictionary. A dictionary must be trained first on a small amount of already present data (recommended no more than 10MiB). The result of a training is a dictionary, which is stored cluster-wide for all other nodes to use, and this dictionary is used for all subsequent writes of SSTables to a disk.

Workloads with structured data of similar rows benefit most from Zstd with dictionary compression. Some examples of ideal workloads include event logs, telemetry data, metadata tables with templated messages. Think: repeated row data. If the table data is too unstructured or random, this feature likely won’t be optimal for dictionary compression, however plain Zstd will still be an excellent option.

New SSTables with dictionaries are readable across nodes and can stream, repair, and backup. Existing tables are unaffected if dictionary compression is not enabled. Too many unique dictionaries hurt decompression; use minimal dictionaries (recommended dictionary size is about 100KiB and one dictionary per table) and only adopt new ones when they’re noticeably better.

CEP-55: Generated role names

 CEP-55 adds support to create users/roles without supplying a name, simplifying

user management, especially when generating users and roles in bulk. With an example syntax, CREATE GENERATED ROLE WITH GENERATED PASSWORD, new keys are placed in a newly introduced configuration section in cassandra.yaml under “role_name_policy.”

Stefan Miklosovic, our Cassandra engineer at NetApp Instaclustr, created this CEP as a logical follow up to CEP-24 (password validation/generation), which he authored as well. These quality-of-life improvements let operators spend less time doing trivial tasks with high-risk potential and more time on truly complex matters.

Manual name selection seems trivial until a hundred role names need to be generated; now there is a security risk if the new usernames—or worse passwords—are easily guessable. With CEP-55, the generated role name will be UUID-like, with optional prefix/suffix and size hints, however a pluggable policy is available to generate and validate names as well. This is an opt-in feature with no effect to the existing method of generating role names.

The future of Apache Cassandra is bright

 These Cassandra Enhancement Proposals demonstrate a strong commitment to making Apache Cassandra more powerful, secure, and easier to operate. By staying on top of these updates, we ensure our managed platform seamlessly supports future releases that accelerate your business needs.

At NetApp Instaclustr, our expert TechOps team already orchestrates laborious tasks like restarts and upgrades for our Apache Cassandra customers, ensuring their clusters stay online. Our platform handles the complexity so you can get up and running fast.

Learn more about our fully managed and hosted Apache Cassandra offering and try it for free today!

The post Stay ahead with Apache Cassandra®: 2025 CEP highlights appeared first on Instaclustr.

ScyllaDB Vector Search: 1B Vectors with 2ms P99s and 250K QPS Throughput

Results from a benchmark using the yandex-deep_1b dataset, which contains 1 billion vectors of 96 dimensions January 20, 2026 Update:  ScyllaDB Vector Search is now GA and production-ready. See the Quick Start Guide and give it try! As AI-driven applications move from experimentation into real-time production systems, the expectations placed on vector similarity search continue to rise dramatically. Teams now need to support billion-scale datasets, high concurrency, strict p99 latency budgets, and a level of operational simplicity that reduces architectural overhead rather than adding to it. ScyllaDB Vector Search was built with these constraints in mind. It offers a unified engine for storing structured data alongside unstructured embeddings, and it achieves performance that pushes the boundaries of what a managed database system can deliver at scale. The results of our recent high scale 1-billion-vector benchmark show that ScyllaDB demonstrates both ultra-low latency and highly predictable behaviour under load. Architecture at a Glance To achieve low-single-millisecond performance across massive vector sets, ScyllaDB adopts an architecture that separates the storage and indexing responsibilities while keeping the system unified from the user’s perspective. The ScyllaDB nodes store both the structured attributes and the vector embeddings in the same distributed table. Meanwhile, a dedicated Vector Store service – implemented in Rust and powered by the USearch engine optimized to support ScyllaDB’s predictable single-digit millisecond latencies – consumes updates from ScyllaDB via CDC and builds approximate-nearest-neighbour (ANN) indexes in memory. Queries are issued to the database using a familiar CQL expression such as: SELECT … ORDER BY vector_column ANN_OF ? LIMIT k; They are then internally routed to the Vector Store, which performs the similarity search and returns the candidate rows. This design allows each layer to scale independently, optimising for its own workload characteristics and eliminating resource interference. Benchmarking 1 Billion Vectors To evaluate real-world performance, ScyllaDB ran a rigorous benchmark using the publicly available yandex-deep_1b dataset, which contains 1 billion vectors of 96 dimensions. The setup consisted of six nodes: three ScyllaDB nodes running on i4i.16xlarge instances, each equipped with 64 vCPUs, and three Vector Store nodes running on r7i.48xlarge instances, each with 192 vCPUs. This hardware configuration reflects realistic production deployments where the database and vector indexing tiers are provisioned with different resource profiles. The results focus on two usage scenarios with distinct accuracy and latency goals (detailed in the following sections). A full architectural deep-dive, including diagrams, performance trade-offs, and extended benchmark results for higher-dimension datasets, can be found in the technical blog post Building a Low-Latency Vector Search Engine for ScyllaDB. These additional results follow the same pattern seen in the 96-dimensional tests: exceptionally low latency, high throughput, and stability across a wide range of concurrent load profiles. Scenario #1 — Ultra-Low Latency with Moderate Recall The first scenario was designed for workloads such as recommendation engines and real-time personalisation systems, where the primary objective is extremely low latency and the recall can be moderately relaxed. We used index parameters m = 16, ef-construction = 128, ef-search = 64 and Euclidean distance. At approximately 70% recall and with 30 concurrent searches, the system maintained a p99 latency of only 1.7 milliseconds and a p50 of just 1.2 milliseconds, while sustaining 25,000 queries per second. When expanding the throughput window (still keeping p99 latency below 10 milliseconds), the cluster reached 60,000 QPS for k = 100 with a p50 latency of 4.5 milliseconds, and 252,000 QPS for k = 10 with a p50 latency of 2.2 milliseconds. Importantly, utilizing ScyllaDB’s predictable performance, this throughput scales linearly: adding more Vector Store nodes directly increases the achievable QPS without compromising latency or recall. Latency and throughput depending on the concurrency level for recall equal to 70%. Scenario #2 — High Recall with Slightly Higher Latency The second scenario targets systems that require near-perfect recall, including high-fidelity semantic search and retrieval-augmented generation pipelines. Here, the index parameters were significantly increased to m = 64, ef-construction = 512, and ef-search = 512. This configuration raises compute requirements but dramatically improves recall. With 50 concurrent searches and recall approaching 98%, ScyllaDB kept p99 latency below 12 milliseconds and p50 around 8 milliseconds while delivering 6,500 QPS. When shifting the focus to maximum sustained throughput while keeping p99 latency under 20 milliseconds and p50 under 10 milliseconds, the system achieved 16,600 QPS. Even under these settings, latency remained notably stable across values of k from 10 to 100, demonstrating predictable behaviour in environments where query limits vary dynamically. Latency and throughput depending on the concurrency level for recall equal to 98%. Detailed results The table below presents the summary of the results for some representative concurrency levels. Run 1 Run 2 Run 3 m 16 16 64 efconstruct 128 128 512 efsearch 64 64 512 metric Euclidean Euclidean Euclidean upload 3.5 h 3.5 h 3.5 h build 4.5 h 4.5 h 24.4 h p50 [ms] 2.2 1.7 8.2 p99 [ms] 9.9 4 12.3 qps 252,799 225,891 10,206   Unified Vector Search Without the Complexity A big advantage of integrating Vector Search with ScyllaDB is that it delivers substantial performance and networking cost advantages. The vector store resides close to the data with just a single network hop between metadata and embedding storage in the same availability zone. This locality, combined with ScyllaDB’s shard-per-core execution model, allows the system to provide real-time latency and massive throughput even under heavy load. The result is that teams can accomplish more with fewer resources compared to specialised vector-search systems. In addition to being fast at scale, ScyllaDB’s Vector Search is also simpler to operate. Its key advantage is its ability to unify structured and unstructured retrieval within a single dataset. This means you can store traditional attributes and vector embeddings side-by-side and express queries that combine semantic search with conventional search. For example, you can ask the database to “find the top five most similar documents, but only those belonging to this specific customer and created within the past 30 days.” This approach eliminates the common pain of maintaining separate systems for transactional data and vector search, and it removes the operational fragility associated with syncing between two sources of truth. This also means there is no ETL drift and no dual-write risk. Instead of shipping embeddings to a separate vector database while keeping metadata in a transactional store, ScyllaDB consolidates everything into a single system. The only pipeline you need is the computational step that generates embeddings using your preferred LLM or ML model. Once written, the data remains consistent without extra coordination, backfills, or complex streaming jobs. Operationally, ScyllaDB simplifies the entire retrieval stack. Because it is built on ScyllaDB’s proven distributed architecture, the system is highly available, horizontally scalable, and resilient across availability zones and regions. Instead of operating two or three different technologies – each with its own monitoring, security configurations, and failure modes – you only manage one. This consolidation drastically reduces operational complexity while simultaneously improving performance. Roadmap The product is now in General Availability. This includes Cloud Portal provisioning, on-demand billing, a full range of instance types, and additional performance optimisations. Self-service scaling is planned for Q1. By the end of Q1 we will introduce native filtering capabilities, enabling vector search queries to combine ANN results with traditional predicates for more precise hybrid retrieval. Looking further ahead, the roadmap includes support for scalar and binary quantisation to reduce memory usage, TTL functionality for lifecycle automation of vector data, and integrated hybrid search combining ANN with BM25 for unified lexical and semantic relevance. Conclusion ScyllaDB has demonstrated that it is capable of delivering industry-leading performance for vector search at massive scale, handling a dataset of 1 billion vectors with p99 latency as low as 1.7 milliseconds and throughput up to 252,000 QPS. These results validate ScyllaDB Vector Search as a unified, high-performance solution that simplifies the operational complexity of real-time AI applications by co-locating structured data and unstructured embeddings. The current benchmarks showcase the current state of ScyllaDB’s scalability. With planned enhancements in the upcoming roadmap, including scalar quantization and sharding, these performance limits are set to increase in the next year. Nevertheless, even now, the feature is ready for running latency critical workloads such as fraud detection or recommendation systems.

Cut LLM Costs and Latency with ScyllaDB Semantic Caching

How semantic caching can help with costs and latency as you scale up your AI workload Developers building large-scale LLM solutions often rely on powerful APIs such as OpenAI’s. This approach outsources model hosting and inference, allowing teams to focus on application logic rather than infrastructure. However, there are two main challenges you might face as you scale up your AI workload: high costs and high latency. This blog post introduces semantic caching as a possible solution to these problems. Along the way, we cover how ScyllaDB can help implement semantic caching. What is semantic caching? Semantic caching follows the same principle as traditional caching: storing data in a system that allows faster access than your primary source. In conventional caching solutions, that source is a database. In AI systems, the source is an LLM. Here’s a simplified semantic caching workflow: User sends a question (“What is ScyllaDB?”) Check if this type of question has been asked before (for example “whats scylladb” or “Tell me about ScyllaDB”) If yes, deliver the response from cache If no a)Send the request to LLM and deliver the response from there b) Save the response to cache Semantic caching stores the meaning of user queries as vector embeddings and uses vector search to find similar ones. If there’s a close enough match, it returns the cached result instead of calling the LLM. The more queries you can serve from the cache, the more you save on cost and latency over time. Invalidating data is just as important for semantic caching as it is for traditional caching. For instance, if you are working with RAGs (where the underlying base information can change over time), then you need to invalidate the cache periodically so it returns accurate information. For example, if the user query is “What’s the most recent version of ScyllaDB Enterprise,” the answer depends on when you ask this question. The cached response to this answer must be refreshed accordingly (assuming the only context your LLM works with is the one provided by the cache). Why use a semantic cache? Simply put, semantic caching saves you money and time. You save money by making fewer LLM calls, and you save time from faster responses. When a use case involves repeated or semantically similar queries, and identical responses are acceptable, semantic caching offers a practical way to reduce both inference costs and latency. Heavy LLM usage might put you on OpenAI’s top spenders list. That’s great for OpenAI. But is it great for you? Sure, you’re using cutting-edge AI and delivering value to users, but the real question is: can you optimize those costs? Cost isn’t the only concern. Latency matters too. LLMs inherently cannot achieve sub-millisecond response times. But users still expect instant responses. So how do you bridge that gap? You can combine LLM APIs with a low-latency database like ScyllaDB to speed things up. Combining AI models with traditional optimization techniques is key to meeting strict latency requirements. Semantic caching helps mitigate these issues by caching LLM responses associated with the input embeddings. When a new input is received, its embedding is compared to those stored in the cache. If a similar-enough embedding is found (based on a defined similarity threshold), the saved response is returned from the cache. This way, you can skip the round trip to the LLM provider. This leads to two major benefits: Lower latency: No need to wait for the LLM to generate a new response. Your low-latency database will always return responses faster than an LLM. Lower cost: Cached responses are “free” – no LLM API fees. Unlike LLM calls, database queries don’t charge you per request or per token. Why use ScyllaDB for semantic caching? From day one, ScyllaDB has focused on three things: cutting latency, cost, and operational overhead. All three of those things matter just as much for LLM apps and semantic caching as they do for “traditional” applications. Furthermore, ScyllaDB is more than an in-memory cache. It’s a full-fledged high-performance database with a built-in caching layer. It offers high availability and strong P99 latency guarantees, making it ideal for real-time AI applications. ScyllaDB has recently added Vector Search offering, which is essential for building a semantic cache, and it’s also used for a wide range of AI and LLM-based applications. For example, it’s quite commonly used as a feature store. In short, you can consolidate all your AI workloads into a single high-performance, low-latency database. Now let’s see how you can implement semantic caching with ScyllaDB. How to implement semantic caching with ScyllaDB > If you just want to dive in, clone the repo, and try it yourself, check out the GitHub repository here. Here’s a simplified, general guide on how to implement semantic caching with ScyllaDB (using Python examples): 1. Create a semantic caching schema First, we create a keyspace, then a table called prompts, which will act as our cache table. It includes the following columns: prompt_id: The partition key for the table. Inserted_at: Stores the timestamp when the row was originally inserted (the response first cached) prompt_text: The actual input provided by the user, such as a question or query. prompt_embedding: The vector embedding representation of the user input. llm_response: The LLM’s response for that prompt, returned from the cache when a similar prompt appears again. updated_at: Timestamp of when the row was last updated, useful if the underlying data changes and the cached response needs to be refreshed. Finally, we create an ANN (Approximate Nearest Neighbor) index on the prompt_embedding column to enable fast and efficient vector searches. Now that ScyllaDB is ready to receive and return responses, let’s implement semantic caching in our application code. 2. Convert user input to vector embedding Take the user’s text input (which is usually a question or some kind of query) and convert it into an embedding using your chosen embedding model. It’s important that the same embedding model is used consistently for both cached data and new queries. In this example, we’re using a local embedding model from sentence transformers. In your application, you might use OpenAI or some other embedding provider platform. 3. Calculate similarity score Use ScyllaDB Vector Search syntax: `ANN OF` to find semantically similar entries in the cache. There are two key components in this part of the application. Similarity score: You need to calculate the similarity between the user’s new query and the most similar item returned by vector search. Cosine similarity, which is the most frequently used similarity function in LLM-based applications, ranges from 0 to 1. A similarity of 1 means the embeddings are identical. A similarity of 0 means they are completely dissimilar. Threshold: Determines whether the response can be provided from cache. If the similarity score is above that threshold, it means the new query is similar enough to one already stored in the cache, so the cached response can be returned. If it falls below the threshold, the system should fetch a fresh response from the LLM. The exact threshold should be tuned experimentally based on your use case. 4. Implement cache logic Finally, putting it all together, you need a function that decides whether to serve a response from the cache or make a request to the LLM. If the user query matches something similar in the cache, follow the earlier steps and return the cached response. If it’s not in the cache, make a request to your LLM provider, such as OpenAI, return that response to the user, and then store it in the cache. This way, the next time a similar query comes in, the response can be served instantly from the cache. Get started! Get started building with ScyllaDB; check out our examples on GitHub: `git clone https://github.com/scylladb/vector-search-examples.git ` Vector Search Semantic Cache RAG See the Quick Start Guide and give it try Contact us with your questions, or for a personalized tour

Managing ScyllaDB Background Operations with Task Manager

Learn about Task Manager, which provides a unified way to observe and control ScyllaDB’s background maintenance work In each ScyllaDB cluster, there are a lot of background processes that help maintain data consistency, durability, and performance in a distributed environment. For instance, such operations include compaction (which cleans up on-disk data files) and repair (which ensures data consistency in a cluster). These operations are critical for preserving cluster health and integrity. However, some processes can be long-running and resource-intensive. Given that ScyllaDB is used for latency-sensitive database workloads, it’s important to monitor and track these operations. That’s where ScyllaDB’s Task Manager comes in. Task Manager allows administrators of self-managed ScyllaDB to see all running operations, manage them, or get detailed information about a specific operation. And beyond being a monitoring tool, Task Manager also provides a unified way to manage asynchronous operations. How Task Manager Organizes and Tracks Operations Task Manager adds structure and visibility into ScyllaDB’s background work. It groups related maintenance activities into modules, represents them as hierarchical task trees, and tracks their lifecycle from creation through completion. The following sections explain how operations are organized, retained, and monitored at both node and cluster levels. Supported Operations Task Manager supports the following operations: Local: Compaction; Repair; Streaming; Backup; Restore. Global: Tablet repair; Tablet migration; Tablet split and merge; Node operations: bootstrap, replace, rebuild, remove node, decommission. Reviewing Active/Completed Tasks Task Manager is divided into modules: the entities that gather information about operations of similar functionality. Task Manager captures and exposes this data using tasks. Each task covers an operation or its part (e.g., a task can represent the part of the repair operation running on a specific shard). Each operation is represented by a tree of tasks. The tree root covers the whole operation. The root may have children, which give more fine-grained control over the operation. The children may have their own children, etc. Let’s consider the example of a global major compaction task tree: The root covers the compaction of all keyspaces in a node; The children of the root task cover a single keyspace; The second-degree descendants of the root task cover a single keyspace on a single shard; The third-degree descendants of the root task cover a single table on a single shard; etc. You can inspect a task from each depth to see details on the operation’s progress.   Determining How Long Tasks Are Shown Task Manager can show completed tasks as well as running ones. The completed tasks are removed from Task Manager after some time. To customize how long a task’s status is preserved, modify task_ttl_in_seconds (aka task_ttl) and user_task_ttl_in_seconds (aka user_task_ttl) configuration parameters. Task_ttl applies to operations that are started internally, while user_task_ttl refers to those initiated by the user. When the user starts an operation, the root of the task tree is a user-task. Descendant tasks are internal and such tasks are unregistered after they finish, propagating their status to their parents. Node Tasks vs Cluster Tasks Task Manager tracks operations local to a node as well as global cluster-wide operations. A local task is created on a node that the respective operation runs on. Its status may be requested only from a node on which the task was created. A global task always covers the whole operation. It is the root of a task tree and it may have local children. A global task is reachable from each node in a cluster. Task_ttl and user_task_ttl are not relevant for global tasks. Per-Task Details When you list all tasks in a Task Manager module, it shows brief information about them with task_stats. Each task has a unique task_id and sequence_number that’s unique within its module. All tasks in a task tree share the same sequence_number. Task stats also include several descriptive attributes: kind: either “node” (a local operation) or “cluster” (a global one). type: what specific operation this task involves (e.g., “major compaction” or “intranode migration”). scope: the level of granularity (e.g., “keyspace” or “tablet”). Additional attributes such as shard, keyspace, table, and entity can further specify the scope. Status fields summarize the task’s state and timing: state: indicate if the task was created, running, done, failed, or suspended. start_time and end_time: indicate when the task began and finished. If a task is still running, its end_time is set to epoch. When you request a specific task’s status, you’ll see more detailed metrics: progress_total and progress_completed show how much work is done, measured in progress_units. parent_id and children_ids place the task within its tree hierarchy. is_abortable indicates whether the task can be stopped before completion. If the task failed, you will also see the exact error message. Interacting with Task Manager Task Manager provides a REST API for listing, monitoring, and controlling ScyllaDB’s background operations. You can also use it to manage the execution of long-running maintenance tasks started with the asynchronous API instead of blocking a client call. If you prefer command-line tools, the same functionality is available through nodetool tasks. Using the Task Management API Task Manager exposes a REST API that lets you manage tasks: GET /task_manager/list_modules – lists all supported Task Manager modules. GET /task_manager/list_module_tasks/{module} – lists all tasks in a specified module. GET /task_manager/task_status/{task_id} – shows the detailed status of a specified task. GET /task_manager/wait_task/{task_id} – waits for a specified task and shows its status. POST /task_manager/abort_task/{task_id} – aborts a specified task. GET /task_manager/task_status_recursive/{task_id} – gets statuses of a specified task and all its descendants. GET/POST /task_manager/ttl – gets/sets task_ttl. GET/POST /task_manager/user_ttl – gets/sets user_task_ttl. POST /task_manager/drain/{module} – drains the finished tasks in a specified module. Running Maintenance Tasks Asynchronously Some ScyllaDB maintenance operations can take a while to complete, especially at scale. Waiting for them to finish through a synchronous API call isn’t always practical. Thanks to Task Manager, existing synchronous APIs are easily and consistently converted into asynchronous ones. Instead of waiting for an operation to finish, a new API can immediately return the ID of the root task representing the started operation. Using this task_id, you can check the operation’s progress, wait for completion, or abort it if needed. This gives you a unified and consistent way to manage all those long-running tasks. Nodetool A task can be managed using nodetool’s tasks command. For details, see the related nodetool docs page. Example: Tracking and Managing Tasks Preparation To start, we locally set up a cluster of three nodes with the IP addresses 127.43.0.1, 127.43.0.2, and 127.43.0.3. Next, we create two keyspaces: keyspace1 with replication factor 3 and keyspace2 with replication factor 2. In each keyspace, we create 2 tables: table1 and table2 in keyspace1, and table3 and table4 in keyspace2. We populate them with data. Exploring Task Manager Let’s start by listing the modules supported by Task Manager: nodetool tasks modules -h 127.43.0.1 Starting and Tracking a Repair Task We request a tablet repair on all tokens of table keyspace2.table3. curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://127.43.0.3:10000/storage_service/tablets/repair?ks=keyspace2&table=table3&tokens=all' {"tablet_task_id":"2f06bff0-ab45-11f0-94c2-60ca5d6b2927"} In response, we get the task id of the respective tablet repair task. We can use it to track the progress of the repair. Let’s check whether the task with id 2f06bff0-ab45-11f0-94c2-60ca5d6b2927 will be listed in a tablets module. nodetool tasks list tablets -h 127.43.0.1 Apart from the repair task, we can see that there are two intranode migrations running. All the tasks are of type “cluster”, which means that they cover the global operations. All these tasks would be visible regardless of which node we request them from. We can also see the scope of the operations. We always migrate one tablet at a time, so the migration tasks’ scope is “tablet”. For repair, the scope is “table” because we previously started the operation on a whole table. Entity, sequence_number, and shard are irrelevant for global tasks. Since all tasks are running, their end_time is set to a default value (epoch). Examining Task Status Let’s examine the status of the tablet repair using its task_id. Global tasks are available on the whole cluster, so we change the requested node… just because we can. 😉 nodetool tasks status 2f06bff0-ab45-11f0-94c2-60ca5d6b2927 -h 127.43.0.3 The task status contains detailed information about the tablet repair task. We can see whether the task is abortable (via task_manager API). There could also be some additional information that’s not applicable for this particular task : error, which would be set if the task failed; parent_id, which would be set if it had a parent (impossible for a global task); progress_unit, progress_total, progress_comwepleted, which would indicate task progress (not yet supported for tablet repair tasks). There’s also a list of tasks that were created as a part of the global task. The list above has been shortened to improve readability. The key point is that children of a global task may be created on all nodes in a cluster. Those children are local tasks (because global tasks cannot have a parent). Thus, they are reachable only from the nodes where they were created. For example, the status of a task 1eb69569-c19d-481e-a5e6-0c433a5745ae should be requested from node 127.43.0.2. nodetool tasks status 1eb69569-c19d-481e-a5e6-0c433a5745ae -h 127.43.0.2 As expected, the child’s kind is “node”. Its parent_id references the tablet repair task’s task_id. The task has completed successfully, as indicated by the state. The end_time of a task is set. Its sequence_number is 15, which means it is the 15th task in its module. The task’s scope is wider than the parent’s. It could encompass the whole keyspace, but – in this case – it is limited to the parent’s scope. The task’s progress is measured in ranges, and we can see that exactly one range was repaired. This task has one child that is created on the same node as its parent. That’s always true for local tasks. nodetool tasks status 70d098c4-df79-4ea2-8a5e-6d7386d8d941 -h 127.43.0.3 We may examine other children of the global tablet repair task too. However, we may only check each one on the node where it was created. Let’s wait until the global task is completed. nodetool tasks wait 2f06bff0-ab45-11f0-94c2-60ca5d6b2927 -h 127.43.0.2 We can see that its state is “done” and its end_time is set. Working with Compaction Tasks Let’s start some compactions and have a look at the compaction module. nodetool tasks list compaction -h 127.43.0.2 We can see that one of the major compaction tasks is still running. Let’s abort it and check its task tree. nodetool tasks abort 16a6cdcc-bb32-41d0-8f06-1541907a3b48 -h 127.43.0.2 nodetool tasks tree 16a6cdcc-bb32-41d0-8f06-1541907a3b48 -h 127.43.0.2 We can see that the abort request propagated to one of the task’s children and aborted it. That task now has a failed state and its error field contains abort_requested_exception. Managing Asynchronous Operations Beyond examining the running operations, Task Manager can manage asynchronous operations started with the REST API. For example, we may start a major compaction of a keyspace synchronously with /storage_service/keyspace_compaction/{keyspace} or use an asynchronous version of this API: curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://127.43.0.1:10000/tasks/compaction/keyspace_compaction/keyspace2' "4c6f3dd4-56dc-4242-ad6a-8be032593a02" The response includes the task_id of the operation we just started. This id may be used in Task Manager to track the progress, wait for the operation, or abort it. Key Takeaways The Task Manager provides a clear, unified way to observe and control background maintenance work in ScyllaDB. Visibility: It shows detailed, hierarchical information about ongoing and completed operations, from cluster-level tasks down to individual shards. Consistency: You can use the same mechanisms for listing, tracking, and managing all asynchronous operations. Control: You can check progress, wait for completion, or abort tasks directly, without guessing what’s running. Extensibility: It also provides a framework for turning synchronous APIs into asynchronous ones by returning task IDs that can be monitored or managed through the Task Manager. Together, these capabilities make it easier to see what ScyllaDB is doing, keep the system stable, and convert long-running operations to asynchronous workflows.

The Cost of Multitenancy

DynamoDB and ScyllaDB share many similarities, but DynamoDB is a multi-tenant database, while ScyllaDB is single-tenant The recent DynamoDB outage is a stark reminder that even the most reliable and mature cloud services can experience downtime. Amazon DynamoDB remains a strong and proven choice for many workloads, and many teams are satisfied with its latency and cost. However, incidents like this highlight the importance of architecture, control, and flexibility when building for resilience. DynamoDB and ScyllaDB share many similarities: Both are distributed NoSQL databases with the same “ancestor”: the Dynamo paper (although both databases have significantly evolved from the original concept). A compatible API: The DynamoDB API is one of two supported APIs in ScyllaDB Cloud. Both use multi-zone deployment for higher HA. Both support multi-region deployment. DynamoDB uses Global Tablets (See this analysis for more). ScyllaDB can go beyond and allow multi-cloud deployments, or on-prem / hybrid deployments. But they also have a major difference: DynamoDB is a multi-tenant database, while ScyllaDB is single-tenant. Source: https://blog.bytebytego.com/p/a-deep-dive-into-amazon-dynamodb Multi-tenancy has notable advantages for the vendor: Lower infrastructure cost: Since tenants’ peaks don’t align, the vendor can provision for the aggregate average rather than the sum of all peaks, and even safely over-subscribe resources. Shared burst capacity: Extra capacity for traffic spikes is pooled across all users. Multi-tenancy also comes with significant technical challenges and is never perfect. All users still share the same underlying resources (CPU, storage, and network) while the service works hard to preserve the illusion of a dedicated environment for each tenant (e.g., using various isolation mechanisms). However, sometimes the isolation breaks and the real architecture behind the curtain is revealed. One example is the Noisy Neighbor issue. Another is that when a shared resource breaks, like the DNS endpoint in the latest DynamoDB outage, MANY users are affected. In this case, all DynamoDB users in a region suffer. ScyllaDB Cloud takes a different approach: all database resources are completely separated from each other. Each ScyllaDB database is running: On dedicated VMs On a dedicated VPC In a dedicated Security Group Using a dedicated endpoint and (an optional) dedicated Private Link Isolated authorization and authentication (per database) Dedicated Monitoring and Administration (ScyllaDB Manager) servers When using ScyllaDB Cloud Bring Your Own Account (BYOA), the entire deployment is running on the *user* account, often on a dedicated sub-account. This provides additional isolation. The ScyllaDB Cloud control plane is loosely coupled to the managed databases. Even in the case of a disconnect, the database clusters will continue to serve requests. This design greatly reduces the blast radius of any one issue. While the single-tenant architecture is more resilient, it does come with a few challenges: Scaling: To scale, ScyllaDB needs to allocate new resources (nodes) from EC2, and depend on the EC2 API to allocate them. Tablets and X Cloud have made a great improvement in reducing scaling time. Workload Isolation: ScyllaDB allows users to control the resource bandwidth per workload with Workload Prioritization (docs | tech talk | demo) Pricing: Using numerous optimization techniques, like shard-per-core, ScyllaDB achieves extreme performance per node, which allows us to provide lower prices than DynamoDB for most use cases. To conclude: DynamoDB optimizes for multi-tenancy, whereas ScyllaDB favors stronger tenant isolation and a smaller blast radius.

Cache vs. Database: How Architecture Impacts Performance

Lessons learned comparing Memcached with ScyllaDB Although caches and databases are different animals, databases have always cached data and caches started to use disks, extending beyond RAM. If an in-memory cache can rely on flash storage, can a persistent database also function as a cache? And how far can you reasonably push each beyond its original intent, given the power and constraints of its underlying architecture? A little while ago, I joined forces with Memcached maintainer Alan Kasindorf (a.k.a. dormando) to explore these questions. The collaboration began with the goal of an “apples to oranges” benchmark comparing ScyllaDB with Memcached, which is covered in the article “We Compared ScyllaDB and Memcached and… We Lost?” A few months later, we were pleasantly surprised that the stars aligned for P99 CONF. At the last minute, Kasindorf was able to join us to chat about the project – specifically, what it all means for developers with performance-sensitive use cases. Note: P99 CONF is a highly technical conference on performance and low-latency engineering. We just wrapped P99 CONF 2025, and you can watch the core sessions on-demand. Watch on demand  Cache Efficiency Which data store uses memory more efficiently? To test it, we ran a simple key-value workload on both systems. The results: Memcached cached 101 million items before evictions began ScyllaDB cached only 61 million items before evictions Cache efficiency comparison What’s behind the difference? ScyllaDB also has its own LRU (Least Recently Used) cache, bypassing the Linux cache. But unlike Memcached, ScyllaDB supports a wide-column data representation: A single key may contain many rows. This, along with additional protocol overhead, causes a single write in ScyllaDB to consume more space than a write in Memcached. Drilling down into the differences, Memcached has very little per-item overhead. In the example from the image above, each stored item consumes either 48 or 56 bytes, depending on whether compare and swap (CAS) is enabled. In contrast, ScyllaDB has to handle a lot more (it’s a persistent database after all!). It needs to allocate space for its memtables, Bloom filters and SSTable summaries so it can efficiently retrieve data from disk when a cache miss occurs. On top of that, ScyllaDB supports a much richer data model(wide column). Another notable architectural difference stands out in the performance front: Memcached is optimized for pipelined requests (think batching, as in DynamoDB’s BatchGetItem), considerably reducing the number of roundtrips over the network to retrieve several keys. ScyllaDB is optimized for single (and contiguous) key retrievals under a wide-column representation. Read-only in-memory efficiency comparison Following each system’s ideal data model, both ScyllaDB and Memcached managed to saturate the available network throughput, servicing around 3 million rows/s while sustaining below single-digit millisecond P99 latencies. Disks and IO Efficiency Next, the focus shifted to disks. We measured performance under different payload sizes, as well as how efficiently each of the systems could maximize the underlying storage. With Extstore and small (1K) payloads, Memcached stored about 11 times more items (compared to its in-memory workload) before evictions started to kick in, leaving a significant portion of free available disk space. This happens because, in addition to the regular per-key overhead, Memcached stores an additional 12 bytes per item in RAM as a pointer to storage. As RAM gets depleted, Extstore is no longer effective and users will no longer observe savings beyond that point. Disk performance with small payloads comparison For the actual performance tests, we stressed Extstore against item sizes of 1KB and 8KB. The table below summarizes the results: Test Type Payload Size I/O Threads GET Rate P99 Latency perfrun_metaget_pipe 1KB 32 188K/s 4~5 ms perfrun_metaget 1KB 32 182K/s <1ms perfrun_metaget_pipe 1KB 64 261K/s 5~6 ms perfrun_metaget 1KB 64 256K/s 1~2ms perfrun_metaget_pipe 8KB 16 92K/s 5~6 ms perfrun_metaget 8KB 16 90K/s <1ms perfrun_metaget_pipe 8KB 32 110K/s 3~4 ms perfrun_metaget 8KB 32 105K/s <1ms We populated ScyllaDB with the same number of items as we used for Memcached. ScyllaDB actually achieved higher throughput – and just slightly higher latency – than Extstore. I’m pretty sure that if the throughput had been reduced, the latency would have been lower. But even with no tuning, the performance is quite comparable. This is summarized below: Test Type Payload Size GET Rate Server-Side P99 Client-Side P99 1KB Read 1KB 268.8K/s 2ms 2.4ms 8KB Read 8KB 156.8K/s 1.54ms 1.9ms A few notable points from these tests: Extstore required considerable tuning to fully saturate flash storage I/O. Due to Memcached’s architecture, smaller payloads are unable to fully use the available disk space, providing smaller gains compared to ScyllaDB. ScyllaDB rates were overall higher than Memcached in a key-value orientation, especially under higher payload sizes. Latencies were better than pipelined requests, but slightly higher than individual GETs in Memcached. I/O Access Methods Discussion These disk-focused tests unsurprisingly sparked a discussion about the different I/O access methods used by ScyllaDB vs. Memcached/Extstore. I explained that ScyllaDB uses asynchronous direct I/O. For an extensive discussion of this, read this blog post by ScyllaDB CTO and cofounder Avi Kivity. Here’s the short version: ScyllaDB is a persistent database. When people adopt a database, they rightfully expect that it will persist their data. So, direct I/O is a deliberate choice. It bypasses the kernel page cache, giving ScyllaDB full control over disk operations. This is critical for things like compactions, write-ahead logs and efficiently reading data off disk. A user-space I/O scheduler is also involved. It lives in the middle and decides which operation gets how much I/O bandwidth. That could be an internal compaction task or a user-facing query. It arbitrates between them. That’s what enables ScyllaDB to balance persistence work with latency-sensitive operations. Extstore takes a rather very different approach: keep things as simple as possible and avoid touching the disk unless it’s absolutely necessary. As Kasindorf put it: “We do almost nothing.” That’s fully intentional. Most operations — like deletes, TTL updates, or overwrites — can happen entirely in memory. No disk access needed. So Extstore doesn’t bother with a scheduler.” Without a scheduler, Extstore performance tuning is manual. You can change the number of Extstore I/O threads to get better utilization. If you roll it out and notice that your disk doesn’t look fully utilized – and you still have a lot of spare CPU – you can bump up the thread count. Kasindorf mentioned that it will likely become self-tuning at some point. But for now, it’s a knob that users can tweak. Another important piece is how Extstore layers itself on top of Memcached’s existing RAM cache. It’s not a replacement; it’s additive. You still have your in-memory cache and Extstore just handles the overflow. Here’s how Kasindorf explained it: “If you have, say, five gigs of RAM and one gig of that is dedicated to these small pointers that point from memory into disk, we still have a couple extra gigs left over for RAM cache.” That means if a user is actively clicking around, their data may never even go to disk. The only time Extstore might need to read from disk is when the cache has gone cold (for instance, a user returning the next day). Then the entries get pulled back in. Basically, while ScyllaDB builds around persistent, high-performance disk I/O (with scheduling, direct control and durable storage), Extstore is almost the opposite. It’s light, minimal and tries to avoid disk entirely unless it really has to. Conclusion and Takeaways Across these and the other tests that we performed in the full benchmark, Memcached and ScyllaDB both managed to maximize the underlying hardware utilization and keep latencies predictably low. So which one should you pick? The real answer: It depends. If your existing workload can accommodate a simple key-value model and it benefits from pipelining, then Memcached should be more suitable to your needs. On the other hand, if the workload requires support for complex data models, then ScyllaDB is likely a better fit. Another reason for sticking with Memcached: It easily delivers traffic far beyond what a network interface card can sustain. In fact, in this Hacker News thread, dormando mentioned that he could scale it up past 55 million read ops/sec for a considerably larger server. Given that, you could make use of smaller and/or cheaper instance types to sustain a similar workload, provided the available memory and disk footprint meet your workload needs. A different angle to consider is the data set size. Even though Extstore provides great cost savings by allowing you to store items beyond RAM, there’s a limit to how many keys can fit per gigabyte of memory. Workloads with very small items should observe smaller gains compared to those with larger items. That’s not the case with ScyllaDB, which allows you to store billions of items irrespective of their sizes. It’s also important to consider whether data persistence is required. If it is, then running ScyllaDB as a replicated distributed cache provides you greater resilience and non-stop operations, with the tradeoff being (and as Memcached correctly states) that replication halves your effective cache size. Unfortunately, Extstore doesn’t support warm restarts and thus the failure or maintenance of a single node is prone to elevating your cache miss ratios. Whether this is acceptable depends on your application semantics: If a cache miss corresponds to a round-trip to the database, then the end-to-end latency will be momentarily higher. Regardless of whether you choose a cache like Memcached or a database like ScyllaDB, I hope this work inspires you to think differently about performance testing. As we’ve seen, databases and caches are fundamentally different. And at the end of the day, just comparing performance numbers isn’t enough. Moreover, recognize that it’s hard to fully represent your system’s reality with simple benchmarks, and every optimization comes with some trade-offs. For example, pipelining is great, but as we saw with Extstore, it can easily introduce I/O contention. ScyllaDB’s shard-per-core model and support for complex data models are also powerful, but they come with costs too, like losing some pipelining flexibility and adding memory overhead.  

11X Faster ScyllaDB Backup

Learn about ScyllaDB’s new native backup, which improves backup speed up to 11X by using Seastar’s CPU and IO scheduling ScyllaDB’s 2025.3 release introduces native backup functionality. Previously, an external process managed backups independently, without visibility into ScyllaDB’s internal workload. Now, Seastar’s CPU and I/O schedulers handle backups internally, which gives ScyllaDB full control over prioritization and resource usage. In this blog post, we explain why we changed our approach to backup, share what users need to know, and provide a preview of what to expect next. What We Changed and Why Previously, SSTable backups to S3 were managed entirely by ScyllaDB Manager and the Scylla Manager Agent running on each node. You would schedule the backup, and Manager would coordinate the required operations (taking snapshots, collecting metadata, and orchestrating uploads). Scylla Manager Agent handled all the actual data movement. The problem with this approach was that it was often too slow for our users’ liking, especially at the massive scale that’s common across our user base. Since uploads ran through an external process, they competed with ScyllaDB for resources (CPU, disk I/O, and network bandwidth). The rclone process read from \disk at the same time that ScyllaDB did – so two processes on the same node were performing heavy disk I/O simultaneously. This contention on the disk could impact query latencies when user requests were being processed during a backup. To mitigate the effect on real-time database requests, we use Systemd slice to control Scylla Manager Agent resources. This solution successfully reduced backup bandwidth, but failed to increase the bandwidth when the pressure from online requests was low. To optimize this process, ScyllaDB now provides a native backup capability. Rather than relying on an external agent (ScyllaDB Manager) to copy files, ScyllaDB uploads files directly to S3. The new approach is faster and more efficient because ScyllaDB uses its internal IO and CPU scheduling to control the backup operations. Backup operations are assigned a lower priority than user queries. In the event of resource contention, ScyllaDB will deprioritize them so they don’t interfere with the latency of the actual workload. Note that this new native backup capability is currently available for AWS. It is coming soon for other backup targets (such as GCP Cloud Storage and Azure Storage). To enable native backup, configure the S3 connectivity on each node’s scylla.yaml and set the desired strategy (Native, Auto, or Rclone) in ScyllaDB Manager. Note that the rclone agent is always used to upload backup metadata, so you should still configure the Manager Agent even if you are using native backup and restore. Performance Improvements So how much faster is the new backup approach? We recently ran some tests to find out. We ran two tests which are the same in all aspects except for the tool being used for backup: rclone in one and native scylla in the other. Test Setup The test uses 6 nodes i4i.2xlarge with total injected data of 2TB with RF=3. That means that the 2TB injected data becomes 6TB (RF=3) and these 6TB are spread across 6 nodes, resulting in each node holding 1TB of data. The backup benchmark then measures how long it takes to backup the entire cluster, indicating the data size of one node Native Backup Here are the results of the native backup tests: Name Size Time [s] native_backup_1016_2234 1.057 TiB 00:19:18 Data was uploaded at a rate of approximately  900 MB/s. OS Tx Bytes during backup The slightly higher values for the OS metrics are due to for example tcp-retransmit, size of HTTP headers that is not part of the data but part of the transmitted bytes, and more alike. rclone Backup The same exact test with rclone produced the following results: Name Size Time [s] rclone_backup_1017_2334 1.057 TiB 03:48:57 Here, data was uploaded at a rate of approximately 80MB/s Next Up: Faster Restore Next, we’re optimizing restore, which is the more complex part of the backup/restore process. Backups are relatively straightforward: you just upload the data to object storage. But restoring that data is harder, especially if you need to bring a cluster back online quickly or restore it onto a topology that’s different from the original one. The original cluster’s nodes, token ranges, and data distribution might look quite different from the new setup – but during restore, ScyllaDB must somehow map between what was backed up and what the new topology expects. Replication adds even more complexity. ScyllaDB replicates data according to the specified replication factor (RF), so the backup has multiple copies of the same data. During the restore process, we don’t want to redundantly download or process those copies; we need a way to handle them efficiently. And one more complicating factor: the restore process must understand whether the cluster uses virtual nodes or tablets because that affects how data is distributed. Wrapping Up ScyllaDB’s move to native integration with object storage is a big step forward for the faster backup/restore operations that many of our large-scale users have been asking for. We’ve already sped up backups by eliminating the extra rclone layer. Now, our focus is on making restores equally efficient while handling complex topologies, replication, and data distribution. This will make it faster and easier to restore large clusters. Looking ahead, we’re working on using object storage not only for backup and restore, but also for tiering: letting ScyllaDB read data directly from object storage as if it were on local disk. For a more detailed look at ScyllaDB’s plans for backup, restore, and object storage as native storage, see this video: