ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 2)

ScyllaDB’s co-founder/CTO goes deeper into how tablets work, then looks at the design behind ScyllaDB X Cloud’s autoscaling Following the recent ScyllaDB X Cloud release, Tim Koopmans sat down (virtually) with ScyllaDB Co-Founder and CTO Avi Kivity. The goal: get the engineering perspective on all the multiyear projects leading up to this release. This includes using Raft for topology and schema metadata, moving from vNodes to tablets-based data distribution, allowing up to 90% storage utilization, new compression approaches, etc. etc. In part 1 of this 3-part series, we looked at the motivations and architectural shifts behind ScyllaDB X Cloud, particularly with respect to Raft and tablets-based data distribution. This blog post goes deeper into how tablets work, then looks at the design behind ScyllaDB X Cloud’s autoscaling. Read part 1 You can watch the complete video here. Tackling technical challenges Tim: With such a complex project, I’m guessing that you didn’t nail everything perfectly on the first try. Could you walk us through some of the hard problems that took time to crack? How did you work around those hurdles? Avi: One of the difficult things was the distribution related to racks or availability zones (we use those terms interchangeably). With the vNodes method of data distribution, a particular replica can hop around different racks. That does work, but it creates problems when you have materialized views. With a materialized view, each row in the base table is tied to a row in the materialized view. If there’s a change in the relationship between which replica on the base table owns the row on the materialized view, that can cause problems with data consistency. We struggled with that a lot until we came to a solution of just forbidding having a replication factor that’s different from the number of racks or availability zones. That simple change solved a lot of problems. It’s a very small restriction because, practically speaking, the vast majority of users have a replication factor of 3, and they use 3 racks or 3 availability zones. So the restriction affects very few people, but solves a large number of problems for us…so we’re happy that we made it. How tablets prevent hot partitions Tim: What about things like hot partitions and data skew in tablets? Does tablets help here since you’re working with smaller chunks? Avi: Yes. With tablets, our granularity is 5GB, so we can balance data in 5GB chunks. That might sound large, but it’s actually very small compared to the node capacity. The 5GB size was selected because it’s around 1% of the data that a single vCPU can hold. For example, an i3 node has around 600GB of storage per vCPU, and 1% of that is 5GB. That’s where the 5GB number came from. Since we control individual tablets, we can isolate a tablet to a single vCPU. Then, instead of a tablet being 1% of a vCPU, it can take 100% of it. That effectively increases the amount of compute power that is dedicated to the tablet by a factor of 100. This will let us isolate hot partitions into their own vCPUs. We don’t do this yet, but detecting hot partitions and isolating them in this way will improve the system’s resilience to hot partition problems. Tim: That’s really interesting. So have we gone from shard per core to almost tablet per core? Is that what the 1% represents, on average? Avi:  The change is that we now have additional flexibility. With a static distribution, you look at the partition key and you know in advance where it will go. Here, you look at the partition key and you consult an indirection table. And that indirection table is under our control…which means we can play with it and adjust things. Tim:  Can you say more about the indirection table? Avi:  It’s called system.tablets. It lays out the topology of the cluster. For every table and every token range, it lists what node and what shard will handle those keys. It’s important that it’s per table. With vNodes, we had the same layout for all tables. Some tables can be very large, some tables can be very small, some tables can be hot, some tables can be cold…so the one-size-fits-all approach doesn’t always work. Now, we have the flexibility to lay out different tables in different ways. How driver changes simplify complexity Tim:  Very cool. So tablets seem to solve a lot of problems – they just have a lot of good things going for them. I guess they can start servicing requests as soon as a new node receives a tablet? That should help with long-tail latency for cluster operations. We also get more fine-grained control over how we pack data into the cluster (and we’ll talk about storage utilization shortly). But you mentioned the additional table. Is there any other overhead or any operational complexity? Avi: Yes. It does introduce more complexity. But since it’s under our control, we also introduced mitigations for that. For example, the drivers now have to know about this indirection layer, so we modified them. We have this reactive approach where a driver doesn’t read the tablets table upfront. Instead, when it doesn’t know the layout of tablets on a cluster, it just fires off a request randomly. If it hits, great. If it misses, then along with the results, we’ll get back a notification about the topology of that particular tablet. As it fires off more requests, it will gradually learn the topology of the cluster. And when the topology changes, it will react to how the cluster layout changes. That saves it from doing a lot of upfront work – so it can send requests as soon as it connects to the cluster.   ScyllaDB’s approach to autoscaling Tim:  Let’s shift over to autoscaling. Autoscaling in databases generally seems more like marketing than reality to me. What’s different about ScyllaDB X Cloud’s approach to autoscaling? Avi:  One difference is that we can autoscale much later, at least for storage-bound workloads. Before, we would scale at around 70% storage utilization. But now we will start scaling at 90%. This decreases the cluster cost because more of the cluster storage is used to store data, rather than being used as a free space cushion. Tablets allow us to do that. Since tablets lets us add nodes concurrently, we can scale much faster. Also, since each tablet is managed independently, we can remove its storage as soon as the tablet is migrated off its previous node. Before, we had to wait until the data was completely transitioned to a new node, and then we would run a cleaner process that would erase it from the original node. But now this is done incrementally (in 5GB increments), so it happens very quickly. We can migrate a 5GB tablet in around a minute, sometimes even less. As soon as a cluster scale out begins, the node storage decreases immediately. That means we can defer the scale out decision, waiting until it’s really needed. Scaling for CPU, by measuring the CPU usage, will be another part of that. CPU is used for many different things in ScyllaDB. It can be used for serving queries, but it’s also used for internal background tasks like compaction. It can also be used for queries that – from the user’s perspective – are background queries like running analytics. You wouldn’t want to scale your cluster just because you’re running analytics on it. These are jobs that can take as long as they need to; you wouldn’t necessarily want to add more hardware just to make them run faster. We can distinguish between CPU usage for foreground tasks (for queries that are latency sensitive) and CPU usage for maintenance tasks, for background work, and for queries where latency is not so important. We will only scale when the CPU for foreground tasks runs low. Tim: Does the user have to do anything special to prioritize the foreground vs background queries? Is that just part of workload prioritization? Or does it just understand the difference? Avi: We’re trying not to be too clever. It does use the existing service level mechanism. And in the service level definition, you can say whether it’s a transaction workload or a batch workload. All you need to do is run an alter service level statement to designate a particular service level as a batch workload. And once you do that, then the cluster will not scale because that service level needs more CPU. It will only scale if your real-time queries are running out of CPU. It’s pretty normal to see ScyllaDB at 100% CPU. But that 100% is split: part goes to your workload, and part goes to maintenance like compaction. You don’t want to trigger scaling just because the cluster is using idle CPU power for background work. So, we track every cycle and categorize it as either foreground work or background work, then we make decisions based on that. We don’t want it to scale out too far when that’s just not valuable.

Why Cache Data? [Latency Book Excerpt]

Latency is a monstrous concern here at ScyllaDB. So we’re pleased to bring you excerpts from Pekka Enberg’s new book on Latency…and a masterclass with Pekka as well! Latency is a monstrous concern here at ScyllaDB. Our engineers, our users, and our broader community are obsessed with it…to the point that we developed an entire conference on low latency. [Side note: That’s P99 CONF, a free + virtual conference coming to you live, October 22-23.] Join P99 CONF – Free + Virtual We’re delighted that Pekka Enberg decided to write an entire book on latency to punish himself share his hard-fought latency lessons learned. The book (quite efficiently titled “Latency”) is now off to the printers. From his intro: Latency is so important across a variety of use cases today. Still, it’s a tricky topic because many low-latency techniques are effectively developer folklore hidden in blog posts, mailing lists, and side notes in books. When faced with a latency problem, understanding what you’re even dealing with often takes a lot of time. I remember multiple occasions where I saw peculiar results from benchmarks, which resulted in an adventure down the software and hardware stack where I learned something new. By reading this book, you will understand latency better, how you can measure latency accurately, and discover the different techniques for achieving low latency. This is the book I always wished I had when grappling with latency issues. Although this book focuses on applying the techniques in practice, I will also bring up enough of the background side of things to try to balance between theory and practice. ScyllaDB is sponsoring chapters from the book, so we’ll be trickling out some excerpts on our blogs. Get the Latency book excerpt PDF More good news: Pekka is also joining us for a Latency Masterclass on September 25. Be there as he speedruns through the latency related patterns you’ll want to know when working on low latency apps. And bring your toughest latency questions! Join us live on September 25 Until then, let’s kick off our Latency book excerpts with the start of Pekka’s caching chapter. It’s reprinted here with permission of the publisher. ***   Why cache data? Typically, you should consider caching for reducing latency over other techniques if your application or system: Doesn’t need transactions or complex queries. Cannot be changed which makes using techniques such as replication hard. Has compute or storage constraints that prevent other techniques. Many applications and systems are simple enough that a key-value interface, typical for caching solutions, is more than sufficient. For example, you can store user data such as profiles and settings as a key-value pair where the key is the user ID, and the value is the user data in JSON or a similar format. Similarly, session management, where you keep track of logged in user session state is often simple enough that it doesn’t require complex queries. However, caching can eventually be too limiting as you move to more complicated use cases, such as recommendations or ad delivery. You have to look into other techniques. Overall, whether your application is simple enough to use caching is highly use case-specific. Often, you look into caching because you cannot or don’t want to change the existing system. For example, you may have a database system that you cannot change, which does not support replication, but you have clients accessing the database from multiple locations. You may then look into caching some query results to reduce latency and scale the system, which is a typical use of caching. However, this comes with various caveats on data freshness and consistency, which we’ll discuss in this chapter. Compute and storage constraints can also be a reason to use caching instead of other techniques. Depending on their implementation, colocation and replication can have high storage requirements, which may prevent you from using them. For example, suppose you want to reduce access latency to a large data set, such as a product catalog in an e-commerce site. In that case, it may be impractical to replicate the whole data set in a client with the lowest access latency. However, caching parts of the product catalog may still make sense to cache in the client to reduce latency but simultaneously live with the client’s storage constraints. Similarly, it may be impractical to replicate a whole database to a client or a service because database access requires compute capacity for query execution, which may not be there. Caching overview With caching, you can keep a temporary copy of data to reduce access time significantly by reusing the same result many times. For example, if you have a REST API that takes a long time to compute a result, you can cache the REST API results in the client to reduce latency. Accessing the cached results can be as fast as reading from the memory, which can significantly reduce latency. You can also use caching for data items that don’t exist, called negative caching. For example, maybe the REST API you use is there to look up customer information based on some filtering parameters. In some cases, no results will match the filter, but you still need to perform the expensive computation to discover that. In that scenario, you would use negative caching to cache the fact that there are no results, speeding up the search. Of course, caching has a downside, too: you trade off data freshness for reduced access latency. You also need more storage space to keep the cached data around. But in many use cases, it’s a trade-off you are willing to take. Cache storage is where you keep the temporary copies of data. Depending on the use case, cache storage can either be in the main memory or on disk, and cache storage can be accessed either in the same memory address space as the application or over a network protocol. For example, you can use an in-memory cache library value in your application memory or a key-value store such as Redis or Memcached to cache values in a remote server. With caching, an application looks up values based on a cache key from the cache. When the cache has a copy of the value, we call that a cache hit and serve the data access from the cache. However, if there is no value in the cache, we call that scenario a cache miss and must retrieve the value from the backing store. A key metric for an effective caching solution is the cache hit-to-miss ratio, which describes how often the application finds a relevant value in the cache and how frequently the cache does not have a value. If a cache has a high cache hit ratio, it is utilized well, meaning there is less need to perform a slow lookup or compute the result. With a high cache miss ratio, you are not taking advantage of the cache. This can mean that your application runs slower than without caching because caching itself has some overhead. One major complication with caches is cache eviction policies or what values to throw out from the cache. The main point of a cache is to provide fast access but also fit the cache in a limited storage space. For example, you may have a database with hundreds of gigabytes of data. Still, you can only reasonably cache tens of gigabytes in the memory address space of your application because of machine resource limitations. You, therefore, need some policy to determine which values stay in the cache and which ones you can evict if you run out of cache space. Similarly, once you cache a value, you can’t always retain the value in the cache indefinitely if the source value changes. For example, you may have a time-based eviction policy enforcing that a cached value can be at least a minute old before updating to the latest source value. Despite the challenges, caching is an effective technique to reduce latency in your application, in particular when you can’t change some parts of the system and when your use case doesn’t warrant investment in things like colocation, replication, or partitioning. With that in mind, let’s look at the different caching strategies. Caching strategies When adding caching to your application, you must first consider your caching strategy, which determines how reads and writes happen from the cache and the underlying backing store, such as a database or a service. At a high level, you need to decide if the cache is passive or active when there is a cache miss. In other words, when your application looks up a value from the cache, but the value is not there or has expired, the caching strategy mandates whether it’s your application or the cache that retrieves the value from the backing store. As usual, different caching strategies have different trade-offs on latency and complexity, so let’s get right into it. To be continued… Get the Latency book excerpt PDF Join the Latency Masterclass on September 25

ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 1)

ScyllaDB’s co-founder/CTO on the motivations and architectural shifts behind ScyllaDB X Cloud — focusing on Raft and tablets-based data distribution If you follow ScyllaDB, you’ve probably heard us talking about Raft and tablets-based data distribution for a few years now. The ultimate goal of these projects (plus a few related ones) was to optimize elasticity and price performance – especially for dynamic and storage-bound workloads. And we finally hit a nice milestone along that journey: the release of ScyllaDB X Cloud. You can read about the release in our earlier blog post. Here, we wanted to share the engineering perspective on these architectural shifts. Tim Koopmans recently sat down with Avi Kivity – ScyllaDB Co-Founder and CTO – to chat about the underlying motivation and design decisions. You can watch the complete video here. But if you prefer to read, we’re writing up the highlights. This is the first blog post in a three-part series. Why ScyllaDB X Cloud? For scaling large clusters Tim: Let’s start with a big picture. What really motivated the architectural evolution behind what we know as ScyllaDB X Cloud? Was this change inevitable? How did it come into place? Avi: It came from our experience managing clusters for our customers. With the existing architecture, things like scaling up the cluster in preparation for events like Black Friday could take a long time. Since ScyllaDB can manage very large nodes (e.g., nodes with 30TB of data), moving that data onto new nodes could take a long time, sometimes a day. Also, nodes had to be added one at a time. If you had a large cluster, scaling the cluster would be a nail-biting experience. So we decided to improve that experience and, along the way, we improved many parts of the architecture. Tim: Do you have any numbers around what it used to be like to scale a large cluster? Avi: One of our large clusters has 75 nodes, each of which has around 60TB. It’s a pretty hefty cluster. It’s nice watching clusters like that on our dashboards and seeing compactions at tens of gigabytes per second aggregate across the cluster. Those clusters are churning through large amounts of data per second and carrying a huge amount of data. Now, we can scale this amount of data in minutes, maybe an hour for the most extreme cases. So it’s quite a huge change. Why ScyllaDB addressed scaling with Tablets & Raft Tim:  When you think about dynamic or storage-bound workloads today, what are other databases getting wrong in this space? How did that lead you to this new approach, with tablets? Avi:  “Other databases” is a huge area – there are hundreds of databases. Let’s talk about our heritage. We came from the Cassandra model. And the basic problem there was the static distribution of data. The node layout determines how data is distributed, and as long as you don’t add or remove nodes, it remains static. That means you have no flexibility. Also, the focus on having availability over consistency led to no central point for managing the topology. Without a coordinating authority, you could make only one change at a time. One of the first changes that we made was to add a coordinating authority in the form of Raft. Before, we managed topology with Gossip, which really puts cluster management responsibility on the operator. We moved it to a Raft group to centralize the management. You’ve probably heard the old proverb that anything in computer science can be solved with another layer of indirection. We did that with tablets, more or less. We inserted a layer of indirection so that instead of having a static distribution of data to nodes, it goes through a translation table. Each range of rows is mapped to a different node in a tablets table. By manipulating the tablets table, we can redirect small packages of data (specifically, 5GB – that’s pretty small for us). We can redirect the granularity of 5GB to any node and any CPU on any node. We can move those packages around at will, and those packages are moved at the line rate, so it’s no problem to fire them away at gigabits per second across the cluster. And that gives us the ability to rebalance data on a cluster or add and remove nodes very quickly. Tim:  So tablets are really a new ScyllaDB abstraction? Is it an abstraction that breaks those tables into independently managed units? And I think you said the size is 5GB – is that configurable? Avi:  It’s configurable, but I don’t recommend playing with it. Normally, you stay between 2.6GB and 10GB. When it reaches 10GB, it triggers a split, which will bring it back to 5GB. So each tablet will be split into two. If it goes down to 2.5GB, it will trigger a merge, merging two tablets into one larger tablet – again, bringing it back to 5GB. Tim: So ensuring that things can be dynamically split…We can move data around, rebalance across the cluster…That gives us finer-grained load distribution as well as better scalability and perhaps a bit of separation between compute and storage, right? Because we’re not necessarily tied to the size of the compute node anymore. We can have different instance types in a cluster now, as an indirect result of this change. The tipping point Tim: Avi, you said that re-architecting around tablets has been a huge shift. So what was the tipping point? Was it just that vNodes didn’t work anymore in terms of how you organize data? What was your aha moment where you said, “Yeah, I think we need to do something different here”? Avi:  It was a combination of things, and since this was such a major change, we needed a lot of motivation to do it. One part of it was the inability to perform topology changes that involve more than one node at a time. Another part was that the previous streaming mechanism was very slow. Yet another part is that, because the streaming mechanism was so slow, we had to scale well in advance of exhausting the storage on the node. That required us to leave a lot of free space on the node, and that’s wasteful. We took all of this into consideration, and that was enough motivation for us to take on a multi-year change. I think it was well worth it. Tim:  Multiyear…So how long ago did you start workshopping different ideas to solve? Avi:  The first phase was changing topology to be strongly consistent and having a central authority to coordinate it. I think it took around a couple of years to switch to Raft topology. Before that, we switched schema management to use Raft as well. That was a separate problem, but since those two problems had the same solution, we jumped on it. We’re still not completely done. There are still a few features that are not yet fully compatible with tablets – but we see the light at the end of the tunnel now. [Stay tuned for parts 2 and 2]  

Be Part of Something Big – Speak at Monster Scale Summit

Share your “extreme scale engineering” expertise with ~20K like-minded engineers Whether you’re designing, implementing, or optimizing systems that are pushed to their limits, we’d love to hear about your most impressive achievements and lessons learned – at Monster Scale Summit 2026. Become a Monster Scale Summit Speaker What’s Monster Scale Summit? Monster Scale Summit is a technical conference that connects the community of people working on performance-sensitive data-intensive applications. Engineers, architects, and SREs from gamechangers around the globe will be gathering virtually to explore “monster scale” challenges with respect to extreme levels of throughput, data, and global distribution. It’s a lot like P99 CONF (also hosted by ScyllaDB) – a two-day event that’s free, fully virtual, and highly interactive. The core difference is that it’s focused on extreme scale engineering vs. all things performance. Last time, we hosted industry giants like Kelsey Hightower, Martin Kleppmann, Discord, Slack, CanvaBrowse past sessions Details please! When: March 11 + 12 Where: Wherever you’d like! It’s intentionally virtual, so you can present and interact with attendees from anywhere around the world. Topics: Core topics include distributed databases, streaming and real-time processing, intriguing system designs, methods for balancing latency/concurrency/throughput, SRE techniques proven at scale, and infrastructure built for unprecedented demands. What we’re looking for: We welcome a broad range of talks about tackling the challenges that arise in the most massive, demanding environments. The conference prioritizes technical talks sharing first-hand experiences. Sessions are just 18-20 minutes – so consider this your TED Talk debut! Share your ideas

Beyond Apache Cassandra

ScyllaDB is no longer “just” a faster Cassandra. In 2008, Apache Cassandra set a new standard for database scalability. Born to support Facebook’s Inbox Search, it has since been adopted by tech giants like Uber, Netflix, and Apple – where it’s run by experts who also serve as Cassandra contributors (alongside DataStax/IBM). And as its adoption scaled, Cassandra remained true to its core mission of scaling on commodity hardware with high availability. But what about performance? Simplicity? Efficiency? Elasticity? In 2015, ScyllaDB was born to go beyond Cassandra’s suboptimal resource utilization. Fresh from creating KVM and hacking the Linux kernel, the founders believed that their low-level engineering approach could squeeze considerably more power from the underlying infrastructure. The timing was ideal: just a year earlier, Netflix had published their numbers showing how to push Apache Cassandra to 1 million write RPS. This was an impressive feat, but one that required significant infrastructure investments and tuning efforts. The idea was quite simple (in theory, at least): take Apache Cassandra’s scalable architecture and reimplement it close to the metal while keeping wire protocol compatibility. Not relying on Java meant less latency variability (plus no stop the world pauses), while a unique shard-per-core architecture maximized servers’ throughput even under heavy system load. To prevent contention, everything was made asynchronous, and all these optimizations were paired with autonomous internal schedulers for minimal operational overhead. That was 10 years ago. While I can’t speak to Cassandra’s current direction, ScyllaDB evolved quite significantly since then – shifting from “just” a faster Cassandra alternative to a database with its own identity and unique feature set. Spoiler: In this video, I walk you through some key differences between ScyllaDB and how it differs from Apache Cassandra. I discuss the differences in performance, elasticity, and capabilities such as workload prioritization. You can see how ScyllaDB maps data per CPU core, scales in parallel, and de-risks topology changes—allowing it to handle millions of OPS with predictable low latencies (and without constant tuning and babysitting).    ScyllaDB’s Evolution The first generation of ScyllaDB was all about raw performance. That’s when we introduced the shard-per-core asynchronous architecture, row-based cache, and advanced schedulers that achieve predictable low latencies. ScyllaDB’s second generation aimed for feature parity with Cassandra, but we actually went beyond that. For example, we introduced our Materialized views and production-ready Global Secondary Indexes (something that Cassandra still flags as experimental). Likewise, ScyllaDB also introduced support for local secondary indexes in that same year; those were just introduced in Cassandra 5 (after at least three different indexing implementations). Moreover, our Paxos implementation for lightweight transactions eliminated much of the overhead and limitations in Cassandra’s alternative implementation. The third generation marked our shift to the cloud, along with continued innovation. This is when ScyllaDB Alternator—our DynamoDB-compatible API—was introduced. We added support for ZSTD compression in 2020 (Cassandra only adopted it late in 2021). During this period, we dramatically improved repair speeds with row-level repair and introduced workload prioritization (more on this in the next section). The fourth generation of ScyllaDB emerged around the time AWS announced their i3en instance family, with high-density nodes holding up to 60TB of data (something Cassandra still struggles to handle effectively). During this period, we introduced the Incremental Compaction Strategy (ICS), allowing users to utilize up to 70% of their storage before scaling out. This later evolved into a hybrid compaction strategy (and we now support 90% storage utilization). We also introduced Change Data Capture (CDC) with a fundamentally different approach from Cassandra’s. And we significantly extended the CQL protocol with concepts such as shard-awareness, BYPASS CACHE, per-query configurable TIMEOUTs, and much more. Finally, we arrive at the fifth generation of ScyllaDB, which is still unfolding. This phase represents our path toward strong consistency and elasticity with Raft and Tablets. For more about the significance of this, read on… Capabilities That Set ScyllaDB Apart Our engineers have introduced lots of interesting features over the past decade. Based on my interactions with former Cassandra users, I think these are the most interesting to discuss here. Tablets Data Distribution Each ScyllaDB table is split into smaller fragments (“tablets”) to evenly distribute data and load across the system. Tablets bring elasticity to ScyllaDB, allowing you to instantly double, triple, or even 10x your cluster size to accommodate unpredictable traffic surges. They also enable more efficient use of storage, reaching up to 90% utilization. Since teams can quickly scale out in response to traffic spikes, they can satisfy latency SLAs without needing to overprovision “just in case.” Raft-based Strong Consistency for Metadata Raft introduces strong consistency to ScyllaDB’s metadata. Gone are the days when a schema change could push your cluster into disagreement or you’d lose access because you forgot to update the replication factor of your authentication keyspace (issues that still plague Cassandra). Workload Prioritization Workload prioritization allows you to consolidate multiple workloads under a single cluster, each with its own SLA. Basically, it controls how different workloads compete for system resources. Teams use it to prioritize urgent application requests that require immediate response times versus others that can tolerate slighter delays (e.g., large scans). Common use cases include balancing real-time vs batch processing, splitting writes from reads, and workload/infrastructure consolidation. Repair-based Operations Repair-based operations ensure your cluster data stays in sync, even during topology changes. This addresses a long-standing data consistency flaw in Apache Cassandra, where operations like replacing failed nodes can result in data loss. ScyllaDB also fully eliminates the problem of data resurrection, thanks to repair-based tombstone garbage collection. Incremental Compaction Incremental compaction (ICS) has been the default compaction strategy in ScyllaDB for over five years. ICS greatly reduces the temporary space amplification, resulting in more disk space being available for storing user data – and that eliminates the typical requirement of 50% free space in your drive. There is no comparable Cassandra feature. Cassandra just recently introduced Unified Compaction, which has yet to prove itself. Row-based Cache ScyllaDB’s row-based cache is also unique. It is enabled by default and requires no manual tuning. With the BYPASS CACHE extension, you can prevent cache pollution by keeping important items from being invalidated. Additionally, SSTable index caching significantly reduces I/O access time when fetching data from disk. Per-shard Concurrency Limits and Rate Limiters ScyllaDB includes per-shard concurrency limits and rate limiters per partition to protect against unexpected spikes. Whether dealing with a misbehaving client or a flood of requests to a specific key, ScyllaDB ensures resilience where Cassandra often falls short. DynamoDB Compatibility ScyllaDB also offers a DynamoDB-compatible layer, further distancing itself from its Apache Cassandra origins. This lets teams run their DynamoDB workloads on any cloud or on-prem – without code changes, and with 50% lower cost. This has helped quite a few teams consolidate multiple workloads on ScyllaDB. What’s Next? At the recent Monster SCALE Summit, CEO/co-founder Dor Laor shared a peek at what’s next for ScyllaDB. A few highlights… Ready now (see this blog post and product page for details): The ability to safely run at 90% storage utilization Support for clusters with mixed instance type nodes Dynamic provisioning and flex credit Short-term: Vector search Strongly consistent tables Fault injection service Transparent repairs Object and tiered storage Raft for strongly consistent tables Longer-term Multi-key transactions Analytics and transformations with UDFs Automated large partition balancing Immutable infrastructure for greater stability and reliability A replication mode for more flexible and efficient infrastructure changes For details, watch the complete talk here: To close, ScyllaDB is faster than Cassandra (I’ll share our latest benchmark results here soon). But both ScyllaDB and Cassandra have evolved to the point that ScyllaDB is no longer “just” a faster Cassandra. We’ve evolved beyond Cassandra. If your project needs more predictable performance – and/or could benefit from the elasticity, efficiency, and simplicity optimizations we’ve been focusing on for years now – you might also want to consider evolving beyond Cassandra.  

We Built a Tool to Diagnose ScyllaDB Kubernetes Issues

Introducing Scylla Operator Analyze, a tool to help platform engineers and administrators deploy ScyllaDB clusters running on Kubernetes Imagine it’s a Friday afternoon. Your company is migrating all the data to ScyllaDB and you’re in the middle of setting up the cluster on Kubernetes. Then, something goes wrong. Your time today is limited, but the sheer volume of ScyllaDB configuration feels endless. To help you detect problems in ScyllaDB deployments, we built Scylla Operator Analyze, a command-line tool designed to automatically analyze Kubernetes-based ScyllaDB clusters, identify potential misconfigurations, and offer actionable diagnostics. In modern infrastructure management, Kubernetes has revolutionized how we orchestrate containers and manage distributed systems. However, debugging complex Kubernetes deployments remains a significant challenge, especially in production-grade, high-performance environments like those powered by ScyllaDB. In this blog post, we’ll explain what Scylla Operator Analyze is, how it works, and how it may help platform engineers and administrators deploy ScyllaDB clusters running on Kubernetes. The repo we’ve been working on is available here. It’s a fork of Scylla Operator, but the project hasn’t been merged upstream (it’s highly experimental). What is Scylla Operator Analyze? Scylla Operator Analyze is a Go-based command-line utility that extends Scylla Operator by introducing a diagnostic command. Its goal is straightforward: automatically inspect a given Kubernetes deployment and report problems it identified in the deployment configuration. We designed our tool to help ScyllaDB’s technical support staff to quickly diagnose known issues reported by our clients, both by providing solutions for simple problems, and helpful insights in more complex cases. However, it’s also freely available as a subcommand of the Scylla Operator binary. The next few sections share how we implemented the tool. If you want to go straight to example usage, skip to the Making a diagnosis section. Capturing the cluster state Kubernetes deployments consist of many components with various functions. Collectively, they are called resources. The Kubernetes API presents them to the client as objects containing fields with information about their configuration and current state. Two modes of operation Scylla Operator Analyze supports two ways of collecting these data: Live Cluster Connection The tool can connect directly to a Kubernetes cluster using the client-go API. Once connected, it retrieves data from Kubernetes resources and compiles it into an internal representation. Archive-Based Analysis (Must-Gather) Alternatively, the tool can analyze archived cluster states created using a utility called must-gather. These archives contain YAML descriptions of resources, allowing offline analysis. Diagnosis by analyzing symptoms Symptoms are high-level objects representing certain issues that could occur while deploying a ScyllaDB cluster. A symptom contains the diagnosis of the problem and a suggestion on how to fix it, as well as a method for checking if the problem occurs in a given deployment (we cover this in the section about selectors). In order to create objects representing more complex problems, symptoms can be used to create tree-like structures. For example, a problem that could manifest itself in a few different ways could be represented by many symptoms checking for all the different spots the problem could affect. Those symptoms would be connected to one root symptom, describing the cause of the problem. This way, if any of the sub-symptoms report that their condition is met, the tool can display the root cause instead of one specific manifestation of that problem. Example of a symptom and the workflow used to detect it. In this example, let’s assume that the driver is unable to provide storage, but NodeConfig does not report a nonexistent device. When checking if the symptom occurs, the tool will perform the following steps. Check if the NodeConfig reports a nonexistent device – no Check if the driver is unable to provide storage – yes. At this point we know the symptom occurs, so we don’t need to check for any more subsymptoms. Since one of the subsymptoms occurs, the main symptom (NodeConfig configured with nonexistent volume) is reported to the user. Deployment condition description Resources As described earlier, Kubernetes deployments can be considered collections of many interconnected resources. All resources are described using so-called fields. Fields contain information identifying resources, deployment configuration and descriptions of past and current states. Together, these data give the controllers all the information they need to supervise the deployment. Because of that, they are very useful for debugging issues and are the main source of information for our tool. Resources’ fields contain a special kind field, which describes what the resource is and indicates what other fields are available. Some fundamental Kubernetes resource kinds include Pods, Services, etc. Those can also be extended with custom ones, such as the ScyllaCluster resource kind defined by the Scylla Operator. This provides the most basic kind of grouping of resources in Kubernetes. Other fields are grouped in sections called Metadata, which provide identifying information, Spec, which contain configuration and Status, which contain current status. Such a description in YAML format may look something like this: apiVersion: v1 kind: Pod metadata: creationTimestamp: "2024-12-03T17:47:06Z" labels: scylla/cluster: scylla scylla/datacenter: us-east-1 scylla/scylla-version: 6.2.0 name: scylla-us-east-1-us-east-1a-0 namespace: scylla spec: volumes: - name: data persistentVolumeClaim: claimName: data-scylla-us-east-1-us-east-1a-0 status: conditions: - lastTransitionTime: "2024-12-03T17:47:06Z" message: '0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending Selectors An accurate description of symptoms (presented in the previous section) requires a method for describing conditions in the deployment using information contained in the resources’ fields. Moreover, because of the distributed nature of both Kubernetes deployments and ScyllaDB, these descriptions must also specify how the resources are related to one another. Our tool comes with a package providing selectors. They offer a simple, yet powerful, way to describe deployment conditions using Kubernetes objects in a way that’s flexible and allows for automatic processing using the provided selection engine. A selector can be thought of as a query because it specifies the kinds of resources to select and criteria which they should satisfy. Selectors are constructed using four main methods of the selector structure builder. First, the developer specifies resources to be selected with the Select method by specifying their kind and a predicate which should be true for the selected resources. The predicate is provided as a standard Go closure to allow for complex conditions if needed. Next, the developer may call the Relate method to define a relationship between two kinds of resources. This is again defined using a Go closure as a predicate, which must hold for the two objects to be considered in the same result set. This can establish a context within which an issue should be inspected (for example: connecting a Pod to relevant Storage resources). Finally, constraints for individual resources in the result set can be specified with the Where method, similarly to how it is done in the Select method. This method is mainly meant to be used with the SelectWithNil method. The SelectWithNil method is the same as the Select method; the only difference is that it allows returning a special nil value instead of a resource instance. This nil  value signifies that no resources of a given kind match all the other resources in the resulting set. Thanks to this, selectors can also be used to detect a scenario where a resource is missing just by examining the context of related resources. An example selector — shortened for brevity — may look something like this: selector. New(). Select("scylla-pod", selector.Type[*v1.Pod](), func(p *v1.Pod) (bool, error) { /* ... */ }). SelectWithNil("storage-class", selector.Type[*storagev1.StorageClass](), nil). Select("pod-pvc", selector.Type[*v1.PersistentVolumeClaim](), nil). Relate("scylla-pod", "pod-pvc", func(p *v1.Pod, pvc *v1.PersistentVolumeClaim) (bool, error) { for _, volume := range p.Spec.Volumes { vPvc := volume.PersistentVolumeClaim if vPvc != nil && (*vPvc).ClaimName == pvc.Name { return true, nil } } return false, nil }). Relate("pod-pvc", "storage-class", /* ... */). Where("storage-class", func(sc *storagev1.StorageClass) (bool, error) { return sc == nil, nil }) In symptom definitions, selectors for a corresponding condition are used and are usually constructed alongside them. Such a selector provides a description of a faulty condition. This means that if there is a matching set of resources, it can be inferred that the symptom occurs. Finally, the selector can then be used, given all the deployments resources, to construct an iterator-like object that provides a list of all the sets of resources that match the selector. Symptoms can then use those results to detect issues and generate diagnoses containing useful debugging information. Making a diagnosis When a symptom relating to a problematic condition is detected, a diagnosis for a user is generated. Diagnoses are automatically generated report objects summarizing the problem and providing additional information. A diagnosis consists of an issue description, identifiers of resources related to the fault, and hints for the user (when available). Hints may contain, for example, a description of steps to remedy the issue or a reference to a bug tracker. In the final stage of analysis, those diagnoses are presented to the user and the output may look something like this: Diagnoses: scylladb-local-xfs StorageClass used by a ScyllaCluster is missing Suggestions: deploy scylladb-local-xfs StorageClass (or change StorageClass) Resources GVK: /v1.PersistentVolumeClaim, scylla/data-scylla-us-east-1-us-east-1a-0 (4…) scylla.scylladb.com/v1.ScyllaCluster, scylla/scylla (b6343b79-4887-497b…) /v1.Pod, scylla/scylla-us-east-1-us-east-1a-0 (0e716c3f-6432-4eeb-b5ff-…) Learn more As we suggested, Kubernetes deployments of ScyllaDB involve many interacting components, each of which has its own quirks. Here are a few strategies to help in diagnosing the problems you encounter: Run Scylla Doctor Check our troubleshooting guide Look for open issues on our GitHub Check our forum Ask us on Slack Learn more about ScyllaDB at ScyllaDB University Good luck, fellow troubleshooter!  

Building easy-cass-mcp: An MCP Server for Cassandra Operations

I’ve started working on a new project that I’d like to share, easy-cass-mcp, an MCP (Model Context Protocol) server specifically designed to assist Apache Cassandra operators.

After spending over a decade optimizing Cassandra clusters in production environments, I’ve seen teams consistently struggle with how to interpret system metrics, configuration settings, schema design, and system configuration, and most importantly, how to understand how they all impact each other. While many teams have solid monitoring through JMX-based collectors, extracting and contextualizing specific operational metrics for troubleshooting or optimization can still be cumbersome. The good news is that we now have the infrastructure to make all this operational knowledge accessible through conversational AI.

How GE Healthcare Took DynamoDB on Prem for Its AI Platform

How GE Healthcare moved a DynamoDB‑powered AI platform to hospital data centers, without rewriting the app How do you move a DynamoDB‑powered AI platform from AWS to hospital data centers without rewriting the app? That’s the challenge that Sandeep Lakshmipathy (Director of Engineering at GE Healthcare) decided to share with the ScyllaDB community a few years back. We noticed an uptick in people viewing this video recently, so we thought we’d share it here, in blog form. Watch or read, your choice.   Intro Hi, I’m Sandeep Lakshmipathy, the Director of Engineering for the Edison AI group at GE Healthcare. I have about 20 years of experience in the software industry, working predominantly in product and platform development. For the last seven years I’ve been in the healthcare domain at GE, rolling out solutions for our products. Let me start by setting some context with respect to the healthcare challenges that we face today. Roughly 130M babies are born every year; about 350K every single day. There’s a 40% shortage of healthcare workers to help bring these babies into the world. Ultrasound scans help ensure the babies are healthy, but those scans are user‑dependent, repetitive, and manual. Plus, clinical training is often neglected. Why am I talking about this? Because AI solutions can really help in this specific use case and make a big difference. Now, consider this matrix of opportunities that AI presents.   Every single tiny dot within each cell is an opportunity in itself. The newborn‑baby challenge I just highlighted is one tiny speck in this giant matrix. It shows what an infinite space this is, and how AI can address each challenge in a unique way. GE Healthcare is tackling these opportunities through a platform approach. Edison AI Workbench (cloud) We ingest data from many devices and customers: scanners, research networks, and more. Data is then annotated and used to train models. Once the models are trained, we deploy them onto devices. The Edison AI Workbench helps data scientists view and annotate data, train models, and package them for deployment. The whole Edison AI Workbench runs in AWS and uses AWS resources to provide a seamless experience to the data scientists and annotators who are building AI solutions for our customers. Bringing Edison AI Workbench on‑prem When we showed this solution to our research customers, they said, “Great, we really like the features and the tools….but can we have Edison AI Workbench on‑prem?” So, we started thinking: How do we take something that lives in the AWS cloud, uses all those resources, and relies heavily on AWS services – and move it onto an on‑prem server while still giving our research customers the same experience? That’s when we began exploring different options. Since DynamoDB was one of the main things tying us to the AWS cloud, we started looking for a way to replace it in the on‑prem world. After some research, we saw that ScyllaDB was a good DynamoDB replacement because it provides API compatibility with DynamoDB. Without changing much code and keeping all our interfaces the same, we migrated the Workbench to on‑prem and quickly delivered what our research customers asked for. Why ScyllaDB Alternator (DynmamoDB-Compatible API)? Moving cloud assets on‑prem is not trivial; expertise, time‑to‑market, service parity, and scalability all matter. We also wanted to keep our release cycles short: in the cloud we can push features every sprint; on‑prem, we still need regular updates. Keeping the database layer similar across cloud and on‑prem minimized rework. Quick proofs of concept confirmed that ScyllaDB + Alternator met our needs, and using Kubernetes on‑prem let us port microservices comfortably. The ScyllaDB team has always been available with respect to developer‑level interactions, quick fixes in nightly builds, and constant touch‑points with technical and marketing teams. All of this helped us move fast. For example, DynamoDB Streams wasn’t yet in ScyllaDB when we adopted it (back in 2020), but the team provided work‑arounds until the feature became available. They also worked with us on licensing to match our needs. This partnership was crucial to the solution’s evolution. By partnering with the ScyllaDB team, we could take a cloud‑native Workbench to our on‑prem research customers in healthcare. Final thoughts Any AI solution rollout depends on having the right data volume and balance. It’s all the annotations that drive model quality. Otherwise, the model will be brittle, and it won’t have the necessary diversity. Supporting all these on‑prem Workbench use cases helps because it takes the tools to where the data is. The cloud workbench handles data in the cloud data lake. But at the same time, our research customers who are partnering with us can use this on-prem, taking the tools to where the data is: in their hospital network.