ScyllaDB’s Engineering Summit in Sofia, Bulgaria

From hacking to hiking: what happens when engineering sea monsters get together  ScyllaDB is a remote-first company, with team members spread across the globe. We’re masters of virtual connection, but every year, we look forward to the chance to step away from our screens and come together for our Engineering Summit. It’s a time to reconnect, exchange ideas, and share spontaneous moments of joy that make working together so special. This year, we gathered in Sofia, Bulgaria — a city rich in history and culture, set against the stunning backdrop of the Balkan mountains. Where Monsters Meet This year’s summit brought together a record-breaking number of participants from across the globe—over 150! As ScyllaDB continues its continued growth, the turnout well reflects our momentum and our team’s expanding global reach. We, ScyllaDB Monsters from all corners of the world, came together to share knowledge, build connections, and collaborate on shaping the future of our company and product. A team-building activity An elevator ride to one of the sessions The summit brought together not just the engineering teams but also our Customer Experience (CX) colleagues. With their insights into the real-world experiences of our customers, the CX team helped us see the bigger picture and better understand how our work impacts those who use our product. The CX team Looking Inward, Moving Forward The summit was packed with really insightful talks, giving us a chance to reflect on where we are and where we’re heading next. It was all about looking back at the wins we’ve had so far, getting excited about the cool new features we’re working on now, and diving deep into what’s coming down the pipeline. CEO and Co-founder, Dor Laor, kicking off the summit The sessions sparked fruitful discussions about how we can keep pushing forward and build on the strong foundation we’ve already laid. The speakers touched on a variety of topics, including: ScyllaDB X Cloud Consistent topology Data distribution with tablets Object storage Tombstone garbage collection Customer stories Improving customer experience And many more Notes and focus doodles Collaboration at Its Best: The Hackathon This year, we took the summit energy to the next level with a hackathon that brought out the best in creativity, collaboration, and problem-solving. Participants were divided into small teams, each tackling a unique problem. The projects were chosen ahead of time so that we had a chance to work on real challenges that could make a tangible impact on our product and processes. The range of projects was diverse. Some teams focused on adding new features, like implementing a notification API to enhance the user experience. Others took on documentation-related challenges, improving the way we share knowledge. But across the board, every team managed to create a functioning solution or prototype. At the hackathon The hackathon brought people from different teams together to tackle complex issues, pushing everyone a bit outside their comfort zone. Beyond the technical achievements, it was a powerful team-building experience, reinforcing our culture of collaboration and shared purpose. It reminded us that solving real-life challenges—and doing it together—makes our work even more rewarding. The hackathon will undoubtedly be a highlight of future summits to come! From Development to Dance And then, of course, came the party. The atmosphere shifted from work to celebration with live music from a band playing all-time hits, followed by a DJ spinning tracks that kept everyone on their feet. Live music at the party Almost everyone hit the dance floor—even those who usually prefer to sit it out couldn’t resist the rhythm. It was the perfect way to unwind and celebrate the success of the summit! Sea monsters swaying Exploring Sofia and Beyond During our time in Sofia, we had the chance to immerse ourselves in the city’s rich history and culture. Framed by the dramatic Balkan mountains, Sofia blends the old with the new, offering a mix of history, culture, and modern vibe. We wandered through the ancient ruins of the Roman Theater and visited the iconic Alexander Nevsky Cathedral, marveling at their beauty and historical significance. To recharge our batteries, we enjoyed delicious meals in modern Bulgarian restaurants. In front of Alexander Nevsky Cathedral But the adventure didn’t stop in the city. We took a day trip to the Rila Mountains, where the breathtaking landscapes and serene atmosphere left us in awe. One of the standout sights was the Rila Monastery, a UNESCO World Heritage site known for its stunning architecture and spiritual significance. The Rila Monastery After soaking in the peaceful vibes of the monastery, we hiked the trail leading to the Stob Earth Pyramids, a natural wonder that looked almost otherworldly. The Stob Pyramids The hike was rewarding, offering stunning views of the mountains and the unique rock formations below. It was the perfect way to experience Bulgaria’s natural beauty while winding down from the summit excitement. Happy hiking Looking Ahead to the Future As we wrapped up this year’s summit, we left feeling energized by the connections made, ideas shared, and challenges overcome. From brainstorming ideas to clinking glasses around the dinner table, this summit was a reminder of why in-person gatherings are so valuable—connecting not just as colleagues but as a team united by a common purpose. As ScyllaDB continues to expand, we’re excited for what lies ahead, and we can’t wait to meet again next year. Until then, we’ll carry the lessons, memories, and new friendships with us as we keep moving forward. Чао! We’re hiring – join our team! Our team

How ScyllaDB Simulates Real-World Production Workloads with the Rust-Based “latte” Benchmarking Tool

Learn why we use a little-known benchmarking tool for testing Before using a tech product, it’s always nice to know its capabilities and limits. In the world of databases, there are a lot of different benchmarking tools that help us assess that… If you’re ok with some standard benchmarking scenarios, you’re set – one of the existing tools will probably serve you well. But what if not? Rigorously assessing ScyllaDB, a high-performance distributed database, requires testing some rather specific scenarios, ones with real-world production workloads. Fortunately, there is a tool to help with that. It is latte: a Rust-based lightweight benchmarking tool for Apache Cassandra and ScyllaDB.
Special thanks to Piotr Kołaczkowski for implementing the latte benchmarking tool.
We (the ScyllaDB testing team) forked it and enhanced it. In this blog post, I’ll share why and how we adopted it for our specialized testing needs. About latte Our team really values latte’s “flexibility.” Want to create a schema using a user defined type (UDT), Map, Set, List, or any other data type? Ok. Want to create a materialized views and query it? Ok. Want to change custom function behavior based on elapsed time? Ok. Want to run multiple custom functions in parallel? Ok. Want to use small, medium, and large partitions? Ok. Basically, latte lets us define any schema and workload functions. We can do this thanks to its implementation design. The latte tool is a type of engine/kernel and rune scripts are essentially the “business logic” that’s written separately. Rune scripts are an enhanced, more powerful, analog of what cassandra-stress calls user profiles. The rune scripting language is dynamically-typed and native to the Rust programming language ecosystem. Here’s a simple example of a rune script: In the above example of a rune script, we defined 2 required functions (schema and prepare) and one custom to be used as our workload –myinsert. First, we create a schema: Then, we use the latte run command to call our custom myinsert function: The replication_factor parameter above is a custom parameter. If we do not specify it, then latte will use its default value, 3. We can define any number of custom parameters. How is latte different from other benchmarking Tools? Based on our team’s experiences, here’s how latte compares to the 2 main competitors: cassandra-stress and ycsb: How is our fork of latte different from the original latte project? At ScyllaDB, our main use case for latte is testing complex and realistic customer scenarios with controlled disruptions. But (from what we understand), the project was designed to perform general latency measurements in healthy DB clusters. Given these different goals, we changed some features (“overlapping features”) – and added other new ones (“unique to our fork”). Here’s an overview: Overlapping features differences Latency measurement. Fork-latte accounts for coordinated omission in latencies The original project doesn’t consider the “coordinated omission” phenomenon. Saturated DB impact. When a system under test cannot keep up with the load/stress, fork-latte tries to satisfy the “rate”, compensating for missed scheduler ticks ASAP. Source-latte pulls back on violating the rate requirement and doesn’t later compensate for missed scheduler ticks. This isn’t a “bug”; it is a design decision which also violates the idea of proper latency calculation related to the “coordinated omission” phenomenon. Retries. We enabled retries by default; there, it is disabled by default. Prepared statements. Fork-latte supports all the CQL data types available in ScyllaDB Rust Driver. The source project has limited support of CQL data types. ScyllaDB Rust Driver. Our fork uses the latest version – “1.2.0” The source project sticks to the old version “0.13.2” Stress execution reporting. Report is disabled by default in fork-latte. It’s enabled in source-latte. Features unique to our fork Preferred datacenter support. Useful for testing multi-DC DB setups Preferred rack support. Useful for testing multi-rack DB setups Possibility to get a list of actual datacenter values from the DB nodes that the driver connected to. Useful for creating schema with dc-based keyspaces Sine-wave rate limiting. Useful for SLA/Workload Prioritization demo and OLTP testing with peaks and lows. Batch query type support. Multi-row partitions. Our fork can create multi-row partitions of different sizes. Page size support for select queries. Useful using multi-row partitions feature. HDR histograms support. The source project has only 1 way to get the HDR histograms data It stores HDR histograms data in RAM till the end of a stress command execution and only in the end releases it as part of a report. Leaks RAM. Forked latte supports the above inherited approach and one more: Real-time streaming of HDR histogram data not storing in RAM. No RAM leaks. Rows count validation for select queries. Useful for testing data resurrection.   Example: Testing multi-row partitions of different sizes Let’s look at one specific user scenario where we applied our fork of latte to test ScyllaDB. For background, one of the user’s ScyllaDB production clusters was using large partitions which could be grouped by size in 3 groups: 2000, 4000 and 8000 rows per partition. 95% of the partitions had 2000 rows, 4% of partitions had 4000 rows, and the last 1% of partitions had 8000 rows. The target table had 20+ different columns of different types. Also, ScyllaDB’s Secondary Indexes (SI) feature was enabled for the target table. One day, on one of the cloud providers, latencies spiked and throughput dropped. The source of the problem was not immediately clear. To learn more, we needed to have a quick way to reproduce the customer’s workload in a test environment. Using the latte tool and its great flexibility, we created a rune script covering all the above specifics. The simplified rune script looks like the following: Assume we have a ScyllaDB cluster where one of the nodes has a 172.17.0.2 IP address. Here is the command to create the schema we need: And here is the command to populate the just-created table: To read from the main table and from the MV, use a similar command – just replacing the function name to get and get_from_mv respectively. So, the usage of the above commands allowed us to get a stable issue reproducer and work on its solution. Working with ScyllaDB’s Workload Prioritization feature In other cases, we needed to: Create a Workload Prioritization (WLP) demo. Test an OLTP setup with continuous peaks and lows to showcase giving priority to different workloads. And for these scenarios, we used a special latte feature called sine wave rate. This is an extension to the common rate-limiting feature. It allows us to specify how many operations per second we want to produce. It can be used with following command parameters: And looking at the monitoring, we can see the following picture of the operations per second graph: Internal testing of tombstones (validation) As of June 2025, forked latte supports row count validation. It is useful for testing data resurrection. Here is the rune script for latte to demonstrate these capabilities: As before, we create the schema first: Then, we populate the table with 100k rows using the following command: To check that all rows are in place, we use command similar to the one above, change the function to be get, and define the validation strategy to be fail-fast: The supported validation strategies are retry, fail-fast, ignore. Then, we run 2 different commands in parallel. Here is the first one, which deletes part of the rows: Here is the second one, which knows when we expect 1 row and when we expect none: And here is the timing of actions that take place during these 2 commands’ runtime: That’s a simple example of how we can check whether data got deleted or not. In long-running testing scenarios, we might run more parallel commands, make them depend on the elapsed time, and many more other flexibilities. Conclusions Yes, to take advantage of latte, you first need to study a bit of  rune scripting. But once you’ve done that to some extent, especially having available examples, it becomes a powerful tool that is capable of covering various scenarios of different types.

ScyllaDB Tablets: Answering Your Top Questions

What does your team need to know about tablets– at a purely pragmatic level? Here are answers to the top user questions. The latest ScyllaDB releases feature some significant architectural shifts. Tablets build upon a multi-year project to re-architect our legacy ring architecture. And our metadata is now fully consistent, thanks to the assistance of Raft. Together, these changes can help teams with elasticity, speed, and operational simplicity. Avi Kivity, our CTO and co-founder, provided a detailed look at why and how we made this shift in a series of blogs (Why ScyllaDB Moved to “Tablets” Data Distribution and How We Implemented ScyllaDB’s “Tablets” Data Distribution). Join Avi for a technical deep dive…at our upcoming livestream And we recently created this quick demo to show you what this looks like in action, from the point of view of a database user/operator: But what does your team need to know – at a purely pragmatic level? Here are some of the questions we’ve heard from interested users, and a short summary of how we answer them. What’s the TL;DR on tablets? Tablets are the smallest replication unit in ScyllaDB. Data gets distributed by splitting tables into smaller logical pieces called tablets, and this allows ScyllaDB to shift from a static to a dynamic topology. Tablets are dynamically balanced across the cluster using the Raft consensus protocol. This was introduced as part of a project to bring more elasticity to ScyllaDB, enabling faster topology changes and seamless scaling. Tablets acknowledge that most workloads do not follow a static traffic pattern. In fact, most often follow a cyclical curve with different baseline and peaks through a period of time. By decoupling topology changes from the actual streaming of data, tablets therefore present significant cost saving opportunities for users adopting ScyllaDB by allowing infrastructure to be scaled on-demand, fast. Previously, adding or removing nodes required a sequential, one-at-a-time and serializable process with data streaming and rebalancing. Now, you can add or remove multiple nodes in parallel. This significantly speeds up the scaling process and makes ScyllaDB much more elastic. Tablets are distributed on a per-table basis, with each table having its own set of tablets. The tablets are then further distributed across the shards in the ScyllaDB cluster. The distribution is handled automatically by ScyllaDB, with tablets being dynamically migrated across replicas as needed. Data within a table is split across tablets based on the average geometric size of a token range boundary. How do I configure tablets? Tablets are enabled by default in ScyllaDB 2025.1 and are also available with ScyllaDB Cloud. When creating a new keyspace, you can specify whether to enable tablets or not. There are also three key configuration options for tablets: 1) the enable_tablets boolean setting, 2) the target_tablet_size_in_bytes (default is 5GB), and 3) the tablets property during a CREATE KEYSPACE statement. Here are a few tips for configuring these settings: enable_tablets indicates whether newly created keyspaces should rely on tablets for data distribution. Note that tablets are currently not yet enabled for workloads requiring the use of Counters Tables, Secondary Indexes, Materialized Views, CDC, or LWT. target_tablet_size_in_bytes indicates the average geometric size of a tablet, and is particularly useful during tablet split and merge operations. The default indicates splits are done when a tablet reaches 10GB and merges at 2.5GB. A higher value means tablet migration throughput can be reduced (due to larger tablets), whereas a lower value may significantly increase the number of tablets. The tablets property allows you to opt for tablets on a per keyspace basis via the ‘enabled’ boolean sub-option. This is particularly important if some of your workloads rely on the currently unsupported features mentioned earlier: You may opt out for these tables and fallback to the still supported vNode-based replication strategy. Still under the tablets property, the ‘initial’ sub-option determines how many tablets are created upfront on a per-table basis. We recommend that you target 100 tablets/shard. In future releases, we’ll introduce Per-table tablet options to extend and simplify this process while deprecating the keyspace sub-option. How/why should I monitor tablet distribution? Starting with ScyllaDB Monitoring 4.7, we introduced two additional panels for the observability and distribution of tablets within a ScyllaDB cluster. These metrics are present within the Detailed dashboard under the Tablets section: The Tablets over time panel is a heatmap showing the tablet distribution over time. As the data size of a tablet-enabled table grows, you should observe the number of tablets increasing (tablet split) and being automatically distributed by ScyllaDB. Likewise, as the table size shrinks, the number of tablets should be reduced (tablet merge, but to no less than your initially configured ‘initial’ value within the keyspace tablets property). Similarly, as you perform topology changes (e.g., adding nodes), you can monitor the tablet distribution progress. You’ll notice that existing replicas will have their tablet count reduced while new replicas will increase. The Tablets per DC/Instance/Shard panel shows the absolute count of tablets within the cluster. Under heterogeneous setups running on top of instances of the same size, this metric should be evenly balanced. However, the situation changes for heterogeneous setups with different shard counts. In this situation, it is expected that larger instances will hold more tablets given their additional processing power. This is, in fact, yet another benefit of tablets: the ability to run heterogeneous setups and leave it up to the database to determine how to internally maximize each instance’s performance capabilities. What are the impacts of tablets on maintenance tasks like node cleanup? The primary benefit of tablets is elasticity. Tablets allow you to easily and quickly scale out and in your database infrastructure without hassle. This not only translates to infrastructure savings (like avoiding being overprovisioned for the peak all the time). It also allows you to reach a higher percentage of storage utilization before rushing to add more nodes – so you can better utilize the underlying infrastructure you pay for. Another key benefit of tablets is that they eliminate the need for maintenance tasks like node cleanup. Previously, after scaling out the cluster, operators would need to run node cleanup to ensure data was properly evicted from nodes that no longer owned certain token ranges. With tablets, this is no longer necessary. The compaction process automatically handles the migration of data as tablets are dynamically balanced across the cluster. This is a significant operational improvement that reduces the maintenance burden for ScyllaDB users. The ability to now run heterogeneous deployments without running through cryptic and hours-long tuning cycles is also a plus. ScyllaDB’s tablet load balancer is smart enough to figure out how to distribute and place your data. It considers the amount of compute resources available, reducing the risk of traffic hotspots or data imbalances that may affect your clusters’ performance. In the future, ScyllaDB will bring transparent repairs on a per-tablet basis, further eliminating the need for users to worry about repairing their clusters, and also provide “temperature-based balancing” so that hot partitions get split and other shards cooperate with the incoming load. Do I need to change drivers? ScyllaDB’s latest drivers are tablet-aware, meaning they understand the tablets concept and can route queries to the correct nodes and shards. However, the drivers do not directly query the internal system.tablets table. That could become unwieldy as the number of tablets grows. Furthermore, tablets are transient, meaning a replica owning a tablet may no longer be a natural endpoint for it as time goes by. Instead, the drivers use a dynamic routing process: when a query is sent to the wrong node/shard, the coordinator will respond with the correct routing information, allowing the driver to update its routing cache. This ensures efficient query routing as tablets are migrated across the cluster. When using ScyllaDB tablets, it’s more important than ever to use ScyllaDB shard-aware – and now also tablet-aware – drivers instead of Cassandra drivers. The existing drivers will still work, but they won’t work as efficiently because they lack the necessary logic to understand the coordinator-provided tablet metadata. Using the latest ScyllaDB drivers should provide a nice throughput and latency boost. Read more in How We Updated ScyllaDB Drivers for Tablets Elasticity. More questions? If you’re interested in tablets and we didn’t answer your question here, please reach out to us! Our Contact Us page offers a number of ways to interact, including a community forum and Slack.  

Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview

ScyllaDB X Cloud just landed! It’s a truly elastic database that supports variable/unpredictable workloads with consistent low latency, plus low costs. The ScyllaDB team is excited to announce ScyllaDB X Cloud, the next generation of our fully-managed database-as-a-service. It features architectural enhancements for greater flexibility and lower cost. ScyllaDB X Cloud is a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. A few spoilers before we get into the details: You can now scale out and scale in almost instantly to match actual usage, hour by hour. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer latency hits while waiting for autoscaling to fully kick in. You can now safely run at 90% storage utilization, compared to the standard 70% utilization. This means you need fewer underlying servers and have substantially less infrastructure to pay for. Optimizations like file-based streaming and dictionary-based compression also speed up scaling and reduce network costs. Beyond the technical changes, there’s also an important pricing update. To go along with all this database flexibility, we’re now offering a “Flex Credit” pricing model. Basically, this gives you the flexibility of on-demand pricing with the cost advantage that comes from an annual commitment. Access ScyllaDB X Cloud Now If you want to get started right away, just go to ScyllaDB Cloud and choose the X Cloud cluster type when you create a cluster. This is our code name for the new type of cluster that enables greater elasticity, higher storage utilization, and automatic scaling. Note that X Cloud clusters are available from the ScyllaDB Cloud application (below) and API. They’re available on AWS and GCP, running on a ScyllaDB account or your company’s account with the Bring Your Own Account (BYOA) model. Sneak peek: In the next release, you won’t need to choose instance size or number of services if you select the X Cloud option. Instead, you will be able to define a serverless scaling policy and let X Cloud scale the cluster as required. If you want to learn more, keep reading. In this blog post, we’ll cover what’s behind the technical changes and also talk a little about the new pricing option. But first, let’s start with the why. Backstory Why did we do this? Consider this example from a marketing/AdTech platform that provides event-based targeting. Such a pattern, with predictable/cyclical daily peaks and low baseline off-hours, is quite common across retail platforms, food delivery services, and other applications aligned with customer work hours. In this case, the peak loads are 3x the base and require 2-3x the resources. With ScyllaDB X Cloud, they can provision for the baseline and quickly scale in/out as needed to serve the peaks. They get the steady low latency they need without having to overprovision – paying for peak capacity 24/7 when it’s really only needed for 4 hours a day. Tablets + just-in-time autoscaling If you follow ScyllaDB, you know that tablets aren’t new. We introduced them last year for ScyllaDB Enterprise (self-managed on the cloud or on-prem). Avi Kivity, our CTO, already provided a look at why and how we implemented tablets. And you can see tablets in action here: With tablets, data gets distributed by splitting tables into smaller logical pieces (“tablets”), which are dynamically balanced across the cluster using the Raft consensus protocol. This enables you to scale your databases as rapidly as you can scale your infrastructure. In a self-managed ScyllaDB deployment, tablets makes it much faster and simpler to expand and reduce your database capacity. However, you still need to plan ahead for expansion and initiate the operations yourself. ScyllaDB X Cloud lets you take full advantage of tablets’ elasticity. Scaling can be triggered automatically based on storage capacity (more on this below) or based on your knowledge of expected usage patterns. Moreover, as capacity expands and contracts, we’ll automatically optimize both node count and utilization. You don’t even have to choose node size; ScyllaDB X Cloud’s storage-utilization target does that for you. This should simplify admin and also save costs. 90% storage utilization ScyllaDB has always handled running at 100% compute utilization well by having automated internal schedulers manage compactions, repairs, and lower-priority tasks in a way that prioritizes performance. Now, it also does two things that let you increase the maximum storage utilization to 90%: Since tablets can move data to new nodes so much faster, ScyllaDB X Cloud can defer scaling until the very last minute Support for mixed instance sizes allows ScyllaDB X Cloud to allocate minimal additional resources to keep the usage close to 90% Previously, we recommended adding nodes at 70% capacity. This was because node additions were unpredictable and slow — sometimes taking hours or days — and you risked running out of space. We’d send a soft alert at 50% and automatically add nodes at 70%. However, those big nodes often sat underutilized. With ScyllaDB X Cloud’s tablets architecture, we can safely target 90% utilization. That’s particularly helpful for teams with storage-bound workloads. Support for mixed size clusters A little more on the “mixed instance size” support mentioned earlier. Basically, this means that ScyllaDB X Cloud can now add the exact mix of nodes you need to meet the exact capacity you need at any given time. Previous versions of ScyllaDB used a single instance size across all nodes in the cluster. For example, if you had a cluster with 3 i4i.16xlarge instances, increasing the capacity meant adding another i4i.16xlarge. That works, but it’s wasteful: you’re paying for a big node that you might not immediately need. Now with ScyllaDB X Cloud (thanks to tablets and support for mixed-instance sizes), we can scale in much smaller increments. You can add tiny instances first, then replace them with larger ones if needed. That means you rarely pay for unused capacity. For example, before, if you started with an i4i.16xlarge node that had 15 TB of storage and you hit 70% utilization, you had to launch another i4i.16xlarge — adding 15 TB at once. With ScyllaDB X Cloud, you might add two xlarge nodes (2 TB each) first. Then, if you need more storage, you add more small nodes, then eventually replace them with larger nodes. And by the way, i7i instances are now available too, and they are even more powerful. The key is granular, just-in-time scaling: you only add what you need, when you need it. This applies in reverse, too. Before, you had to decommission a large node all at once. Now, ScyllaDB X Cloud can remove smaller nodes gradually based on the policies you set, saving compute and storage costs. Network-focused engineering optimizations Every gigabyte leaving a node, crossing an Availability Zone (AZ) boundary, or replicating to another region shows up on your AWS, GCP, or Azure bill. That’s why we’ve done some engineering work at different layers of ScyllaDB to shrink those bytes—and the dollars tied to them. File-based streaming We anticipated that mutation-based streaming would hold us back once we moved to tablets. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Think of it as Cassandra’s zero-copy streaming, except that we keep ownership metadata with each replica. This table shows the result: You can read more about this in the blog Why We Changed ScyllaDB’s Data Streaming Approach. Dictionary-based compression We also introduced dictionary-trained Zstandard (Zstd), which is pipeline-aware. This involved building a custom RPC compressor with external dictionary support, and a mechanism that trains new dictionaries on RPC traffic, distributes them over the cluster, and performs a live switch of connections to the new dictionaries. This is done in 4 key steps: Sample: Continuously sample RPC traffic for some time Train: Train a 100 kiB dictionary on a 16MiB sample Distribute: Distribute a new dictionary via system distributed table Switch: Negotiate the switch separately within each connection On the graph below, you can see LZ4 (Cassandra’s default) leaves you at 72% of the original size. Generic Zstd cuts that to 50%. Our per-cluster Zstd dictionary takes it down to 30%, which is a 3X improvement over the default Cassandra compression. Flex Credit To close, let’s shift from the technical changes to a major pricing change: Flex Credit. Flex Credit is a new way to consume a ScyllaDB Cloud subscription. It can be applied to ScyllaDB Cloud as well as ScyllaDB Enterprise. Flex Credit provides the flexibility of on-demand pricing at a lower cost via an annual commitment. In combination with X Cloud, Flex Credit can be a great tool to reduce cost. You can use Reserved pricing for a load that’s known in advance and use Flex for less predictable bursts. This saves you from paying the higher on-demand pricing for anything above the reserved part. How might this play out in your day-to-day work? Imagine your baseline workload handles 100K OPS, but sometimes it spikes to 400K OPS. Previously, you’d have to provision (and pay for) enough capacity to sustain 400K OPS at all times. That’s inefficient and costly. With ScyllaDB X Cloud, you reserve 100K OPS upfront. When a spike hits, we automatically call the API to spin up “flex capacity” – instantly scaling you to 400K OPS – and then tear it down when traffic subsides. You only pay for the extra capacity during the peak. Not sure what to choose? We can help advise based on your workload specifics (contact your representative or ping us here), but here’s some quick guidance in the meantime. Reserved Capacity: The most cost-effective option across all plans. Commit to a set number of cluster nodes or machines for a year. You lock in lower rates and guarantee capacity availability. This is ideal if your cluster size is relatively stable. Hybrid Model: Reserved + On-Demand: Commit to a baseline reserved capacity to lock in lower rates, but if you exceed that baseline (e.g., because you have a traffic spike), you can scale with on-demand capacity at an hourly rate. This is good if your usage is mostly stable but occasionally spikes. Hybrid Model: Reserved + Flex Credit: Commit to baseline reserved capacity for the lowest rates. For peak usage, use pre-purchased flex credit (which is discounted) instead of paying on-demand prices. Flex credit also applies to network and backup usage at standard provider rates. This is ideal if you have predictable peak periods (e.g., seasonal spikes, event-driven workload surges, etc.). You get the best of both worlds: low baseline costs and cost-efficient peak capacity. Recap In summary, ScyllaDB X Cloud uses tablets to enable faster, more granular scaling with mixed-instance sizes. This lets you avoid overprovisioning and safely run at 90% storage utilization. All of this will help you respond to volatile/unpredictable demand with low latencies and low costs. Moreover, flexible pricing (on-demand, flex credit, reserved) will help you pay only for what you need, especially when you have tablets scaling your capacity up and down in response to traffic spikes. There are also some network cost optimizations through file-based streaming and improved compression. Want to learn more? Our Co-Founder/CTO Avi Kivity will be discussing the design decisions behind ScyllaDB X Cloud’s elasticity and efficiency. Join us for the engineering deep dive on July 10. ScyllaDB X Cloud: An Inside Look with Avi Kivity

A New Way to Estimate DynamoDB Costs

We built a new DynamoDB cost analyzer that helps developers understand what their workloads will really cost DynamoDB costs can blindside you. Teams regularly face “bill shock”: that sinking feeling when you look at a shockingly high bill and realize that you haven’t paid enough attention to your usage, especially with on-demand pricing. Provisioned capacity brings a different risk: performance. If you can’t accurately predict capacity or your math is off, requests get throttled. It’s a delicate balancing act. Although AWS offers a DynamoDB pricing calculator, it often misses the nuances of real-world workloads (e.g., bursty traffic or uneven access patterns, or using global tables or caching). We wanted something better. In full transparency, we wanted something better to help the teams considering ScyllaDB as a DynamoDB alternative. So we built a new DynamoDB cost calculator that helps developers understand what their workloads will really cost. Although we designed it for teams comparing DynamoDB with ScyllaDB, we believe it’s useful for anyone looking to more accurately estimate their DynamoDB costs, for any reason. You can see the live version at: calculator.scylladb.com How We Built It We wanted to build something that would work client side, without the need for any server components. It’s a simple JavaScript single page application that we currently host on GitHub pages. If you want to check out the source code, feel free to take a look at https://github.com/scylladb/calculator To be honest, working with the examples at https://calculator.aws/ was a bit of a nightmare, and when you “show calculations,” you get these walls of text: I was tempted to take a shorter approach, like: Monthly WCU Cost = WCUs × Price_per_WCU_per_hour × 730 hours/month But every time I simplified this, I found it harder to get parity between what I calculated and the final price in AWS’s calculation. Sometimes the difference was due to rounding, other times it was due to the mixture of reserved + provision capacity, and so on. So to make it easier (for me) to debug, I faithfully followed their calculations line by line and tried to replicate this in my own rather ugly function: https://github.com/scylladb/calculator/blob/main/src/calculator.js I may still refactor this into smaller functions. But for now, I wanted to get parity between theirs and ours. You’ll see that there are also some end-to-end tests for these calculations — I use those to test for a bunch of different configurations. I will probably expand on these in time as well. So that gets the job done for On Demand, Provisioned (and Reserved) capacity models. If you’ve used AWS’s calculator, you know that you can’t specify things like a peak (or peak width) in On Demand. I’m not sure about their reasoning. I decided it would be easier for users to specify both the baseline and peak for reads and writes (respectively) in On Demand, much like Provisioned capacity. Another design decision was to represent the traffic using a chart. I do better with visuals, so seeing the peaks and troughs makes it easier for me to understand – and I hope it does for you as well. You’ll also notice that as you change the inputs, the URL query parameters change to reflect those inputs. That’s designed to make it easier to share and reference specific variations of costs. There’s some other math in there, like figuring out the true cost of Global Tables and understanding derived costs of things like network transfer or DynamoDB Accelerator (DAX). However, explaining all that is a bit too dense for this format. We’ll talk more about that in an upcoming webinar (see the next section). The good news is that you can estimate these costs in addition to your workload, as they can be big cost multipliers when planning out your usage of DynamoDB. Explore “what if” scenarios for your own workloads Analyzing Costs in Real-World Scenarios The ultimate goal of all this tinkering and tuning is to help you explore various “what-if” scenarios from a DynamoDB cost perspective.  To get started, we’re sharing the cost impacts of some of the more interesting DynamoDB user scenarios we’ve come across at ScyllaDB. My colleague Gui and I just got together for a deep dive into how factors like traffic surges, multi-datacenter expansion, and the introduction of caching (e.g., DAX) impact DynamoDB costs. We explored how a few (anonymized) teams we work with ended up blindsided by their DynamoDB bills and the various options they considered for getting costs back under control. Watch the DynamoDB costs chat now

Cassandra vs. MongoDB: When to Use Which​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌‌‍‌‍‌‍‌‌​​‌‌‍‌‍​‌​​‌‌‍​‌‌‍​‌​‍‌‌‍​‌‌‍​‍​‍‌​​​​‍‌​‌​‌‍​‍​​‌​‌‌​‍‌​‍​​‍‌​‌​‌‍‌‍​‍‌​‌​‌​‌‌​‌‍​‌‌‌‍​​‍​​​‌‍‌‌‌‍​​​​‌‍‌‌​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌​‍‌‍​​‌‌‍‍​‌‌​‌‌​‌​​‌​​‍‌‌​​‌​​‌​‍‌‌​​‍‌​‌‍​‍‌‌​​‍‌​‌‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‌‍‍‌‌‍‌​​‌‌‍‌‍‌‍‌‌​​‌‌‍‌‍​‌​​‌‌‍​‌‌‍​‌​‍‌‌‍​‌‌‍​‍​‍‌​​​​‍‌​‌​‌‍​‍​​‌​‌‌​‍‌​‍​​‍‌​‌​‌‍‌‍​‍‌​‌​‌​‌‌​‌‍​‌‌‌‍​​‍​​​‌‍‌‌‌‍​​​​‌‍‌‌​‍‌‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‍‌‍‌​​‌‍‌‌‌​‍‌​‌​​‌‍‌‌‌‍​‌‌​‌‍‍‌‌‌‍‌‍‌‌​‌‌​​‌‌‌‌‍​‍‌‍​‌‍‍‌‌​‌‍‍​‌‍‌‌‌‍‌​​‍​‍‌‌

Which NoSQL database works best for powering your GenAI use cases? A look at Cassandra vs. MongoDB and which to use when.

Rust Rewrite, Postgres Exit: Blitz Revamps Its “League of Legends” Backend

How Blitz scaled their game coaching app with lower latency and leaner operations Blitz is a fast-growing startup that provides personalized coaching for games such as League of Legends, Valorant, and Fortnite. They aim to help gamers become League of Legends legends through real-time insights and post-match analysis. While players play, the app does quite a lot of work. It captures live match data, analyzes it quickly, and uses it for real-time game screen overlays plus personalized post-game coaching. The guidance is based on each player’s current and historic game activity, as well as data collected across billions of matches involving hundreds of millions of users. Thanks to growing awareness of Blitz’s popular stats and game-coaching app, their steadily increasing user base pushed their original Postgres- and Elixir-based architecture to its limits. This blog post explains how they recently overhauled their League of Legends data backend – using Rust and ScyllaDB. TL;DR – In order to provide low latency, high availability, and horizontal scalability to their growing user base, they ultimately: Migrated backend services from Elixir to Rust. Replaced Postgres with ScyllaDB Cloud. Heavily reduced their Redis footprint. Removed their Riak cluster. Replaced queue processing with realtime processing. Consolidated infrastructure from over a hundred cores of microservices to four n4‑standard‑4 Google Cloud nodes (plus a small Redis instance for edge caching) As an added bonus, these changes ended up cutting Blitz’s infrastructure costs and reducing the database burden on their engineering staff. Blitz Background As Naveed Khan (Head of Engineering at Blitz) explained, “We collect a lot of data from game publishers and during gameplay. For example, if you’re playing League of Legends, we use Riot’s API to pull match data, and if you install our app we also monitor gameplay in real time. All of this data is stored in our transactional database for initial processing, and most of it eventually ends up in our data lake.” Scaling Past Postgres One key part of Blitz’s system is the Playstyles API, which analyzes pre-game data for both teammates and opponents. This intensive process evaluates up to 20 matches per player and runs nine separate times per game (once for each player in the match). The team strategically refactored and consolidated numerous microservices to improve performance. But the data volume remained intense. According to Brian Morin (Principal Backend Engineer at Blitz), “Finding a database solution capable of handling this query volume was critical.” They originally used Postgres, which served them well early on. However, as their write-heavy workloads scaled, the operational complexity and costs on Google Cloud grew significantly. Moreover, scaling Postgres became quite complex. Naveed shared, “We tried all sorts of things to scale. We built multiple services around Postgres to get the scale we needed: a Redis cluster, a Riak cluster, and Elixir Oban queues that occasionally overflowed. Queue management became a big task.” To stay ahead of the game, they needed to move on. As startups scale, they often switch from “just use Postgres” to “just use NoSQL.” Fittingly, the Blitz team considered moving to MongoDB, but eventually ruled it out. “We had lots of MongoDB experience in the team and some of us really liked it. However, our workload is very write-heavy, with thousands of concurrent players generating a constant stream of data. MongoDB uses a single-writer architecture, so scaling writes means vertically scaling one node.” In other words, MongoDB’s primary-secondary architecture would create a bottleneck for their specific workload and anticipated growth. They then decided to move forward with RocksDB because of its low latency and cost considerations. Tests showed that it would meet their latency needs, so they performed the required data (re)modeling and migrated a few smaller games over from Postgres to RocksDB. However, they ultimately decided against RocksDB due to scale and high availability concerns. “Based on available data from our testing, it was clear RocksDB wouldn’t be able to handle the load of our bigger games – and we couldn’t risk vertically scaling a single instance, and then having that one instance go down,” Naveed explained. Why ScyllaDB One of their backend engineers suggested ScyllaDB, so they reached out and ran a proof of concept. They were primarily looking for a solution that can handle the write throughput, scales horizontally, and provides high availability. They tested it on their own hardware first, then moved to ScyllaDB Cloud. Per Naveed, “The cost was pretty close to self-hosting, and we got full management for free, so it was a no-brainer. We now have a significantly reduced Redis cluster, plus we got rid of the Riak cluster and Oban queues dependencies. Just write to ScyllaDB and it all just works. The amount of time we spend on infrastructure management has significantly decreased.” Performance-wise, the shift met their goal of leveling up the user experience … and also simplified life for their engineering teams. Brian added, “ScyllaDB proved exceptional, delivering robust performance with capacity to spare after optimization. Our League product peaks at around 5k ops/sec with the cluster reporting under 20% load. Our biggest constraint has been disk usage, which we’ve rolled out multiple updates to mitigate. The new system can now often return results immediately instead of relying on cached data, providing more up-to-date information on other players and even identifying frequent teammates. The results of this migration have been impressive: over a hundred cores of microservices have been replaced by just four n4-standard-4 nodes and a minimal Redis instance for caching. Additionally, a 3xn2-highmem ScyllaDB cluster has effectively replaced the previous relational database infrastructure that required significant computing resources.” High-Level Architecture of Blitz Server with Rust and ScyllaDB Rewriting Elixir Services into Rust As part of a major backend overhaul, the Blitz team began rethinking their entire infrastructure – beyond the previously described shift from Postgres to the high-performance and distributed ScyllaDB. Alongside this database migration, they also chose to sunset their Elixir-based services in favor of a more modern language. After careful evaluation, Rust emerged as the clear choice. “Elixir is great and it served its purpose well,” explained Naveed. “But we wanted to move toward something with broader adoption and a stronger systems-level ecosystem. Rust proved to be a robust and future-proof alternative.” Now that the first batch of Rust rewritten services are in production, Naveed and team aren’t looking back: “Rust is fantastic. It’s fast, and the compiler forces you to write memory-safe code upfront instead of debugging garbage-collection issues later. Performance is comparable to C, and the talent pool is also much larger compared to Elixir.”

Why We Changed ScyllaDB’s Data Streaming Approach

How moving from mutation-based streaming to file-based streaming resulted in 25X faster streaming time Data streaming – an internal operation that moves data from node to node over a network – has always been the foundation of various ScyllaDB cluster operations. For example, it is used by “add node” operations to copy data to a new node in a cluster (as well as “remove node” operations to do the opposite). As part of our multiyear project to optimize ScyllaDB’s elasticity, we reworked our approach to streaming. We recognized that when we moved to tablets-based data distribution, mutation-based streaming would hold us back. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Mutation-Based Streaming In ScyllaDB, data streaming is a low-level mechanism to move data between nodes. For example, when nodes are added to a cluster, streaming moves data from existing nodes to the new nodes. We also use streaming to decommission nodes from the cluster. In this case, streaming moves data from the decommissioned nodes to other nodes in order to balance the data across the cluster. Previously, we were using a streaming method called mutation-based streaming.   On the sender side, we read the data from multiple SSTables. We get a stream of mutations, serialize them, and send them over the network. On the receiver side, we deserialize and write them to SSTables. File-Based Streaming Recently, we introduced a new file-based streaming method. The big difference is that we do not read the individual mutations from the SSTables, and we skip all the parsing and serialization work. Instead, we read and send the SSTable directly to remote nodes. A given SSTable always belongs to a single tablet. This means we can always send the entire SSTable to other nodes without worrying about whether the SSTable contains unwanted data. We implemented this by having the Seastar RPC stream interface stream SSTable files on the network for tablet migration. More specifically, we take an internal snapshot of the SSTables we want to transfer so the SSTables won’t be deleted during streaming. Then, SSTable file readers are created for them so we can use the Seastar RPC stream to send the SSTable files over the network. On the receiver side, the file streams are written into SSTable files by the SSTable writers.       Why did we do this? First, it reduces CPU usage because we do not need to read each and every mutation fragment from the SSTables, and we do not need to parse mutations. The CPU reduction is even more significant for small cells, where the ratio of the amount of metadata parsed to real user data is higher. Second, the format of the SSTable is much more compact than the mutation format (since on-disk presentation of data is more compact than in-memory). This means we have less data to send over the network. As a result, it can boost the streaming speed rather significantly. Performance Improvements To quantify how this shift impacted performance, we compared the performance of mutation-based and file-based streaming when migrating tablets between nodes. The tests involved: 3 ScyllaDB nodes i4i.2xlarge 3 loaders t3.2xlarge 1 billion partitions Here are the results:   Note that file-based streaming results in 25 times faster streaming time. We also have much higher streaming bandwidth: the network bandwidth is 10 times faster with file-based streaming. As mentioned earlier, we have less data to send with file streaming. The data sent on the wire is almost three times less with file streaming. In addition, we can also see that file-based streaming consumes many fewer CPU cycles. Here’s a little more detail, in case you’re curious. Disk IO Queue The following sections show how the IO bandwidth compares across mutation-based and file-based streaming. Different colors represent different nodes. As expected, the throughput was higher with mutation-based streaming. Here are the detailed IO results for mutation-based streaming:   The streaming bandwidth is 30-40MB/s with mutation-based streaming. Here are the detailed IO results for file-based streaming: The bandwidth for file streaming is much higher than with mutation-based streaming. The pattern differs from the mutation-based graph because file streaming completes more quickly and can sustain a high speed of transfer bandwidth during streaming. CPU Load We found that the overall CPU usage is much lower for the file-based streaming. Here are the detailed CPU results for mutation-based streaming: Note that the CPU usage is around 12% for mutation-based streaming. Here are the detailed CPU results for file-based streaming: Note that the CPU usage for the file-based streaming is less than 5%. Again, this pattern differs from the mutation-based streaming graph because file streams complete much more quickly and can maintain a high transfer bandwidth throughout. Wrap Up This new file-based streaming makes data streaming in ScyllaDB faster and more efficient. You can explore it in ScyllaDB Cloud or ScyllaDB 2025.1. Also, our CTO and co-founder Avi Kivity shares an extensive look at our other recent and upcoming engineering projects in this tech talk: More engineering blog posts