How Supercell Handles Real-Time Persisted Events with ScyllaDB

How a team of just two engineers tackled real-time persisted events for hundreds of millions of players With just two engineers, Supercell took on the daunting task of growing their basic account system into a social platform connecting hundreds of millions of gamers. Account management, friend requests, cross-game promotions, chat, player presence tracking, and team formation – all of this had to work across their five major games. And they wanted it all to be covered by a single solution that was simple enough for a single engineer to maintain, yet powerful enough to handle massive demand in real-time. Supercell’s Server Engineer, Edvard Fagerholm, recently shared how their mighty team of two tackled this task. Read on to learn how they transformed a simple account management tool into a comprehensive cross-game social network infrastructure that prioritized both operational simplicity and high performance. Note: If you enjoy hearing about engineering feats like this, join us at Monster Scale Summit (free + virtual). Engineers from Disney+/Hulu,, Slack, Canva, Uber, Salesforce, Atlassian and more will be sharing strategies and case studies. Background: Who’s Supercell? Supercell is the Finland-based company behind the hit games Hay Day, Clash of Clans, Boom Beach, Clash Royale and Brawl Stars. Each of these games has generated $1B in lifetime revenue.   Somehow they manage to achieve this with a super small staff. Until quite recently, all the account management functionality for games servicing hundreds of millions of monthly active users was being built and managed by just two engineers. And that brings us to Supercell ID. The Genesis of Supercell ID Supercell ID was born as a basic account system – something to help users recover accounts and move them to new devices. It was originally implemented as a relatively simple HTTP API. Edvard explained, “The client could perform HTTP queries to the account API, which mainly returned signed tokens that the client could present to the game server to prove their identity. Some operations, like making friend requests, required the account API to send a notification to another player. For example, ‘Do you approve this friend request?’ For that purpose, there was an event queue for notifications. We would post the event there, and the game backend would forward the notification to the client using the game socket.” Enter Two-Way Communication After Edvard joined the Supercell ID project in late 2020, he started working on the notification backend – mainly for cross-promotion across their five games. He soon realized that they needed to implement two-way communication themselves, and built it as follows: Clients connected to a fleet of proxy servers, then a routing mechanism pushed events directly to clients (without going through the game). This was sufficient for the immediate goal of handling cross-promotion and friend requests. It was fairly simple and didn’t need to support high throughput or low latency. But it got them thinking bigger. They realized they could use two-way communication to significantly increase the scope of the Supercell ID system. Edvard explained, “Basically, it allowed us to implement features that were previously part of the game server. Our goal was to take features that any new games under development might need and package them into our system – thereby accelerating their development.” With that, Supercell ID began transforming into a cross-game social network that supported features like friend graphs, teaming up, chat, and friend state tracking. Evolving Supercell ID into Cross-Game Social Network At this point, the Social Network side of the backend was still a single-person project, so they designed it with simplicity in mind. Enter abstraction. Finding the right abstraction “We wanted to have only one simple abstraction that would support all of our uses and could therefore be designed and implemented by a single engineer,” explained Edvard. “In other words, we wanted to avoid building a chat system, a presence system, etc. We wanted to build one thing, not many.” Finding the right abstraction was key. And a hierarchical key-value store with Change Data Capture fit the bill perfectly. Here’s how they implemented it: The top-level keys in the key-value store are topics that can be subscribed to. There’s a two-layer map under each top-level key – map(string, map(string, string)). Any change to the data under a top-level key is broadcast to all that key’s subscribers. The values in the innermost map are also timestamped. Each data source controls its own timestamps and defines the correct order. The client drops any update with an older timestamp than what it already has stored. A typical change in the data would be something like ‘level equals 10’ changes to ‘level equals 11’. As players play, they trigger all sorts of updates like this, which means a lot of small writes are involved in persisting all the events. Finding the Right Database They needed a database that would support their technical requirements and be manageable, given their minimalist team. That translated to the following criteria: Handles many small writes with low latency Supports a hierarchical data model Manages backups and cluster operations as a service ScyllaDB Cloud turned out to be a great fit. (ScyllaDB Cloud is the fully-managed version of ScyllaDB, a database known for delivering predictable low latency at scale). How it All Plays Out For an idea of how this plays out in Supercell games, let’s look at two examples. First, consider chat messages. A simple chat message might be represented in their data model as follows: <room ID> -> <timestamp_uuid> -> message -> “hi there”                                  metadata -> …                                  reactions -> … Edvard explained, “The top-level key that’s subscribed to is the chat room ID. The next level key is a timestamp-UID, so we have an ordering of each message and can query chat history. The inner map contains the actual message together with other data attached to it.” Next, let’s look at “presence”, which is used heavily in Supercell’s new (and highly anticipated) game, mo.co. The goal of presence, according to Edvard: “When teaming up for battle, you want to see in real-time the avatar and the current build of your friends – basically the weapons and equipment of your friends, as well as what they’re doing. If your friend changes their avatar or build, goes offline, or comes online, it should instantly be visible in the ‘teaming up’ menu.” Players’ state data is encoded into Supercell’s hierarchical map as follows: <player ID> -> “presence” -> weapon -> sword                              level -> 29                              status -> in battle Note that: The top level is the player ID, the second level is the type, and the inner map contains the data. Supercell ID doesn’t need to understand the data; it just forwards it to the game clients. Game clients don’t need to know the friend graph since the routing is handled by Supercell ID. Deeper into the System Architecture Let’s close with a tour of the system architecture, as provided by Edvard. “The backend is split into APIs, proxies, and event routing/storage servers. Topics live on the event routing servers and are sharded across them. A client connects to a proxy, which handles the client’s topic subscription. The proxy routes these subscriptions to the appropriate event routing servers. Endpoints (e.g., for chat and presence) send their data to the event routing servers, and all events are persisted in ScyllaDB Cloud. Each topic has a primary and backup shard. If the primary goes down, the primary shard maintains the memory sequence numbers for each message to detect lost messages. The secondary will forward messages without sequence numbers. If the primary is down, the primary coming up will trigger a refresh of state on the client, as well as resetting the sequence numbers. The API for the routing layers is a simple post-event RPC containing a batch of topic, type, key, value tuples. The job of each API is just to rewrite their data into the above tuple representation. Every event is written in ScyllaDB before broadcasting to subscribers. Our APIs are synchronous in the sense that if an API call gives a successful response, the message was persisted in ScyllaDB. Sending the same event multiple times does no harm since applying the update on the client is an idempotent operation, with the exception of possibly multiple sequence numbers mapping to the same message. When connecting, the proxy will figure out all your friends and subscribe to their topics, same for chat groups you belong to. We also subscribe to topics for the connecting client. These are used for sending notifications to the client, like friend requests and cross promotions. A router reboot triggers a resubscription to topics from the proxy. We use Protocol Buffers to save on bandwidth cost. All load balancing is at the TCP level to guarantee that requests over the same HTTP/2 connection are handled by the same TCP socket on the proxy. This lets us cache certain information in memory on the initial listen, so we don’t need to refetch on other requests. We have enough concurrent clients that we don’t need to separately load balance individual HTTP/2 requests, as traffic is evenly distributed anyway, and requests are about equally expensive to handle across different users. We use persistent sockets between proxies and routers. This way, we can easily send tens of thousands of subscriptions per second to a single router without an issue.” But It’s Not Game Over If you want to watch the complete tech talk, just press play below: And if you want to read more about ScyllaDB’s role in the gaming world, you might also want to read: Epic Games: How Epic Games uses ScyllaDB as a binary cache in front of NVMe and S3 to accelerate global distribution of large game assets used by Unreal Cloud DDC. Tencent Games: How Tencent Games built service architecture based on CQRS and event sourcing patterns with Pulsar and ScyllaDB. Discord: How Discord uses ScyllaDB to power their massive growth, moving from a niche gaming platform to one of the world’s largest communication platforms.

How To Analyze ScyllaDB Cluster Capacity

Monitoring tips that can help reduce cluster size 2-5X without compromising latency Editor’s note: The following is a guest post by Andrei Manakov, Senior Staff Software Engineer at ShareChat. It was originally published on Andrei’s blog. I had the privilege of giving a talk at ScyllaDB Summit 2024, where I briefly addressed the challenge of analyzing the remaining capacity in ScyllaDB clusters. A good understanding of ScyllaDB internals is required to plan your computation cost increase when your product grows or to reduce cost if the cluster turns out to be heavily over-provisioned. In my experience, clusters can be reduced by 2-5x without latency degradation after such an analysis. In this post, I provide more detail on how to properly analyze CPU and disk resources. How Does ScyllaDB Use CPU? ScyllaDB is a distributed database, and one cluster typically contains multiple nodes. Each node can contain multiple shards, and each shard is assigned to a single core. The database is built on the Seastar framework and uses a shared-nothing approach. All data is usually replicated in several copies, depending on the replication factor, and each copy is assigned to a specific shard. As a result, every shard can be analyzed as an independent unit and every shard efficiently utilizes all available CPU resources without any overhead from contention or context switching. Each shard has different tasks, which we can divide into two categories: client request processing and maintenance tasks. All tasks are executed by a scheduler in one thread pinned to a core, giving each one its own CPU budget limit. Such clear task separation allows isolation and prioritization of latency-critical tasks for request processing. As a result of this design, the cluster handles load spikes more efficiently and provides gradual latency degradation under heavy load. [More details about this architecture].
Another interesting result of this design is that ScyllaDB supports workload prioritization. In my experience, this approach ensures that critical latency is not impacted during less critical load spikes. I can’t recall any similar feature in other databases. Such problems are usually tackled by having 2 clusters for different workloads. But keep in mind that this feature is available only in ScyllaDB Enterprise.
However, background tasks may occupy all remaining resources, and overall CPU utilization in the cluster appears spiky. So, it’s not obvious how to find the real cluster capacity. It’s easy to see 100% CPU usage with no performance impact. If we increase the critical load, it will consume the resources (CPU, I/O) from background tasks. Background tasks’ duration can increase slightly, but it’s totally manageable. The Best CPU Utilization Metric How can we understand the remaining cluster capacity when CPU usage spikes up to 100% throughout the day, yet the system remains stable? We need to exclude maintenance tasks and remove all these spikes from the consideration. Since ScyllaDB distributes all the data by shards and every shard has its own core, we take into account the max CPU utilization by a shard excluding maintenance tasks (you can find other task types here). In my experience, you can keep the utilization up to 60-70% without visible degradation in tail latency. Example of a Prometheus query: max(sum(rate(scylla_scheduler_runtime_ms{group!="compaction|streaming"})) by (instance, shard))/10
You can find more details about the ScyllaDB monitoring stack here. In this article, PromQL queries are used to demonstrate how to analyse key metrics effectively.
However, I don’t recommend rapidly downscaling the cluster to the desired size just after looking at max CPU utilization excluding the maintenance tasks. First, you need to look at average CPU utilization excluding maintenance tasks across all shards. In an ideal world, it should be close to max value. In case of significant skew, it definitely makes sense to find the root cause. It can be an inefficient schema with an incorrect partition key or an incorrect token-aware/rack-aware configuration in the driver. Second, you need to take a look at the average CPU utilization of excluded tasks for some your workload specific things. It’s rarely more than 5-10% but you might need to have more buffer if it uses more CPU. Otherwise, compaction will be too tight in resources and reads start to become more expensive with respect to CPU and disk. Third, it’s important to downscale your cluster gradually. ScyllaDB has an in-memory row cache which is crucial for ScyllaDB. It allocates all remaining memory for the cache and with the memory reduction, the hit rate might drop more than you expected. Hence, CPU utilization can be increased unilinearly and low cache hit rate can harm your tail latency. 1- (sum(rate(scylla_cache_reads_with_misses{})) / sum(rate(scylla_cache_reads{})))
I haven’t mentioned RAM in this article as there are not many actionable points. However, since memory cache is crucial for efficient reading in ScyllaDB, I recommend always using memory-optimized virtual machines. The more memory, the better.
Disk Resources ScyllaDB is a LSMT-based database. That means it is optimized for writing by design and any mutation will lead to new appending new data to the disk. The database periodically rewrites the data to ensure acceptable read performance. Disk performance plays a crucial role in overall database performance. You can find more details about the write path and compaction in the scylla documentation. There are 3 important disk resources we will discuss here: Throughput, IOPs and free disk space. All these resources depend on the disk type we attached to our ScyllaDB nodes and their quantity. But how can we understand the limit of the IOPs/throughput? There 2 possible options: Any cloud provider or manufacturer usually provides performance of their disks ; you can find it on their website. For example, NVMe disks from Google Cloud. The actual disk performance can be different compared to the numbers that manufacturers share. The best option might be just to measure it. And we can easily get the result. ScyllaDB performs a benchmark during installation to a node and stores the result in the file io_properties.yaml. The database uses these limits internally for achieving optimal performance. disks: - mountpoint: /var/lib/scylla/data read_iops: 2400000 //iops read_bandwidth: 5921532416//throughput write_iops: 1200000 //iops write_bandwidth: 4663037952//throughput file: io_properties.yaml Disk Throughput sum(rate(node_disk_read_bytes_total{})) / (read_bandwidth * nodeNumber) sum(rate(node_disk_written_bytes_total{})) / (write_bandwidth * nodeNumber) In my experience, I haven’t seen any harm with utilization up to 80-90%. Disk IOPs sum(rate(node_disk_reads_completed_total{})) / (read_iops * nodeNumber) sum(rate(node_disk_writes_completed_total{})) / (write_iops * nodeNumber) Disk free space It’s crucial to have significant buffer in every node. In case you’re running out of space, the node will be basically unavailable and it will be hard to restore it. However, additional space is required for many operations: Every update, write, or delete will be written to the disk and allocate new space. Compaction requires some buffer during cleaning the space. Back up procedure. The best way to control disk usage is to use Time To Live in the tables if it matches your use case. In this case, irrelevant data will expire and be cleaned during compaction. I usually try to keep at least 50-60% of free space. min(sum(node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"}) by (instance)/sum(node_filesystem_size_bytes{mountpoint="/var/lib/scylla"}) by (instance)) Tablets Most apps have significant load variations throughout the day or week. ScyllaDB is not elastic and you need to have provisioned the cluster for the peak load. So, you could waste a lot of resources during night or weekends. But that could change soon. A ScyllaDB cluster distributes data across its nodes and the smallest unit of the data is a partition uniquely identified by a partition key. A partitioner hash function computes tokens to understand in which nodes data are stored. Every node has its own token range, and all nodes make a ring. Previously, adding a new node wasn’t a fast procedure because it required copying (it is called streaming) data to a new node, adjusting token range for neighbors, etc. In addition, it’s a manual procedure. However, ScyllaDB introduced tablets in 6.0 version, and it provides new opportunities. A Tablet is a range of tokens in a table and it includes partitions which can be replicated independently. It makes the overall process much smoother and it increases elasticity significantly. Adding new nodes takes minutes and a new node starts processing requests even before full data synchronization. It looks like a significant step toward full elasticity which can drastically reduce server cost for ScyllaDB even more. You can read more about tablets here. I am looking forward to testing tablets closely soon. Conclusion Tablets look like a solid foundation for future pure elasticity, but for now, we’re planning clusters for peak load. To effectively analyze ScyllaDB cluster capacity, focus on these key recommendations: Target max CPU utilization (excluding maintenance tasks) per shard at 60–70%. Ensure sufficient free disk space to handle compaction and backups. Gradually downsize clusters to avoid sudden cache degradation.  

ScyllaDB University and Training Updates

It’s been a while since my last update. We’ve been busy improving the existing ScyllaDB training material and adding new lessons and labs. In this post, I’ll survey the latest developments and update you on the live training event taking place later this month. You can discuss these topics (and more!) on the community forum. Say hello here. ScyllaDB University LIVE Training In addition to the self-paced online courses you can take on ScyllaDB University (see below), we host online live training events. These events are a great opportunity to improve your NoSQL and ScyllaDB skills, get hands-on practice, and get your questions answered by our team of experts. The next event is ScyllaDB University LIVE, which will occur 29th of January 29. As usual, we’re planning on having two tracks, an Essentials, and an Advanced track. However, this time we’ll change the format and make each track a complete learning path. Stay tuned for more details, and I hope to see you there. Save your spot at ScyllaDB University LIVE ScyllaDB University Content Updates ScyllaDB University is our online learning platform where you can learn about NoSQL and about ScyllaDB and get some hands-on experience. It includes many different self-paced lessons, meaning you can study whenever you have some free time and continue where you left off. The material is free and all you have to do is create a user account. We recently added new lessons and updated many existing ones. All of the following topics were added to the course S201: Data Modeling and Application Development. Start learning New in the How To Write Better Apps Lesson General Data Modeling Guidelines This lesson discusses key principles of NoSQL data modeling, emphasizing a query-driven design approach to ensuring efficient data distribution and balanced workloads. It highlights the importance of selecting high-cardinality primary keys, avoiding bad access patterns, and using ScyllaDB Monitoring to identify and resolve issues such as Hot Partitions and Large Partitions. Neglecting these practices can lead to slow performance, bottlenecks, and potentially unreadable data – underscoring the need for using best practices when creating your data model. To learn more, you can explore the complete lesson here. Large Partitions and Collections This lesson provides insights into common pitfalls in NoSQL data modeling, focusing on issues like large partitions, collections, and improper use of ScyllaDB features. It emphasizes avoiding large partitions due to the impact on performance and demonstrates this with real-world examples and Monitoring data. Collections should generally remain small to prevent high latency. The schema used depends on the use case and on the performance requirements. Practical advice and tools are offered for testing and monitoring. You can learn more in the complete lesson here. Hot Partitions, Cardinality and Tombstones This lesson explores common challenges in NoSQL databases, focusing on hot partitions, low cardinality keys, and tombstones. Hot partitions cause uneven load and bottlenecks, often due to misconfigurations or retry storms. Having many tombstones can degrade read performance due to read amplification. Best practices include avoiding retry storms, using efficient full-table scans over low cardinality views and preferring partition-level deletes to minimize tombstone buildup. Monitoring tools and thoughtful schema design are emphasized for efficient database performance. You can find the complete lesson here. Diagnosis and Prevention This lesson covers strategies to diagnose and prevent common database issues in ScyllaDB, such as large partitions, hot partitions, and tombstone-related inefficiencies. Tools like the nodetool toppartitions command help identify hot partition problems, while features like per-partition rate limits and shard concurrency limits manage load and prevent contention. Properly configuring timeout settings avoids retry storms that exacerbate hot partition problems. For tombstones, using efficient delete patterns helps maintain performance and prevent timeouts during reads. Proactive monitoring and adjustments are emphasized throughout. You can see the complete lesson here. New in the Basic Data Modeling Lesson CQL and the CQL Shell The lesson introduces the Cassandra Query Language (CQL), its similarities to SQL, and its use in ScyllaDB for data definition and manipulation commands. It highlights the interactive CQL shell (CQLSH) for testing and interaction, alongside a high level overview of drivers. Common data types and collections like Sets, Lists, Maps, and User-Defined Types in ScyllaDB are briefly mentioned. The “Pet Care IoT” lab example is presented, where sensors on pet collars record data like heart rate or temperature at intervals. This demonstrates how CQL is applied in database operations for IoT use cases. This example is used in labs later on. You can watch the video and complete lesson here. Data Modeling Overview and Basic Concepts The new video introduces the basics of data modeling in ScyllaDB, contrasting NoSQL and relational approaches. It emphasizes starting with application requirements, including queries, performance, and consistency, to design models. Key concepts such as clusters, nodes, keyspaces, tables, and replication factors are explained, highlighting their role in distributed data systems. Examples illustrate how tables and primary keys (partition keys) determine data distribution across nodes using consistent hashing. The lesson demonstrates creating keyspaces and tables, showing how replication factors ensure data redundancy and how ScyllaDB maps partition keys to replica nodes for efficient reads and writes. You can find the complete lesson here. Primary Key, Partition Key, Clustering Key This lesson explains the structure and importance of primary keys in ScyllaDB, detailing their two components: the mandatory partition key and the optional clustering key. The partition key determines the data’s location across nodes, ensuring efficient querying, while the clustering key organizes rows within a partition. For queries to be efficient, the partition key must be specified to avoid full table scans. An example using pet data illustrates how rows are sorted within partitions by the clustering key (e.g., time), enabling precise and optimized data retrieval. Find the complete lesson here. Importance of Key Selection This video emphasizes the importance of choosing partition and clustering keys in ScyllaDB for optimal performance and data distribution. Partition keys should have high cardinality to ensure even data distribution across nodes and avoid issues like large or hot partitions. Examples of good keys include unique identifiers like user IDs, while low-cardinality keys like states or ages can lead to uneven load and inefficiency. Clustering keys should align with query patterns, considering the order of rows and prioritizing efficient retrieval, such as fetching recent data for time-sensitive applications. Strategic key selection prevents resource bottlenecks and enhances scalability. Learn more in the complete lesson. Data Modeling Lab Walkthrough (three parts) The new three-part video lesson focuses on key aspects of data modeling in ScyllaDB, emphasizing the design and use of primary keys. It demonstrates creating a cluster and tables using the CQL shell, highlighting how partition keys determine data location and efficient querying while showcasing different queries. Some tables use a Clustering key, which organizes data within partitions, enabling efficient range queries. It explains compound primary keys to enhance query flexibility. Next, an example of a different clustering key order (ascending or descending) is given. This enables query optimization and efficient retrieval of data. Throughout the lab walkthrough, different challenges are presented, as well as data modeling solutions to optimize performance, scalability, and resource utilization. You can watch the walkthrough here and also take the lab yourself. New in the Advanced Data Modeling Lesson Collections and Drivers The new lesson discusses advanced data modeling in ScyllaDB, focusing on collections (Sets, Lists, Maps, and User-defined types) to simplify models with multi-value fields like phone numbers or emails. It introduces token-aware and shard-aware drivers as optimizations to enhance query efficiency. Token-aware drivers allow clients to send requests directly to replica nodes, bypassing extra hops through coordinator nodes, while shard-aware clients target specific shards within replica nodes for improved performance. ScyllaDB supports drivers in multiple languages like Java, Python, and Go, along with compatibility with Cassandra drivers. An entire course on Drivers is also available. You can learn more in the complete lesson here. New in the ScyllaDB Operations Course Replica level Write/Read Path The lesson explains ScyllaDB’s read and write paths, focusing on how data is written to Memtables persisted as immutable SSTables. Because the SSTables are immutable, they are compacted periodically. Writes, including updates and deletes, are stored in a commit log before being flushed to SSTables. This ensures data consistency. For reads, a cache is used to optimize performance (also using bloom filters). Compaction merges SSTables to remove outdated data, maintain efficiency, and save storage. ScyllaDB offers different compaction strategies and you can choose the most suitable one based on your use case. Learn more in the full lesson. Tracing Demo The lesson provides a practical demonstration of ScyllaDB’s tracing using a three-node cluster. The tracing tool is showcased as a debugging aid to track request flows and replica responses. The demo highlights how data consistency levels influence when responses are sent back to clients and demonstrates high availability by successfully handling writes even when a node is down, provided the consistency requirements are met. You can find the complete lesson here.

Top Blogs of 2024: Comparisons, Caching & Database Internals

Let’s look back at the top 10 ScyllaDB blog posts written this year – plus 10 “timeless classics” that continue to get attention. Before we start, thank you to all the community members who contributed to our blogs in various ways – from users sharing best practices at ScyllaDB Summit, to engineers explaining how they raised the bar for database performance, to anyone who has initiated or contributed to the discussion on HackerNews, Reddit, and other platforms. And if you have suggestions for 2025 blog topics, please share them with us on our socials. With no further ado, here are the most-read blog posts that we published in 2024…   We Compared ScyllaDB and Memcached and… We Lost? By Felipe Cardeneti Mendes Engineers behind ScyllaDB joined forces with Memcached maintainer dormando for an in-depth look at database and cache internals, and the tradeoffs in each. Read: We Compared ScyllaDB and Memcached and… We Lost? Related: Why Databases Cache, but Caches Go to Disk   Inside ScyllaDB’s Internal Cache By Pavel “Xemul” Emelyanov Why ScyllaDB completely bypasses the Linux cache during reads, using its own highly efficient row-based cache instead. Read: Inside ScyllaDB’s Internal Cache Related: Replacing Your Cache with ScyllaDB   Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution By Avi Kivity The rationale behind ScyllaDB’s new “tablets” replication architecture, which builds upon a multiyear project to implement and extend Raft. Read: Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution Related: ScyllaDB Fast Forward: True Elastic Scale   Rust vs. Zig in Reality: A (Somewhat) Friendly Debate By Cynthia Dunlop A (somewhat) friendly P99 CONF popup debate with Jarred Sumner (Bun.js), Pekka Enberg (Turso), and Glauber Costa (Turso) on ThePrimeagen’s stream. Read: Rust vs. Zig in Reality: A (Somewhat) Friendly Debate Related: P99 CONF on demand   Database Internals: Working with IO By Pavel “Xemul” Emelyanov Explore the tradeoffs of different Linux I/O methods and learn how databases can take advantage of a modern SSD’s unique characteristics. Read: Database Internals: Working with IO Related: Understanding Storage I/O Under Load   How We Implemented ScyllaDB’s “Tablets” Data Distribution By Avi Kivity How ScyllaDB implemented its new Raft-based tablets architecture, which enables teams to quickly scale out in response to traffic spikes. Read: How We Implemented ScyllaDB’s “Tablets” Data Distribution Related: Overcoming Distributed Databases Scaling Challenges with Tablets   How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database By Ivan Burmistrov and Andrei Manakov How ShareChat engineers managed to meet their lofty performance goal without scaling the underlying database. Read: How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database Related: ShareChat’s Path to High-Performance NoSQL with ScyllaDB   New Google Cloud Z3 Instances: Early Performance Benchmarks By Łukasz Sójka, Roy Dahan ScyllaDB had the privilege of testing Google Cloud’s brand new Z3 GCE instances in an early preview. We observed a 23% increase in write throughput, 24% for mixed workloads, and 14% for reads per vCPU – all at a lower cost compared to N2. Read:New Google Cloud Z3 Instances: Early Performance Benchmarks Related: A Deep Dive into ScyllaDB’s Architecture   Database Internals: Working with CPUs By Pavel “Xemul” Emelyanov Get a database engineer’s inside look at how the database interacts with the CPU…in this excerpt from the book, “Database Performance at Scale.” Read: Database Internals: Working with CPUs Related: Database Performance at Scale: A Practical Guide [Free Book]   Migrating from Postgres to ScyllaDB, with 349X Faster Query Processing By Dan Harris and Sebastian Vercruysse How Coralogix cut processing times from 30 seconds to 86 milliseconds with a PostgreSQL to ScyllaDB migration. Read: Migrating from Postgres to ScyllaDB, with 349X Faster Query Processing Related: NoSQL Migration Masterclass   Bonus: Top NoSQL Database Blogs From Years Past Many of the blogs published in previous years continued to resonate with the community. Here’s a rundown of 10 enduring favorites: How io_uring and eBPF Will Revolutionize Programming in Linux (Glauber Costa): How io_uring and eBPF will change the way programmers develop asynchronous interfaces and execute arbitrary code, such as tracepoints, more securely. [2020]   Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost (Dr. Daniel Seybold): Dr. Daniel Seybold shares how MongoDB and ScyllaDB compare on throughput, latency, scalability, and price-performance in this third-party benchmark by benchANT. [2023]   Introducing “Database Performance at Scale”: A Free, Open Source Book (Dor Laor): Introducing a new book that provides practical guidance for understanding the opportunities, trade-offs, and traps you might encounter while trying to optimize data-intensive applications for high throughput and low latency. [2023]   DynamoDB: When to Move Out (Felipe Cardeneti Mendes): A look at the top reasons why teams decide to leave DynamoDB: throttling, latency, item size limits, and limited flexibility…not to mention costs. [2023]   ScyllaDB vs MongoDB vs PostgreSQL: Tractian’s Benchmarking & Migration (João Pedro Voltani): TRACTIAN shares their comparison of ScyllaDB vs MongoDB and PostgreSQL, then provides an overview of their MongoDB to ScyllaDB migration process, challenges & results. [2023]   Benchmarking Apache Cassandra (40 Nodes) vs ScyllaDB (4 Nodes) (Juliusz Stasiewicz, Piotr Grabowski, Karol Baryla): We benchmarked Apache Cassandra on 40 nodes vs ScyllaDB on just 4 nodes. See how they stacked up on throughput, latency, and cost. [2022]   How Numberly Replaced Kafka with a Rust-Based ScyllaDB Shard-Aware Application (Alexys Jacob): How Numberly used Rust & ScyllaDB to replace Kafka, streamlining the way all its AdTech components send and track messages (whatever their form). [2023]   Async Rust in Practice: Performance, Pitfalls, Profiling (Piotr Sarna): How our engineers used flamegraphs to diagnose and resolve performance issues in our Tokio framework based Rust driver. [2022]   On Coordinated Omission (Ivan Prisyazhynyy): Your benchmark may be lying to you! Learn why coordinated omissions are a concern, and how we account for them in benchmarking ScyllaDB. [2021]   Why Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB Cloud (Cynthia Dunlop) – Get the inside perspective on how Disney+ Hotstar simplified its “continue watching” data architecture for scale. [2022]  

Apache Cassandra 2024 Wrapped: A Year of Innovation and Growth

When Spotify launched its wrapped campaign years ago, it tapped into something we all love - taking a moment to look back and celebrate achievements. As we wind down 2024, we thought it would be fun to do our own "wrapped" for Apache Cassandra®. And what a year it's been! From groundbreaking...

Why We’re Moving to a Source Available License

TL;DR ScyllaDB has decided to focus on a single release stream – ScyllaDB Enterprise. Starting with the ScyllaDB Enterprise 2025.1 release (ETA February 2025): ScyllaDB Enterprise will change from closed source to source available. ScyllaDB OSS AGPL 6.2 will stand as the final OSS AGPL release. A free tier of the full-featured ScyllaDB Enterprise will be available to the community. This includes all the performance, efficiency, and security features previously reserved for ScyllaDB Enterprise. For convenience, the existing ScyllaDB Enterprise 2024.2 will gain the new source available license starting from our next patch release (in December), allowing easy migration of older releases. The source available Scylla Manager will move to AGPL and the closed source Kubernetes multi-region operator will be merged with the main Apache-licensed ScyllaDB Kubernetes operator. Other ScyllaDB components (e.g., Seastar, Kubernetes operator, drivers) will keep their current licenses. Why are we doing this? ScyllaDB’s team has always been extremely passionate about open source, low-level optimizations, and the delivery of groundbreaking core technologies – from hypervisors (KVM, Xen), to operating systems (Linux, OSv), and the ScyllaDB database. Over our 12 years of existence, we developed an OS, pivoted to the database space, developed Seastar (the open source standalone core engine of ScyllaDB), and developed ScyllaDB itself. Dozens of open source projects were created: drivers, a Kubernetes operator, test harnesses, and various tools. Open source is an outstanding way to share innovation. It is a straightforward choice for projects that are not your core business. However, it is a constant challenge for vendors whose core product is open source. For almost a decade, we have been maintaining two separate release streams: one for the open source database and one for the enterprise product. Balancing the free vs. paid offerings is a never-ending challenge that involves engineering, product, marketing, and constant sales discussions. Unlike other projects that decided to switch to source available or BSL to protect themselves from “free ride” competition, we were comfortable with AGPL. We took different paths, from the initial reimplementation of the Apache Cassandra API, to an open source implementation of a DynamoDB-compatible API. Beyond the license, we followed the whole approach of ‘open source first.’ Almost every line of code – from a new feature, to a bug fix – went to the open source branch first. We were developing two product lines that competed with one another, and we had to make one of them dramatically better. It’s hard enough to develop a single database and support Docker, Kubernetes, virtual and physical machines, and offer a database-as-a-service. The value of developing two separate database products, along with their release trains, ultimately does not justify the massive overhead and incremental costs required. To give you some idea of what’s involved, we have had nearly 60 public releases throughout 2024. Moreover, we have been the single significant contributor of the source code. Our ecosystem tools have received a healthy amount of contributions, but not the core database. That makes sense. The ScyllaDB internal implementation is a C++, shard-per-core, future-promise code base that is extremely hard to understand and requires full-time devotion. Thus source-wise, in terms of the code, we operated as a full open-source-first project. However, in reality, we benefitted from this no more than as a source-available project. “Behind the curtain” tradeoffs of free vs paid Balancing our requirements (of open source first, efficient development, no crippling of our OSS, and differentiation between the two branches) has been challenging, to say the least. Our open source first culture drove us to develop new core features in the open. Our engineers released these features before we were prepared to decide what was appropriate for open source and what was best for the enterprise paid offering. For example, Tablets, our recent architectural shift, was all developed in the open – and 99% of its end user value is available in the OSS release. As the Enterprise version branched out of the OSS branch, it was helpful to keep a unified base for reuse and efficiency. However, it reduced our paid version differentiation since all features were open by default (unless flagged). For a while, we thought that the OSS release would be the latest and greatest and have a short lifecycle as a differentiation and a means of efficiency. Although maintaining this process required a lot of effort on our side, this could have been a nice mitigation option, a replacement for a feature/functionality gap between free and paid. However, the OSS users didn’t really use the latest and didn’t always upgrade. Instead, most users preferred to stick to old, end-of-life releases. The result was a lose-lose situation (for users and for us). Another approach we used was to differentiate by using peripheral tools – such as Scylla Manager, which helps to operate ScyllaDB (e.g., running backup/restore and managing repairs) – and having a usage limit on them. Our Kubernetes operator is open source and we added a separate closed source repository for multi-region support for Kubernetes. This is a complicated path for development and also for our paying users. The factor that eventually pushed us over the line is that our new architecture – with Raft, tablets, and native S3 – moves peripheral functionality into the core database: Our backup and restore implementation moves from an agent and external manager into the core database. S3 I/O access for backup and restore (and, in the future, for tiered storage) is handled directly by the core database. The I/O operations are controlled by our schedulers, allowing full prioritization and bandwidth control. Later on, “point in time recovery” will be provided. This is a large overhaul unification change, eliminating complexity while improving control. Repair becomes automatic. Repair is a full-scan, backend process that merges inconsistent replica data. Previously, it was controlled by the external Scylla Manager. The new generation core database runs its own automatic repair with tablet awareness. As a result, there is no need for an external peripheral tool; repair will become transparent to the user, like compaction is today. These changes are leading to a more complete core product, with better manageability and functionality. However, they eat into the differentiators for our paid offerings. As you can see, a combination of architecture consolidations, together with multiple release stream efforts, have made our lives extremely complicated and slowed down our progress. Going forward After a tremendous amount of thought and discussion on these points, we decided to unify the two release streams as described at the start of this post. This license shift will allow us to better serve our customers as well as provide increased free tier value to the community. The new model opens up access to previously-restricted capabilities that: Achieve up to 50% higher throughput and 33% lower latency via profile-guided optimization Speed up node addition/removal by 30X via file-based streaming Balance multiple workloads with different performance needs on a single cluster via workload prioritization Reduce network costs with ZSTD-based network compression (with a shard dictionary) for intra-node RPC Combine the best of Leveled Compaction Strategy and Size-tiered Compaction Strategy with Incremental Compaction Strategy – resulting in 35% better storage utilization Use encryption at rest, LDAP integration, and all of the other benefits of the previous closed source Enterprise version Provide a single (all open source) Kubernetes operator for ScyllaDB Enable users to enjoy a longer product life cycle This was a difficult decision for us, and we know it might not be well-received by some of our OSS users running large ScyllaDB clusters. We appreciate your journey and we hope you will continue working with ScyllaDB. After 10 years, we believe this change is the right move for our company, our database, our customers, and our early adopters. With this shift, our team will be able to move faster, better respond to your needs, and continue making progress towards the major milestones on our roadmap: Raft for data, optimized tablet elasticity, and tiered (S3) storage. Read the FAQ

A Tale from Database Performance at Scale

The following is an excerpt from Chapter 1 of Database Performance at Scale (an Open Access book that’s available for free). Follow Joan’s highly fictionalized adventures with some all-too-real database performance challenges. You’ll laugh. You’ll cry. You’ll wonder how we worked this “cheesy story” into a deeply technical book. Get the complete book, free Lured in by impressive buzzwords like “hybrid cloud,” “serverless,” and “edge first,” Joan readily joined a new company and started catching up with their technology stack. Her first project recently started a transition from their in-house implementation of a database system, which turned out to not scale at the same pace as the number of customers, to one of the industry-standard database management solutions. Their new pick was a new distributed database, which, contrarily to NoSQL, strives to keep the original ACID guarantees known in the SQL world. Due to a few new data protection acts that tend to appear annually nowadays, the company’s board decided that they were going to maintain their own datacenter, instead of using one of the popular cloud vendors for storing sensitive information. On a very high level, the company’s main product consisted of only two layers: The frontend, the entry point for users, which actually runs in their own browsers and communicates with the rest of the system to exchange and persist information. The everything-else, customarily known as “backend,” but actually including load balancers, authentication, authorization, multiple cache layers, databases, backups, and so on. Joan’s first introductory task was to implement a very simple service for gathering and summing up various statistics from the database, and integrate that service with the whole ecosystem, so that it fetches data from the database in real-time and allows the DevOps teams to inspect the statistics live. To impress the management and reassure them that hiring Joan was their absolutely best decision this quarter, Joan decided to deliver a proof-of-concept implementation on her first day! The company’s unspoken policy was to write software in Rust, so she grabbed the first driver for their database from a brief crates.io search and sat down to her self-organized hackathon. The day went by really smoothly, with Rust’s ergonomy-focused ecosystem providing a superior developer experience. But then Joan ran her first smoke tests on a real system. Disbelief turned to disappointment and helplessness when she realized that every third request (on average) ended up in an error, even though the whole database cluster reported to be in a healthy, operable state. That meant a debugging session was in order! Unfortunately, the driver Joan hastily picked for the foundation of her work, even though open-source on its own, was just a thin wrapper over precompiled, legacy C code, with no source to be found. Fueled by a strong desire to solve the mystery and a healthy dose of fury, Joan spent a few hours inspecting the network communication with Wireshark, and she made an educated guess that the bug must be in the hashing key implementation. In the database used by the company, keys are hashed to later route requests to appropriate nodes. If a hash value is computed incorrectly, a request may be forwarded to the wrong node that can refuse it and return an error instead. Unable to verify the claim due to missing source code, Joan decided on a simpler path — ditching the originally chosen driver and reimplementing the solution on one of the officially supported, open-source drivers backed by the database vendor, with a solid user base and regularly updated release schedule. Joan’s diary of lessons learned, part I The initial lessons include: Choose a driver carefully. It’s at the core of your code’s performance, robustness, and reliability. Drivers have bugs too, and it’s impossible to avoid them. Still, there are good practices to follow: Unless there’s a good reason, prefer the officially supported driver (if it exists); Open-source drivers have advantages: They’re not only verified by the community, but also allow deep inspection of its code, and even modifying the driver code to get even more insights for debugging; It’s better to rely on drivers with a well-established release schedule since they are more likely to receive bug fixes (including for security vulnerabilities) in a reasonable period of time. Wireshark is a great open-source tool for interpreting network packets; give it a try if you want to peek under the hood of your program. The introductory task was eventually completed successfully, which made Joan ready to receive her first real assignment. The tuning Armed with the experience gained working on the introductory task, Joan started planning how to approach her new assignment: a misbehaving app. One of the applications notoriously caused stability issues for the whole system, disrupting other workloads each time it experienced any problems. The rogue app was already based on an officially supported driver, so Joan could cross that one off the list of potential root causes. This particular service was responsible for injecting data backed up from the legacy system into the new database. Because the company was not in a great hurry, the application was written with low concurrency in mind to have low priority and not interfere with user workloads. Unfortunately, once every few days something kept triggering an anomaly. The normally peaceful application seemed to be trying to perform a denial-of-service attack on its own database, flooding it with requests until the backend got overloaded enough to cause issues for other parts of the ecosystem. As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that the rate of requests generated by this application started spiking around the time of the anomaly, she wondered how on Earth this workload could behave like that. It was, after all, explicitly implemented to send new requests only when less than 100 of them were currently in progress. Since collaboration was heavily advertised as one of the company’s “spirit and cultural foundations” during the onboarding sessions with an onsite coach, she decided it’s best to discuss the matter with her colleague, Tony. “Look, Tony, I can’t wrap my head around this,” she explained. “This service doesn’t send any new requests when 100 of them are already in flight. And look right here in the logs: 100 requests in progress, one returned a timeout error, and…,” she then stopped, startled at her own epiphany. “Alright, thanks Tony, you’re a dear – best rubber duck ever!,” she concluded and returned to fixing the code. The observation that led to discovering the root cause was rather simple: the request didn’t actually *return* a timeout error because the database server never sent back such a response. The request was simply qualified as timed out by the driver, and discarded. But the sole fact that the driver no longer waits for a response for a particular request does not mean that the database is done processing it! It’s entirely possible that the request was instead just stalled, taking longer than expected, and only the driver gave up waiting for its response. With that knowledge, it’s easy to imagine that once 100 requests time out on the client side, the app might erroneously think that they are not in progress anymore, and happily submit 100 more requests to the database, increasing the total number of in-flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme levels of concurrency on your database cluster—even though the application was supposed to keep it limited to a small number! Joan’s diary of lessons learned, part II The lessons continue: Client-side timeouts are convenient for programmers, but they can interact badly with server-side timeouts. Rule of thumb: make the client-side timeouts around twice as long as server-side ones, unless you have an extremely good reason to do otherwise. Some drivers may be capable of issuing a warning if they detect that the client-side timeout is smaller than the server-side one, or even amend the server-side timeout to match, but in general it’s best to double-check. Tasks with seemingly fixed concurrency can actually cause spikes under certain unexpected conditions. Inspecting logs and dashboards is helpful in investigating such cases, so make sure that observability tools are available both in the database cluster and for all client applications. Bonus points for distributed tracing, like OpenTelemetry integration. With client-side timeouts properly amended, the application choked much less frequently and to a smaller extent, but it still wasn’t a perfect citizen in the distributed system. It occasionally picked a victim database node and kept bothering it with too many requests, while ignoring the fact that seven other nodes were considerably less loaded and could help handle the workload too. At other times, its concurrency was reported to be exactly 200% larger than expected by the configuration. Whenever the two anomalies converged in time, the poor node was unable to handle all requests it was bombarded with, and had to give up on a fair portion of them. A long study of the driver’s documentation, which was fortunately available in mdBook format and kept reasonably up-to-date, helped Joan alleviate those pains too. The first issue was simply a misconfiguration of the non-default load balancing policy, which tried too hard to pick “the least loaded” database node out of all the available ones, based on heuristics and statistics occasionally updated by the database itself. Malheureusement, this policy was also “best effort,” and relied on the fact that statistics arriving from the database were always legit – but a stressed database node could become so overloaded that it wasn’t sending back updated statistics in time! That led the driver to falsely believe that this particular server was not actually busy at all. Joan decided that this setup was a premature optimization that turned out to be a footgun, so she just restored the original default policy, which worked as expected. The second issue (temporary doubling of the concurrency) was caused by another misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured period of time without getting an acknowledgment from the database, drivers would speculatively resend a request to maximize its chances to succeed. This mechanism is very useful to increase requests’ success rate. However, if the original request also succeeds, it means that the speculative one was sent in vain. In order to balance the pros and cons, speculative retry should be configured to only resend requests if it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative retry may act too soon, doubling the number of requests sent (and thus also doubling concurrency) without improving the success rate at all. Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality debugging session that ends in an astounding success (except writing a cheesy story in a deeply technical book, naturally). Great job, Joan! The end. Editor’s note: If you made it this far and can’t get enough of cheesy database performance stories, see what happened to poor old Patrick in “A Tale of Database Performance Woes: Patrick’s Unlucky Green Fedoras.” And if you appreciate this sense of humor, see Piotr’s new book on writing engineering blog posts.

How Digital Turbine Moved DynamoDB Workloads to GCP – In Just One Sprint

How Joseph Shorter and Miles Ward led a fast, safe migration with ScyllaDB’s DynamoDB-compatible API  Digital Turbine is a quiet but powerful player in the mobile ad tech business. Their platform is preinstalled on Android phones, connecting app developers, advertisers, mobile carriers, and device manufacturers. In the process, they bring in $500M annually. And if their database goes down, their business goes down. Digital Turbine recently decided to standardize on Google Cloud – so continuing with their DynamoDB database was no longer an option. They had to move fast without breaking things. Joseph Shorter (VP, Platform Architecture at Digital Turbine) teamed up with Miles Ward (CTO at SADA) and devised a game plan to pull off the move. Spoiler: they not only moved fast, but also ended up with an approach that was even faster…and less expensive too. You can hear directly from Joe and Miles in this conference talk:  We’ve captured some highlights from their discussion below. Why migrate from DynamoDB The tipping point for the DynamoDB migration was Digital Turbine’s decision to standardize on GCP following a series of acquisitions. But that wasn’t the only issue. DynamoDB hadn’t been ideal from a cost perspective or from a performance perspective. Joe explained: “It can be a little expensive as you scale, to be honest. We were finding some performance issues. We were doing a ton of reads—90% of all interactions with DynamoDB were read operations. With all those operations, we found that the performance hits required us to scale up more than we wanted, which increased costs.” Their DynamoDB migration requirements Digital Turbine needed the migration to be as fast and low-risk as possible, which meant keeping application refactoring to a minimum. The main concern, according to Joe, was “How can we migrate without radically refactoring our platform, while maintaining at least the same performance and value, and avoiding a crash-and-burn situation? Because if it failed, it would take down our whole company. “ They approached SADA, who helped them think through a few options – including some Google-native solutions and ScyllaDB. ScyllaDB stood out due to its DynamoDB API, ScyllaDB Alternator. What the DynamoDB migration entailed In summary, it was “as easy as pudding pie” (quoting Joe here). But a little more detail: “There is a DynamoDB API that we could just use. I won’t say there was no refactoring. We did some refactoring to make it easy for engineers to plug in this information, but it was straightforward. It took less than a sprint to write that code. That was awesome. Everyone had told us that ScyllaDB was supposed to be a lot faster. Our reaction was, ‘Sure, every competitor says their product performs better.’ We did a lot with DynamoDB at scale, so we were skeptical. We decided to do a proper POC—not just some simple communication with ScyllaDB compared to DynamoDB. We actually put up multiple apps with some dependencies and set it up the way it actually functions in AWS, then we pressure-tested it. We couldn’t afford any mistakes—a mistake here means the whole company would go down. The goal was to make sure, first, that it would work and, second, that it would actually perform. And it turns out, it delivered on all its promises. That was a huge win for us.” Results so far – with minimal cluster utilization Beyond meeting their primary goal of moving off AWS, the Digital Turbine team improved performance – and they ended up reducing their costs a bit too, as an added benefit. From Joe: “I think part of it comes down to the fact that the performance is just better. We didn’t know what to expect initially, so we scaled things to be pretty comparable. What we’re finding is that it’s simply running better. Because of that, we don’t need as much infrastructure. And we’re barely tapping the ScyllaDB clusters at all right now. A 20% cost difference—that’s a big number, no matter what you’re talking about. And when you consider our plans to scale even further, it becomes even more significant. In the industry we’re in, there are only a few major players—Google, Facebook, and then everyone else. Digital Turbine has carved out a chunk of this space, and we have the tools as a company to start competing in ways others can’t. As we gain more customers and more people say, ‘Hey, we like what you’re doing,’ we need to scale radically. That 20% cost difference is already significant now, and in the future, it could be massive. Better performance and better pricing—it’s hard to ask for much more than that. You’ve got to wonder why more people haven’t noticed this yet.” Learn more about the difference between ScyllaDB and DynamoDB Compare costs: ScyllaDB vs DynamoDB