A Tiny Peek at Monster SCALE Summit 2025
Big things have been happening behind the scenes for the premier Monster SCALE Summit. Ever since we introduced it at P99 CONF, the community response has been overwhelming. We’re now faced with the “good” problem of determining how to fit all the selected speakers into the two half-days we set aside for the event. 😅 If you missed the intro last year, Monster Scale Summit is a highly technical conference that connects the community of professionals designing, implementing, and optimizing performance-sensitive data-intensive applications. It focuses on exploring “monster scale” engineering challenges with respect to extreme levels of throughput, data, and global distribution. The two-day event is free, intentionally virtual, and highly interactive. Register – it’s free and virtual We’ll be announcing the agenda next month. But we’re so excited about the speaker lineup that we can’t wait to share a taste of what you can expect. Here’s a preview of 12 of the 60+ sessions that you can join on March 11 and 12… Designing Data-Intensive Applications in 2025 Martin Kleppmann and Chris Riccomini (Designing Data-Intensive Applications book) Join us for an informal chat with Martin Kleppmann and Chris Riccomini, who are currently revising the famous book Designing Data-Intensive Applications. We’ll cover how data-intensive applications have evolved since the book was first published, the top tradeoffs people are negotiating today, and what they believe is next for data-intensive applications. Martin and Chris will also provide an inside look at the book writing and revision process. The Nile Approach: Re-engineering Postgres for Millions of Tenants Gwen Shapira (Nile) Scaling relational databases is a notoriously challenging problem. Doing so while maintaining consistent low latency, efficient use of resources and compatibility with Postgres may seem impossible. At Nile, we decided to tackle the scaling challenge by focusing on multi-tenant applications. These applications require not only scalability, but also a way to isolate tenants and avoid the noisy neighbor problem. By tackling both challenges, we developed an approach, which we call “virtual tenant databases”, which gives us an efficient way to scale Postgres to millions of tenants while still maintaining consistent performance. In this talk, I’ll explore the limitations of traditional scaling for multi-tenant applications and share how Nile’s virtual tenant databases address these challenges. By combining the best of Postgres existing capabilities, distributed algorithms and a new storage layer, Nile re-engineered Postgres for multi-tenant applications at scale. The Mechanics of Scale Dominik Tornow (Resonate HQ) As distributed systems scale, the complexity of their development and operation skyrockets. A dependable understanding of the mechanics of distributed systems is our most reliable parachute. In this talk, we’ll use systems thinking to develop an accurate and concise mental model of concurrent, distributed systems, their core challenges, and the key principles to address these challenges. We’ll explore foundational problems such as the tension between consistency and availability, and essential techniques like partitioning and replication. Whether you are building a new system from scratch or scaling an existing system to new heights, this talk will provide the understanding to confidently navigate the intricacies of modern, large-scale distributed systems. Feature Store Evolution Under Cost Constraints: When Cost is Part of the Architecture Ivan Burmistrov and David Malinge (ShareChat) At P99 CONF 23, the ShareChat team presented the scaling challenges for the ML Feature Store so it could handle 1 billion features per second. Once the system was scaled to handle the load, the next challenge the team faced was extreme cost constraints: it was required to make the same quality system much cheaper to run. Ivan and David will talk about approaches the team implemented in order to optimize for cost in the Cloud environment while maintaining the same SLA for the service. The talk will touch on such topics as advanced optimizations on various levels to bring down the compute, minimizing the waste when running on Kubernetes, autoscaling challenges for stateful Apache Flink jobs, and others. The talk should be useful for those who are either interested in building or optimizing an ML Feature Store or in general looking into cost optimizations in the cloud environment. Time Travelling at Scale Richard Hart (Antithesis) Antithesis is a continuous reliability platform that autonomously searches for problems in your software within a simulated environment. Every problem we find can be perfectly reproduced, allowing for efficient debugging of even the most complex problems. But storing and querying histories of program execution at scale creates monster large cardinalities. Over a ~10 hour test run, we generate ~1bn rows. The solution: our own tree-database. 30B Images and Counting: Scaling Canva’s Content-Understanding Pipelines Dr. Kerry Halupka (Canva) As the demand for high-quality, labeled image data grows, building systems that can scale content understanding while delivering real-time performance is a formidable challenge. In this talk, I’ll share how we tackled the complexities of scaling content understanding pipelines to support monstrous volumes of data, including backfilling labels for over 30 billion images. At the heart of our system is an extreme label classification model capable of handling thousands of labels and scaling seamlessly to thousands more. I’ll dive into the core components: candidate image search, zero-shot labelling using highly trained teacher models, and iterative refinement with visual critic models. You’ll learn how we balanced latency, throughput, and accuracy while managing evolving datasets and continuously expanding label sets. I’ll also discuss the tradeoffs we faced—such as ensuring precision in labelling without compromising speed—and the techniques we employed to optimise for scale, including strategies to address data sparsity and performance bottlenecks. By the end of the session, you’ll gain insights into designing, implementing, and scaling content understanding systems that meet extreme demands. Whether you’re working with real-time systems, distributed architectures, or ML pipelines, this talk will provide actionable takeaways for pushing large-scale labelling pipelines to their limits and beyond. How Agoda Scaled 50x Throughput with ScyllaDB Worakarn Isaratham (Agoda) In this talk, we will explore the performance tuning strategies implemented at Agoda to optimize ScyllaDB. Key topics include enhancing disk performance, selecting the appropriate compaction strategy, and adjusting SSTable settings to match our usage profile. Who Needs One Database Anyway? Glauber Costa (Turso) Developers need databases. That’s how you store your data. And that’s usually how it goes: you have your large fleet of services, and they connect to one database. But what if it wasn’t like that? What if instead of one database, one application would create one million databases, or even more? In this talk, we’ll explore the market trends that give rise to use cases where this pattern is beneficial, and the infrastructure changes needed to support it. How We Boosted ScyllaDB’s Data Streaming by 25x Asias He (ScyllaDB) To improve elasticity, we wanted to speed up streaming, the process of scaling out/in to other nodes used to analyze every partition. Enter file-based streaming, a new feature that optimizes tablet movement. This new approach streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, significantly less data is streamed over the network, and less CPU is consumed – especially for data models that contain small cells. This session will share the engineering behind this optimization and look at the performance impact you can expect in common situations. Evolving Atlassian Confluence Cloud for Scale, Reliability, and Performance Bhakti Mehta (Atlassian) This session covers the journey of Confluence Cloud – the team workspace for collaboration and knowledge sharing used by thousands of companies – and how we aim to take it to the next level, with scale, performance, and reliability as the key motivators. This session presents a deep dive to provide insights into how the Confluence architecture has evolved into its current form. It discusses how Atlassian deploys, runs, and operates at scale and all challenges encountered along the way. I will cover performance and reliability at scale starting with the fundamentals of measuring everything, re-defining metrics to be insightful of actual customer pain, auditing end-to-end experiences. Beyond just dev-ops and best practices, this means empowering teams to own product stability through practices and tools. Two Leading Approaches to Data Virtualization: Which Scales Better? Dr. Daniel Abadi (University of Maryland) You have a large dataset stored in location X, and some code to process or analyze it in location Y. What is better: move the code to the data, or move the data to the code? For decades, it has always been assumed that the former approach is more scalable. Recently, with the rise of cloud computing, and the push to separate resources for storage and compute, we have seen data increasingly being pushed to code, flying in face of conventional wisdom. What is behind this trend, and is it a dangerous idea? This session will look at this question from academic and practical perspectives, with a particular focus on data virtualization, where there exists an ongoing debate on the merits of push-based vs. pull-based data processing. Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System Dmytro Hnatiuk (Wise) Scaling a system from 66 million to over 25 billion records is no easy feat—especially when it’s a core financial system where every number has to be right, and data needs to be fresh right now. In this session, I’ll share the ups and downs of managing this kind of growth without losing my sanity. You’ll learn how to balance high data accuracy with real-time performance, optimize your app logic, and avoid the usual traps of database scaling. This isn’t about turning you into a database expert—it’s about giving you the practical, no-BS strategies you need to scale your systems without getting overwhelmed by technical headaches. Perfect for engineers and architects who want to tackle big challenges and come out on top.How Supercell Handles Real-Time Persisted Events with ScyllaDB
How a team of just two engineers tackled real-time persisted events for hundreds of millions of players With just two engineers, Supercell took on the daunting task of growing their basic account system into a social platform connecting hundreds of millions of gamers. Account management, friend requests, cross-game promotions, chat, player presence tracking, and team formation – all of this had to work across their five major games. And they wanted it all to be covered by a single solution that was simple enough for a single engineer to maintain, yet powerful enough to handle massive demand in real-time. Supercell’s Server Engineer, Edvard Fagerholm, recently shared how their mighty team of two tackled this task. Read on to learn how they transformed a simple account management tool into a comprehensive cross-game social network infrastructure that prioritized both operational simplicity and high performance. Note: If you enjoy hearing about engineering feats like this, join us at Monster Scale Summit (free + virtual). Engineers from Disney+/Hulu,, Slack, Canva, Uber, Salesforce, Atlassian and more will be sharing strategies and case studies. Background: Who’s Supercell? Supercell is the Finland-based company behind the hit games Hay Day, Clash of Clans, Boom Beach, Clash Royale and Brawl Stars. Each of these games has generated $1B in lifetime revenue. Somehow they manage to achieve this with a super small staff. Until quite recently, all the account management functionality for games servicing hundreds of millions of monthly active users was being built and managed by just two engineers. And that brings us to Supercell ID. The Genesis of Supercell ID Supercell ID was born as a basic account system – something to help users recover accounts and move them to new devices. It was originally implemented as a relatively simple HTTP API. Edvard explained, “The client could perform HTTP queries to the account API, which mainly returned signed tokens that the client could present to the game server to prove their identity. Some operations, like making friend requests, required the account API to send a notification to another player. For example, ‘Do you approve this friend request?’ For that purpose, there was an event queue for notifications. We would post the event there, and the game backend would forward the notification to the client using the game socket.” Enter Two-Way Communication After Edvard joined the Supercell ID project in late 2020, he started working on the notification backend – mainly for cross-promotion across their five games. He soon realized that they needed to implement two-way communication themselves, and built it as follows: Clients connected to a fleet of proxy servers, then a routing mechanism pushed events directly to clients (without going through the game). This was sufficient for the immediate goal of handling cross-promotion and friend requests. It was fairly simple and didn’t need to support high throughput or low latency. But it got them thinking bigger. They realized they could use two-way communication to significantly increase the scope of the Supercell ID system. Edvard explained, “Basically, it allowed us to implement features that were previously part of the game server. Our goal was to take features that any new games under development might need and package them into our system – thereby accelerating their development.” With that, Supercell ID began transforming into a cross-game social network that supported features like friend graphs, teaming up, chat, and friend state tracking. Evolving Supercell ID into Cross-Game Social Network At this point, the Social Network side of the backend was still a single-person project, so they designed it with simplicity in mind. Enter abstraction. Finding the right abstraction “We wanted to have only one simple abstraction that would support all of our uses and could therefore be designed and implemented by a single engineer,” explained Edvard. “In other words, we wanted to avoid building a chat system, a presence system, etc. We wanted to build one thing, not many.” Finding the right abstraction was key. And a hierarchical key-value store with Change Data Capture fit the bill perfectly. Here’s how they implemented it: The top-level keys in the key-value store are topics that can be subscribed to. There’s a two-layer map under each top-level key – map(string, map(string, string)). Any change to the data under a top-level key is broadcast to all that key’s subscribers. The values in the innermost map are also timestamped. Each data source controls its own timestamps and defines the correct order. The client drops any update with an older timestamp than what it already has stored. A typical change in the data would be something like ‘level equals 10’ changes to ‘level equals 11’. As players play, they trigger all sorts of updates like this, which means a lot of small writes are involved in persisting all the events. Finding the Right Database They needed a database that would support their technical requirements and be manageable, given their minimalist team. That translated to the following criteria: Handles many small writes with low latency Supports a hierarchical data model Manages backups and cluster operations as a service ScyllaDB Cloud turned out to be a great fit. (ScyllaDB Cloud is the fully-managed version of ScyllaDB, a database known for delivering predictable low latency at scale). How it All Plays Out For an idea of how this plays out in Supercell games, let’s look at two examples. First, consider chat messages. A simple chat message might be represented in their data model as follows: <room ID> -> <timestamp_uuid> -> message -> “hi there” metadata -> … reactions -> … Edvard explained, “The top-level key that’s subscribed to is the chat room ID. The next level key is a timestamp-UID, so we have an ordering of each message and can query chat history. The inner map contains the actual message together with other data attached to it.” Next, let’s look at “presence”, which is used heavily in Supercell’s new (and highly anticipated) game, mo.co. The goal of presence, according to Edvard: “When teaming up for battle, you want to see in real-time the avatar and the current build of your friends – basically the weapons and equipment of your friends, as well as what they’re doing. If your friend changes their avatar or build, goes offline, or comes online, it should instantly be visible in the ‘teaming up’ menu.” Players’ state data is encoded into Supercell’s hierarchical map as follows: <player ID> -> “presence” -> weapon -> sword level -> 29 status -> in battle Note that: The top level is the player ID, the second level is the type, and the inner map contains the data. Supercell ID doesn’t need to understand the data; it just forwards it to the game clients. Game clients don’t need to know the friend graph since the routing is handled by Supercell ID. Deeper into the System Architecture Let’s close with a tour of the system architecture, as provided by Edvard. “The backend is split into APIs, proxies, and event routing/storage servers. Topics live on the event routing servers and are sharded across them. A client connects to a proxy, which handles the client’s topic subscription. The proxy routes these subscriptions to the appropriate event routing servers. Endpoints (e.g., for chat and presence) send their data to the event routing servers, and all events are persisted in ScyllaDB Cloud. Each topic has a primary and backup shard. If the primary goes down, the primary shard maintains the memory sequence numbers for each message to detect lost messages. The secondary will forward messages without sequence numbers. If the primary is down, the primary coming up will trigger a refresh of state on the client, as well as resetting the sequence numbers. The API for the routing layers is a simple post-event RPC containing a batch of topic, type, key, value tuples. The job of each API is just to rewrite their data into the above tuple representation. Every event is written in ScyllaDB before broadcasting to subscribers. Our APIs are synchronous in the sense that if an API call gives a successful response, the message was persisted in ScyllaDB. Sending the same event multiple times does no harm since applying the update on the client is an idempotent operation, with the exception of possibly multiple sequence numbers mapping to the same message. When connecting, the proxy will figure out all your friends and subscribe to their topics, same for chat groups you belong to. We also subscribe to topics for the connecting client. These are used for sending notifications to the client, like friend requests and cross promotions. A router reboot triggers a resubscription to topics from the proxy. We use Protocol Buffers to save on bandwidth cost. All load balancing is at the TCP level to guarantee that requests over the same HTTP/2 connection are handled by the same TCP socket on the proxy. This lets us cache certain information in memory on the initial listen, so we don’t need to refetch on other requests. We have enough concurrent clients that we don’t need to separately load balance individual HTTP/2 requests, as traffic is evenly distributed anyway, and requests are about equally expensive to handle across different users. We use persistent sockets between proxies and routers. This way, we can easily send tens of thousands of subscriptions per second to a single router without an issue.” But It’s Not Game Over If you want to watch the complete tech talk, just press play below: And if you want to read more about ScyllaDB’s role in the gaming world, you might also want to read: Epic Games: How Epic Games uses ScyllaDB as a binary cache in front of NVMe and S3 to accelerate global distribution of large game assets used by Unreal Cloud DDC. Tencent Games: How Tencent Games built service architecture based on CQRS and event sourcing patterns with Pulsar and ScyllaDB. Discord: How Discord uses ScyllaDB to power their massive growth, moving from a niche gaming platform to one of the world’s largest communication platforms.How To Analyze ScyllaDB Cluster Capacity
Monitoring tips that can help reduce cluster size 2-5X without compromising latency Editor’s note: The following is a guest post by Andrei Manakov, Senior Staff Software Engineer at ShareChat. It was originally published on Andrei’s blog. I had the privilege of giving a talk at ScyllaDB Summit 2024, where I briefly addressed the challenge of analyzing the remaining capacity in ScyllaDB clusters. A good understanding of ScyllaDB internals is required to plan your computation cost increase when your product grows or to reduce cost if the cluster turns out to be heavily over-provisioned. In my experience, clusters can be reduced by 2-5x without latency degradation after such an analysis. In this post, I provide more detail on how to properly analyze CPU and disk resources. How Does ScyllaDB Use CPU? ScyllaDB is a distributed database, and one cluster typically contains multiple nodes. Each node can contain multiple shards, and each shard is assigned to a single core. The database is built on the Seastar framework and uses a shared-nothing approach. All data is usually replicated in several copies, depending on the replication factor, and each copy is assigned to a specific shard. As a result, every shard can be analyzed as an independent unit and every shard efficiently utilizes all available CPU resources without any overhead from contention or context switching. Each shard has different tasks, which we can divide into two categories: client request processing and maintenance tasks. All tasks are executed by a scheduler in one thread pinned to a core, giving each one its own CPU budget limit. Such clear task separation allows isolation and prioritization of latency-critical tasks for request processing. As a result of this design, the cluster handles load spikes more efficiently and provides gradual latency degradation under heavy load. [More details about this architecture].Another interesting result of this design is that ScyllaDB supports workload prioritization. In my experience, this approach ensures that critical latency is not impacted during less critical load spikes. I can’t recall any similar feature in other databases. Such problems are usually tackled by having 2 clusters for different workloads. But keep in mind that this feature is available only in ScyllaDB Enterprise.However, background tasks may occupy all remaining resources, and overall CPU utilization in the cluster appears spiky. So, it’s not obvious how to find the real cluster capacity. It’s easy to see 100% CPU usage with no performance impact. If we increase the critical load, it will consume the resources (CPU, I/O) from background tasks. Background tasks’ duration can increase slightly, but it’s totally manageable. The Best CPU Utilization Metric How can we understand the remaining cluster capacity when CPU usage spikes up to 100% throughout the day, yet the system remains stable? We need to exclude maintenance tasks and remove all these spikes from the consideration. Since ScyllaDB distributes all the data by shards and every shard has its own core, we take into account the max CPU utilization by a shard excluding maintenance tasks (you can find other task types here). In my experience, you can keep the utilization up to 60-70% without visible degradation in tail latency. Example of a Prometheus query:
max(sum(rate(scylla_scheduler_runtime_ms{group!="compaction|streaming"}))
by (instance, shard))/10
You can find more details about the ScyllaDB monitoring stack here. In this article, PromQL queries are used to demonstrate how to analyse key metrics effectively.However, I don’t recommend rapidly downscaling the cluster to the desired size just after looking at max CPU utilization excluding the maintenance tasks. First, you need to look at average CPU utilization excluding maintenance tasks across all shards. In an ideal world, it should be close to max value. In case of significant skew, it definitely makes sense to find the root cause. It can be an inefficient schema with an incorrect partition key or an incorrect token-aware/rack-aware configuration in the driver. Second, you need to take a look at the average CPU utilization of excluded tasks for some your workload specific things. It’s rarely more than 5-10% but you might need to have more buffer if it uses more CPU. Otherwise, compaction will be too tight in resources and reads start to become more expensive with respect to CPU and disk. Third, it’s important to downscale your cluster gradually. ScyllaDB has an in-memory row cache which is crucial for ScyllaDB. It allocates all remaining memory for the cache and with the memory reduction, the hit rate might drop more than you expected. Hence, CPU utilization can be increased unilinearly and low cache hit rate can harm your tail latency.
1- (sum(rate(scylla_cache_reads_with_misses{})) /
sum(rate(scylla_cache_reads{})))
I haven’t mentioned RAM in this article as there are not many actionable points. However, since memory cache is crucial for efficient reading in ScyllaDB, I recommend always using memory-optimized virtual machines. The more memory, the better.Disk Resources ScyllaDB is a LSMT-based database. That means it is optimized for writing by design and any mutation will lead to new appending new data to the disk. The database periodically rewrites the data to ensure acceptable read performance. Disk performance plays a crucial role in overall database performance. You can find more details about the write path and compaction in the scylla documentation. There are 3 important disk resources we will discuss here: Throughput, IOPs and free disk space. All these resources depend on the disk type we attached to our ScyllaDB nodes and their quantity. But how can we understand the limit of the IOPs/throughput? There 2 possible options: Any cloud provider or manufacturer usually provides performance of their disks ; you can find it on their website. For example, NVMe disks from Google Cloud. The actual disk performance can be different compared to the numbers that manufacturers share. The best option might be just to measure it. And we can easily get the result. ScyllaDB performs a benchmark during installation to a node and stores the result in the file io_properties.yaml. The database uses these limits internally for achieving optimal performance.
disks: - mountpoint:
/var/lib/scylla/data read_iops: 2400000 //iops read_bandwidth:
5921532416//throughput write_iops: 1200000 //iops write_bandwidth:
4663037952//throughput
file:
io_properties.yaml Disk Throughput
sum(rate(node_disk_read_bytes_total{})) / (read_bandwidth *
nodeNumber) sum(rate(node_disk_written_bytes_total{})) /
(write_bandwidth * nodeNumber)
In my experience, I haven’t
seen any harm with utilization up to 80-90%. Disk IOPs
sum(rate(node_disk_reads_completed_total{})) / (read_iops *
nodeNumber) sum(rate(node_disk_writes_completed_total{})) /
(write_iops * nodeNumber)
Disk free space It’s crucial to
have significant buffer in every node. In case you’re running out
of space, the node will be basically unavailable and it will be
hard to restore it. However, additional space is required for many
operations: Every update, write, or delete will be written to the
disk and allocate new space. Compaction requires some buffer during
cleaning the space. Back up procedure. The best way to control disk
usage is to use
Time To Live in the tables if it matches your use case. In this
case, irrelevant data will expire and be cleaned during compaction.
I usually try to keep at least 50-60% of free space.
min(sum(node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"})
by
(instance)/sum(node_filesystem_size_bytes{mountpoint="/var/lib/scylla"})
by (instance))
Tablets Most apps have significant load
variations throughout the day or week. ScyllaDB is not elastic and
you need to have provisioned the cluster for the peak load. So, you
could waste a lot of resources during night or weekends. But that
could change soon. A ScyllaDB cluster distributes data across its
nodes and the smallest unit of the data is a partition uniquely
identified by a
partition key. A
partitioner hash function computes tokens to understand in
which nodes data are stored. Every node has its own token range,
and all nodes make a
ring. Previously, adding a new node wasn’t a fast procedure
because it required copying (it is called streaming) data to a new
node, adjusting token range for neighbors, etc. In addition, it’s a
manual
procedure. However, ScyllaDB introduced
tablets in 6.0 version, and it provides new opportunities. A
Tablet is a range of tokens in a table and it includes partitions
which can be replicated independently. It makes the overall process
much smoother and it increases elasticity significantly. Adding new
nodes takes minutes and
a new node starts processing requests even before full data
synchronization. It looks like a significant step toward full
elasticity which can drastically reduce server cost for ScyllaDB
even more. You can read more about
tablets here. I am looking forward to testing tablets closely
soon. Conclusion Tablets look like a solid foundation for future
pure elasticity, but for now, we’re planning clusters for peak
load. To effectively analyze ScyllaDB cluster capacity, focus on
these key recommendations: Target max CPU
utilization (excluding maintenance tasks) per shard at
60–70%. Ensure sufficient free disk
space to handle compaction and backups. Gradually
downsize clusters to avoid sudden cache
degradation. ScyllaDB University and Training Updates
It’s been a while since my last update. We’ve been busy improving the existing ScyllaDB training material and adding new lessons and labs. In this post, I’ll survey the latest developments and update you on the live training event taking place later this month. You can discuss these topics (and more!) on the community forum. Say hello here. ScyllaDB University LIVE Training In addition to the self-paced online courses you can take on ScyllaDB University (see below), we host online live training events. These events are a great opportunity to improve your NoSQL and ScyllaDB skills, get hands-on practice, and get your questions answered by our team of experts. The next event is ScyllaDB University LIVE, which will occur 29th of January 29. As usual, we’re planning on having two tracks, an Essentials, and an Advanced track. However, this time we’ll change the format and make each track a complete learning path. Stay tuned for more details, and I hope to see you there. Save your spot at ScyllaDB University LIVE ScyllaDB University Content Updates ScyllaDB University is our online learning platform where you can learn about NoSQL and about ScyllaDB and get some hands-on experience. It includes many different self-paced lessons, meaning you can study whenever you have some free time and continue where you left off. The material is free and all you have to do is create a user account. We recently added new lessons and updated many existing ones. All of the following topics were added to the course S201: Data Modeling and Application Development. Start learning New in the How To Write Better Apps Lesson General Data Modeling Guidelines This lesson discusses key principles of NoSQL data modeling, emphasizing a query-driven design approach to ensuring efficient data distribution and balanced workloads. It highlights the importance of selecting high-cardinality primary keys, avoiding bad access patterns, and using ScyllaDB Monitoring to identify and resolve issues such as Hot Partitions and Large Partitions. Neglecting these practices can lead to slow performance, bottlenecks, and potentially unreadable data – underscoring the need for using best practices when creating your data model. To learn more, you can explore the complete lesson here. Large Partitions and Collections This lesson provides insights into common pitfalls in NoSQL data modeling, focusing on issues like large partitions, collections, and improper use of ScyllaDB features. It emphasizes avoiding large partitions due to the impact on performance and demonstrates this with real-world examples and Monitoring data. Collections should generally remain small to prevent high latency. The schema used depends on the use case and on the performance requirements. Practical advice and tools are offered for testing and monitoring. You can learn more in the complete lesson here. Hot Partitions, Cardinality and Tombstones This lesson explores common challenges in NoSQL databases, focusing on hot partitions, low cardinality keys, and tombstones. Hot partitions cause uneven load and bottlenecks, often due to misconfigurations or retry storms. Having many tombstones can degrade read performance due to read amplification. Best practices include avoiding retry storms, using efficient full-table scans over low cardinality views and preferring partition-level deletes to minimize tombstone buildup. Monitoring tools and thoughtful schema design are emphasized for efficient database performance. You can find the complete lesson here. Diagnosis and Prevention This lesson covers strategies to diagnose and prevent common database issues in ScyllaDB, such as large partitions, hot partitions, and tombstone-related inefficiencies. Tools like the nodetool toppartitions command help identify hot partition problems, while features like per-partition rate limits and shard concurrency limits manage load and prevent contention. Properly configuring timeout settings avoids retry storms that exacerbate hot partition problems. For tombstones, using efficient delete patterns helps maintain performance and prevent timeouts during reads. Proactive monitoring and adjustments are emphasized throughout. You can see the complete lesson here. New in the Basic Data Modeling Lesson CQL and the CQL Shell The lesson introduces the Cassandra Query Language (CQL), its similarities to SQL, and its use in ScyllaDB for data definition and manipulation commands. It highlights the interactive CQL shell (CQLSH) for testing and interaction, alongside a high level overview of drivers. Common data types and collections like Sets, Lists, Maps, and User-Defined Types in ScyllaDB are briefly mentioned. The “Pet Care IoT” lab example is presented, where sensors on pet collars record data like heart rate or temperature at intervals. This demonstrates how CQL is applied in database operations for IoT use cases. This example is used in labs later on. You can watch the video and complete lesson here. Data Modeling Overview and Basic Concepts The new video introduces the basics of data modeling in ScyllaDB, contrasting NoSQL and relational approaches. It emphasizes starting with application requirements, including queries, performance, and consistency, to design models. Key concepts such as clusters, nodes, keyspaces, tables, and replication factors are explained, highlighting their role in distributed data systems. Examples illustrate how tables and primary keys (partition keys) determine data distribution across nodes using consistent hashing. The lesson demonstrates creating keyspaces and tables, showing how replication factors ensure data redundancy and how ScyllaDB maps partition keys to replica nodes for efficient reads and writes. You can find the complete lesson here. Primary Key, Partition Key, Clustering Key This lesson explains the structure and importance of primary keys in ScyllaDB, detailing their two components: the mandatory partition key and the optional clustering key. The partition key determines the data’s location across nodes, ensuring efficient querying, while the clustering key organizes rows within a partition. For queries to be efficient, the partition key must be specified to avoid full table scans. An example using pet data illustrates how rows are sorted within partitions by the clustering key (e.g., time), enabling precise and optimized data retrieval. Find the complete lesson here. Importance of Key Selection This video emphasizes the importance of choosing partition and clustering keys in ScyllaDB for optimal performance and data distribution. Partition keys should have high cardinality to ensure even data distribution across nodes and avoid issues like large or hot partitions. Examples of good keys include unique identifiers like user IDs, while low-cardinality keys like states or ages can lead to uneven load and inefficiency. Clustering keys should align with query patterns, considering the order of rows and prioritizing efficient retrieval, such as fetching recent data for time-sensitive applications. Strategic key selection prevents resource bottlenecks and enhances scalability. Learn more in the complete lesson. Data Modeling Lab Walkthrough (three parts) The new three-part video lesson focuses on key aspects of data modeling in ScyllaDB, emphasizing the design and use of primary keys. It demonstrates creating a cluster and tables using the CQL shell, highlighting how partition keys determine data location and efficient querying while showcasing different queries. Some tables use a Clustering key, which organizes data within partitions, enabling efficient range queries. It explains compound primary keys to enhance query flexibility. Next, an example of a different clustering key order (ascending or descending) is given. This enables query optimization and efficient retrieval of data. Throughout the lab walkthrough, different challenges are presented, as well as data modeling solutions to optimize performance, scalability, and resource utilization. You can watch the walkthrough here and also take the lab yourself. New in the Advanced Data Modeling Lesson Collections and Drivers The new lesson discusses advanced data modeling in ScyllaDB, focusing on collections (Sets, Lists, Maps, and User-defined types) to simplify models with multi-value fields like phone numbers or emails. It introduces token-aware and shard-aware drivers as optimizations to enhance query efficiency. Token-aware drivers allow clients to send requests directly to replica nodes, bypassing extra hops through coordinator nodes, while shard-aware clients target specific shards within replica nodes for improved performance. ScyllaDB supports drivers in multiple languages like Java, Python, and Go, along with compatibility with Cassandra drivers. An entire course on Drivers is also available. You can learn more in the complete lesson here. New in the ScyllaDB Operations Course Replica level Write/Read Path The lesson explains ScyllaDB’s read and write paths, focusing on how data is written to Memtables persisted as immutable SSTables. Because the SSTables are immutable, they are compacted periodically. Writes, including updates and deletes, are stored in a commit log before being flushed to SSTables. This ensures data consistency. For reads, a cache is used to optimize performance (also using bloom filters). Compaction merges SSTables to remove outdated data, maintain efficiency, and save storage. ScyllaDB offers different compaction strategies and you can choose the most suitable one based on your use case. Learn more in the full lesson. Tracing Demo The lesson provides a practical demonstration of ScyllaDB’s tracing using a three-node cluster. The tracing tool is showcased as a debugging aid to track request flows and replica responses. The demo highlights how data consistency levels influence when responses are sent back to clients and demonstrates high availability by successfully handling writes even when a node is down, provided the consistency requirements are met. You can find the complete lesson here.Top Blogs of 2024: Comparisons, Caching & Database Internals
Let’s look back at the top 10 ScyllaDB blog posts written this year – plus 10 “timeless classics” that continue to get attention. Before we start, thank you to all the community members who contributed to our blogs in various ways – from users sharing best practices at ScyllaDB Summit, to engineers explaining how they raised the bar for database performance, to anyone who has initiated or contributed to the discussion on HackerNews, Reddit, and other platforms. And if you have suggestions for 2025 blog topics, please share them with us on our socials. With no further ado, here are the most-read blog posts that we published in 2024… We Compared ScyllaDB and Memcached and… We Lost? By Felipe Cardeneti Mendes Engineers behind ScyllaDB joined forces with Memcached maintainer dormando for an in-depth look at database and cache internals, and the tradeoffs in each. Read: We Compared ScyllaDB and Memcached and… We Lost? Related: Why Databases Cache, but Caches Go to Disk Inside ScyllaDB’s Internal Cache By Pavel “Xemul” Emelyanov Why ScyllaDB completely bypasses the Linux cache during reads, using its own highly efficient row-based cache instead. Read: Inside ScyllaDB’s Internal Cache Related: Replacing Your Cache with ScyllaDB Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution By Avi Kivity The rationale behind ScyllaDB’s new “tablets” replication architecture, which builds upon a multiyear project to implement and extend Raft. Read: Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution Related: ScyllaDB Fast Forward: True Elastic Scale Rust vs. Zig in Reality: A (Somewhat) Friendly Debate By Cynthia Dunlop A (somewhat) friendly P99 CONF popup debate with Jarred Sumner (Bun.js), Pekka Enberg (Turso), and Glauber Costa (Turso) on ThePrimeagen’s stream. Read: Rust vs. Zig in Reality: A (Somewhat) Friendly Debate Related: P99 CONF on demand Database Internals: Working with IO By Pavel “Xemul” Emelyanov Explore the tradeoffs of different Linux I/O methods and learn how databases can take advantage of a modern SSD’s unique characteristics. Read: Database Internals: Working with IO Related: Understanding Storage I/O Under Load How We Implemented ScyllaDB’s “Tablets” Data Distribution By Avi Kivity How ScyllaDB implemented its new Raft-based tablets architecture, which enables teams to quickly scale out in response to traffic spikes. Read: How We Implemented ScyllaDB’s “Tablets” Data Distribution Related: Overcoming Distributed Databases Scaling Challenges with Tablets How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database By Ivan Burmistrov and Andrei Manakov How ShareChat engineers managed to meet their lofty performance goal without scaling the underlying database. Read: How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database Related: ShareChat’s Path to High-Performance NoSQL with ScyllaDB New Google Cloud Z3 Instances: Early Performance Benchmarks By Łukasz Sójka, Roy Dahan ScyllaDB had the privilege of testing Google Cloud’s brand new Z3 GCE instances in an early preview. We observed a 23% increase in write throughput, 24% for mixed workloads, and 14% for reads per vCPU – all at a lower cost compared to N2. Read:New Google Cloud Z3 Instances: Early Performance Benchmarks Related: A Deep Dive into ScyllaDB’s Architecture Database Internals: Working with CPUs By Pavel “Xemul” Emelyanov Get a database engineer’s inside look at how the database interacts with the CPU…in this excerpt from the book, “Database Performance at Scale.” Read: Database Internals: Working with CPUs Related: Database Performance at Scale: A Practical Guide [Free Book] Migrating from Postgres to ScyllaDB, with 349X Faster Query Processing By Dan Harris and Sebastian Vercruysse How Coralogix cut processing times from 30 seconds to 86 milliseconds with a PostgreSQL to ScyllaDB migration. Read: Migrating from Postgres to ScyllaDB, with 349X Faster Query Processing Related: NoSQL Migration Masterclass Bonus: Top NoSQL Database Blogs From Years Past Many of the blogs published in previous years continued to resonate with the community. Here’s a rundown of 10 enduring favorites: How io_uring and eBPF Will Revolutionize Programming in Linux (Glauber Costa): How io_uring and eBPF will change the way programmers develop asynchronous interfaces and execute arbitrary code, such as tracepoints, more securely. [2020] Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost (Dr. Daniel Seybold): Dr. Daniel Seybold shares how MongoDB and ScyllaDB compare on throughput, latency, scalability, and price-performance in this third-party benchmark by benchANT. [2023] Introducing “Database Performance at Scale”: A Free, Open Source Book (Dor Laor): Introducing a new book that provides practical guidance for understanding the opportunities, trade-offs, and traps you might encounter while trying to optimize data-intensive applications for high throughput and low latency. [2023] DynamoDB: When to Move Out (Felipe Cardeneti Mendes): A look at the top reasons why teams decide to leave DynamoDB: throttling, latency, item size limits, and limited flexibility…not to mention costs. [2023] ScyllaDB vs MongoDB vs PostgreSQL: Tractian’s Benchmarking & Migration (João Pedro Voltani): TRACTIAN shares their comparison of ScyllaDB vs MongoDB and PostgreSQL, then provides an overview of their MongoDB to ScyllaDB migration process, challenges & results. [2023] Benchmarking Apache Cassandra (40 Nodes) vs ScyllaDB (4 Nodes) (Juliusz Stasiewicz, Piotr Grabowski, Karol Baryla): We benchmarked Apache Cassandra on 40 nodes vs ScyllaDB on just 4 nodes. See how they stacked up on throughput, latency, and cost. [2022] How Numberly Replaced Kafka with a Rust-Based ScyllaDB Shard-Aware Application (Alexys Jacob): How Numberly used Rust & ScyllaDB to replace Kafka, streamlining the way all its AdTech components send and track messages (whatever their form). [2023] Async Rust in Practice: Performance, Pitfalls, Profiling (Piotr Sarna): How our engineers used flamegraphs to diagnose and resolve performance issues in our Tokio framework based Rust driver. [2022] On Coordinated Omission (Ivan Prisyazhynyy): Your benchmark may be lying to you! Learn why coordinated omissions are a concern, and how we account for them in benchmarking ScyllaDB. [2021] Why Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB Cloud (Cynthia Dunlop) – Get the inside perspective on how Disney+ Hotstar simplified its “continue watching” data architecture for scale. [2022]Why We’re Moving to a Source Available License
TL;DR ScyllaDB has decided to focus on a single release stream – ScyllaDB Enterprise. Starting with the ScyllaDB Enterprise 2025.1 release (ETA February 2025): ScyllaDB Enterprise will change from closed source to source available. ScyllaDB OSS AGPL 6.2 will stand as the final OSS AGPL release. A free tier of the full-featured ScyllaDB Enterprise will be available to the community. This includes all the performance, efficiency, and security features previously reserved for ScyllaDB Enterprise. For convenience, the existing ScyllaDB Enterprise 2024.2 will gain the new source available license starting from our next patch release (in December), allowing easy migration of older releases. The source available Scylla Manager will move to AGPL and the closed source Kubernetes multi-region operator will be merged with the main Apache-licensed ScyllaDB Kubernetes operator. Other ScyllaDB components (e.g., Seastar, Kubernetes operator, drivers) will keep their current licenses. Why are we doing this? ScyllaDB’s team has always been extremely passionate about open source, low-level optimizations, and the delivery of groundbreaking core technologies – from hypervisors (KVM, Xen), to operating systems (Linux, OSv), and the ScyllaDB database. Over our 12 years of existence, we developed an OS, pivoted to the database space, developed Seastar (the open source standalone core engine of ScyllaDB), and developed ScyllaDB itself. Dozens of open source projects were created: drivers, a Kubernetes operator, test harnesses, and various tools. Open source is an outstanding way to share innovation. It is a straightforward choice for projects that are not your core business. However, it is a constant challenge for vendors whose core product is open source. For almost a decade, we have been maintaining two separate release streams: one for the open source database and one for the enterprise product. Balancing the free vs. paid offerings is a never-ending challenge that involves engineering, product, marketing, and constant sales discussions. Unlike other projects that decided to switch to source available or BSL to protect themselves from “free ride” competition, we were comfortable with AGPL. We took different paths, from the initial reimplementation of the Apache Cassandra API, to an open source implementation of a DynamoDB-compatible API. Beyond the license, we followed the whole approach of ‘open source first.’ Almost every line of code – from a new feature, to a bug fix – went to the open source branch first. We were developing two product lines that competed with one another, and we had to make one of them dramatically better. It’s hard enough to develop a single database and support Docker, Kubernetes, virtual and physical machines, and offer a database-as-a-service. The value of developing two separate database products, along with their release trains, ultimately does not justify the massive overhead and incremental costs required. To give you some idea of what’s involved, we have had nearly 60 public releases throughout 2024. Moreover, we have been the single significant contributor of the source code. Our ecosystem tools have received a healthy amount of contributions, but not the core database. That makes sense. The ScyllaDB internal implementation is a C++, shard-per-core, future-promise code base that is extremely hard to understand and requires full-time devotion. Thus source-wise, in terms of the code, we operated as a full open-source-first project. However, in reality, we benefitted from this no more than as a source-available project. “Behind the curtain” tradeoffs of free vs paid Balancing our requirements (of open source first, efficient development, no crippling of our OSS, and differentiation between the two branches) has been challenging, to say the least. Our open source first culture drove us to develop new core features in the open. Our engineers released these features before we were prepared to decide what was appropriate for open source and what was best for the enterprise paid offering. For example, Tablets, our recent architectural shift, was all developed in the open – and 99% of its end user value is available in the OSS release. As the Enterprise version branched out of the OSS branch, it was helpful to keep a unified base for reuse and efficiency. However, it reduced our paid version differentiation since all features were open by default (unless flagged). For a while, we thought that the OSS release would be the latest and greatest and have a short lifecycle as a differentiation and a means of efficiency. Although maintaining this process required a lot of effort on our side, this could have been a nice mitigation option, a replacement for a feature/functionality gap between free and paid. However, the OSS users didn’t really use the latest and didn’t always upgrade. Instead, most users preferred to stick to old, end-of-life releases. The result was a lose-lose situation (for users and for us). Another approach we used was to differentiate by using peripheral tools – such as Scylla Manager, which helps to operate ScyllaDB (e.g., running backup/restore and managing repairs) – and having a usage limit on them. Our Kubernetes operator is open source and we added a separate closed source repository for multi-region support for Kubernetes. This is a complicated path for development and also for our paying users. The factor that eventually pushed us over the line is that our new architecture – with Raft, tablets, and native S3 – moves peripheral functionality into the core database: Our backup and restore implementation moves from an agent and external manager into the core database. S3 I/O access for backup and restore (and, in the future, for tiered storage) is handled directly by the core database. The I/O operations are controlled by our schedulers, allowing full prioritization and bandwidth control. Later on, “point in time recovery” will be provided. This is a large overhaul unification change, eliminating complexity while improving control. Repair becomes automatic. Repair is a full-scan, backend process that merges inconsistent replica data. Previously, it was controlled by the external Scylla Manager. The new generation core database runs its own automatic repair with tablet awareness. As a result, there is no need for an external peripheral tool; repair will become transparent to the user, like compaction is today. These changes are leading to a more complete core product, with better manageability and functionality. However, they eat into the differentiators for our paid offerings. As you can see, a combination of architecture consolidations, together with multiple release stream efforts, have made our lives extremely complicated and slowed down our progress. Going forward After a tremendous amount of thought and discussion on these points, we decided to unify the two release streams as described at the start of this post. This license shift will allow us to better serve our customers as well as provide increased free tier value to the community. The new model opens up access to previously-restricted capabilities that: Achieve up to 50% higher throughput and 33% lower latency via profile-guided optimization Speed up node addition/removal by 30X via file-based streaming Balance multiple workloads with different performance needs on a single cluster via workload prioritization Reduce network costs with ZSTD-based network compression (with a shard dictionary) for intra-node RPC Combine the best of Leveled Compaction Strategy and Size-tiered Compaction Strategy with Incremental Compaction Strategy – resulting in 35% better storage utilization Use encryption at rest, LDAP integration, and all of the other benefits of the previous closed source Enterprise version Provide a single (all open source) Kubernetes operator for ScyllaDB Enable users to enjoy a longer product life cycle This was a difficult decision for us, and we know it might not be well-received by some of our OSS users running large ScyllaDB clusters. We appreciate your journey and we hope you will continue working with ScyllaDB. After 10 years, we believe this change is the right move for our company, our database, our customers, and our early adopters. With this shift, our team will be able to move faster, better respond to your needs, and continue making progress towards the major milestones on our roadmap: Raft for data, optimized tablet elasticity, and tiered (S3) storage. Read the FAQA Tale from Database Performance at Scale
The following is an excerpt from Chapter 1 of Database Performance at Scale (an Open Access book that’s available for free). Follow Joan’s highly fictionalized adventures with some all-too-real database performance challenges. You’ll laugh. You’ll cry. You’ll wonder how we worked this “cheesy story” into a deeply technical book. Get the complete book, free Lured in by impressive buzzwords like “hybrid cloud,” “serverless,” and “edge first,” Joan readily joined a new company and started catching up with their technology stack. Her first project recently started a transition from their in-house implementation of a database system, which turned out to not scale at the same pace as the number of customers, to one of the industry-standard database management solutions. Their new pick was a new distributed database, which, contrarily to NoSQL, strives to keep the original ACID guarantees known in the SQL world. Due to a few new data protection acts that tend to appear annually nowadays, the company’s board decided that they were going to maintain their own datacenter, instead of using one of the popular cloud vendors for storing sensitive information. On a very high level, the company’s main product consisted of only two layers: The frontend, the entry point for users, which actually runs in their own browsers and communicates with the rest of the system to exchange and persist information. The everything-else, customarily known as “backend,” but actually including load balancers, authentication, authorization, multiple cache layers, databases, backups, and so on. Joan’s first introductory task was to implement a very simple service for gathering and summing up various statistics from the database, and integrate that service with the whole ecosystem, so that it fetches data from the database in real-time and allows the DevOps teams to inspect the statistics live. To impress the management and reassure them that hiring Joan was their absolutely best decision this quarter, Joan decided to deliver a proof-of-concept implementation on her first day! The company’s unspoken policy was to write software in Rust, so she grabbed the first driver for their database from a brief crates.io search and sat down to her self-organized hackathon. The day went by really smoothly, with Rust’s ergonomy-focused ecosystem providing a superior developer experience. But then Joan ran her first smoke tests on a real system. Disbelief turned to disappointment and helplessness when she realized that every third request (on average) ended up in an error, even though the whole database cluster reported to be in a healthy, operable state. That meant a debugging session was in order! Unfortunately, the driver Joan hastily picked for the foundation of her work, even though open-source on its own, was just a thin wrapper over precompiled, legacy C code, with no source to be found. Fueled by a strong desire to solve the mystery and a healthy dose of fury, Joan spent a few hours inspecting the network communication with Wireshark, and she made an educated guess that the bug must be in the hashing key implementation. In the database used by the company, keys are hashed to later route requests to appropriate nodes. If a hash value is computed incorrectly, a request may be forwarded to the wrong node that can refuse it and return an error instead. Unable to verify the claim due to missing source code, Joan decided on a simpler path — ditching the originally chosen driver and reimplementing the solution on one of the officially supported, open-source drivers backed by the database vendor, with a solid user base and regularly updated release schedule. Joan’s diary of lessons learned, part I The initial lessons include: Choose a driver carefully. It’s at the core of your code’s performance, robustness, and reliability. Drivers have bugs too, and it’s impossible to avoid them. Still, there are good practices to follow: Unless there’s a good reason, prefer the officially supported driver (if it exists); Open-source drivers have advantages: They’re not only verified by the community, but also allow deep inspection of its code, and even modifying the driver code to get even more insights for debugging; It’s better to rely on drivers with a well-established release schedule since they are more likely to receive bug fixes (including for security vulnerabilities) in a reasonable period of time. Wireshark is a great open-source tool for interpreting network packets; give it a try if you want to peek under the hood of your program. The introductory task was eventually completed successfully, which made Joan ready to receive her first real assignment. The tuning Armed with the experience gained working on the introductory task, Joan started planning how to approach her new assignment: a misbehaving app. One of the applications notoriously caused stability issues for the whole system, disrupting other workloads each time it experienced any problems. The rogue app was already based on an officially supported driver, so Joan could cross that one off the list of potential root causes. This particular service was responsible for injecting data backed up from the legacy system into the new database. Because the company was not in a great hurry, the application was written with low concurrency in mind to have low priority and not interfere with user workloads. Unfortunately, once every few days something kept triggering an anomaly. The normally peaceful application seemed to be trying to perform a denial-of-service attack on its own database, flooding it with requests until the backend got overloaded enough to cause issues for other parts of the ecosystem. As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that the rate of requests generated by this application started spiking around the time of the anomaly, she wondered how on Earth this workload could behave like that. It was, after all, explicitly implemented to send new requests only when less than 100 of them were currently in progress. Since collaboration was heavily advertised as one of the company’s “spirit and cultural foundations” during the onboarding sessions with an onsite coach, she decided it’s best to discuss the matter with her colleague, Tony. “Look, Tony, I can’t wrap my head around this,” she explained. “This service doesn’t send any new requests when 100 of them are already in flight. And look right here in the logs: 100 requests in progress, one returned a timeout error, and…,” she then stopped, startled at her own epiphany. “Alright, thanks Tony, you’re a dear – best rubber duck ever!,” she concluded and returned to fixing the code. The observation that led to discovering the root cause was rather simple: the request didn’t actually *return* a timeout error because the database server never sent back such a response. The request was simply qualified as timed out by the driver, and discarded. But the sole fact that the driver no longer waits for a response for a particular request does not mean that the database is done processing it! It’s entirely possible that the request was instead just stalled, taking longer than expected, and only the driver gave up waiting for its response. With that knowledge, it’s easy to imagine that once 100 requests time out on the client side, the app might erroneously think that they are not in progress anymore, and happily submit 100 more requests to the database, increasing the total number of in-flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme levels of concurrency on your database cluster—even though the application was supposed to keep it limited to a small number! Joan’s diary of lessons learned, part II The lessons continue: Client-side timeouts are convenient for programmers, but they can interact badly with server-side timeouts. Rule of thumb: make the client-side timeouts around twice as long as server-side ones, unless you have an extremely good reason to do otherwise. Some drivers may be capable of issuing a warning if they detect that the client-side timeout is smaller than the server-side one, or even amend the server-side timeout to match, but in general it’s best to double-check. Tasks with seemingly fixed concurrency can actually cause spikes under certain unexpected conditions. Inspecting logs and dashboards is helpful in investigating such cases, so make sure that observability tools are available both in the database cluster and for all client applications. Bonus points for distributed tracing, like OpenTelemetry integration. With client-side timeouts properly amended, the application choked much less frequently and to a smaller extent, but it still wasn’t a perfect citizen in the distributed system. It occasionally picked a victim database node and kept bothering it with too many requests, while ignoring the fact that seven other nodes were considerably less loaded and could help handle the workload too. At other times, its concurrency was reported to be exactly 200% larger than expected by the configuration. Whenever the two anomalies converged in time, the poor node was unable to handle all requests it was bombarded with, and had to give up on a fair portion of them. A long study of the driver’s documentation, which was fortunately available in mdBook format and kept reasonably up-to-date, helped Joan alleviate those pains too. The first issue was simply a misconfiguration of the non-default load balancing policy, which tried too hard to pick “the least loaded” database node out of all the available ones, based on heuristics and statistics occasionally updated by the database itself. Malheureusement, this policy was also “best effort,” and relied on the fact that statistics arriving from the database were always legit – but a stressed database node could become so overloaded that it wasn’t sending back updated statistics in time! That led the driver to falsely believe that this particular server was not actually busy at all. Joan decided that this setup was a premature optimization that turned out to be a footgun, so she just restored the original default policy, which worked as expected. The second issue (temporary doubling of the concurrency) was caused by another misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured period of time without getting an acknowledgment from the database, drivers would speculatively resend a request to maximize its chances to succeed. This mechanism is very useful to increase requests’ success rate. However, if the original request also succeeds, it means that the speculative one was sent in vain. In order to balance the pros and cons, speculative retry should be configured to only resend requests if it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative retry may act too soon, doubling the number of requests sent (and thus also doubling concurrency) without improving the success rate at all. Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality debugging session that ends in an astounding success (except writing a cheesy story in a deeply technical book, naturally). Great job, Joan! The end. Editor’s note: If you made it this far and can’t get enough of cheesy database performance stories, see what happened to poor old Patrick in “A Tale of Database Performance Woes: Patrick’s Unlucky Green Fedoras.” And if you appreciate this sense of humor, see Piotr’s new book on writing engineering blog posts.How Digital Turbine Moved DynamoDB Workloads to GCP – In Just One Sprint
How Joseph Shorter and Miles Ward led a fast, safe migration with ScyllaDB’s DynamoDB-compatible API Digital Turbine is a quiet but powerful player in the mobile ad tech business. Their platform is preinstalled on Android phones, connecting app developers, advertisers, mobile carriers, and device manufacturers. In the process, they bring in $500M annually. And if their database goes down, their business goes down. Digital Turbine recently decided to standardize on Google Cloud – so continuing with their DynamoDB database was no longer an option. They had to move fast without breaking things. Joseph Shorter (VP, Platform Architecture at Digital Turbine) teamed up with Miles Ward (CTO at SADA) and devised a game plan to pull off the move. Spoiler: they not only moved fast, but also ended up with an approach that was even faster…and less expensive too. You can hear directly from Joe and Miles in this conference talk: We’ve captured some highlights from their discussion below. Why migrate from DynamoDB The tipping point for the DynamoDB migration was Digital Turbine’s decision to standardize on GCP following a series of acquisitions. But that wasn’t the only issue. DynamoDB hadn’t been ideal from a cost perspective or from a performance perspective. Joe explained: “It can be a little expensive as you scale, to be honest. We were finding some performance issues. We were doing a ton of reads—90% of all interactions with DynamoDB were read operations. With all those operations, we found that the performance hits required us to scale up more than we wanted, which increased costs.” Their DynamoDB migration requirements Digital Turbine needed the migration to be as fast and low-risk as possible, which meant keeping application refactoring to a minimum. The main concern, according to Joe, was “How can we migrate without radically refactoring our platform, while maintaining at least the same performance and value, and avoiding a crash-and-burn situation? Because if it failed, it would take down our whole company. “ They approached SADA, who helped them think through a few options – including some Google-native solutions and ScyllaDB. ScyllaDB stood out due to its DynamoDB API, ScyllaDB Alternator. What the DynamoDB migration entailed In summary, it was “as easy as pudding pie” (quoting Joe here). But a little more detail: “There is a DynamoDB API that we could just use. I won’t say there was no refactoring. We did some refactoring to make it easy for engineers to plug in this information, but it was straightforward. It took less than a sprint to write that code. That was awesome. Everyone had told us that ScyllaDB was supposed to be a lot faster. Our reaction was, ‘Sure, every competitor says their product performs better.’ We did a lot with DynamoDB at scale, so we were skeptical. We decided to do a proper POC—not just some simple communication with ScyllaDB compared to DynamoDB. We actually put up multiple apps with some dependencies and set it up the way it actually functions in AWS, then we pressure-tested it. We couldn’t afford any mistakes—a mistake here means the whole company would go down. The goal was to make sure, first, that it would work and, second, that it would actually perform. And it turns out, it delivered on all its promises. That was a huge win for us.” Results so far – with minimal cluster utilization Beyond meeting their primary goal of moving off AWS, the Digital Turbine team improved performance – and they ended up reducing their costs a bit too, as an added benefit. From Joe: “I think part of it comes down to the fact that the performance is just better. We didn’t know what to expect initially, so we scaled things to be pretty comparable. What we’re finding is that it’s simply running better. Because of that, we don’t need as much infrastructure. And we’re barely tapping the ScyllaDB clusters at all right now. A 20% cost difference—that’s a big number, no matter what you’re talking about. And when you consider our plans to scale even further, it becomes even more significant. In the industry we’re in, there are only a few major players—Google, Facebook, and then everyone else. Digital Turbine has carved out a chunk of this space, and we have the tools as a company to start competing in ways others can’t. As we gain more customers and more people say, ‘Hey, we like what you’re doing,’ we need to scale radically. That 20% cost difference is already significant now, and in the future, it could be massive. Better performance and better pricing—it’s hard to ask for much more than that. You’ve got to wonder why more people haven’t noticed this yet.” Learn more about the difference between ScyllaDB and DynamoDB Compare costs: ScyllaDB vs DynamoDBInnovative data compression for time series: An open source solution
Introduction
There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.
Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.
Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative.
ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub.
There are so many compressors already, so why develop another one?
While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born.
ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that function—no actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use.
Here’s an example: for a temperature change metric—which mostly varies slowly (as do a lot of system metrics!)—instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios.
Image 1: ATSC data for temperature
How does ATSC work?
ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.
By default, ATSC will segment the data—this guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file.
In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function.
ATSC currently uses one (per frame) of those following functions:
- FFT (Fast Fourier Transforms)
- Constant
- Interpolation – Catmull-Rom
- Interpolation – Inverse Distance Weight
Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting
These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.
For a more detailed insight into ATSC internals and operations check our paper!
Use cases for ATSC and results
ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below).
Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results:
Method | Compressed size (bytes) | Compression Ratio |
Prometheus | 454,778,552 | 1.33 |
LZ4 | 141,347,821 | 4.29 |
ATSC | 14,276,544 | 42.47 |
Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times.
Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.
ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include:
- Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss.
- Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events.
- Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data.
- Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)
Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)
Using ATSC
ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions.
Future work
Currently, we are planning to evolve ATSC in two ways (check our open issues):
- Adding features to the core compressor
focused on
these functionalities:
- Frame expansion for appending new data to existing frames
- Dynamic function loading to add more functions without altering the codebase
- Global and per-frame error storage
- Improved error encoding
- Integrations with
additional
technologies (e.g.
databases):
- We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra®
CREATE TABLE sensors_poly ( sensor_id UInt16, location UInt32, timestamp DateTime, pressure Float64 CODEC(ATSC('Polynomial', 1)), temperature Float64 CODEC(ATSC('Polynomial', 1)), ) ENGINE = MergeTree ORDER BY (sensor_id, location, timestamp);
Image 5: Currently testing ClickHouse integration
Sound interesting? Try it out and let us know what you think.
ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data.
But don’t just take our word for it—download and run it!
Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.
Want to integrate this with another tool? You can build and run our demo integration with ClickHouse.
The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.
ScyllaDB 2024.2 Introduces New Efficiency & Elasticity via “Tablets”
New capabilities introduced in ScyllaDB Enterprise make scaling operations with Tablets up to 30X faster while reducing network costs by up to 50% ScyllaDB just released ScyllaDB 2024.2, the first enterprise release featuring ScyllaDB’s new “tablets” replication architecture. This new architecture, which builds upon a multiyear project to implement and extend the Raft consensus protocol, enables new levels of elasticity, speed, simplicity, and efficiency. The enterprise release offers new capabilities designed to help you reduce infrastructure costs and streamline operations: Tablets, a dynamic data distribution architecture that significantly improves elasticity and scalability (limited availability – details in release notes) File-based streaming for tablets further speeds up scaling operations (e.g., adding and removing nodes) Strongly consistent topology updates, authentication updates, service levels (workload prioritization) ZSTD-based network compression for intra-node RPC, with a shard dictionary for improved performance DynamoDB API (Alternator) enhancements such as role-based access control See the release notes for details The ScyllaDB Enterprise 2024.2 release is based on ScyllaDB 6.0; it includes *all* the features available in ScyllaDB 6.0 like: Tablets, a dynamic way to distribute data across nodes that significantly improves scalability Strongly consistent topology, Auth, and Service Level updates In addition, 2024.2 includes enterprise-only features such as: Improved network compression (see below) File-based streaming for tablets A new FIPS enabled Docker Image Related links: Read more about ScyllaDB Enterprise Get ScyllaDB Enterprise 2024.2 (customers only, or 30-day evaluation) Upgrade from ScyllaDB Enterprise 2024.1.x to 2024.2.y Upgrade from ScyllaDB Open Source 6.0 to ScyllaDB Enterprise 2024.2.x ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2024.2, and are welcome to contact our Support Team with questions. Tablets In this release, ScyllaDB enabled Tablets, a new data distribution algorithm as a better alternative to the legacy vNodes approach inherited from Apache Cassandra. While the vNodes approach statically distributes all tables across all nodes and shards based on the token ring, the Tablets approach dynamically distributes each table to a subset of nodes and shards based on its size. In the future, distribution will use CPU, OPS, and other information to further optimize the distribution. In particular, Tablets provide the following: Faster scaling and topology changes. New nodes can start serving reads and writes as soon as the first Tablet is migrated. Together with Strongly Consistent Topology Updates (below), this also allows users to add multiple nodes simultaneously. Automatic support for mixed clusters with different core counts. Manual vNode updates are not required. More efficient operations on small tables., since such tables are placed on a small subset of nodes and shards. Note that you can run a cluster with some of the Keyspaces with Tablets disabled, and some with tablets enabled for as long as you wish. The scaling improvement will be partial, and limited to Keyspaces with Tables enabled. Read the Tablets documentation Currently, tablets are ideal for new clusters that you frequently scale out or in and that have one main large table and potentially many tiny ones. ScyllaDB Support can help you determine if Tablets in release 2024.2 are a good solution for your use case. Learn more about tablets: Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution: The rationale behind ScyllaDB’s new “tablets” replication architecture ScyllaDB Fast Forward: True Elastic Scale: Introducing major architectural shifts that enable new levels of elasticity and operational simplicity Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities: A technical walkthrough of exactly what’s changed from the user/operator perspective Monitoring Tablets To Monitor Tablets in real time, upgrade ScyllaDB Monitoring Stack to release 4.8, and use the new dynamic Tablet panels, below. Driver Support The Following Drivers support Tablets Java driver 4.x, from 4.18.0.2 Java driver 3.x, from 3.11.5.2 Python driver, from 3.26.6 Gocql driver, from 1.13.0 Rust driver from 0.13.0 Legacy ScyllaDB and Apache Cassandra drivers will continue to work with ScyllaDB but will be less efficient when working with tablet-based Keyspaces. Read the blog post “How We Updated ScyllaDB Drivers for Tablets Elasticity” File based streaming for Tablets File-based streaming is a ScyllaDB Enterprise-only feature that optimizes tablet migration. In ScyllaDB Open Source, migrating tablets is performed by streaming mutation fragments, which involves deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on the other node. In ScyllaDB Enterprise, migrating tablets is performed by streaming entire SStables, which does not require (de)serializing or processing mutation fragments. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. File-based streaming is used for tablet migration in all keyspaces created with tablets enabled. Read the file-based streaming documentation Consistent Topology and Metadata Strongly Consistent Topology Updates With Raft-managed topology enabled, all topology operations are internally sequenced consistently. A centralized coordination process ensures that topology metadata is synchronized across the nodes on each step of a topology change procedure. This makes topology updates fast and safe, as the cluster administrator can trigger many topology operations concurrently, and the coordination process will safely drive all of them to completion. For example, multiple nodes can be bootstrapped concurrently, which couldn’t be done with the previous gossip-based topology. Strongly Consistent Topology Updates is now the default for new clusters, and should be enabled after upgrade for existing clusters. Read the blog post “ScyllaDB’s Safe Topology and Schema Changes on Raft” Strongly Consistent Auth Updates System-auth-2 is a reimplementation of the Authentication and Authorization systems in a strongly consistent way on top of the Raft sub-system. This means that Role-Based Access Control (RBAC) commands like create role or grant permission are safe to run in parallel without a risk of getting out of sync with themselves and other metadata operations, like schema changes. As a result, there is no need to update system_auth RF or run repair when adding a Data Center. Strongly Consistent Service Levels Service Levels allow you to define attributes like timeout per workload. Service levels are now strongly consistent using Raft, like Schema, Topology and Auth. Improved network compression for intra-node RPC This release adds new Enterprise only RPC compression improvements for node to node communication: Using zstd instead of lz4 Using a shared dictionary re-trained periodically on the traffic, instead of the message by message compression. Below is a comparison of compressions algorithms on different types of data. Note that dictionary based compression can be used with either lz4 or zstd. The actual compression is very much workload-dependent and can vary between use cases. Alternator Role-Based Access Control Alternator supports Role-Based Access Control (RBAC) for authorization. Control is done via the CQL API. Native Nodetool The nodetool utility provides simple command-line interface operations and attributes. ScyllaDB inherited the Java based nodetool from Apache Cassandra. In this release, the Java implementation was replaced with a backward-compatible native nodetool. The native nodetool works much faster. Unlike the Java version ,the native nodetool is part of the ScyllaDB repo, and allows easier and faster updates. With the native nodetool, the JMX server has become redundant and will no longer be part of the default ScyllaDB Installation or image. If you are using the JMX server directly, not via nodetool, you can either: Work directly with the ScyllaDB REST API (recommended) Install the JMX server yourself. See https://github.com/scylladb/scylla-jmx for instructions. As part of moving to native tooling and away from Java tools, we will deprecate SSTableloader in future versions of ScyllaDB Enterprise. You can use the Load and Stream to upload SSTables directly to ScyllaDB, either from Apache Cassandra or other ScyllaDB clusters. We are also deprecating the Java version of nodetool, which was replaced by a compatible native version (see above). Maintenance Maintenance mode is a new mode in which the node does not communicate with clients or other nodes and only listens to the local maintenance socket and the REST API. It can be used to fix damaged nodes – for example, by using nodetool compact or nodetool scrub. In maintenance mode, ScyllaDB skips loading tablet metadata if it is corrupted to allow an administrator to fix it. Also, the Maintenance Socket provides a new way to interact with ScyllaDB from within the node it runs on. It is mainly for debugging. You can use CQLSh with the Maintenance Socket as described in the Maintenance Socket docs. Improvements and Bug Fixes The latest release also features extensive improvements to stability, performance, monitoring and more. For details, see the release notes on the ScyllaDB Community Forum. See the details release notesNetflix’s Distributed Counter Abstraction
By: Rajiv Shringi, Oleksii Tkachuk, Kartik Sathyanarayanan
Introduction
In our previous blog post, we introduced Netflix’s TimeSeries Abstraction, a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction. This counting service, built on top of the TimeSeries Abstraction, enables distributed counting at scale while maintaining similar low latency performance. As with all our abstractions, we use our Data Gateway Control Plane to shard, configure, and deploy this service globally.
Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.
Note: When it comes to distributed counters, terms such as ‘accurate’ or ‘precise’ should be taken with a grain of salt. In this context, they refer to a count very close to accurate, presented with minimal delays.
Use Cases and Requirements
At Netflix, our counting use cases include tracking millions of user interactions, monitoring how often specific features or experiences are shown to users, and counting multiple facets of data during A/B test experiments, among others.
At Netflix, these use cases can be classified into two broad categories:
- Best-Effort: For this category, the count doesn’t have to be very accurate or durable. However, this category requires near-immediate access to the current count at low latencies, all while keeping infrastructure costs to a minimum.
- Eventually Consistent: This category needs accurate and durable counts, and is willing to tolerate a slight delay in accuracy and a slightly higher infrastructure cost as a trade-off.
Both categories share common requirements, such as high throughput and high availability. The table below provides a detailed overview of the diverse requirements across these two categories.

Distributed Counter Abstraction
To meet the outlined requirements, the Counter Abstraction was designed to be highly configurable. It allows users to choose between different counting modes, such as Best-Effort or Eventually Consistent, while considering the documented trade-offs of each option. After selecting a mode, users can interact with APIs without needing to worry about the underlying storage mechanisms and counting methods.
Let’s take a closer look at the structure and functionality of the API.
API
Counters are organized into separate namespaces that users set up for each of their specific use cases. Each namespace can be configured with different parameters, such as Type of Counter, Time-To-Live (TTL), and Counter Cardinality, using the service’s Control Plane.
The Counter Abstraction API resembles Java’s AtomicInteger interface:
AddCount/AddAndGetCount: Adjusts the count for the specified counter by the given delta value within a dataset. The delta value can be positive or negative. The AddAndGetCount counterpart also returns the count after performing the add operation.
{
"namespace": "my_dataset",
"counter_name": "counter123",
"delta": 2,
"idempotency_token": {
"token": "some_event_id",
"generation_time": "2024-10-05T14:48:00Z"
}
}
The idempotency token can be used for counter types that support them. Clients can use this token to safely retry or hedge their requests. Failures in a distributed system are a given, and having the ability to safely retry requests enhances the reliability of the service.
GetCount: Retrieves the count value of the specified counter within a dataset.
{
"namespace": "my_dataset",
"counter_name": "counter123"
}
ClearCount: Effectively resets the count to 0 for the specified counter within a dataset.
{
"namespace": "my_dataset",
"counter_name": "counter456",
"idempotency_token": {...}
}
Now, let’s look at the different types of counters supported within the Abstraction.
Types of Counters
The service primarily supports two types of counters: Best-Effort and Eventually Consistent, along with a third experimental type: Accurate. In the following sections, we’ll describe the different approaches for these types of counters and the trade-offs associated with each.
Best Effort Regional Counter
This type of counter is powered by EVCache, Netflix’s distributed caching solution built on the widely popular Memcached. It is suitable for use cases like A/B experiments, where many concurrent experiments are run for relatively short durations and an approximate count is sufficient. Setting aside the complexities of provisioning, resource allocation, and control plane management, the core of this solution is remarkably straightforward:
// counter cache key
counterCacheKey = <namespace>:<counter_name>
// add operation
return delta > 0
? cache.incr(counterCacheKey, delta, TTL)
: cache.decr(counterCacheKey, Math.abs(delta), TTL);
// get operation
cache.get(counterCacheKey);
// clear counts from all replicas
cache.delete(counterCacheKey, ReplicaPolicy.ALL);
EVCache delivers extremely high throughput at low millisecond latency or better within a single region, enabling a multi-tenant setup within a shared cluster, saving infrastructure costs. However, there are some trade-offs: it lacks cross-region replication for the increment operation and does not provide consistency guarantees, which may be necessary for an accurate count. Additionally, idempotency is not natively supported, making it unsafe to retry or hedge requests.
Edit: A note on probabilistic data structures:
Probabilistic data structures like HyperLogLog (HLL) can be useful for tracking an approximate number of distinct elements, like distinct views or visits to a website, but are not ideally suited for implementing distinct increments and decrements for a given key. Count-Min Sketch (CMS) is an alternative that can be used to adjust the values of keys by a given amount. Data stores like Redis support both HLL and CMS. However, we chose not to pursue this direction for several reasons:
- We chose to build on top of data stores that we already operate at scale.
- Probabilistic data structures do not natively support several of our requirements, such as resetting the count for a given key or having TTLs for counts. Additional data structures, including more sketches, would be needed to support these requirements.
- On the other hand, the EVCache solution is quite simple, requiring minimal lines of code and using natively supported elements. However, it comes at the trade-off of using a small amount of memory per counter key.
Eventually Consistent Global Counter
While some users may accept the limitations of a Best-Effort counter, others opt for precise counts, durability and global availability. In the following sections, we’ll explore various strategies for achieving durable and accurate counts. Our objective is to highlight the challenges inherent in global distributed counting and explain the reasoning behind our chosen approach.
Approach 1: Storing a Single Row per Counter
Let’s start simple by using a single row per counter key within a table in a globally replicated datastore.
Let’s examine some of the drawbacks of this approach:
- Lack of Idempotency: There is no idempotency key baked into the storage data-model preventing users from safely retrying requests. Implementing idempotency would likely require using an external system for such keys, which can further degrade performance or cause race conditions.
- Heavy Contention: To update counts reliably, every writer must perform a Compare-And-Swap operation for a given counter using locks or transactions. Depending on the throughput and concurrency of operations, this can lead to significant contention, heavily impacting performance.
Secondary Keys: One way to reduce contention in this approach would be to use a secondary key, such as a bucket_id, which allows for distributing writes by splitting a given counter into buckets, while enabling reads to aggregate across buckets. The challenge lies in determining the appropriate number of buckets. A static number may still lead to contention with hot keys, while dynamically assigning the number of buckets per counter across millions of counters presents a more complex problem.
Let’s see if we can iterate on our solution to overcome these drawbacks.
Approach 2: Per Instance Aggregation
To address issues of hot keys and contention from writing to the same row in real-time, we could implement a strategy where each instance aggregates the counts in memory and then flushes them to disk at regular intervals. Introducing sufficient jitter to the flush process can further reduce contention.
However, this solution presents a new set of issues:
- Vulnerability to Data Loss: The solution is vulnerable to data loss for all in-memory data during instance failures, restarts, or deployments.
- Inability to Reliably Reset Counts: Due to counting requests being distributed across multiple machines, it is challenging to establish consensus on the exact point in time when a counter reset occurred.
- Lack of Idempotency: Similar to the previous approach, this method does not natively guarantee idempotency. One way to achieve idempotency is by consistently routing the same set of counters to the same instance. However, this approach may introduce additional complexities, such as leader election, and potential challenges with availability and latency in the write path.
That said, this approach may still be suitable in scenarios where these trade-offs are acceptable. However, let’s see if we can address some of these issues with a different event-based approach.
Approach 3: Using Durable Queues
In this approach, we log counter events into a durable queuing system like Apache Kafka to prevent any potential data loss. By creating multiple topic partitions and hashing the counter key to a specific partition, we ensure that the same set of counters are processed by the same set of consumers. This setup simplifies facilitating idempotency checks and resetting counts. Furthermore, by leveraging additional stream processing frameworks such as Kafka Streams or Apache Flink, we can implement windowed aggregations.
However, this approach comes with some challenges:
- Potential Delays: Having the same consumer process all the counts from a given partition can lead to backups and delays, resulting in stale counts.
- Rebalancing Partitions: This approach requires auto-scaling and rebalancing of topic partitions as the cardinality of counters and throughput increases.
Furthermore, all approaches that pre-aggregate counts make it challenging to support two of our requirements for accurate counters:
- Auditing of Counts: Auditing involves extracting data to an offline system for analysis to ensure that increments were applied correctly to reach the final value. This process can also be used to track the provenance of increments. However, auditing becomes infeasible when counts are aggregated without storing the individual increments.
- Potential Recounting: Similar to auditing, if adjustments to increments are necessary and recounting of events within a time window is required, pre-aggregating counts makes this infeasible.
Barring those few requirements, this approach can still be effective if we determine the right way to scale our queue partitions and consumers while maintaining idempotency. However, let’s explore how we can adjust this approach to meet the auditing and recounting requirements.
Approach 4: Event Log of Individual Increments
In this approach, we log each individual counter increment along with its event_time and event_id. The event_id can include the source information of where the increment originated. The combination of event_time and event_id can also serve as the idempotency key for the write.
However, in its simplest form, this approach has several drawbacks:
- Read Latency: Each read request requires scanning all increments for a given counter potentially degrading performance.
- Duplicate Work: Multiple threads might duplicate the effort of aggregating the same set of counters during read operations, leading to wasted effort and subpar resource utilization.
- Wide Partitions: If using a datastore like Apache Cassandra, storing many increments for the same counter could lead to a wide partition, affecting read performance.
- Large Data Footprint: Storing each increment individually could also result in a substantial data footprint over time. Without an efficient data retention strategy, this approach may struggle to scale effectively.
The combined impact of these issues can lead to increased infrastructure costs that may be difficult to justify. However, adopting an event-driven approach seems to be a significant step forward in addressing some of the challenges we’ve encountered and meeting our requirements.
How can we improve this solution further?
Netflix’s Approach
We use a combination of the previous approaches, where we log each counting activity as an event, and continuously aggregate these events in the background using queues and a sliding time window. Additionally, we employ a bucketing strategy to prevent wide partitions. In the following sections, we’ll explore how this approach addresses the previously mentioned drawbacks and meets all our requirements.
Note: From here on, we will use the words “rollup” and “aggregate” interchangeably. They essentially mean the same thing, i.e., collecting individual counter increments/decrements and arriving at the final value.
TimeSeries Event Store:
We chose the TimeSeries Data Abstraction as our event store, where counter mutations are ingested as event records. Some of the benefits of storing events in TimeSeries include:
High-Performance: The TimeSeries abstraction already addresses many of our requirements, including high availability and throughput, reliable and fast performance, and more.
Reducing Code Complexity: We reduce a lot of code complexity in Counter Abstraction by delegating a major portion of the functionality to an existing service.
TimeSeries Abstraction uses Cassandra as the underlying event store, but it can be configured to work with any persistent store. Here is what it looks like:
Handling Wide Partitions: The time_bucket and event_bucket columns play a crucial role in breaking up a wide partition, preventing high-throughput counter events from overwhelming a given partition. For more information regarding this, refer to our previous blog.
No Over-Counting: The event_time, event_id and event_item_key columns form the idempotency key for the events for a given counter, enabling clients to retry safely without the risk of over-counting.
Event Ordering: TimeSeries orders all events in descending order of time allowing us to leverage this property for events like count resets.
Event Retention: The TimeSeries Abstraction includes retention policies to ensure that events are not stored indefinitely, saving disk space and reducing infrastructure costs. Once events have been aggregated and moved to a more cost-effective store for audits, there’s no need to retain them in the primary storage.
Now, let’s see how these events are aggregated for a given counter.
Aggregating Count Events:
As mentioned earlier, collecting all individual increments for every read request would be cost-prohibitive in terms of read performance. Therefore, a background aggregation process is necessary to continually converge counts and ensure optimal read performance.
But how can we safely aggregate count events amidst ongoing write operations?
This is where the concept of Eventually Consistent counts becomes crucial. By intentionally lagging behind the current time by a safe margin, we ensure that aggregation always occurs within an immutable window.
Lets see what that looks like:
Let’s break this down:
- lastRollupTs: This represents the most recent time when the counter value was last aggregated. For a counter being operated for the first time, this timestamp defaults to a reasonable time in the past.
- Immutable Window and Lag: Aggregation can only occur safely within an immutable window that is no longer receiving counter events. The “acceptLimit” parameter of the TimeSeries Abstraction plays a crucial role here, as it rejects incoming events with timestamps beyond this limit. During aggregations, this window is pushed slightly further back to account for clock skews.
This does mean that the counter value will lag behind its most recent update by some margin (typically in the order of seconds). This approach does leave the door open for missed events due to cross-region replication issues. See “Future Work” section at the end.
- Aggregation Process: The rollup process aggregates all events in the aggregation window since the last rollup to arrive at the new value.
Rollup Store:
We save the results of this aggregation in a persistent store. The next aggregation will simply continue from this checkpoint.
We create one such Rollup table per dataset and use Cassandra as our persistent store. However, as you will soon see in the Control Plane section, the Counter service can be configured to work with any persistent store.
LastWriteTs: Every time a given counter receives a write, we also log a last-write-timestamp as a columnar update in this table. This is done using Cassandra’s USING TIMESTAMP feature to predictably apply the Last-Write-Win (LWW) semantics. This timestamp is the same as the event_time for the event. In the subsequent sections, we’ll see how this timestamp is used to keep some counters in active rollup circulation until they have caught up to their latest value.
Rollup Cache
To optimize read performance, these values are cached in EVCache for each counter. We combine the lastRollupCount and lastRollupTs into a single cached value per counter to prevent potential mismatches between the count and its corresponding checkpoint timestamp.
But, how do we know which counters to trigger rollups for? Let’s explore our Write and Read path to understand this better.
Add/Clear Count:
An add or clear count request writes durably to the TimeSeries Abstraction and updates the last-write-timestamp in the Rollup store. If the durability acknowledgement fails, clients can retry their requests with the same idempotency token without the risk of overcounting. Upon durability, we send a fire-and-forget request to trigger the rollup for the request counter.
GetCount:
We return the last rolled-up count as a quick point-read operation, accepting the trade-off of potentially delivering a slightly stale count. We also trigger a rollup during the read operation to advance the last-rollup-timestamp, enhancing the performance of subsequent aggregations. This process also self-remediates a stale count if any previous rollups had failed.
With this approach, the counts continually converge to their latest value. Now, let’s see how we scale this approach to millions of counters and thousands of concurrent operations using our Rollup Pipeline.
Rollup Pipeline:
Each Counter-Rollup server operates a rollup pipeline to efficiently aggregate counts across millions of counters. This is where most of the complexity in Counter Abstraction comes in. In the following sections, we will share key details on how efficient aggregations are achieved.
Light-Weight Roll-Up Event: As seen in our Write and Read paths above, every operation on a counter sends a light-weight event to the Rollup server:
rollupEvent: {
"namespace": "my_dataset",
"counter": "counter123"
}
Note that this event does not include the increment. This is only an indication to the Rollup server that this counter has been accessed and now needs to be aggregated. Knowing exactly which specific counters need to be aggregated prevents scanning the entire event dataset for the purpose of aggregations.
In-Memory Rollup Queues: A given Rollup server instance runs a set of in-memory queues to receive rollup events and parallelize aggregations. In the first version of this service, we settled on using in-memory queues to reduce provisioning complexity, save on infrastructure costs, and make rebalancing the number of queues fairly straightforward. However, this comes with the trade-off of potentially missing rollup events in case of an instance crash. For more details, see the “Stale Counts” section in “Future Work.”
Minimize Duplicate Effort: We use a fast non-cryptographic hash like XXHash to ensure that the same set of counters end up on the same queue. Further, we try to minimize the amount of duplicate aggregation work by having a separate rollup stack that chooses to run fewer beefier instances.
Availability and Race Conditions: Having a single Rollup server instance can minimize duplicate aggregation work but may create availability challenges for triggering rollups. If we choose to horizontally scale the Rollup servers, we allow threads to overwrite rollup values while avoiding any form of distributed locking mechanisms to maintain high availability and performance. This approach remains safe because aggregation occurs within an immutable window. Although the concept of now() may differ between threads, causing rollup values to sometimes fluctuate, the counts will eventually converge to an accurate value within each immutable aggregation window.
Rebalancing Queues: If we need to scale the number of queues, a simple Control Plane configuration update followed by a re-deploy is enough to rebalance the number of queues.
"eventual_counter_config": {
"queue_config": {
"num_queues" : 8, // change to 16 and re-deploy
...
Handling Deployments: During deployments, these queues shut down gracefully, draining all existing events first, while the new Rollup server instance starts up with potentially new queue configurations. There may be a brief period when both the old and new Rollup servers are active, but as mentioned before, this race condition is managed since aggregations occur within immutable windows.
Minimize Rollup Effort: Receiving multiple events for the same counter doesn’t mean rolling it up multiple times. We drain these rollup events into a Set, ensuring a given counter is rolled up only once during a rollup window.
Efficient Aggregation: Each rollup consumer processes a batch of counters simultaneously. Within each batch, it queries the underlying TimeSeries abstraction in parallel to aggregate events within specified time boundaries. The TimeSeries abstraction optimizes these range scans to achieve low millisecond latencies.
Dynamic Batching: The Rollup server dynamically adjusts the number of time partitions that need to be scanned based on cardinality of counters in order to prevent overwhelming the underlying store with many parallel read requests.
Adaptive Back-Pressure: Each consumer waits for one batch to complete before issuing the rollups for the next batch. It adjusts the wait time between batches based on the performance of the previous batch. This approach provides back-pressure during rollups to prevent overwhelming the underlying TimeSeries store.
Handling Convergence:
In order to prevent low-cardinality counters from lagging behind too much and subsequently scanning too many time partitions, they are kept in constant rollup circulation. For high-cardinality counters, continuously circulating them would consume excessive memory in our Rollup queues. This is where the last-write-timestamp mentioned previously plays a crucial role. The Rollup server inspects this timestamp to determine if a given counter needs to be re-queued, ensuring that we continue aggregating until it has fully caught up with the writes.
Now, let’s see how we leverage this counter type to provide an up-to-date current count in near-realtime.
Experimental: Accurate Global Counter
We are experimenting with a slightly modified version of the Eventually Consistent counter. Again, take the term ‘Accurate’ with a grain of salt. The key difference between this type of counter and its counterpart is that the delta, representing the counts since the last-rolled-up timestamp, is computed in real-time.
And then, currentAccurateCount = lastRollupCount + delta
Aggregating this delta in real-time can impact the performance of this operation, depending on the number of events and partitions that need to be scanned to retrieve this delta. The same principle of rolling up in batches applies here to prevent scanning too many partitions in parallel. Conversely, if the counters in this dataset are accessed frequently, the time gap for the delta remains narrow, making this approach of fetching current counts quite effective.
Now, let’s see how all this complexity is managed by having a unified Control Plane configuration.
Control Plane
The Data Gateway Platform Control Plane manages control settings for all abstractions and namespaces, including the Counter Abstraction. Below, is an example of a control plane configuration for a namespace that supports eventually consistent counters with low cardinality:
"persistence_configuration": [
{
"id": "CACHE", // Counter cache config
"scope": "dal=counter",
"physical_storage": {
"type": "EVCACHE", // type of cache storage
"cluster": "evcache_dgw_counter_tier1" // Shared EVCache cluster
}
},
{
"id": "COUNTER_ROLLUP",
"scope": "dal=counter", // Counter abstraction config
"physical_storage": {
"type": "CASSANDRA", // type of Rollup store
"cluster": "cass_dgw_counter_uc1", // physical cluster name
"dataset": "my_dataset_1" // namespace/dataset
},
"counter_cardinality": "LOW", // supported counter cardinality
"config": {
"counter_type": "EVENTUAL", // Type of counter
"eventual_counter_config": { // eventual counter type
"internal_config": {
"queue_config": { // adjust w.r.t cardinality
"num_queues" : 8, // Rollup queues per instance
"coalesce_ms": 10000, // coalesce duration for rollups
"capacity_bytes": 16777216 // allocated memory per queue
},
"rollup_batch_count": 32 // parallelization factor
}
}
}
},
{
"id": "EVENT_STORAGE",
"scope": "dal=ts", // TimeSeries Event store
"physical_storage": {
"type": "CASSANDRA", // persistent store type
"cluster": "cass_dgw_counter_uc1", // physical cluster name
"dataset": "my_dataset_1", // keyspace name
},
"config": {
"time_partition": { // time-partitioning for events
"buckets_per_id": 4, // event buckets within
"seconds_per_bucket": "600", // smaller width for LOW card
"seconds_per_slice": "86400", // width of a time slice table
},
"accept_limit": "5s", // boundary for immutability
},
"lifecycleConfigs": {
"lifecycleConfig": [
{
"type": "retention", // Event retention
"config": {
"close_after": "518400s",
"delete_after": "604800s" // 7 day count event retention
}
}
]
}
}
]
Using such a control plane configuration, we compose multiple abstraction layers using containers deployed on the same host, with each container fetching configuration specific to its scope.
Provisioning
As with the TimeSeries abstraction, our automation uses a bunch of user inputs regarding their workload and cardinalities to arrive at the right set of infrastructure and related control plane configuration. You can learn more about this process in a talk given by one of our stunning colleagues, Joey Lynch : How Netflix optimally provisions infrastructure in the cloud.
Performance
At the time of writing this blog, this service was processing close to 75K count requests/second globally across the different API endpoints and datasets:
while providing single-digit millisecond latencies for all its endpoints:
Future Work
While our system is robust, we still have work to do in making it more reliable and enhancing its features. Some of that work includes:
- Regional Rollups: Cross-region replication issues can result in missed events from other regions. An alternate strategy involves establishing a rollup table for each region, and then tallying them in a global rollup table. A key challenge in this design would be effectively communicating the clearing of the counter across regions.
- Error Detection and Stale Counts: Excessively stale counts can occur if rollup events are lost or if a rollup fails and isn’t retried. This isn’t an issue for frequently accessed counters, as they remain in rollup circulation. This issue is more pronounced for counters that aren’t accessed frequently. Typically, the initial read for such a counter will trigger a rollup, self-remediating the issue. However, for use cases that cannot accept potentially stale initial reads, we plan to implement improved error detection, rollup handoffs, and durable queues for resilient retries.
Conclusion
Distributed counting remains a challenging problem in computer science. In this blog, we explored multiple approaches to implement and deploy a Counting service at scale. While there may be other methods for distributed counting, our goal has been to deliver blazing fast performance at low infrastructure costs while maintaining high availability and providing idempotency guarantees. Along the way, we make various trade-offs to meet the diverse counting requirements at Netflix. We hope you found this blog post insightful.
Stay tuned for Part 3 of Composite Abstractions at Netflix, where we’ll introduce our Graph Abstraction, a new service being built on top of the Key-Value Abstraction and the TimeSeries Abstraction to handle high-throughput, low-latency graphs.
Acknowledgments
Special thanks to our stunning colleagues who contributed to the Counter Abstraction’s success: Joey Lynch, Vinay Chella, Kaidan Fullerton, Tom DeVoe, Mengqing Wang, Varun Khaitan
Netflix’s Distributed Counter Abstraction was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
New cassandra_latest.yaml configuration for a top performant Apache Cassandra®
Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.
This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters.
Motivation
The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters:
- Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints.
- Operators: who prefer stability and minimal disruption during upgrades.
- Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility.
Using cassandra_latest.yaml
Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.
This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements.
Key changes and features
Key Cache Size
- Old: Evaluated as a minimum from 5% of the heap or 100MB
- Latest: Explicitly set to 0
Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage.
Commit Log Disk Access Mode
- Old: Set to legacy
- Latest: Set to auto
Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.
Memtable Implementation
- Old: Skiplist-based
- Latest: Trie-based
Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.
create table … with memtable = {'class': 'TrieMemtable', … }
Memtable Allocation Type
- Old: Heap buffers
- Latest: Off-heap objects
Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments.
Trickle Fsync
- Old: False
- Latest: True
Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads.
SSTable Format
- Old: big
- Latest: bti (trie-indexed structure)
Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets.
sstable: selected_format: bti default_compression: zstd compression: zstd: enabled: true chunk_length: 16KiB max_compressed_length: 16KiB
Default Compaction Strategy
- Old: STCS (Size-Tiered Compaction Strategy)
- Latest: Unified Compaction Strategy
Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables.
default_compaction: class_name: UnifiedCompactionStrategy parameters: scaling_parameters: T4 max_sstables_to_compact: 64 target_sstable_size: 1GiB sstable_growth: 0.3333333333333333 min_sstable_size: 100MiB
Concurrent Compactors
- Old: Defaults to the smaller of the number of disks and cores
- Latest: Explicitly set to 8
Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient.
Default Secondary Index
- Old: legacy_local_table
- Latest: sai
Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time.
Stream Entire SSTables
- Old: implicity set to True
- Latest: explicity set to True
Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by
entire_sstable_stream_throughput_outbound
and
entire_sstable_inter_dc_stream_throughput_outbound
for inter-DC transfers.
UUID SSTable Identifiers
- Old: False
- Latest: True
Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments.
Storage Compatibility Mode
- Old: Cassandra 4
- Latest: None
Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions.
Testing and validation
The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests.
Future improvements
Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml.
Conclusion
The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone.
Try it out
Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!
The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.
Instaclustr for Apache Cassandra® 5.0 Now Generally Available
NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.
NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and on–premises.
Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.
Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by:
- Helping you build AI/ML-based applications through Vector Search
- Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities
- Improving flexibility and security
With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industry–leading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.
Support for continuous backups and private network add–ons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported.
Some of the key new features in Cassandra 5.0 include:
- Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.
- Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.
- Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.
- Numerous stability and testing improvements: You can read all about these changes here.
All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.
Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.
We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.
Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things.
Lifecycle Policy Updates
As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).
To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters.
Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.
Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.
You can read more about our lifecycle policies on our website.
Getting Started
Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.
- Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
- Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
- Learn about 4 new Apache Cassandra 5.0 features to be excited about.
- Click here to learn what you need to know about Apache Cassandra 5.0.
Why Choose Apache Cassandra on the Instaclustr Managed Platform?
NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.
NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.
Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.
With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.
If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.
The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.
Apache Cassandra® 5.0: Behind the Scenes
Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.
Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.
It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform.
Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra project. His changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform.
August 2023: The Beginning
We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.
One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java.
September 2023: First Milestone
The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).
Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.
At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released.
November 2023: Further Testing and Internal Preview
The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.
We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.
We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.
During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us.
If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer:
java.lang.IllegalArgumentException: Could not parse type name vector<float, 5> Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356) <Rest of stacktrace removed for brevity>
December 2023: Focus on new features and planning
As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.
The final list of high impact features we came up with was:
- A new data type – Vectors
- Trie memtables/Trie Indexed SSTables (BTI Formatted SStables)
- Storage-Attached Indexes (SAI)
- Unified Compaction Strategy
A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase.
Once the holiday season arrived, it was time for a break, and we were back in force in February next year.
February 2024: Intensive testing
In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, we discovered issues in the interaction with our monitoring and provisioning setup.
We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.
Next, we broke down the work we needed to do. This included identifying monitoring agents requiring upgrade and config defaults that needed to change.
From this point, the project split into 3 streams of work:
- Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.
- Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.
- Infrastructure Upgrades – Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables.
A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved.
March 2024: Public Preview Release
In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables.
However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.
See our public preview launch blog for further details.
There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults.
April 2024: Configuration Tuning and Deeper Testing
The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.
This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default.
"memtable": { "configurations": { "skiplist": { "class_name": "SkipListMemtable" }, "sharded": { "class_name": "ShardedSkipListMemtable" }, "trie": { "class_name": "TrieMemtable" }, "default": { "inherits": "trie" } } }, "sstable": { "selected_format": "bti" }, "storage_compatibility_mode": "NONE",
The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables. You can override this per table:
CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};
Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information.
You can also reference the cassandra_latest.yaml file for the latest settings (please note you should not apply these to existing clusters without rigorous testing).
May 2024: Major Infrastructure Milestone
We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.
At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed.
June 2024: Successful Rollout New Cassandra Driver
The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default.
July 2024: Path to Release Candidate
We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.
The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform.
The Road Ahead to General Availability
We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including:
- Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure.
At Launch:
When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.
For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.
What Have We Learned From This Project?
- Releasing limited,
small
and frequent changes
has resulted in a smooth project, even if sometimes frequent
releases do not feel smooth. Some
thoughts:
- Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
- Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc.
- Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.
- Releasing to internal and public preview helped
create
momentum within
the Instaclustr
business and
teams:
- This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
- Communicating frequently, transparently, and efficiently is the foundation
of success:
- We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything.
- It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project.
This has been a long–running but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.
You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.
More Readings
- The Top 5 Questions We’re Asked about Apache Cassandra 5.0
- Vector Search in Apache Cassandra 5.0
- How Does Data Modeling Change in Apache Cassandra 5.0?
The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.
Will Your Cassandra Database Project Succeed?: The New Stack
Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.
That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.
Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.
Accurate Data Modeling Is a Must
Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.
On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.
Configure Cassandra Clusters the Right Way
Accurate, expertly managed cluster configurations are pivotal to the success of Cassandra implementations. Get those cluster settings wrong and Cassandra can suffer from data inconsistencies and performance issues due to inappropriate node capacities, poor partitioning or replication strategies that aren’t up to the task.
Teams should understand the needs of their particular use case and how each cluster configuration setting affects Cassandra’s abilities to serve that use case. Attuning configurations to best support your application — including the right settings for node capacity, data distribution, replication factor and consistency levels — will ensure that you can harness the full power of Cassandra when it counts.
Take Advantage of Tunable Consistency
Cassandra gives teams the option to leverage the best balance of data consistency and availability for their use case. While these tunable consistency levels are a valuable tool in the right hands, teams that don’t understand the nuances of these controls can saddle their applications with painful latency and troublesome data inconsistencies.
Teams that learn to operate Cassandra’s tunable consistency levels properly and carefully assess their application’s needs — especially with read and write patterns, data sensitivity and the ability to tolerate eventual consistency — will unlock far more beneficial Cassandra experiences.
Perform Regular Maintenance
Regular Cassandra maintenance is required to stave off issues such as data inconsistencies and performance drop-offs. Within their Cassandra operational procedures, teams should routinely perform compaction, repair and node-tool operations to prevent challenges down the road, while ensuring cluster health and performance are optimized.
Anticipate Capacity and Scaling Needs
By its nature, success will yield new needs. Be prepared for your Cassandra cluster to grow and scale well into the future — that is what this database is built to do. Starving your Cassandra cluster for CPU, RAM and storage resources because you don’t have a plan to seamlessly add capacity is a way of plucking failure from the jaws of success. Poor performance, data loss and expensive downtime are the rewards for growing without looking ahead.
Plan for growth and scalability from the beginning of your Cassandra implementation. Practice careful capacity planning. Look at your data volumes, write/read patterns and performance requirements today and tomorrow. Teams with clusters built for growth will be ready to do so far more easily and affordably.
Make Changes With a Careful Testing/Staging/Prod Process
Teams that think they’re streamlining their process efficiency by putting Cassandra changes straight into production actually enable a pipeline for bugs, performance roadblocks and data inconsistencies. Testing and staging environments are essential for validating changes before putting them into production environments and will save teams countless hours of headaches.
At the end of the day, running all data migrations, changes to schema and application updates through testing and staging environments is far more efficient than putting them straight into production and then cleaning up myriad live issues.
Set Up Monitoring and Alerts
Teams implementing monitoring and alerts to track metrics and flag anomalies can mitigate trouble spots before they become full-blown service interruptions. The speed at which teams become aware of issues can mean the difference between a behind-the-scenes blip and a downtime event.
Have Backup and Disaster Recovery at the Ready
In addition to standing up robust monitoring and alerting, teams should regularly test and run practice drills on their procedures for recovering from disasters and using data backups. Don’t neglect this step; these measures are absolutely essential for ensuring the safety and resilience of systems and data.
The less prepared an organization is to recover from issues, the longer and more costly and impactful downtime will be. Incremental or snapshot backup strategies, replication that’s based in the cloud or across multiple data centers and fine-tuned recovery processes should be in place to minimize downtime, stress and confusion whenever the worst occurs.
Nurture Cassandra Expertise
The expertise required to optimize Cassandra configurations, operations and performance will only come with a dedicated focus. Enlisting experienced talent, instilling continuous training regimens that keep up with Cassandra updates, turning to external support and ensuring available resources — or all of the above — will position organizations to succeed in following the best practices highlighted here and achieving all of the benefits that Cassandra can deliver.
The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.
Use Your Data in LLMs With the Vector Database You Already Have: The New Stack
Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.
Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.
You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.
But, ready for some good news?
If you’re already using Apache Cassandra 5.0, OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.
For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.
Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.
RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.
Let’s look closer at what each open source technology brings to the vector database discussion:
Apache Cassandra 5.0 Offers Native Vector Indexing
With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.
Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.
OpenSearch Provides a Combination of Benefits
Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.
With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.
The pgvector Extension Makes Postgres a Powerful Vector Store
Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.
pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.
Was the Answer to Your AI Challenges in Front of You All Along?
The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.
The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.
Who Did What to That and When? Exploring the User Actions Feature
NetApp recently released the user actions feature on the Instaclustr Managed Platform, allowing customers to search for user actions recorded against their accounts and organizations. We record over 100 different types of actions, with detailed descriptions of what was done, by whom, to what, and at what time.
This provides customers with visibility into the actions users are performing on their linked accounts. NetApp has always collected this information in line with our security and compliance policies, but now, all important changes to your managed cluster resources have self-service access from the Console and the APIs.
In the past, this information was accessible only through support tickets when important questions such as “Who deleted my cluster?” and “When was the firewall rule removed from my cluster?” needed answers. This feature adds more self-discoverability of what your users are doing and what our support staff are doing to keep your clusters healthy.
This blog post provides a detailed walkthrough of this new feature at a moderate level of technical detail, with the hope of encouraging you to explore and better find the actions you are looking for.
For this blog, I’ve created two Apache Cassandra® clusters in one account and performed some actions on each. I’ve also created an organization linked to this account and performed some actions on that. This will allow a full example UI to be shown and demonstrate the type of “stories” that can emerge from typical operations via user actions.
Introducing Global Directory
During development, we decided to consolidate the other global account pages into a new centralized location, which we are calling the “Directory”.
This Directory provides you with the consolidated view of all organizations and accounts that you have access to, collecting global searches and account functions into a view that does not have a “selected cluster” context (i.e., global). For more information on how Organizations, Accounts and Clusters relate to each other, check out this blog.
Organizations serve as an efficient method to consolidate all associated accounts into a single unified, easily accessible location. They introduce an extra layer to the permission model, facilitating the management and sharing of information such as contact and billing details. They also streamline the process of Single Sign-On (SSO) and account creation.
Let’s log in and click on the new button:
This will take us to the new directory landing page:
Here, you will find two types of global searches: accounts and user actions, as well as account creation. Selecting the new “User Actions” item will take us to the new page. You can also navigate to these directory pages directly from the top right ‘folder’ menu:
User Action Search Page: Walkthrough
This is the new page we land on if we choose to search for user actions:
When you first enter, it finds the last page of actions that happened in the accounts and organizations you have access to. It will show both organization and account actions on a single consolidated page, even though they are slightly different in nature.
*Note: The accessible accounts and organisations are defined as those you are linked to as
CLUSTER_ADMIN
or
OWNER
*TIP: If you don’t want an account user to see user actions, give the
READ_ONLY
access.
You may notice a brief progress bar display as the actions are retrieved. At the time of writing, we have recorded nearly 100 million actions made by our customers over a 6-month period.
From here, you can increase the number of actions shown on each page and page through the results. Sorting is not currently supported on the actions table, but it is something we will be looking to add in the future. For each action found, the table will display:
- Action: What happened to your account (or organization)? There are over 100 tracked kinds of actions recorded.
- Domain: The specific account or organization name of the action targeted.
- Description: An expanded description of what happened, using context captured at the time of action. Important values are highlighted between square brackets, and the copy button will copy the first one into the clipboard.
- User: The user who
performed the action, typically using the console/
APIs or
Terraform
provider, but
it can also be triggered by “Instaclustr
Support” using our
admin tools.
- For those actions marked with user “Instaclustr Support”, please reach out to support for more information about those actions we’ve taken on your behalf or visit https://support.instaclustr.com/hc/en-us.
- Local time: The action time from your local web browser’s perspective.
Additionally, for those who prefer programmatic access, the user action feature is fully accessible via our APIs, allowing for automation and integration into your existing workflows. Please visit our API documentation page here for more details.
Basic (super-search) Mode
Let’s say we only care about the “LeagueOfNations” organization domain; we can type ‘League’ and then click Search:
The name patterns are simple partial string patterns we look for as being ’contained’ within the name, such as ”Car” in ”Carlton”. These are case insensitive. They are not (yet!) general regular expressions.
Advanced “find a needle” Search Mode
Sometimes, searching by names is not precise enough; you may want to provide more detailed search criteria, such as time ranges or narrowing down to specific clusters or kinds of actions. Expanding the “Advanced Search” section will switch the page to a more advanced search criteria form, disabling the basic search area and its criteria.
Let’s say we only want to see the “Link Account” actions over the last week:
We select it from the actions multi-chip selector using the cursor (we could also type it and allow autocomplete to kick in). Hitting search will give you your needle time to go chase that Carl guy down and ask why he linked that darn account:
The available criteria fields are as follows (additive in nature):
- Action: the kinds of actions, with a bracketed count of their frequency over the current criteria; if empty, all are included.
- Account: The account name of interest OR its UUID can be useful to narrow the matches to only a specific account. It’s also useful when user, organization, and account names share string patterns, which makes the super-search less precise.
- Organization: the organization name of interest or its UUID.
- User: the user who performed the action.
- Description: matches against the value of an expanded description variable. This is useful because most actions mention the ‘target’ of the action, such as cluster-id, in the expanded description.
- Starting At: match actions starting from this time cannot be older than 12 months ago.
- Ending At: match actions up until this time.
Bonus Feature: Cluster Actions
While it’s nice to have this new search page, we wanted to build a higher-order question on top of it: What has happened to my cluster?
The answer can be found on the details tab of each cluster. When clicked on, it will take you directly to the user actions page with appropriate criteria to answer the question.
* TIP: we currently support entry into this view with a
descriptionFormat queryParam
allowing you to save bookmarks to particular action ‘targets’. Further
queryParams
may be supported in the future for the remaining criteria: https://console2.instaclustr.com/global/searches/user-action?descriptionContextPattern=acde7535-3288-48fa-be64-0f7afe4641b3
Clicking this provides you the answer:
Future Thoughts
There are some future capabilities we will look to add, including the ability to subscribe to webhooks that trigger on some criteria. We would also like to add the ability to generate reports against a criterion or to run such things regularly and send them via email. Let us know what other feature improvements you would like to see!
Conclusion
This new capability allows customers to search for user actions directly without contacting support. It also provides improved visibility and auditing of what’s been changing on their clusters and who’s been making those changes. We hope you found this interesting and welcome any feedback for “higher-order” types of searches you’d like to see built on top of this new feature. What kind of common questions about user actions can you think of?
If you have any questions about this feature, please contact Instaclustr Support at any time. If you are not a current Instaclustr customer and you’re interested to learn more, register for a free trial and spin up your first cluster for free!
The post Who Did What to That and When? Exploring the User Actions Feature appeared first on Instaclustr.
Powering AI Workloads with Intelligent Data Infrastructure and Open Source
In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively.
This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI.
The Power of Open Source in AI Solutions
Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions:
- Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
- Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
- Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals.
- Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness.
Vector Databases: A Key Component for AI Workloads
Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems.
Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities.
Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users.
The Role of Vector Databases in AI:
- Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models.
- High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly.
- Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance.
Leveraging Existing Infrastructure for AI Workloads
Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements:
- Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement.
- Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads.
- Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration.
Open Source Databases for Enterprise AI
Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors:
PostgreSQL® and pgvector
“The world’s most advanced open source relational database“, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure.
From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.
OpenSearch®
Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.
OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI agents to recommendation engines with trustworthy results and minimal hallucinations.
Apache Cassandra® 5.0 with Native Vector Indexing
Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.
Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.
Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises.
Open Source Opens the Door to Successful AI Workloads
Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutions—and suffering the pitfalls of vendor lock-in or simply mismatched features—can easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position.
When leveraging open source databases for AI workloads, consider the following:
- Data Security: Ensure robust security measures are in place to protect sensitive data.
- Scalability: Plan for future growth by choosing solutions that can scale with your needs.
- Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications.
- Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI.
Conclusion
Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency.
Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.
And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts!
The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.
Building a 100% ScyllaDB Shard-Aware Application Using Rust
Building a 100% ScyllaDB Shard-Aware Application Using Rust
I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...
On Scylla Manager Suspend & Resume feature
On Scylla Manager Suspend & Resume feature
!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...
Renaming and reshaping Scylla tables using scylla-migrator
We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...
Python scylla-driver: how we unleashed the Scylla monster's performance
At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...
Scylla Summit 2019
I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...