We Compared ScyllaDB and Memcached and… We Lost?
An in-depth look at database and cache internals, and the tradeoffs in each. ScyllaDB would like to publicly acknowledge dormando (Memcached maintainer) and Danny Kopping for their contributions to this project, as well as thank them for their support and patience. Engineers behind ScyllaDB – the database for predictable performance at scale – joined forces with Memcached maintainer dormando to compare both technologies head-to-head, in a collaborative vendor-neutral way. The results reveal that: Both Memcached and ScyllaDB maximized disks and network bandwidth while being stressed under similar conditions, sustaining similar performance overall. While ScyllaDB required data modeling changes to fully saturate the network throughput, Memcached required additional IO threads to saturate disk I/O. Although ScyllaDB showed better latencies when compared to Memcached pipelined requests to disk, Memcached latencies were better for individual requests. This document explains our motivation for these tests, provides a summary of the tested scenarios and results, then presents recommendations for anyone who might be deciding between ScyllaDB and Memcached. Along the way, we analyze the architectural differences behind these two solutions and discuss the tradeoffs involved in each. There’s also a detailed Gitbook for this project, with a more extensive look at the tests and results and links to the specific configurations you can use to perform the tests yourself. Bonus: dormando and I will recently discussed this project at P99 CONF, a highly technical conference on performance and low latency engineering. Watch on demand Why have we done this? First and foremost, ScyllaDB invested lots of time and engineering resources optimizing our database to deliver predictable low latencies for real-time data-intensive applications. ScyllaDB’s shard-per-core, shared-nothing architecture, userspace I/O scheduler and internal cache implementation (fully bypassing the Linux page cache) are some notable examples of such optimizations. Second: performance converges over time. In-memory caches have been (for a long time) regarded as one of the fastest infrastructure components around. Yet, it’s been a few years now since caching solutions started to look into the realm of flash disks. These initiatives obviously pose an interesting question: If an in-memory cache can rely on flash storage, then why can’t a persistent database also work as a cache? Third: We previously discussed 7 Reasons Not to Put a Cache in Front of Your Database and recently explored how specific teams have successfully replaced their caches with ScyllaDB. Fourth: At last year’s P99 CONF, Danny Kopping gave us an enlightening talk, Cache Me If You Can, where he explained how Memcached Extstore helped Grafana Labs scale their cache footprint 42x while driving down costs. And finally, despite the (valid) criticism that performance benchmarks receive, they still play an important role in driving innovation. Benchmarks are a useful resource for engineers seeking in-house optimization opportunities. Now, on to the comparison. Setup Instances Tests were carried out using the following AWS instance types: Loader: c7i.16xlarge (64 vCPUs, 128GB RAM) Memcached: i4i.4xlarge (16 vCPUs, 128GB RAM, 3.75TB NVMe) ScyllaDB: i4i.4xlarge (16 vCPUs, 128GB RAM, 3.75TB NVMe) All instances can deliver up to 25Gbps of network bandwidth. Keep in mind that specially during tests maxing out the promised Network Capacity, we noticed throttling shrinking down the bandwidth down to the instances’ baseline capacity. Optimizations and Settings To overcome potential bottlenecks, the following optimizations and settings were applied: AWS side: All instances used a Cluster placement strategy, following the AWS Docs: “This strategy enables workloads to achieve the low-latency network performance necessary for tightly-coupled node-to-node communication that is typical of high-performance computing (HPC) applications.” Memcached: Version 1.6.25, compiled with Extstore enabled. Except where denoted, run with 14 threads, pinned to specific CPUs. The remaining 2 vCPUs were assigned to CPU 0 (core & HT sibling) to handle Network IRQs, as specified by the sq_split mode in seastar perftune.py. CAS operations were disabled to save space on per-item overhead. The full command line arguments were: taskset -c 1-7,9-15 /usr/local/memcached/bin/memcached -v -A -r -m 114100 -c 4096 –lock-memory –threads 14 -u scylla -C ScyllaDB: Default settings as configured by ScyllaDB Enterprise 2024.1.2 AMI (ami-id: ami-018335b47ba6bdf9a) in an i4i.4xlarge. This includes the same CPU pinning settings as described above for Memcached. Stressors For Memcached loaders, we used mcshredder, part of memcached’s official testing suite. The applicable stressing profiles are in the fee-mendes/shredders GitHub repository. For ScyllaDB, we used cassandra-stress, as shipped with ScyllaDB, and specified comparable workloads as the ones used for Memcached. Tests and Results The following is a summary of the tests we conducted and their results. If you want a more detailed description and analysis, go to the extended writeup of this project. RAM Caching Efficiency The more items you can fit into RAM, the better your chance of getting cache hits. More cache hits result in significantly faster access than going to disk. Ultimately, that improves latency. This project began by measuring how many items we could store to each datastore. Throughout our tests, the key was between 4 to 12 bytes (key0 .. keyN) for Memcached, and 12 bytes for ScyllaDB. The value was fixed to 1000 bytes. Memcached Memcached stored roughly 101M items until eviction started. It’s memory efficient. Out of Memcached’s 114G assigned memory, this is approximately 101G worth of values, without considering the key size and other flags: Memcached stored 101M items in memory before evictions started ScyllaDB ScyllaDB stored between 60 to 61M items before evictions started. This is no surprise, given that its protocol requires more data to be stored as part of a write (such as the write timestamp since epoch, row liveness, etc). ScyllaDB also persists data to disk as you go, which means that Bloom Filters (and optionally Indexes) need to be stored in memory for subsequent disk lookups. With ScyllaDB, eviction starts under memory pressure while trying to load 61M rows Takeaways Memcached stored approximately 65% more in-memory items than ScyllaDB. ScyllaDB rows have higher per-item overhead to support a wide-column orientation. In ScyllaDB, Bloom Filters, Index Caching, and other components are also stored in-memory to support efficient disk lookups, contributing to yet another layer of overhead. Read-only In-Memory Workload The ideal (though unrealistic) workload for a cache is one where all the data fits in RAM – so that reads don’t require disk accesses and no evictions or misses occur. Both ScyllaDB and Memcached employ LRU (Least Recently Used) logic for freeing up memory: When the system runs under pressure, items get evicted from the LRU’s tail; these are typically the least active items. Taking evictions and cache misses out of the picture helps measure and set a performance baseline for both datastores. It places the focus on what matters most for these kinds of workloads: read throughput and request latency. In this test, we first warmed up both stores with the same payload sizes used during the previous test. Then, we initiated reads against their respective ranges for 30 minutes. Memcached Memcached achieved an impressive 3 Million Gets per second, fully maximizing AWS NIC bandwidth (25 Gbps)! Memcached kept a steady 3M rps, fully maximizing the NIC throughput The parsed results show that p99.999 responses completed below 1ms: stat: cmd_get : Total Ops: 5503513496 Rate: 3060908/s === timer mg === 1-10us 0 0.000% 10-99us 343504394 6.238% 100-999us 5163057634 93.762% 1-2ms 11500 0.00021% ScyllaDB To read more rows in ScyllaDB, we needed to devise a better data model for client requests due to protocol characteristics (in particular, no pipelining). With a clustering key, we could fully maximize ScyllaDB’s cache, resulting in a significant improvement in the number of cached rows. We ingested 5M partitions, each with 16 clustering keys, for a total of 80M cached rows. As a result, the number of records within the cache significantly improved compared to the key-value numbers shown previously. As dormando correctly pointed out (thanks!), this configuration is significantly different than the previous Memcached set-up. While the Memcached workload always hits an individual key-value pair, a single request in ScyllaDB results in several rows being returned. Notably, the same results could be achieved using Memcached by feeding the entire payload as the value under a single key, with the results scaling accordingly. We explained the reasons for these changes in the detailed writeup. There, we covered characteristics of the CQL protocol (such as the per-item overhead [compared to memcached] and no support for pipelining) which make wide-partitions more efficient on ScyllaDB than single-key fetches. With these adjustments, our loaders ran a total of 187K read ops/second over 30 minutes. Each operation resulted in 16 rows getting retrieved. Similarly to memcached, ScyllaDB also maximized the NIC throughput. It served roughly 3M rows/second solely from in-memory data: ScyllaDB Server Network Traffic as reported by node_exporter Number of read operations (left) and rows being hit (right) from cache during the exercise ScyllaDB exposes server-side latency information, which is useful for analyzing latency without the network. During the test, ScyllaDB’s server-side p99 latency remained within 1ms bounds: Latency and Network traffic from ScyllaDB matching the adjustments done The client-side percentiles are, unsurprisingly, higher than the server-side latency with a read P99 of 0.9ms. cassandra-stress P99 latency histogram Takeaways Both Memcached and ScyllaDB fully saturated the network; to prevent saturating the maximum network packets per second, Memcached relied on request pipelining whereas ScyllaDB was switched to a wide-column orientation. ScyllaDB’s cache showed considerable gains following a wide-column schema, able to store more items compared to the previous simple key-value orientation. On the protocol level, Memcached’s protocol is simpler and lightweight, whereas ScyllaDB’s CQL provides richer features but can be heavier. Adding Disks to the Picture Measuring flash storage performance introduces its own set of challenges, which makes it almost impossible to fully characterize a given workload realistically. For disk-related tests, we decided to measure the most pessimistic situation: Compare both solutions serving data (mostly) from block storage, knowing that: The likelihood of realistic workloads doing this is somewhere close to zero Users should expect numbers in between the previous optimistic cache workload and the pessimistic disk-bound workload in practice Memcached Extstore The Extstore wiki page provides extensive detail into the solution’s inner workings. At a high-level, it allows memcached to keep its hash table and keys in memory, but store values onto external storage. During our tests, we populated memcached with 1.25B items with a value size of 1KB and a keysize of up to 14 bytes: Evictions started as soon as we hit approximately 1.25B items, despite free disk space With Extstore, we stored around 11X the number of items compared to the previous in-memory workload until evictions started to kick in (as shown in the right hand panel in the image above). Even though 11X is an already impressive number, the total data stored on flash was only 1.25TB out of the total 3.5TB provided by the AWS instance. Read-Only Performance For the actual performance tests, we stressed Extstore against item sizes of 1KB and 8KB. The table below summarizes the results: Test Type Items per GET Payload Size IO Threads GET Rate P99 perfrun_metaget_pipe 16 1KB 32 188K/s 4~5 ms perfrun_metaget 1 1KB 32 182K/s <1ms perfrun_metaget_pipe 16 1KB 64 261K/s 5~6 ms perfrun_metaget 1 1KB 64 256K/s 1~2ms perfrun_metaget_pipe 16 8KB 16 92K/s 5~6 ms perfrun_metaget 1 8KB 16 90K/s <1ms perfrun_metaget_pipe 16 8KB 32 110K/s 3~4 ms perfrun_metaget 1 8KB 32 105K/s <1ms ScyllaDB We populated ScyllaDB with the same number of items as used for memcached. Although ScyllaDB showed higher GET rates than memcached, it did so under slightly higher tail latencies compared to memcached’s non-pipelining workloads. This is summarized below: Test Type Items per GET Payload Size GET Rate Server-side P99 Client-side P99 1KB Read 1 1KB 268.8K/s 2ms 2.4ms 8KB Read 1 8KB 156.8K/s 1.54ms 1.9ms Takeaways Extstore required considerable tuning to its settings in order to fully saturate flash storage I/O. Due to Memcached architecture, smaller payloads are unable to fully utilize the available disk space, providing smaller gains compared to ScyllaDB. ScyllaDB rates were overall higher than Memcached in a key-value orientation, especially under higher payload sizes. Latencies were better than pipelined requests, but slightly higher than individual GETs in Memcached. Overwrite Workload Following our previous Disk results, we then compared both solutions in a read-mostly workload targeting the same throughput (250K ops/sec). The workload in question is a slight modification of memcached’s ‘basic’ test for Extstore, with 10% random overwrites. It is considered a “semi-worst case scenario.”. Memcached Memcached achieved a rate of slightly under 249K during the test. Although the write rates remained steady during the duration of the test, we observed that reads fluctuated slightly throughout the run: Memcached: Read-mostly workload metrics We also observed slightly high extstore_io_queue metrics despite the lowered read ratios, but latencies still remained low. These results are summarized below: Operation IO Threads Rate P99 Latency cmd_get 64 224K/s 1~2 ms cmd_set 64 24.8K/s <1ms ScyllaDB The ScyllaDB test was run using 2 loaders, each with half of the target rate. Even though ScyllaDB achieved a slightly higher throughput (259.5K), the write latencies were kept low throughout the run and the read latencies were higher (similarly as with memcached): ScyllaDB: Read-mostly workload metrics The table below summarizes the client-side run results across the two loaders: Loader Rate Write P99 Read P99 loader1 124.9K/s 1.4ms 2.6 ms loader2 124.6K/s 1.3ms 2.6 ms Takeaways Both Memcached and ScyllaDB write rates were steady, with reads slightly fluctuating throughout the run ScyllaDB writes still account for the commitlog overhead, which sits in the hot write path ScyllaDB server-side latencies were similar to those observed in Memcached results, although client-side latencies were slightly higher Read a more detailed analysis in the Gitbook for this project Wrapping Up Both memcached and ScyllaDB managed to maximize the underlying hardware utilization across all tests and keep latencies predictably low. So which one should you pick? The real answer: It depends. If your existing workload can accommodate a simple key-value model and it benefits from pipelining, then memcached should be more suitable to your needs. On the other hand, if the workload requires support for complex data models, then ScyllaDB is likely a better fit. Another reason for sticking with Memcached: it easily delivers traffic far beyond what a NIC can sustain. In fact, in this Hacker News thread, dormando mentioned that he could scale it up past 55 million read ops/sec for a considerably larger server. Given that, you could make use of smaller and/or cheaper instance types to sustain a similar workload, provided the available memory and disk footprint suffice your workload needs. A different angle to consider is the data set size. Even though Extstore provides great cost savings by allowing you to store items beyond RAM, there’s a limit to how many keys can fit per GB of memory. Workloads with very small items should observe smaller gains compared to those with larger items. That’s not the case with ScyllaDB, which allows you to store billions of items irrespective of their sizes. It’s also important to consider whether data persistence is required. If it is, then running ScyllaDB as a replicated distributed cache provides you greater resilience and non-stop operations, with the tradeoff being (and as memcached correctly states) that replication halves your effective cache size. Unfortunately Extstore doesn’t support warm restarts and thus the failure or maintenance of a single node is prone to elevating your cache miss ratios. Whether this is acceptable or not depends on your application semantics: If a cache miss corresponds to a round-trip to the database, then the end-to-end latency will be momentarily higher. With regards to consistent hashing, memcached clients are responsible for distributing keys across your distributed servers. This may introduce some hiccups, as different client configurations will cause keys to be assigned differently, and some implementations may not be compatible with each other. These details are outlined in Memcached’s ConfiguringClient wiki. ScyllaDB takes a different approach: consistent hashing is done at the server level and propagated to clients when the connection is first established. This ensures that all connected clients always observe the same topology as you scale. So who won (or who lost)? Well… This does not have to be a competition, nor an exhaustive list outlining every single consideration for each solution. Both ScyllaDB and memcached use different approaches to efficiently utilize the underlying infrastructure. When configured correctly, both of them have the potential to provide great cost savings. We were pleased to see ScyllaDB matching the numbers of the industry-recognized Memcached. Of course, we had no expectations of our database being “faster.” In fact, as we approach microsecond latencies at scale, the definition of faster becomes quite subjective. 🙂 Continuing the Discussion at P99 CONF Reminder: dormando (Alan Kasindorf) and I recently discussed this project at P99 CONF, a highly technical conference on performance and low latency engineering. Watch videoLow-Latency Database Strategies Featured at P99 CONF 24
Obsessed with high-performance low-latency data systems? See the 20+ data-related tech talks sessions we’re hosting at P99 CONF 2024. P99 CONF is a (free + online) highly-technical conference for engineers who obsess over P99 percentiles and long-tail latencies. The open source, community-focused event is hosted by ScyllaDB, the company behind the monstrously fast and scalable NoSQL database (and the adorable one-eyed sea monster). Since database performance is so near and dear to ScyllaDB, we quite eagerly reached out to our friends and colleagues across the community to ensure a wide spectrum of distributed data systems, approaches, and challenges would be represented at P99 CONF. As you can see on our agenda, the response was overwhelming. This year’s attendees get to hear from – and interact with – the creators of Postgres, ScyllaDB, Turso, SlateDB, SpiceDB, Arroyo, Responsive, FerretDB, and Percona. We’re also looking forward to sessions with engineers from Redis, Oracle, TigerBeetle, AWS, QuestDB, and more. There’s a series of keynotes focused on rethinking the database, deep dives into database internals, and case studies of database engineering feats at organizations like Uber, Shopify, ShareChat, and LinkedIn. If you share our obsession with high-performance low-latency data systems, here’s a rundown of sessions to consider joining at P99 CONF 24. Register Now – It’s Free Just In Time LSM Compaction Aleksei Kladov (TigerBeetle) TigerBeetle is a reliable and fast accounting database. Its primary on-disk data structure is a log-structured merge-tree. This talk is a deep dive into TigerBeetle’s compaction algorithm — “garbage collection” for LSM, which achieves several unusual goals: Full storage determinism across replicas, enabling recovery from disk faults. Absence of dynamic memory allocation. Highly concurrent implementation, utilizing all available CPU and disk bandwidth. Perfect pacing: resources are carefully balanced between compaction and normal transaction processing, avoiding starvation and guaranteeing bounded P100. Time-Series and Analytical Databases Walk in a Bar… Andrei Pechkurov (QuestDB) A good time-series database also has to be a decent analytical database. This implies both SQL features and efficient query processing. That’s why we recently added many optimizations to QuestDB’s SQL engine, featuring better on-disk data layout, specialized data structures, SIMD and SWAR-based code, scalable aggregation algorithms, and parallel execution pipelines. Many of these additions can be met in popular databases, some are unique to QuestDB. In this talk, we will go through the most important optimizations, and discuss what’s yet to be added and where we are when compared with the fastest analytical databases. The Next Chapter in the Sordid Love/Hate Relationship Between DBs and OSes Andy Pavlo (Carnegie Mellon) Database management systems (DBMSs) are beautiful, free-spirited software that want nothing more than to help users store and access data as quickly as possible. To achieve this goal, DBMSs have spent decades trying to avoid operating systems (OSs) at all costs. Such avoidance is necessary because OSs always try to impose their will on DBMSs and stifle their ambitions through disingenuous syscall semantics, unscalable kernel-level data structures, and excessive data copying. The many attempts to avoid the OS through kernel-bypass methods or custom hardware have such high engineering/R&D costs that few DBMSs support them. In the end, DBMSs are stuck in an abusive relationship: they need the OS to run their software and provide them with basic functionalities (e.g., memory allocation), but they do not like how the OS treats them. However, new technologies like eBPF, which allow DBMSs to run custom code safely inside the OS kernel to override its functionality, are poised to upend this power struggle. In this talk, I will present a new design approach called “user-bypass” for building high-performance database systems and services with eBPF. I will discuss recent developments in eBPF relevant to the DBMS community and what parts of a DBMS are most amenable to using it. We will also present the design of BPF-DB, an embedded DBMS written in eBPF that provides ACID transactions over multi-versioned data and runs entirely in the Linux kernel. Designing a Query Queue for ScyllaDB Avi Kivity (ScyllaDB) Database queries can be highly variable. Some are served immediately from cache, return a single row, and are done in milliseconds. Others return gigabytes or terabytes of data, take minutes or hours, and require plenty of disk I/O and compute. Deciding what concurrency to use when running these queries is a delicate balance of CPU consumption, memory consumption, and the queue designer’s nerves. A bad design can mean high latency, under-utilization of available resources, or crashing when one runs out of memory. This talk will cover the decisions we made while designing ScyllaDB’s query queue. Reliable Data Replication Cameron Morgan (Shopify) Data replication is required to make data highly available to users. Highly available in this context means users can access data in a reliable, consistent and timely fashion. Because it is so important, if this problem has not come up in your work, you have certainly used such a system. This talk focuses on the hard problems of data replication, the ones that are usually skipped in talks. What is a backfill and why do I need them to be reliable, non-blocking and often? How do you handle schema changes? How do you validate the data is correct? How can you be resistant to failure? How can you write in parallel? This talk is about the hard problems Shopify solved replicating Shopify stores to the Edge and reaching ~5M rows replicated per second with < 1s replication lag p99. Rust: A Productive Language for Writing Database Applications Carl Lerche (AWS) When you think about Rust, you might think of performance, safety, and reliability, but what about productivity? Last year, I recommended considering Rust for developing high-level applications. Rust showed great promise, but its library ecosystem needed to mature. What has changed since then? Many higher-level applications sit on top of a database. In this talk, I will explore the current state of Rust libraries for database access, focusing on ergonomics and ease of use—two crucial factors in high-level database application development. Building a Cloud Native LSM on Object Storage Chris Riccomini (Materialized View) and Rohan Desai (Responsive) This talk discusses the design and implementation of SlateDB, an open source cloud native storage engine built as a log-structured merge-tree (LSM) on top of an object store like S3, Google Cloud Storage (GCS), or Azure Blob Store (ABS). LSMs are traditionally built assuming data will reside on local storage. Building an LSM on object storage allows SlateDB to benefit from object storage replication and durability guarantees while presenting unique latency and cost challenges. We’ll discuss the design decisions and tradeoffs we faced when building SlateDB. Taming Tail Latencies in Apache Pinot with Generational ZGC Christopher Peck (Uber) The introduction of Generational ZGC mostly eliminated concerns around pause time for Java applications. This session will cover a real world application of Generational ZGC and the effects. The session will also cover how application level configs/features can be used to offset some of the trade-offs we encountered when switching to Generational ZGC. Apache Pinot is an OLAP database with an emphasis on low latency. We’ll walk through how we solved large scatter gather induced tail latencies in our Pinot clusters by switching to Generational ZGC, uncovering the low latency query potential of Pinot. We’ll also a couple of Pinot’s features which made using Generational ZGC possible. Elevating PostgreSQL: Benchmarking Vector Search Performance Daniel Seybold (benchANT) The database market is constantly evolving with new database systems addressing specific use cases such as time series data or vector search. But there is one open source database system which has been around since nearly three decades and which has gained a strong momentum over the last years, PostgreSQL. Due its pure open source approach and strong community, PostgreSQL is continuously improving on its features, performance and extensions that enable PostgreSQL to handle also specific use cases such as vector search. Over the last years, multiple native vector database systems have been established and many NoSQL and relational database systems have released vector extensions for their database systems. The same goes for PostgreSQL with two available vector search extensions, pgvector and pgvecto.rs. And since vector search performance is a crucial differentiating factor for every vector search database, we report on the latest vector search benchmark results for PostgreSQL with pgvector and pgvecto.rs. The benchmarking study covers multiple data sets of varying vector dimensions, also considering different PostgreSQL configurations, from a baseline configuration to tuned configurations. Overcoming Distributed Databases Scaling Challenges with Tablets Dor Laor (ScyllaDB) Getting fantastic performance cannot stop at the server level. Even after rewriting your code in assembly, you would need multiple servers to run at scale and to provide availability. Clusters are often sharded to achieve good performance. In this session, I will cover tablets, a new dynamic sharding design we applied at ScyllaDB in order to maximize cpu and storage utilization dynamically and to maximize elasticity speed. Why Databases Cache, but Caches Go to Disk Felipe Mendes (ScyllaDB) and Alan Kasindorf (Cache Forge) Caches and Databases are different animals. Yet, databases have always cached data and caches are exploring disks. To quantify the strengths and tradeoffs of each, ScyllaDB joined forces with Memcached’s maintainer to compare both across different scenarios. Join us as we discuss how the results trace back to each underlying architectures’ design decisions. Specifically, we’ll compare ScyllaDB row-based cache with Memcached’s in-memory hash table, and look at how Memcached’s External Flash Storage compares to ScyllaDB’s userspace I/O scheduler and asynchronous AIO/DIO. Feature Store Evolution Under Cost Constraints: When Cost is Part of the Architecture Ivan Burmistrov and David Malinge (ShareChat) At P99 CONF 23, the ShareChat team presented the scaling challenges for the ML Feature Store so it could handle 1 billion features per second. Once the system was scaled to handle the load, the next challenge the team faced was extreme cost constraints: it was required to make the same quality system much cheaper to run. This year, we will talk about approaches the team implemented in order to optimize for cost in the Cloud environment while maintaining the same SLA for the service. The talk will touch on such topics as advanced optimizations on various levels to bring down the compute, minimizing the waste when running on Kubernetes, autoscaling challenges for stateful Apache Flink jobs, and others. The talk should be useful for those who are either interested in building or optimizing an ML Feature Store or in general looking into cost optimizations in the cloud environment. Running Low-Latency Workloads on Kubernetes Jimmy Zelinskie (authzed) Configuring Kubernetes to optimally run a particular workload is best described as a continuous journey. Depending on your requirements, best practices might not only no longer apply, but actively harm performance. In this session, we document what we’ve found to work best in our journey running SpiceDB, a low-latency authorization system. Cheating the Cloud: 50% Savings with Compression Dictionaries Łukasz Paszkowsk (ScyllaDB) Discover how to slash networking costs by up to 50% with advanced compression techniques. This session covers a real-world case where the default LZ4 compression in Cassandra, with its limited 25% efficiency, was causing high costs in inter-zone replication. We’ll introduce a custom RPC compressor with external dictionary support that samples RPC traffic, trains optimized dictionaries, and seamlessly switches connections to these new dictionaries. Learn how this approach dramatically improves compression ratios, reduces cloud expenses, and enhances data transfer efficiency across distributed systems. It’s perfect for those looking to optimize cloud infrastructure. Latency, Throughput & Fault Tolerance: Designing the Arroyo Streaming Engine Micah Wylde (Arroyo) Arroyo is a distributed, stateful stream processing engine written in Rust. It combines predictable millisecond-latency processing with the throughput of a high-performance batch query engine—on top of a distributed checkpointing implementation that provides fault tolerance and exactly-once processing. These design goals are often in tension: increasing throughput generally comes at the expense of latency, and consistent checkpointing can introduce periodic latency spikes while we wait for alignment and IO. In this talk, I will cover the distributed architecture and implementation of Arroyo including the core Arrow-based dataflow engine, algorithms for stateful windowing and aggregates, and the Chandy-Lamport inspired distributed checkpointing system. You’re Doing It All Wrong Michel Stonebraker (MIT, Postgres creator) In this talk, we consider business data processing applications, which have historically been written for a three-tier architecture. Two ideas totally upset this applecart. Idea #1: The Cloud All enterprises are moving everything possible to the cloud as quickly as possible. In this new environment, you are highly encouraged to use a cloud-native architecture, whereby your system is composed of distributed functions, working in parallel, and running on a serverless (and stateless) platform like AWS Lambda or Azure Functions. You program your application as a workflow of “steps.” To make systems resilient to failures you require a separate state machine and workflow manager (e.g., AWS Step Functions, Airflow, etc.). If you use this architecture, then you don’t pay for resources when your application is idle, often a major benefit. Depending on the platform, you may also get automatic resource elasticity and load balancing; additional major benefits. Idea #2: Leverage the DBMS Obviously, your data belongs in a DBMS. However, by extension, so does the state of your application. Keeping track of application state in the DBMS allows one to provide once-and-only-once execution semantics for your workflow. One can also use the database concept of “sagas” to allow multi-transaction applications to be done to completion or not at all. Furthermore, to go an order of magnitude faster that AWS Lambda, you need to collocate your application and the DBMS. The fastest alternative is to run your application inside the DBMS using stored procedures (SPs). However, it is imperative to overcome SP weaknesses, specifically the requirement of a different language (e.g.PL/SQL) and the absence of a debugging environment. The latter can be accomplished by persisting the database log and allowing “time travel debugging” for SPs. The former can be supported by coding SPs in a conventional language such as Typescript. Extending this idea to the operating environment, one can time travel the entire system, thereby allowing recovery to a previous point in time when disasters happen (errant programs, adversary intrusions, ransomware, etc.). I will discuss one such platform (DBOS) with all of the above features. In my opinion, this is an example of why “you are doing it all wrong.” Taming Discard Latency Spikes Patryk Wróbel (ScyllaDB) Discover an interesting lesson related to the impact of discarding files on read and write latency on modern NVMe SSDs, learned while fixing a real-world problem in ScyllaDB. Dive into the way how TRIM requests are issued when online discard is enabled for the XFS file system, the problems that may occur, and possible solutions. Redis Alternatives Compared Peter Zaitsev (Percona) In my talk, I will delve into a variety of Redis alternatives, providing an unbiased analysis that encompasses emerging forks like Valley and Redix, established contenders such as DragonflyDB and KeyDB, and unique options like Microsoft Garnet and Redka. My presentation will cover critical aspects of these alternatives, including licensing models and their implications, comparisons of feature sets and functionality, governance and community support structures, and performance considerations tailored to different use cases. You will leave with a clearer understanding of how each alternative could meet specific needs, insights into open source compliance and licensing, and an appreciation of the importance of performance and support options in choosing the right solution. Join me to clarify your options and strategize your approach in the ever-changing world of Redis alternatives. Database Drivers: Performance Perspectives Piot Sarna (poolside) This talk explains how to get the most out of database drivers by understanding their design and potential. It’s a technical deep dive into how database drivers work underneath, and how to adjust their performance to your expectations. Using eBPF Off-CPU Sampling to See What Your Databases Are Really Waiting For Tanel Poder At P99 CONF 23, I introduced the general concept of using eBPF-populated Task State Arrays to keep track of all Linux applications’ (including database engines) thread states and activity without relying on the built-in instrumentation of the application. For example, the “wait events” built into database engines are not perfect; some voluntary waits (system calls) are not properly instrumented in all database engines. There are also other involuntary waits caused by OS-level issues, like memory allocation stalls, CPU queuing, and task scheduler glitches. This year, I will show the latest eBPF-based “xcapture” tool in practical use, measuring where MySQL, Postgres, and DuckDB really spend their time, both when on CPU and sleeping. All this can be done without having to change any source code of the database engine or applications running on it. Java Heap Memory Optimization to Improve P99 Query Latency at LinkedIn Scale Vivek Iyer Vaidyanathan Iyer (LinkedIn) Apache Pinot is a real-time, distributed, OLAP database designed to serve low-latency SQL queries at high throughput. It was built and open-sourced by Linkedin and powers many site facing use cases for low latency realtime analytics. Pinot Servers, the work-horses of SQL query processing, store table shards on local SSDs and memory map the columnar data buffers (data, indexes etc). In some specialized use cases where we have P99 query SLA under 100ms, the column buffers are loaded on Java heap as opposed to off heap (memory map). The data in these on heap column buffers are characterized by high cardinality, featuring a high number of unique objects alongside a notable abundance of DUPLICATE objects. Duplicate Objects waste almost 20% of the JVM heap per host. The memory-intensive nature of our OnHeap dictionary indexes leads to high Java Heap usage resulting in spiky P99 latencies due to the unpredictable nature of Java Garbage Collection. This talk will challenge the conventional notion that discourages the use of Interning methodologies and showcase how the Pinot production deployments at LinkedIn saw 20% heap savings per host along while improving P99 query latencies by 35% using a home-grown, efficient strategy of FALF Interning – Fixed-Size Array Lock-Free Interning. Designed as a small, fixed-size, open-hashmap-based object cache that duplicates objects opportunistically, these Interners work on all object types and are 17% faster than the traditional Interners. In this talk, we will present on how we used the JXRAY memory analysis to discover the problem, design, implementation and the P99 query latency improvements we observed in production @ LinkedIn Scale. We will discuss the general challenges to solve the duplicate objects problem for Java-based systems where the traditional methods of tuning JVM parameters, employing native or Guava Interners don’t work. Join the Conference Online – It’s FreeUnderstanding Distributed System Performance… from the Grocery Store
Learn essential steps for boosting distributed system performance– explained with grocery store checkout analogies. I visited a small local grocery store which happens to be in a touristy part of my neighborhood. If you’ve ever traveled abroad, then you’ve probably visited a store like that to stock up on bottled water without purchasing the overpriced hotel equivalent. This was one of these stores. To my misfortune, my visit happened to coincide with a group of tourists arriving all at once to buy beverages and warm up (it’s winter!). It just so happens that selecting beverages is often much faster than buying fruit – the reason for my visit. So after I had selected some delicious apples and grapes, I ended up waiting in line behind 10 people. And there was a single cashier to serve us all. The tourists didn’t seem to mind the wait (they were all chatting in line), but I sure wish that the store had more cashiers so I could get on with my day faster. What does this have to do with system performance? You’ve probably experienced a similar situation yourself and have your own tale to tell. It happens so frequently that sometimes we forget how applicable these situations can be to other domain areas, including distributed systems.Sometimes when you evaluate a new solution, the results don’t meet your expectations. Why is latency high? Why is the throughput so low? Those are two of the top questions that pop up every now and then. Many times, the challenges can be resolved by optimizing your performance testing approach, as well as better maximizing your solution’s potential. As you’ll realize, improving the performance of a distributed system is a lot like ensuring speedy checkouts in a grocery store. This blog covers 7 performance-focused steps for you to follow as you evaluate distributed systems performance. Step #1: Measure Time With groceries, the first step towards doing any serious performance optimization is to precisely measure how long it takes for a single cashier to scan a barcode. Some goods, like bulk fruits that require weighing, may take longer to scan than products in industrial packaging. A common misconception is that processing happens in parallel. It does not (note: we’re not referring to capabilities like SIMD and pipelining here). Cashiers do not service more than a single person at a time, nor do they scan your products’ barcodes simultaneously. Likewise, a single CPU in a system will process one work unit at a time, no matter how many requests are sent to it. In a distributed system, consider all the different work units you have and execute them in an isolated way against a single shard. Execute your different items with single-threaded execution and measure how many requests per second the system can process. Eventually, you may learn that different requests get processed at different rates. For example, if the system is able to process a thousand 1 KB requests/sec, the average latency is 1ms. Similarly, if throughput is 500 requests/sec for a larger payload size, then the average latency is 2ms. Step #2: Find the Saturation Point A cashier is never scanning barcodes all the time. Sometimes, they will be idle waiting for customers to place their items onto the checkout counter, or waiting for payment to complete. This introduces delays you’ll typically want to avoid. Likewise, every request your client submits against a system incurs, for example, network round trip time – and you will always pay a penalty under low concurrency. To eliminate this idleness and further increase throughput, simply increase the concurrency. Do it in small increments until you observe that the throughput saturates and the latency starts to grow. Once you reach that point, congratulations! You effectively reached the system’s limits. In other words, unless you manage to get your work items processed faster (for example, by reducing the payload size) or tune the system to work more efficiently with your workload, you won’t achieve gains past that point. You definitely don’t want to find yourself in a situation where you are constantly pushing the system against its limits, though. Once you reach the saturation area, fall back to lower concurrency numbers to account for growth and unpredictability. Step #3: Add More Workers If you live in a busy area, grocery store demand might be beyond what a single cashier can sustain. Even if the store happened to hire the fastest cashier in the world, they would still be busy as demand/concurrency increases. Once the saturation point is reached it is time to hire more workers. In the distributed systems case, this means adding more shards to the system to scale throughput under the latency you’ve previously measured. This leads us to the following formula: Number of Workers = Target Throughput / Single worker limit You already discovered the performance limits of a single worker in the previous exercise. To find the total number of workers you need, simply divide your target throughput by how much a single worker can sustain under your defined latency requirements. Distributed systems like ScyllaDB provide linear scale, which simplifies the math (and total cost of ownership [TCO]). In fact, as you add more workers, chances are that you’ll achieve even higher rates than under a single worker. The reason is due to Network IRQs, and out of scope for this write-up (but see this perftune docs page for some details). Step #4: Increase Parallelism Think about it. The total time to check out an order is driven by the number of items in a cart divided by the speed of a single cashier. Instead of adding all the pressure on a single cashier, wouldn’t it be far more efficient to divide the items in your shopping cart (our work) and distribute them among friends who could then check out in parallel? Sometimes the number of work items you need to process might not be evenly split across all available cashiers. For example, if you have 100 items to check out, but there are only 5 cashiers, then you would route 20 items per counter. You might wonder: “Why shouldn’t I instead route only 5 customers with 20 items each?” That’s a great question – and you probably should do that, rather than having the store’s security kick you out. When designing real-time low latency OLTP systems, however, you mostly care about the time it takes for a single work unit to get processed. Although it is possible to “batch” multiple requests against a single shard, it is far more difficult (though not impossible) to consistently accomplish that task in such a way that every item is owned by that specific worker. The solution is to always ensure you dispatch individual requests one at a time. Keep concurrency high enough to overcome external delays like client processing time and network RTT, and introduce more clients for higher parallelism. Step #5: Avoid Hotspots Even after multiple cashiers get hired, it sometimes happens that a long line of customers queue after a handful of them. More often than not you should be able to find less busy – or even totally free – cashiers simply by walking through the hallway. This is known as a hotspot, and it often gets triggered due to unbound concurrency. It manifests in multiple ways. A common situation is when you have a traffic spike to a few popular items (load). That momentarily causes a single worker to queue a considerable amount of requests. Another example: low cardinality (uneven data distribution) prevents you from fully benefiting from the increased workforce. There’s also another commonly overlooked situation that frequently arises. It’s when you dispatch too much work against a single worker to coordinate, and that single worker depends on other workers to complete that task. Let’s get back to the shopping analogy: Assume you’ve found yourself on a blessed day as you approach the checkout counters. All cashiers are idle and you can choose any of them. After most of your items get scanned, you say “Dear Mrs. Cashier, I want one of those whiskies sitting in your locked closet.” The cashier then calls for another employee to pick up your order. A few minutes later, you realize: “Oops, I forgot to pick up my toothpaste,” and another idling cashier nicely goes and picks it up for you. This approach introduces a few problems. First, your payment needs to be aggregated by a single cashier – the one you ran into when you approached the checkout counter. Second, although we parallelized, the “main” cashier will be idle waiting for their completion, adding delays. Third, further delays may be introduced in between each additional and individual request completion: for example, when the keys of the locked closet are only held by a single employee, so the total latency will be driven by the slowest response. Consider the following pseudocode: See that? Don’t do that. The previous pattern works nicely when there is a single work unit (or shard) to route requests to. Key-value caches are a great example of how multiple requests can get pipelined all together for higher efficiency. As we introduce sharding into the picture, this becomes a great way to undermine your latencies given the previously outlined reasons. Step #6: Limit Concurrency When more clients are introduced, it’s like customers inadvertently ending up at the supermarket during rush hour. Suddenly, they can easily end up in a situation where many clients all decide to queue under a handful of cashiers. You previously discovered the maximum concurrency at which a single shard can service requests. These are hard numbers and – as you observed during small scale testing – you won’t see any benefits if you try to push requests further. The formula goes like this: Concurrency = Throughput * Latency If a single shard sustains up to 5K ops/second under an average latency of 1 ms, then you can execute up to 5 concurrent in-flight requests at all times. Later you added more shards to scale that throughput. Say you scaled to 20 shards for a total throughput goal of 100K ops/second. Intuitively, you would think that your maximum useful concurrency would become 100. But there’s a problem. Introducing more shards to a distributed system doesn’t increase the maximum concurrency that a single shard can handle. To continue the shopping analogy, a single cashier will continue to scan barcodes at a fixed rate – and if several customers line up waiting to get serviced, their wait time will increase. To mitigate (though not necessarily prevent) that situation, divide the maximum useful concurrency among the number of clients. For example, if you’ve got 10 clients and a maximum useful concurrency of 100, then each client should be able to queue up to 10 requests across all available shards. This generally works when your requests are evenly distributed. However, it can still backfire when you have a certain degree of imbalance. Say all 10 clients decided to queue at least one request under the same shard. At a given point in time, that shard’s concurrency climbed to 10, double our initially discovered maximum concurrency. As a result, latency increases, and so does your P99. There are different approaches to prevent that situation. The right one to follow depends on your application and use case semantics. One option is to limit your client concurrency even further to minimize its P99 impact. Another strategy is to throttle at the system level, allowing each shard to shed requests as soon as it queues past a certain threshold. Step #7: Consider Background Operations Cashiers do not work at their maximum speed at all times. Sometimes, they inevitably slow down. They drink water, eat lunch, go to the restroom, and eventually change shifts. That’s life! It is now time for real-life production testing. Apply what you’ve learned so far and observe how the system behaves over long periods of time. Distributed systems often need to run background maintenance activities (like compactions and repairs) to keep things running smoothly. In fact, that’s precisely the reason why I recommended that you stay away from the saturation area at the beginning of this article. Background tasks inevitably consume system resources, and are often tricky to diagnose. I commonly receive reports like “We observed a latency increase due to compactions”, only to find out later the actual cause was something else – for example, a spike in queued requests to a given shard. Irrespective of the cause, don’t try to “throttle” system tasks. They exist and need to run for a reason. Throttling their execution will likely backfire on you eventually. Yes, background tasks slow down a given shard momentarily (that’s normal!). Your application should simply prefer other less busy replicas (or cashiers) when it happens. For a great detailed discussion of these points, see Brian Taylor’s insights during his How to Maximize Database Concurrency talk. Applying These Steps Hopefully, you are now empowered to address questions like “why latency is high”, or “why throughput is so low”. As you start evaluating performance, start small. This minimizes costs, and gives you fine-grained control during each step. If latencies are sub-optimal under small scale, it either means you are pushing a single shard too hard, or that your expectations are off. Do not engage in larger scale testing until you are happy with the performance a single shard gives you. Once you feel comfortable with the performance of a single shard, scale capacity accordingly. Keep an eye on concurrency at all times and watch out for imbalances, mitigating or preventing them as needed. When you find yourself in a situation where throughput no longer increases but the system is idling, add more clients to increase parallelism. These concepts generally apply to every distributed system out there, including ScyllaDB. Our shard-per-core architecture linearly scales, making it easy for you to follow through the steps we discussed here. If you’d like to know more on how we can help, book a technical session with us.Introducing Netflix’s Key-Value Data Abstraction Layer
Vidhya Arvind, Rajasekhar Ummadisetty, Joey Lynch, Vinay Chella
Introduction
At Netflix our ability to deliver seamless, high-quality, streaming experiences to millions of users hinges on robust, global backend infrastructure. Central to this infrastructure is our use of multiple online distributed databases such as Apache Cassandra, a NoSQL database known for its high availability and scalability. Cassandra serves as the backbone for a diverse array of use cases within Netflix, ranging from user sign-ups and storing viewing histories to supporting real-time analytics and live streaming.
Over time as new key-value databases were introduced and service owners launched new use cases, we encountered numerous challenges with datastore misuse. Firstly, developers struggled to reason about consistency, durability and performance in this complex global deployment across multiple stores. Second, developers had to constantly re-learn new data modeling practices and common yet critical data access patterns. These include challenges with tail latency and idempotency, managing “wide” partitions with many rows, handling single large “fat” columns, and slow response pagination. Additionally, the tight coupling with multiple native database APIs — APIs that continually evolve and sometimes introduce backward-incompatible changes — resulted in org-wide engineering efforts to maintain and optimize our microservice’s data access.
To overcome these challenges, we developed a holistic approach that builds upon our Data Gateway Platform. This approach led to the creation of several foundational abstraction services, the most mature of which is our Key-Value (KV) Data Abstraction Layer (DAL). This abstraction simplifies data access, enhances the reliability of our infrastructure, and enables us to support the broad spectrum of use cases that Netflix demands with minimal developer effort.
In this post, we dive deep into how Netflix’s KV abstraction works, the architectural principles guiding its design, the challenges we faced in scaling diverse use cases, and the technical innovations that have allowed us to achieve the performance and reliability required by Netflix’s global operations.
The Key-Value Service
The KV data abstraction service was introduced to solve the persistent challenges we faced with data access patterns in our distributed databases. Our goal was to build a versatile and efficient data storage solution that could handle a wide variety of use cases, ranging from the simplest hashmaps to more complex data structures, all while ensuring high availability, tunable consistency, and low latency.
Data Model
At its core, the KV abstraction is built around a two-level map architecture. The first level is a hashed string ID (the primary key), and the second level is a sorted map of a key-value pair of bytes. This model supports both simple and complex data models, balancing flexibility and efficiency.
HashMap<String, SortedMap<Bytes, Bytes>>
For complex data models such as structured Records or time-ordered Events, this two-level approach handles hierarchical structures effectively, allowing related data to be retrieved together. For simpler use cases, it also represents flat key-value Maps (e.g. id → {"" → value}) or named Sets (e.g.id → {key → ""}). This adaptability allows the KV abstraction to be used in hundreds of diverse use cases, making it a versatile solution for managing both simple and complex data models in large-scale infrastructures like Netflix.
The KV data can be visualized at a high level, as shown in the diagram below, where three records are shown.
message Item (
Bytes key,
Bytes value,
Metadata metadata,
Integer chunk
)
Database Agnostic Abstraction
The KV abstraction is designed to hide the implementation details of the underlying database, offering a consistent interface to application developers regardless of the optimal storage system for that use case. While Cassandra is one example, the abstraction works with multiple data stores like EVCache, DynamoDB, RocksDB, etc…
For example, when implemented with Cassandra, the abstraction leverages Cassandra’s partitioning and clustering capabilities. The record ID acts as the partition key, and the item key as the clustering column:
The corresponding Data Definition Language (DDL) for this structure in Cassandra is:
CREATE TABLE IF NOT EXISTS <ns>.<table> (
id text,
key blob,
value blob,
value_metadata blob,
PRIMARY KEY (id, key))
WITH CLUSTERING ORDER BY (key <ASC|DESC>)
Namespace: Logical and Physical Configuration
A namespace defines where and how data is stored, providing logical and physical separation while abstracting the underlying storage systems. It also serves as central configuration of access patterns such as consistency or latency targets. Each namespace may use different backends: Cassandra, EVCache, or combinations of multiple. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs. Developers just provide their data problem rather than a database solution!
In this example configuration, the ngsegment namespace is backed by both a Cassandra cluster and an EVCache caching layer, allowing for highly durable persistent storage and lower-latency point reads.
"persistence_configuration":[
{
"id":"PRIMARY_STORAGE",
"physical_storage": {
"type":"CASSANDRA",
"cluster":"cassandra_kv_ngsegment",
"dataset":"ngsegment",
"table":"ngsegment",
"regions": ["us-east-1"],
"config": {
"consistency_scope": "LOCAL",
"consistency_target": "READ_YOUR_WRITES"
}
}
},
{
"id":"CACHE",
"physical_storage": {
"type":"CACHE",
"cluster":"evcache_kv_ngsegment"
},
"config": {
"default_cache_ttl": 180s
}
}
]
Key APIs of the KV Abstraction
To support diverse use-cases, the KV abstraction provides four basic CRUD APIs:
PutItems — Write one or more Items to a Record
The PutItems API is an upsert operation, it can insert new data or update existing data in the two-level map structure.
message PutItemRequest (
IdempotencyToken idempotency_token,
string namespace,
string id,
List<Item> items
)
As you can see, the request includes the namespace, Record ID, one or more items, and an idempotency token to ensure retries of the same write are safe. Chunked data can be written by staging chunks and then committing them with appropriate metadata (e.g. number of chunks).
GetItems — Read one or more Items from a Record
The GetItemsAPI provides a structured and adaptive way to fetch data using ID, predicates, and selection mechanisms. This approach balances the need to retrieve large volumes of data while meeting stringent Service Level Objectives (SLOs) for performance and reliability.
message GetItemsRequest (
String namespace,
String id,
Predicate predicate,
Selection selection,
Map<String, Struct> signals
)
The GetItemsRequest includes several key parameters:
- Namespace: Specifies the logical dataset or table
- Id: Identifies the entry in the top-level HashMap
- Predicate: Filters the matching items and can retrieve all items (match_all), specific items (match_keys), or a range (match_range)
- Selection: Narrows returned responses for example page_size_bytes for pagination, item_limit for limiting the total number of items across pages and include/exclude to include or exclude large values from responses
- Signals: Provides in-band signaling to indicate client capabilities, such as supporting client compression or chunking.
The GetItemResponse message contains the matching data:
message GetItemResponse (
List<Item> items,
Optional<String> next_page_token
)
- Items: A list of retrieved items based on the Predicate and Selection defined in the request.
- Next Page Token: An optional token indicating the position for subsequent reads if needed, essential for handling large data sets across multiple requests. Pagination is a critical component for efficiently managing data retrieval, especially when dealing with large datasets that could exceed typical response size limits.
DeleteItems — Delete one or more Items from a Record
The DeleteItems API provides flexible options for removing data, including record-level, item-level, and range deletes — all while supporting idempotency.
message DeleteItemsRequest (
IdempotencyToken idempotency_token,
String namespace,
String id,
Predicate predicate
)
Just like in the GetItems API, the Predicate allows one or more Items to be addressed at once:
- Record-Level Deletes (match_all): Removes the entire record in constant latency regardless of the number of items in the record.
- Item-Range Deletes (match_range): This deletes a range of items within a Record. Useful for keeping “n-newest” or prefix path deletion.
- Item-Level Deletes (match_keys): Deletes one or more individual items.
Some storage engines (any store which defers true deletion) such as Cassandra struggle with high volumes of deletes due to tombstone and compaction overhead. Key-Value optimizes both record and range deletes to generate a single tombstone for the operation — you can learn more about tombstones in About Deletes and Tombstones.
Item-level deletes create many tombstones but KV hides that storage engine complexity via TTL-based deletes with jitter. Instead of immediate deletion, item metadata is updated as expired with randomly jittered TTL applied to stagger deletions. This technique maintains read pagination protections. While this doesn’t completely solve the problem it reduces load spikes and helps maintain consistent performance while compaction catches up. These strategies help maintain system performance, reduce read overhead, and meet SLOs by minimizing the impact of deletes.
Complex Mutate and Scan APIs
Beyond simple CRUD on single Records, KV also supports complex multi-item and multi-record mutations and scans via MutateItems and ScanItems APIs. PutItems also supports atomic writes of large blob data within a single Item via a chunked protocol. These complex APIs require careful consideration to ensure predictable linear low-latency and we will share details on their implementation in a future post.
Design Philosophies for reliable and predictable performance
Idempotency to fight tail latencies
To ensure data integrity the PutItems and DeleteItems APIs use idempotency tokens, which uniquely identify each mutative operation and guarantee that operations are logically executed in order, even when hedged or retried for latency reasons. This is especially crucial in last-write-wins databases like Cassandra, where ensuring the correct order and de-duplication of requests is vital.
In the Key-Value abstraction, idempotency tokens contain a generation timestamp and random nonce token. Either or both may be required by backing storage engines to de-duplicate mutations.
message IdempotencyToken (
Timestamp generation_time,
String token
)
At Netflix, client-generated monotonic tokens are preferred due to their reliability, especially in environments where network delays could impact server-side token generation. This combines a client provided monotonic generation_time timestamp with a 128 bit random UUID token. Although clock-based token generation can suffer from clock skew, our tests on EC2 Nitro instances show drift is minimal (under 1 millisecond). In some cases that require stronger ordering, regionally unique tokens can be generated using tools like Zookeeper, or globally unique tokens such as a transaction IDs can be used.
The following graphs illustrate the observed clock skew on our Cassandra fleet, suggesting the safety of this technique on modern cloud VMs with direct access to high-quality clocks. To further maintain safety, KV servers reject writes bearing tokens with large drift both preventing silent write discard (write has timestamp far in past) and immutable doomstones (write has a timestamp far in future) in storage engines vulnerable to those.
Handling Large Data through Chunking
Key-Value is also designed to efficiently handle large blobs, a common challenge for traditional key-value stores. Databases often face limitations on the amount of data that can be stored per key or partition. To address these constraints, KV uses transparent chunking to manage large data efficiently.
For items smaller than 1 MiB, data is stored directly in the main backing storage (e.g. Cassandra), ensuring fast and efficient access. However, for larger items, only the id, key, and metadata are stored in the primary storage, while the actual data is split into smaller chunks and stored separately in chunk storage. This chunk storage can also be Cassandra but with a different partitioning scheme optimized for handling large values. The idempotency token ties all these writes together into one atomic operation.
By splitting large items into chunks, we ensure that latency scales linearly with the size of the data, making the system both predictable and efficient. A future blog post will describe the chunking architecture in more detail, including its intricacies and optimization strategies.
Client-Side Compression
The KV abstraction leverages client-side payload compression to optimize performance, especially for large data transfers. While many databases offer server-side compression, handling compression on the client side reduces expensive server CPU usage, network bandwidth, and disk I/O. In one of our deployments, which helps power Netflix’s search, enabling client-side compression reduced payload sizes by 75%, significantly improving cost efficiency.
Smarter Pagination
We chose payload size in bytes as the limit per response page rather than the number of items because it allows us to provide predictable operation SLOs. For instance, we can provide a single-digit millisecond SLO on a 2 MiB page read. Conversely, using the number of items per page as the limit would result in unpredictable latencies due to significant variations in item size. A request for 10 items per page could result in vastly different latencies if each item was 1 KiB versus 1 MiB.
Using bytes as a limit poses challenges as few backing stores support byte-based pagination; most data stores use the number of results e.g. DynamoDB and Cassandra limit by number of items or rows. To address this, we use a static limit for the initial queries to the backing store, query with this limit, and process the results. If more data is needed to meet the byte limit, additional queries are executed until the limit is met, the excess result is discarded and a page token is generated.
This static limit can lead to inefficiencies, one large item in the result may cause us to discard many results, while small items may require multiple iterations to fill a page, resulting in read amplification. To mitigate these issues, we implemented adaptive pagination which dynamically tunes the limits based on observed data.
Adaptive Pagination
When an initial request is made, a query is executed in the storage engine, and the results are retrieved. As the consumer processes these results, the system tracks the number of items consumed and the total size used. This data helps calculate an approximate item size, which is stored in the page token. For subsequent page requests, this stored information allows the server to apply the appropriate limits to the underlying storage, reducing unnecessary work and minimizing read amplification.
While this method is effective for follow-up page requests, what happens with the initial request? In addition to storing item size information in the page token, the server also estimates the average item size for a given namespace and caches it locally. This cached estimate helps the server set a more optimal limit on the backing store for the initial request, improving efficiency. The server continuously adjusts this limit based on recent query patterns or other factors to keep it accurate. For subsequent pages, the server uses both the cached data and the information in the page token to fine-tune the limits.
In addition to adaptive pagination, a mechanism is in place to send a response early if the server detects that processing the request is at risk of exceeding the request’s latency SLO.
For example, let us assume a client submits a GetItems request with a per-page limit of 2 MiB and a maximum end-to-end latency limit of 500ms. While processing this request, the server retrieves data from the backing store. This particular record has thousands of small items so it would normally take longer than the 500ms SLO to gather the full page of data. If this happens, the client would receive an SLO violation error, causing the request to fail even though there is nothing exceptional. To prevent this, the server tracks the elapsed time while fetching data. If it determines that continuing to retrieve more data might breach the SLO, the server will stop processing further results and return a response with a pagination token.
This approach ensures that requests are processed within the SLO, even if the full page size isn’t met, giving clients predictable progress. Furthermore, if the client is a gRPC server with proper deadlines, the client is smart enough not to issue further requests, reducing useless work.
If you want to know more, the How Netflix Ensures Highly-Reliable Online Stateful Systems article talks in further detail about these and many other techniques.
Signaling
KV uses in-band messaging we call signaling that allows the dynamic configuration of the client and enables it to communicate its capabilities to the server. This ensures that configuration settings and tuning parameters can be exchanged seamlessly between the client and server. Without signaling, the client would need static configuration — requiring a redeployment for each change — or, with dynamic configuration, would require coordination with the client team.
For server-side signals, when the client is initialized, it sends a handshake to the server. The server responds back with signals, such as target or max latency SLOs, allowing the client to dynamically adjust timeouts and hedging policies. Handshakes are then made periodically in the background to keep the configuration current. For client-communicated signals, the client, along with each request, communicates its capabilities, such as whether it can handle compression, chunking, and other features.
KV Usage @ Netflix
The KV abstraction powers several key Netflix use cases, including:
- Streaming Metadata: High-throughput, low-latency access to streaming metadata, ensuring personalized content delivery in real-time.
- User Profiles: Efficient storage and retrieval of user preferences and history, enabling seamless, personalized experiences across devices.
- Messaging: Storage and retrieval of push registry for messaging needs, enabling the millions of requests to flow through.
- Real-Time Analytics: This persists large-scale impression and provides insights into user behavior and system performance, moving data from offline to online and vice versa.
Future Enhancements
Looking forward, we plan to enhance the KV abstraction with:
- Lifecycle Management: Fine-grained control over data retention and deletion.
- Summarization: Techniques to improve retrieval efficiency by summarizing records with many items into fewer backing rows.
- New Storage Engines: Integration with more storage systems to support new use cases.
- Dictionary Compression: Further reducing data size while maintaining performance.
Conclusion
The Key-Value service at Netflix is a flexible, cost-effective solution that supports a wide range of data patterns and use cases, from low to high traffic scenarios, including critical Netflix streaming use-cases. The simple yet robust design allows it to handle diverse data models like HashMaps, Sets, Event storage, Lists, and Graphs. It abstracts the complexity of the underlying databases from our developers, which enables our application engineers to focus on solving business problems instead of becoming experts in every storage engine and their distributed consistency models. As Netflix continues to innovate in online datastores, the KV abstraction remains a central component in managing data efficiently and reliably at scale, ensuring a solid foundation for future growth.
Acknowledgments: Special thanks to our stunning colleagues who contributed to Key Value’s success: William Schor, Mengqing Wang, Chandrasekhar Thumuluru, Rajiv Shringi, John Lu, George Cambell, Ammar Khaku, Jordan West, Chris Lohfink, Matt Lehman, and the whole online datastores team (ODS, f.k.a CDE).
Introducing Netflix’s Key-Value Data Abstraction Layer was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
ScyllaDB’s Rust Developer Workshop: What We Learned
A recap of the recent Rust developer workshop, where we built and refactored a high-performance Rust app for real-time data streaming (with ScyllaDB and Redpanda). Felipe Cardeneti Mendes (ScyllaDB Technical Director) and I recently got together with a couple thousand curious Rustaceans for a ScyllaDB Rust Developer Workshop. The agenda: walk through how we built and refactored a high-performance Rust app for real-time data streaming. We promised to show developers, engineers, and architects how to: Create and compile a sample social media app with Rust Connect the application to ScyllaDB (NoSQL data store) and Redpanda (streaming data) Negotiate tradeoffs related to data modeling and querying Manage and monitor the database for consistently low latencies This blog post is a quick recap of what we covered. Hopefully, it’s a nice wrapup for those who joined us live. If you missed it, you can still watch the recording (uncut – complete with a little cat chasing!). And feel free to ping me or Felipe with any questions you have. Access the workshop now Attend P99 CONF (free + virtual) to watch Rust tech talks First Things First First, I wanted to cover how I approach an existing, legacy codebase that Felipe so kindly generated for the workshop. I think it’s really important to respect everyone who interacts with code – past, present and future. That mindset helps foster good collaboration and leads to more maintainable and high quality code. Who knows, you might even have a laugh along the way. You probably spotted me using an Integrated Development Environment (IDE). Depending on your budget (from free to perhaps a couple hundred dollars), an IDE will really help streamline your coding process, especially when working with complex projects. The eagle eyed among you may have spotted some AI in there as well from our friends at GitHub. Every bit helps! Dealing with Dependencies In the code walkthrough, I first tackled the structure of the code, and showed how to organize workspace members. This helps me resolve dependencies efficiently and start to test the binaries in isolation:[workspace] members = ["backend", "consumer",
"frontend"] resolver = "1"
Then I could just run the
consumer after stopping it in docker-compose with: cargo run
--package consumer --bin consumer
Updating the Driver
Another thing I did was update the driver. It’s important to keep
things in check with releases from ScyllaDB so we upgraded the Rust driver
for the whole project. I did a quick walkthrough of application
functionality and decided to write a quick smoke test that
simulated traffic on the front end in terms of messaging between
users. If you’re interested, I used a great load testing tool called k6 to simulate that
load. Here’s the script: export default function () {
http.post('http://localhost:3001/new_post', JSON.stringify({
content: 'bar', subject: 'foo', id:
'8d8712fc-786f-4d72-98ea-3669e56f7012' }), { headers: {
'Content-Type': 'application/json', }, }); }
Dealing with an
Offset Bug Once we had some messages flowing (perhaps way too many,
as it turned out) I discovered a potential bug, where the offset
was not being persisted between application restarts. This meant
every time we restarted the application, all of the messages would
be read from the topic and then re-written to the database. Without
understanding functionality like the ability to parse
consumer offsets in Redpanda, I went for a more naive approach
by storing the offset in ScyllaDB instead. I’m sure I’m not the
first dev to go down the wrong path, and I fully blame Felipe for
not intercepting earlier 😉 Refactoring Time In any case, it was fun
to see how we might approach the topic of refactoring code. It’s
always easier to start with small, manageable tasks when making
improvements or refactoring code. The first thing I did was decide
what the table (and ultimately query) might look like. This “query
first design” is an important design concept in ScyllaDB..Be sure
to check out some ScyllaDB
University courses on this. I decided the table would look
something like this to store my offset value: CREATE TABLE IF
NOT EXISTS ks.offset (consumer text PRIMARY KEY, count
BigInt)
We briefly touched on why I chose a
BigInt primitive instead of a
Counter value. The main reason is that we can’t arbitrarily set
the latter to a value, only increment or decrement it. We then
tackled how we might write to that table and came up with the
following: async fn update_offset(offset: i64, session:
&Session, update_counter: &PreparedStatement, consumer:
&str) -> Result<()> { session.execute(update_counter,
(offset, consumer)).await?; Ok(()) }
You’ll notice here that
I’m passing it a
prepared statement which is an important concept to grasp when
making your code perform well with ScyllaDB. Be sure to read the
docs on that if you’re unsure. I also recall writing a TODO to move
some existing prepared query statements outside a for loop. The
main reason: you only need to do this once for your app, not over
and over. So watch out for that mistake. I also stored my query as
a constant: const UPDATE_OFFSET: &str = "UPDATE ks.offset
SET count = ? WHERE consumer = ?";
There are different ways
to skin this, like maybe some form of model approach, but this was
a simple way to keep the queries in one place within the consumer
code. We restarted the app and checked the database using cqlsh
to see if the offsets were being written – and they weren’t! But
first, a quick tip from other webinars: If you’re running ScyllaDB
in a docker container, you can simple exec to it and run the tool:
docker exec -it scylla cqlsh
Back to my mistake, why
no writes to the table? If you recall, I write the offset after the
consumer has finished processing records from the topic:
offset = consumer(&postclient, "posts", offset,
&session).await; update_offset(offset, &session,
&update_counter, "posts").await.expect("Failed to update
offset"); tokio::time::sleep(Duration::from_secs(5)).await;
Since I had written a load test with something like 10K records,
that consumer takes some time to complete, so update_offset wasn’t
getting called straight away. By the end of the webinar, it
actually finished reading from the topic and wrote the offset to
the table. Another little change I snuck in there was on:
tokio::time::sleep(Duration::from_secs(5)).await;
Felipe spoke to the benefits of using tokio, an asynchronous runtime for Rust.
The previous thread sleep would in fact do nothing, hence the
change. Hooray for quick fixes! Once we had writes, we needed to
read from the table, so I added another function that looked like
this: async fn fetch_offset(session: &Session, consumer:
&str) -> Result { let query = "SELECT count FROM ks.offset
WHERE consumer = ?"; let values = (consumer,); let result =
session.query(query, values).await.expect("Failed to execute
query"); if let Some(row) =
result.maybe_first_row_typed::<(i64,)>().expect("Failed to
get row") { Ok(row.0) } else { Ok(0) } }
I spoke about some
common gotchas here, like misunderstanding how query
values work, with different types, and whether to use a slice
&[] or a tuple (). Query text is constant, but the values might
change. You can pass changing values to a query by specifying a
list of variables as bound values. Don’t forget the parenthesis! I
also highlighted some of the convenience methods in query
result, like maybe_first_row_typed. That returns
Option<RowT> containing the first row from the result – which
is handy when you just want the first row or None. Once again, you
can play around with types, and even use
custom structs if you prefer for the output. In my case, it was
just a tuple with an i64. The complete consumer code for posts
looked something like this: tokio::spawn(async move { use
std::time::Duration; info!("Posts Consumer Started"); let session =
db_session().await; let update_counter =
session.prepare(UPDATE_OFFSET).await.expect("Failed to prepare
query"); loop { let mut offset = fetch_offset(&session,
"posts").await.expect("Failed to fetch offset"); offset =
consumer(&postclient, "posts", offset, &session).await;
update_offset(offset, &session, &update_counter,
"posts").await.expect("Failed to update offset");
tokio::time::sleep(Duration::from_secs(5)).await; } });
You
can see I prepare the statement before the loop, then I fetch the
offset from the database, consume the topic, write the offset to
the database and sleep. Keep doing that forever! What We Didn’t
Have Time to Cover There were a few things that I wanted to cover,
but ran out of time. If you wanted to write results to a custom
struct, the code might look something like: #[derive(Default,
FromRow)] pub struct Offset { consumer: String, count: i64, } use
scylla::IntoTypedRows; async fn fetch_offset_type(session:
&Session, consumer: &str) -> Offset { let query =
"SELECT * FROM ks.offset WHERE consumer = ?"; let values =
(consumer,); let result = session.query(query,
values).await.expect("Failed to execute query"); if let Some(rows)
= result.rows { if let Some(row) = rows.into_typed::().next() { let
offset: Offset = row.expect("Failed to parse row"); return offset;
} } Offset { consumer: consumer.to_string(), count: 0, } }
There are some custom values you’ll come across like CqlTimestamps
and Counter… so you should be aware of the ways to handle these
different
data types. For example, rather than convert everything to and
from millisecond timestamps, you can add the
chrono feature flag on the crate to interact with time. You can
also improve logging with the driver’s support of the tracing crate for your
logs. If you add that, you can use a tracing subscriber as follows:
#[tokio::main] async fn main() {
tracing_subscriber::fmt::init(); …
Wrapping Up I personally
find refactoring code enjoyable. I’d encourage you to have a
patient, persistent approach to coding, testing and refactoring.
When it comes to ScyllaDB it’s a product where it really pays to
read the documentation, as many of the foot guns are well
documented. If you still find yourself stuck, feel free to ask
questions on the ScyllaDB forum and learn from your peers. And
remember, small, continuous improvements lead to long-term
benefits. Have fun! See what
you missed – watch the video Instaclustr for Apache Cassandra® 5.0 Now Generally Available
NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.
NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and on–premises.
Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.
Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by:
- Helping you build AI/ML-based applications through Vector Search
- Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities
- Improving flexibility and security
With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industry–leading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.
Support for continuous backups and private network add–ons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported.
Some of the key new features in Cassandra 5.0 include:
- Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.
- Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.
- Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.
- Numerous stability and testing improvements: You can read all about these changes here.
All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.
Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.
We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.
Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things.
Lifecycle Policy Updates
As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).
To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters.
Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.
Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.
You can read more about our lifecycle policies on our website.
Getting Started
Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.
- Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
- Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
- Learn about 4 new Apache Cassandra 5.0 features to be excited about.
- Click here to learn what you need to know about Apache Cassandra 5.0.
Why Choose Apache Cassandra on the Instaclustr Managed Platform?
NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.
NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.
Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.
With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.
If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.
The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.
Apache Cassandra® 5.0: Behind the Scenes
Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.
Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.
It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform.
Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra project. His changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform.
August 2023: The Beginning
We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.
One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java.
September 2023: First Milestone
The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).
Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.
At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released.
November 2023: Further Testing and Internal Preview
The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.
We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.
We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.
During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us.
If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer:
java.lang.IllegalArgumentException: Could not parse type name vector<float, 5> Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356) <Rest of stacktrace removed for brevity>
December 2023: Focus on new features and planning
As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.
The final list of high impact features we came up with was:
- A new data type – Vectors
- Trie memtables/Trie Indexed SSTables (BTI Formatted SStables)
- Storage-Attached Indexes (SAI)
- Unified Compaction Strategy
A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase.
Once the holiday season arrived, it was time for a break, and we were back in force in February next year.
February 2024: Intensive testing
In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, we discovered issues in the interaction with our monitoring and provisioning setup.
We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.
Next, we broke down the work we needed to do. This included identifying monitoring agents requiring upgrade and config defaults that needed to change.
From this point, the project split into 3 streams of work:
- Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.
- Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.
- Infrastructure Upgrades – Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables.
A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved.
March 2024: Public Preview Release
In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables.
However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.
See our public preview launch blog for further details.
There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults.
April 2024: Configuration Tuning and Deeper Testing
The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.
This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default.
"memtable": { "configurations": { "skiplist": { "class_name": "SkipListMemtable" }, "sharded": { "class_name": "ShardedSkipListMemtable" }, "trie": { "class_name": "TrieMemtable" }, "default": { "inherits": "trie" } } }, "sstable": { "selected_format": "bti" }, "storage_compatibility_mode": "NONE",
The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables. You can override this per table:
CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};
Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information.
You can also reference the cassandra_latest.yaml file for the latest settings (please note you should not apply these to existing clusters without rigorous testing).
May 2024: Major Infrastructure Milestone
We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.
At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed.
June 2024: Successful Rollout New Cassandra Driver
The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default.
July 2024: Path to Release Candidate
We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.
The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform.
The Road Ahead to General Availability
We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including:
- Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure.
At Launch:
When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.
For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.
What Have We Learned From This Project?
- Releasing limited,
small
and frequent changes
has resulted in a smooth project, even if sometimes frequent
releases do not feel smooth. Some
thoughts:
- Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
- Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc.
- Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.
- Releasing to internal and public preview helped
create
momentum within
the Instaclustr
business and
teams:
- This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
- Communicating frequently, transparently, and efficiently is the foundation
of success:
- We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything.
- It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project.
This has been a long–running but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.
You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.
More Readings
- The Top 5 Questions We’re Asked about Apache Cassandra 5.0
- Vector Search in Apache Cassandra 5.0
- How Does Data Modeling Change in Apache Cassandra 5.0?
The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.
Will Your Cassandra Database Project Succeed?: The New Stack
Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.
That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.
Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.
Accurate Data Modeling Is a Must
Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.
On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.
Configure Cassandra Clusters the Right Way
Accurate, expertly managed cluster configurations are pivotal to the success of Cassandra implementations. Get those cluster settings wrong and Cassandra can suffer from data inconsistencies and performance issues due to inappropriate node capacities, poor partitioning or replication strategies that aren’t up to the task.
Teams should understand the needs of their particular use case and how each cluster configuration setting affects Cassandra’s abilities to serve that use case. Attuning configurations to best support your application — including the right settings for node capacity, data distribution, replication factor and consistency levels — will ensure that you can harness the full power of Cassandra when it counts.
Take Advantage of Tunable Consistency
Cassandra gives teams the option to leverage the best balance of data consistency and availability for their use case. While these tunable consistency levels are a valuable tool in the right hands, teams that don’t understand the nuances of these controls can saddle their applications with painful latency and troublesome data inconsistencies.
Teams that learn to operate Cassandra’s tunable consistency levels properly and carefully assess their application’s needs — especially with read and write patterns, data sensitivity and the ability to tolerate eventual consistency — will unlock far more beneficial Cassandra experiences.
Perform Regular Maintenance
Regular Cassandra maintenance is required to stave off issues such as data inconsistencies and performance drop-offs. Within their Cassandra operational procedures, teams should routinely perform compaction, repair and node-tool operations to prevent challenges down the road, while ensuring cluster health and performance are optimized.
Anticipate Capacity and Scaling Needs
By its nature, success will yield new needs. Be prepared for your Cassandra cluster to grow and scale well into the future — that is what this database is built to do. Starving your Cassandra cluster for CPU, RAM and storage resources because you don’t have a plan to seamlessly add capacity is a way of plucking failure from the jaws of success. Poor performance, data loss and expensive downtime are the rewards for growing without looking ahead.
Plan for growth and scalability from the beginning of your Cassandra implementation. Practice careful capacity planning. Look at your data volumes, write/read patterns and performance requirements today and tomorrow. Teams with clusters built for growth will be ready to do so far more easily and affordably.
Make Changes With a Careful Testing/Staging/Prod Process
Teams that think they’re streamlining their process efficiency by putting Cassandra changes straight into production actually enable a pipeline for bugs, performance roadblocks and data inconsistencies. Testing and staging environments are essential for validating changes before putting them into production environments and will save teams countless hours of headaches.
At the end of the day, running all data migrations, changes to schema and application updates through testing and staging environments is far more efficient than putting them straight into production and then cleaning up myriad live issues.
Set Up Monitoring and Alerts
Teams implementing monitoring and alerts to track metrics and flag anomalies can mitigate trouble spots before they become full-blown service interruptions. The speed at which teams become aware of issues can mean the difference between a behind-the-scenes blip and a downtime event.
Have Backup and Disaster Recovery at the Ready
In addition to standing up robust monitoring and alerting, teams should regularly test and run practice drills on their procedures for recovering from disasters and using data backups. Don’t neglect this step; these measures are absolutely essential for ensuring the safety and resilience of systems and data.
The less prepared an organization is to recover from issues, the longer and more costly and impactful downtime will be. Incremental or snapshot backup strategies, replication that’s based in the cloud or across multiple data centers and fine-tuned recovery processes should be in place to minimize downtime, stress and confusion whenever the worst occurs.
Nurture Cassandra Expertise
The expertise required to optimize Cassandra configurations, operations and performance will only come with a dedicated focus. Enlisting experienced talent, instilling continuous training regimens that keep up with Cassandra updates, turning to external support and ensuring available resources — or all of the above — will position organizations to succeed in following the best practices highlighted here and achieving all of the benefits that Cassandra can deliver.
The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.
Use Your Data in LLMs With the Vector Database You Already Have: The New Stack
Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.
Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.
You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.
But, ready for some good news?
If you’re already using Apache Cassandra 5.0, OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.
For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.
Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.
RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.
Let’s look closer at what each open source technology brings to the vector database discussion:
Apache Cassandra 5.0 Offers Native Vector Indexing
With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.
Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.
OpenSearch Provides a Combination of Benefits
Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.
With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.
The pgvector Extension Makes Postgres a Powerful Vector Store
Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.
pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.
Was the Answer to Your AI Challenges in Front of You All Along?
The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.
The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.
Who Did What to That and When? Exploring the User Actions Feature
NetApp recently released the user actions feature on the Instaclustr Managed Platform, allowing customers to search for user actions recorded against their accounts and organizations. We record over 100 different types of actions, with detailed descriptions of what was done, by whom, to what, and at what time.
This provides customers with visibility into the actions users are performing on their linked accounts. NetApp has always collected this information in line with our security and compliance policies, but now, all important changes to your managed cluster resources have self-service access from the Console and the APIs.
In the past, this information was accessible only through support tickets when important questions such as “Who deleted my cluster?” and “When was the firewall rule removed from my cluster?” needed answers. This feature adds more self-discoverability of what your users are doing and what our support staff are doing to keep your clusters healthy.
This blog post provides a detailed walkthrough of this new feature at a moderate level of technical detail, with the hope of encouraging you to explore and better find the actions you are looking for.
For this blog, I’ve created two Apache Cassandra® clusters in one account and performed some actions on each. I’ve also created an organization linked to this account and performed some actions on that. This will allow a full example UI to be shown and demonstrate the type of “stories” that can emerge from typical operations via user actions.
Introducing Global Directory
During development, we decided to consolidate the other global account pages into a new centralized location, which we are calling the “Directory”.
This Directory provides you with the consolidated view of all organizations and accounts that you have access to, collecting global searches and account functions into a view that does not have a “selected cluster” context (i.e., global). For more information on how Organizations, Accounts and Clusters relate to each other, check out this blog.
Organizations serve as an efficient method to consolidate all associated accounts into a single unified, easily accessible location. They introduce an extra layer to the permission model, facilitating the management and sharing of information such as contact and billing details. They also streamline the process of Single Sign-On (SSO) and account creation.
Let’s log in and click on the new button:
This will take us to the new directory landing page:
Here, you will find two types of global searches: accounts and user actions, as well as account creation. Selecting the new “User Actions” item will take us to the new page. You can also navigate to these directory pages directly from the top right ‘folder’ menu:
User Action Search Page: Walkthrough
This is the new page we land on if we choose to search for user actions:
When you first enter, it finds the last page of actions that happened in the accounts and organizations you have access to. It will show both organization and account actions on a single consolidated page, even though they are slightly different in nature.
*Note: The accessible accounts and organisations are defined as those you are linked to as
CLUSTER_ADMIN
or
OWNER
*TIP: If you don’t want an account user to see user actions, give the
READ_ONLY
access.
You may notice a brief progress bar display as the actions are retrieved. At the time of writing, we have recorded nearly 100 million actions made by our customers over a 6-month period.
From here, you can increase the number of actions shown on each page and page through the results. Sorting is not currently supported on the actions table, but it is something we will be looking to add in the future. For each action found, the table will display:
- Action: What happened to your account (or organization)? There are over 100 tracked kinds of actions recorded.
- Domain: The specific account or organization name of the action targeted.
- Description: An expanded description of what happened, using context captured at the time of action. Important values are highlighted between square brackets, and the copy button will copy the first one into the clipboard.
- User: The user who
performed the action, typically using the console/
APIs or
Terraform
provider, but
it can also be triggered by “Instaclustr
Support” using our
admin tools.
- For those actions marked with user “Instaclustr Support”, please reach out to support for more information about those actions we’ve taken on your behalf or visit https://support.instaclustr.com/hc/en-us.
- Local time: The action time from your local web browser’s perspective.
Additionally, for those who prefer programmatic access, the user action feature is fully accessible via our APIs, allowing for automation and integration into your existing workflows. Please visit our API documentation page here for more details.
Basic (super-search) Mode
Let’s say we only care about the “LeagueOfNations” organization domain; we can type ‘League’ and then click Search:
The name patterns are simple partial string patterns we look for as being ’contained’ within the name, such as ”Car” in ”Carlton”. These are case insensitive. They are not (yet!) general regular expressions.
Advanced “find a needle” Search Mode
Sometimes, searching by names is not precise enough; you may want to provide more detailed search criteria, such as time ranges or narrowing down to specific clusters or kinds of actions. Expanding the “Advanced Search” section will switch the page to a more advanced search criteria form, disabling the basic search area and its criteria.
Let’s say we only want to see the “Link Account” actions over the last week:
We select it from the actions multi-chip selector using the cursor (we could also type it and allow autocomplete to kick in). Hitting search will give you your needle time to go chase that Carl guy down and ask why he linked that darn account:
The available criteria fields are as follows (additive in nature):
- Action: the kinds of actions, with a bracketed count of their frequency over the current criteria; if empty, all are included.
- Account: The account name of interest OR its UUID can be useful to narrow the matches to only a specific account. It’s also useful when user, organization, and account names share string patterns, which makes the super-search less precise.
- Organization: the organization name of interest or its UUID.
- User: the user who performed the action.
- Description: matches against the value of an expanded description variable. This is useful because most actions mention the ‘target’ of the action, such as cluster-id, in the expanded description.
- Starting At: match actions starting from this time cannot be older than 12 months ago.
- Ending At: match actions up until this time.
Bonus Feature: Cluster Actions
While it’s nice to have this new search page, we wanted to build a higher-order question on top of it: What has happened to my cluster?
The answer can be found on the details tab of each cluster. When clicked on, it will take you directly to the user actions page with appropriate criteria to answer the question.
* TIP: we currently support entry into this view with a
descriptionFormat queryParam
allowing you to save bookmarks to particular action ‘targets’. Further
queryParams
may be supported in the future for the remaining criteria: https://console2.instaclustr.com/global/searches/user-action?descriptionContextPattern=acde7535-3288-48fa-be64-0f7afe4641b3
Clicking this provides you the answer:
Future Thoughts
There are some future capabilities we will look to add, including the ability to subscribe to webhooks that trigger on some criteria. We would also like to add the ability to generate reports against a criterion or to run such things regularly and send them via email. Let us know what other feature improvements you would like to see!
Conclusion
This new capability allows customers to search for user actions directly without contacting support. It also provides improved visibility and auditing of what’s been changing on their clusters and who’s been making those changes. We hope you found this interesting and welcome any feedback for “higher-order” types of searches you’d like to see built on top of this new feature. What kind of common questions about user actions can you think of?
If you have any questions about this feature, please contact Instaclustr Support at any time. If you are not a current Instaclustr customer and you’re interested to learn more, register for a free trial and spin up your first cluster for free!
The post Who Did What to That and When? Exploring the User Actions Feature appeared first on Instaclustr.
Powering AI Workloads with Intelligent Data Infrastructure and Open Source
In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively.
This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI.
The Power of Open Source in AI Solutions
Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions:
- Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
- Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
- Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals.
- Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness.
Vector Databases: A Key Component for AI Workloads
Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems.
Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities.
Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users.
The Role of Vector Databases in AI:
- Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models.
- High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly.
- Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance.
Leveraging Existing Infrastructure for AI Workloads
Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements:
- Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement.
- Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads.
- Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration.
Open Source Databases for Enterprise AI
Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors:
PostgreSQL® and pgvector
“The world’s most advanced open source relational database“, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure.
From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.
OpenSearch®
Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.
OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI agents to recommendation engines with trustworthy results and minimal hallucinations.
Apache Cassandra® 5.0 with Native Vector Indexing
Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.
Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.
Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises.
Open Source Opens the Door to Successful AI Workloads
Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutions—and suffering the pitfalls of vendor lock-in or simply mismatched features—can easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position.
When leveraging open source databases for AI workloads, consider the following:
- Data Security: Ensure robust security measures are in place to protect sensitive data.
- Scalability: Plan for future growth by choosing solutions that can scale with your needs.
- Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications.
- Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI.
Conclusion
Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency.
Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.
And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts!
The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.
How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes?
Data modeling in Apache Cassandra® is an important topic—how you model your data can affect your cluster’s performance, costs, etc. Today I’ll be looking at a new feature in Cassandra 5.0 called Storage -Attached Indexes (SAI), and how they affect the way you model data in Cassandra databases.
First, I’ll briefly cover what SAIs are (for more information about SAIs, check out this post). Then I’ll look at 3 use cases where your modeling strategy could change with SAI. Finally, I’ll talk about benefits and constraints of SAIs. and constraints of SAIs.
What Are Storage–Attached Indexes?
From the Cassandra 5.0 Documentation, Storage –Attached Indexes (SAIs) “[provide] an indexing mechanism that is closely integrated with the Cassandra storage engine to make data modeling easier for users.” Secondary Indexing, which is indexing values on properties that are not part of the Primary Key for that table, has been available for Cassandra in the past (called SASI and 2i). However, SAIs will replace the existing functionality, as it will be deprecated in 5.0, and then tentatively removed in Cassandra 6.0.
This is because SAIs improve upon the older methods in a lot of key ways. For one, according to the devs, SAIs are the fastest indexing method for Cassandra clusters. This performance boost was a plus for using indexing in production environments. It also lowered the data storage overhead over prior implementations, which lowers costs by reducing the need for database storage, which induces operational costs, and by reducing latency when dealing with indexes, lowering a loss of user interaction due to high latency.
How Do SAIs work?
SAIs are implemented as part of the SSTables, or Sorted String Tables, of a Cassandra database. This is because SAIs index Memtables and SSTables as they are written. It filters from both in-memory and on-disk sources, filtering them out into a series of indexed columns at read time. I’m not going to go into too much detail here because there are a lot of existing resources on this exciting topic: see the Cassandra 5.0 Documentation and the Instaclustr site for examples.
The main thing to keep in mind is that SAIs are attached to Cassandra’s storage engine, and it’s much more performant from speed, scalability, and data storage angles as a result. This means that you can use indexing reliably in production beginning with Cassandra 5.0, which allows data modeling to be improved very quickly.
To learn more about how SAIs work, check out this piece from the Apache Cassandra blog.
What Is SAI For?
SAI is a filtering engine, and while it does have some functionality overlap with search engines, it directly says it is “not an enterprise search engine” (source).
SAI is meant for creating filters on non-primary-key or composite partition keys (source), essentially meaning that you can enable a ‘WHERE’ clause on any column in your Cassandra 5.0 database. This makes queries a lot more flexible without sacrificing latency or storage space as with prior methods.
How Can We Use SAI When Data Modeling in Cassandra 5.0?
Because of the increased scalability and performance of SAIs, data modeling in Cassandra 5.0 will most definitely change.
You will be able to search collections more thoroughly and easily, for instance, indexing is more of an option when designing your Cassandra queries. This will also allow new query types, which can improve your existing queries—which by Cassandra’s design paradigm changes your table design.
But what if you’re not on a greenfield project and want to use SAIs? No problem! SAI is backwards-compatible, and you can migrate your application one index at a time if you need.
How Do Storage–Attached Indexes Affect Data Modeling in Apache Cassandra 5.0?
Cassandra’s SAI was designed with data modeling in mind (source). It unlocks new query patterns that make data modeling easier in quite a few cases. In the Cassandra team’s words: “You can create a table that is most natural for you, write to just that table, and query it any way you want.” (source)
I think another great way to look at how SAIs affect data modeling is by looking at some queries that could be asked of SAI data. This is because Cassandra data modeling relies heavily on the queries that will be used to retrieve the data. I’ll take a look at 2 use cases: indexing as a means of searching a collection in a row and indexing to manage a one-to-many relationship.
Use Case: Querying on Values of Non-Primary-Key Columns
You may find you’re searching for records with a particular value in a particular column often in a table. An example may be a search form for a large email inbox with lots of filters. You could find yourself looking at a record like:
- Subject
- Sender
- Receiver
- Body
- Time sent
Your table creation may look like:
CREATE KEYSPACE IF NOT EXISTS inbox WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; CREATE TABLE IF NOT EXISTS emails ( id int, sender text, receivers text, subject text, body text, timeSent timestamp, PRIMARY KEY (id)); };
If you allow users to search for a particular subject or sender, and the data set is large, not having SAIs could make query times painful:
SELECT * FROM emails WHERE emails.sender == “sam.example@example.com”
To fix this problem, we can create secondary indexes on our sender, receiver, and body fields:
CREATE CUSTOM INDEX sender_sai_idx ON Inbox.emails (sender) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE INDEX IF NOT EXISTS receiver_sai_idx on Inbox.emails (receiver) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE CUSTOM INDEX body_sai_idx ON Inbox.emails (body) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE CUSTOM INDEX subject_sai_idx ON Inbox.emails (subject) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};
Once you’ve established the indexes, you can run the same query and it will automatically use the SAI index to find all emails with a sender of “sam.example@examplemail.com” OR by subject match/body match. Note that although the data model changed with the inclusion of the indexes, the SELECT query does not change, and the fields of the table stayed the same as well!
Use Case: Managing One-To-Many Relationships
Going back to the previous example, one email could have many recipients. Prior to secondary indexes, you would need to scan every row in the collection of every row in the table in order to query on recipients. This could be solved in a few ways. One is to create a join table for recipients that contains an id, email id, and recipient. This becomes complicated when the constraint that each email should only appear once per email is added. With SAI, we now have an index-based solution—create an index on a collection of recipients for each row.
The script to create the table and indices changes a bit:
id int, sender text, receivers set<text>, subject text, body text, timeSent timestamp, PRIMARY KEY (id)); };
The text type of receivers changes to a set<text>. A set is used because each email should only occur once. This takes the logic you would have had to implement for the join table solution and moves it to Cassandra.
The indexing code remains mostly the same, except for the creation of the index for receivers:
CREATE INDEX IF NOT EXISTS receivers_sai_idx on Inbox.emails (receivers)
That’s it! One line of CQL and there’s now an index on receivers. We can query for emails with a particular receiver:
SELECT * FROM emails WHERE emails.receievers CONTAINS “sam.example@examplemail.com”
There are many one-to-many relationships that can be simplified in Cassandra with the use of secondary indexes and SAI.
What Are the Benefits of Data Modeling With Storage Attached Indexes?
There are many benefits to using SAI when data modeling in Cassandra 5.0:
- Query performance: because of SAI’s implementation, it has much faster query speeds than previous implementations, and indexed data is generally faster to search than unindexed data. This means you have more flexibility to search within your data and write queries that search non-primary-key columns and collections.
- Move over piecemeal: SAI’s backwards compatibility, coupled with how little your table structure has to change to add SAIs, means you can move over your data models piece by piece, meaning moving is easier.
- Data storage overhead: SAI has much lower data overhead than previous secondary index implementations, meaning more flexibility in what you can store in your data models without impacting overall storage needs.
- More complex
queries/features: SAI allows you to write much
more thorough queries when looking
through SAIs,
and offers up a lot of new functionality, like:
- Vector Search
- Numeric Range queries
- AND queries within indexes
- Support for map/set/
What Are the Constraints of Storage–Attached Indexes?
While there are benefits to SAI, there are also a few constraints, including:
- Because SAI is attached to the SSTable mechanism, the performance of queries on indexed columns will be “highly dependent on the compaction strategy in use” (per the Cassandra 5.0 CEP-7)
- SAI is not designed for unlimited-size data sets, such as logs; indexing on a dataset like that would cause performance issues. The reason for this is read latency at higher row counts spread across a cluster. It is also related to consistency level (CL), as the higher the CL is, the more nodes you’ll have to ping on larger datasets. (Source).
- Query complexity: while you can query as many indexes as you like, when you do so, you incur a cost related to the number of index values processed. Be aware when designing queries to select from as few indexes as necessary.
- You cannot index multiple columns in one index, as there is a 1-to-1 mapping of an SAI index to a column. You can however create separate indexes and query them simultaneously.
This is a v1, and some features, like the LIKE comparison for strings, the OR operator, and global sorting are all slated for v2.
Disk usage: SAI uses an extra 20-35% disk space over unindexed data; note that over previous versions of indexing, it consumes much less (source). You shouldn’t just make every column an index if you don’t need to, saving disk space and maintaining query performance.
Conclusion
SAI is a very robust solution for secondary indexes, and their addition to Cassandra 5.0 opens the door for several new data modelling strategies—from searching non-primary-key columns, to managing one-to-many relationships, to vector search. To learn more about SAI, read this post from the Instaclustr by NetApp blog, or check out the documentation for Cassandra 5.0.
If you’d like to test SAI without setting up and configuring Cassandra yourself, Instaclustr has a free trial and you can spin up Cassandra 5.0 clusters today through a public preview! Instaclustr also offers a bunch of educational content about Cassandra 5.0.
The post How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes? appeared first on Instaclustr.
Cassandra Lucene Index: Update
**An important update regarding support of Cassandra Lucene Index for Apache Cassandra® 5.0 and the retirement of Apache Lucene Add-On on the Instaclustr Managed Platform.**
Instaclustr by NetApp has been maintaining the new fork of the Cassandra Lucene Index plug-in since its announcement in 2018. After extensive evaluation, we have decided not to upgrade the Cassandra Lucene Index to support Apache Cassandra® 5.0. This decision aligns with the evolving needs of the Cassandra community and the capabilities offered by the Storage–Attached Indexing (SAI) in Cassandra 5.0.
SAI introduces significant improvements in secondary indexing, while simplifying data modeling and creating new use cases in Cassandra, such as Vector Search. While SAI is not a direct replacement for the Cassandra Lucene Index, it offers a more efficient alternative for many indexing needs.
For applications requiring advanced indexing features, such as full-text search or geospatial queries, users can consider external integrations, such as OpenSearch®, that offer numerous full-text search and advanced analysis features.
We are committed to maintaining the Cassandra Lucene Index for currently supported and newer versions of Apache Cassandra 4 (including minor and patch-level versions) for users who rely on its advanced search capabilities. We will continue to release bug fixes and provide necessary security patches for the supported versions in the public repository.
Retiring Apache Lucene Add-On for Instaclustr for Apache Cassandra
Similarly, Instaclustr is commencing the retirement process of the Apache Lucene add-on on its Instaclustr Managed Platform. The offering will move to the Closed state on July 31, 2024. This means that the add-on will no longer be available for new customers.
However, it will continue to be fully supported for existing customers with no restrictions on SLAs, and new deployments will be permitted by exception. Existing customers should be aware that the add-on will not be supported for Cassandra 5.0. For more details about our lifecycle policies, please visit our website here.
Instaclustr will work with the existing customers to ensure a smooth transition during this period. Support and documentation will remain in place for our customers running the Lucene add–on on their clusters.
For those transitioning to, or already using the Cassandra 5.0 beta version, we recommend exploring how Storage-Attached Indexing can help you with your indexing needs. You can try the SAI feature as part of the free trial on the Instaclustr Managed Platform.
We thank you for your understanding and support as we continue to adapt and respond to the community’s needs.
If you have any questions about this announcement, please contact us at support@instaclustr.com.
The post Cassandra Lucene Index: Update appeared first on Instaclustr.
Building a 100% ScyllaDB Shard-Aware Application Using Rust
Building a 100% ScyllaDB Shard-Aware Application Using Rust
I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...
On Scylla Manager Suspend & Resume feature
On Scylla Manager Suspend & Resume feature
!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...
Renaming and reshaping Scylla tables using scylla-migrator
We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...
Python scylla-driver: how we unleashed the Scylla monster's performance
At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...
Scylla Summit 2019
I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...
Scylla: four ways to optimize your disk space consumption
We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...
Scylla Summit 2018 write-up
It's been almost one month since I had the chance to attend and speak at Scylla Summit 2018 so I'm reliev...
Authenticating and connecting to a SSL enabled Scylla cluster using Spark 2
This quick article is a wrap up for reference on how to connect to ScyllaDB using Spark 2 when authentication and SSL are enforced for the clients on the...
A botspot story
I felt like sharing a recent story that allowed us identify a bot in a haystack thanks to Scylla.
...
Evaluating ScyllaDB for production 2/2
In my previous blog post, I shared [7 lessons on our experience in evaluating Scylla](https://www.ultrabug.fr...
Evaluating ScyllaDB for production 1/2
I have recently been conducting a quite deep evaluation of ScyllaDB to find out if we could benefit from this database in some of...
Enhance Apache Cassandra Logging
Cassandra usually output all its logs in a system.log file. It
uses log4j old 1.2
version for cassandra 2.0, and since
2.1, logback, which of
course use different syntax :)
Logs can be enhanced with some configuration. These explanations
works with Cassandra 2.0.x and Cassandra 2.1.x, I haven’t tested
others versions yet.
I wanted to split logs in different files, depending on their “sources” (repair, compaction, tombstones etc), to ease debugging, while keeping the system.log as usual.
For example, to declare 2 new files to handle, say Repair and Tombstones logs :
Cassandra 2.0 :
You need to declare each new log files in log4j-server.properties file.
Cassandra 2.1 :
It is in the logback.xml file.
Now that theses new files are declared, we need to fill them with logs. To do that, simply redirect some Java class to the good file. To redirect the class org.apache.cassandra.db.filter.SliceQueryFilter, loglevel WARN to the Tombstone file, simply add :
Cassandra 2.0 :
Cassandra 2.1 :
It’s a on-the-fly configuration, so no need to restart Cassandra
!
Now you will have dedicated files for each kind of logs.
A list of interesting Cassandra classes :
You can find from which java class a log message come from by adding “%c” in log4j/logback “ConversionPattern” :
You can disable “additivity” (i.e avoid adding messages in system.log for example) in log4j for a specific class by adding :
For logback, you can add additivity=”false” to <logger .../> elements.
To migrate from log4j logs to logback.xml, you can look at http://logback.qos.ch/translator/
Sources :
- http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configLoggingLevels_r.html
- http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configLoggingLevels_t.html
- https://logging.apache.org/log4j/1.2/manual.html
- http://logback.qos.ch/manual/appenders.html
Note: you can add http://blog.alteroot.org/feed.cassandra.xml to your rss aggregator to follow all my Cassandra posts :)
How to change Cassandra compaction strategy on a production cluster
I’ll talk about changing Cassandra CompactionStrategy on a live
production Cluster.
First of all, an extract of the
Cassandra documentation :
Periodic compaction is essential to a healthy Cassandra database because Cassandra does not insert/update in place. As inserts/updates occur, instead of overwriting the rows, Cassandra writes a new timestamped version of the inserted or updated data in another SSTable. Cassandra manages the accumulation of SSTables on disk using compaction. Cassandra also does not delete in place because the SSTable is immutable. Instead, Cassandra marks data to be deleted using a tombstone.
By default, Cassandra use SizeTieredCompactionStrategyi (STC). This strategy triggers a minor compaction when there are a number of similar sized SSTables on disk as configured by the table subproperty, 4 by default.
Another compaction strategy available since
Cassandra 1.0 is
LeveledCompactionStrategy (LCS) based on LevelDB.
Since 2.0.11, DateTieredCompactionStrategy
is also available.
Depending on your needs, you may need to change the compaction strategy on a running cluster. Change this setting involves rewrite ALL sstables to the new strategy, which may take long time and can be cpu / i/o intensive.
I needed to change the compaction strategy on my production
cluster to LeveledCompactionStrategy because of our workflow : lot
of updates and deletes, wide rows etc.
Moreover, with the default STC, progressively the largest SSTable
that is created will not be compacted until the amount of actual
data increases four-fold. So it can take long time before old data
are really deleted !
Note: You can test a new compactionStrategy on one new node with the write_survey bootstrap option. See the datastax blogpost about it.
The basic procedure to change the CompactionStrategy is to alter the table via cql :
If you run alter table to change to LCS like that, all nodes will recompact data at the same time, so performances problems can occurs for hours/days…
A better solution is to migrate nodes by nodes !
You need to change the compaction locally on-the-fly, via the
JMX, like in write_survey mode.
I use jmxterm
for that. I think I’ll write articles about all theses jmx things
:)
For example, to change to LCS on mytable table with
jmxterm :
A nice one-liner :
On next commitlog flush, the node will start it compaction to rewrite all it mytable sstables to the new strategy.
You can see the progression with nodetool :
You need to wait for the node to recompact all it sstables, then
change the strategy to instance2, etc.
The transition will be done in multiple compactions if you have
lots of data. By default new sstables will be 160MB large.
you can monitor you table with nodetool cfstats too :
You can see the 31/4 : it means that there is 31 sstables in L0, whereas cassandra try to have only 4 in L0.
Taken from the code ( src/java/org/apache/cassandra/db/compaction/LeveledManifest.java )
When all nodes have the new strategy, let’s go for the global alter table. /!\ If a node restart before the final alter table, it will recompact to default strategy (SizeTiered)!
Et voilà, I hope this article will help you :)
My latest Cassandra blogpost was one year ago… I have several in mind (jmx things !) so stay tuned !
Replace a dead node in Cassandra
Note (June 2020): this article is old and not really revelant
anymore. If you use a modern version of cassandra, look at
-Dcassandra.replace_address_first_boot
option !
I want to share some tips about my experimentations with Cassandra (version 2.0.x).
I found some documentations on datastax website about replacing a dead node, but it is not suitable for our needs, because in case of hardware crash, we will set up a new node with exactly the same IP (replace “in place”). Update : the documentation in now up to date on datastax !
If you try to start the new node with the same IP, cassandra doesn’t start with :
java.lang.RuntimeException: A node with address /10.20.10.2 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
So, we need to use the “cassandra.replace_address” directive (which is not really documented ? :() See this commit and this bug report, available since 1.2.11/2.0.0, it’s an easier solution and it works.
+ - New replace_address to supplant the (now removed) replace_token and
+ replace_node workflows to replace a dead node in place. Works like the
+ old options, but takes the IP address of the node to be replaced.
It’s a JVM directive, so we can add it at the end of /etc/cassandra/cassandra-env.sh (debian package), for example:
JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=10.20.10.2"
Of course, 10.20.10.2 = ip of your dead/new node.
Now, start cassandra, and in logs you will see :
INFO [main] 2014-03-10 14:58:17,804 StorageService.java (line 941) JOINING: schema complete, ready to bootstrap
INFO [main] 2014-03-10 14:58:17,805 StorageService.java (line 941) JOINING: waiting for pending range calculation
INFO [main] 2014-03-10 14:58:17,805 StorageService.java (line 941) JOINING: calculation complete, ready to bootstrap
INFO [main] 2014-03-10 14:58:17,805 StorageService.java (line 941) JOINING: Replacing a node with token(s): [...]
[...]
INFO [main] 2014-03-10 14:58:17,844 StorageService.java (line 941) JOINING: Starting to bootstrap...
INFO [main] 2014-03-10 14:58:18,551 StreamResultFuture.java (line 82) [Stream #effef960-6efe-11e3-9a75-3f94ec5476e9] Executing streaming plan for Bootstrap
Node is in boostraping mode and will retrieve data from cluster.
This may take lots of time.
If the node is a seed node, a warning will indicate that the node
did not auto bootstrap. This is normal, you need to run a
nodetool repair on the node.
On the new node :
# nodetools netstats
Mode: JOINING
Bootstrap effef960-6efe-11e3-9a75-3f94ec5476e9
/10.20.10.1
Receiving 102 files, 17467071157 bytes total
[...]
After some time, you will see some informations on logs !
On the new node :
INFO [STREAM-IN-/10.20.10.1] 2014-03-10 15:15:40,363 StreamResultFuture.java (line 215) [Stream #effef960-6efe-11e3-9a75-3f94ec5476e9] All sessions completed
INFO [main] 2014-03-10 15:15:40,366 StorageService.java (line 970) Bootstrap completed! for the tokens [...]
[...]
INFO [main] 2014-03-10 15:15:40,412 StorageService.java (line 1371) Node /10.20.10.2 state jump to normal
WARN [main] 2014-03-10 15:15:40,413 StorageService.java (line 1378) Not updating token metadata for /10.20.30.51 because I am replacing it
INFO [main] 2014-03-10 15:15:40,419 StorageService.java (line 821) Startup completed! Now serving reads.
And on other nodes :
INFO [GossipStage:1] 2014-03-10 15:15:40,625 StorageService.java (line 1371) Node /10.20.10.2 state jump to normal
Et voilà, dead node has been replaced !
Don’t forget to REMOVE modifications on
cassandra-env.sh after the complete bootstrap !
Enjoy !