Database Internals: Working with CPUs

Get a database engineer’s inside look at how the database interacts with the CPU…in this excerpt from the book, “Database Performance at Scale.” Note: The following blog is an excerpt from Chapter 3 of the Database Performance at Scale book, which is available for free. This book sheds light on often overlooked factors that impact database performance at scale. *** A database’s internal architecture makes a tremendous impact on the latency it can achieve and the throughput it can handle. Being an extremely complex piece of software, a database doesn’t exist in a vacuum, but rather interacts with the environment which includes the operating system and the hardware. While it’s one thing to get massive terabyte-to-petabyte scale systems up and running, it’s a whole other thing to make sure they are operating at peak efficiency. In fact, it’s usually more than just “one other thing.” Performance optimization of large distributed systems is usually a multivariate problem — combining aspects of the underlying hardware, networking, tuning operating systems, or finagling with layers of virtualization and application architectures. Such a complex problem warrants exploration from multiple perspectives. In this chapter, we’ll begin the discussion of database internals by looking at ways that databases can optimize performance by taking advantage of modern hardware and operating systems. We’ll cover how the database interacts with the operating system plus CPUs, memory, storage, and networking. Then, in the next chapter, we’ll shift focus to algorithmic optimizations. Note: This blog focuses exclusively on CPUs, but you can access the complete book (free, Open Access). Working with CPUs Programming books tell programmers that we have this CPU that can run processes or threads, and what runs means is that there’s some simple sequential instruction execution. Then there’s a footnote explaining that with multiple threads you might need to consider doing some synchronization. In fact, how things actually get executed inside CPU cores is something completely different and much more complicated. It would be very difficult to program these machines if we didn’t have those abstractions from books, but they are a lie to some degree and how you can efficiently take advantage of CPU capabilities is still very important. Share nothing across cores Individual CPU cores aren’t getting any faster. Their clock speeds reached a performance plateau long ago. Now, the ongoing increase of CPU performance continues horizontally: by increasing the number of processing units. In turn, the increase in the number of cores means that performance now depends on coordination across multiple cores (versus the throughput of a single core). On modern hardware, the performance of standard workloads depends more on the locking and coordination across cores than on the performance of an individual core. Software architects face two unattractive alternatives: Coarse-grained locking, which will see application threads contend for control of the data and wait instead of producing useful work. Fine-grained locking, which, in addition to being hard to program and debug, sees significant overhead even when no contention occurs due to the locking primitives themselves. Consider an SSD drive. The typical time needed to communicate with an SSD on a modern NVMe device is quite lengthy – it’s about 20 µseconds. That’s enough time for the CPU to execute tens of thousands of instructions. Developers should consider it as a networked device but generally do not program in that way. Instead, they often use an API that is synchronous (we return to this later in the book), which produces a thread that can be blocked. Looking at the image of the logical layout of an Intel Xeon Processor (Figure 3-1), it’s clear that this is also a networked device. Figure 3-1: The logical layout of an Intel Xeon Processor The cores are all connected by what is essentially a network — a dual ring interconnect architecture. There are two such rings and they are bidirectional. Why should developers use a synchronous API for that then? Since the sharing of information across cores requires costly locking, a shared-nothing model is perfectly worth considering. In such a model, all requests are sharded onto individual cores, one application thread is run per core, and communication depends on explicit message passing, not shared memory between threads. This design avoids slow, unscalable lock primitives and cache bounces. Any sharing of resources across cores in modern processors must be handled explicitly. For example, when two requests are part of the same session and two CPUs each get a request that depends on the same session state, one CPU must explicitly forward the request to the other. Either CPU may handle either response. Ideally, your database provides facilities that limit the need for cross-core communication – but when communication is inevitable, it provides high-performance non-blocking communication primitives to ensure performance is not degraded. Futures-Promises There are many solutions for coordinating work across multiple cores. Some are highly programmer-friendly and enable the development of software that works exactly as if it were running on a single core. For example, the classic Unix process model is designed to keep each process in total isolation and relies on kernel code to maintain a separate virtual memory space per process. Unfortunately, this increases the overhead at the OS level. There’s a model known as “futures and promises.” A future is a data structure that represents some yet-undetermined result. A promise is the provider of this result. It can be helpful to think of a promise/future pair as a first-in first-out (FIFO) queue with a maximum length of one item, which may be used only once. The promise is the producing end of the queue, while the future is the consuming end. Like FIFOs, futures and promises are used to decouple the data producer and the data consumer. However, the optimized implementations of futures and promises need to take several considerations into account. While the standard implementation targets coarse-grained tasks that may block and take a long time to complete, optimized futures and promises are used to manage fine-grained, non-blocking tasks. In order to meet this requirement efficiently, they should: Require no locking Not allocate memory Support continuations Future-promise design eliminates the costs associated with maintaining individual threads by the OS and allows close to complete utilization of the CPU. On the other hand, it calls for user-space CPU scheduling and very likely limits the developer with voluntary preemption scheduling. The latter, in turn, is prone to generating phantom jams in popular producer-consumer programming templates. Applying future-promise design to database internals has obvious benefits. First of all, database workloads can be naturally CPU-bound. For example, that’s typically the case with in-memory database engines, and aggregates’ evaluations also involve pretty intensive CPU work. Even for huge on-disk data sets, when the query time is typically dominated by the I/O, CPU should be considered. Parsing a query is a CPU-intensive task regardless of whether the workload is CPU-bound or storage-bound, and collecting, converting, and sending the data back to the user also calls for careful CPU utilization. And last but not least: processing the data always involves a lot of high-level operations and low-level instructions. Maintaining them in an optimal manner requires a good low-level programming paradigm and future-promises is one of the best choices. However, large instruction sets need even more care; this leads us to “execution stages.” Execution Stages Let’s dive deeper into CPU microarchitecture because (as discussed previously) database engine CPUs typically need to deal with millions and billions of instructions, and it’s essential to help the poor thing with that. In a very simplified way, the microarchitecture of a modern x86 CPU – from the point of view of Top-Down Analysis – consists of four major components: Front End, Back-End, Branch Speculation, and Retiring. Front End The processor’s front end is responsible for fetching and decoding instructions that are going to be executed. It may become a bottleneck when there is either a latency problem or insufficient bandwidth. The former can be caused, for example, by instruction cache misses. The latter happens when the instruction decoders cannot keep up. In the latter case, the solution may be to attempt to make the hot path (or at least significant portions of it) fit in the decoded µop cache (DSB) or be recognizable by the loop detector (LSD). Branch speculation Pipeline slots that the Top-Down Analysis classifies as Bad Speculation are not stalled, but wasted. This happens when a branch is mispredicted and the rest of the CPU executes a µop that eventually cannot be committed. The branch predictor is generally considered to be a part of the front end. However, its problems can affect the whole pipeline in ways beyond just causing the back end to be undersupplied by the instruction fetch and decode. (Note: we’ll cover branch mispredictions in more detail a bit later.) Back End The back end receives decoded µops and executes them. A stall may happen either because of an execution port being busy or a cache miss. At the lower level, a pipeline slot may be core bound either due to data dependency or an insufficient number of available execution units. Stalls caused by memory can be caused by cache misses at different levels of data cache, external memory latency, or bandwidth. Retiring Finally, there are pipeline slots that get classified as Retiring. They are the lucky ones that were able to execute and commit their µop without any problems. When 100% of the pipeline slots are able to retire without a stall, then the program has achieved the maximum number of instructions per cycle for that model of the CPU. Although this is very desirable, it doesn’t mean that there’s no opportunity for improvement. Rather, it means that the CPU is fully utilized and the only way to improve the performance is to reduce the number of instructions. Implications for Databases The way CPUs are architectured has direct implications on the database design. It may very well happen that individual requests involve a lot of logic and relatively little data, which is a scenario that stresses the CPU significantly. This kind of workload will be completely dominated by the front end – instruction cache misses in particular. If we think about this for a moment, it shouldn’t really be very surprising though. The pipeline that each request goes through is quite long. For example, write requests may need to go through transport protocol logic, query parsing code, look up in the caching layer, or be applied to the in-memory structure where it will be waiting to be flushed on disk, etc. The most obvious way to solve this is to attempt to reduce the amount of logic in the hot path. Unfortunately, this approach does not offer a huge potential for significant performance improvement. Reducing the number of instructions needed to perform a certain activity is a popular optimization practice, but a developer cannot make any code shorter infinitely. At some point, the code “freezes” – literally. There’s some minimal amount of instructions needed even to compare two strings and return the result. It’s impossible to perform that with a single instruction. A higher-level way of dealing with instruction cache problems is called Staged Event-Driven Architecture (SEDA for short). It’s an architecture that splits the request processing pipeline into a graph of stages – thereby decoupling the logic from the event and thread scheduling. This tends to yield greater performance improvements than the previous approach. Access the complete book  – it’s free, Open Access

Vector Search in Apache Cassandra® 5.0

Apache Cassandra® is moving towards an AI-driven future. Cassandra’s high availability and ability to scale with large amounts of data have made it the obvious choice for many of these types of applications.  

In the constantly evolving world of data analysis, we have seen significant transformations over time. The capabilities needed now are those which support a new data type (vectors) that has accelerated in popularity with the growing adoption of Generative AI and large language models. But what does this mean? 

Data Analysis Past, Present and Future

Traditionally, data analysis was about looking back and understanding trends and patterns. Big data analytics and predictive ML applications emerged as powerful tools for processing and analyzing vast amounts of data to make informed predictions and optimize decision-making.

Today, real-time analytics has become the norm. Organizations require timely insights to respond swiftly to changing market conditions, customer demands, and emerging trends.

Tomorrow’s world will be shaped by the emergence of generative AI. Generative AI is a transformative approach that goes way beyond traditional analytics. Instead, it leverages machine learning models trained on diverse datasets to produce something entirely new while retaining similar characteristics and patterns learned from the training data. 

By training machine learning models on vast and varied datasets, businesses can unlock the power of generative AI. These models understand the underlying patterns, structures, and meaning in the data, enabling them to generate novel and innovative outputs. Whether it’s generating lifelike images, composing original music, or even creating entirely new virtual worlds, generative AI pushes the boundaries of what is possible. 

Broadly speaking, these generative AI models store the ingested data in numerical representation, known as vectors (we’ll dive deeper later). Vectors capture the essential features and characteristics of the data, allowing the models to understand and generate outputs that align with those learned patterns. 

With vector search capabilities in Apache Cassandra® 5.0 (now in public preview on the Instaclustr Platform), we enable the storage of these vectors and efficient searching and retrieval of them based on their similarity to the query vector. ‘Vector similarity search’ opens a whole new world of possibilities and lies at the core of generative AI.

The ability to create novel and innovative outputs that were never explicitly present in the training data has significant implications across many creative, exploratory, and problem-solving uses. Organizations can now unlock the full potential of their data by performing complex similarity-based searches at scale which is key to supporting AI workloads such as recommendation systems, image, text, or voice matching applications, fraud detection, and much more. 

What Is Vector Search? 

Vectors are essentially just a giant list of numbers across as many or as few dimensions as you want… but really they are just a list (or array) of numbers!  

Embeddings, by comparison, are a numerical representation of something else, like a picture, song, topic, book; they capture the semantic meaning behind each vector and encode those semantic understandings as a vector. 

Take, for example, the words “couch, “sofa, and “chair. All 3 words are individual vectors, and while their semantic relationships are similarthey are pieces of furniture after allbut “couch” and “sofa” are more closely related to each other than to “chair”. 

These semantic relationships are encoded as embeddingsdense numerical vectors that represent the semantic meaning of each word. As such, the embeddings for “couch” and “sofa” will be geometrically closer to each other than to the embedding for “chair.” “Couch”, “sofa, and “chair” will all be closer than, say, to the word “erosion”. 

Take a look at Figure 1 below to see that relationship: 

Figure 1: While “Sofa”, “Couch”, and “Chair” are all related, “Sofa” and “Couch” are semantically closer to each other than “Chair” (i.e., they are both designed for multiple people, while a chair is meant for just one).

“Erosion, on the other hand, shares practically no resemblance to the other vectors, which is why it is geometrically much further away. 

When computers search for things, they typically rely on exact matching, text matching, or open searching/object matching. Vector search, on the other hand, works by using these embeddings to search for semantics as opposed to terms, allowing you to get a much closer match in unstructured datasets based on the meaning of the word. 

Under the hood we can generate embeddings for our search term, and then using some vector math, find other embeddings that are geometrically close to our search term. This will return results that are semantically related to our search. For example, if we search for “sofa, then “couch” followed by “chair” will be returned. 

How Does Vector Search Work With Large Language Models (LLMs)? 

Now that we’ve established an understanding of vectors, embeddings, and vector searching, let’s explore the intersection of vector searching and Large Language Models (LLMs), a trending topic within the world of sophisticated AI models.

With Retrieval Augmented Generation (RAG), we use the vectors to look up related text or data from an external source as part of our query, as part of the query we also retrieve the original human readable text. We then feed the original text alongside the original prompt into the LLM for generated text output, allowing the LLM to use the additional provided context while generating the response.  

(PGVector does a very similar thing with PostgreSQL; check out our blog where we talk all about it). 

By utilizing this approach, we can take known information, query it in a natural language way, and receive relevant responses. Essentially, we input questions, prompts, and data sources into a model and generate informative responses based on that data. 

The combination of Vector Search and LLMs opens up exciting possibilities for more efficient and contextually rich GenAI applications powered by Cassandra. 

So What Does This Mean for Cassandra?

Well, with the additions of CEP-7 (Storage Attached Index, SAI) and CEP-30 (Approximate Nearest Neighbor, ANN Vector Search via SAI + Lucene) to the Cassandra 5.0 release, we can now store vectors and create indexes on them to perform similarity searches. This feature alone broadens the scope of possible Cassandra use cases.  

We utilize ANN (or “Approximate Nearest Neighbor”) as the fast and approximate search as opposed to k-NN (or “k-Nearest Neighbor”), which is a slow and exact algorithm of large, high-dimensional data; speed and performance are the priorities. 

Just about any application could utilize a vector search feature. Thus, in an existing data model, you would simply add a column of ‘vectors’ (Vectors are 32-bit floats, a new CQL data type that also takes a dimension input parameter). You would then create a vector search index to enable similarity searches on columns containing vectors.  

Searching complexity scales linearly with vector dimensions, so it is important to keep in mind vector dimensions and normalization. Ideally, we want normalized vectors (as opposed to vectors of all different sizes) when we look for similarity as it will lead to faster and more ‘correct’ results. With that, we can now filter our data by vector. 

Let’s walk through an example CQLSH session to demonstrate the creation, insertion, querying, and the new syntax that comes along with this feature.

1. Create your keyspace

CREATE KEYSPACE catalog WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}

2. Create your table or add a column to an existing table with the vector type (ALTER TABLE) 

CREATE TABLE IF NOT EXISTS catalog.products (
 product_id text PRIMARY KEY, 
 product_vector VECTOR<float, 3>  

*Vector dimension of 3 chosen for simplicity, the dimensions of your vector will be dependent on the vectorization method*

3. Create your custom index with SAI on the vector column to enable ANN search

CREATE CUSTOM INDEX ann_index_for_similar_products ON products(product_vector) USING ‘StorageAttachedIndex’;

*Your custom index (ann_index_for_similar_products in this case) can be created with your choice of similarity functions: default cosine, dot product, or Euclidean similarity functions. See here* 

4. Load vector data using CQL insert

INSERT INTO catalog.products (product_id, product_vector) VALUES 
(‘SKU1’, [8, 2.3, 58]);
INSERT INTO catalog.products (product_id, product_vector) VALUES 
(‘SKU3’, [1.2, 3.4, 5.6]);
INSERT INTO catalog.products (product_id, product_vector) VALUES 
(‘SKU5’, [23, 18, 3.9]);

5. Query the data to perform a similarity search using the new CQL operator 

SELECT * FROM catalog.products WHERE product_vector ANN OF [3.4, 7.8, 9.1] limit 1;


product_id  | product_vector 
SKU5  | [23, 18, 3.9]

How Can You Use This?

E-Commerce (for Content Recommendation) 

We can start to think about simple e-commerce use cases like when a customer adds a product or service to their cart. You can now have a vector stored with that product, query the ANN algorithm (aka pull your next nearest neighbor(s)/most similar datapoint(s)), and display to that customer ‘similar’ products in ‘real-time.’  

The above example data model allows you to store and retrieve products based on their unique product_id. You can also perform queries or analysis based on the vector embeddings, such as finding similar products or generating personalized recommendations. The vector column can store the vector representation of the product, which can be generated using techniques like Word2Vec, Doc2Vec, or other vectorization methods.

In a practical application, you might populate this table with actual product data, including the unique identifier, name, category, and the corresponding vector embeddings for each product. The vector embeddings can be generated elsewhere using machine learning models and then inserted into a C* table. 

Content Generation (LLM or Data Augmentation) 

What about the real-time, generative AI applications that many companies (old and new) are focusing on today and in the near future?  

Well, Cassandra can serve the persistence of vectors for such applications. With this storage, we now have persisted data that can add supplemental context to an LLM query. This is the ‘RAG’ referred to earlier in this piece and enables uses like AI chatbots.  

We can think of the ever-popular ChatGPT that tends to hallucinate on anything beyond 2021its last knowledge update. With VSS on Cassandra, it is possible to augment such an LLM with stored vectors.  

Think of the various PDF plugins behind ChatGPT. We can feed the model PDFs, which, at a high level, it then vectorizes and stores in Cassandra (there are more intermediary steps to this) and is then ready to be queried in a natural language way.

Here is a list of just a subset of possible industries/use-cases that can extract the value of a such a feature-add to the Open Source Cassandra project: 

  • Cybersecurity (Anomaly Detection)
    • Ex. Financial Fraud Detection 
  • Language Processing (Language Translation)
  • Digital Marketing (Natural Language Generation)
  • Document Management (Document Clustering)   
  • Data Integration (Entity Matching or Deduplication) 
  • Image Recognition (Image Similarity Search) 
    • Ex. Healthcare (Medical Imaging) 
  • Voice Assistants (Voice and Speech Recognition)
  • Customer Support (AI Chatbots) 
  • Pharmaceutical (Drug Discovery) 

Cassandra 5.0 is in beta release on the Instaclustr Platform today (with GA coming soon)! Stay tuned for Part 2 of this blog series, where I will delve deeper and demonstrate real-world usage of VSS in Cassandra 5.0.


The post Vector Search in Apache Cassandra® 5.0 appeared first on Instaclustr.

Top Reasons Apache Cassandra® Projects Fail (and How to Overcome Them)

Apache Cassandra® is a popular and powerful distributed NoSQL database management system that is widely used for handling large amounts of data across multiple servers. However, like any complex system, Cassandra projects can face challenges and failures if not properly planned and managed.  

Here are some of the top reasons why Cassandra projects fail and how you can overcome them: 

1. Lack of proper data modeling 

Being a NoSQL database, Cassandra’s data model is fundamentally different from traditional relational databases. Improper data modeling can lead to performance issues, excessive use of secondary indexes, and difficulty in maintaining data consistency. 

Invest time in understanding Cassandra’s data model principles such as denormalization, partition keys, and clustering keys, and perform a thorough analysis of query patterns and data access patterns so you can design an effective data model that aligns with your application’s access patterns. 

2. Poor cluster configuration 

Incorrect cluster settings, such as insufficient nodes, improper partitioning, or inappropriate replication strategies, can lead to performance issues and data inconsistencies. 

It is vital to take time to understand the implications of various configuration settings and tune them based on your specific use case. You should carefully plan and configure the cluster according to your application’s requirements, considering factors like data distribution, replication factor, consistency levels, and node capacity. 

3. Ignoring data consistency trade-offs 

Cassandra offers tunable consistency levels to balance data consistency and availability. Failing to understand these trade-offs can lead to incorrect consistency level choices, resulting in data inconsistencies or excessive latency.  

It is important to evaluate the consistency requirements of your application carefully and choose appropriate consistency levels, based on factors like data sensitivity, read and write patterns, and tolerance for eventual consistency. 

4. Lack of monitoring and alerting 

Without alerting mechanisms and proper monitoring of key metrics, having to tune your system proactively to identify and resolve issues can lead to unexpected downtimes and operational challenges.  

It is essential to set up alerting mechanisms to notify your team promptly when issues arise, along with establishing and the regular testing of backup procedures.  

5. Ignoring maintenance tasks 

Neglecting routine maintenance tasks like compaction, repair, and nodetool operations can lead to data inconsistencies and degraded performance over time. 

Implement regular maintenance tasks as part of your operational procedures. Monitor and schedule compaction, repair, and other nodetool operations to ensure cluster health and optimal performance. 

6. Insufficient capacity planning and scaling challenges 

Underestimating the required resources (CPU, RAM, disk space) for your Cassandra cluster can lead to performance bottlenecks and potential data loss. Planning for future growth and scalability are therefore essential—if you don’t, you may suffer the consequences, including performance degradation and costly downtime.  

Capacity planning must take into consideration current and projected data volumes, write/read patterns, and performance requirements. Planning for scalability from the outset by designing your cluster with room for growth will avoid future problems and expense. 

7. Inadequate testing and staging environments 

Deploying changes directly to production without proper testing and staging can introduce bugs, data inconsistencies, or performance regressions. 

You must establish robust testing and staging environments that mimic production settings. Then thoroughly test and validate all changes, including data migrations, schema alterations, and application updates, before deploying to production. 

8. Unsatisfactory backup and disaster recovery strategies 

The lack of proper backup and disaster recovery strategies can lead to data loss or prolonged downtime in case of hardware failures, data center outages, or human error. 

Avoid this by implementing reliable backup strategies, such as incremental backups or snapshot backups. Set up multi-data center replication or cloud-based, and practice disaster recovery scenarios to ensure you have a quick and reliable recovery process. 

9. Lack of expertise and training 

Cassandra is a complex system, and its architecture and operational model differ from traditional databases, so it requires specialized knowledge and expertise. Lack of proper training and experience within your team can lead to suboptimal configurations, performance issues, and operational challenges. 

Investment in training and knowledge-sharing within a team may help, along with leveraging external resources such as documentation, tutorials, and community forums.  

Talk to the Cassandra Experts 

By addressing the common pitfalls and adopting best practices outlined above you can significantly increase the chances of success with your Apache Cassandra projects and ensure a stable, high-performing, and fault-tolerant distributed database system.  

However, this can be guaranteed by enlisting the services of the leading Cassandra experts at Instaclustr.  

Instaclustr provides a fully managed service for Apache Cassandra®—SOC 2 certified and hosted in the cloud or on-prem. We customize and optimize the configuration of your cluster so you can focus on your applications. Instaclustr offers comprehensive support across various hyperscaler platforms. Discover more about Instaclustr Managed Service for Apache Cassandra by downloading our datasheet.  

Whether you’re looking for a complete managed solution or need enterprise support or consulting services, we’re here to help. Learn more by reading our white paper 10 Rules for Managing Apache Cassandra. 

The post Top Reasons Apache Cassandra® Projects Fail (and How to Overcome Them) appeared first on Instaclustr.

ScyllaDB’s Safe Topology and Schema Changes on Raft

How ScyllaDB is using Raft for all topology and schema metadata – and the impacts on elasticity, operability, and performance ScyllaDB recently completed the transition to strong consistency for all cluster metadata. This transition involved moving schema and topology metadata to Raft, implementing a centralized topology coordinator for driving topology changes, and several other changes related to our commit logs, schema versioning, authentication, and other aspects of database internals. With all topology and schema metadata now under Raft, ScyllaDB officially supports safe, concurrent, and fast bootstrapping with versions 6.0 and higher. We can have dozens of nodes start concurrently. Rapidly assembling a fresh cluster, performing concurrent topology and schema changes, and quickly restarting a node with a different IP address or configuration – all of this is now possible. This article shares why and how we moved to a new algorithm providing centralized (yet fault-tolerant) topology change coordination for metadata, as well as its implications for elasticity, operability, and performance. A Quick Consistency Catchup Since ScyllaDB was born as a Cassandra-compatible database, we started as an eventually consistent system. That made perfect business sense for storing user data. In a large cluster, we want our writes to be available even if a link to the other data center is down.   [For more on the differences between eventually consistent and strongly consistent systems, see the blog ScyllaDB’s Path to Strong Consistency: A New Milestone.] But beyond storing user data, the database maintains additional information, called metadata, that describes: Topology (nodes, data distribution…) Schema (table format, column names, indexes…) There’s minimal business value in using the eventually consistent model for metadata. Metadata changes are infrequent, so we do not need to demand extreme availability or performance for them. Yet, we want to reliably change the metadata in an automatic manner to bring elasticity. That’s difficult to achieve with an eventually consistent model. Having metadata consistently replicated to every node in the cluster allows us to bridge the gap to elasticity, enabling us to fully automate node operations and cluster scaling. So, back in 2021, we embarked on a journey to bring in Raft: an algorithm and a library that we implemented to replicate any kind of information across multiple nodes. Since then, we’ve been rolling out the implementation incrementally. Our Move to Schema and Topology Changes on Raft In ScyllaDB 5.2, we put the schema into a Raft-replicated state. That involved replicating keyspace, table, and column information through Raft. Raft provides a replicated log across all the nodes in the cluster. Everything that’s updated through Raft first gets applied to that log, then gets applied to the nodes (in exactly the same order to all nodes). Now, in ScyllaDB 6.0, we greatly expanded the amount of information we store in this Raft-based replicated state machine. We also include new schema tables for authentication and service levels. And more interestingly, we moved topology over to Raft and implemented a new centralized topology coordinator that’s instrumental for our new tablets architecture (more on that at the end of this article). We also maintain backward compatibility tables so that old drivers and old clients can still get information about the cluster in the same way. Driving Topology Changes from a Centralized Topology Coordinator Let’s take a closer look at that centralized topology coordinator. Previously, the node joining the cluster would drive the topology change forward. If a node was being removed, another node would drive its removal. If something happened to the node driving these operations, the database operator had to intervene and restart the operation from scratch. Now, there’s a centralized process (which we call the topology change coordinator) that runs alongside the Raft cluster leader node and drives all topology changes. If the leader coordinator node is down, a new node is automatically elected a leader. Since the coordinator state is stored in the deterministic state machine (which is replicated across the entire cluster), the new coordinator can continue to drive the topology work from the state where the previous coordinator left off. No human intervention is required. Every topology operation registers itself in a work queue, and the coordinator works off that queue. Multiple operations can be queued at the same time, providing an illusion of concurrency while preserving operation safety. It’s possible to build a deterministic schedule, optimizing execution of multiple operations. For example, it lets us migrate multiple tablets at once, call cleanups for multiple nodetool operations, and so on. Since information about the cluster members is now propagated through Raft instead of Gossip, it’s quickly replicated to all nodes and is strongly consistent. A snapshot of this data is always available locally. That allows a starting node to quickly obtain the topology information without reaching out to the majority of the cluster. Practical Applications of this Design Next, let’s go over some practical applications of this design, beginning with the improvements in schema changes that we introduced in ScyllaDB 6.0. Dedicated Metadata Commit Log on shard 0 The ScyllaDB schema commit log, introduced in ScyllaDB 5.0 and now mandatory in ScyllaDB 6.0, is a dedicated write-ahead log for schema tables. With ScyllaDB 6.0, we started using the same log for schema and topology changes. That brings both linearizability and durability. This commit log runs on shard 0 and has different properties than the data commit log. It’s always durable, always synced immediately after write to disk. There’s no need to sync the system tables to disk when performing schema changes, which leads to faster schema changes. And this commit log has a different segment size, allowing larger chunks of data (e.g., very large table definitions) to fit into a single segment. This log is not impacted by the tuning you might do for the data commit log, such as max size on disk or flush settings. It also has its own priority, so that data writes don’t’ stall metadata changes, and there is no priority inversion. Linearizable schema version Another important update is the change to how we build schema versions. A schema version is a table identifier that we use internally in intra-cluster RPC to understand that every node has the same version of the metadata. Whenever a table definition changes, the identifier must be rebuilt. Before, with eventual consistency allowing concurrent schema modifications, we used to rehash all the system tables to create a new version on each schema change. Now, since schema changes are linearized, only one schema change occurs at a time – making a monotonic timestamp just as effective. It turns out that schema hash calculation is a major performance hog when creating, propagating, or applying schema changes. Moving away from this enables a nice speed boost. With this change, we were able to dramatically improve schema operation (e.g., create table, drop table) performance from one schema change per 10-20 seconds (in large clusters) to one schema change per second or less. We also removed the quadratic dependency of the cost of this algorithm on the size of the schema. It used to be that the more tables you had, the longer it took to add a new table. That’s no longer the case. We plan to continue improving schema change performance until we can achieve at least several changes per second and increase the practical ceiling for the number of tables a ScyllaDB installation can hold. Authentication and service levels on Raft We moved the internal tables for authentication and service levels to Raft as well. Now, they are globally replicated (i.e., present on every node in the cluster). This means users no longer need to adjust the replication factor for authentication after adding or removing nodes. Previously, authentication information was partitioned across the entire cluster. If a part of the cluster was down and the role definition was on one of the unavailable nodes, there was a risk that this role couldn’t connect to the cluster at all. This posed a serious denial of service problem. Now that we’re replicating this information to all nodes using Raft, there’s higher reliability since the data is present at all nodes. Additionally, there’s improved performance since the data is available locally (and also no denial of service risk for the same reason). For service levels, we moved from a polling model to a triggering model. Now, service level information is rebuilt automatically whenever it’s updated, and it’s also replicated onto every node via Raft. Additional Metadata Consistency in ScyllaDB 6.0 Now, let’s shift focus to other parts of the metadata that we converted to strong consistency in ScyllaDB 6.0. With all this metadata under Raft, ScyllaDB now officially supports safe, concurrent, and fast bootstrap. We can have dozens of nodes start concurrently. Feature Negotiation on Raft To give you an idea of some of the low-level challenges involved in moving to Raft, consider how we moved little-known ScyllaDB feature called “feature negotiation.” Essentially, this is a feature with details about other features. To ensure smooth upgrades, ScyllaDB runs a negotiation protocol between cluster nodes. A new functionality is only enabled when all of the nodes in the cluster can support it. But how does a cluster know that all of the nodes support the feature? Prior to Raft, this was accomplished with Gossip. The nodes were gossiping about their supported features, and eventually deciding that it was safe to enable them (after every node sees that every other node sees the feature). However, remember that our goal was to make ScyllaDB bootstraps safe, concurrent, and fast. We couldn’t afford to continue waiting for Gossip to learn that the features are supported by the cluster. We decided to propagate features through Raft. But we needed a way to quickly determine if the cluster supported the feature of feature propagation through Raft. It’s a classic “chicken or the egg” problem. The solution: in 6.0, when joining a node, we offload its feature verification to an existing cluster member. The joining node sends its supported feature set to the cluster member, which then verifies whether the node is compatible with the current cluster. Beyond the features that this node supports, this also includes such things as the snitch used and the cluster name. All that node information is then persisted in Raft. Then, the topology coordinator decides whether to accept the node or to reject it (because it doesn’t support some of the features). The most important thing to note here is that the enablement of any cluster features is now serialized with the addition of the nodes. There is no race. It’s impossible to concurrently add a feature and add a node that doesn’t support that feature. CDC Stream Details on Raft We also moved information about CDC stream generation to Raft. Moving this CDC metadata was required in order for us to stop relying on Gossip and sleeps during boot. We use this metadata to tell drivers that the current distribution of CDC has changed because the cluster topology changed – and it needs to be refreshed. Again, Gossip was previously used to safely propagate this metadata through the cluster, and the nodes had to wait for Gossip to settle. That’s no longer the case for CDC metadata. Moving this data over to Raft on group0, with its dedicated commit log, also improved data availability & durability. Additional Updates Moreover, we implemented a number of additional updates as part of this shift: Automated SSTable Cleanup : In ScyllaDB 6.0, we also automated the SSTable cleanup that needs to run between (some) topology changes to avoid data resurrection. Sometimes even a failed topology change may require cleanup. Previously, users had to remember to run this cleanup. Now, each node tracks its own cleanup status (whether the cleanup is needed or not) and performs the cleanup. The topology coordinator automatically coordinates the next topology change with the cluster cleanup status. Raft-Based UUID host identification: Internally, we switched most ScyllaDB subsystems to Raft-based UUID host identification. These are the same identifiers that Raft uses for cluster membership. Host id is now part of every node’s handshake, and this allows us to ensure that a node removed from the cluster cannot corrupt cluster data with its write RPCs. We also provide a safety net for the database operators: if they mistakenly try to remove a live node from the cluster, they get an error. Live nodes can be decommissioned, but not removed. Improved Manageability of the Raft Subsystem: We improved the manageability of the Raft subsystem in ScyllaDB 6.0 with the following: A new system table for Raft state inspection allows users to see the current Raft identifiers, the relation of the node to the Raft cluster, and so on. It’s useful for checking whether a node is in a good state – and troubleshooting if it is not. New Rest APIs allow users to manually trigger Raft internal operations. This is mainly useful for troubleshooting clusters. A new maintenance mode lets you start a node even if it’s completely isolated from the rest of the cluster. It also lets you manipulate the data on that local node (for example, to fix it or allow it to join a different cluster). Again, this is useful for troubleshooting. We plan to continue this work going forward. Raft is Enabled – and It Enables Extreme Elasticity with Our New Tablets Architecture To sum up, the state of the topology, such as tokens, was previously propagated through gossip and eventually consistent. Now, the state is propagated through Raft, replicated to all nodes and is strongly consistent. A snapshot of this data is always available locally, so that starting nodes can quickly obtain the topology information without reaching out to the leader of the Raft group. Even if nodes start concurrently, token metadata changes are now linearized. Also, the view of token metadata at each node is not dependent on the availability of the owner of the tokens. Node C is fully aware of node B tokens (even if node B is down) when it bootstraps. Raft is enabled by default, for both schema and topology, in ScyllaDB 6.0 and higher. And it now serves as the foundation for our tablets replication implementation: the tablet load balancer could not exist without it. Learn more about our tablets initiative overall, as well as its load balancer implementation, in the following blogs: Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution How We Implemented ScyllaDB’s “Tablets” Data Distribution

IoT Overdrive Part 3: Starting Up the Compute Cluster

In my previous posts (Part 1 and Part 2), I’ve covered what a cluster is and what it does. Now we’ll take a detailed look at what it takes to assemble and set up a cluster. We’ll go over assembly, OS installation, setting things up with Ansible, and deploying our services with Docker. 


To start assembling the cluster, we need a few things: 

  • 7 Raspberry Pi 4s, 4x4GB, 3x8GB: the 3 8GB Pi computers will be used as worker nodes, and the 4GB models will be used for application servers and an admin node.
  • 6 Orange Pi 5s, 3x4GB, 3x8GB: the 8GB models will be used as worker nodes, and the 4GB models will serve as cluster managers.
  • At least 20 Power over Ethernet (PoE) ethernet cables: 
    • You need at least 13 for the power converters to the switch (I use 3ft/1m cables for this) 
    • 1 for your router or other internet connection (I use a 6ft/2m here to reach my other switch under my desk)
    • I always like to keep a few spares just in case something happens (I have a rolling office chair and sometimes bend cables for travel storage). My rule of thumb is the longer the cable, the more likely it’ll break. 
  • PoE switch: I use a 16-port Netgear switch that can give up to 60W of power (more than enough). 
  • 13 PoE to 5V USB-C adapters: These turn the high-voltage PoE power into a 5V/2.4A one. They also pass internet data; the cable splits into a USB-C for power and CAT-6 for data. One for each Pi. 
  • At least 13 microSD cards: These are for the Operating Systems (OS) for the Pi computers (see OS Installation). You’ll probably want a few spares as well because these love to break.  
    • They should have at least 8GB of memory, but I recommend 32GB. You can even buy these in bulk or large quantity packs. But BE CAREFUL and don’t buy from shady dealers/brands. It’s not worth it!  
  • 7 USB drives: These are so the Raspberry Pi computers can store application and other data. Same warning about shady deals as with MicroSD cards. 
  • 6 M.2 drives: These are so the Orange Pi computers can store application and other data 
  • Case(s) for the Pi computers: You’re going to want at least a fan operating on these. There are cases for multiple Pi computers out there. I made a custom case for 2.0 and put fans and heat sinks on each Pi. 

Here’s the setup: first take the Pi computers, put them in cases, and connect them to the 5V USB-C adapters. These have a plug for the internet and a plug for power. Plug both into the Pi. 

Then connect the 5V adapters to the switch (which is powered off) using the ethernet cables. 

Next, plug the switch into your internet source. You can use your home ethernet or a router with Wi-Fi repeating if you’re on the go. 

Source: Kassian Wren

You still don’t want to plug in the switch to power, as we need operating systems on the Pi computers for them to do anything.  

OS Installation 

You’ll need a few things to install the operating systems: 

  • 13 microSD cards: I recommend at least 16GB cards; I used 64GB. Make sure they’re compatible with Raspberry Pi if you can, but it’s likely if you get them from a reputable retailer, they’ll be alright. 
  • USB adapter for microSD: so you can read/write the cards from your computer if it doesn’t have a microSD or SD card slot. You may also need a microSD to SD card adapter. 
  • Raspberry Pi Flasher installed. 
  • Balena Etcher installed. 
  • An Orange Pi OS iso image: I used the official Ubuntu image (I can’t provide an https link to one, you’ll need to search). 

Once you’ve gathered and installed everything, you’ll want to do the following 7 times: 

  • Insert a microSD card into a USB adapter and into the computer 
  • Open Raspberry Pi Flasher if it’s not open 
  • Select “Choose Device” and select the Raspberry Pi 5 
  • Select “Choose OS” and select the latest Raspberry Pi 64-bit OS image 
  • Select “Next” 
  • When prompted, select “Edit Settings” 
  • Check the first box next to “Set hostname” and in the field enter one of the following: 
    • admin.mycluster 
    • w1.mycluster 
    • w2.mycluster 
    • w3.mycluster 
    • a1.mycluster 
    • a2.mycluster 
    • a3.mycluster 
  • Check the box next to “Set username and password” and set the username to ‘pi’ and the password to ‘raspberry’
    • For those of you who are reasonably worried about the security of this, worry not, we change it to ssh key access only through Ansible later.
  • Go to the “Services” tab
    • Check the box next to ‘enable SSH’ and the radio button next to password authentication 
  • Click ‘save’, ‘yes’ and ‘yes’ to start the burn process 
  • When done, eject and remove microSD and set aside 

And the following 6 times: 

  • Insert a microSD card into a USB adapter and into the computer 
  • Open Balena Etcher if it’s not open 
  • Follow the prompts to pick your microSD card and the ISO you downloaded  
  • Start the image burn 
  • Once it’s done, pop out the microSD card and set it aside 

Then install the SD cards, with the orange and raspberry pi OS’s correctly matched to the hardware. And if your cases are labeled with a hostname, match the raspberry pi cards by their hostname.

Unplug the Orange Pi computers (except one, any one) from the switch, and give the switch power. This should boot all the Raspberry Pi computers and ONE Orange Pi. 

Now you’ll need to get the IP addresses of the Orange Pis, since you can’t easily set a hostname like we did for raspberry Pi.  

  • Let the Orange Pi that’s plugged into the switch boot 
  • Use your router menu or another tool that maps devices on your network to find the orange pi on the network and get its IP 
  • Log that IP somewhere it’s easy to copy/paste it 
  • Leave this Orange Pi alone and on, and plug in the next Orange Pi to power/ethernet 

A picture of the cluster plugged in and booting from a conference showing. There is a network switch and rows of small computer boards with fans and network cables connecting everything.

Source: Kassian Wren

Once you’ve got all the IPs, you can dig into the Ansible setup. 

Automated Setup Scripts with Ansible 

We used Ansible to automate not only the initial OS setup, but Docker cluster creation and management. 

Stage 1: Setting Up for Ansible 

We’ll need a certificate to authenticate with the cluster, which can be generated with the following command.

For Ansible to work, it needs a few things; a way to ssh into each Pi, hostnames and IPs, etc. Luckily, we know just enough bash to automate this as well. See code below: 

for ipend in [copied from previous steps, in the format ### ### ### ###]  
 sshpass -f orange-pi-password.txt sudo ssh-copy-id -o StrictHostKeyChecking=no -i cluster

for index in 1 2 3 
 for type in a w
   sshpass -f raspi-password.txt ssh-copy-id -o StrictHostKeyChecking=no -i cluster

This script: 

  • Cycles through each orange pi IP address 
    • Installs the cluster certificate for user root 
  • Cycles through each raspberry pi hostname 
    • Install the cluster certificate for user pi 

Once this script has set up ssh connections to the Pi computers, Ansible can update the Pi computers, install software, and set up the Docker cluster. 

Running Ansible Scripts 

Then, we run a series of ansible scripts that do the following: 

  • Stage 2: Updates/Dependencies script 
    • Update/upgrade package lists with apt update/upgrade 
    • Install dependencies (python3, passlib) 
    • Create ansible and nodebotanist user, add to sudoers 
      • Nodebotanist is for running commands on the cluster manually 
      • Ansible user for logging, automation purposes (previous stages ran commands as pi/root user, next stages use ansible user) 
    • Disable password ssh authentication (now allows cert auth only) 
    • Copy ssh auth certificate to ansible, nodebotanist user 
      • They can now log in with the cert we created in stage 1 
  • Stage 3: Updates for Orange Pi computers 
    • Change the Orange Pi hostnames to easier-to-use ones 
    • Install avahi-daemon on Orange Pi computers 
    • Reboot Orange Pi computers 
  • Stage 3b: Storage format/mount 
    • Unmounts the USB drives (Raspberry Pi) and M.2 drives (Orange Pi) 
    • Creates a new primary partition (this erases all data on drive) 
    • Creates an ext4 file storage on the USB/M.2 drives 
    • Mounts the USB/M.2 drives (which also edits fstab so they auto-mount on start) 
  • Stage 4: Install Docker 
    • Installs dependencies for Docker (ca-certificates gnupg curl) 
    • Installs package 
    • Add nodebotanist and ansible users to Docker groups (so we don’t have to sudo all the time) 
  • Stage 5: Create cluster groups using Docker swarm 
    • Create a Docker swarm 
      • This creates 2 tokens: worker token and manager token. These are used to join other machines to the cluster 
      • Get and save these tokens 
    • Creates swarm manager bootstrap and operational groups, sorts manager machines into them 
      • Operational: in the swarm 
      • Bootstrap: needs to be added to swarm 
    • Same groups for workers 
    • Adds bootstrap managers to cluster using manager token 
    • Adds bootstrap workers to cluster using worker token 

Once we’ve run these using Ansible, we have updated Pis running Docker swarm and a working, but idle, cluster of 13 machines. Now we can use Docker Compose to start up our Cassandra and Kafka services on it. 

Setting Up Docker ServicesStarting Up Apache Cassandra® and Apache Kafka®

We’re going to use Docker Compose to start up Cassandra and Kafka services on our cluster. The way this works is we create a docker-compose.yml file that describes each container that should run in the cluster. 

  • Stage 6a: Starting up Cassandra  
    • Describes 6 Cassandra containers, one on each worker node 
    • Links them all to a virtual network called Ericht_net 
    • Sets up volumes to store data 
    • Use the depends_on clause to make sure cassandra-01 (the leader) is online before starting up nodes 2-6 
  • Stage 6b: Starting up Kafka 
    • Describes 6 Kafka containers, one on each worker node 
    • Links them to the same virtual network as the Cassandra containers (Ericht_net) 

Once we’ve run stage 6, we have a cluster that is running Cassandra and Kafka! Now to test. 

Coming Up Next 

Adding Docker volumes for data permanence. Watch our for Part 4 in IoT Overdrive!

The post IoT Overdrive Part 3: Starting Up the Compute Cluster appeared first on Instaclustr.

How We Implemented ScyllaDB’s “Tablets” Data Distribution

How ScyllaDB implemented its new Raft-based tablets architecture, which enables teams to quickly scale out in response to traffic spikes ScyllaDB just announced the first GA release featuring our new tablets architecture. In case this is your first time hearing about it, tablets are the smallest replication unit in ScyllaDB. Unlike static and all-or-nothing topologies, tablets provides dynamic, parallel, and near-instant scaling operations. This approach allows for autonomous and flexible data balancing and ensures that data is sharded and replicated evenly across the cluster. That, in turn, optimizes performance, avoids imbalances, and increases efficiency. The first blog in this series covered why we shifted from tokens to tablets and outlined the goals we set for this project. In this blog, let’s take a deeper look at how we implemented tablets via: Indirection and abstraction Independent tablet units A Raft-based load balancer Tablet-aware drivers Indirection and Abstraction There’s a saying in computer science that every problem can be solved with a new layer of indirection. We tested that out here, with a so-called tablets table. 😉 We added a tablets table that stores all of the tablets metadata and serves as the single source of truth for the cluster. Each tablet has its own token-range-to-nodes-and-shards mapping, and those mappings can change independently of node addition and removal. That enables the shift from static to dynamic data distribution. Each node controls its own copy, but all of the copies are synchronized via Raft. The tablets table tracks the details as the topology evolves. It always knows the current topology state, including the number of tablets per table, the token boundaries for each tablet, and which nodes and shards have the replicas. It also dynamically tracks tablet transitions (e.g., is the tablet currently being migrated or rebuilt?) and what the topology will look like when the transition is complete Independent Tablet Units Tablets dynamically distribute each table based on its size on a subset of nodes and shards. This is a much different approach than having vNodes statically distribute all tables across all nodes and shards based only on the token ring. With vNodes, each shard has its own set of SSTables and memtables that contain the portion of the data that’s allocated to that node and shard. With this new approach, each tablet is isolated into its own mini memtable and its own mini SSTables. Each tablet runs the entire log-structured merge (LSM) tree independently of other tablets that run on this shard. The advantage of this approach is that everything (SSTable + memtable + LSM tree) can be migrated as a unit. We just flush the memtables and copy the SSTables before streaming (because it’s easier for us to stream SSTables and not memtables). This enables very fast and very efficient migration. Another benefit: users no longer need to worry about manual cleanup operations. With vNodes, it can take quite a while to complete a cleanup since it involves rewriting all of the data on the node. With tablets, we migrate it as a unit and we can just delete the unit when we finish streaming it. When a new node is added to the cluster, it doesn’t yet own any data. A new component that we call the load balancer (more on this below) notices an imbalance among nodes and automatically starts moving data from the existing nodes to the new node. This is all done in the background with no user intervention required. For decommissioning nodes, there’s a similar process, just in the other direction. The load balancer is given the goal of zero data on that decommissioned node, it shifts tablets to make that happen, then the node can be removed once all tablets are migrated. Each tablet holds approximately 5 GB of data. Different tables might have different tablet counts and involve a different number of nodes and shards. A large table will be divided into a large number of tablets that are spread across many nodes and many shards, while a smaller table will involve fewer tablets, nodes and shards. The ultimate goal is to spread tables across nodes and shards evenly, in a way that they can effectively tap the cluster’s combined CPU power and IO bandwidth. Load Balancer All the tablet transitioning is globally controlled by the load balancer. This includes moving data from node to node or across shards within a node, running rebuild and repair operations, etc. This means the human operator doesn’t have to perform those tasks. The load balancer moves tablets around with the goal of achieving the delicate balance between overloading the nodes and underutilizing them. We want to maximize the available throughput (saturate the CPU and network on each node). But at the same time, we need to avoid overloading the nodes so we can keep migrations as fast as possible. To do this, the load balancer runs a loop that collects statistics on tables and tablets. It looks at which nodes have too little free space and which nodes have too much free space and it works to balance free space. It also rebalances data when we want to decommission a node. The load balancer’s other core function is maintaining the serialization of transitions. That’s all managed via Raft, specifically the Raft Group 0 leader. For example, we rely on Raft to prevent migration during a tablet rebuild and prevent conflicting topology changes. If the human operator happens to increase the replication factor, we will rebuild tablets for the additional replicas – and we will not allow yet another RF change until the automatic rebuild of the new replicas completes. The load balancer is hosted on a single node, but not a designated node. If that node goes down for maintenance or crashes, the load balancer will just get restarted on another node. And since we have all the tablets metadata in the tablets tables, the new load balancer instance can just pick up wherever the last one left off. Read about our Raft implementation Tablet-aware drivers Finally, it’s important to note that our older drivers will work with tablets, but they will not work as well. We just released new tablet-aware drivers that will provide a nice performance and latency boost. We decided that the driver should not read from the tablets table because it could take too long to scan the table, plus that approach doesn’t work well with things like lambdas or cloud functions. Instead, the driver learns tablets information lazily. The driver starts without any knowledge of where tokens are located. It makes a request to a random node. If that’s the incorrect node, the node will see that the driver missed the correct host. When it returns the data, it will also add in extra routing information that indicates the correct location. Next time, the driver will know where that particular token lives, so it will send the request directly to the node that hosts the data. This avoids an extra hop. If the tablets get migrated later on, then the “lazy learning” process repeats. How Does this All Play Out? Let’s take a deeper look into monitoring metrics and even some mesmerizing tablet visualization to see how all the components come together to achieve the elasticity and speed goals laid out in the previous blog. Conclusion We have seen how tablets make ScyllaDB more elastic. With tablets, ScyllaDB scales out faster, scaling operations are independent, and the process requires less attention and care from the operator. We feel like we haven’t yet exhausted the potential of tablets. Future ScyllaDB versions will bring more innovation in this space.

Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution

The rationale behind ScyllaDB’s new “tablets” replication architecture, which builds upon a multiyear project to implement and extend Raft  ScyllaDB 6.0 is the first release featuring ScyllaDB’s new tablet architecture. Tablets are designed to support flexible and dynamic data distribution across the cluster. Based on Raft, this new approach provides new levels of elasticity with near-instant bootstrap and the ability to add new nodes in parallel – even doubling an entire cluster at once. Since new nodes begin serving requests as soon as they join the cluster, users can spin up new nodes that start serving requests almost instantly. This means teams can quickly scale out in response to traffic spikes – satisfying latency SLAs without needing to overprovision “just in case.” This blog post shares why we decided to take on this massive project and the goals that we set for ourselves. Part 2 (publishing tomorrow) will focus on the technical requirements and implementation. Tablets Background First off, let’s be clear. ScyllaDB didn’t invent the idea of using tablets for data distribution. Previous tablets implementations can be seen across Google Bigtable, Google Spanner, and YugabyteDB. The 2006 Google Bigtable paper introduced the concept of splitting table rows into dynamic sections called tablets. Bigtable nodes don’t actually store data; they store pointers to tablets that are stored on Colossus (Google’s internal, highly durable file system). Data can be rebalanced across nodes by changing metadata vs actually copying data – so it’s much faster to dynamically redistribute data to balance the load. For the same reason, node failure has a minimal impact and node recovery is fast. Bigtable automatically splits busier or larger tablets in half and merges less-accessed/smaller tablets together – redistributing them between nodes as needed for load balancing and efficient resource utilization. The Bigtable tablets implementation uses Paxos. Tablets are also discussed in the 2012 Google Spanner paper (in section 2.1) and implemented in Spanner. Paxos is used here as well. Spanserver software stack – image from the Google Spanner paper Another implementation of tablets can be found in DocDB, which serves as YugabyteDB’s underlying document storage engine. Here, data is distributed by splitting the table rows and index entries into tablets according to the selected sharding method (range or hash) or auto-splitting. The Yugabyte implementation uses Raft. Each tablet has its own Raft group, with its own LSM-Tree datastore (including a memtable, in RAM, and many SSTable files on disk). YugabyteDB hash-based data partitioning – image from the YugabyteDB blog YugabyteDB range-based data partitioning – image from the YugabyteDB blog Why Did ScyllaDB Consider Tablets? Why did ScyllaDB consider a major move to tablets-based data distribution? Basically, several elements of our original design eventually became limiting as infrastructure and the shape of data evolved – and the sheer volume of data and size of deployments spiked. More specifically: Node storage: ScyllaDB streaming started off quite fast back in 2015, but storage volumes grew faster. The shapes of nodes changed: nodes got more storage per vCPU. For example, compare AWS i4i nodes to i3en ones, which have about twice the amount of storage per vCPU. As a result, each vCPU needs to stream more data. The immediate effect is that streaming takes longer. Schema shapes: The rate of streaming depends on the shape of the schema. If you have relatively large cells, then streaming is not all that CPU-intensive. However, if you have tiny cells (e.g., numerical data common with time-series data), then ScyllaDB will spend CPU time parsing and serializing – and then deserializing and writing – each cell. Eventually consistent leaderless architecture: ScyllaDB’s eventually consistent leaderless architecture, without the notion of a primary node, meant that the database operator (you) had to bear the burden of coordinating operations. Everything had to be serialized because the nodes couldn’t reliably communicate about what you were doing. That meant that you could not run bootstraps and decommissions in parallel. Static token-based distribution: Static data distribution is another design aspect that eventually became limiting. Once a node was added, it was assigned a range of tokens and those token assignments couldn’t change between the point when the node was added and when it was removed. As a result, data couldn’t be moved dynamically. This architecture – rooted in the Cassandra design – served us well for a while. However, the more we started working with larger deployments and workloads that required faster scaling, it became increasingly clear that it was time for something new. So we launched a multiyear project to implement tablets-based data distribution in ScyllaDB. Our Goals for the ScyllaDB Tablets ImplementationProject Our tablets project targeted several goals stemming from the above limitations: Fast bootstrap/decommission: The top project goal was to improve the bootstrap and decommission speed. Bootstrapping large nodes in a busy cluster could take many hours, sometimes a day in massive deployments. Bootstrapping is often done at critical times: when you’re running out of space or CPU capacity, or you’re trying to support an expected workload increase. Understandably, users in such situations want this bootstrapping to complete as fast as feasible. Incremental bootstrap: Previously, the bootstrapped node couldn’t start serving read requests until all of the data was streamed. That means you’re still starved for CPU and potentially IO until the end of that bootstrapping process. With incremental bootstrap, a node can start shouldering the load – little by little – as soon as it’s added to the cluster. That brings immediate relief. Parallel bootstrap: Previously, you could only add one node at a time. And given how long it took to add a node, increasing cluster size took hours, sometimes days in our larger deployments. With parallel bootstrap, you can add multiple nodes in parallel if you urgently need fast relief. Decouple topology operations: Another goal was to decouple changes to the cluster. Before, we had to serialize every operation. A node failure while bootstrapping or decommissioning nodes would force you to restart everything from scratch. With topology operations decoupled, you can remove a dead node while bootstrapping two new nodes. You don’t have to schedule everything and have it all waiting on some potentially slow operation to complete. Improve support for many small tables: ScyllaDB was historically optimized for a small number of large tables. However, our users have also been using it for workloads with a large number of small tables – so we wanted to equalize the performance for all kinds of workloads. Tablets in Action To see how tablets achieves those goals, let’s look at the following scenario: Preload a three-node cluster with 650 GB per replica Run a moderate mixed read/write workload Bootstrap three nodes to add more storage and CPU Decommission three nodes We ran this with the Scylla-cluster-tests (open-source) test harness that we use for our weekly regression tests. With tablets, the new nodes start gradually relieving the workload as soon as they’re bootstrapped and existing nodes start shedding the load incrementally. This offers fast relief for performance issues. In the write scenario here, bootstrapping was roughly 4X faster. We’ve tested other scenarios where bootstrapping was up to 30X faster. Next Up: Implementation The follow-up blog looks at how we implemented tablets. Specifically: Indirection and abstraction Independent tablet units A Raft-based load balancer Tablet-aware drivers Finally, we wrap it up with a more extensive demo that shows the impact of tablets from the perspective of coordinator requests, CPU load, and disk bandwidth across operations.

ScyllaDB 6.0: with Tablets & Strongly-Consistent Topology Updates

The ScyllaDB team is pleased to announce ScyllaDB Open Source 6.0, a production-ready major release. ScyllaDB 6.0 introduces two major features which change the way ScyllaDB works: Tablets, a dynamic way to distribute data across nodes that significantly improves scalability Strongly consistent topology, Auth, and Service Level updates Note: Join ScyllaDB co-founder Dor Laor on June 27 to explore what learn what this architectural shift means for elasticity and operational simplicity. Join the livestream In addition, ScyllaDB 6.0 includes many other improvements in functionality, stability, UX and performance. Only the latest two minor releases of the ScyllaDB Open Source project are supported. With this release, only ScyllaDB Open Source 6.0 and 5.4 are supported. Users running earlier releases are encouraged to upgrade to one of these two releases. Related Links Get ScyllaDB Open Source 6.0 as binary packages (RPM/DEB), AWS AMI, GCP Image and Docker Image Upgrade from ScyllaDB 5.4 to ScyllaDB 6.0 Report an issue   New features Tablets In this release, ScyllaDB enabled Tablets, a new data distribution algorithm to replace the legacy vNodes approach inherited from Apache Cassandra. While the vNodes approach statically distributes all tables across all nodes and shards based on the token ring, the Tablets approach dynamically distributes each table to a subset of nodes and shards based on its size. In the future, distribution will use CPU, OPS, and other information to further optimize the distribution. Read Avi Kivity’s blog series on tablets In particular, Tablets provide the following: Faster scaling and topology changes. New nodes can start serving reads and writes as soon as the first Tablet is migrated. Together with Strongly Consistent Topology Updates (below), this also allows users to add multiple nodes simultaneously and scale, out or down, much faster Automatic support for mixed clusters with different core counts. Manual vNode updates are not required. More efficient operations on small tables, since such tables are placed on a small subset of nodes and shards. Read more about Tablets in the docs Using Tablets Tablets are enabled by default for new clusters. No action required. To disable Tablets for a Keyspace use     CREATE KEYSPACE ... WITH TABLETS = { 'enabled': false } For Tablets limitations in 6.0, see the discussion in the docs. Monitoring Tablets To Monitor Tablets in real time, upgrade ScyllaDB Monitoring Stack to release 4.7, and use the new dynamic Tablet panels, below. Driver Support The Following Drivers support Tablets: Java driver 4.x, from (to be released soon) Java Driver 3.x, from Python driver, from 3.26.6 Gocql driver, from 1.13.0 Rust Driver from 0.13.0 Legacy ScyllaDB and Apache Cassandra drivers will continue to work with ScyllaDB but will be less efficient when working with tablet-based Keyspaces. Strongly Consistent Topology Updates With Raft-managed topology enabled, all topology operations are internally sequenced consistently. A centralized coordination process ensures that topology metadata is synchronized across the nodes on each step of a topology change procedure. This makes topology updates fast and safe, as the cluster administrator can trigger many topology operations concurrently, and the coordination process will safely drive all of them to completion. For example, multiple nodes can be bootstrapped concurrently, which couldn’t be done with the previous gossip-based topology. Strongly Consistent Topology Updates is now the default for new clusters, and should be enabled after upgrade for existing clusters. In addition to Topology Updates, more Cluster metadata elements are now strongly consistent: Strongly Consistent Auth Updates. Role-Based Access Control (RBAC) commands like create role or grant permission are safe to run in parallel. As a result, there is no need to update system_auth RF or run repair when adding a DataCenter. Strongly Consistent Service Levels. Service Levels allow you to define attributes such as timeout per workload. Service levels are now strongly consistent Native Nodetool The nodetool utility provides simple command-line interface operations and attributes. ScyllaDB inherited the Java based nodetool from Apache Cassandra. In this release, the Java implementation was replaced with a backward-compatible native nodetool. The native nodetool works much faster. Unlike the Java version ,the native nodetool is part of the ScyllaDB repo, and allows easier and faster updates. With the Native Nodetool, the JMX server has become redundant and will no longer be part of the default ScyllaDB Installation or image, and can be installed separately Maintenance Mode and Socket Maintenance mode is a new mode in which the node does not communicate with clients or other nodes and only listens to the local maintenance socket and the REST API. It can be used to fix damaged nodes – for example, by using nodetool compact or nodetool scrub. In maintenance mode, ScyllaDB skips loading tablet metadata if it is corrupted to allow an administrator to fix it. The new Maintenance Socket provides a way to interact with ScyllaDB, only from within the node it runs on, while on Maintenance Mode Maintenance Socket docs. Improvements and Bug Fixes The latest release also features extensive improvements to: Bloom Filters Stability and performance Compaction Commitlog Cluster operations Materialized views Performance Edge cases Guardrails CQL Alternator (DynamoDB compatible API) REST API Tracing Monitoring Packaging and configuration For details, see the release notes on the ScyllaDB Community Forum See the detailed release notes

NetApp and Pegasystems Open Source Support Package

We are excited to announce the collaboration between NetApp and Pegasystems to provide tailored support packages for Pega customers utilizing the open source technologies required to run the Pega Platform.  

Following recent changes to the Pega Platform, customers will become responsible for managing their own open source external services (Apache Kafka®, Apache Cassandra® and OpenSearch®). Instaclustr by NetApp brings a wealth of experience and expertise in these open source technologies, and the Pega team has relied on Instaclustr to support these technologies for many years. 

Beginning with Pega Platform v8.4 or Pega Platform ‘23, Pegasystems have made the architectural decision to transition external deployments of Apache Cassandra and Apache Kafka to support the platform; in subsequent minor versions this will be extended to include OpenSearch.   

These technologies are required to run the Pega Platform and moving to an external microservices architecture delivers improvements in performance, scalability and maintainability. 

Since 2020, Instaclustr has been providing support to Pegasystems by managing the underlying open source technologies necessary for Pega Cloud. Instaclustr’s Technical Operations teams possess extensive experience in running, operating and supporting the open source software essential for the Pega Platform. It was a natural progression to establish a partnership and extend these services to self-managed Pega customers. 

With Instaclustr’s support, Pega customers who manage their own environments gain access to round-the-clock assistance and direct communication with experts possessing deep technical knowledge. We understand the intricacies of the required technologies and are well-equipped to provide the necessary guidance and support to ensure smooth operations and optimal performance. 

“Pegasystems and Instaclustr have formed a collaborative partnership in the ever-evolving tech industry. Pega leverages Instaclustr’s technical expertise in the open source services space to enhance its offerings. Using our partnership, we are revolutionizing industry norms by combining innovative solutions and advanced technology to deliver a superior offering for our clients.”

Ramzi Souri, Vice President of Cloud Technologies at Pegasystems

Instaclustr offers support packages tailored to meet the specific operational requirements of each customer. There are 3 key options available: 

  1. Support Only: This option is designed for customers who manage their own Kafka, Cassandra, PostgreSQL® and OpenSearch clusters. With the Support Only package, customers gain peace of mind by having 24×7 access to dedicated experts from Instaclustr’s managed platform. This ensures they can rely on the same experienced professionals who oversee Instaclustr’s managed services.
  2. Managed Platform: The Managed Platform option provides customers managing the Pega Platform themselves on AWS, GCP or Azure Clouds with access to the fully supported Instaclustr Managed Platform. This comprehensive solution enables customers to leverage the expertise and support of Instaclustr while managing their Pega Platform deployments on their chosen cloud provider.
  3. On-Premise Managed Platform: For customers who prefer to host their environments on-premises, the On-Premise Managed Platform option offers all the benefits of the Instaclustr Managed Platform. This means customers can enjoy the same level of support, reliability, and expertise while having their platform hosted within their own on-premises environment.

All offerings include support planning of migrations and more details of the support packages can be found on our dedicated Pega Support package page. 

By partnering with Instaclustr, Pega customers can simplify the management complexity of open source application infrastructure.  

Now is the time to start planning your upgrade and transition journey. Get in touch with us today to find out more about how Instaclustr by NetApp can help manage your Pega open source environment.  

The post NetApp and Pegasystems Open Source Support Package appeared first on Instaclustr.

Book Excerpt: ScyllaDB, a Different Database

What’s so distinctive about ScyllaDB? Read what Bo Ingram (Staff Engineer at Discord) has to say – in this excerpt from the book “ScyllaDB in Action.” Editor’s note: We’re thrilled to share the following excerpt from Bo Ingram’s informative – and fun! – new book on ScyllaDB: ScyllaDB in Action. You might have already experienced Bo’s expertise and engaging communication style in his blog How Discord Stores Trillions of Messages or ScyllaDB Summit talks How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB and  So You’ve Lost Quorum: Lessons From Accidental Downtime  If not, you should 😉 You can purchase the full 370-page book from You can also access a 122-page early-release digital copy for free, compliments of ScyllaDB. The book excerpt includes a discount code for 45% off the complete book. Get the 122-page PDF for free The following is an excerpt from Chapter 1; it’s reprinted here with permission of the publisher. *** ScyllaDB is a database — it says it in its name! Users give it data; the database gives it back when asked. This very basic and oversimplified interface isn’t too dissimilar from popular relational databases like PostgreSQL and MySQL. ScyllaDB, however, is not a relational database, eschewing joins and relational data modeling to provide a different set of benefits. To illustrate these, let’s take a look at a fictitious example. Hypothetical databases Let’s imagine you’ve just moved to a new town, and as you go to new restaurants, you want to remember what you ate so that you can order it or avoid it next time. You could write it down in a journal or save it in the notes app on your phone, but you hear about a new business model where people remember information you send them. Your friend Robert has just started a similar venture: Robert’s Rememberings. ROBERT’S REMEMBERINGS Robert’s business (figure 1.2) is straightforward: you can text Robert’s phone number, and he will remember whatever information you send him. He’ll also retrieve information for you, so you won’t need to remember everything you’ve eaten in your new town. That’s Robert’s job. Figure 1.2 Robert’s Rememberings has a seemingly simple plan. The plan works swimmingly at first, but issues begin to appear. Once, you text him, and he doesn’t respond. He apologizes later and says he had a doctor’s appointment. Not unreasonable, you want your friend to be healthy. Another time, you text him about a new meal, and it takes him several minutes to reply instead of his usual instant response. He says that business is booming, and he’s been inundated with requests — response time has suffered. He reassures you and says not to worry, he has a plan. Figure 1.3 Robert adds a friend to his system to solve problems, but it introduces complications. Robert has hired a friend to help him out. He sends you the new updated rules for his system. If you only want to ask a question, you can text his friend, Rosa. All updates are still sent to Robert; he will send everything you save to her, so she’ll have an up-to-date copy. At first, you slip up a few times and still ask Robert questions, but it seems to work well. No longer is Robert overwhelmed, and Rosa’s responses are prompt. One day, you realize that when you asked Rosa a question, she texted back an old review that you had previously overwritten. You message Robert about this discrepancy, worried that your review of the much-improved tacos at Main Street Tacos is lost forever. Robert tells you there was an issue within the system where Rosa hadn’t been receiving messages from Robert but was still able to get requests from customers. Your request hasn’t been lost, and they’re reconciling to get back in sync. You wanted to be able to answer one question: is the food here good or not? Now, you’re worrying about contacting multiple people depending on whether you’re reading a review or writing a review, whether data is in sync, and whether your friend’s system can scale to satisfy all of their users’ requests. What happens if Robert can’t handle people only saving their information? When you begin brainstorming intravenous energy drink solutions, you realize that it’s time to consider other options. ABC DATA: A DIFFERENT APPROACH Your research leads you to another business – ABC Data. They tell you that their system is a little different: they have three people – Alice, Bob, and Charlotte – and any of them can save information or answer questions. They communicate with each other to ensure each of them has the latest data, as shown in figure 1.4. You’re curious what happens if one of them is unavailable, and they say they provide a cool feature: because there are multiple of them, they coordinate within themselves to provide redundancy for your data and increased availability. If Charlotte is unavailable, Alice and Bob will receive the request and answer. If Charlotte returns later, Alice and Bob will get Charlotte back up to speed on the latest changes. Figure 1.4 ABC Data’s approach is designed to meet the scaling challenges that Robert encountered. This setup is impressive, but because each request can lead to additional requests, you’re worried this system might get overwhelmed even easier than Robert’s. This, they tell you, is the beauty of their system. They take the data set and create multiple copies of it. They then divide this redundant data amongst themselves. If they need to expand, they only need to add additional people, who take over some of the existing slices of data. When a hypothetical fourth person, Diego, joins, one customer’s data might be owned by Alice, Charlotte, and Diego, whereas Bob, Charlotte, and Diego might own other data. Because they allow you to choose how many people should respond internally for a successful request, ABC Data gives you control over availability and correctness. If you want to always have the most up-to-date data, you can require all three holders to respond. If you want to prioritize getting an answer, even if it isn’t the most recent one, you can require only one holder to respond. You can balance these properties by requiring two holders to respond — you can tolerate the loss of one, but you can ensure that a majority of them have seen the most up-to-date data, so you should get the most recent information. Figure 1.5 ABC Data’s approach gives us control over availability and correctness. You’ve learned about two imaginary databases here — one that seems straightforward but introduces complexity as requests grow, and another with a more complex implementation that attempts to handle the drawbacks of the first system. Before beginning to contemplate the awkwardness of telling a friend you’re leaving his business for a competitor, let’s snap back to reality and translate these hypothetical databases to the real world. Real-world databases Robert’s database is a metaphorical relational database, such as PostgreSQL or MySQL. They’re relatively straightforward to run, fit a multitude of use cases, and are quite performant, and their relational data model has been used in practice for more than 50 years. Very often, a relational database is a safe and strong option. Accordingly, developers tend to default toward these systems. But, as demonstrated, they also have their drawbacks. Availability is often all-or-nothing. Even if you run with a read replica, which in Robert’s database would be his friend, Rosa, you would potentially only be able to do reads if you had lost your primary instance. Scalability can also be tricky – a server has a maximum amount of compute resources and memory. Once you hit that, you’re out of room to grow. It is through these drawbacks that ScyllaDB differentiates itself. The ABC Data system is ScyllaDB. Like ABC Data, ScyllaDB is a distributed database that replicates data across its nodes to provide both scalability and fault tolerance. Scaling is straightforward – you add more nodes. This elasticity in node count extends to queries. ScyllaDB lets you decide how many replicas are required to respond for a successful query, giving your application room to handle the loss of a server. *** Want to read more from Bo? You can purchase the full 325-page book from   Also, you can access a 122-page early-release digital copy for free, compliments of ScyllaDB.  Get the 122-page PDF for free

Focus on Creativity, Not Clusters: DataStax Mission Control in Action!

While many large enterprises have made use of managed databases, several still have significant workloads being served by self-managed solutions (both on-premises and in the cloud) like DataStax Enterprise (DSE) and Apache Cassandra®. Although a significant amount of those workloads will eventually...

Why Teams are Eliminating External Database Caches

Often-overlooked risks related to external caches — and how 3 teams are replacing their core database + external cache with a single solution (ScyllaDB) Teams often consider external caches when the existing database cannot meet the required service-level agreement (SLA). This is a clear performance-oriented decision. Putting an external cache in front of the database is commonly used to compensate for subpar latency stemming from various factors, such as inefficient database internals, driver usage, infrastructure choices, traffic spikes and so on. Caching might seem like a fast and easy solution because the deployment can be implemented without tremendous hassle and without incurring the significant cost of database scaling, database schema redesign or even a deeper technology transformation. However, external caches are not as simple as they are often made out to be. In fact, they can be one of the more problematic components of a distributed application architecture. In some cases, it’s a necessary evil, such as when you require frequent access to transformed data resulting from long and expensive computations, and you’ve tried all the other means of reducing latency. But in many cases, the performance boost just isn’t worth it. You solve one problem, but create others. Here are some often-overlooked risks related to external caches and ways three teams have achieved a performance boost plus cost savings by replacing their core database and external cache with a single solution. Spoiler: They adopted ScyllaDB, a high-performance database that achieves improved long-tail latencies by tapping a specialized internal cache. Why Not Cache At ScyllaDB, we’ve worked with countless teams struggling with the costs, hassles and limits of traditional attempts to improve database performance. Here are the top struggles we’ve seen teams experience with putting an external cache in front of their database. An External Cache Adds Latency A separate cache means another hop on the way. When a cache surrounds the database, the first access occurs at the cache layer. If the data isn’t in the cache, then the request is sent to the database. This adds latency to an already slow path of uncached data. One may claim that when the entire data set fits the cache, the additional latency doesn’t come into play. However, unless your data set is considerably small, storing it entirely in memory considerably magnifies costs and is thus prohibitively expensive for most organizations. An External Cache is an Additional Cost Caching means expensive DRAM, which translates to a higher cost per gigabyte than solid-state disks (see this P99 CONF talk by Grafana’s Danny Kopping for more details on that). Rather than provisioning an entirely separate infrastructure for caching, it is often best to use the existing database memory, and even increase it for internal caching. Modern database caches can be just as efficient as traditional in-memory caching solutions when sized correctly. When the working set size is too large to fit in memory, then databases often shine in optimizing I/O access to flash storage, making databases alone (no external cache) a preferred and cheaper option. External Caching Decreases Availability No cache’s high availability solution can match that of the database itself. Modern distributed databases have multiple replicas; they also are topology-aware and speed-aware and can sustain multiple failures without data loss. For example, a common replication pattern is three local replicas, which generally allows for reads to be balanced across such replicas to efficiently make use of your database’s internal caching mechanism. Consider a nine-node cluster with a replication factor of three: Essentially every node will hold roughly a third of your total data set size. As requests are balanced among different replicas, this grants you more room for caching your data, which could completely eliminate the need for an external cache. Conversely, if an external cache happens to invalidate entries right before a surge of cold requests, availability could be impeded for a while since the database won’t have that data in its internal cache (more on this below). Caches often lack high availability properties and can easily fail or invalidate records depending on their heuristics. Partial failures, which are more common, are even worse in terms of consistency. When the cache inevitably fails, the database will get hit by the unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache itself has some high availability features, it can’t coordinate handling such failure with the persistent database it is in front of. The bottom line: Rely on the database, rather than making your latency SLAs dependent on a cache. Application Complexity — Your Application Needs to Handle More Cases External caches introduce application and operational complexity. Once you have an external cache, it is your responsibility to keep the cache up to date with the database. Irrespective of your caching strategy (such as write-through, caching aside, etc.), there will be edge cases where your cache can run out of sync from your database, and you must account for these during application development. Your client settings (such as failover, retry and timeout policies) need to match the properties of both the cache as well as your database to function when the cache is unavailable or goes cold. Usually such scenarios are hard to test and implement. External Caching Ruins the Database Caching Modern databases have embedded caches and complex policies to manage them. When you place a cache in front of the database, most read requests will reach only the external cache and the database won’t keep these objects in its memory. As a result, the database cache is rendered ineffective. When requests eventually reach the database, its cache will be cold and the responses will come primarily from the disk. As a result, the round-trip from the cache to the database and then back to the application is likely to add latency. External Caching Might Increase Security Risks An external cache adds a whole new attack surface to your infrastructure. Encryption, isolation and access control on data placed in the cache are likely to be different from the ones at the database layer itself. External Caching Ignores The Database Knowledge And Database Resources Databases are quite complex and built for specialized I/O workloads on the system. Many of the queries access the same data, and some amount of the working set size can be cached in memory to save disk accesses. A good database should have sophisticated logic to decide which objects, indexes and accesses it should cache. The database also should have eviction policies that determine when new data should replace existing (older) cached objects. An example is scan-resistant caching. When scanning a large data set, say a large range or a full-table scan, a lot of objects are read from the disk. The database can realize this is a scan (not a regular query) and choose to leave these objects outside its internal cache. However, an external cache (following a read-through strategy) would treat the result set just like any other and attempt to cache the results. The database automatically synchronizes the content of the cache with the disk according to the incoming request rate, and thus the user and the developer do not need to do anything to make sure that lookups to recently written data are performant and consistent. Therefore, if, for some reason, your database doesn’t respond fast enough, it means that: The cache is misconfigured It doesn’t have enough RAM for caching The working set size and request pattern don’t fit the cache The database cache implementation is poor A Better Option: Let the Database Handle It How can you meet your SLAs without the risks of external database caches? Many teams have found that by moving to a faster database such as ScyllaDB with a specialized internal cache, they’re able to meet their latency SLAs with less hassle and lower costs. Results vary based on workload characteristics and technical requirements, of course. But for an idea of what’s possible, consider what these teams were able to achieve. SecurityScorecard Achieves 90% Latency Reduction with $1 Million Annual Savings SecurityScorecard aims to make the world a safer place by transforming the way thousands of organizations understand, mitigate and communicate cybersecurity. Its rating platform is an objective, data-driven and quantifiable measure of an organization’s overall cybersecurity and cyber risk exposure. The team’s previous data architecture served them well for a while, but couldn’t keep up with their growth. Their platform API queried one of three data stores: Redis (for faster lookups of 12 million scorecards), Aurora (for storing 4 billion measurement stats across nodes), or a Presto cluster on Hadoop Distributed File System (for complex SQL queries on historical results). As data and requests grew, challenges emerged. Aurora and Presto latencies spiked under high throughput. The largest possible instance of Redis still wasn’t sufficient, and they didn’t want the complexity of working with a Redis Cluster. To reduce latencies at the new scale that their rapid business growth required, the team moved to ScyllaDB Cloud and developed a new scoring API that routed less latency-sensitive requests to Presto and S3 storage. Here’s a visualization of this – and considerably simpler – architecture: The move resulted in: 90% latency reduction for most service endpoints 80% fewer production incidents related to Presto/Aurora performance $1 million infrastructure cost savings per year 30% faster data pipeline processing Much better customer experience Read more about the SecurityScorecard use case IMVU Reins in Redis Costs at 100X Scale A popular social community, IMVU enables people all over the world to interact with each other using 3D avatars on their desktops, tablets and mobile devices. To meet growing requirements for scale, IMVU decided it needed a more performant solution than its previous database architecture of Memcached in front of MySQL and Redis. The team looked for something that would be easier to configure, easier to extend and, if successful, easier to scale. “Redis was fine for prototyping features, but once we actually rolled it out, the expenses started getting hard to justify,” said Ken Rudy, senior software engineer at IMVU. “ScyllaDB is optimized for keeping the data you need in memory and everything else in disk. ScyllaDB allowed us to maintain the same responsiveness for a scale a hundred times what Redis could handle.” Comcast Reduces Long Tail Latencies 95% with $2.5 million Annual Savings Comcast is a global media and technology company with three primary businesses: Comcast Cable, one of the United States’ largest video, high-speed internet and phone providers to residential customers; NBCUniversal and Sky. Comcast’s Xfinity service serves 15 million households with more than 2 billion API calls (reads/writes) and over 200 million new objects per day. Over seven years, the project expanded from supporting 30,000 devices to more than 31 million. Cassandra’s long tail latencies proved unacceptable at the company’s rapidly increasing scale. To mask Cassandra’s latency issues from users, the team placed 60 cache servers in front of their database. Keeping this cache layer consistent with the database was causing major admin headaches. Since the cache and related infrastructure had to be replicated across data centers, Comcast needed to keep caches warm. They implemented a cache warmer that examined write volumes, then replicated the data across data centers. After struggling with the overhead of this approach, Comcast soon moved to ScyllaDB. Designed to minimize latency spikes through its internal caching mechanism, ScyllaDB enabled Comcast to eliminate the external caching layer, providing a simple framework in which the data service connected directly to the data store. Comcast was able to replace 962 Cassandra nodes with just 78 nodes of ScyllaDB. They improved overall availability and performance while completely eliminating the 60 cache servers. The result: 95% lower P99, P999 and P9999 latencies with the ability to handle over twice the requests – at 60% of the operating costs. This ultimately saved them $2.5 million annually in infrastructure costs and staff overhead.   Closing Thoughts Although external caches are a great companion for reducing latencies (such as serving static content and personalization data not requiring any level of durability), they often introduce more problems than benefits when placed in front of a database. The top tradeoffs include elevated costs, increased application complexity, additional round trips to your database and an additional security surface area. By rethinking your existing caching strategy and switching to a modern database providing predictable low latencies at scale, teams can simplify their infrastructure and minimize costs. And at the same time, they can still meet their SLAs without the extra hassles and complexities introduced by external caches.

ScyllaDB as a DynamoDB Alternative: Frequently Asked Questions

A look at the top questions engineers are asking about moving from DynamoDB to ScyllaDB to reduce cost, avoid throttling, and avoid cloud vendor lockin A great thing about working closely with our community is that I get a chance to hear a lot about their needs and – most importantly – listen to and take in their feedback. Lately, we’ve seen a growing interest from organizations considering ScyllaDB as a means to replace their existing DynamoDB deployments and, as happens with any new tech stack, some frequently recurring questions. 🙂 ScyllaDB provides you with multiple ways to get started: you can choose from its CQL protocol or ScyllaDB Alternator. CQL refers to the Cassandra Query Language, a NoSQL interface that is intentionally similar to SQL. ScyllaDB Alternator is ScyllaDB’s DynamoDB-compatible API, aiming at full compatibility with the DynamoDB protocol. Its goal is to provide a seamless transition from AWS DynamoDB to a cloud-agnostic or on-premise infrastructure while delivering predictable performance at scale. But which protocol should you choose? What are the main differences between ScyllaDB and DynamoDB? And what does a typical migration path look like? Fear no more, young sea monster! We’ve got you covered. I personally want to answer some of these top questions right here, and right now. Why switch from DynamoDB to ScyllaDB? If you are here, chances are that you fall under at least one of the following categories: Costs are running out of control Latency and/or throughput are suboptimal You are currently locked-in and would like a bit more flexibility ScyllaDB delivers predictable low latency at scale with less infrastructure required. For DynamoDB specifically, we have an in-depth article covering which pain points we address. Is ScyllaDB Alternator a DynamoDB drop-in replacement? In the term’s strict sense, it is not: notable differences across both solutions exist. DynamoDB development is closed source and driven by AWS (which ScyllaDB is not affiliated with), which means that there’s a chance that some specific features launched in DynamoDB may take some time to land in ScyllaDB. A more accurate way to describe it is as an almost drop-in replacement. Whenever you migrate to a different database, some degree of changes will always be required to get started with the new solution. We try to keep the level of changes to a minimum to make the transition as seamless as possible. For example, Digital Turbine easily migrated from DynamoDB to ScyllaDB within just a single two-week sprint, the results showing significant performance improvements and cost savings. What are the main differences between ScyllaDB Alternator and AWS DynamoDB? Provisioning: In ScyllaDB you provision nodes, not tables. In other words, a single ScyllaDB deployment is able to host several tables and serve traffic for multiple workloads combined. Load Balancing: Application clients do not route traffic through a single endpoint as in AWS DynamoDB (dynamodb.<region_name> Instead, clients may use one of our load balancing libraries, or implement a server-side load balancer. Limits: ScyllaDB does not impose a 400KB limit per item, nor any partition access limits. Metrics and Integration: Since ScyllaDB is not a “native AWS service,” it naturally does not integrate in the same way as other AWS services (such as CloudWatch and others) does with DynamoDB. For metrics specifically, ScyllaDB provides the ScyllaDB Monitoring Stack with specific dashboards for DynamoDB deployments. When should I use the DynamoDB API instead of CQL? Whenever you’re interested in moving away from DynamoDB (either to remain in AWS or to another cloud), and either: Have zero interest in refactoring your code to a new API, or Plan to get started or evaluate ScyllaDB prior to major code refactoring. For example, you would want to use the DynamoDB API in a situation where hundreds of independent Lambda services communicating with DynamoDB may require quite an effort to refactor. Or, when you rely on a connector that doesn’t provide compatibility with the CQL protocol. For all other cases, CQL is likely to be a better option. Check out our protocol comparison for more details. What is the level of effort required to migrate to ScyllaDB? Assuming that all features required by the application are supported by ScyllaDB (irrespective of which API you choose), the level of effort should be minimal. The process typically involves lifting your existing DynamoDB tables’ data and then replaying changes from DynamoDB Streams to ScyllaDB. Once that is complete, you update your application to connect to ScyllaDB. I once worked with an AdTech company choosing CQL as their protocol. This obviously required code refactoring to adhere to the new query language specification. On the other hand, a mobile platform company decided to go with ScyllaDB Alternator, eliminating the need for data transformations during the migration and application code changes. Is there a tool to migrate from DynamoDB to ScyllaDB Alternator? Yes. The ScyllaDB Migrator is a Spark-based tool available to perform end-to-end migrations. We also provide relevant material and hands-on assistance for migrating to ScyllaDB using alternative methods, as relevant to your use case. I currently rely on DynamoDB autoscaling; how does that translate to ScyllaDB? More often than not you shouldn’t need it. Autoscaling is not free (there’s idle infrastructure reserved for you), and it requires considerable “time to scale”, which may end up ruining end users’ experience. A small ScyllaDB cluster alone should be sufficient to deliver tens to hundreds of thousands of operations per second – and a moderately-sized one can easily achieve over a million operations per second. That being said, the best practice is to be provisioned for the peak. What about DynamoDB Accelerator (DAX)? ScyllaDB implements a row-based cache, which is just as fast as DAX. We follow a read-through caching strategy (unlike the DAX write-through strategy), resulting in less write amplification, simplified cache management, and lower latencies. In addition, ScyllaDB’s cache is bundled within the core database, not as a separate add-on like DynamoDB Accelerator. Which features are (not) available? The ScyllaDB Alternator Compatibility page contains a detailed breakdown of not yet implemented features. Keep in mind that some features might be just missing the DynamoDB API implementation. You might be able to achieve the same functionality in ScyllaDB in other ways. If any particular missing feature is critical for your ScyllaDB adoption, please let us know. How safe is this? Really? ScyllaDB Alternator has been production-ready since 2020, with leading organizations running it in production both on-premise as well as in ScyllaDB Cloud. Our DynamoDB compatible API is extensively tested and its code is open source. Next Steps If you’d like to learn more about how to succeed during your DynamoDB to ScyllaDB journey, I highly encourage you to check out our recent ScyllaDB Summit talk. For a detailed performance comparison across both solutions, check out our ScyllaDB Cloud vs DynamoDB benchmark. If you are still unsure whether a change makes sense, then you might want to read the top reasons why teams decide to abandon the DynamoDB ecosystem. If you’d like a high-level overview on how to move away from DynamoDB, refer to our DynamoDB: How to Move Out? article. When you are ready for your actual migration, then check out our in-depth walkthrough of an end-to-end DynamoDB to ScyllaDB migration. Chances are that I probably did not address some of your more specific questions (sorry about that!), in which case you can always book a 1:1 Technical Consultation with me so we can discuss your specific situation thoroughly. I’m looking forward to it!

Better “Goodput” Performance through C++ Exception Handling

Some very strange error reporting performance issues we uncovered while working on per-partition query rate limiting — and how we addressed it through C++ exception handling In Retaining Goodput with Query Rate Limiting, I introduced why and how ScyllaDB implemented per-partition rate limiting to maintain stable performance under stress by rejecting excess requests to maintain stable goodput. The implementation works by estimating request rates for each partition and making decisions based on those estimates. And benchmarks confirmed how rate limiting restored goodput, even under stressful conditions, by rejecting problematic requests. However, there was actually an interesting behind-the-scenes twist in that story. After we first coded the solution, we discovered that the performance wasn’t as good as expected. In fact, we were achieving the opposite effect of what we expected. Rejected operations consumed more CPU than the successful ones. This was very strange because tracking per-partition hit counts shouldn’t have been compute-intensive. It turned out that the problem was related to how ScyllaDB reports failures: namely, via C++ exceptions. In this article, I will share more details about the error reporting performance issues that we uncovered while working on per-partition query rate limiting and explain how we addressed them. On C++ (and Seastar) Exceptions Exceptions are notorious in the C++ community. There are many problems with their design, of course. But the most important and relevant one here is the unpredictable and potentially degraded performance that occurs when an exception is actually thrown. Some C++ projects disable exceptions altogether for those reasons. However, it’s hard to avoid exceptions because they’re thrown by the standard library and because they interact with other language features. Seastar, an open-source C++ framework for I/O intensive asynchronous computing, embraces exceptions and provides facilities for handling them correctly. Since ScyllaDB is built on Seastar, the ScyllaDB engineering team is rather accustomed to reporting failures via exceptions. They work fine, provided that errors aren’t very frequent. Under overload, it’s a different story though. We noticed that throwing exceptions in large volumes can introduce a performance bottleneck. This problem affects both existing errors such as timeouts and the new “rate limit exception” that we tried to introduce. This wasn’t the first time that we encountered performance issues with exceptions. In libstdc++, the standard library for GCC, throwing an exception involves acquiring a global mutex. That mutex protects some information important to the runtime that can be modified when a dynamic library is being loaded or unloaded. ScyllaDB doesn’t use dynamic libraries, so we were able to disable the mutex with some clever workarounds. As an unfortunate side effect, we disabled some caching functionality that speeds up further exception throws. However, avoiding scalability bottlenecks was more important to us. Exceptions are usually propagated up the call stack until a try..catch block is encountered. However, in programs with non-standard control flow, it sometimes makes sense to capture the exception and rethrow it elsewhere. For this purpose, std::exception_ptr can be used. Our Seastar framework is a good example of code that uses it. Seastar allows running concurrent, asynchronous tasks that can wait for the results of each other via objects called futures. If a task results in an exception, it is stored in a future as std::exception_ptr and can be later inspected by a different task. future<> do_thing() { return really_do_thing().finally([] { std::cout << "Did the thing\n"; }); } future<> really_do_thing() { if (fail_flag) { return make_exception_future<>( std::runtime_error("oh no!")); } else { return make_ready_future<>(); } } Seastar nicely reduces the time spent in the exception handling runtime. In most cases, Seastar is not interested in the exception itself; it only cares about whether a given future contains an exception. Because of that, the common control flow primitives such as then, finally etc. do not have to rethrow and inspect the exception. They only check whether an exception is present. Additionally, it’s possible to construct an “exceptional future” directly, without throwing the exception. Unfortunately, the standard library doesn’t make it possible to inspect an std::exception_ptr without rethrowing it. Because each exception must be inspected and handled appropriately at some point (sometimes more than once), it’s impossible for us to eliminate throws if we want to use exceptions and have portable code. We had to provide a cheaper way to return timeouts and “rate limit exceptions.” We tried two approaches. Approach 1: Avoid The Exceptions The first approach we explored was quite heavy-handed. It involved plowing through the code and changing all the important functions to return a boost::result. That’s a type from the boost library that holds either a successful result or an error. We also introduced a custom container for holding exceptions that allows inspecting the exception inside it without having to throw it. future<result<>> do_thing() { return really_do_thing().then( [] (result<> res) -> result<> { if (res) { // handle success } else { // handle failure } } ); } This approach worked, but had numerous drawbacks. First, it required a lot of work to adapt the existing code: return types had to be changed, and the boost::result returned by the functions had to be explicitly checked to see whether they held an exception or not. Moreover, those checks introduced a slight overhead on the happy path. We applied this approach to the coordinator logic. Approach 2: Implement the Missing Parts Another approach: create our own implementation for inspecting the value of std::exception_ptr. We encountered a proposal to extend the std::exception_ptr’s interface that included adding the possibility to inspect its value without rethrowing it. The proposal includes an experimental implementation that uses standard-library specific and ABI-specific constructs to implement it for some of the major compilers. Inspired by it, we implemented a small utility function that allows us to match on the exception’s type and inspect its value, and replaced a bunch of try..catch blocks with it. std::exception_ptr ep = get_exception(); if (auto* ex = try_catch(ep)) { // ... } else if (auto* ex = try_catch(ep)) { // ... } else { // ... } This is how we eliminated throws in the most important parts of the replica logic. It was much less work than the previous approach since we only had to get rid of try-catch blocks and make sure that exceptions were propagated using the existing, non-throwing primitives. While the solution is not portable, it works for us since we use and support only a single toolchain at a time to build ScyllaDB. However, it can be easy to accidentally reintroduce performance problems if you’re not careful and forget to use the non-throwing primitives, including the new try_catch. In contrast, the first approach does not have this problem. Results: Better Exception Handling, Better Throughput (and Goodput) To measure the results of these optimizations on the exception handling path, we ran a benchmark. We set up a cluster of 3 i3.4xlarge AWS nodes with ScyllaDB 5.1.0-rc1. We pre-populated it with a small data set that fits in memory. Then, we ran a write workload to a single partition. The data included in each request is very small, so the workload was CPU bound, not I/O bound. Each request had a small server-side timeout of 10ms. The benchmarking application gradually increases its write rate from 50k writes per second to 100k. The chart shows what happens when ScyllaDB is no longer able to handle the throughput. There are two runs of the same benchmark, superimposed on the same charts – a yellow line for ScyllaDB 5.0.3 which handles timeouts in the old, slow way, and a green line for ScyllaDB 5.1.0-rc1 which has exception handling improvements. The older version of ScyllaDB became overwhelmed somewhere around the 60k writes per second mark. We were pushing the shard to its limit, and some requests inevitably failed because we set a short timeout. Since processing timed out requests is more costly than processing a regular request, the shard very quickly entered a state where all requests that it processed resulted in timeouts. The new version, which has better exception handling performance, was able to sustain a larger throughput. Some timeouts occurred, but they did not overwhelm the shard completely. Timeouts don’t start to be more prominent until the 100k requests per second mark. Summary We discovered unexpected performance issues while implementing a special case of ScyllaDB’s per-partition query rate limiting. Despite Seastar’s support for handling exceptions efficiently, throwing exceptions in large volumes during overload situations became a bottleneck, affecting both existing errors like timeouts and new exceptions introduced for rate limiting. We solved the issue by using two different approaches in different parts of the code: Change the code to pass information about timeouts by returning boost::result instead of throwing exceptions. Implement a utility function that allows catching exceptions cheaply without throwing them. Benchmarks with ScyllaDB versions 5.0.3 and 5.1.0-rc1 demonstrated significant improvements in handling throughput with better exception handling. While the older version struggled and became overwhelmed around 60k writes per second, the newer version sustained larger throughput levels with fewer timeouts, even at 100k requests per second. Ultimately, these optimizations not only improved exception handling performance but also contributed to better throughput and “goodput” under high load scenarios.

Instaclustr for Apache Cassandra® Versions 3.11.17, 4.0.12 and 4.1.4 are now GA

NetApp is pleased to announce the general availability of Apache Cassandra® versions 3.11.17, 4.0.12 and 4.1.4 on its Instaclustr Managed Platform, maintaining the availability of the latest Cassandra application versions. These versions include a series of important fixes and improvements that enhance the stability, performance and security of your Cassandra deployments. 

Refer to the full project release notes for each version here: 3.11.17, 4.0.12 and 4.1.4. 

The main feature of these versions is that they address CVE-2023-43642 by upgrading the snappy Java to Read the related Instaclustr security advisory published on this specific vulnerability.   

Apache Cassandra versions 3.11.17, 4.0.12 and 4.1.4 are now in general availability and are the recommended choices for all new clusters. Read about the lifecycle status of Apache Cassandra versions here. 

For existing clusters on older versions, Instaclustr Support will reach out to you to schedule an upgrade

Lifecycle Policy Updates 

Following the release of the latest patch versions, the following lifecycle state changes will apply to previous versions: 

With the Release of:    Transition
Cassandra 4.1.4  Cassandra 4.1.3 will be changed from GA version to a Closed version. 
Cassandra 4.0.12  Cassandra 4.0.11 will be changed from GA version to a Legacy Support version. 
Cassandra 3.11.17  Cassandra 3.11.16 will be Deprecated. 
Cassandra 3.11.15 will be changed from Deprecated version to a Closed version. 

To view the lifecycle status of application versions, please visit our website here. You can learn about lifecycle policies here that define Instaclustr Support for offerings as they are superseded by preferred options and ultimately head toward the end of life. 

Enterprise Support Customers 

We recommend all Instaclustr Enterprise Support customers upgrade their Cassandra clusters to versions 3.11.17, 4.0.12, or 4.1.4, depending on the major version you are currently on. Upgrading to these latest Cassandra versions will ensure that you stay protected from vulnerabilities and benefit from the latest features.If you have any questions or concerns during the upgrade process, please make sure to reach out to our support team for assistance. 

Latest Apache Cassandra Version Release 

NetApp promptly releases the latest Cassandra versions following their official release by the Apache Project to ensure you benefit from the latest features and stay protected from vulnerabilities. We have also recently released the Apache Cassandra 5.0 beta version with support for Vector Search and Storage-Attached Indexes (SAI).  

Getting Started 

If you are not an existing customer, click here to sign up for a free trial to experience seamless deployments of Apache Cassandra on the Instaclustr Managed Platform. Once you have signed up, visit our documentation site to learn how to spin a Cassandra cluster with just a few clicks. 

If you have any further queries regarding this release, please contact Instaclustr Support.  

The post Instaclustr for Apache Cassandra® Versions 3.11.17, 4.0.12 and 4.1.4 are now GA appeared first on Instaclustr.

ScyllaDB University Updates, High Availability Lab & NoSQL Training Events

There are many paths to adopting and mastering ScyllaDB, but the vast majority all share a visit to ScyllaDB University. I want to update the ScyllaDB community about recent changes to ScyllaDB University, introduce the High Availability lab, and share some future events and plans. What is ScyllaDB University? For those new to ScyllaDB University, it’s an online learning and training center for ScyllaDB. The material is self-paced and includes theory, quizzes, and hands-on labs that you can run yourself. Access is free; to get started, just create an account. Start Learning ScyllaDB University has six courses and covers most ScyllaDB-related material. As the product evolves, the training material is updated and enhanced. In addition to the self-paced, on-demand material available on ScyllaDB University, we host live training events. These virtual and highly interactive events offer special opportunities for users to learn from the leading ScyllaDB engineers and experts, ask questions, and connect with their peers. The last ScyllaDB University LIVE event took place in March. It had two tracks: an Essentials track covering basic content, as well as an Advanced track covering the latest features and more complex use cases. The next event in the US/EMEA time zones will be on June 18th. Join us at ScyllaDB University LIVE Also, if you’re in India, we’re hosting a special event for you! On May 29, 9:30am-11:30am IST, join us for an interactive workshop where we’ll go hands-on to build and interact with high-performance apps using ScyllaDB. This is a great way to discover the NoSQL strategies used by top teams and apply them in a guided, supportive environment. Register for ScyllaDB Labs India Example Lab: High Availability Here’s a taste of the type of content you will find on ScyllaDB University. The High Availability lab is one of the most popular labs on the platform. It demonstrates what happens when we read and write data while simulating a situation where some of the replica nodes are unavailable, with different consistency levels. ScyllaDB, at its core, is designed to be highly available. With the right settings, it will remain up and available at all times. The database features a peer-to-peer architecture that is leaderless and uses the Gossip protocol for internode communication. This enhances resiliency and scalability. Upon startup and during topology changes such as node addition or removal, nodes automatically discover and communicate with each other, streaming data between them as needed. Multiple copies of the data (replicas) are kept. This is determined by setting a replication factor (RF) to ensure data is stored on multiple nodes, thereby allowing for data redundancy and high availability. Even if a node fails, the data is still available on other nodes, reducing downtime risks. Topology awareness is part of the design. ScyllaDB enhances its fault tolerance through rack and data center awareness, allowing distribution across multiple locations. It supports multi-datacenter replication with customizable replication factors for each site, ensuring data resiliency and availability even in case of significant failures of entire data centers. First, we’ll bring up a 3-node ScyllaDB Docker cluster. Set up a Docker Container with one node, called Node_X: docker run --name Node_X -d scylladb/scylla:5.2.0 --overprovisioned 1 --smp 1 Create two more nodes, Node_Y and Node_Z, and add them to the cluster of Node_X. The command “$(docker inspect –format='{{ .NetworkSettings.IPAddress }}’ Node_X)” translates to the IP address of Node-X: docker run --name Node_Y -d scylladb/scylla:5.2.0 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' Node_X)" --overprovisioned 1 --smp 1 docker run --name Node_Z -d scylladb/scylla:5.2.0 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' Node_X)" --overprovisioned 1 --smp 1 Wait a minute or so and check the node status: docker exec -it Node_Z nodetool status You’ll see that eventually, all the nodes have UN for status. U means up, and N means normal. Read more about Nodetool Status Here. Once the nodes are up, and the cluster is set, we can use the CQL shell to create a table. Run a CQL shell: docker exec -it Node_Z cqlsh Create a keyspace called “mykeyspace”, with a Replication Factor of three: CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3}; Next, create a table with three columns: user id, first name, and last name, and insert some data: use mykeyspace; CREATE TABLE users ( user_id int, fname text, lname text, PRIMARY KEY((user_id))); Insert into the newly created table two rows: insert into users(user_id, fname, lname) values (1, 'rick', 'sanchez'); insert into users(user_id, fname, lname) values (4, 'rust', 'cohle'); Read the table contents: select * from users;   Now we will see what happens when we try to read and write data to our table, with varying Consistency Levels and when some of the nodes are down. Let’s make sure all our nodes are still up: docker exec -it Node_Z nodetool status Now, set the Consistency Level to QUORUM and perform a write: docker exec -it Node_Z cqlsh use mykeyspace; CONSISTENCY QUORUM insert into users (user_id, fname, lname) values (7, 'eric', 'cartman'); Read the data to see if the insert was successful: select * from users; The read and write operations were successful. What do you think would happen if we did the same thing with Consistency Level ALL? CONSISTENCY ALL insert into users (user_id, fname, lname) values (8, 'lorne', 'malvo'); select * from users; The operations were successful again. CL=ALL means that we have to read/write to the number of nodes according to the Replication Factor, 3 in our case. Since all nodes are up, this works. Next, we’ll take one node down and check read and write operations with a Consistency Level of Quorum and ALL. Take down Node_Y and check the status (it might take some time until the node is actually down): exit docker stop Node_Y docker exec -it Node_Z nodetool status Now, set the Consistency Level to QUORUM and perform a write: docker exec -it Node_Z cqlsh CONSISTENCY QUORUM use mykeyspace; insert into users (user_id, fname, lname) values (9, 'avon', 'barksdale'); Read the data to see if the insert was successful: select * from users; With CL = QUORUM, the read and write were successful. What will happen with CL = ALL? CONSISTENCY ALL insert into users (user_id, fname, lname) values (10, 'vm', 'varga'); select * from users; Both read and write fail. CL = ALL requires that we read/write to three nodes (based on RF = 3), but only two nodes are up. What happens if another node is down? Take down Node_Z and check the status: exit docker stop Node_Z docker exec -it Node_X nodetool status Now, set the Consistency Level to QUORUM and perform a read and a write: docker exec -it Node_X cqlsh CONSISTENCY QUORUM use mykeyspace; insert into users (user_id, fname, lname) values (11, 'morty', 'smith'); select * from users; With CL = QUORUM, the read and write fail. Since RF=3, QUORUM means at least two nodes need to be up. In our case, just one is. What will happen with CL = ONE? CONSISTENCY ONE insert into users (user_id, fname, lname) values (12, 'marlo', 'stanfield'); select * from users; With CL = QUORUM, the read and write fail. Since RF=3, QUORUM means at least two nodes need to be up. In our case, just one is. What do you think will happen with CL = ONE? Check out the full High Availability lesson and lab on ScyllaDB University, with expected outputs, full explanation of the theory, and an option for seamlessly running the lab within a browser, regardless of the platform you’re using. Next Steps and Upcoming  NoSQL Events A good starting point on ScyllaDB University is ScyllaDB Essentials – Overview of ScyllaDB and NoSQL Basics course. There, you can find more hands-on labs like the one above, and learn about distributed databases, NoSQL, and ScyllaDB. ScyllaDB is backed by a vibrant community of users and developers. If you have any questions about this blog post or about a specific lesson, or want to discuss it, you can write on the ScyllaDB community forum. Say hello here. The next live training events are taking place in: An Asia-friendly time zone event will take place on the 29th of May US/Europe, Middle East, and Africa time zones on the 18th of June. Stay tuned for more details!