How to Model Leaderboards for 1M Player Game with ScyllaDB

Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we’ll explore how to properly model a schema for leaderboards…using a monstrously fast database (ScyllaDB)! 1. Prologue Ever since I was a kid, I’ve been fascinated by games and how they’re made. My favorite childhood game was Guitar Hero 3: Legends of Rock. Well, more than a decade later, I decided to try to contribute to some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and YARG (Yet Another Rhythm Game). YARG is another rhythm game, but this project is completely open source. It unites legendary contributors in game development and design. The game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch. I thought: Well, it’s an open-source project, so maybe I can use my database skills to create a monstrously fast leaderboard for storing past games. It started as a simple chat on their Discord, then turned into a long discussion about how to make this project grow faster. Ultimately, I decided to contribute to it by building a leaderboard with ScyllaDB. In this blog, I’ll show you some code and concepts! 2. Query-Driven Data Modeling With NoSQL, you should first understand which query you want to run depending on the paradigm (document, graph, wide-column, etc.). Focus on the query and create your schema based on that query. In this project, we will handle two types of paradigms: Key-Value Wide Column (Clusterization) Now let’s talk about the queries/features of our modeling. 2.1 Feature: Storing the matches Every time you finish a YARG gameplay, you want to submit your scores plus other in-game metrics. Basically, it will be a single query based on a main index. SELECT score, stars, missed_notes, instrument, ... FROM leaderboard.submisisons WHERE submission_id = 'some-uuid-here-omg' 2.2 Feature: Leaderboard And now our main goal: a super cool leaderboard that you don’t need to worry about after you perform good data modeling. The leaderboard is per song: every time you play a specific song, your best score will be saved and ranked. The interface has filters that dictate exactly which leaderboard to bring: song_id: required instrument: required modifiers: required difficulty: required player_id: optional score: optional Imagine our query looks like this, and it returns the results sorted by score in descending order: SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'dani-california' LIMIT 100; -- player_id | score ----------------+------- -- tzach | 12000 -- danielhe4rt | 10000 -- kadoodle | 9999 ----------------+------- Can you already imagine what the final schema will look like? No? Ok, let me help you with that! 3. Data Modeling time! It’s time to take a deep dive into data modeling with ScyllaDB and better understand how to scale it. 3.1 – Matches Modeling First, let’s understand a little more about the game itself: It’s a rhythm game; You play a certain song at a time; You can activate “modifiers” to make your life easier or harder before the game; You must choose an instrument (e.g. guitar, drums, bass, and microphone). Every aspect of the gameplay is tracked, such as: Score; Missed notes; Overdrive count; Play speed (1.5x ~ 1.0x); Date/time of gameplay; And other cool stuff. Thinking about that, let’s start our data modeling. It will turn into something like this: CREATE TABLE IF NOT EXISTS leaderboard.submissions ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY (submission_id, played_at) ); Let’s skip all the int/text values and jump to the set<text>. The set type allows you to store a list of items of a particular type. I decided to use this list to store the modifiers because it’s a perfect fit. Look at how the queries are executed: INSERT INTO leaderboard.submissions ( submission_id, track_id, modifiers, played_at ) VALUES ( some-cool-uuid-here, 'starlight-muse' {'all-taps', 'hell-mode', 'no-hopos'}, '2024-01-01 00:00:00' ); With this type, you can easily store a list of items to retrieve later. Another cool piece of information is that this query is a key-value like! What does that mean? Since you will always query it by the submission_id only, it can be categorized as a key-value. 3.2 Leaderboard Modeling Now we’ll cover some cool wide-column database concepts. In our leaderboard query, we will always need some dynamic values in the WHERE clauses. That means these values will belong to the Partition Key while the Clustering Keys will have values that can be “optional”. A partition key is a hash based on a combination of fields that you added to identify a value. Let’s imagine that you played Starlight - Muse 100x times. If you were to query this information, it would return 100x different results differentiated by Clustering Keys like score or player_id. SELECT player_id, score --- FROM leaderboard.song_leaderboard WHERE track_id = 'starlight-muse' LIMIT 100; If 1,000,000 players play this song, your query will become slow and it will become a problem in the future because your partition key consists of only one field, which is track_id. However, if you add more fields to your Partition Key, like mandatory things before playing the game, maybe you can shrink these possibilities for a faster query. Now do you see the big picture? Adding the fields like Instrument, Difficulty, and Modifiers will give you a way to split the information about that specific track evenly. Let’s imagine with some simple numbers: -- Query Partition ID: '1' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; -- Query Partition ID: '2' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'all-hopos'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; So, if you build the query in a specific shape it will always look for a specific token and retrieve the data based on these specific Partition Keys. Let’s take a look at the final modeling and talk about the clustering keys and the application layer: CREATE TABLE IF NOT EXISTS leaderboard.song_leaderboard ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY ((track_id, modifiers, difficulty, instrument), score, player_id) ) WITH CLUSTERING ORDER BY (score DESC, player_id ASC); The partition key was defined as mentioned above, consisting of our REQUIRED PARAMETERS such as track_id, modifiers, difficulty and instrument. And for the Clustering Keys, we added score and player_id. Note that by default the clustering fields are ordered by score DESC and just in case a player has the same score, the criteria to choose the winner will be alphabetical ¯\(ツ)/¯. First, it’s good to understand that we will have only ONE SCORE PER PLAYER. But, with this modeling, if the player goes through the same track twice with different scores, it will generate two different entries. INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 133700, 'expert', 'guitar', '2023-11-23 00:00:00' ); INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 123700, 'expert', 'guitar', '2023-11-23 00:00:00' ); SELECT player_id, score FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' LIMIT 2; -- player_id | score ----------------+------- -- daniel-reis | 133700 -- daniel-reis | 123700 ----------------+------- So how do we fix this problem? Well, it’s not a problem per se. It’s a feature! As a developer, you have to create your own business rules based on the project’s needs, and this is no different. What do I mean by that? You can run a simple DELETE query before inserting the new entry. That will guarantee that you will not have specific data from the player_id with less than the new score inside that specific group of partition keys. -- Before Insert the new Gampleplay DELETE FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' AND player_id = 'daniel-reis' AND score <= 'your-new-score-here'; -- Now you can insert the new payload... And with that, we finished our simple leaderboard system, the same one that runs in YARG and can also be used in games with MILLIONS of entries per second 😀 4. How to Contribute to YARG Want to contribute to this wonderful open-source project? We’re building a brand new platform for all the players using: Game: Unity3d (Repository) Front-end: NextJS (Repository) Back-end: Laravel 10.x (Repository) We will need as many developers and testers as possible to discuss future implementations of the game together with the main contributors! First, make sure to join this Discord Community. This is where all the technical discussions happen with the backing of the community before going to the development board. Also, outside of Discord, the YARG community is mostly focused on the EliteAsian (core contributor and project owner) X account for development showcases. Be sure to follow him there as well.
New replay viewer HUD for #YARG! There are still some issues with it, such as consistency, however we are planning to address them by the official stable release of v0.12. pic.twitter.com/9ACIJXAZS4 — EliteAsian (@EliteAsian123) December 16, 2023
And FYI, the Lead Artist of the game, (aka Kadu) is also a Broadcast Specialist and Product Innovation Developer at Elgato who worked with streamers like: Ninja Nadeshot StoneMountain64 and the legendary DJ Marshmello. Kadu also uses his X to share some insights and early previews of new features and experimentations for YARG. So, don’t forget to follow him as well!
Here's how the replay venue looks like now, added a lot of details on the desk, really happy with the result so far, going to add a few more and start the textures pic.twitter.com/oHH27vkREe — ⚡Kadu Waengertner (@kaduwaengertner) August 10, 2023
Here are some useful links to learn more about the project: Official Website Github Repository Task Board
Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?
5. Conclusion Data modeling is sometimes challenging. This project involved learning many new concepts and a lot of testing together with my community on Twitch. I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB! Also, if you like ScyllaDB and want to learn more about it, I strongly suggest you watch our free Masterclass Courses or visit ScyllaDB University!  

Will Your Cassandra Database Project Succeed?: The New Stack

Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.

That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.

Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.

Accurate Data Modeling Is a Must

Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.

On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.

The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.

How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database

How ShareChat successfully scaled 1000X without scaling the underlying database (ScyllaDB) The demand for low-latency machine learning feature stores is higher than ever, but actually implementing one at scale remains a challenge. That became clear when ShareChat engineers Ivan Burmistrov and Andrei Manakov took the P99 CONF 23 stage to share how they built a low-latency ML feature store based on ScyllaDB. This isn’t a tidy case study where adopting a new product saves the day. It’s a “lessons learned” story, a look at the value of relentless performance optimization – with some important engineering takeaways. The original system implementation fell far short of the company’s scalability requirements. The ultimate goal was to support 1 billion features per second, but the system failed under a load of just 1 million. With some smart problem solving, the team pulled it off though. Let’s look at how their engineers managed to pivot from the initial failure to meet their lofty performance goal without scaling the underlying database. Obsessed with performance optimizations and low-latency engineering? Join your peers at P99 24 CONF, a free highly technical virtual conference on “all things performance.” Speakers include: Michael Stonebraker, Postgres creator and MIT professor Bryan Cantrill, Co-founder and CTO of Oxide Computer Avi Kivity, KVM creator, ScyllaDB co-founder and CTO Liz Rice, Chief open source officer with eBPF specialists Isovalent Andy Pavlo, CMU professor Ashley Williams, Axo founder/CEO, former Rust core team, Rust Foundation founder Carl Lerche, Tokio creator, Rust contributor and engineer at AWS Register Now – It’s Free In addition to another great talk by Ivan from ShareChat’, expect more than 60 engineering talks on performance optimizations at Disney/Hulu, Shopify, Lyft, Uber, Netflix,  American Express, Datadog, Grafana, LinkedIn, Google, Oracle, Redis, AWS, ScyllaDB and more. Register for free. ShareChat: India’s Leading Social Media Platform To understand the scope of the challenge, it’s important to know a little about ShareChat, the leading social media platform in India. On the ShareChat app, users discover and consume content in more than 15 different languages, including videos, images, songs and more. ShareChat also hosts a TikTok-like short video platform (Moj) that encourages users to be creative with trending tags and contests. Between the two applications, they serve a rapidly growing user base that already has over 325 million monthly active users. And their AI-based content recommendation engine is essential for driving user retention and engagement. Machine learning feature stores at ShareChat This story focuses on the system behind ML feature stores for the short-form video app Moj. It offers fully personalized feeds to around 20 million daily active users, 100 million monthly active users. Feeds serve 8,000 requests per second, and there’s an average of 2,000 content candidates being ranked on each request (for example, to find the 10 best items to recommend). “Features” are pretty much anything that can be extracted from the data: Ivan Burmistrov, principal staff software engineer at ShareChat, explained: “We compute features for different ‘entities.’ Post is one entity, User is another and so on. From the computation perspective, they’re quite similar. However, the important difference is in the number of features we need to fetch for each type of entity. When a user requests a feed, we fetch user features for that single user. However, to rank all the posts, we need to fetch features for each candidate (post) being ranked, so the total load on the system generated by post features is much larger than the one generated by user features. This difference plays an important role in our story.” What went wrong At first, the primary focus was on building a real-time user feature store because, at that point, user features were most important. The team started to build the feature store with that goal in mind. But then priorities changed and post features became the focus too. This shift happened because the team started building an entirely new ranking system with two major differences versus its predecessor: Near real-time post features were more important The number of posts to rank increased from hundreds to thousands Ivan explained: “When we went to test this new system, it failed miserably. At around 1 million features per second, the system became unresponsive, latencies went through the roof and so on.” Ultimately, the problem stemmed from how the system architecture used pre-aggregated data buckets called tiles. For example, they can aggregate the number of likes for a post in a given minute or other time range. This allows them to compute metrics like the number of likes for multiple posts in the last two hours. Here’s a high-level look at the system architecture. There are a few real-time topics with raw data (likes, clicks, etc.). A Flink job aggregates them into tiles and writes them to ScyllaDB. Then there’s a feature service that requests tiles from ScyllaDB, aggregates them and returns results to the feed service. The initial database schema and tiling configuration led to scalability problems. Originally, each entity had its own partition, with rows timestamp and feature name being ordered clustering columns. [Learn more in this NoSQL data modeling masterclass]. Tiles were computed for segments of one minute, 30 minutes and one day. Querying one hour, one day, seven days or 30 days required fetching around 70 tiles per feature on average. If you do the math, it becomes clear why it failed. The system needed to handle around 22 billion rows per second. However, the database capacity was only 10 million rows/sec.   Initial optimizations At that point, the team went on an optimization mission. The initial database schema was updated to store all feature rows together, serialized as protocol buffers for a given timestamp. Because the architecture was already using Apache Flink, the transition to the new tiling schema was fairly easy, thanks to Flink’s advanced capabilities in building data pipelines. With this optimization, the “Features” multiplier has been removed from the equation above, and the number of required rows to fetch has been reduced by 100X: from around 2 billion to 200 million rows/sec. The team also optimized the tiling configuration, adding additional tiles for five minutes, three hours and five days to one minute, 30 minutes and one day tiles. This reduced the average required tiles from 70 to 23, further reducing the rows/sec to around 73 million. To handle more rows/sec on the database side, they changed the ScyllaDB compaction strategy from incremental to leveled. [Learn more about compaction strategies]. That option better suited their query patterns, keeping relevant rows together and reducing read I/O. The result: ScyllaDB’s capacity was effectively doubled. The easiest way to accommodate the remaining load would have been to scale ScyllaDB 4x. However, more/larger clusters would increase costs and that simply wasn’t in their budget. So the team continued focusing on improving the scalability without scaling up the ScyllaDB cluster. Improved cache locality One potential way to reduce the load on ScyllaDB was to improve the local cache hit rate, so the team decided to research how this could be achieved. The obvious choice was to use a consistent hashing approach, a well-known technique to direct a request to a certain replica from the client based on some information about the request. Since the team was using NGINX Ingress in their Kubernetes setup, using NGINX’s capabilities for consistent hashing seemed like a natural choice. Per NGINX Ingress documentation, setting up consistent hashing would be as simple as adding three lines of code. What could go wrong? A bit. This simple configuration didn’t work. Specifically: The client subset led to a huge key remapping – up 100% in the worst case. Since the node keys can be changed in a hash ring, it was impossible to use real-life scenarios with autoscaling. [See the ingress implementation] It was tricky to provide a hash value for a request because Ingress doesn’t support the most obvious solution: a gRPC header. The latency suffered severe degradation, and it was unclear what was causing the tail latency. To support a subset of the pods, the team modified their approach. They created a two-step hash function: first hashing an entity, then adding a random prefix. That distributed the entity across the desired number of pods. In theory, this approach could cause a collision when an entity is mapped to the same pod several times. However, the risk is low given the large number of replicas. Ingress doesn’t support using gRPC header as a variable, but the team found a workaround: using path rewriting and providing the required hash key in the path itself. The solution was admittedly a bit “hacky” … but it worked. Unfortunately, pinpointing the cause of latency degradation would have required considerable time, as well as observability improvements. A different approach was needed to scale the feature store in time. To meet the deadline, the team split the Feature service into 27 different services and manually split all entities between them on the client. It wasn’t the most elegant approach, but, it was simple and practical – and it achieved great results. The cache hit rate improved to 95% and the ScyllaDB load was reduced to 18.4 million rows per second. With this design, ShareChat scaled its feature store to 1B features per second by March. However, this “old school” deployment-splitting approach still wasn’t the ideal design. Maintaining 27 deployments was tedious and inefficient. Plus, the cache hit rate wasn’t stable, and scaling was limited by having to keep a high minimum pod count in every deployment. So even though this approach technically met their needs, the team continued their search for a better long-term solution. The next phase of optimizations: consistent hashing, Feature service Ready for yet another round of optimization, the team revisited the consistent hashing approach using a sidecar, called Envoy Proxy, deployed with the feature service. Envoy Proxy provided better observability which helped identify the latency tail issue. The problem: different request patterns to the Feature service caused a huge load on the gRPC layer and cache. That led to extensive mutex contention. The team then optimized the Feature service. They: Forked the caching library (FastCache from VictoriaMetrics) and implemented batch writes and better eviction to reduce mutex contention by 100x. Forked gprc-go and implemented buffer pool across different connections to avoid contention during high parallelism. Used object pooling and tuned garbage collector (GC) parameters to reduce allocation rates and GC cycles. With Envoy Proxy handling 15% of traffic in their proof-of-concept, the results were promising: a 98% cache hit rate, which reduced the load on ScyllaDB to 7.4M rows/sec. They could even scale the feature store more: from 1 billion features/second to 3 billion features/second. Lessons learned Here’s what this journey looked like from a timeline perspective: To close, Andrei summed up the team’s top lessons learned from this project (so far): Use proven technologies. Even as the ShareChat team drastically changed their system design, ScyllaDB, Apache Flink and VictoriaMetrics continued working well. Each optimization is harder than the previous one – and has less impact. Simple and practical solutions (such as splitting the feature store into 27 deployments) do indeed work. The solution that delivers the best performance isn’t always user-friendly. For instance, their revised database schema yields good performance, but is difficult to maintain and understand. Ultimately, they wrote some tooling around it to make it simpler to work with. Every system is unique. Sometimes you might need to fork a default library and adjust it for your specific system to get the best performance. Watch their complete P99 CONF talk

Simplifying Cassandra and DynamoDB Migrations with the ScyllaDB Migrator

Learn about the architecture of ScyllaDB Migrator, how to use it, recent developments, and upcoming features. ScyllaDB offers both a CQL-compatible API and a DynamoDB-compatible API, allowing applications that use Apache Cassandra or DynamoDB to take advantage of reduced costs and lower latencies with minimal code changes. We previously described the two main migration strategies: cold and hot migrations. In both cases, you need to backfill ScyllaDB with historical data. Either can be efficiently achieved with the ScyllaDB Migrator. In this blog post, we will provide an update on its status. You will learn about its architecture, how to use it, recent developments, and upcoming features. The Architecture of the ScyllaDB Migrator The ScyllaDB Migrator leverages Apache Spark to migrate terabytes of data in parallel. It can migrate data from various types of sources, as illustrated in the following diagram: We initially developed it to migrate from Apache Cassandra, but we have since added support for more types of data sources. At the time of writing, the Migrator can migrate data from either: A CQL-compatible source: An Apache Cassandra table. Or a Parquet file stored locally or on Amazon S3. Or a DynamoDB-compatible source: A DynamoDB table. Or a DynamoDB table export on Amazon S3. What’s so interesting about ScyllaDB Migrator? Since it runs as an Apache Spark application, you can adjust its throughput by scaling the underlying Spark cluster. It is designed to be resilient to read or write failures. If it stops prior to completion, the migration can be restarted from where it left off. It can rename item columns along the way. When migrating from DynamoDB, the Migrator can endlessly replicate new changes to ScyllaDB. This is useful for hot migration strategies. How to Use the ScyllaDB Migrator More details are available in the official Migrator documentation. The main steps are: Set Up Apache Spark: There are several ways to set up an Apache Spark cluster, from using a pre-built image on AWS EMR to manually following the official Apache Spark documentation to using our automated Ansible playbook on your own infrastructure. You may also use Docker to run a cluster on a single machine. Prepare the Configuration File: Create a YAML configuration file that specifies the source database, target ScyllaDB cluster, and any migration option. Run the Migrator: Execute the ScyllaDB Migrator using the spark-submit command. Pass the configuration file as an argument to the migrator. Monitor the Migration: The Spark UI provides logs and metrics to help you monitor the migration process. You can track the progress and troubleshoot any issues that arise. You should also monitor the source and target databases to check whether they are saturated or not. Recent Developments The ScyllaDB Migrator has seen several significant improvements, making it more versatile and easier to use: Support for Reading DynamoDB S3 Exports: You can now migrate data from DynamoDB S3 exports directly to ScyllaDB, broadening the range of sources you can migrate from. PR #140. AWS AssumeRole Authentication: The Migrator now supports AWS AssumeRole authentication, allowing for secure access to AWS resources during the migration process. PR #150. Schema-less DynamoDB Migrations: By adopting a schema-less approach, the Migrator enhances reliability when migrating to ScyllaDB Alternator, ScyllaDB’s DynamoDB-compatible API. PR #105. Dedicated Documentation Website: The Migrator’s documentation is now available on a proper website, providing comprehensive guides, examples, and throughput tuning tips. PR #166. Update to Spark 3.5 and Scala 2.13: The Migrator has been updated to support the latest versions of Spark and Scala, ensuring compatibility and leveraging the latest features and performance improvements. PR #155. Ansible Playbook for Spark Cluster Setup: An Ansible playbook is now available to automate the setup of a Spark cluster, simplifying the initial setup process. PR #148. Publish Pre-built Assemblies: You don’t need to manually build the Migrator from the source anymore. Download the latest release and pass it to the spark-submit command. PR #158. Strengthened Continuous Integration: We have set up a testing infrastructure that reduces the risk of introducing regressions and prevents us from breaking backward compatibility. PRs #107, #121, #127. Hands-on Migration Example The content of this section has been extracted from the documentation website. The original content is kept up to date. Let’s go through a migration example to illustrate some of the points listed above. We will perform a cold migration to replicate 1,000,000 items from a DynamoDB table to ScyllaDB Alternator. The whole system is composed of the DynamoDB service, a Spark cluster with a single worker node, and a ScyllaDB cluster with a single node, as illustrated below: To make it easier for interested readers to follow along, we will create all those services using Docker. All you need is the AWS CLI and Docker. The example files can be found at  https://github.com/scylladb/scylla-migrator/tree/b9be9fb684fb0e51bf7c8cbad79a1f42c6689103/docs/source/tutorials/dynamodb-to-scylladb-alternator Set Up the Services and Populate the Source Database We use Docker Compose to define each service. Our docker-compose.yml file looks as follows: Let’s break down this Docker Compose file. We define the DynamoDB service by reusing the official image amazon/dynamodb-local. We use the TCP port 8000 for communicating with DynamoDB. We define the Spark master and Spark worker services by using a custom image (see below). Indeed, the official Docker images for Spark 3.5.1 only support Scala 2.12 for now, but we need Scala 2.13. We mount the local directory ./spark-data to the Spark master container path /app so that we can supply the Migrator jar and configuration to the Spark master node. We expose the ports 8080 and 4040 of the master node to access the Spark UIs from our host environment. We allocate 2 cores and 4 GB of memory to the Spark worker node. As a general rule, we recommend allocating 2 GB of memory per core on each worker. We define the ScyllaDB service by reusing the official image scylladb/scylla. We use the TCP port 8001 for communicating with ScyllaDB Alternator. The Spark services rely on a local Dockerfile located at path ./dockerfiles/spark/Dockerfile. For the sake of completeness, here is the content of this file, which you can copy-paste: And here is the entry point used by the image, which needs to be executable: This Docker image installs Java and downloads the official Spark release. The entry point of the image takes an argument that can be either master or worker to control whether to start a master node or a worker node. Prepare your system for building the Spark Docker image with the following commands: mkdir spark-data chmod +x entrypoint.sh Finally, start all the services with the following command: docker compose up Your system’s Docker daemon will download the DynamoDB and ScyllaDB images and build our Spark Docker image. Check that you can access the Spark cluster UI by opening http://localhost:8080 in your browser. You should see your worker node in the workers list. Once all the services are up, you can access your local DynamoDB instance and your local ScyllaDB instance by using the standard AWS CLI. Make sure to configure the AWS CLI as follows before running the dynamodb commands: # Set dummy region and credentials aws configure set region us-west-1 aws configure set aws_access_key_id dummy aws configure set aws_secret_access_key dummy # Access DynamoDB aws --endpoint-url http://localhost:8000 dynamodb list-tables # Access ScyllaDB Alternator aws --endpoint-url http://localhost:8001 dynamodb list-tables The last preparatory step consists of creating a table in DynamoDB and filling it with random data. Create a file named create-data.sh, make it executable, and write the following content into it: This script creates a table named Example and adds 1 million items to it. It does so by invoking another script, create-25-items.sh, that uses the batch-write-item command to insert 25 items in a single call: Every added item contains an id and five columns, all filled with random data. Run the script: ./create-data.sh and wait for a couple of hours until all the data is inserted (or change the last line of create-data.sh to insert fewer items and make the demo faster). Perform the Migration Once you have set up the services and populated the source database, you are ready to perform the migration. Download the latest stable release of the Migrator in the spark-data directory: wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \ –directory-prefix=./spark-data Create a configuration file in spark-data/config.yaml and write the following content: This configuration tells the Migrator to read the items from the table Example in the dynamodb service, and to write them to the table of the same name in the scylla service. Finally, start the migration with the following command: docker compose exec spark-master \ /spark/bin/spark-submit \ --executor-memory 4G \ --executor-cores 2 \ --class com.scylladb.migrator.Migrator \ --master spark://spark-master:7077 \ --conf spark.driver.host=spark-master \ --conf spark.scylla.config=/app/config.yaml \ /app/scylla-migrator-assembly.jar This command calls spark-submit in the spark-master service with the file scylla-migrator-assembly.jar, which bundles the Migrator and all its dependencies. In the spark-submit command invocation, we explicitly tell Spark to use 4 GB of memory; otherwise, it would default to 1 GB only. We also explicitly tell Spark to use 2 cores. This is not really necessary as the default behavior is to use all the available cores, but we set it for the sake of illustration. If the Spark worker node had 20 cores, it would be better to use only 10 cores per executor to optimize the throughput (big executors require more memory management operations, which decrease the overall application performance). We would achieve this by passing --executor-cores 10, and the Spark engine would allocate two executors for our application to fully utilize the resources of the worker node. The migration process inspects the source table, replicates its schema to the target database if it does not exist, and then migrates the data. The data migration uses the Hadoop framework under the hood to leverage the Spark cluster resources. The migration process breaks down the data to transfer chunks of about 128 MB each, and processes all the partitions in parallel. Since the source is a DynamoDB table in our example, each partition translates into a scan segment to maximize the parallelism level when reading the data. Here is a diagram that illustrates the migration process: During the execution of the command, a lot of logs are printed, mostly related to Spark scheduling. Still, you should be able to spot the following relevant lines: 24/07/22 15:46:13 INFO migrator: ScyllaDB Migrator 0.9.2 24/07/22 15:46:20 INFO alternator: We need to transfer: 2 partitions in total 24/07/22 15:46:20 INFO alternator: Starting write… 24/07/22 15:46:20 INFO DynamoUtils: Checking for table existence at destination And when the migration ends, you will see the following line printed: 24/07/22 15:46:24 INFO alternator: Done transferring table snapshot During the migration, it is possible to monitor the underlying Spark job by opening the Spark UI available at http://localhost:4040 Example of a migration broken down in 6 tasks. The Spark UI allows us to follow the overall progress, and it can also show specific metrics such as the memory consumption of an executor. In our example the size of the source table is ~200 MB. In practice, it is common to migrate tables containing several terabytes of data. If necessary, and as long as your DynamoDB source supports a higher read throughput level, you can increase the migration throughput by adding more Spark worker nodes. The Spark engine will automatically spread the workload between all the worker nodes. Future Enhancements The ScyllaDB team is continuously improving the Migrator. Some of the upcoming features include: Support for Savepoints with DynamoDB Sources: This will allow users to resume the migration from a specific point in case of interruptions. This is currently supported with Cassandra sources only. Shard-Aware ScyllaDB Driver: The Migrator will fully take advantage of ScyllaDB’s specific optimizations for even faster migrations. Support for SQL-based Sources: For instance, migrate from MySQL to ScyllaDB. Conclusion Thanks to the ScyllaDB Migrator, migrating data to ScyllaDB has never been easier. With its robust architecture, recent enhancements, and active development, the migrator is an indispensable tool for ensuring a smooth and efficient migration process. For more information, check out the ScyllaDB Migrator lesson on ScyllaDB University. Another useful resource is the official ScyllaDB Migrator documentation. Are you using the Migrator? Any specific feature you’d like to see? For any questions about your specific use case or about the Migrator in general, tap into the community knowledge on the ScyllaDB Community Forum.

Inside ScyllaDB’s Continuous Optimizations for Reducing P99 Latency

How the ScyllaDB Engineering team reduced latency spikes during administrative operations through continuous monitoring and rigorous testing In the world of databases, smooth and efficient operation is crucial. However, both ScyllaDB and its predecessor Cassandra have historically encountered challenges with latency spikes during administrative operations such as repair, backup, node addition, decommission, replacement, upgrades, compactions etc.. This blog post shares how the ScyllaDB Engineering team embraced continuous improvement to tackle these challenges head-on. Protecting Performance by Measuring Operational Latency Understanding and improving the performance of a database system like ScyllaDB involves continuous monitoring and rigorous testing. Each week, our team tackles this challenge by measuring performance under three types of workload scenarios: write, read, and mixed (50% read/write). We focus specifically on operational latency: how the system performs during typical and intensive operations like repair, node addition, node termination, decommission or upgrade. Our Measurement Methodology To ensure accurate results, we preload each cluster with data at a 10:1 data-to-memory ratio—equivalent to inserting 650GB on 64GB memory instances. Our benchmarks begin by recording the latency during a steady state to establish a baseline before initiating various cluster operations. We follow a strict sequence during testing: Preload data to simulate real user environments. Baseline latency measurement for a stable reference point. Sequential operational tests involving: Repair operations via Scylla Manager. Addition of three new nodes. Termination and replacement of a node. Decommissioning of three nodes. Latency is our primary metric; if it exceeds 15ms, we immediately start investigating it. We also monitor CPU instructions per operation and track reactor stalls, which are critical for understanding performance bottlenecks. How We Measure Latency Measuring latency effectively requires looking beyond the time it takes for ScyllaDB to process a command. We consider the entire lifecycle of a request: Response time: The time from the moment the query is initiated to when the response is delivered back to the client. Advanced metrics: We utilize High Dynamic Range (HDR) Histograms to capture and analyze latency from each cassandra-stress worker. This ensures we can compute a true representation of latency percentiles rather than relying on simple averages. Results from these tests are meticulously compiled and compared with previous runs. This not only helps us detect any performance degradation but also highlights improvements. It keeps the entire team informed through detailed reports that include operation durations and latency breakdowns for both reads and writes. Better Metrics, Better Performance When we started to verify performance regularly, we mostly focused on the latencies. At that time, reports lacked many details (like HDR results), but were sufficient to identify performance issues. These included high latency when decommissioning a node, or issues with latencies during the steady state. Since then, we have optimized our testing approach to include more – and more detailed – metrics. This enables us to spot emerging performance issues sooner and root out the culprit faster. The improved testing approach has been a valuable tool, providing fast and precise feedback on how well (or not) our product optimization strategies are actually working in practice. Total metrics Our current reports include HDR Histogram details providing a comprehensive overview of system latency throughout the entire test. Number of reactor stalls (which are pauses in processing due to overloaded conditions)  prompts immediate attention and action when they increase significantly. We take a similar approach to kernel callstacks which are logged when kernel space holds locks for too long.   Management repair After populating the cluster with data, we start our test from a full cluster repair using Scylla Manager and measure the latencies:   During this period, the P99 latency was 3.87 ms for writes and 9.41 ms for reads. In comparison, during the “steady state” (when no operations were performed), the latencies were 2.23 ms and 3.87 ms, respectively. Cluster growth After the repair, we add three nodes to the cluster and conduct a similar latency analysis:   Each cycle involves adding one node sequentially. These results provide a clear view of how latency changes and the duration required to expand the cluster. Node Termination and Replacement Following the cluster growth, one node is terminated and replaced with another. Cluster Shrinkage The test concludes with shrinking the cluster back to its initial size by decommissioning random nodes one by one.   These tests and reports are invaluable, uncovering numerous performance issues like increased latencies during decommission, detecting long reactor stalls in row cache update or short but frequent ones in sstable reader paths that lead to crucial fixes, improvements, and insights. This progress is evident in the numbers, where current latencies remain in the single-digit range under various conditions. Looking Ahead Our optimization journey is ongoing. ScyllaDB 6.0 introduced tablets, significantly accelerating cluster resizing to market-leading levels. The introduction of immediate node joining, which can start in parallel with accelerated data streaming, shows significant improvements across all metrics. With these improvements, we start measuring and optimizing not only the latencies during these operations but also the operations durations. Stay tuned for more details about these advancements soon. Our proactive approach to tackling latency issues not only improves our database performance but also exemplifies our commitment to excellence. As we continue to innovate and refine our processes, ScyllaDB remains dedicated to delivering superior database solutions that meet the evolving needs of our users.

ScyllaDB Elasticity: Demo, Theory, and Future Plans

Watch a demo of how ScyllaDB’s Raft and tablets initiatives play out with real operations on a real ScyllaDB cluster — and get a glimpse at what’s next on our roadmap. If you follow ScyllaDB, you’ve likely heard us talking about Raft and tablets initiatives for years now. (If not, read more on tablets from Avi Kivity and Raft from Kostja Osipov) You might have even seen some really cool animations. But how does it play out with real operations on a real ScyllaDB cluster? And what’s next on our roadmap – particularly in terms of user impacts? ScyllaDB Co-Founder Dor Laor and Technical Director Felipe Mendes recently got together to answer those questions. In case you missed it or want a recap of the action-packed and information-rich session, here’s the complete recording:   In case you want to skip to a specific section, here’s a breakdown of what they covered when: 4:45 ScyllaDB already scaled linearly 8:11 Tablets + Raft = elasticity, speed, simplicity, TCO 11:45 Demo time! 30:23 Looking under the hood 46:19: Looking ahead And in case you prefer to read vs watch, here are some key points… Double Double Demo After Dor shared why ScyllaDB adopted a new dynamic “tablets-based” data replication architecture for faster scaling, he passed the mic over to Felipe to show it in action. Felipe covers: Parallel scaling operations (adding and removing nodes) – speed and impact on latency How new nodes can start servicing increased demand almost instantly Dynamic load balancing based on node capacity, including automated resharding for new/different instance types The demo starts with the following initial setup: 3-node cluster running on AWS i4i.xlarge Each node processing ~17,000 operations/second System load at ~50% Here’s a quick play-by-play… Scale out: Bootstrapped 3 additional i4i.large nodes in parallel New nodes start serving traffic once the first tablets arrive, before the entire data set is received. Tablets migration complete in ~3 minutes Writes are at sub-millisecond latencies; so are read latencies once the cache warms up (in the meantime, reads go to warmed up nodes, thanks to heat-weighted load balancing) Scale up: Added 3 nodes of a larger instance size (i4i.2xlarge, with double the capacity of the original nodes) and increased the client load The larger nodes receive more tablets and service almost twice the traffic than the smaller replicas (as appropriate for their higher capacity) The expanded cluster handles over 100,000 operations/second with the potential to handle 200,000-300,000 operations/second Downscale: A total of 6 nodes were decommissioned in parallel As part of the decommission process, tablets were migrated to other replicas Only 8 minutes were required to fully decommission 6 replicas while serving traffic A Special Raft for the ScyllaDB Sea Monster Starting with the ScyllaDB 6.0 release, topology metadata is managed by the Raft protocol. The process of adding, removing, and replacing nodes is fully linearized. This contributes to parallel operations, simplicity, and correctness. Read barriers and fencing are two interesting aspects of our Raft implementation. Basically, if a node doesn’t know the most recent topology, it’s barred from responding to related queries. This prevents, for example, a node from observing an incorrect topology state in the cluster – which could result in data loss. It also prevents a situation where a removed node or an external node using the same cluster name could silently come back or join the cluster simply by gossiping with another replica. Another difference: Schema versions are now linearized, and use a TimeUUID to indicate the most up-to-date schema. Linearizing schema updates not only makes the operation safer; it also considerably improves performance. Previously, a schema change could take a while to propagate via gossip – especially in large cluster deployments. Now, this is gone. TimeUUIDs provide an additional safety net. Since schema versions now contain a time-based component, ScyllaDB can ensure schema versioning, which helps with: Improved visibility on conditions triggering a schema change on logs Accurately restoring a cluster backup Rejecting out-of-order schema updates Tablets relieve operational pains The latest changes simplify ScyllaDB operations in several ways: You don’t need to perform operations one by one and wait in between them; you can just initiate the operation to add or remove all the nodes you need, all at once You no longer need to cleanup after you scale the cluster Resharding (the process of changing the shard count of an existing node) is simple. Since tablets are already split on a per-shard boundary, resharding simply updates the shard ownership Managing the system_auth keyspace (for authentication) is no longer needed. All auth-related data is now automatically replicated to every node in the cluster Soon, repairs will also be automated   Expect less: typeless, sizeless, limitless ScyllaDB’s path forward from here certainly involves less: typeless, sizeless, limitless. You could be typeless. You won’t have to think about instance types ahead of time. Do you need a storage-intensive instance like the i3ens, or a throughput-intensive instance like the i4is? It no longer matters, and you can easily transition or even mix among these. You could be sizeless. That means you won’t have to worry about capacity planning when you start off. Start small and evolve from there. You could also be limitless. You could start off anticipating a high throughput and then reduce it, or you could commit to a base and add on-demand usage if you exceed it.

Use Your Data in LLMs With the Vector Database You Already Have: The New Stack

Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.

Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.

You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.

But, ready for some good news?

If you’re already using Apache Cassandra 5.0OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.

For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.

Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.

RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.

Let’s look closer at what each open source technology brings to the vector database discussion:

Apache Cassandra 5.0 Offers Native Vector Indexing

With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.

Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.

OpenSearch Provides a Combination of Benefits

Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.

With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.

The pgvector Extension Makes Postgres a Powerful Vector Store

Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.

pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.

Was the Answer to Your AI Challenges in Front of You All Along?

The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.

The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.

Who Did What to That and When? Exploring the User Actions Feature

NetApp recently released the user actions feature on the Instaclustr Managed Platform, allowing customers to search for user actions recorded against their accounts and organizations. We record over 100 different types of actions, with detailed descriptions of what was done, by whom, to what, and at what time. 

This provides customers with visibility into the actions users are performing on their linked accounts. NetApp has always collected this information in line with our security and compliance policies, but now, all important changes to your managed cluster resources have self-service access from the Console and the APIs.

In the past, this information was accessible only through support tickets when important questions such as “Who deleted my cluster?” and “When was the firewall rule removed from my cluster?” needed answers. This feature adds more self-discoverability of what your users are doing and what our support staff are doing to keep your clusters healthy. 

This blog post provides a detailed walkthrough of this new feature at a moderate level of technical detail, with the hope of encouraging you to explore and better find the actions you are looking for. 

For this blog, I’ve created two Apache Cassandra® clusters in one account and performed some actions on each. I’ve also created an organization linked to this account and performed some actions on that. This will allow a full example UI to be shown and demonstrate the type of “stories” that can emerge from typical operations via user actions. 

Introducing Global Directory 

During development, we decided to consolidate the other global account pages into a new centralized location, which we are calling the “Directory”.  

This Directory provides you with the consolidated view of all organizations and accounts that you have access to, collecting global searches and account functions into a view that does not have a “selected cluster” context (i.e., global).  For more information on how Organizations, Accounts and Clusters relate to each other, check out this blog.

Organizations serve as an efficient method to consolidate all associated accounts into a single unified, easily accessible location. They introduce an extra layer to the permission model, facilitating the management and sharing of information such as contact and billing details. They also streamline the process of Single Sign-On (SSO) and account creation. 

Let’s log in and click on the new button: 

This will take us to the new directory landing page: 

Here, you will find two types of global searches: accounts and user actions, as well as account creation. Selecting the new “User Actions” item will take us to the new page. You can also navigate to these directory pages directly from the top right ‘folder’ menu:

User Action Search Page: Walkthrough 

This is the new page we land on if we choose to search for user actions: 

When you first enter, it finds the last page of actions that happened in the accounts and organizations you have access to. It will show both organization and account actions on a single consolidated page, even though they are slightly different in nature. 

*Note: The accessible accounts and organisations are defined as those you are linked to as

CLUSTER_ADMIN

or

OWNER

*TIP: If you don’t want an account user to see user actions, give the

READ_ONLY

access. 

You may notice a brief progress bar display as the actions are retrieved. At the time of writing, we have recorded nearly 100 million actions made by our customers over a 6-month period.  

From here, you can increase the number of actions shown on each page and page through the results. Sorting is not currently supported on the actions table, but it is something we will be looking to add in the future. For each action found, the table will display: 

  • Action: What happened to your account (or organization)? There are over 100 tracked kinds of actions recorded. 
  • Domain: The specific account or organization name of the action targeted. 
  • Description: An expanded description of what happened, using context captured at the time of action. Important values are highlighted between square brackets, and the copy button will copy the first one into the clipboard. 
  • User: The user who performed the action, typically using the console/ APIs or Terraform provider, but it can also be triggered by “Instaclustr Support” using our admin tools.
    • For those actions marked with user Instaclustr Support”, please reach out to support for more information about those actions we’ve taken on your behalf or visit https://support.instaclustr.com/hc/en-us. 
  • Local time: The action time from your local web browser’s perspective. 

Additionally, for those who prefer programmatic access, the user action feature is fully accessible via our APIs, allowing for automation and integration into your existing workflows. Please visit our API documentation page here for more details.  

Basic (super-search) Mode 

Let’s say we only care about the LeagueOfNations organization domain; we can type ‘League’ and then click Search: 

The name patterns are simple partial string patterns we look for as being ’contained’ within the name, such as ”Car in ”Carlton”. These are case insensitive. They are not (yet!) general regular expressions.

Advanced “find a needle” Search Mode 

Sometimes, searching by names is not precise enough; you may want to provide more detailed search criteria, such as time ranges or narrowing down to specific clusters or kinds of actions. Expanding the “Advanced Search” section will switch the page to a more advanced search criteria form, disabling the basic search area and its criteria. 

Lets say we only want to see the “Link Account” actions over the last week: 

We select it from the actions multi-chip selector using the cursor (we could also type it and allow autocomplete to kick in). Hitting search will give you your needle time to go chase that Carl guy down and ask why he linked that darn account: 

The available criteria fields are as follows (additive in nature): 

  • Action: the kinds of actions, with a bracketed count of their frequency over the current criteria; if empty, all are included. 
  • Account: The account name of interest OR its UUID can be useful to narrow the matches to only a specific account. It’s also useful when user, organization, and account names share string patterns, which makes the super-search less precise. 
  • Organization: the organization name of interest or its UUID. 
  • User: the user who performed the action. 
  • Description: matches against the value of an expanded description variable. This is useful because most actions mention the ‘target’ of the action, such as cluster-id, in the expanded description. 
  • Starting At: match actions starting from this time cannot be older than 12 months ago. 
  • Ending At: match actions up until this time. 

Bonus Feature: Cluster Actions 

While it’s nice to have this new search page, we wanted to build a higher-order question on top of it: What has happened to my cluster?  

The answer can be found on the details tab of each cluster. When clicked on, it will take you directly to the user actions page with appropriate criteria to answer the question. 

* TIP: we currently support entry into this view with a

descriptionFormat queryParam

allowing you to save bookmarks to particular action ‘targets’. Further

queryParams

may be supported in the future for the remaining criteria: https://console2.instaclustr.com/global/searches/user-action?descriptionContextPattern=acde7535-3288-48fa-be64-0f7afe4641b3

Clicking this provides you the answer:

Future Thoughts 

There are some future capabilities we will look to add, including the ability to subscribe to webhooks that trigger on some criteria. We would also like to add the ability to generate reports against a criterion or to run such things regularly and send them via email. Let us know what other feature improvements you would like to see! 

Conclusion 

This new capability allows customers to search for user actions directly without contacting support. It also provides improved visibility and auditing of what’s been changing on their clusters and who’s been making those changes. We hope you found this interesting and welcome any feedback for “higher-order” types of searches you’d like to see built on top of this new feature. What kind of common questions about user actions can you think of? 

If you have any questions about this feature, please contact Instaclustr Support at any time If you are not a current Instaclustr customer and you’re interested to learn more, register for a free trial and spin up your first cluster for free!

 

The post Who Did What to That and When? Exploring the User Actions Feature appeared first on Instaclustr.

Powering AI Workloads with Intelligent Data Infrastructure and Open Source

In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively. 

This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI. 

The Power of Open Source in AI Solutions 

Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions: 

  1. Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
  2. Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
  3. Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals. 
  4. Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness. 

Vector Databases: A Key Component for AI Workloads 

Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems. 

Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities. 

Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users. 

The Role of Vector Databases in AI: 

  1. Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models. 
  2. High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly. 
  3. Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance. 

Leveraging Existing Infrastructure for AI Workloads 

Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements: 

  1. Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement. 
  2. Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads. 
  3. Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration. 

Open Source Databases for Enterprise AI 

Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors: 

PostgreSQL® and pgvector 

“The world’s most advanced open source relational database, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure. 

From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.

OpenSearch® 

Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.  

OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI  agents to recommendation engines with trustworthy results and minimal hallucinations. 

Apache Cassandra® 5.0 with Native Vector Indexing

Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.

Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.

Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises. 

Open Source Opens the Door to Successful AI Workloads 

Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutionsand suffering the pitfalls of vendor lock-in or simply mismatched featurescan easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position. 

When leveraging open source databases for AI workloads, consider the following: 

  • Data Security: Ensure robust security measures are in place to protect sensitive data. 
  • Scalability: Plan for future growth by choosing solutions that can scale with your needs. 
  • Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications. 
  • Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI. 

Conclusion 

Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency. 

Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.

And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts! 

The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.

How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes?

Data modeling in Apache Cassandra® is an important topic—how you model your data can affect your cluster’s performance, costs, etc. Today I’ll be looking at a new feature in Cassandra 5.0 called Storage -Attached Indexes (SAI), and how they affect the way you model data in Cassandra databases. 

First, I’ll briefly cover what SAIs are (for more information about SAIs, check out this post). Then I’ll look at 3 use cases where your modeling strategy could change with SAI. Finally, I’ll talk about benefits and constraints of SAIs. and constraints of SAIs. 

What Are StorageAttached Indexes? 

From the Cassandra 5.0 Documentation, Storage Attached Indexes (SAIs) “[provide] an indexing mechanism that is closely integrated with the Cassandra storage engine to make data modeling easier for users. Secondary Indexing, which is indexing values on properties that are not part of the Primary Key for that table, has been available for Cassandra in the past (called SASI and 2i). However, SAIs will replace the existing functionality, as it will be deprecated in 5.0, and then tentatively removed in Cassandra 6.0. 

This is because SAIs improve upon the older methods in a lot of key ways. For one, according to the devs, SAIs are the fastest indexing method for Cassandra clusters. This performance boost was a plus for using indexing in production environments. It also lowered the data storage overhead over prior implementations, which lowers costs by reducing the need for database storage, which induces operational costs, and by reducing latency when dealing with indexes, lowering a loss of user interaction due to high latency. 

How Do SAIs work? 

SAIs are implemented as part of the SSTables, or Sorted String Tables, of a Cassandra database. This is because SAIs index Memtables and SSTables as they are written. It filters from both in-memory and on-disk sources, filtering them out into a series of indexed columns at read time. I’m not going to go into too much detail here because there are a lot of existing resources on this exciting topic: see the Cassandra 5.0 Documentation and the Instaclustr site for examples. 

The main thing to keep in mind is that SAIs are attached to Cassandra’s storage engine, and it’s much more performant from speed, scalability, and data storage angles as a result. This means that you can use indexing reliably in production beginning with Cassandra 5.0, which allows data modeling to be improved very quickly. 

To learn more about how SAIs work, check out this piece from the Apache Cassandra blog. 

What Is SAI For? 

SAI is a filtering engine, and while it does have some functionality overlap with search engines, it directly says it is not an enterprise search engine” (source). 

SAI is meant for creating filters on non-primary-key or composite partition keys (source), essentially meaning that you can enable a ‘WHERE’ clause on any column in your Cassandra 5.0 database. This makes queries a lot more flexible without sacrificing latency or storage space as with prior methods.  

How Can We Use SAI When Data Modeling in Cassandra 5.0? 

Because of the increased scalability and performance of SAIs, data modeling in Cassandra 5.0 will most definitely change 

You will be able to search collections more thoroughly and easily, for instance, indexing is more of an option when designing your Cassandra queries. This will also allow new query types, which can improve your existing querieswhich by Cassandra’s design paradigm changes your table design. 

But what if you’re not on a greenfield project and want to use SAIs? No problem! SAI is backwards-compatible, and you can migrate your application one index at a time if you need. 

How Do StorageAttached Indexes Affect Data Modeling in Apache Cassandra 5.0? 

Cassandra’s SAI was designed with data modeling in mind (source). It unlocks new query patterns that make data modeling easier in quite a few cases. In the Cassandra team’s words: “You can create a table that is most natural for you, write to just that table, and query it any way you want.” (source) 

I think another great way to look at how SAIs affect data modeling is by looking at some queries that could be asked of SAI data. This is because Cassandra data modeling relies heavily on the queries that will be used to retrieve the data. I’ll take a look at 2 use cases: indexing as a means of searching a collection in a row and indexing to manage a one-to-many relationship. 

Use Case: Querying on Values of Non-Primary-Key Columns 

You may find you’re searching for records with a particular value in a particular column often in a table. An example may be a search form for a large email inbox with lots of filters. You could find yourself looking at a record like: 

  • Subject 
  • Sender 
  • Receiver 
  • Body 
  • Time sent 

Your table creation may look like: 

CREATE KEYSPACE IF NOT EXISTS inbox 

WITH REPLICATION = { 

  'class' : 'SimpleStrategy', 

  'replication_factor' : 3 

}; 

CREATE TABLE IF NOT EXISTS emails ( 

  id int,  

  sender text,  

  receivers text,  

  subject text,  

  body text, 

  timeSent timestamp, 

  PRIMARY KEY (id)); 

};

If you allow users to search for a particular subject or sender, and the data set is large, not having SAIs could make query times painful: 

SELECT * FROM emails WHERE emails.sender == “sam.example@example.com”

To fix this problem, we can create secondary indexes on our sender, receiver, and body fields: 

CREATE CUSTOM INDEX sender_sai_idx ON Inbox.emails (sender)  

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; 

CREATE INDEX IF NOT EXISTS receiver_sai_idx on Inbox.emails (receiver) 

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; 

CREATE CUSTOM INDEX body_sai_idx ON Inbox.emails (body)  

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; 

CREATE CUSTOM INDEX subject_sai_idx ON Inbox.emails (subject)  

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

Once you’ve established the indexes, you can run the same query and it will automatically use the SAI index to find all emails with a sender of “sam.example@examplemail.com OR by subject match/body match.  Note that although the data model changed with the inclusion of the indexes, the SELECT query does not change, and the fields of the table stayed the same as well! 

Use Case: Managing One-To-Many Relationships 

Going back to the previous example, one email could have many recipients. Prior to secondary indexes, you would need to scan every row in the collection of every row in the table in order to query on recipients. This could be solved in a few ways. One is to create a join table for recipients that contains an id, email id, and recipient. This becomes complicated when the constraint that each email should only appear once per email is added. With SAI, we now have an index-based solutioncreate an index on a collection of recipients for each row. 

The script to create the table and indices changes a bit: 

id int,  

  sender text,  

  receivers set<text>,  

  subject text,  

  body text, 

  timeSent timestamp, 

  PRIMARY KEY (id)); 

};

The text type of receivers changes to a set<text>. A set is used because each email should only occur once. This takes the logic you would have had to implement for the join table solution and moves it to Cassandra.  

The indexing code remains mostly the same, except for the creation of the index for receivers: 

CREATE INDEX IF NOT EXISTS receivers_sai_idx on Inbox.emails (receivers)

That’s it! One line of CQL and there’s now an index on receivers. We can query for emails with a particular receiver: 

SELECT * FROM emails WHERE emails.receievers CONTAINS “sam.example@examplemail.com”

There are many one-to-many relationships that can be simplified in Cassandra with the use of secondary indexes and SAI. 

What Are the Benefits of Data Modeling With Storage Attached Indexes? 

There are many benefits to using SAI when data modeling in Cassandra 5.0: 

  • Query performance: because of SAI’s implementation, it has much faster query speeds than previous implementations, and indexed data is generally faster to search than unindexed data. This means you have more flexibility to search within your data and write queries that search non-primary-key columns and collections. 
  • Move over piecemeal: SAI’s backwards compatibility, coupled with how little your table structure has to change to add SAIs, means you can move over your data models piece by piece, meaning moving is easier.  
  • Data storage overhead: SAI has much lower data overhead than previous secondary index implementations, meaning more flexibility in what you can store in your data models without impacting overall storage needs. 
  • More complex queries/features: SAI allows you to write much more thorough queries when looking through SAIs, and offers up a lot of new functionality, like: 
    • Vector Search 
    • Numeric Range queries 
    • AND queries within indexes 
    • Support for map/set/ 

What Are the Constraints of StorageAttached Indexes? 

While there are benefits to SAI, there are also a few constraints, including: 

  • Because SAI is attached to the SSTable mechanism, the performance of queries on indexed columns will be “highly dependent on the compaction strategy in use” (per the Cassandra 5.0 CEP-7) 
  • SAI is not designed for unlimited-size data sets, such as logs; indexing on a dataset like that would cause performance issues. The reason for this is read latency at higher row counts spread across a cluster. It is also related to consistency level (CL), as the higher the CL is, the more nodes you’ll have to ping on larger datasets. (Source). 
  • Query complexity: while you can query as many indexes as you like, when you do so, you incur a cost related to the number of index values processed. Be aware when designing queries to select from as few indexes as necessary. 
  • You cannot index multiple columns in one index, as there is a 1-to-1 mapping of an SAI index to a column. You can however create separate indexes and query them simultaneously. 

This is a v1, and some features, like the LIKE comparison for strings, the OR operator, and global sorting are all slated for v2. 

Disk usage: SAI uses an extra 20-35% disk space over unindexed data; note that over previous versions of indexing, it consumes much less (source). You shouldn’t just make every column an index if you don’t need to, saving disk space and maintaining query performance. 

Conclusion 

SAI is a very robust solution for secondary indexes, and their addition to Cassandra 5.0 opens the door for several new data modelling strategiesfrom searching non-primary-key columns, to managing one-to-many relationships, to vector search. To learn more about SAI, read this post from the Instaclustr by NetApp blog, or check out the documentation for Cassandra 5.0. 

If you’d like to test SAI without setting up and configuring Cassandra yourself, Instaclustr has a free trial and you can spin up Cassandra 5.0 clusters today through a public preview! Instaclustr also offers a bunch of educational content about Cassandra 5.0. 

 

The post How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes? appeared first on Instaclustr.

Cassandra Lucene Index: Update

**An important update regarding support of Cassandra Lucene Index for Apache Cassandra® 5.0 and the retirement of Apache Lucene Add-On on the Instaclustr Managed Platform.** 

Instaclustr by NetApp has been maintaining the new fork of the Cassandra Lucene Index plug-in since its announcement in 2018After extensive evaluation, we have decided not to upgrade the Cassandra Lucene Index to support Apache Cassandra® 5.0. This decision aligns with the evolving needs of the Cassandra community and the capabilities offered by the StorageAttached Indexing (SAI) in Cassandra 5.0.  

SAI introduces significant improvements in secondary indexing, while simplifying data modeling and creating new use cases in Cassandra, such as Vector Search. While SAI is not a direct replacement for the Cassandra Lucene Index, it offers a more efficient alternative for many indexing needs.  

For applications requiring advanced indexing features, such as full-text search or geospatial queries, users can consider external integrations, such as OpenSearch®, that offer numerous full-text search and advanced analysis features. 

We are committed to maintaining the Cassandra Lucene Index for currently supported and newer versions of Apache Cassandra 4 (including minor and patch-level versions) for users who rely on its advanced search capabilities. We will continue to release bug fixes and provide necessary security patches for the supported versions in the public repository. 

Retiring Apache Lucene™ Add-On for Instaclustr for Apache Cassandra 

Similarly, Instaclustr is commencing the retirement process of the Apache Lucene add-on on its Instaclustr Managed Platform. The offering will move to the Closed state on July 31, 2024. This means that the add-on will no longer be available for new customers.  

However, it will continue to be fully supported for existing customers with no restrictions on SLAs, and new deployments will be permitted by exception. Existing customers should be aware that the add-on will not be supported for Cassandra 5.0. For more details about our lifecycle policies, please visit our website here.  

Instaclustr will work with the existing customers to ensure a smooth transition during this period. Support and documentation will remain in place for our customers running the Lucene addon on their clusters.  

For those transitioning to, or already using the Cassandra 5.0 beta version, we recommend exploring how Storage-Attached Indexing can help you with your indexing needs. You can try the SAI feature as part of the free trial on the Instaclustr Managed Platform.  

We thank you for your understanding and support as we continue to adapt and respond to the community’s needs. 

If you have any questions about this announcement, please contact us at support@instaclustr.com. 

The post Cassandra Lucene Index: Update appeared first on Instaclustr.

Vector Search in Apache Cassandra® 5.0

Apache Cassandra® is moving towards an AI-driven future. Cassandra’s high availability and ability to scale with large amounts of data have made it the obvious choice for many of these types of applications.  

In the constantly evolving world of data analysis, we have seen significant transformations over time. The capabilities needed now are those which support a new data type (vectors) that has accelerated in popularity with the growing adoption of Generative AI and large language models. But what does this mean? 

Data Analysis Past, Present and Future

Traditionally, data analysis was about looking back and understanding trends and patterns. Big data analytics and predictive ML applications emerged as powerful tools for processing and analyzing vast amounts of data to make informed predictions and optimize decision-making.

Today, real-time analytics has become the norm. Organizations require timely insights to respond swiftly to changing market conditions, customer demands, and emerging trends.

Tomorrow’s world will be shaped by the emergence of generative AI. Generative AI is a transformative approach that goes way beyond traditional analytics. Instead, it leverages machine learning models trained on diverse datasets to produce something entirely new while retaining similar characteristics and patterns learned from the training data. 

By training machine learning models on vast and varied datasets, businesses can unlock the power of generative AI. These models understand the underlying patterns, structures, and meaning in the data, enabling them to generate novel and innovative outputs. Whether it’s generating lifelike images, composing original music, or even creating entirely new virtual worlds, generative AI pushes the boundaries of what is possible. 

Broadly speaking, these generative AI models store the ingested data in numerical representation, known as vectors (we’ll dive deeper later). Vectors capture the essential features and characteristics of the data, allowing the models to understand and generate outputs that align with those learned patterns. 

With vector search capabilities in Apache Cassandra® 5.0 (now in public preview on the Instaclustr Platform), we enable the storage of these vectors and efficient searching and retrieval of them based on their similarity to the query vector. ‘Vector similarity search’ opens a whole new world of possibilities and lies at the core of generative AI.

The ability to create novel and innovative outputs that were never explicitly present in the training data has significant implications across many creative, exploratory, and problem-solving uses. Organizations can now unlock the full potential of their data by performing complex similarity-based searches at scale which is key to supporting AI workloads such as recommendation systems, image, text, or voice matching applications, fraud detection, and much more. 

What Is Vector Search? 

Vectors are essentially just a giant list of numbers across as many or as few dimensions as you want… but really they are just a list (or array) of numbers!  

Embeddings, by comparison, are a numerical representation of something else, like a picture, song, topic, book; they capture the semantic meaning behind each vector and encode those semantic understandings as a vector. 

Take, for example, the words “couch, “sofa, and “chair. All 3 words are individual vectors, and while their semantic relationships are similarthey are pieces of furniture after allbut “couch” and “sofa” are more closely related to each other than to “chair”. 

These semantic relationships are encoded as embeddingsdense numerical vectors that represent the semantic meaning of each word. As such, the embeddings for “couch” and “sofa” will be geometrically closer to each other than to the embedding for “chair.” “Couch”, “sofa, and “chair” will all be closer than, say, to the word “erosion”. 

Take a look at Figure 1 below to see that relationship: 

Figure 1: While “Sofa”, “Couch”, and “Chair” are all related, “Sofa” and “Couch” are semantically closer to each other than “Chair” (i.e., they are both designed for multiple people, while a chair is meant for just one).

“Erosion, on the other hand, shares practically no resemblance to the other vectors, which is why it is geometrically much further away. 

When computers search for things, they typically rely on exact matching, text matching, or open searching/object matching. Vector search, on the other hand, works by using these embeddings to search for semantics as opposed to terms, allowing you to get a much closer match in unstructured datasets based on the meaning of the word. 

Under the hood we can generate embeddings for our search term, and then using some vector math, find other embeddings that are geometrically close to our search term. This will return results that are semantically related to our search. For example, if we search for “sofa, then “couch” followed by “chair” will be returned. 

How Does Vector Search Work With Large Language Models (LLMs)? 

Now that we’ve established an understanding of vectors, embeddings, and vector searching, let’s explore the intersection of vector searching and Large Language Models (LLMs), a trending topic within the world of sophisticated AI models.

With Retrieval Augmented Generation (RAG), we use the vectors to look up related text or data from an external source as part of our query, as part of the query we also retrieve the original human readable text. We then feed the original text alongside the original prompt into the LLM for generated text output, allowing the LLM to use the additional provided context while generating the response.  

(PGVector does a very similar thing with PostgreSQL; check out our blog where we talk all about it). 

By utilizing this approach, we can take known information, query it in a natural language way, and receive relevant responses. Essentially, we input questions, prompts, and data sources into a model and generate informative responses based on that data. 

The combination of Vector Search and LLMs opens up exciting possibilities for more efficient and contextually rich GenAI applications powered by Cassandra. 

So What Does This Mean for Cassandra?

Well, with the additions of CEP-7 (Storage Attached Index, SAI) and CEP-30 (Approximate Nearest Neighbor, ANN Vector Search via SAI + Lucene) to the Cassandra 5.0 release, we can now store vectors and create indexes on them to perform similarity searches. This feature alone broadens the scope of possible Cassandra use cases.  

We utilize ANN (or “Approximate Nearest Neighbor”) as the fast and approximate search as opposed to k-NN (or “k-Nearest Neighbor”), which is a slow and exact algorithm of large, high-dimensional data; speed and performance are the priorities. 

Just about any application could utilize a vector search feature. Thus, in an existing data model, you would simply add a column of ‘vectors’ (Vectors are 32-bit floats, a new CQL data type that also takes a dimension input parameter). You would then create a vector search index to enable similarity searches on columns containing vectors.  

Searching complexity scales linearly with vector dimensions, so it is important to keep in mind vector dimensions and normalization. Ideally, we want normalized vectors (as opposed to vectors of all different sizes) when we look for similarity as it will lead to faster and more ‘correct’ results. With that, we can now filter our data by vector. 

Let’s walk through an example CQLSH session to demonstrate the creation, insertion, querying, and the new syntax that comes along with this feature.

1. Create your keyspace

CREATE KEYSPACE catalog WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}

2. Create your table or add a column to an existing table with the vector type (ALTER TABLE) 

CREATE TABLE IF NOT EXISTS catalog.products (
 product_id text PRIMARY KEY, 
 product_vector VECTOR<float, 3>  
);

*Vector dimension of 3 chosen for simplicity, the dimensions of your vector will be dependent on the vectorization method*

3. Create your custom index with SAI on the vector column to enable ANN search

CREATE CUSTOM INDEX ann_index_for_similar_products ON products(product_vector) USING ‘StorageAttachedIndex’;

*Your custom index (ann_index_for_similar_products in this case) can be created with your choice of similarity functions: default cosine, dot product, or Euclidean similarity functions. See here* 

4. Load vector data using CQL insert

INSERT INTO catalog.products (product_id, product_vector) VALUES 
(‘SKU1’, [8, 2.3, 58]);
INSERT INTO catalog.products (product_id, product_vector) VALUES 
(‘SKU3’, [1.2, 3.4, 5.6]);
INSERT INTO catalog.products (product_id, product_vector) VALUES 
(‘SKU5’, [23, 18, 3.9]);

5. Query the data to perform a similarity search using the new CQL operator 

SELECT * FROM catalog.products WHERE product_vector ANN OF [3.4, 7.8, 9.1] limit 1;

Result:

product_id  | product_vector 
---+--------------------------------------------------------- 
SKU5  | [23, 18, 3.9]

How Can You Use This?

E-Commerce (for Content Recommendation) 

We can start to think about simple e-commerce use cases like when a customer adds a product or service to their cart. You can now have a vector stored with that product, query the ANN algorithm (aka pull your next nearest neighbor(s)/most similar datapoint(s)), and display to that customer ‘similar’ products in ‘real-time.’  

The above example data model allows you to store and retrieve products based on their unique product_id. You can also perform queries or analysis based on the vector embeddings, such as finding similar products or generating personalized recommendations. The vector column can store the vector representation of the product, which can be generated using techniques like Word2Vec, Doc2Vec, or other vectorization methods.

In a practical application, you might populate this table with actual product data, including the unique identifier, name, category, and the corresponding vector embeddings for each product. The vector embeddings can be generated elsewhere using machine learning models and then inserted into a C* table. 

Content Generation (LLM or Data Augmentation) 

What about the real-time, generative AI applications that many companies (old and new) are focusing on today and in the near future?  

Well, Cassandra can serve the persistence of vectors for such applications. With this storage, we now have persisted data that can add supplemental context to an LLM query. This is the ‘RAG’ referred to earlier in this piece and enables uses like AI chatbots.  

We can think of the ever-popular ChatGPT that tends to hallucinate on anything beyond 2021its last knowledge update. With VSS on Cassandra, it is possible to augment such an LLM with stored vectors.  

Think of the various PDF plugins behind ChatGPT. We can feed the model PDFs, which, at a high level, it then vectorizes and stores in Cassandra (there are more intermediary steps to this) and is then ready to be queried in a natural language way.

Here is a list of just a subset of possible industries/use-cases that can extract the value of a such a feature-add to the Open Source Cassandra project: 

  • Cybersecurity (Anomaly Detection)
    • Ex. Financial Fraud Detection 
  • Language Processing (Language Translation)
  • Digital Marketing (Natural Language Generation)
  • Document Management (Document Clustering)   
  • Data Integration (Entity Matching or Deduplication) 
  • Image Recognition (Image Similarity Search) 
    • Ex. Healthcare (Medical Imaging) 
  • Voice Assistants (Voice and Speech Recognition)
  • Customer Support (AI Chatbots) 
  • Pharmaceutical (Drug Discovery) 

Cassandra 5.0 is in beta release on the Instaclustr Platform today (with GA coming soon)! Stay tuned for Part 2 of this blog series, where I will delve deeper and demonstrate real-world usage of VSS in Cassandra 5.0.

 

The post Vector Search in Apache Cassandra® 5.0 appeared first on Instaclustr.

Medusa 0.16 was released

The k8ssandra team is happy to announce the release of Medusa for Apache Cassandra™ v0.16. This is a special release as we did a major overhaul of the storage classes implementation. We now have less code (and less dependencies) while providing much faster and resilient storage communications.

Back to ~basics~ official SDKs

Medusa has been an open source project for about 4 years now, and a private one for a few more. Over such a long time, other software it depends upon (or doesn’t) evolves as well. More specifically, the SDKs of the major cloud providers evolved a lot. We decided to check if we could replace a lot of custom code doing asynchronous parallelisation and calls to the cloud storage CLI utilities with the official SDKs.

Our storage classes so far relied on two different ways of interacting with the object storage backends:

  • Apache Libcloud, which provided a Python API for abstracting ourselves from the different protocols. It was convenient and fast for uploading a lot of small files, but very inefficient for large transfers.
  • Specific cloud vendors CLIs, which were much more efficient with large file transfers, but invoked through subprocesses. This created an overhead that made them inefficient for small file transfers. Relying on subprocesses also created a much more brittle implementation which led the community to create a lot of issues we’ve been struggling to fix.

To cut a long story short, we did it!

  • We started by looking at S3, where we went for the official boto3. As it turns out, boto3 does all the chunking, throttling, retries and parallelisation for us. Yay!
  • Next we looked at GCP. Here we went with TalkIQ’s gcloud-aio-storage. It works very well for everything, including the large files. The only thing missing is the throughput throttling.
  • Finally, we used Azure’s official SDK to cover Azure compatibility. Sadly, this still works without throttling as well.

Right after finishing these replacements, we spotted the following improvements:

  • The integration tests duration against the storage backends dropped from ~45 min to ~15 min.
    • This means Medusa became far more efficient.
    • There is now much less time spent managing storage interaction thanks to it being asynchronous to the core.
  • The Medusa uncompressed image size we bundle into k8ssandra dropped from ~2GB to ~600MB and its build time went from 2 hours to about 15 minutes.
    • Aside from giving us much faster feedback loops when working on k8ssandra, this should help k8ssandra itself move a little bit faster.
  • The file transfers are now much faster.
    • We observed up to several hundreds of MB/s per node when moving data from a VM to blob storage within the same provider. The available network speed is the limit now.
    • We are also aware that consuming the whole network throughput is not great. That’s why we now have proper throttling for S3 and are working on a solution for this in other backends too.

The only compromise we had to make was to drop Python 3.6 support. This is because the Pythons asyncio features only come in Python 3.7.

The other good stuff

Even though we are the happiest about the storage backends, there is a number of changes that should not go without mention:

  • We fixed a bug with hierarchical storage containers in Azure. This flavor of blob storage works more like a regular file system, meaning it has a concept of directories. None of the other backends do this (including the vanilla Azure ones), and Medusa was not dealing gracefully with this.
  • We are now able to build Medusa images for multiple architectures, including the arm64 one.
  • Medusa can now purge backups of nodes that have been decommissioned, meaning they are no longer present in the most recent backups. Use the new medusa purge-decommissioned command to trigger such a purge.

Upgrade now

We encourage all Medusa users to upgrade to version 0.16 to benefit from all these storage improvements, making it much faster and reliable.

Medusa v0.16 is the default version in the newly released k8ssandra-operator v1.9.0, and it can be used with previous releases by setting the .spec.medusa.containerImage.tag field in your K8ssandraCluster manifests.

Building a 100% ScyllaDB Shard-Aware Application Using Rust

Building a 100% ScyllaDB Shard-Aware Application Using Rust

I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...

Reaper 3.0 for Apache Cassandra was released

The K8ssandra team is pleased to announce the release of Reaper 3.0. Let’s dive into the main new features and improvements this major version brings, along with some notable removals.

Storage backends

Over the years, we regularly discussed dropping support for Postgres and H2 with the TLP team. The effort for maintaining these storage backends was moderate, despite our lack of expertise in Postgres, as long as Reaper’s architecture was simple. Complexity grew with more deployment options, culminating with the addition of the sidecar mode.
Some features require different consensus strategies depending on the backend, which sometimes led to implementations that worked well with one backend and were buggy with others.
In order to allow building new features faster, while providing a consistent experience for all users, we decided to drop the Postgres and H2 backends in 3.0.

Apache Cassandra and the managed DataStax Astra DB service are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.

Reaper does not generally require high availability - even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.

Adaptive Repairs and Schedules

One of the pain points we observed when people start using Reaper is understanding the segment orchestration and knowing how the default timeout impacts the execution of repairs.
Repair is a complex choreography of operations in a distributed system. As such, and especially in the days when Reaper was created, the process could get blocked for several reasons and required a manual restart. The smart folks that designed Reaper at Spotify decided to put a timeout on segments to deal with such blockage, over which they would be terminated and rescheduled.
Problems arise when segments are too big (or have too much entropy) to process within the default 30 minutes timeout, despite not being blocked. They are repeatedly terminated and recreated, and the repair appears to make no progress.
Reaper did a poor job at dealing with this for mainly two reasons:

  • Each retry will use the same timeout, possibly failing segments forever
  • Nothing obvious was reported to explain what was failing and how to fix the situation

We fixed the former by using a longer timeout on subsequent retries, which is a simple trick to make repairs more “adaptive”. If the segments are too big, they’ll eventually pass after a few retries. It’s a good first step to improve the experience, but it’s not enough for scheduled repairs as they could end up with the same repeated failures for each run.
This is where we introduce adaptive schedules, which use feedback from past repair runs to adjust either the number of segments or the timeout for the next repair run.

Adaptive Schedules

Adaptive schedules will be updated at the end of each repair if the run metrics justify it. The schedule can get a different number of segments or a higher segment timeout depending on the latest run.
The rules are the following:

  • if more than 20% segments were extended, the number of segments will be raised by 20% on the schedule
  • if less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout
  • if no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.

This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.

Incremental Repair Triggers

As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anticompaction process.

The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.

Reaper 3.0 introduces a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.

Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.

Percent unrepaired triggers

These new features will allow to securely optimize tombstone deletions by enabling the only_purge_repaired_tombstones compaction subproperty in Cassandra, permitting to reduce gc_grace_seconds down to 3 hours without fearing that deleted data reappears.

Schedules can be edited

That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.

3.0 fixes that embarrassing situation and adds an edit button to schedules, which allows to change the mutable settings of schedules:

Edit Repair Schedule

More improvements

In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).

Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 will display a pop up informing you and allowing to force create the schedule/run:

Force bypass schedule conflict

We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:

java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml

There are many other improvements and we invite all users to check the changelog in the GitHub repo.

Upgrade Now

We encourage all Reaper users to upgrade to 3.0.0, while recommending users to carefully prepare their migration out of Postgres/H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.

All instructions to download, install, configure, and use Reaper 3.0.0 are available on the Reaper website.

Certificates management and Cassandra Pt II - cert-manager and Kubernetes

The joys of certificate management

Certificate management has long been a bugbear of enterprise environments, and expired certs have been the cause of countless outages. When managing large numbers of services at scale, it helps to have an automated approach to managing certs in order to handle renewal and avoid embarrassing and avoidable downtime.

This is part II of our exploration of certificates and encrypting Cassandra. In this blog post, we will dive into certificate management in Kubernetes. This post builds on a few of the concepts in Part I of this series, where Anthony explained the components of SSL encryption.

Recent years have seen the rise of some fantastic, free, automation-first services like letsencrypt, and no one should be caught flat footed by certificate renewals in 2021. In this blog post, we will look at one Kubernetes native tool that aims to make this process much more ergonomic on Kubernetes; cert-manager.

Recap

Anthony has already discussed several points about certificates. To recap:

  1. In asymmetric encryption and digital signing processes we always have public/private key pairs. We are referring to these as the Keystore Private Signing Key (KS PSK) and Keystore Public Certificate (KS PC).
  2. Public keys can always be openly published and allow senders to communicate to the holder of the matching private key.
  3. A certificate is just a public key - and some additional fields - which has been signed by a certificate authority (CA). A CA is a party trusted by all parties to an encrypted conversation.
  4. When a CA signs a certificate, this is a way for that mutually trusted party to attest that the party holding that certificate is who they say they are.
  5. CA’s themselves use public certificates (Certificate Authority Public Certificate; CA PC) and private signing keys (the Certificate Authority Private Signing Key; CA PSK) to sign certificates in a verifiable way.

The many certificates that Cassandra might be using

In a moderately complex Cassandra configuration, we might have:

  1. A root CA (cert A) for internode encryption.
  2. A certificate per node signed by cert A.
  3. A root CA (cert B) for the client-server encryption.
  4. A certificate per node signed by cert B.
  5. A certificate per client signed by cert B.

Even in a three node cluster, we can envisage a case where we must create two root CAs and 6 certificates, plus a certificate for each client application; for a total of 8+ certificates!

To compound the problem, this isn’t a one-off setup. Instead, we need to be able to rotate these certificates at regular intervals as they expire.

Ergonomic certificate management on Kubernetes with cert-manager

Thankfully, these processes are well supported on Kubernetes by a tool called cert-manager.

cert-manager is an all-in-one tool that should save you from ever having to reach for openssl or keytool again. As a Kubernetes operator, it manages a variety of custom resources (CRs) such as (Cluster)Issuers, CertificateRequests and Certificates. Critically it integrates with Automated Certificate Management Environment (ACME) Issuers, such as LetsEncrypt (which we will not be discussing today).

The workfow reduces to:

  1. Create an Issuer (via ACME, or a custom CA).
  2. Create a Certificate CR.
  3. Pick up your certificates and signing keys from the secrets cert-manager creates, and mount them as volumes in your pods’ containers.

Everything is managed declaratively, and you can reissue certificates at will simply by deleting and re-creating the certificates and secrets.

Or you can use the kubectl plugin which allows you to write a simple kubectl cert-manager renew. We won’t discuss this in depth here, see the cert-manager documentation for more information

Java batteries included (mostly)

At this point, Cassandra users are probably about to interject with a loud “Yes, but I need keystores and truststores, so this solution only gets me halfway”. As luck would have it, from version .15, cert-manager also allows you to create JKS truststores and keystores directly from the Certificate CR.

The fine print

There are two caveats to be aware of here:

  1. Most Cassandra deployment options currently available (including statefulSets, cass-operator or k8ssandra) do not currently support using a cert-per-node configuration in a convenient fashion. This is because the PodTemplate.spec portions of these resources are identical for each pod in the StatefulSet. This precludes the possibility of adding per-node certs via environment or volume mounts.
  2. There are currently some open questions about how to rotate certificates without downtime when using internode encryption.
    • Our current recommendation is to use a CA PC per Cassandra datacenter (DC) and add some basic scripts to merge both CA PCs into a single truststore to be propagated across all nodes. By renewing the CA PC independently you can ensure one DC is always online, but you still do suffer a network partition. Hinted handoff should theoretically rescue the situation but it is a less than robust solution, particularly on larger clusters. This solution is not recommended when using lightweight transactions or non LOCAL consistency levels.
    • One mitigation to consider is using non-expiring CA PCs, in which case no CA PC rotation is ever performed without a manual trigger. KS PCs and KS PSKs may still be rotated. When CA PC rotation is essential this approach allows for careful planning ahead of time, but it is not always possible when using a 3rd party CA.
    • Istio or other service mesh approaches can fully automate mTLS in clusters, but Istio is a fairly large committment and can create its own complexities.
    • Manual management of certificates may be possible using a secure vault (e.g. HashiCorp vault), sealed secrets, or similar approaches. In this case, cert manager may not be involved.

These caveats are not trivial. To address (2) more elegantly you could also implement Anthony’s solution from part one of this blog series; but you’ll need to script this up yourself to suit your k8s environment.

We are also in discussions with the folks over at cert-manager about how their ecosystem can better support Cassandra. We hope to report progress on this front over the coming months.

These caveats present challenges, but there are also specific cases where they matter less.

cert-manager and Reaper - a match made in heaven

One case where we really don’t care if a client is unavailable for a short period is when Reaper is the client.

Cassandra is an eventually consistent system and suffers from entropy. Data on nodes can become out of sync with other nodes due to transient network failures, node restarts and the general wear and tear incurred by a server operating 24/7 for several years.

Cassandra contemplates that this may occur. It provides a variety of consistency level settings allowing you to control how many nodes must agree for a piece of data to be considered the truth. But even though properly set consistency levels ensure that the data returned will be accurate, the process of reconciling data across the network degrades read performance - it is best to have consistent data on hand when you go to read it.

As a result, we recommend the use of Reaper, which runs as a Cassandra client and automatically repairs the cluster in a slow trickle, ensuring that a high volume of repairs are not scheduled all at once (which would overwhelm the cluster and degrade the performance of real clients) while also making sure that all data is eventually repaired for when it is needed.

The set up

The manifests for this blog post can be found here.

Environment

We assume that you’re running Kubernetes 1.21, and we’ll be running with a Cassandra 3.11.10 install. The demo environment we’ll be setting up is a 3 node environment, and we have tested this configuration against 3 nodes.

We will be installing the cass-operator and Cassandra cluster into the cass-operator namespace, while the cert-manager operator will sit within the cert-manager namespace.

Setting up kind

For testing, we often use kind to provide a local k8s cluster. You can use minikube or whatever solution you prefer (including a real cluster running on GKE, EKS, or AKS), but we’ll include some kind instructions and scripts here to ease the way.

If you want a quick fix to get you started, try running the setup-kind-multicluster.sh script from the k8ssandra-operator repository, with setup-kind-multicluster.sh --kind-worker-nodes 3. I have included this script in the root of the code examples repo that accompanies this blog.

A demo CA certificate

We aren’t going to use LetsEncrypt for this demo, firstly because ACME certificate issuance has some complexities (including needing a DNS or a publicly hosted HTTP server) and secondly because I want to reinforce that cert-manager is useful to organisations who are bringing their own certs and don’t need one issued. This is especially useful for on-prem deployments.

First off, create a new private key and certificate pair for your root CA. Note that the file names tls.crt and tls.key will become important in a moment.

openssl genrsa -out manifests/demoCA/tls.key 4096
openssl req -new -x509 -key manifests/demoCA/tls.key -out manifests/demoCA/tls.crt -subj "/C=AU/ST=NSW/L=Sydney/O=Global Security/OU=IT Department/CN=example.com"

(Or you can just run the generate-certs.sh script in the manifests/demoCA directory - ensure you run it from the root of the project so that the secrets appear in .manifests/demoCA/.)

When running this process on MacOS be aware of this issue which affects the creation of self signed certificates. The repo referenced in this blog post contains example certificates which you can use for demo purposes - but do not use these outside your local machine.

Now we’re going to use kustomize (which comes with kubectl) to add these files to Kubernetes as secrets. kustomize is not a templating language like Helm. But it fulfills a similar role by allowing you to build a set of base manifests that are then bundled, and which can be customised for your particular deployment scenario by patching.

Run kubectl apply -k manifests/demoCA. This will build the secrets resources using the kustomize secretGenerator and add them to Kubernetes. Breaking this process down piece by piece:

# ./manifests/demoCA
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cass-operator
generatorOptions:
 disableNameSuffixHash: true
secretGenerator:
- name: demo-ca
  type: tls
  files:
  - tls.crt
  - tls.key
  • We use disableNameSuffixHash, because otherwise kustomize will add hashes to each of our secret names. This makes it harder to build these deployments one component at a time.
  • The tls type secret conventionally takes two keys with these names, as per the next point. cert-manager expects a secret in this format in order to create the Issuer which we will explain in the next step.
  • We are adding the files tls.crt and tls.key. The file names will become the keys of a secret called demo-ca.

cert-manager

cert-manager can be installed by running kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml. It will install into the cert-manager namespace because a Kubernetes cluster should only ever have a single cert-manager operator installed.

cert-manager will install a deployment, as well as various custom resource definitions (CRDs) and webhooks to deal with the lifecycle of the Custom Resources (CRs).

A cert-manager Issuer

Issuers come in various forms. Today we’ll be using a CA Issuer because our components need to trust each other, but don’t need to be trusted by a web browser.

Other options include ACME based Issuers compatible with LetsEncrypt, but these would require that we have control of a public facing DNS or HTTP server, and that isn’t always the case for Cassandra, especially on-prem.

Dive into the truststore-keystore directory and you’ll find the Issuer, it is very simple so we won’t reproduce it here. The only thing to note is that it takes a secret which has keys of tls.crt and tls.key - the secret you pass in must have these keys. These are the CA PC and CA PSK we mentioned earlier.

We’ll apply this manifest to the cluster in the next step.

Some cert-manager certs

Let’s start with the Cassandra-Certificate.yaml resource:

spec:
  # Secret names are always required.
  secretName: cassandra-jks-keystore
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  subject:
    organizations:
    - datastax
  dnsNames:
  - dc1.cass-operator.svc.cluster.local
  isCA: false
  usages:
    - server auth
    - client auth
  issuerRef:
    name: ca-issuer
    # We can reference ClusterIssuers by changing the kind here.
    # The default value is `Issuer` (i.e. a locally namespaced Issuer)
    kind: Issuer
    # This is optional since cert-manager will default to this value however
    # if you are using an external issuer, change this to that `Issuer` group.
    group: cert-manager.io
  keystores:
    jks:
      create: true
      passwordSecretRef: # Password used to encrypt the keystore
        key: keystore-pass
        name: jks-password
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048

The first part of the spec here tells us a few things:

  • The keystore, truststore and certificates will be fields within a secret called cassandra-jks-keystore. This secret will end up holding our KS PSK and KS PC.
  • It will be valid for 90 days.
  • 15 days before expiry, it will be renewed automatically by cert manager, which will contact the Issuer to do so.
  • It has a subject organisation. You can add any of the X509 subject fields here, but it needs to have one of them.
  • It has a DNS name - you could also provide a URI or IP address. In this case we have used the service address of the Cassandra datacenter which we are about to create via the operator. This has a format of <DC_NAME>.<NAMESPACE>.svc.cluster.local.
  • It is not a CA (isCA), and can be used for server auth or client auth (usages). You can tune these settings according to your needs. If you make your cert a CA you can even reference it in a new Issuer, and define cute tree like structures (if you’re into that).

Outside the certificates themselves, there are additional settings controlling how they are issued and what format this happens in.

  • IssuerRef is used to define the Issuer we want to issue the certificate. The Issuer will sign the certificate with its CA PSK.
  • We are specifying that we would like a keystore created with the keystore key, and that we’d like it in jks format with the corresponding key.
  • passwordSecretKeyRef references a secret and a key within it. It will be used to provide the password for the keystore (the truststore is unencrypted as it contains only public certs and no signing keys).

The Reaper-Certificate.yaml is similar in structure, but has a different DNS name. We aren’t configuring Cassandra to verify that the DNS name on the certificate matches the DNS name of the parties in this particular case.

Apply all of the certs and the Issuer using kubectl apply -k manifests/truststore-keystore.

Cass-operator

Examining the cass-operator directory, we’ll see that there is a kustomization.yaml which references the remote cass-operator repository and a local cassandraDatacenter.yaml. This applies the manifests required to run up a cass-operator installation namespaced to the cass-operator namespace.

Note that this installation of the operator will only watch its own namespace for CassandraDatacenter CRs. So if you create a DC in a different namespace, nothing will happen.

We will apply these manifests in the next step.

CassandraDatacenter

Finally, the CassandraDatacenter resource in the ./cass-operator/ directory will describe the kind of DC we want:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.10
  managementApiAuth:
    insecure: {}
  size: 1
  podTemplateSpec:
    spec:
      containers:
        - name: "cassandra"
          volumeMounts:
          - name: certs
            mountPath: "/crypto"
      volumes:
      - name: certs
        secret:
          secretName: cassandra-jks-keystore
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
  config:
    cassandra-yaml:
      authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
      authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
      role_manager: org.apache.cassandra.auth.CassandraRoleManager
      client_encryption_options:
        enabled: true
        # If enabled and optional is set to true encrypted and unencrypted connections are handled.
        optional: false
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        require_client_auth: true
        # Set trustore and truststore_password if require_client_auth is true
        truststore: /crypto/truststore.jks
        truststore_password: dc1
        protocol: TLS
        # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA] # An earlier version of this manifest configured cipher suites but the proposed config was less secure. This section does not need to be modified.
      server_encryption_options:
        internode_encryption: all
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        truststore: /crypto/truststore.jks
        truststore_password: dc1
    jvm-options:
      initial_heap_size: 800M
      max_heap_size: 800M
  • We provide a name for the DC - dc1.
  • We provide a name for the cluster - the DC would join other DCs if they already exist in the k8s cluster and we configured the additionalSeeds property.
  • We use the podTemplateSpec.volumes array to declare the volumes for the Cassandra pods, and we use the podTemplateSpec.containers.volumeMounts array to describe where and how to mount them.

The config.cassandra-yaml field is where most of the encryption configuration happens, and we are using it to enable both internode and client-server encryption, which both use the same keystore and truststore for simplicity. Remember that using internode encryption means your DC needs to go offline briefly for a full restart when the CA’s keys rotate.

  • We are not using authz/n in this case to keep things simple. Don’t do this in production.
  • For both encryption types we need to specify (1) the keystore location, (2) the truststore location and (3) the passwords for the keystores. The locations of the keystore/truststore come from where we mounted them in volumeMounts.
  • We are specifying JVM options just to make this run politely on a smaller machine. You would tune this for a production deployment.

Roll out the cass-operator and the CassandraDatacenter using kubectl apply -k manifests/cass-operator. Because the CRDs might take a moment to propagate, there is a chance you’ll see errors stating that the resource type does not exist. Just keep re-applying until everything works - this is a declarative system so applying the same manifests multiple times is an idempotent operation.

Reaper deployment

The k8ssandra project offers a Reaper operator, but for simplicity we are using a simple deployment (because not every deployment needs an operator). The deployment is standard kubernetes fare, and if you want more information on how these work you should refer to the Kubernetes docs.

We are injecting the keystore and truststore passwords into the environment here, to avoid placing them in the manifests. cass-operator does not currently support this approach without an initContainer to pre-process the cassandra.yaml using envsubst or a similar tool.

The only other note is that we are also pulling down a Cassandra image and using it in an initContainer to create a keyspace for Reaper, if it does not exist. In this container, we are also adding a ~/.cassandra/cqlshrc file under the home directory. This provides SSL connectivity configurations for the container. The critical part of the cqlshrc file that we are adding is:

[ssl]
certfile = /crypto/ca.crt
validate = true
userkey = /crypto/tls.key
usercert = /crypto/tls.crt
version = TLSv1_2

The version = TLSv1_2 tripped me up a few times, as it seems to be a recent requirement. Failing to add this line will give you back the rather fierce Last error: [SSL] internal error in the initContainer. The commands run in this container are not ideal. In particular, the fact that we are sleeping for 840 seconds to wait for Cassandra to start is sloppy. In a real deployment we’d want to health check and wait until the Cassandra service became available.

Apply the manifests using kubectl apply -k manifests/reaper.

Results

If you use a GUI, look at the logs for Reaper, you should see that it has connected to the cluster and provided some nice ASCII art to your console.

If you don’t use a GUI, you can run kubectl get pods -n cass-operator to find your Reaper pod (which we’ll call REAPER_PODNAME) and then run kubectl logs -n cass-operator REAPER_PODNAME to pull the logs.

Conclusion

While the above might seem like a complex procedure, we’ve just created a Cassandra cluster with both client-server and internode encryption enabled, all of the required certs, and a Reaper deployment which is configured to connect using the correct certs. Not bad.

Do keep in mind the weaknesses relating to key rotation, and watch this space for progress on that front.

Cassandra Certificate Management Part 1 - How to Rotate Keys Without Downtime

Welcome to this three part blog series where we dive into the management of certificates in an Apache Cassandra cluster. For this first post in the series we will focus on how to rotate keys in an Apache Cassandra cluster without downtime.

Usability and Security at Odds

If you have downloaded and installed a vanilla installation of Apache Cassandra, you may have noticed when it is first started all security is disabled. Your “Hello World” application works out of the box because the Cassandra project chose usability over security. This is deliberately done so everyone benefits from the usability, as security requirement for each deployment differ. While only some deployments require multiple layers of security, others require no security features to be enabled.

Security of a system is applied in layers. For example one layer is isolating the nodes in a cluster behind a proxy. Another layer is locking down OS permissions. Encrypting connections between nodes, and between nodes and the application is another layer that can be applied. If this is the only layer applied, it leaves other areas of a system insecure. When securing a Cassandra cluster, we recommend pursuing an informed approach which offers defence-in-depth. Consider additional aspects such as encryption at rest (e.g. disk encryption), authorization, authentication, network architecture, and hardware, host and OS security.

Encrypting connections between two hosts can be difficult to set up as it involves a number of tools and commands to generate the necessary assets for the first time. We covered this process in previous posts: Hardening Cassandra Step by Step - Part 1 Inter-Node Encryption and Hardening Cassandra Step by Step - Part 2 Hostname Verification for Internode Encryption. I recommend reading both posts before reading through the rest of the series, as we will build off concepts explained in them.

Here is a quick summary of the basic steps to create the assets necessary to encrypt connections between two hosts.

  1. Create the Root Certificate Authority (CA) key pair from a configuration file using openssl.
  2. Create a keystore for each host (node or client) using keytool.
  3. Export the Public Certificate from each host keystore as a “Signing Request” using keytool.
  4. Sign each Public Certificate “Signing Request” with our Root CA to generate a Signed Certificate using openssl.
  5. Import the Root CA Public Certificate and the Signed Certificate into each keystore using keytool.
  6. Create a common truststore and import the CA Public Certificate into it using keytool.

Security Requires Ongoing Maintenance

Setting up SSL encryption for the various connections to Cassandra is only half the story. Like all other software out in the wild, there are ongoing maintenance to ensure the SSL encrypted connections continue to work.

At some point you wil need to update the certificates and stores used to implement the SSL encrypted connections because they will expire. If the certificates for a node expire it will be unable to communicate with other nodes in the cluster. This will lead to at least data inconsistencies or, in the worst case, unavailable data.

This point is specifically called out towards the end of the Inter-Node Encryption blog post. The note refers to steps 1, 2 and 4 in the above summary of commands to set up the certificates and stores. The validity periods are set for the certificates and stores in their respective steps.

One Certificate Authority to Rule Them All

Before we jump into how we handle expiring certificates and stores in a cluster, we first need to understand the role a certificate plays in securing a connection.

Certificates (and encryption) are often considered a hard topic. However, there are only a few concepts that you need to bear in mind when managing certificates.

Consider the case where two parties A and B wish to communicate with one another. Both parties distrust each other and each needs a way to prove that they are who they claim to be, as well as verify the other party is who they claim to be. To do this a mutually trusted third party needs to be brought in. In our case the trusted third party is the Certificate Authority (CA); often referred to as the Root CA.

The Root CA is effectively just a key pair; similar to an SSH key pair. The main difference is the public portion of the key pair has additional fields detailing who the public key belongs to. It has the following two components.

  • Certificate Authority Private Signing Key (CA PSK) - Private component of the CA key pair. Used to sign a keystore’s public certificate.
  • Certificate Authority Public Certificate (CA PC) - Public component of the CA key pair. Used to provide the issuer name when signing a keystore’s public certificate, as well as by a node to confirm that a third party public certificate (when presented) has been signed by the Root CA PSK.

When you run openssl to create your CA key pair using a certificate configuration file, this is the command that is run.

$ openssl req \
      -config path/to/ca_certificate.config \
      -new \
      -x509 \
      -keyout path/to/ca_psk \
      -out path/to/ca_pc \
      -days <valid_days>

In the above command the -keyout specifies the path to the CA PSK, and the -out specifies the path to the CA PC.

And in the Darkness Sign Them

In addition to a common Root CA key pair, each party has its own certificate key pair to uniquely identify it and to encrypt communications. In the Cassandra world, two components are used to store the information needed to perform the above verification check and communication encryption; the keystore and the truststore.

The keystore contains a key pair which is made up of the following two components.

  • Keystore Private Signing Key (KS PSK) - Hidden in keystore. Used to sign messages sent by the node, and decrypt messages received by the node.
  • Keystore Public Certificate (KS PC) - Exported for signing by the Root CA. Used by a third party to encrypt messages sent to the node that owns this keystore.

When created, the keystore will contain the PC, and the PSK. The PC signed by the Root CA, and the CA PC are added to the keystore in subsequent operations to complete the trust chain. The certificates are always public and are presented to other parties, while PSK always remains secret. In an asymmetric/public key encryption system, messages can be encrypted with the PC but can only be decrypted using the PSK. In this way, a node can initiate encrypted communications without needing to share a secret.

The truststore stores one or more CA PCs of the parties which the node has chosen to trust, since they are the source of trust for the cluster. If a party tries to communicate with the node, it will refer to its truststore to see if it can validate the attempted communication using a CA PC that it knows about.

For a node’s KS PC to be trusted and verified by another node using the CA PC in the truststore, the KS PC needs to be signed by the Root CA key pair. Futhermore, the CA key pair is used to sign the KS PC of each party.

When you run openssl to sign an exported Keystore PC, this is the command that is run.

$ openssl x509 \
    -req \
    -CAkey path/to/ca_psk \
    -CA path/to/ca_pc \
    -in path/to/exported_ks_pc_sign_request \
    -out paht/to/signed_ks_pc \
    -days <valid_days> \
    -CAcreateserial \
    -passin pass:<ca_psk_password>

In the above command both the Root CA PSK and CA PC are used via -CAkey and -CA respectively when signing the KS PC.

More Than One Way to Secure a Connection

Now that we have a deeper understanding of the assets that are used to encrypt communications, we can examine various ways to implement it. There are multiple ways to implement SSL encryption in an Apache Cassandra cluster. Regardless of the encryption approach, the objective when applying this type of security to a cluster is to ensure;

  • Hosts (nodes or clients) can determine whether they should trust other hosts in cluster.
  • Any intercepted communication between two hosts is indecipherable.

The three most common methods vary in both ease of deployment and resulting level of security. They are as follows.

The Cheats Way

The easiest and least secure method for rolling out SSL encryption can be done in the following way

Generation

  • Single CA for the cluster.
  • Single truststore containing the CA PC.
  • Single keystore which has been signed by the CA.

Deployment

  • The same keystore and truststore are deployed to each node.

In this method a single Root CA and a single keystore is deployed to all nodes in the cluster. This means any node can decipher communications intended for any other node. If a bad actor gains control of a node in the cluster then they will be able to impersonate any other node. That is, compromise of one host will compromise all of them. Depending on your threat model, this approach can be better than no encryption at all. It will ensure that a bad actor with access to only the network will no longer be able to eavesdrop on traffic.

We would use this method as a stop gap to get internode encryption enabled in a cluster. The idea would be to quickly deploy internode encryption with the view of updating the deployment in the near future to be more secure.

Best Bang for Buck

Arguably the most popular and well documented method for rolling out SSL encryption is

Generation

  • Single CA for the cluster.
  • Single truststore containing the CA PC.
  • Unique keystore for each node all of which have been signed by the CA.

Deployment

  • Each keystore is deployed to its associated node.
  • The same truststore is deployed to each node.

Similar to the previous method, this method uses a cluster wide CA. However, unlike the previous method each node will have its own keystore. Each keystore has its own certificate that is signed by a Root CA common to all nodes. The process to generate and deploy the keystores in this way is practiced widely and well documented.

We would use this method as it provides better security over the previous method. Each keystore can have its own password and host verification, which further enhances the security that can be applied.

Fort Knox

The method that offers the strongest security of the three can be rolled out in following way

Generation

  • Unique CA for each node.
  • A single truststore containing the Public Certificate for each of the CAs.
  • Unique keystore for each node that has been signed by the CA specific to the node.

Deployment

  • Each keystore with its unique CA PC is deployed to its associated node.
  • The same truststore is deployed to each node.

Unlike the other two methods, this one uses a Root CA per host and similar to the previous method, each node will have its own keystore. Each keystore has its own PC that is signed by a Root CA unique to the node. The Root CA PC of each node needs to be added to the truststore that is deployed to all nodes. For large cluster deployments this encryption configuration is cumbersome and will result in a large truststore being generated. Deployments of this encryption configuration are less common in the wild.

We would use this method as it provides all the advantages of the previous method and in addition, provides the ability to isolate a node from the cluster. This can be done by simply rolling out a new truststore which excludes a specific node’s CA PC. In this way a compromised node could be isolated from the cluster by simply changing the truststore. Under the previous two approaches, isolation of a compromised node in this fashion would require a rollout of an entirely new Root CA and one or more new keystores. Furthermore, each new Keystore CA would need to be signed by the new Root CA.

WARNING: Ensure your Certificate Authority is secure!

Regardless of the deployment method chosen, the whole setup will depend on the security of the Root CA. Ideally both components should be secured, or at the very least the PSK needs to be secured properly after it is generated since all trust is based on it. If both components are compromised by a bad actor, then that actor can potentially impersonate another node in the cluster. The good news is, there are a variety of ways to secure the Root CA components, however that topic goes beyond the scope of this post.

The Need for Rotation

If we are following best practices when generating our CAs and keystores, they will have an expiry date. This is a good thing because it forces us to regenerate and roll out our new encryption assets (stores, certificates, passwords) to the cluster. By doing this we minimise the exposure that any one of the components has. For example, if a password for a keystore is unknowingly leaked, the password is only good up until the keystore expiry. Having a scheduled expiry reduces the chance of a security leak becoming a breach, and increases the difficulty for a bad actor to gain persistence in the system. In the worst case it limits the validity of compromised credentials.

Always Read the Expiry Label

The only catch to having an expiry date on our encryption assets is that we need to rotate (update) them before they expire. Otherwise, our data will be unavailable or may be inconsistent in our cluster for a period of time. Expired encryption assets when forgotten can be a silent, sinister problem. If, for example, our SSL certificates expire unnoticed we will only discover this blunder when we restart the Cassandra service. In this case the Cassandra service will fail to connect to the cluster on restart and SSL expiry error will appear in the logs. At this point there is nothing we can do without incurring some data unavailability or inconsistency in the cluster. We will cover what to do in this case in a subsequent post. However, it is best to avoid this situation by rotating the encryption assets before they expire.

How to Play Musical Certificates

Assuming we are going to rotate our SSL certificates before they expire, we can perform this operation live on the cluster without downtime. This process requires the replication factor and consistency level to configured to allow for a single node to be down for a short period of time in the cluster. Hence, it works best when use a replication factor >= 3 and use consistency level <= QUORUM or LOCAL_QUORUM depending on the cluster configuration.

  1. Create the NEW encryption assets; NEW CA, NEW keystores, and NEW truststore, using the process described earlier.
  2. Import the NEW CA to the OLD truststore already deployed in the cluster using keytool. The OLD truststore will increase in size, as it has both the OLD and NEW CAs in it.
    $ keytool -keystore <old_truststore> -alias CARoot -importcert -file <new_ca_pc> -keypass <new_ca_psk_password> -storepass <old_truststore_password> -noprompt
    

    Where:

    • <old_truststore>: The path to the OLD truststore already deployed in the cluster. This can be just a copy of the OLD truststore deployed.
    • <new_ca_pc>: The path to the NEW CA PC generated.
    • <new_ca_psk_password>: The password for the NEW CA PSKz.
    • <old_truststore_password>: The password for the OLD truststore.
  3. Deploy the updated OLD truststore to all the nodes in the cluster. Specifically, perform these steps on a single node, then repeat them on the next node until all nodes are updated. Once this step is complete, all nodes in the cluster will be able to establish connections using both the OLD and NEW CAs.
    1. Drain the node using nodetool drain.
    2. Stop the Cassandra service on the node.
    3. Copy the updated OLD truststore to the node.
    4. Start the Cassandra service on the node.
  4. Deploy the NEW keystores to their respective nodes in the cluster. Perform this operation one node at a time in the same way the OLD truststore was deployed in the previous step. Once this step is complete, all nodes in the cluster will be using their NEW SSL certificate to establish encrypted connections with each other.
  5. Deploy the NEW truststore to all the nodes in the cluster. Once again, perform this operation one node at a time in the same way the OLD truststore was deployed in Step 3.

The key to ensuring uptime in the rotation are in Steps 2 and 3. That is, we have the OLD and the NEW CAs all in the truststore and deployed on every node prior to rolling out the NEW keystores. This allows nodes to communicate regardless of whether they have the OLD or NEW keystore. This is because both the OLD and NEW assets are trusted by all nodes. The process still works whether our NEW CAs are per host or cluster wide. If the NEW CAs are per host, then they all need to be added to the OLD truststore.

Example Certificate Rotation on a Cluster

Now that we understand the theory, let’s see the process in action. We will use ccm to create a three node cluster running Cassandra 3.11.10 with internode encryption configured.

As pre-cluster setup task we will generate the keystores and truststore to implement the internode encryption. Rather than carry out the steps manually to generate the stores, we have developed a script called generate_cluster_ssl_stores that does the job for us.

The script requires us to supply the node IP addresses, and a certificate configuration file. Our certificate configuration file, test_ca_cert.conf has the following contents:

[ req ]
distinguished_name     = req_distinguished_name
prompt                 = no
output_password        = mypass
default_bits           = 2048

[ req_distinguished_name ]
C                      = AU
ST                     = NSW
L                      = Sydney
O                      = TLP
OU                     = SSLTestCluster
CN                     = SSLTestClusterRootCA
emailAddress           = info@thelastpickle.com¡

The command used to call the generate_cluster_ssl_stores.sh script is as follows.

$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 test_ca_cert.conf

Let’s break down the options in the above command.

  • -g - Generate passwords for each keystore and the truststore.
  • -c - Create a Root CA for the cluster and sign each keystore PC with it.
  • -n - List of nodes to generate keystores for.

The above command generates the following encryption assets.

$ ls -alh ssl_artifacts_20210602_125353
total 72
drwxr-xr-x   9 anthony  staff   288B  2 Jun 12:53 .
drwxr-xr-x   5 anthony  staff   160B  2 Jun 12:53 ..
-rw-r--r--   1 anthony  staff    17B  2 Jun 12:53 .srl
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  2 Jun 12:53 certs
-rw-r--r--   1 anthony  staff   1.0K  2 Jun 12:53 common-truststore.jks
-rw-r--r--   1 anthony  staff   219B  2 Jun 12:53 stores.password

With the necessary stores generated we can create our three node cluster in ccm. Prior to starting the cluster our nodes should look something like this.

$ ccm status
Cluster: 'SSLTestCluster'
-------------------------
node1: DOWN (Not initialized)
node2: DOWN (Not initialized)
node3: DOWN (Not initialized)

We can configure internode encryption in the cluster by modifying the cassandra.yaml files for each node as follows. The passwords for each store are in the stores.password file created by the generate_cluster_ssl_stores.sh script.

node1 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-1-keystore.jks
  keystore_password: HQR6xX4XQrYCz58CgAiFkWL9OTVDz08e
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-2-keystore.jks
  keystore_password: Aw7pDCmrtacGLm6a1NCwVGxohB4E3eui
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-3-keystore.jks
  keystore_password: 1DdFk27up3zsmP0E5959PCvuXIgZeLzd
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

Now that we configured internode encryption in the cluster, we can start the nodes and monitor the logs to make sure they start correctly.

$ ccm node1 start && touch ~/.ccm/SSLTestCluster/node1/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node1/logs/system.log
...
$ ccm node2 start && touch ~/.ccm/SSLTestCluster/node2/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node2/logs/system.log
...
$ ccm node3 start && touch ~/.ccm/SSLTestCluster/node3/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node3/logs/system.log

In all cases we see the following message in the logs indicating that internode encryption is enabled.

INFO  [main] ... MessagingService.java:704 - Starting Encrypted Messaging Service on SSL port 7001

Once all the nodes have started, we can check the cluster status. We are looking to see that all nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  90.65 KiB  16           65.8%             2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  66.31 KiB  16           65.5%             f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  71.46 KiB  16           68.7%             46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

We will create a NEW Root CA along with a NEW set of stores for the cluster. As part of this process, we will add the NEW Root CA PC to OLD (current) truststore that is already in use in the cluster. Once again we can use our generate_cluster_ssl_stores.sh script to this, including the additional step of adding the NEW Root CA PC to our OLD truststore. This can be done with the following commands.

# Make the password to our old truststore available to script so we can add the new Root CA to it.

$ export EXISTING_TRUSTSTORE_PASSWORD=$(cat ssl_artifacts_20210602_125353/stores.password | grep common-truststore.jks | cut -d':' -f2)
$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 -e ssl_artifacts_20210602_125353/common-truststore.jks test_ca_cert.conf 

We call our script using a similar command to the first time we used it. The difference now is we are using one additional option; -e.

  • -e - Path to our OLD (existing) truststore which we will add the new Root CA PC to. This option requires us to set the OLD truststore password in the EXISTING_TRUSTSTORE_PASSWORD variable.

The above command generates the following new encryption assets. These files are located in a different directory to the old ones. The directory with the old encryption assets is ssl_artifacts_20210602_125353 and the directory with the new encryption assets is ssl_artifacts_20210603_070951

$ ls -alh ssl_artifacts_20210603_070951
total 72
drwxr-xr-x   9 anthony  staff   288B  3 Jun 07:09 .
drwxr-xr-x   6 anthony  staff   192B  3 Jun 07:09 ..
-rw-r--r--   1 anthony  staff    17B  3 Jun 07:09 .srl
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  3 Jun 07:09 certs
-rw-r--r--   1 anthony  staff   1.0K  3 Jun 07:09 common-truststore.jks
-rw-r--r--   1 anthony  staff   223B  3 Jun 07:09 stores.password

When we look at our OLD truststore we can see that it has increased in size. Originally, it was 1.0K and it is now 2.0K in size after adding the new Root CA PC it.

$ ls -alh ssl_artifacts_20210602_125353/common-truststore.jks
-rw-r--r--  1 anthony  staff   2.0K  3 Jun 07:09 ssl_artifacts_20210602_125353/common-truststore.jks

We can now roll out the updated OLD truststore. In a production Cassandra deployment we would copy the updated OLD truststore to a node and restart the Cassandra service. Then repeat this process on the other nodes in the cluster, one node at a time. In our case, our locally running nodes are already pointing to the updated OLD truststore. We need to only restart the Cassandra service.

$ for i in $(ccm status | grep UP | cut -d':' -f1); do echo "restarting ${i}" && ccm ${i} stop && sleep 3 && ccm ${i} start; done
restarting node1
restarting node2
restarting node3

After the restart, our nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  167.23 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  173.7 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Our nodes are using the updated OLD truststore which has the old Root CA PC and the new Root CA PC. This means that nodes will be able to communicate using either the old (current) keystore or the new keystore. We can now roll out the new keystore one node at a time and still have all our data available.

To do the new keystore roll out we will stop the Cassandra service, update its configuration to point to the new keystore, and then start the Cassandra service. A few notes before we start:

  • The node will need to point to the new keystore located in the directory with the new encryption assets; ssl_artifacts_20210603_070951.
  • The node will still need to use the OLD truststore, so its path will remain unchanged.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  179.23 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 using the new keystore while node2 and node3 are using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  188.46 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  227.12 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 and node2 using the new keystore while node3 is using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  239.3 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The keystore rotation is now complete on all nodes in our cluster. However, all nodes are still using the updated OLD truststore. To ensure that our old Root CA can no longer be used to intercept messages in our cluster we need to roll out the NEW truststore to all nodes. This can be done in the same way we deployed the new keystores.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update truststore path to point to new truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  294.05 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  288.6 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The rotation of the certificates is now complete and all while having only a single node down at any one time! This process can be used for all three of the deployment variations. In addition, it can be used to move between the different deployment variations without incurring downtime.

Conclusion

Internode encryption plays an important role in securing the internal communication of a cluster. When deployed, it is crucial that certificate expiry dates be tracked so the certificates can be rotated before they expire. Failure to do so will result in unavailability and inconsistencies.

Using the process discussed in this post and combined with the appropriate tooling, internode encryption can be easily deployed and associated certificates easily rotated. In addition, the process can be used to move between the different encryption deployments.

Regardless of the reason for using the process, it can be executed without incurring downtime in common Cassandra use cases.

Running your Database on OpenShift and CodeReady Containers

Let’s take an introductory run-through of setting up your database on OpenShift, using your own hardware and RedHat’s CodeReady Containers.

CodeReady Containers is a great way to run OpenShift K8s locally, ideal for development and testing. The steps in this blog post will require a machine, laptop or desktop, of decent capability; preferably quad CPUs and 16GB+ RAM.

Download and Install RedHat’s CodeReady Containers

Download and install RedHat’s CodeReady Containers (version 1.27) as described in Red Hat OpenShift 4 on your laptop: Introducing Red Hat CodeReady Containers.

First configure CodeReady Containers, from the command line

❯ crc setup

…
Your system is correctly setup for using CodeReady Containers, you can now run 'crc start' to start the OpenShift cluster

Check the version is correct.

❯ crc version

…
CodeReady Containers version: 1.27.0+3d6bc39d
OpenShift version: 4.7.11 (not embedded in executable)

Then start it, entering the Pull Secret copied from the download page. Have patience here, this can take ten minutes or more.

❯ crc start

INFO Checking if running as non-root
…
Started the OpenShift cluster.

The server is accessible via web console at:
  https://console-openshift-console.apps-crc.testing 
…

The output above will include the kubeadmin password which is required in the following oc login … command.

❯ eval $(crc oc-env)
❯ oc login -u kubeadmin -p <password-from-crc-setup-output> https://api.crc.testing:6443

❯ oc version

Client Version: 4.7.11
Server Version: 4.7.11
Kubernetes Version: v1.20.0+75370d3

Open in a browser the URL https://console-openshift-console.apps-crc.testing

Log in using the kubeadmin username and password, as used above with the oc login … command. You might need to try a few times because of the self-signed certificate used.

Once OpenShift has started and is running you should see the following webpage

CodeReady Start Webpage

Some commands to help check status and the startup process are

❯ oc status       

In project default on server https://api.crc.testing:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 10.217.4.1:443 -> 6443

View details with 'oc describe <resource>/<name>' or list resources with 'oc get all'.      

Before continuing, go to the CodeReady Containers Preferences dialog. Increase CPUs and Memory to >12 and >14GB correspondingly.

CodeReady Preferences dialog

Create the OpenShift Local Volumes

Cassandra needs persistent volumes for its data directories. There are different ways to do this in OpenShift, from enabling local host paths in Rancher persistent volumes, to installing and using the OpenShift Local Storage Operator, and of course persistent volumes on the different cloud provider backends.

This blog post will use vanilla OpenShift volumes using folders on the master k8s node.

Go to the “Terminal” tab for the master node and create the required directories. The master node is found on the /cluster/nodes/ webpage.

Click on the node, named something like crc-m89r2-master-0, and then click on the “Terminal” tab. In the terminal, execute the following commands:

sh-4.4# chroot /host
sh-4.4# mkdir -p /mnt/cass-operator/pv000
sh-4.4# mkdir -p /mnt/cass-operator/pv001
sh-4.4# mkdir -p /mnt/cass-operator/pv002
sh-4.4# 

Persistent Volumes are to be created with affinity to the master node, declared in the following yaml. The name of the master node can vary from installation to installation. If your master node is not named crc-gm7cm-master-0 then the following command replaces its name. First download the cass-operator-1.7.0-openshift-storage.yaml file, check the name of the node in the nodeAffinity sections against your current CodeReady Containers instance, updating if necessary.

❯ wget https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-storage.yaml

# The name of your master node
❯ oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers

# If it is not crc-gm7cm-master-0
❯ sed -i '' "s/crc-gm7cm-master-0/$(oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers)/" cass-operator-1.7.0-openshift-storage.yaml

Create the Persistent Volumes (PV) and Storage Class (SC).

❯ oc apply -f cass-operator-1.7.0-openshift-storage.yaml

persistentvolume/server-storage-0 created
persistentvolume/server-storage-1 created
persistentvolume/server-storage-2 created
storageclass.storage.k8s.io/server-storage created

To check the existence of the PVs.

❯ oc get pv | grep server-storage

server-storage-0   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-1   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-2   10Gi   RWO    Delete   Available   server-storage     5m19s

To check the existence of the SC.

❯ oc get sc

NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
server-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  5m36s

More information on using the can be found in the RedHat documentation for OpenShift volumes.

Deploy the Cass-Operator

Now create the cass-operator. Here we can use the upstream 1.7.0 version of the cass-operator. After creating (applying) the cass-operator, it is important to quickly execute the oc adm policy … commands in the following step so the pods have the privileges required and are successfully created.

❯ oc apply -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

namespace/cass-operator created
serviceaccount/cass-operator created
secret/cass-operator-webhook-config created
W0606 14:25:44.757092   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0606 14:25:45.077394   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com created
clusterrole.rbac.authorization.k8s.io/cass-operator-cr created
clusterrole.rbac.authorization.k8s.io/cass-operator-webhook created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-crb created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-webhook created
role.rbac.authorization.k8s.io/cass-operator created
rolebinding.rbac.authorization.k8s.io/cass-operator created
service/cassandradatacenter-webhook-service created
deployment.apps/cass-operator created
W0606 14:25:46.701712   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
W0606 14:25:47.068795   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
validatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration created

❯ oc adm policy add-scc-to-user privileged -z default -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "default"

❯ oc adm policy add-scc-to-user privileged -z cass-operator -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "cass-operator"

Let’s check the deployment happened.

❯ oc get deployments -n cass-operator

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
cass-operator   1/1     1            1           14m

Let’s also check the cass-operator pod was created and is successfully running. Note that the kubectl command is used here, for all k8s actions the oc and kubectl commands are interchangable.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-hxc8z   1/1     Running   0          15m

Troubleshooting: If the cass-operator does not end up in Running status, or if any pods in later sections fail to start, it is recommended to use the OpenShift UI Events webpage for easy diagnostics.

Setup the Cassandra Cluster

The next step is to create the cluster. The following deployment file creates a 3 node cluster. It is largely a copy from the upstream cass-operator version 1.7.0 file example-cassdc-minimal.yaml but with a small modification made to allow all the pods to be deployed to the same worker node (as CodeReady Containers only uses one k8s node by default).

❯ oc apply -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

cassandradatacenter.cassandra.datastax.com/dc1 created

Let’s watch the pods get created, initialise, and eventually becoming running, using the kubectl get pods … watch command.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-28fhw   1/1     Running   0          102s
cluster1-dc1-default-sts-0       0/2     Pending   0          0s
cluster1-dc1-default-sts-1       0/2     Pending   0          0s
cluster1-dc1-default-sts-2       0/2     Pending   0          0s
cluster1-dc1-default-sts-0       2/2     Running   0          3m
cluster1-dc1-default-sts-1       2/2     Running   0          3m
cluster1-dc1-default-sts-2       2/2     Running   0          3m

Use the Cassandra Cluster

With the Cassandra pods each up and running, the cluster is ready to be used. Test it out using the nodetool status command.

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Defaulting container name to cassandra.
Use 'kubectl describe pod/cluster1-dc1-default-sts-0 -n cass-operator' to see all of the containers in this pod.
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.217.0.73  84.42 KiB  1            83.6%             672baba8-9a05-45ac-aad1-46427027b57a  default
UN  10.217.0.72  70.2 KiB   1            65.3%             42758a86-ea7b-4e9b-a974-f9e71b958429  default
UN  10.217.0.71  65.31 KiB  1            51.1%             2fa73bc2-471a-4782-ae63-5a34cc27ab69  default

The above command can be run on `cluster1-dc1-default-sts-1` and `cluster1-dc1-default-sts-2` too.

Next, test out cqlsh. For this authentication is required, so first get the CQL username and password.

# Get the cql username
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# Get the cql password
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- cqlsh -u <cql-username> -p <cql-password>

Connected to cluster1 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cluster1-superuser@cqlsh>

Keep It Clean

CodeReady Containers are very simple to clean up, especially because it is a packaging of OpenShift intended only for development purposes. To wipe everything, just “delete”

❯ crc stop
❯ crc delete

If, on the other hand, you only want to delete individual steps, each of the following can be done (but in order).

❯ oc delete -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

❯ oc delete -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

❯ oc delete -f cass-operator-1.7.0-openshift-storage.yaml

On Scylla Manager Suspend & Resume feature

On Scylla Manager Suspend & Resume feature

!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...

Apache Cassandra's Continuous Integration Systems

With Apache Cassandra 4.0 just around the corner, and the feature freeze on trunk lifted, let’s take a dive into the efforts ongoing with the project’s testing and Continuous Integration systems.

continuous integration in open source

Every software project benefits from sound testing practices and having a continuous integration in place. Even more so for open source projects. From contributors working around the world in many different timezones, particularly prone to broken builds and longer wait times and uncertainties, to contributors just not having the same communication bandwidths between each other because they work in different companies and are scratching different itches.

This is especially true for Apache Cassandra. As an early-maturity technology used everywhere on mission critical data, stability and reliability are crucial for deployments. Contributors from many companies: Alibaba, Amazon, Apple, Bloomberg, Dynatrace, DataStax, Huawei, Instaclustr, Netflix, Pythian, and more; need to coordinate and collaborate and most importantly trust each other.

During the feature freeze the project was fortunate to not just stabilise and fix tons of tests, but to also expand its continuous integration systems. This really helps set the stage for a post 4.0 roadmap that features heavy on pluggability, developer experience and safety, as well as aiming for an always-shippable trunk.

@ cassandra

The continuous integration systems at play are CircleCI and ci-cassandra.apache.org

CircleCI is a commercial solution. The main usage today of CircleCI is pre-commit, that is testing your patches while they get reviewed before they get merged. To effectively use CircleCI on Cassandra requires either the medium or high resource profiles that enables the use of hundreds of containers and lots of resources, and that’s basically only available for folk working in companies that are paying for a premium CircleCI account. There are lots stages to the CircleCI pipeline, and developers just trigger those stages they feel are relevant to test that patch on.

CircleCI pipeline

ci-cassandra is our community CI. It is based on CloudBees, provided by the ASF and running 40 agents (servers) around the world donated by numerous different companies in our community. Its main usage is post-commit, and its pipelines run every stage automatically. Today the pipeline consists of 40K tests. And for the first time in many years, on the lead up to 4.0, pipeline runs are completely green.

ci-cassandra pipeline

ci-cassandra is setup with a combination of Jenkins DSL script, and declarative Jenkinsfiles. These jobs use the build scripts found here.

forty thousand tests

The project has many types of tests. It has proper unit tests, and unit tests that have some embedded Cassandra server. The unit tests are run in a number of different parameterisations: from different Cassandra configuration, JDK 8 and JDK 11, to supporting the ARM architecture. There’s CQLSH tests written in Python against a single ccm node. Then there’s the Java distributed tests and Python distributed tests. The Python distributed tests are older, use CCM, and also run parameterised. The Java distributed tests are a recent addition and run the Cassandra nodes inside the JVM. Both types of distributed tests also include testing the upgrade paths of different Cassandra versions. Most new distributed tests today are written as Java distributed tests. There are also burn and microbench (JMH) tests.

distributed is difficult

Testing distributed tech is hardcore. Anyone who’s tried to run the Python upgrade dtests locally knows the pain. Running the tests in Docker helps a lot, and this is what CircleCI and ci-cassandra predominantly does. The base Docker images are found here. Distributed tests can fall over for numerous reasons, exacerbated in ci-cassandra with heterogenous servers around the world and all the possible network and disk issues that can occur. Just for the 4.0 release over 200 Jira tickets were focused just on strengthening flakey tests. Because ci-cassandra has limited storage, the logs and test results to all runs are archived in nightlies.apache.org/cassandra.

call for help

There’s still heaps to do. This is all part-time and volunteer efforts. No one in the community is dedicated to these systems or as a build engineer. The project can use all the help it can get.

There’s a ton of exciting stuff to add. Some examples are microbench and JMH reports, Jacoco test coverage reports, Harry for fuzz testing, Adelphi or Fallout for end-to-end performance and comparison testing, hooking up Apache Yetus for efficient resource usage, or putting our Jenkins stack into a k8s operator run script so you can run the pipeline on your own k8s cluster.

So don’t be afraid to jump in, pick your poison, we’d love to see you!

Reaper 2.2 for Apache Cassandra was released

We’re pleased to announce that Reaper 2.2 for Apache Cassandra was just released. This release includes a major redesign of how segments are orchestrated, which allows users to run concurrent repairs on nodes. Let’s dive into these changes and see what they mean for Reaper’s users.

New Segment Orchestration

Reaper works in a variety of standalone or distributed modes, which create some challenges in meeting the following requirements:

  • A segment is processed successfully exactly once.
  • No more than one segment is running on a node at once.
  • Segments can only be started if the number of pending compactions on a node involved is lower than the defined threshold.

To make sure a segment won’t be handled by several Reaper instances at once, Reaper relies on LightWeight Transactions (LWT) to implement a leader election process. A Reaper instance will “take the lead” on a segment by using a LWT and then perform the checks for the last two conditions above.

To avoid race conditions between two different segments involving a common set of replicas that would start at the same time, a “master lock” was placed after the checks to guarantee that a single segment would be able to start. This required a double check to be performed before actually starting the segment.

Segment Orchestration pre 2.2 design

There were several issues with this design:

  • It involved a lot of LWTs even if no segment could be started.
  • It was a complex design which made the code hard to maintain.
  • The “master lock” was creating a lot of contention as all Reaper instances would compete for the same partition, leading to some nasty situations. This was especially the case in sidecar mode as it involved running a lot of Reaper instances (one per Cassandra node).

As we were seeing suboptimal performance and high LWT contention in some setups, we redesigned how segments were orchestrated to reduce the number of LWTs and maximize concurrency during repairs (all nodes should be busy repairing if possible).
Instead of locking segments, we explored whether it would be possible to lock nodes instead. This approach would give us several benefits:

  • We could check which nodes are already busy without issuing JMX requests to the nodes.
  • We could easily filter segments to be processed to retain only those with available nodes.
  • We could remove the master lock as we would have no more race conditions between segments.

One of the hard parts was that locking several nodes in a consistent manner would be challenging as they would involve several rows, and Cassandra doesn’t have a concept of an atomic transaction that can be rolled back as RDBMS do. Luckily, we were able to leverage one feature of batch statements: All Cassandra batch statements which target a single partition will turn all operations into a single atomic one (at the node level). If one node out of all replicas was already locked, then none would be locked by the batched LWTs. We used the following model for the leader election table on nodes:

CREATE TABLE reaper_db.running_repairs (
    repair_id uuid,
    node text,
    reaper_instance_host text,
    reaper_instance_id uuid,
    segment_id uuid,
    PRIMARY KEY (repair_id, node)
) WITH CLUSTERING ORDER BY (node ASC)

The following LWTs are then issued in a batch for each replica:

BEGIN BATCH

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1', 
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node1'
IF reaper_instance_id = null;

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node2'
IF reaper_instance_id = null;

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node3'
IF reaper_instance_id = null;

APPLY BATCH;

If all the conditional updates are able to be applied, we’ll get the following data in the table:

cqlsh> select * from reaper_db.running_repairs;

 repair_id                            | node  | reaper_instance_host | reaper_instance_id                   | segment_id
--------------------------------------+-------+----------------------+--------------------------------------+--------------------------------------
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node1 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node2 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node3 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 

If one of the conditional updates fails because one node is already locked for the same repair_id, then none will be applied.

Note: the Postgres backend also benefits from these new features through the use of transactions, using commit and rollback to deal with success/failure cases.

The new design is now much simpler than the initial one:

Segment Orchestration post 2.2 design

Segments are now filtered on those that have no replica locked to avoid wasting energy in trying to lock them and the pending compactions check also happens before any locking.

This reduces the number of LWTs by four in the simplest cases and we expect more challenging repairs to benefit from even more reductions:

LWT improvements

At the same time, repair duration on a 9-node cluster showed 15%-20% improvements thanks to the more efficient segment selection.

One prerequisite to make that design efficient was to store the replicas for each segment in the database when the repair run is created. You can now see which nodes are involved for each segment and which datacenter they belong to in the Segments view:

Segments view

Concurrent repairs

Using the repair id as the partition key for the node leader election table gives us another feature that was long awaited: Concurrent repairs.
A node could be locked by different Reaper instances for different repair runs, allowing several repairs to run concurrently on each node. In order to control the level of concurrency, a new setting was introduced in Reaper: maxParallelRepairs
By default it is set to 2 and should be tuned carefully as heavy concurrent repairs could have a negative impact on clusters performance.
If you have small keyspaces that need to be repaired on a regular basis, they won’t be blocked by large keyspaces anymore.

Future upgrades

As some of you are probably aware, JFrog has decided to sunset Bintray and JCenter. Bintray is our main distribution medium and we will be working on replacement repositories. The 2.2.0 release is unaffected by this change but future upgrades could require an update to yum/apt repos. The documentation will be updated accordingly in due time.

Upgrade now

We encourage all Reaper users to upgrade to 2.2.0. It was tested successfully by some of our customers which had issues with LWT pressure and blocking repairs. This version is expected to make repairs faster and more lightweight on the Cassandra backend. We were able to remove a lot of legacy code and design which were fit to single token clusters, but failed at spreading segments efficiently for clusters using vnodes.

The binaries for Reaper 2.2.0 are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 2.2.0 are available on the Reaper website.

Creating Flamegraphs with Apache Cassandra in Kubernetes (cass-operator)

In a previous blog post recommending disabling read repair chance, some flamegraphs were generated to demonstrate the effect read repair chance had on a cluster. Let’s go through how those flamegraphs were captured, step-by-step using Apache Cassandra 3.11.6, Kubernetes and the cass-operator, nosqlbench and the async-profiler.

In previous blog posts we would have used the existing tools of tlp-cluster or ccm, tlp-stress or cassandra-stress, and sjk. Here we take a new approach that is a lot more fun, as with k8s the same approach can be used locally or in the cloud. No need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. Nor are you bound to AWS for big instance testing, that’s right: no vendor lock-in. Cass-operator and K8ssandra is getting a ton of momentum from DataStax, so it is only deserved and exciting to introduce them to as much of the open source world as we can.

This blog post is not an in-depth dive into using cass-operator, rather a simple teaser to demonstrate how we can grab some flamegraphs, as quickly as possible. The blog post is split into three sections

  • Setting up Kubernetes and getting Cassandra running
  • Getting access to Cassandra from outside Kubernetes
  • Stress testing and creating flamegraphs

Setup

Let’s go through a quick demonstration using Kubernetes, the cass-operator, and some flamegraphs.

First, download four yaml configuration files that will be used. This is not strictly necessary for the latter three, as kubectl may reference them by their URLs, but let’s download them for the sake of having the files locally and being able to make edits if and when desired.

wget https://thelastpickle.com/files/2021-01-31-cass_operator/01-kind-config.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/02-storageclass-kind.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/11-install-cass-operator-v1.1.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/13-cassandra-cluster-3nodes.yaml

The next steps involve kind and kubectl to create a local cluster we can test. To use kind you have docker running locally, it is recommended to have 4 CPU and 12GB RAM for this exercise.

kind create cluster --name read-repair-chance-test --config 01-kind-config.yaml

kubectl create ns cass-operator
kubectl -n cass-operator apply -f 02-storageclass-kind.yaml
kubectl -n cass-operator apply -f 11-install-cass-operator-v1.1.yaml

# watch and wait until the pod is running
watch kubectl -n cass-operator get pod

# create 3 node C* cluster
kubectl -n cass-operator apply -f 13-cassandra-cluster-3nodes.yaml

# again, wait for pods to be running
watch kubectl -n cass-operator get pod

# test the three nodes are up
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Access

For this example we are going to run NoSqlBench from outside the k8s cluster, so we will need access to a pod’s Native Protocol interface via port-forwarding. This approach is practical here because it was desired to have the benchmark connect to just one coordinator. In many situations you would instead run NoSqlBench from a separate dedicated pod inside the k8s cluster.

# get the cql username
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# get the cql password
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

# port forward the native protocol (CQL)
kubectl -n cass-operator port-forward --address 0.0.0.0 cluster1-dc1-default-sts-0 9042:9042 

The above sets up the k8s cluster, a k8s storageClass, and the cass-operator with a three node Cassandra cluster. For a more in depth look at this setup checkout this tutorial.

Stress Testing and Flamegraphs

With a cluster to play with, let’s generate some load and then go grab some flamegraphs.

Instead of using SJK (Swiss Java Knife), as our previous blog posts have done, we will use the async-profiler. The async-profiler does not suffer from Safepoint bias problem, an issue we see more often than we would like in Cassandra nodes (protip: make sure you configure ParallelGCThreads and ConcGCThreads to the same value).

Open a new terminal window and do the following.

# get the latest NoSqlBench jarfile
wget https://github.com/nosqlbench/nosqlbench/releases/latest/download/nb.jar

# generate some load, use credentials as found above

java -jar nb.jar cql-keyvalue username=<cql_username> password=<cql_password> whitelist=127.0.0.1 rampup-cycles=10000 main-cycles=500000 rf=3 read_cl=LOCAL_ONE

# while the load is still running,
# open a shell in the coordinator pod, download async-profiler and generate a flamegraph
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- /bin/bash

wget https://github.com/jvm-profiling-tools/async-profiler/releases/download/v1.8.3/async-profiler-1.8.3-linux-x64.tar.gz

tar xvf async-profiler-1.8.3-linux-x64.tar.gz

async-profiler-1.8.3-linux-x64/profiler.sh -d 300 -f /tmp/flame_away.svg <CASSANDRA_PID>

exit

# copy the flamegraph out of the pod
kubectl -n cass-operator cp cluster1-dc1-default-sts-0:/tmp/flame_away.svg flame_away.svg

Keep It Clean

After everything is done, it is time to clean up after yourself.

Delete the CassandraDatacenters first, otherwise Kubernetes will block deletion because we use a finalizer. Note, this will delete all data in the cluster.

kubectl delete cassdcs --all-namespaces --all

Remove the operator Deployment, CRD, etc.

# this command can take a while, be patient

kubectl delete -f https://raw.githubusercontent.com/datastax/cass-operator/v1.5.1/docs/user/cass-operator-manifests-v1.16.yaml

# if troubleshooting, to forcibly remove resources, though
# this should not be necessary, and take care as this will wipe all resources

kubectl delete "$(kubectl api-resources --namespaced=true --verbs=delete -o name | tr "\n" "," | sed -e 's/,$//')" --all

To remove the local Kubernetes cluster altogether

kind delete cluster --name read-repair-chance-test

To stop and remove the docker containers that are left running…

docker stop $(docker ps | grep kindest | cut -d" " -f1)
docker rm $(docker ps -a | grep kindest | cut -d" " -f1)

More… the cass-operator tutorials

There is a ton of documentation and tutorials getting released on how to use the cass-operator. If you are keen to learn more the following is highly recommended: Managing Cassandra Clusters in Kubernetes Using Cass-Operator.

The Impacts of Changing the Number of VNodes in Apache Cassandra

Apache Cassandra’s default value for num_tokens is about to change in 4.0! This might seem like a small edit note in the CHANGES.txt, however such a change can have a profound effect on day-to-day operations of the cluster. In this post we will examine how changing the value for num_tokens impacts the cluster and its behaviour.

There are many knobs and levers that can be modified in Apache Cassandra to tune its behaviour. The num_tokens setting is one of those. Like many settings it lives in the cassandra.yaml file and has a defined default value. That’s where it stops being like many of Cassandra’s settings. You see, most of Cassandra’s settings will only affect a single aspect of the cluster. However, when changing the value of num_tokens there is an array of behaviours that are altered. The Apache Cassandra project has committed and resolved CASSANDRA-13701 which changed the default value for num_tokens from 256 to 16. This change is significant, and to understand the consequences we first need to understand the role that num_tokens play in the cluster.

Never try this on production

Before we dive into any details it is worth noting that the num_tokens setting on a node should never ever be changed once it has joined the cluster. For one thing the node will fail on a restart. The value of this setting should be the same for every node in a datacenter. Historically, different values were expected for heterogeneous clusters. While it’s rare to see, nor would we recommend, you can still in theory double the num_tokens on nodes that are twice as big in terms of hardware specifications. Furthermore, it is common to see the nodes in a datacenter have a value for num_tokens that differs to nodes in another datacenter. This is partly how changing the value of this setting on a live cluster can be safely done with zero downtime. It is out of scope for this blog post, but details can be found in migration to a new datacenter.

The Basics

The num_tokens setting influences the way Cassandra allocates data amongst the nodes, how that data is retrieved, and how that data is moved between nodes.

Under the hood Cassandra uses a partitioner to decide where data is stored in the cluster. The partitioner is a consistent hashing algorithm that maps a partition key (first part of the primary key) to a token. The token dictates which nodes will contain the data associated with the partition key. Each node in the cluster is assigned one or more unique token values from a token ring. This is just a fancy way of saying each node is assigned a number from a circular number range. That is, “the number” being the token hash, and “a circular number range” being the token ring. The token ring is circular because the next value after the maximum token value is the minimum token value.

An assigned token defines the range of tokens in the token ring the node is responsible for. This is commonly known as a “token range”. The “token range” a node is responsible for is bounded by its assigned token, and the next smallest token value going backwards in the ring. The assigned token is included in the range, and the smallest token value going backwards is excluded from the range. The smallest token value going backwards typically resides on the previous neighbouring node. Having a circular token ring means that the range of tokens a node is responsible for, could include both the minimum and maximum tokens in the ring. In at least one case the smallest token value going backwards will wrap back past the maximum token value in the ring.

For example, in the following Token Ring Assignment diagram we have a token ring with a range of hashes from 0 to 99. Token 10 is allocated to Node 1. The node before Node 1 in the cluster is Node 5. Node 5 is allocated token 90. Therefore, the range of tokens that Node 1 is responsible for is between 91 and 10. In this particular case, the token range wraps around past the maximum token in the ring.

Token ring

Note that the above diagram is for only a single data replica. This is because only a single node is assigned to each token in the token ring. If multiple replicas of the data exists, a node’s neighbours become replicas for the token as well. This is illustrated in the Token Ring Assignment diagram below.

Token ring

The reason the partitioner is defined as a consistent hashing algorithm is because it is just that; no matter how many times you feed in a specific input, it will always generate the same output value. It ensures that every node, coordinator, or otherwise, will always calculate the same token for a given partition key. The calculated token can then be used to reliably pinpoint the nodes with the sought after data.

Consequently, the minimum and maximum numbers for the token ring are defined by the partitioner. The default Murur3Partitioner based on the Murmur hash has for example, a minimum and maximum range of -2^63 to +2^63 - 1. The legacy RandomPartitioner (based on the MD5 hash) on the other hand has a range of 0 to 2^127 - 1. A critical side effect of this system is that once a partitioner for a cluster is picked, it can never be changed. Changing to a different partitioner requires the creation of a new cluster with the desired partitioner and then reloading the data into the new cluster.

Further information on consistent hashing functionality can be found in the Apache Cassandra documentation.

Back in the day…

Back in the pre-1.2 era, nodes could only be manually assigned a single token. This was done and can still be done today using the initial_token setting in the cassandra.yaml file. The default partitioner at that point was the RandomPartitioner. Despite token assignment being manual, the partitioner made the process of calculating the assigned tokens fairly straightforward when setting up a cluster from scratch. For example, if you had a three node cluster you would divide 2^127 - 1 by 3 and the quotient would give you the correct increment amount for each token value. Your first node would have an initial_token of 0, your next node would have an initial_token of (2^127 - 1) / 3, and your third node would have an initial_token of (2^127 - 1) / 3 * 2. Thus, each node will have the same sized token ranges.

Dividing the token ranges up evenly makes it less likely individual nodes are overloaded (assuming identical hardware for the nodes, and an even distribution of data across the cluster). Uneven token distribution can result in what is termed “hot spots”. This is where a node is under pressure as it is servicing more requests or carrying more data than other nodes.

Even though setting up a single token cluster can be a very manual process, their deployment is still common. Especially for very large Cassandra clusters where the node count typically exceeds 1,000 nodes. One of the advantages of this type of deployment, is you can ensure that the token distribution is even.

Although setting up a single token cluster from scratch can result in an even load distribution, growing the cluster is far less straight forward. If you insert a single node into your three node cluster, the result is that two out of the four nodes will have a smaller token range than the other two nodes. To fix this problem and re-balance, you then have to run nodetool move to relocate tokens to other nodes. This is a tedious and expensive task though, involving a lot of streaming around the whole cluster. The alternative is to double the size of your cluster each time you expand it. However, this usually means using more hardware than you need. Much like having an immaculate backyard garden, maintaining an even token range per node in a single token cluster requires time, care, and attention, or alternatively, a good deal of clever automation.

Scaling in a single token world is only half the challenge. Certain failure scenarios heavily reduce time to recovery. Let’s say for example you had a six node cluster with three replicas of the data in a single datacenter (Replication Factor = 3). Replicas might reside on Node 1 and Node 4, Node 2 and Node 5, and lastly on Node 3 and Node 6. In this scenario each node is responsible for a sixth of each of the three replicas.

Six node cluster and three replicas

In the above diagram, the tokens in the token ring are assigned an alpha character. This is to make tracking the token assignment to each node easier to follow. If the cluster had an outage where Node 1 and Node 6 are unavailable, you could only use Nodes 2 and 5 to recover the unique sixth of the data they each have. That is, only Node 2 could be used to recover the data associated with token range ‘F’, and similarly only Node 5 could be used to recover the data associated with token range ‘E’. This is illustrated in the diagram below.

Six node cluster and three replicas failures scenario

vnodes to the rescue

To solve the shortcomings of a single token assignment, Cassandra version 1.2 was enhanced to allow a node to be assigned multiple tokens. That is a node could be responsible for multiple token ranges. This Cassandra feature is known as “virtual node” or vnodes for short. The vnodes feature was introduced via CASSANDRA-4119. As per the ticket description, the goals of vnodes were:

  • Reduced operations complexity for scaling up/down.
  • Reduced rebuild time in event of failure.
  • Evenly distributed load impact in the event of failure.
  • Evenly distributed impact of streaming operations.
  • More viable support for heterogeneity of hardware.

The introduction of this feature gave birth to the num_tokens setting in the cassandra.yaml file. The setting defined the number of vnodes (token ranges) a node was responsible for. By increasing the number of vnodes per node, the token ranges become smaller. This is because the token ring has a finite number of tokens. The more ranges it is divided up into the smaller each range is.

To maintain backwards compatibility with older 1.x series clusters, the num_tokens defaulted to a value of 1. Moreover, the setting was effectively disabled on a vanilla installation. Specifically, the value in the cassandra.yaml file was commented out. The commented line and previous development commits did give a glimpse into the future of where the feature was headed though.

As foretold by the cassandra.yaml file, and the git commit history, when Cassandra version 2.0 was released out the vnodes feature was enabled by default. The num_tokens line was no longer commented out, so its effective default value on a vanilla installation was 256. Thus ushering in a new era of clusters that had relatively even token distributions, and were simple to grow.

With nodes consisting of 256 vnodes and the accompanying additional features, expanding the cluster was a dream. You could insert one new node into your cluster and Cassandra would calculate and assign the tokens automatically! The token values were randomly calculated, and so over time as you added more nodes, the cluster would converge on being in a balanced state. This engineering wizardry put an end to spending hours doing calculations and nodetool move operations to grow a cluster. The option was still there though. If you had a very large cluster or another requirement, you could still use the initial_token setting which was commented out in Cassandra version 2.0. In this case, the value for the num_tokens still had to be set to the number of tokens manually defined in the initial_token setting.

Remember to read the fine print

This gave us a feature that was like a personal devops assistant; you handed them a node, told them to insert it, and then after some time it had tokens allocated and was part of the cluster. However, in a similar vein, there is a price to pay for the convenience…

While we get a more even token distribution when using 256 vnodes, the problem is that availability degrades earlier. Ironically, the more we break the token ranges up the more quickly we can get data unavailability. Then there is the issue of unbalanced token ranges when using a small number of vnodes. By small, I mean values less than 32. Cassandra’s random token allocation is hopeless when it comes to small vnode values. This is because there are insufficient tokens to balance out the wildly different token range sizes that are generated.

Pics or it didn’t happen

It is very easy to demonstrate the availability and token range imbalance issues, using a test cluster. We can set up a single token range cluster with six nodes using ccm. After calculating the tokens, configuring and starting our test cluster, it looked like this.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  71.17 KiB  1            33.3%             8d483ae7-e7fa-4c06-9c68-22e71b78e91f  rack1
UN  127.0.0.2  65.99 KiB  1            33.3%             cc15803b-2b93-40f7-825f-4e7bdda327f8  rack1
UN  127.0.0.3  85.3 KiB   1            33.3%             d2dd4acb-b765-4b9e-a5ac-a49ec155f666  rack1
UN  127.0.0.4  104.58 KiB  1            33.3%             ad11be76-b65a-486a-8b78-ccf911db4aeb  rack1
UN  127.0.0.5  71.19 KiB  1            33.3%             76234ece-bf24-426a-8def-355239e8f17b  rack1
UN  127.0.0.6  30.45 KiB  1            33.3%             cca81c64-d3b9-47b8-ba03-46356133401b  rack1

We can then create a test keyspace and populated it using cqlsh.

$ ccm node1 cqlsh
Connected to SINGLETOKEN at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.9 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE test_keyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
cqlsh> CREATE TABLE test_keyspace.test_table (
...   id int,
...   value text,
...   PRIMARY KEY (id));
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (1, 'foo');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (2, 'bar');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (3, 'net');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (4, 'moo');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (5, 'car');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (6, 'set');

To confirm that the cluster is perfectly balanced, we can check the token ring.

$ ccm node1 nodetool ring test_keyspace


Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns     Token
                                                       6148914691236517202
127.0.0.1  rack1  Up      Normal  125.64 KiB  50.00%   -9223372036854775808
127.0.0.2  rack1  Up      Normal  125.31 KiB  50.00%   -6148914691236517206
127.0.0.3  rack1  Up      Normal  124.1 KiB   50.00%   -3074457345618258604
127.0.0.4  rack1  Up      Normal  104.01 KiB  50.00%   -2
127.0.0.5  rack1  Up      Normal  126.05 KiB  50.00%   3074457345618258600
127.0.0.6  rack1  Up      Normal  120.76 KiB  50.00%   6148914691236517202

We can see in the “Owns” column all nodes have 50% ownership of the data. To make the example easier to follow we can manually add a letter representation next to each token number. So the token ranges could be represented in the following way:

$ ccm node1 nodetool ring test_keyspace


Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns     Token                 Token Letter
                                                       6148914691236517202   F
127.0.0.1  rack1  Up      Normal  125.64 KiB  50.00%   -9223372036854775808  A
127.0.0.2  rack1  Up      Normal  125.31 KiB  50.00%   -6148914691236517206  B
127.0.0.3  rack1  Up      Normal  124.1 KiB   50.00%   -3074457345618258604  C
127.0.0.4  rack1  Up      Normal  104.01 KiB  50.00%   -2                    D
127.0.0.5  rack1  Up      Normal  126.05 KiB  50.00%   3074457345618258600   E
127.0.0.6  rack1  Up      Normal  120.76 KiB  50.00%   6148914691236517202   F

We can then capture the output of ccm node1 nodetool describering test_keyspace and change the token numbers to the corresponding letters in the above token ring output.

$ ccm node1 nodetool describering test_keyspace

Schema Version:6256fe3f-a41e-34ac-ad76-82dba04d92c3
TokenRange:
  TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.4], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.3, 127.0.0.4, 127.0.0.5], rpc_endpoints:[127.0.0.3, 127.0.0.4, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.1], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:F, end_token:A, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])

Using the above output, specifically the end_token, we can determine all the token ranges assigned to each node. As mentioned earlier, the token range is defined by the values after the previous token (start_token) up to and including the assigned token (end_token). The token ranges assigned to each node looked like this:

Six node cluster and three replicas

In this setup, if node3 and node6 were unavailable, we would lose an entire replica. Even if the application is using a Consistency Level of LOCAL_QUORUM, all the data is still available. We still have two other replicas across the other four nodes.

Now let’s consider the case where our cluster is using vnodes. For example purposes we can set num_tokens to 3. It will give us a smaller number of tokens making for an easier to follow example. After configuring and starting the nodes in ccm, our test cluster initially looked like this.

For the majority of production deployments where the cluster size is less than 500 nodes, it is recommended that you use a larger value for `num_tokens`. Further information can be found in the Apache Cassandra Production Recommendations.
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  71.21 KiB  3       46.2%             7d30cbd4-8356-4189-8c94-0abe8e4d4d73  rack1
UN  127.0.0.2  66.04 KiB  3       37.5%             16bb0b37-2260-440c-ae2a-08cbf9192f85  rack1
UN  127.0.0.3  90.48 KiB  3       28.9%             dc8c9dfd-cf5b-470c-836d-8391941a5a7e  rack1
UN  127.0.0.4  104.64 KiB  3      20.7%             3eecfe2f-65c4-4f41-bbe4-4236bcdf5bd2  rack1
UN  127.0.0.5  66.09 KiB  3       36.1%             4d5adf9f-fe0d-49a0-8ab3-e1f5f9f8e0a2  rack1
UN  127.0.0.6  71.23 KiB  3       30.6%             b41496e6-f391-471c-b3c4-6f56ed4442d6  rack1

Right off the blocks we can see signs that the cluster might be unbalanced. Similar to what we did with the single node cluster, here we create the test keyspace and populate it using cqlsh. We then grab a read out of the token ring to see what that looks like. Once again, to make the example easier to follow we manually add a letter representation next to each token number.

$ ccm node1 nodetool ring test_keyspace

Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns    Token                 Token Letter
                                                      8828652533728408318   R
127.0.0.5  rack1  Up      Normal  121.09 KiB  41.44%  -7586808982694641609  A
127.0.0.1  rack1  Up      Normal  126.49 KiB  64.03%  -6737339388913371534  B
127.0.0.2  rack1  Up      Normal  126.04 KiB  66.60%  -5657740186656828604  C
127.0.0.3  rack1  Up      Normal  135.71 KiB  39.89%  -3714593062517416200  D
127.0.0.6  rack1  Up      Normal  126.58 KiB  40.07%  -2697218374613409116  E
127.0.0.1  rack1  Up      Normal  126.49 KiB  64.03%  -1044956249817882006  F
127.0.0.2  rack1  Up      Normal  126.04 KiB  66.60%  -877178609551551982   G
127.0.0.4  rack1  Up      Normal  110.22 KiB  47.96%  -852432543207202252   H
127.0.0.5  rack1  Up      Normal  121.09 KiB  41.44%  117262867395611452    I
127.0.0.6  rack1  Up      Normal  126.58 KiB  40.07%  762725591397791743    J
127.0.0.3  rack1  Up      Normal  135.71 KiB  39.89%  1416289897444876127   K
127.0.0.1  rack1  Up      Normal  126.49 KiB  64.03%  3730403440915368492   L
127.0.0.4  rack1  Up      Normal  110.22 KiB  47.96%  4190414744358754863   M
127.0.0.2  rack1  Up      Normal  126.04 KiB  66.60%  6904945895761639194   N
127.0.0.5  rack1  Up      Normal  121.09 KiB  41.44%  7117770953638238964   O
127.0.0.4  rack1  Up      Normal  110.22 KiB  47.96%  7764578023697676989   P
127.0.0.3  rack1  Up      Normal  135.71 KiB  39.89%  8123167640761197831   Q
127.0.0.6  rack1  Up      Normal  126.58 KiB  40.07%  8828652533728408318   R

As we can see from the “Owns” column above, there are some large token range ownership imbalances. The smallest token range ownership is by node 127.0.0.3 at 39.89%. The largest token range ownership is by node 127.0.0.2 at 66.6%. This is about 26% difference!

Once again, we capture the output of ccm node1 nodetool describering test_keyspace and change the token numbers to the corresponding letters in the above token ring.

$ ccm node1 nodetool describering test_keyspace

Schema Version:4b2dc440-2e7c-33a4-aac6-ffea86cb0e21
TokenRange:
    TokenRange(start_token:J, end_token:K, endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.4], rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:K, end_token:L, endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.4], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:I, end_token:J, endpoints:[127.0.0.6, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.6, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:R, end_token:A, endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:M, end_token:N, endpoints:[127.0.0.2, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.2, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:H, end_token:I, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:L, end_token:M, endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.5], rpc_endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:N, end_token:O, endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:P, end_token:Q, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.5], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:Q, end_token:R, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:F, end_token:G, endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.5], rpc_endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:G, end_token:H, endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.6], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:O, end_token:P, endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])

Finally, we can determine all the token ranges assigned to each node. The token ranges assigned to each node looked like this:

Six node cluster and three replicas

Using this we can see what happens if we had the same outage as the single token cluster did, that is, node3 and node6 are unavailable. As we can see node3 and node6 are both responsible for tokens C, D, I, J, P, and Q. Hence, data associated with those tokens would be unavailable if our application is using a Consistency Level of LOCAL_QUORUM. To put that in different terms, unlike our single token cluster, in this case 33.3% of our data could no longer be retrieved.

Rack ‘em up

A seasoned Cassandra operator will notice that so far we have run our token distribution tests on clusters with only a single rack. To help increase the availability when using vnodes racks can be deployed. When racks are used Cassandra will try to place single replicas in each rack. That is, it will try to ensure no two identical token ranges appear in the same rack.

The key here is to configure the cluster so that for a given datacenter the number of racks is the same as the replication factor.

Let’s retry our previous example where we set num_tokens to 3, only this time we’ll define three racks in the test cluster. After configuring and starting the nodes in ccm, our newly configured test cluster initially looks like this:

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  71.08 KiB  3       31.8%             49df615d-bfe5-46ce-a8dd-4748c086f639  rack1
UN  127.0.0.2  71.04 KiB  3       34.4%             3fef187e-00f5-476d-b31f-7aa03e9d813c  rack2
UN  127.0.0.3  66.04 KiB  3       37.3%             c6a0a5f4-91f8-4bd1-b814-1efc3dae208f  rack3
UN  127.0.0.4  109.79 KiB  3      52.9%             74ac0727-c03b-476b-8f52-38c154cfc759  rack1
UN  127.0.0.5  66.09 KiB  3       18.7%             5153bad4-07d7-4a24-8066-0189084bbc80  rack2
UN  127.0.0.6  66.09 KiB  3       25.0%             6693214b-a599-4f58-b1b4-a6cf0dd684ba  rack3

We can still see signs that the cluster might be unbalanced. This is a side issue, as the main point to take from the above is that we now have three racks defined in the cluster with two nodes assigned in each. Once again, similar to the single node cluster, we can create the test keyspace and populate it using cqlsh. We then grab a read out of the token ring to see what it looks like. Same as the previous tests, to make the example easier to follow, we manually add a letter representation next to each token number.

ccm node1 nodetool ring test_keyspace


Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns    Token                 Token Letter
                                                      8993942771016137629   R
127.0.0.5  rack2  Up      Normal  122.42 KiB  34.65%  -8459555739932651620  A
127.0.0.4  rack1  Up      Normal  111.07 KiB  53.84%  -8458588239787937390  B
127.0.0.3  rack3  Up      Normal  116.12 KiB  60.72%  -8347996802899210689  C
127.0.0.1  rack1  Up      Normal  121.31 KiB  46.16%  -5712162437894176338  D
127.0.0.4  rack1  Up      Normal  111.07 KiB  53.84%  -2744262056092270718  E
127.0.0.6  rack3  Up      Normal  122.39 KiB  39.28%  -2132400046698162304  F
127.0.0.2  rack2  Up      Normal  121.42 KiB  65.35%  -1232974565497331829  G
127.0.0.4  rack1  Up      Normal  111.07 KiB  53.84%  1026323925278501795   H
127.0.0.2  rack2  Up      Normal  121.42 KiB  65.35%  3093888090255198737   I
127.0.0.2  rack2  Up      Normal  121.42 KiB  65.35%  3596129656253861692   J
127.0.0.3  rack3  Up      Normal  116.12 KiB  60.72%  3674189467337391158   K
127.0.0.5  rack2  Up      Normal  122.42 KiB  34.65%  3846303495312788195   L
127.0.0.1  rack1  Up      Normal  121.31 KiB  46.16%  4699181476441710984   M
127.0.0.1  rack1  Up      Normal  121.31 KiB  46.16%  6795515568417945696   N
127.0.0.3  rack3  Up      Normal  116.12 KiB  60.72%  7964270297230943708   O
127.0.0.5  rack2  Up      Normal  122.42 KiB  34.65%  8105847793464083809   P
127.0.0.6  rack3  Up      Normal  122.39 KiB  39.28%  8813162133522758143   Q
127.0.0.6  rack3  Up      Normal  122.39 KiB  39.28%  8993942771016137629   R

Once again we capture the output of ccm node1 nodetool describering test_keyspace and change the token numbers to the corresponding letters in the above token ring.

$ ccm node1 nodetool describering test_keyspace

Schema Version:aff03498-f4c1-3be1-b133-25503becf208
TokenRange:
    TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:L, end_token:M, endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], rpc_endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:N, end_token:O, endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:P, end_token:Q, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:K, end_token:L, endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:R, end_token:A, endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:I, end_token:J, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:Q, end_token:R, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.6, 127.0.0.2, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.2, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:H, end_token:I, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.4, 127.0.0.6, 127.0.0.2], rpc_endpoints:[127.0.0.4, 127.0.0.6, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.2], rpc_endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.1, 127.0.0.6, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.6, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:F, end_token:G, endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:O, end_token:P, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.4], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:J, end_token:K, endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:G, end_token:H, endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:M, end_token:N, endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], rpc_endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2)])

Lastly, we once again determine all the token ranges assigned to each node:

Six node cluster and three replicas

As we can see from the way Cassandra has assigned the tokens, there is now a complete data replica spread across two nodes in each of our three racks. If we go back to our failure scenario where node3 and node6 become unavailable, we can still service queries using a Consistency Level of LOCAL_QUORUM. The only elephant in the room here is node3 has a lot more tokens distributed to it than other nodes. Its counterpart in the same rack, node6, is at the opposite end with fewer tokens allocated to it.

Too many vnodes spoil the cluster

Given the token distribution issues with a low numbers of vnodes, one would think the best option is to have a large vnode value. However, apart from having a higher chance of some data being unavailable in a multi-node outage, large vnode values also impact streaming operations. To repair data on a node, Cassandra will start one repair session per vnode. These repair sessions need to be processed sequentially. Hence, the larger the vnode value the longer the repair times, and the overhead needed to run a repair.

In an effort to fix slow repair times as a result of large vnode values, CASSANDRA-5220 was introduced in 3.0. This change allows Cassandra to group common token ranges for a set of nodes into a single repair session. It increased the size of the repair session as multiple token ranges were being repaired, but reduced the number of repair sessions being executed in parallel.

We can see the effect that vnodes have on repair by running a simple test on a cluster backed by real hardware. To do this test we first need create a cluster that uses single tokens run a repair. Then we can create the same cluster except with 256 vnodes, and run the same repair. We will use tlp-cluster to create a Cassandra cluster in AWS with the following properties.

  • Instance size: i3.2xlarge
  • Node count: 12
  • Rack count: 3 (4 nodes per rack)
  • Cassandra version: 3.11.9 (latest stable release at the time of writing)

The commands to build this cluster are as follows.

$ tlp-cluster init --azs a,b,c --cassandra 12 --instance i3.2xlarge --stress 1 TLP BLOG "Blogpost repair testing"
$ tlp-cluster up
$ tlp-cluster use --config "cluster_name:SingleToken" --config "num_tokens:1" 3.11.9
$ tlp-cluster install

Once we provision the hardware we set the initial_token property for each of the nodes individually. We can calculate the initial tokens for each node using a simple Python command.

Python 2.7.16 (default, Nov 23 2020, 08:01:20)
[GCC Apple LLVM 12.0.0 (clang-1200.0.30.4) [+internal-os, ptrauth-isa=sign+stri on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> num_tokens = 1
>>> num_nodes = 12
>>> print("\n".join(['[Node {}] initial_token: {}'.format(n + 1, ','.join([str(((2**64 / (num_tokens * num_nodes)) * (t * num_nodes + n)) - 2**63) for t in range(num_tokens)])) for n in range(num_nodes)]))
[Node 1] initial_token: -9223372036854775808
[Node 2] initial_token: -7686143364045646507
[Node 3] initial_token: -6148914691236517206
[Node 4] initial_token: -4611686018427387905
[Node 5] initial_token: -3074457345618258604
[Node 6] initial_token: -1537228672809129303
[Node 7] initial_token: -2
[Node 8] initial_token: 1537228672809129299
[Node 9] initial_token: 3074457345618258600
[Node 10] initial_token: 4611686018427387901
[Node 11] initial_token: 6148914691236517202
[Node 12] initial_token: 7686143364045646503

After starting Cassandra on all the nodes, around 3 GB of data per node can be preloaded using the following tlp-stress command. In this command we set our keyspace replication factor to 3 and set gc_grace_seconds to 0. This is done to make hints expire immediately when they are created, which ensures they are never delivered to the destination node.

ubuntu@ip-172-31-19-180:~$ tlp-stress run KeyValue --replication "{'class': 'NetworkTopologyStrategy', 'us-west-2':3 }" --cql "ALTER TABLE tlp_stress.keyvalue WITH gc_grace_seconds = 0" --reads 1 --partitions 100M --populate 100M --iterations 1

Upon completion of the data loading, the cluster status looks like this.

ubuntu@ip-172-31-30-95:~$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.30.95   2.78 GiB   1            25.0%             6640c7b9-c026-4496-9001-9d79bea7e8e5  2a
UN  172.31.31.106  2.79 GiB   1            25.0%             ceaf9d56-3a62-40be-bfeb-79a7f7ade402  2a
UN  172.31.2.74    2.78 GiB   1            25.0%             4a90b071-830e-4dfe-9d9d-ab4674be3507  2c
UN  172.31.39.56   2.79 GiB   1            25.0%             37fd3fe0-598b-428f-a84b-c27fc65ee7d5  2b
UN  172.31.31.184  2.78 GiB   1            25.0%             40b4e538-476a-4f20-a012-022b10f257e9  2a
UN  172.31.10.87   2.79 GiB   1            25.0%             fdccabef-53a9-475b-9131-b73c9f08a180  2c
UN  172.31.18.118  2.79 GiB   1            25.0%             b41ab8fe-45e7-4628-94f0-a4ec3d21f8d0  2a
UN  172.31.35.4    2.79 GiB   1            25.0%             246bf6d8-8deb-42fe-bd11-05cca8f880d7  2b
UN  172.31.40.147  2.79 GiB   1            25.0%             bdd3dd61-bb6a-4849-a7a6-b60a2b8499f6  2b
UN  172.31.13.226  2.79 GiB   1            25.0%             d0389979-c38f-41e5-9836-5a7539b3d757  2c
UN  172.31.5.192   2.79 GiB   1            25.0%             b0031ef9-de9f-4044-a530-ffc67288ebb6  2c
UN  172.31.33.0    2.79 GiB   1            25.0%             da612776-4018-4cb7-afd5-79758a7b9cf8  2b

We can then run a full repair on each node using the following commands.

$ source env.sh
$ c_all "nodetool repair -full tlp_stress"

The repair times recorded for each node were.

[2021-01-22 20:20:13,952] Repair command #1 finished in 3 minutes 55 seconds
[2021-01-22 20:23:57,053] Repair command #1 finished in 3 minutes 36 seconds
[2021-01-22 20:27:42,123] Repair command #1 finished in 3 minutes 32 seconds
[2021-01-22 20:30:57,654] Repair command #1 finished in 3 minutes 21 seconds
[2021-01-22 20:34:27,740] Repair command #1 finished in 3 minutes 17 seconds
[2021-01-22 20:37:40,449] Repair command #1 finished in 3 minutes 23 seconds
[2021-01-22 20:41:32,391] Repair command #1 finished in 3 minutes 36 seconds
[2021-01-22 20:44:52,917] Repair command #1 finished in 3 minutes 25 seconds
[2021-01-22 20:47:57,729] Repair command #1 finished in 2 minutes 58 seconds
[2021-01-22 20:49:58,868] Repair command #1 finished in 1 minute 58 seconds
[2021-01-22 20:51:58,724] Repair command #1 finished in 1 minute 53 seconds
[2021-01-22 20:54:01,100] Repair command #1 finished in 1 minute 50 seconds

These times give us a total repair time of 36 minutes and 44 seconds.

The same cluster can be reused to test repair times when 256 vnodes are used. To do this we execute the following steps.

  • Shut down Cassandra on all the nodes.
  • Delete the contents in each of the directories data, commitlog, hints, and saved_caches (these are located in /var/lib/cassandra/ on each node).
  • Set num_tokens in the cassandra.yaml configuration file to a value of 256 and remove the initial_token setting.
  • Start up Cassandra on all the nodes.

After populating the cluster with data its status looked like this.

ubuntu@ip-172-31-30-95:~$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.30.95   2.79 GiB   256          24.3%             10b0a8b5-aaa6-4528-9d14-65887a9b0b9c  2a
UN  172.31.2.74    2.81 GiB   256          24.4%             a748964d-0460-4f86-907d-a78edae2a2cb  2c
UN  172.31.31.106  3.1 GiB    256          26.4%             1fc68fbd-335d-4689-83b9-d62cca25c88a  2a
UN  172.31.31.184  2.78 GiB   256          23.9%             8a1b25e7-d2d8-4471-aa76-941c2556cc30  2a
UN  172.31.39.56   2.73 GiB   256          23.5%             3642a964-5d21-44f9-b330-74c03e017943  2b
UN  172.31.10.87   2.95 GiB   256          25.4%             540a38f5-ad05-4636-8768-241d85d88107  2c
UN  172.31.18.118  2.99 GiB   256          25.4%             41b9f16e-6e71-4631-9794-9321a6e875bd  2a
UN  172.31.35.4    2.96 GiB   256          25.6%             7f62d7fd-b9c2-46cf-89a1-83155feebb70  2b
UN  172.31.40.147  3.26 GiB   256          27.4%             e17fd867-2221-4fb5-99ec-5b33981a05ef  2b
UN  172.31.13.226  2.91 GiB   256          25.0%             4ef69969-d9fe-4336-9618-359877c4b570  2c
UN  172.31.33.0    2.74 GiB   256          23.6%             298ab053-0c29-44ab-8a0a-8dde03b4f125  2b
UN  172.31.5.192   2.93 GiB   256          25.2%             7c690640-24df-4345-aef3-dacd6643d6c0  2c

When we run the same repair test for the single token cluster on the vnode cluster, the following repair times were recorded.

[2021-01-22 22:45:56,689] Repair command #1 finished in 4 minutes 40 seconds
[2021-01-22 22:50:09,170] Repair command #1 finished in 4 minutes 6 seconds
[2021-01-22 22:54:04,820] Repair command #1 finished in 3 minutes 43 seconds
[2021-01-22 22:57:26,193] Repair command #1 finished in 3 minutes 27 seconds
[2021-01-22 23:01:23,554] Repair command #1 finished in 3 minutes 44 seconds
[2021-01-22 23:04:40,523] Repair command #1 finished in 3 minutes 27 seconds
[2021-01-22 23:08:20,231] Repair command #1 finished in 3 minutes 23 seconds
[2021-01-22 23:11:01,230] Repair command #1 finished in 2 minutes 45 seconds
[2021-01-22 23:13:48,682] Repair command #1 finished in 2 minutes 40 seconds
[2021-01-22 23:16:23,630] Repair command #1 finished in 2 minutes 32 seconds
[2021-01-22 23:18:56,786] Repair command #1 finished in 2 minutes 26 seconds
[2021-01-22 23:21:38,961] Repair command #1 finished in 2 minutes 30 seconds

These times give us a total repair time of 39 minutes and 23 seconds.

While the time difference is quite small for 3 GB of data per node (up to an additional 45 seconds per node), it is easy to see how the difference could balloon out when we have data sizes in the order of hundreds of gigabytes per node.

Unfortunately, all data streaming operations like bootstrap and datacenter rebuild fall victim to the same issue repairs have with large vnode values. Specifically, when a node needs to stream data to another node a streaming session is opened for each token range on the node. This results in a lot of unnecessary overhead, as data is transferred via the JVM.

Secondary indexes impacted too

To add insult to injury, the negative effect of a large vnode values extends to secondary indexes because of the way the read path works.

When a coordinator node receives a secondary index request from a client, it fans out the request to all the nodes in the cluster or datacenter depending on the locality of the consistency level. Each node then checks the SSTables for each of the token ranges assigned to it for a match to the secondary index query. Matches to the query are then returned to the coordinator node.

Hence, the larger the number of vnodes, the larger the impact to the responsiveness of the secondary index query. Furthermore, the performance impacts on secondary indexes grow exponentially with the number of replicas in the cluster. In a scenario where multiple datacenters have nodes using many vnodes, secondary indexes become even more inefficient.

A new hope

So what we are left with then is a property in Cassandra that really hits the mark in terms of reducing the complexities when resizing a cluster. Unfortunately, their benefits come at the expense of unbalanced token ranges on one end, and degraded operations performance at the other. That being said, the vnodes story is far from over.

Eventually, it became a well-known fact in the Apache Cassandra project that large vnode values had undesirable side effects on a cluster. To combat this issue, clever contributors and committers added CASSANDRA-7032 in 3.0; a replica aware token allocation algorithm. The idea was to allow a low value to be used for num_tokens while maintaining relatively even balanced token ranges. The enhancement includes the addition of the allocate_tokens_for_keyspace setting in the cassandra.yaml file. The new algorithm is used instead of the random token allocator when an existing user keyspace is assigned to the allocate_tokens_for_keyspace setting.

Behind the scenes, Cassandra takes the replication factor of the defined keyspace and uses it when calculating the token values for the node when it first enters the cluster. Unlike the random token generator, the replica aware generator is like an experienced member of a symphony orchestra; sophisticated and in tune with its surroundings. So much so, that the process it uses to generate token ranges involves:

  • Constructing an initial token ring state.
  • Computing candidates for new tokens by splitting all existing token ranges right in the middle.
  • Evaluating the expected improvements from all candidates and forming a priority queue.
  • Iterating through the candidates in the queue and selecting the best combination.
    • During token selection, re-evaluate the candidate improvements in the queue.

While this was good advancement for Cassandra, there are a few gotchas to watch out for when using the replica aware token allocation algorithm. To start with, it only works with the Murmur3Partitioner partitioner. If you started with an old cluster that used another partitioner such as the RandomPartitioner and have upgraded over time to 3.0, the feature is unusable. The second and more common stumbling block is that some trickery is required to use this feature when creating a cluster from scratch. The question was common enough that we wrote a blog post specifically on how to use the new replica aware token allocation algorithm to set up a new cluster with even token distribution.

As you can see, Cassandra 3.0 made a genuine effort to address vnode’s rough edges. What’s more, there are additional beacons of light on the horizon with the upcoming Cassandra 4.0 major release. For instance, a new allocate_tokens_for_local_replication_factor setting has been added to the cassandra.yaml file via CASSANDRA-15260. Similar to its cousin the allocate_tokens_for_keyspace setting, the replica aware token allocation algorithm is activated when a value is supplied to it.

However, unlike its close relative, it is more user-friendly. This is because no phaffing is required to create a balanced cluster from scratch. In the simplest case, you can set a value for the allocate_tokens_for_local_replication_factor setting and just start adding nodes. Advanced operators can still manually assign tokens to the initial nodes to ensure the desired replication factor is met. After that, subsequent nodes can be added with the replication factor value assigned to the allocate_tokens_for_local_replication_factor setting.

Arguably, one of the longest time coming and significant changes to be released with Cassandra 4.0 is the update to the default value of the num_tokens setting. As mentioned at the beginning of this post thanks to CASSANDRA-13701 Cassandra 4.0 will ship with a num_tokens value set to 16 in the cassandra.yaml file. In addition, the allocate_tokens_for_local_replication_factor setting is enabled by default and set to a value of 3.

These changes are much better user defaults. On a vanilla installation of Cassandra 4.0, the replica aware token allocation algorithm kicks in as soon as there are enough hosts to satisfy a replication factor of 3. The result is an evenly distributed token ranges for new nodes with all the benefits that a low vnodes value has to offer.

Conclusion

The consistent hashing and token allocation functionality form part of Cassandra’s backbone. Virtual nodes take the guess work out of maintaining this critical functionality, specifically, making cluster resizing quicker and easier. As a rule of thumb, the lower the number of vnodes, the less even the token distribution will be, leading to some nodes being over worked. Alternatively, the higher the number of vnodes, the slower cluster wide operations take to complete and more likely data will be unavailable if multiple nodes are down. The features in 3.0 and the enhancements to those features thanks to 4.0, allow Cassandra to use a low number of vnodes while still maintaining a relatively even token distribution. Ultimately, it will produce a better out-of-the-box experience for new users when running a vanilla installation of Cassandra 4.0.

Get Rid of Read Repair Chance

Apache Cassandra has a feature called Read Repair Chance that we always recommend our clients to disable. It is often an additional ~20% internal read load cost on your cluster that serves little purpose and provides no guarantees.

What is read repair chance?

The feature comes with two schema options at the table level: read_repair_chance and dclocal_read_repair_chance. Each representing the probability that the coordinator node will query the extra replica nodes, beyond the requested consistency level, for the purpose of read repairs.

The original setting read_repair_chance now defines the probability of issuing the extra queries to all replicas in all data centers. And the newer dclocal_read_repair_chance setting defines the probability of issuing the extra queries to all replicas within the current data center.

The default values are read_repair_chance = 0.0 and dclocal_read_repair_chance = 0.1. This means that cross-datacenter asynchronous read repair is disabled and asynchronous read repair within the datacenter occurs on 10% of read requests.

What does it cost?

Consider the following cluster deployment:

  • A keyspace with a replication factor of three (RF=3) in a single data center
  • The default value of dclocal_read_repair_chance = 0.1
  • Client reads using a consistency level of LOCAL_QUORUM
  • Client is using the token aware policy (default for most drivers)

In this setup, the cluster is going to see ~10% of the read requests result in the coordinator issuing two messaging system queries to two replicas, instead of just one. This results in an additional ~5% load.

If the requested consistency level is LOCAL_ONE, which is the default for the java-driver, then ~10% of the read requests result in the coordinator increasing messaging system queries from zero to two. This equates to a ~20% read load increase.

With read_repair_chance = 0.1 and multiple datacenters the situation is much worse. With three data centers each with RF=3, then 10% of the read requests will result in the coordinator issuing eight extra replica queries. And six of those extra replica queries are now via cross-datacenter queries. In this use-case it becomes a doubling of your read load.

Let’s take a look at this with some flamegraphs…

The first flamegraph shows the default configuration of dclocal_read_repair_chance = 0.1. When the coordinator’s code hits the AbstractReadExecutor.getReadExecutor(..) method, it splits paths depending on the ReadRepairDecision for the table. Stack traces containing either AlwaysSpeculatingReadExecutor, SpeculatingReadExecutor or NeverSpeculatingReadExecutor provide us a hint to which code path we are on, and whether either the read repair chance or speculative retry are in play.

Async Read Repairs in the default configuration

The second flamegraph shows the behaviour when the configuration has been changed to dclocal_read_repair_chance = 0.0. The AlwaysSpeculatingReadExecutor flame is gone and this demonstrates the degree of complexity removed from runtime. Specifically, read requests from the client are now forwarded to every replica instead of only those defined by the consistency level.

No Read Repairs

ℹ️   These flamegraphs were created with Apache Cassandra 3.11.9, Kubernetes and the cass-operator, nosqlbench and the async-profiler.

Previously we relied upon the existing tools of tlp-cluster, ccm, tlp-stress and cassandra-stress. This new approach with new tools is remarkably easy, and by using k8s the same approach can be used locally or against a dedicated k8s infrastructure. That is, I don't need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. The same recipe applies everywhere. Nor am I bound to AWS for my cloud testing. It is also worth mentioning that these new tools are gaining a lot of focus and momentum from DataStax, so the introduction of this new approach to the open source community is deserved.

The full approach and recipe to generating these flamegraphs will follow in a [subsequent blog post](/blog/2021/01/31/cassandra_and_kubernetes_cass_operator.html).

What is the benefit of this additional load?

The coordinator returns the result to the client once it has received the response from one of the replicas, per the user’s requested consistency level. This is why we call the feature asynchronous read repairs. This means that read latencies are not directly impacted though the additional background load will indirectly impact latencies.

Asynchronous read repairs also means that there’s no guarantee that the response to the client is repaired data. In summary, 10% of the data you read will be guaranteed to be repaired after you have read it. This is not a guarantee clients can use or rely upon. And it is not a guarantee Cassandra operators can rely upon to ensure data at rest is consistent. In fact it is not a guarantee an operator would want to rely upon anyway, as most inconsistencies are dealt with by hints and nodes down longer than the hint window are expected to be manually repaired.

Furthermore, systems that use strong consistency (i.e. where reads and writes are using quorum consistency levels) will not expose such unrepaired data anyway. Such systems only need repairs and consistent data on disk for lower read latencies (by avoiding the additional digest mismatch round trip between coordinator and replicas) and ensuring deleted data is not resurrected (i.e. tombstones are properly propagated).

So the feature gives us additional load for no usable benefit. This is why disabling the feature is always an immediate recommendation we give everyone.

It is also the rationale for the feature being removed altogether in the next major release, Cassandra version 4.0. And, since 3.0.17 and 3.11.3, if you still have values set for these properties in your table, you may have noticed the following warning during startup:

dclocal_read_repair_chance table option has been deprecated and will be removed in version 4.0

Get Rid of It

For Cassandra clusters not yet on version 4.0, do the following to disable all asynchronous read repairs:

cqlsh -e 'ALTER TABLE <keyspace_name>.<table_name> WITH read_repair_chance = 0.0 AND dclocal_read_repair_chance = 0.0;'

When upgrading to Cassandra 4.0 no action is required, these settings are ignored and disappear.

Renaming and reshaping Scylla tables using scylla-migrator

We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...

Python scylla-driver: how we unleashed the Scylla monster's performance

At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...

Scylla Summit 2019

I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...

Scylla: four ways to optimize your disk space consumption

We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...

Scylla Summit 2018 write-up

It's been almost one month since I had the chance to attend and speak at Scylla Summit 2018 so I'm reliev...

Authenticating and connecting to a SSL enabled Scylla cluster using Spark 2

This quick article is a wrap up for reference on how to connect to ScyllaDB using Spark 2 when authentication and SSL are enforced for the clients on the...

A botspot story

I felt like sharing a recent story that allowed us identify a bot in a haystack thanks to Scylla.

...