How to Model Leaderboards for 1M Player Game with ScyllaDB
Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we’ll explore how to properly model a schema for leaderboards…using a monstrously fast database (ScyllaDB)! 1. Prologue Ever since I was a kid, I’ve been fascinated by games and how they’re made. My favorite childhood game was Guitar Hero 3: Legends of Rock. Well, more than a decade later, I decided to try to contribute to some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and YARG (Yet Another Rhythm Game). YARG is another rhythm game, but this project is completely open source. It unites legendary contributors in game development and design. The game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch. I thought: Well, it’s an open-source project, so maybe I can use my database skills to create a monstrously fast leaderboard for storing past games. It started as a simple chat on their Discord, then turned into a long discussion about how to make this project grow faster. Ultimately, I decided to contribute to it by building a leaderboard with ScyllaDB. In this blog, I’ll show you some code and concepts! 2. Query-Driven Data Modeling With NoSQL, you should first understand which query you want to run depending on the paradigm (document, graph, wide-column, etc.). Focus on the query and create your schema based on that query. In this project, we will handle two types of paradigms: Key-Value Wide Column (Clusterization) Now let’s talk about the queries/features of our modeling. 2.1 Feature: Storing the matches Every time you finish a YARG gameplay, you want to submit your scores plus other in-game metrics. Basically, it will be a single query based on a main index.SELECT score, stars, missed_notes, instrument, ...
FROM leaderboard.submisisons WHERE submission_id =
'some-uuid-here-omg'
2.2 Feature: Leaderboard And now our
main goal: a super cool leaderboard that you don’t need to worry
about after you perform good data modeling. The leaderboard is per
song: every time you play a specific song, your best score will be
saved and ranked. The interface has filters that dictate exactly
which leaderboard to bring: song_id: required instrument: required
modifiers: required difficulty: required player_id: optional score:
optional Imagine our query looks like this, and it returns the
results sorted by score in descending order: SELECT
player_id, score, ... FROM leaderboard.song_leaderboard WHERE
instrument = 'guitar' AND difficulty = 'expert' AND modifiers =
{'none'} AND track_id = 'dani-california' LIMIT 100; -- player_id |
score ----------------+------- -- tzach | 12000 -- danielhe4rt |
10000 -- kadoodle | 9999 ----------------+-------
Can you
already imagine what the final schema will look like? No? Ok, let
me help you with that! 3. Data Modeling time! It’s time to take a
deep dive into data modeling with ScyllaDB and better understand
how to scale it. 3.1 – Matches Modeling First, let’s understand a
little more about the game itself: It’s a rhythm game; You play a
certain song at a time; You can activate “modifiers” to make your
life easier or harder before the game; You must choose an
instrument (e.g. guitar, drums, bass, and microphone). Every aspect
of the gameplay is tracked, such as: Score; Missed notes; Overdrive
count; Play speed (1.5x ~ 1.0x); Date/time of gameplay; And other
cool stuff. Thinking about that, let’s start our data modeling. It
will turn into something like this: CREATE TABLE IF NOT
EXISTS leaderboard.submissions ( submission_id uuid, track_id text,
player_id text, modifiers frozen<set>, score int, difficulty
text, instrument text, stars int, accuracy_percentage float,
missed_count int, ghost_notes_count int, max_combo_count int,
overdrive_count int, speed int, played_at timestamp, PRIMARY KEY
(submission_id, played_at) );
Let’s skip all the
int/text
values and jump to the
set<text>
. The set type allows
you to store a list of items of a particular type. I decided to use
this list to store the modifiers because it’s a perfect fit. Look
at how the queries are executed: INSERT INTO
leaderboard.submissions ( submission_id, track_id, modifiers,
played_at ) VALUES ( some-cool-uuid-here, 'starlight-muse'
{'all-taps', 'hell-mode', 'no-hopos'}, '2024-01-01 00:00:00'
);
With this type, you can easily store a list of items to
retrieve later. Another cool piece of information is that this
query is a key-value like! What does that mean? Since you will
always query it by the submission_id
only, it can be
categorized as a key-value. 3.2 Leaderboard Modeling Now we’ll
cover some cool wide-column database concepts. In our leaderboard
query, we will always need some dynamic values in the WHERE
clauses. That means these values will belong to the
Partition Key while the Clustering
Keys will have values that can be “optional”. A
partition key is a hash based on a
combination of fields that you added to identify a
value. Let’s imagine that you played Starlight -
Muse
100x times. If you were to query this information, it
would return 100x different results differentiated by Clustering
Keys like score
or player_id
.
SELECT player_id, score --- FROM leaderboard.song_leaderboard
WHERE track_id = 'starlight-muse' LIMIT 100;
If 1,000,000
players play this song, your query will become slow and it will
become a problem in the future because your partition key consists
of only one field, which is track_id
. However, if you
add more fields to your Partition Key, like mandatory things before
playing the game, maybe you can shrink these possibilities for a
faster query. Now do you see the big picture? Adding the fields
like Instrument, Difficulty, and
Modifiers will give you a way to split the
information about that specific track evenly. Let’s imagine with
some simple numbers: -- Query Partition ID: '1' SELECT
player_id, score, ... FROM leaderboard.song_leaderboard WHERE
instrument = 'guitar' AND difficulty = 'expert' AND modifiers =
{'none'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT
100; -- Query Partition ID: '2' SELECT player_id, score, ... FROM
leaderboard.song_leaderboard WHERE instrument = 'guitar' AND
difficulty = 'expert' AND modifiers = {'all-hopos'} AND --
Modifiers Changed track_id = 'starlight-muse' LIMIT 100;
So,
if you build the query in a specific shape it will always look for
a specific token and retrieve the data based on these specific
Partition Keys. Let’s take a look at the final modeling and talk
about the clustering keys and the application layer: CREATE
TABLE IF NOT EXISTS leaderboard.song_leaderboard ( submission_id
uuid, track_id text, player_id text, modifiers frozen<set>,
score int, difficulty text, instrument text, stars int,
accuracy_percentage float, missed_count int, ghost_notes_count int,
max_combo_count int, overdrive_count int, speed int, played_at
timestamp, PRIMARY KEY ((track_id, modifiers, difficulty,
instrument), score, player_id) ) WITH CLUSTERING ORDER BY (score
DESC, player_id ASC);
The partition key was defined as
mentioned above, consisting of our REQUIRED
PARAMETERS such as track_id, modifiers, difficulty and
instrument. And for the Clustering Keys, we added
score and player_id. Note that by
default the clustering fields are ordered by score
DESC
and just in case a player has the same score, the
criteria to choose the winner will be alphabetical
¯\(ツ)/¯. First, it’s good to understand that we will have only
ONE SCORE PER PLAYER. But, with this modeling, if
the player goes through the same track twice with different scores,
it will generate two different entries. INSERT INTO
leaderboard.song_leaderboard ( track_id, player_id, modifiers,
score, difficulty, instrument, stars, played_at ) VALUES (
'starlight-muse', 'daniel-reis', {'none'}, 133700, 'expert',
'guitar', '2023-11-23 00:00:00' ); INSERT INTO
leaderboard.song_leaderboard ( track_id, player_id, modifiers,
score, difficulty, instrument, stars, played_at ) VALUES (
'starlight-muse', 'daniel-reis', {'none'}, 123700, 'expert',
'guitar', '2023-11-23 00:00:00' ); SELECT player_id, score FROM
leaderboard.song_leaderboard WHERE instrument = 'guitar' AND
difficulty = 'expert' AND modifiers = {'none'} AND track_id =
'starlight-muse' LIMIT 2; -- player_id | score
----------------+------- -- daniel-reis | 133700 -- daniel-reis |
123700 ----------------+-------
So how do we fix this
problem? Well, it’s not a problem per se. It’s a feature! As a
developer, you have to create your own business rules based on the
project’s needs, and this is no different. What do I mean by that?
You can run a simple DELETE query before inserting
the new entry. That will guarantee that you will not have specific
data from the player_id with less than the new
score inside that specific group of
partition keys. -- Before Insert the new
Gampleplay DELETE FROM leaderboard.song_leaderboard WHERE
instrument = 'guitar' AND difficulty = 'expert' AND modifiers =
{'none'} AND track_id = 'starlight-muse' AND player_id =
'daniel-reis' AND score <= 'your-new-score-here'; -- Now you can
insert the new payload...
And with that, we finished our
simple leaderboard system, the same one that runs in YARG and can
also be used in games with MILLIONS of entries per second 😀 4. How
to Contribute to YARG Want to contribute to this wonderful
open-source project? We’re building a brand new platform for all
the players using: Game: Unity3d (Repository) Front-end:
NextJS (Repository)
Back-end: Laravel 10.x (Repository) We will
need as many developers and testers as possible to discuss future
implementations of the game together with the main contributors!
First, make sure to join this Discord Community. This is
where all the technical discussions happen with the backing of the
community before going to the development board. Also, outside of
Discord, the YARG community is mostly focused on the EliteAsian (core
contributor and project owner) X account for development showcases.
Be sure to follow him there as well.
New replay viewer HUD for #YARG! There are still some issues with it, such as consistency, however we are planning to address them by the official stable release of v0.12. pic.twitter.com/9ACIJXAZS4 — EliteAsian (@EliteAsian123) December 16, 2023And FYI, the Lead Artist of the game, (aka Kadu) is also a Broadcast Specialist and Product Innovation Developer at Elgato who worked with streamers like: Ninja Nadeshot StoneMountain64 and the legendary DJ Marshmello. Kadu also uses his X to share some insights and early previews of new features and experimentations for YARG. So, don’t forget to follow him as well!
Here's how the replay venue looks like now, added a lot of details on the desk, really happy with the result so far, going to add a few more and start the textures pic.twitter.com/oHH27vkREe — ⚡Kadu Waengertner (@kaduwaengertner) August 10, 2023Here are some useful links to learn more about the project: Official Website Github Repository Task Board
Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?5. Conclusion Data modeling is sometimes challenging. This project involved learning many new concepts and a lot of testing together with my community on Twitch. I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB! Also, if you like ScyllaDB and want to learn more about it, I strongly suggest you watch our free Masterclass Courses or visit ScyllaDB University!
Will Your Cassandra Database Project Succeed?: The New Stack
Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.
That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.
Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.
Accurate Data Modeling Is a Must
Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.
On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.
Configure Cassandra Clusters the Right Way
Accurate, expertly managed cluster configurations are pivotal to the success of Cassandra implementations. Get those cluster settings wrong and Cassandra can suffer from data inconsistencies and performance issues due to inappropriate node capacities, poor partitioning or replication strategies that aren’t up to the task.
Teams should understand the needs of their particular use case and how each cluster configuration setting affects Cassandra’s abilities to serve that use case. Attuning configurations to best support your application — including the right settings for node capacity, data distribution, replication factor and consistency levels — will ensure that you can harness the full power of Cassandra when it counts.
Take Advantage of Tunable Consistency
Cassandra gives teams the option to leverage the best balance of data consistency and availability for their use case. While these tunable consistency levels are a valuable tool in the right hands, teams that don’t understand the nuances of these controls can saddle their applications with painful latency and troublesome data inconsistencies.
Teams that learn to operate Cassandra’s tunable consistency levels properly and carefully assess their application’s needs — especially with read and write patterns, data sensitivity and the ability to tolerate eventual consistency — will unlock far more beneficial Cassandra experiences.
Perform Regular Maintenance
Regular Cassandra maintenance is required to stave off issues such as data inconsistencies and performance drop-offs. Within their Cassandra operational procedures, teams should routinely perform compaction, repair and node-tool operations to prevent challenges down the road, while ensuring cluster health and performance are optimized.
Anticipate Capacity and Scaling Needs
By its nature, success will yield new needs. Be prepared for your Cassandra cluster to grow and scale well into the future — that is what this database is built to do. Starving your Cassandra cluster for CPU, RAM and storage resources because you don’t have a plan to seamlessly add capacity is a way of plucking failure from the jaws of success. Poor performance, data loss and expensive downtime are the rewards for growing without looking ahead.
Plan for growth and scalability from the beginning of your Cassandra implementation. Practice careful capacity planning. Look at your data volumes, write/read patterns and performance requirements today and tomorrow. Teams with clusters built for growth will be ready to do so far more easily and affordably.
Make Changes With a Careful Testing/Staging/Prod Process
Teams that think they’re streamlining their process efficiency by putting Cassandra changes straight into production actually enable a pipeline for bugs, performance roadblocks and data inconsistencies. Testing and staging environments are essential for validating changes before putting them into production environments and will save teams countless hours of headaches.
At the end of the day, running all data migrations, changes to schema and application updates through testing and staging environments is far more efficient than putting them straight into production and then cleaning up myriad live issues.
Set Up Monitoring and Alerts
Teams implementing monitoring and alerts to track metrics and flag anomalies can mitigate trouble spots before they become full-blown service interruptions. The speed at which teams become aware of issues can mean the difference between a behind-the-scenes blip and a downtime event.
Have Backup and Disaster Recovery at the Ready
In addition to standing up robust monitoring and alerting, teams should regularly test and run practice drills on their procedures for recovering from disasters and using data backups. Don’t neglect this step; these measures are absolutely essential for ensuring the safety and resilience of systems and data.
The less prepared an organization is to recover from issues, the longer and more costly and impactful downtime will be. Incremental or snapshot backup strategies, replication that’s based in the cloud or across multiple data centers and fine-tuned recovery processes should be in place to minimize downtime, stress and confusion whenever the worst occurs.
Nurture Cassandra Expertise
The expertise required to optimize Cassandra configurations, operations and performance will only come with a dedicated focus. Enlisting experienced talent, instilling continuous training regimens that keep up with Cassandra updates, turning to external support and ensuring available resources — or all of the above — will position organizations to succeed in following the best practices highlighted here and achieving all of the benefits that Cassandra can deliver.
The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.
How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database
How ShareChat successfully scaled 1000X without scaling the underlying database (ScyllaDB) The demand for low-latency machine learning feature stores is higher than ever, but actually implementing one at scale remains a challenge. That became clear when ShareChat engineers Ivan Burmistrov and Andrei Manakov took the P99 CONF 23 stage to share how they built a low-latency ML feature store based on ScyllaDB. This isn’t a tidy case study where adopting a new product saves the day. It’s a “lessons learned” story, a look at the value of relentless performance optimization – with some important engineering takeaways. The original system implementation fell far short of the company’s scalability requirements. The ultimate goal was to support 1 billion features per second, but the system failed under a load of just 1 million. With some smart problem solving, the team pulled it off though. Let’s look at how their engineers managed to pivot from the initial failure to meet their lofty performance goal without scaling the underlying database. Obsessed with performance optimizations and low-latency engineering? Join your peers at P99 24 CONF, a free highly technical virtual conference on “all things performance.” Speakers include: Michael Stonebraker, Postgres creator and MIT professor Bryan Cantrill, Co-founder and CTO of Oxide Computer Avi Kivity, KVM creator, ScyllaDB co-founder and CTO Liz Rice, Chief open source officer with eBPF specialists Isovalent Andy Pavlo, CMU professor Ashley Williams, Axo founder/CEO, former Rust core team, Rust Foundation founder Carl Lerche, Tokio creator, Rust contributor and engineer at AWS Register Now – It’s Free In addition to another great talk by Ivan from ShareChat’, expect more than 60 engineering talks on performance optimizations at Disney/Hulu, Shopify, Lyft, Uber, Netflix, American Express, Datadog, Grafana, LinkedIn, Google, Oracle, Redis, AWS, ScyllaDB and more. Register for free. ShareChat: India’s Leading Social Media Platform To understand the scope of the challenge, it’s important to know a little about ShareChat, the leading social media platform in India. On the ShareChat app, users discover and consume content in more than 15 different languages, including videos, images, songs and more. ShareChat also hosts a TikTok-like short video platform (Moj) that encourages users to be creative with trending tags and contests. Between the two applications, they serve a rapidly growing user base that already has over 325 million monthly active users. And their AI-based content recommendation engine is essential for driving user retention and engagement. Machine learning feature stores at ShareChat This story focuses on the system behind ML feature stores for the short-form video app Moj. It offers fully personalized feeds to around 20 million daily active users, 100 million monthly active users. Feeds serve 8,000 requests per second, and there’s an average of 2,000 content candidates being ranked on each request (for example, to find the 10 best items to recommend). “Features” are pretty much anything that can be extracted from the data: Ivan Burmistrov, principal staff software engineer at ShareChat, explained: “We compute features for different ‘entities.’ Post is one entity, User is another and so on. From the computation perspective, they’re quite similar. However, the important difference is in the number of features we need to fetch for each type of entity. When a user requests a feed, we fetch user features for that single user. However, to rank all the posts, we need to fetch features for each candidate (post) being ranked, so the total load on the system generated by post features is much larger than the one generated by user features. This difference plays an important role in our story.” What went wrong At first, the primary focus was on building a real-time user feature store because, at that point, user features were most important. The team started to build the feature store with that goal in mind. But then priorities changed and post features became the focus too. This shift happened because the team started building an entirely new ranking system with two major differences versus its predecessor: Near real-time post features were more important The number of posts to rank increased from hundreds to thousands Ivan explained: “When we went to test this new system, it failed miserably. At around 1 million features per second, the system became unresponsive, latencies went through the roof and so on.” Ultimately, the problem stemmed from how the system architecture used pre-aggregated data buckets called tiles. For example, they can aggregate the number of likes for a post in a given minute or other time range. This allows them to compute metrics like the number of likes for multiple posts in the last two hours. Here’s a high-level look at the system architecture. There are a few real-time topics with raw data (likes, clicks, etc.). A Flink job aggregates them into tiles and writes them to ScyllaDB. Then there’s a feature service that requests tiles from ScyllaDB, aggregates them and returns results to the feed service. The initial database schema and tiling configuration led to scalability problems. Originally, each entity had its own partition, with rows timestamp and feature name being ordered clustering columns. [Learn more in this NoSQL data modeling masterclass]. Tiles were computed for segments of one minute, 30 minutes and one day. Querying one hour, one day, seven days or 30 days required fetching around 70 tiles per feature on average. If you do the math, it becomes clear why it failed. The system needed to handle around 22 billion rows per second. However, the database capacity was only 10 million rows/sec. Initial optimizations At that point, the team went on an optimization mission. The initial database schema was updated to store all feature rows together, serialized as protocol buffers for a given timestamp. Because the architecture was already using Apache Flink, the transition to the new tiling schema was fairly easy, thanks to Flink’s advanced capabilities in building data pipelines. With this optimization, the “Features” multiplier has been removed from the equation above, and the number of required rows to fetch has been reduced by 100X: from around 2 billion to 200 million rows/sec. The team also optimized the tiling configuration, adding additional tiles for five minutes, three hours and five days to one minute, 30 minutes and one day tiles. This reduced the average required tiles from 70 to 23, further reducing the rows/sec to around 73 million. To handle more rows/sec on the database side, they changed the ScyllaDB compaction strategy from incremental to leveled. [Learn more about compaction strategies]. That option better suited their query patterns, keeping relevant rows together and reducing read I/O. The result: ScyllaDB’s capacity was effectively doubled. The easiest way to accommodate the remaining load would have been to scale ScyllaDB 4x. However, more/larger clusters would increase costs and that simply wasn’t in their budget. So the team continued focusing on improving the scalability without scaling up the ScyllaDB cluster. Improved cache locality One potential way to reduce the load on ScyllaDB was to improve the local cache hit rate, so the team decided to research how this could be achieved. The obvious choice was to use a consistent hashing approach, a well-known technique to direct a request to a certain replica from the client based on some information about the request. Since the team was using NGINX Ingress in their Kubernetes setup, using NGINX’s capabilities for consistent hashing seemed like a natural choice. Per NGINX Ingress documentation, setting up consistent hashing would be as simple as adding three lines of code. What could go wrong? A bit. This simple configuration didn’t work. Specifically: The client subset led to a huge key remapping – up 100% in the worst case. Since the node keys can be changed in a hash ring, it was impossible to use real-life scenarios with autoscaling. [See the ingress implementation] It was tricky to provide a hash value for a request because Ingress doesn’t support the most obvious solution: a gRPC header. The latency suffered severe degradation, and it was unclear what was causing the tail latency. To support a subset of the pods, the team modified their approach. They created a two-step hash function: first hashing an entity, then adding a random prefix. That distributed the entity across the desired number of pods. In theory, this approach could cause a collision when an entity is mapped to the same pod several times. However, the risk is low given the large number of replicas. Ingress doesn’t support using gRPC header as a variable, but the team found a workaround: using path rewriting and providing the required hash key in the path itself. The solution was admittedly a bit “hacky” … but it worked. Unfortunately, pinpointing the cause of latency degradation would have required considerable time, as well as observability improvements. A different approach was needed to scale the feature store in time. To meet the deadline, the team split the Feature service into 27 different services and manually split all entities between them on the client. It wasn’t the most elegant approach, but, it was simple and practical – and it achieved great results. The cache hit rate improved to 95% and the ScyllaDB load was reduced to 18.4 million rows per second. With this design, ShareChat scaled its feature store to 1B features per second by March. However, this “old school” deployment-splitting approach still wasn’t the ideal design. Maintaining 27 deployments was tedious and inefficient. Plus, the cache hit rate wasn’t stable, and scaling was limited by having to keep a high minimum pod count in every deployment. So even though this approach technically met their needs, the team continued their search for a better long-term solution. The next phase of optimizations: consistent hashing, Feature service Ready for yet another round of optimization, the team revisited the consistent hashing approach using a sidecar, called Envoy Proxy, deployed with the feature service. Envoy Proxy provided better observability which helped identify the latency tail issue. The problem: different request patterns to the Feature service caused a huge load on the gRPC layer and cache. That led to extensive mutex contention. The team then optimized the Feature service. They: Forked the caching library (FastCache from VictoriaMetrics) and implemented batch writes and better eviction to reduce mutex contention by 100x. Forked gprc-go and implemented buffer pool across different connections to avoid contention during high parallelism. Used object pooling and tuned garbage collector (GC) parameters to reduce allocation rates and GC cycles. With Envoy Proxy handling 15% of traffic in their proof-of-concept, the results were promising: a 98% cache hit rate, which reduced the load on ScyllaDB to 7.4M rows/sec. They could even scale the feature store more: from 1 billion features/second to 3 billion features/second. Lessons learned Here’s what this journey looked like from a timeline perspective: To close, Andrei summed up the team’s top lessons learned from this project (so far): Use proven technologies. Even as the ShareChat team drastically changed their system design, ScyllaDB, Apache Flink and VictoriaMetrics continued working well. Each optimization is harder than the previous one – and has less impact. Simple and practical solutions (such as splitting the feature store into 27 deployments) do indeed work. The solution that delivers the best performance isn’t always user-friendly. For instance, their revised database schema yields good performance, but is difficult to maintain and understand. Ultimately, they wrote some tooling around it to make it simpler to work with. Every system is unique. Sometimes you might need to fork a default library and adjust it for your specific system to get the best performance. Watch their complete P99 CONF talkSimplifying Cassandra and DynamoDB Migrations with the ScyllaDB Migrator
Learn about the architecture of ScyllaDB Migrator, how to use it, recent developments, and upcoming features. ScyllaDB offers both a CQL-compatible API and a DynamoDB-compatible API, allowing applications that use Apache Cassandra or DynamoDB to take advantage of reduced costs and lower latencies with minimal code changes. We previously described the two main migration strategies: cold and hot migrations. In both cases, you need to backfill ScyllaDB with historical data. Either can be efficiently achieved with the ScyllaDB Migrator. In this blog post, we will provide an update on its status. You will learn about its architecture, how to use it, recent developments, and upcoming features. The Architecture of the ScyllaDB Migrator The ScyllaDB Migrator leverages Apache Spark to migrate terabytes of data in parallel. It can migrate data from various types of sources, as illustrated in the following diagram: We initially developed it to migrate from Apache Cassandra, but we have since added support for more types of data sources. At the time of writing, the Migrator can migrate data from either: A CQL-compatible source: An Apache Cassandra table. Or a Parquet file stored locally or on Amazon S3. Or a DynamoDB-compatible source: A DynamoDB table. Or a DynamoDB table export on Amazon S3. What’s so interesting about ScyllaDB Migrator? Since it runs as an Apache Spark application, you can adjust its throughput by scaling the underlying Spark cluster. It is designed to be resilient to read or write failures. If it stops prior to completion, the migration can be restarted from where it left off. It can rename item columns along the way. When migrating from DynamoDB, the Migrator can endlessly replicate new changes to ScyllaDB. This is useful for hot migration strategies. How to Use the ScyllaDB Migrator More details are available in the official Migrator documentation. The main steps are: Set Up Apache Spark: There are several ways to set up an Apache Spark cluster, from using a pre-built image on AWS EMR to manually following the official Apache Spark documentation to using our automated Ansible playbook on your own infrastructure. You may also use Docker to run a cluster on a single machine. Prepare the Configuration File: Create a YAML configuration file that specifies the source database, target ScyllaDB cluster, and any migration option. Run the Migrator: Execute the ScyllaDB Migrator using thespark-submit
command.
Pass the configuration file as an argument to the migrator.
Monitor the Migration: The Spark UI provides logs
and metrics to help you monitor the migration process. You can
track the progress and troubleshoot any issues that arise. You
should also monitor the source and target databases to check
whether they are saturated or not. Recent Developments The ScyllaDB
Migrator has seen several significant improvements, making it more
versatile and easier to use: Support for Reading DynamoDB
S3 Exports: You can now migrate data from DynamoDB S3
exports directly to ScyllaDB, broadening the range of sources you
can migrate from. PR #140.
AWS AssumeRole Authentication: The Migrator now
supports AWS AssumeRole authentication, allowing for secure access
to AWS resources during the migration process. PR #150.
Schema-less DynamoDB Migrations: By adopting a
schema-less approach, the Migrator enhances reliability when
migrating to ScyllaDB Alternator, ScyllaDB’s DynamoDB-compatible
API. PR #105.
Dedicated Documentation Website: The Migrator’s
documentation is now available on a proper website, providing
comprehensive guides, examples, and throughput tuning tips.
PR
#166. Update to Spark 3.5 and Scala 2.13: The
Migrator has been updated to support the latest versions of Spark
and Scala, ensuring compatibility and leveraging the latest
features and performance improvements. PR #155.
Ansible Playbook for Spark Cluster Setup: An
Ansible playbook is now available to automate the setup of a Spark
cluster, simplifying the initial setup process. PR #148.
Publish Pre-built Assemblies: You don’t need to
manually build the Migrator from the source anymore. Download the
latest release and pass it to the spark-submit
command. PR #158.
Strengthened Continuous Integration: We have set
up a testing infrastructure that reduces the risk of introducing
regressions and prevents us from breaking backward compatibility.
PRs #107,
#121,
#127.
Hands-on Migration Example The content of this section has been
extracted from the documentation website.
The original content is kept up to date. Let’s go through
a migration example to illustrate some of the points listed above.
We will perform a cold migration to replicate 1,000,000 items from
a DynamoDB table to ScyllaDB Alternator. The whole system is
composed of the DynamoDB service, a Spark cluster with a single
worker node, and a ScyllaDB cluster with a single node, as
illustrated below: To make it easier for interested readers to
follow along, we will create all those services using Docker. All you need is the AWS CLI and Docker. The example
files can be found at
https://github.com/scylladb/scylla-migrator/tree/b9be9fb684fb0e51bf7c8cbad79a1f42c6689103/docs/source/tutorials/dynamodb-to-scylladb-alternator
Set Up the Services and Populate the Source Database We use Docker
Compose to define each service. Our docker-compose.yml
file looks as follows: Let’s break down this Docker Compose file.
We define the DynamoDB service by reusing the official image
amazon/dynamodb-local
. We use the TCP port 8000 for
communicating with DynamoDB. We define the Spark master and Spark
worker services by using a custom image (see below). Indeed, the
official
Docker images for Spark 3.5.1 only support Scala 2.12 for now,
but we need Scala 2.13. We mount the local directory
./spark-data
to the Spark master container path
/app
so that we can supply the Migrator jar and
configuration to the Spark master node. We expose the ports 8080
and 4040 of the master node to access the Spark UIs from our host
environment. We allocate 2 cores and 4 GB of memory to the Spark
worker node. As a general rule, we recommend allocating 2 GB of
memory per core on each worker. We define the ScyllaDB service by
reusing the official image scylladb/scylla
. We use the
TCP port 8001 for communicating with ScyllaDB Alternator. The Spark
services rely on a local Dockerfile located at path
./dockerfiles/spark/Dockerfile
. For the sake of
completeness, here is the content of this file, which you can
copy-paste: And here is the entry point used by the image, which
needs to be executable: This Docker image installs Java and
downloads the official Spark release. The entry point of the image
takes an argument that can be either master or worker to control
whether to start a master node or a worker node. Prepare your
system for building the Spark Docker image with the following
commands: mkdir spark-data
chmod +x entrypoint.sh
Finally, start all the services with the following command:
docker compose up
Your
system’s Docker daemon will download the DynamoDB and ScyllaDB
images and build our Spark Docker image. Check that you can access
the Spark cluster UI by opening http://localhost:8080 in your browser.
You should see your worker node in the workers list. Once all the
services are up, you can access your local DynamoDB instance and
your local ScyllaDB instance by using the standard AWS CLI. Make
sure to configure the AWS CLI as follows before running the
dynamodb
commands: # Set dummy
region and credentials
aws
configure set region us-west-1
aws configure set aws_access_key_id
dummy
aws configure set
aws_secret_access_key dummy
# Access DynamoDB
aws --endpoint-url http://localhost:8000 dynamodb
list-tables
# Access
ScyllaDB Alternator
aws
--endpoint-url http://localhost:8001 dynamodb
list-tables
The last preparatory step consists
of creating a table in DynamoDB and filling it with random data.
Create a file named create-data.sh
, make it
executable, and write the following content into it: This script
creates a table named Example and adds 1 million items to it. It
does so by invoking another script,
create-25-items.sh
, that uses the
batch-write-item
command to insert 25 items in a
single call: Every added item contains an id and five columns, all
filled with random data. Run the script:
./create-data.sh
and wait
for a couple of hours until all the data is inserted (or change the
last line of create-data.sh
to insert fewer items and
make the demo faster). Perform the Migration Once you have set up
the services and populated the source database, you are ready to
perform the migration. Download the latest stable release of the
Migrator in the spark-data
directory: wget
https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar
\ –directory-prefix=./spark-data
Create a configuration
file in spark-data/config.yaml
and write the following
content: This configuration tells the Migrator to read the items
from the table Example in the dynamodb service, and to write them
to the table of the same name in the scylla
service.
Finally, start the migration with the following command:
docker compose exec spark-master \
/spark/bin/spark-submit \ --executor-memory 4G \ --executor-cores 2
\ --class com.scylladb.migrator.Migrator \ --master
spark://spark-master:7077 \ --conf spark.driver.host=spark-master \
--conf spark.scylla.config=/app/config.yaml \
/app/scylla-migrator-assembly.jar
This command
calls spark-submit
in the spark-master
service with the file scylla-migrator-assembly.jar
,
which bundles the Migrator and all its dependencies. In the
spark-submit
command invocation, we explicitly tell
Spark to use 4 GB of memory; otherwise, it would default to 1 GB
only. We also explicitly tell Spark to use 2 cores. This is not
really necessary as the default behavior is to use all the
available cores, but we set it for the sake of illustration. If the
Spark worker node had 20 cores, it would be better to use only 10
cores per executor to optimize the throughput (big executors
require more memory management operations, which decrease the
overall application performance). We would achieve this by passing
--executor-cores 10
, and the Spark engine would
allocate two executors for our application to fully utilize the
resources of the worker node. The migration process inspects the
source table, replicates its schema to the target database if it
does not exist, and then migrates the data. The data migration uses
the Hadoop framework under the hood to leverage the Spark cluster
resources. The migration process breaks down the data to transfer
chunks of about 128 MB each, and processes all the partitions in
parallel. Since the source is a DynamoDB table in our example, each
partition translates into a
scan segment to maximize the parallelism level when reading the
data. Here is a diagram that illustrates the migration process:
During the execution of the command, a lot of logs are printed,
mostly related to Spark scheduling. Still, you should be able to
spot the following relevant lines: 24/07/22 15:46:13 INFO
migrator: ScyllaDB Migrator 0.9.2
…
24/07/22 15:46:20 INFO alternator: We need to transfer: 2
partitions in total
24/07/22 15:46:20 INFO alternator:
Starting write…
24/07/22 15:46:20 INFO DynamoUtils:
Checking for table existence at destination
And when the
migration ends, you will see the following line printed:
24/07/22 15:46:24 INFO alternator: Done transferring table
snapshot
During the migration, it is possible to monitor the
underlying Spark job by opening the Spark UI available at http://localhost:4040 Example of a
migration broken down in 6 tasks. The Spark UI allows us to follow
the overall progress, and it can also show specific metrics such as
the memory consumption of an executor. In our example the size
of the source table is ~200 MB. In practice, it is common to
migrate tables containing several terabytes of data. If necessary,
and as long as your DynamoDB source supports a higher read
throughput level, you can increase the migration throughput by
adding more Spark worker nodes. The Spark engine will automatically
spread the workload between all the worker nodes. Future
Enhancements The ScyllaDB team is continuously improving the
Migrator. Some of the upcoming features include: Support
for Savepoints with DynamoDB Sources: This will allow
users to resume the migration from a specific point in case of
interruptions. This is currently supported with Cassandra sources
only. Shard-Aware ScyllaDB Driver: The Migrator
will fully take advantage of ScyllaDB’s specific optimizations for
even faster migrations. Support for SQL-based
Sources: For instance, migrate from MySQL to ScyllaDB.
Conclusion Thanks to the ScyllaDB Migrator, migrating data to
ScyllaDB has never been easier. With its robust architecture,
recent enhancements, and active development, the migrator is an
indispensable tool for ensuring a smooth and efficient migration
process. For more information, check out the
ScyllaDB Migrator lesson on ScyllaDB University. Another useful
resource is the official ScyllaDB
Migrator documentation. Are you using the Migrator? Any
specific feature you’d like to see? For any questions about your
specific use case or about the Migrator in general, tap into the
community knowledge on the ScyllaDB Community Forum. Inside ScyllaDB’s Continuous Optimizations for Reducing P99 Latency
How the ScyllaDB Engineering team reduced latency spikes during administrative operations through continuous monitoring and rigorous testing In the world of databases, smooth and efficient operation is crucial. However, both ScyllaDB and its predecessor Cassandra have historically encountered challenges with latency spikes during administrative operations such as repair, backup, node addition, decommission, replacement, upgrades, compactions etc.. This blog post shares how the ScyllaDB Engineering team embraced continuous improvement to tackle these challenges head-on. Protecting Performance by Measuring Operational Latency Understanding and improving the performance of a database system like ScyllaDB involves continuous monitoring and rigorous testing. Each week, our team tackles this challenge by measuring performance under three types of workload scenarios: write, read, and mixed (50% read/write). We focus specifically on operational latency: how the system performs during typical and intensive operations like repair, node addition, node termination, decommission or upgrade. Our Measurement Methodology To ensure accurate results, we preload each cluster with data at a 10:1 data-to-memory ratio—equivalent to inserting 650GB on 64GB memory instances. Our benchmarks begin by recording the latency during a steady state to establish a baseline before initiating various cluster operations. We follow a strict sequence during testing: Preload data to simulate real user environments. Baseline latency measurement for a stable reference point. Sequential operational tests involving: Repair operations via Scylla Manager. Addition of three new nodes. Termination and replacement of a node. Decommissioning of three nodes. Latency is our primary metric; if it exceeds 15ms, we immediately start investigating it. We also monitor CPU instructions per operation and track reactor stalls, which are critical for understanding performance bottlenecks. How We Measure Latency Measuring latency effectively requires looking beyond the time it takes for ScyllaDB to process a command. We consider the entire lifecycle of a request: Response time: The time from the moment the query is initiated to when the response is delivered back to the client. Advanced metrics: We utilize High Dynamic Range (HDR) Histograms to capture and analyze latency from each cassandra-stress worker. This ensures we can compute a true representation of latency percentiles rather than relying on simple averages. Results from these tests are meticulously compiled and compared with previous runs. This not only helps us detect any performance degradation but also highlights improvements. It keeps the entire team informed through detailed reports that include operation durations and latency breakdowns for both reads and writes. Better Metrics, Better Performance When we started to verify performance regularly, we mostly focused on the latencies. At that time, reports lacked many details (like HDR results), but were sufficient to identify performance issues. These included high latency when decommissioning a node, or issues with latencies during the steady state. Since then, we have optimized our testing approach to include more – and more detailed – metrics. This enables us to spot emerging performance issues sooner and root out the culprit faster. The improved testing approach has been a valuable tool, providing fast and precise feedback on how well (or not) our product optimization strategies are actually working in practice. Total metrics Our current reports include HDR Histogram details providing a comprehensive overview of system latency throughout the entire test. Number of reactor stalls (which are pauses in processing due to overloaded conditions) prompts immediate attention and action when they increase significantly. We take a similar approach to kernel callstacks which are logged when kernel space holds locks for too long. Management repair After populating the cluster with data, we start our test from a full cluster repair using Scylla Manager and measure the latencies: During this period, the P99 latency was 3.87 ms for writes and 9.41 ms for reads. In comparison, during the “steady state” (when no operations were performed), the latencies were 2.23 ms and 3.87 ms, respectively. Cluster growth After the repair, we add three nodes to the cluster and conduct a similar latency analysis: Each cycle involves adding one node sequentially. These results provide a clear view of how latency changes and the duration required to expand the cluster. Node Termination and Replacement Following the cluster growth, one node is terminated and replaced with another. Cluster Shrinkage The test concludes with shrinking the cluster back to its initial size by decommissioning random nodes one by one. These tests and reports are invaluable, uncovering numerous performance issues like increased latencies during decommission, detecting long reactor stalls in row cache update or short but frequent ones in sstable reader paths that lead to crucial fixes, improvements, and insights. This progress is evident in the numbers, where current latencies remain in the single-digit range under various conditions. Looking Ahead Our optimization journey is ongoing. ScyllaDB 6.0 introduced tablets, significantly accelerating cluster resizing to market-leading levels. The introduction of immediate node joining, which can start in parallel with accelerated data streaming, shows significant improvements across all metrics. With these improvements, we start measuring and optimizing not only the latencies during these operations but also the operations durations. Stay tuned for more details about these advancements soon. Our proactive approach to tackling latency issues not only improves our database performance but also exemplifies our commitment to excellence. As we continue to innovate and refine our processes, ScyllaDB remains dedicated to delivering superior database solutions that meet the evolving needs of our users.ScyllaDB Elasticity: Demo, Theory, and Future Plans
Watch a demo of how ScyllaDB’s Raft and tablets initiatives play out with real operations on a real ScyllaDB cluster — and get a glimpse at what’s next on our roadmap. If you follow ScyllaDB, you’ve likely heard us talking about Raft and tablets initiatives for years now. (If not, read more on tablets from Avi Kivity and Raft from Kostja Osipov) You might have even seen some really cool animations. But how does it play out with real operations on a real ScyllaDB cluster? And what’s next on our roadmap – particularly in terms of user impacts? ScyllaDB Co-Founder Dor Laor and Technical Director Felipe Mendes recently got together to answer those questions. In case you missed it or want a recap of the action-packed and information-rich session, here’s the complete recording: In case you want to skip to a specific section, here’s a breakdown of what they covered when: 4:45 ScyllaDB already scaled linearly 8:11 Tablets + Raft = elasticity, speed, simplicity, TCO 11:45 Demo time! 30:23 Looking under the hood 46:19: Looking ahead And in case you prefer to read vs watch, here are some key points… Double Double Demo After Dor shared why ScyllaDB adopted a new dynamic “tablets-based” data replication architecture for faster scaling, he passed the mic over to Felipe to show it in action. Felipe covers: Parallel scaling operations (adding and removing nodes) – speed and impact on latency How new nodes can start servicing increased demand almost instantly Dynamic load balancing based on node capacity, including automated resharding for new/different instance types The demo starts with the following initial setup: 3-node cluster running on AWS i4i.xlarge Each node processing ~17,000 operations/second System load at ~50% Here’s a quick play-by-play… Scale out: Bootstrapped 3 additional i4i.large nodes in parallel New nodes start serving traffic once the first tablets arrive, before the entire data set is received. Tablets migration complete in ~3 minutes Writes are at sub-millisecond latencies; so are read latencies once the cache warms up (in the meantime, reads go to warmed up nodes, thanks to heat-weighted load balancing) Scale up: Added 3 nodes of a larger instance size (i4i.2xlarge, with double the capacity of the original nodes) and increased the client load The larger nodes receive more tablets and service almost twice the traffic than the smaller replicas (as appropriate for their higher capacity) The expanded cluster handles over 100,000 operations/second with the potential to handle 200,000-300,000 operations/second Downscale: A total of 6 nodes were decommissioned in parallel As part of the decommission process, tablets were migrated to other replicas Only 8 minutes were required to fully decommission 6 replicas while serving traffic A Special Raft for the ScyllaDB Sea Monster Starting with the ScyllaDB 6.0 release, topology metadata is managed by the Raft protocol. The process of adding, removing, and replacing nodes is fully linearized. This contributes to parallel operations, simplicity, and correctness. Read barriers and fencing are two interesting aspects of our Raft implementation. Basically, if a node doesn’t know the most recent topology, it’s barred from responding to related queries. This prevents, for example, a node from observing an incorrect topology state in the cluster – which could result in data loss. It also prevents a situation where a removed node or an external node using the same cluster name could silently come back or join the cluster simply by gossiping with another replica. Another difference: Schema versions are now linearized, and use a TimeUUID to indicate the most up-to-date schema. Linearizing schema updates not only makes the operation safer; it also considerably improves performance. Previously, a schema change could take a while to propagate via gossip – especially in large cluster deployments. Now, this is gone. TimeUUIDs provide an additional safety net. Since schema versions now contain a time-based component, ScyllaDB can ensure schema versioning, which helps with: Improved visibility on conditions triggering a schema change on logs Accurately restoring a cluster backup Rejecting out-of-order schema updates Tablets relieve operational pains The latest changes simplify ScyllaDB operations in several ways: You don’t need to perform operations one by one and wait in between them; you can just initiate the operation to add or remove all the nodes you need, all at once You no longer need to cleanup after you scale the cluster Resharding (the process of changing the shard count of an existing node) is simple. Since tablets are already split on a per-shard boundary, resharding simply updates the shard ownership Managing the system_auth keyspace (for authentication) is no longer needed. All auth-related data is now automatically replicated to every node in the cluster Soon, repairs will also be automated Expect less: typeless, sizeless, limitless ScyllaDB’s path forward from here certainly involves less: typeless, sizeless, limitless. You could be typeless. You won’t have to think about instance types ahead of time. Do you need a storage-intensive instance like the i3ens, or a throughput-intensive instance like the i4is? It no longer matters, and you can easily transition or even mix among these. You could be sizeless. That means you won’t have to worry about capacity planning when you start off. Start small and evolve from there. You could also be limitless. You could start off anticipating a high throughput and then reduce it, or you could commit to a base and add on-demand usage if you exceed it.Use Your Data in LLMs With the Vector Database You Already Have: The New Stack
Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.
Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.
You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.
But, ready for some good news?
If you’re already using Apache Cassandra 5.0, OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.
For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.
Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.
RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.
Let’s look closer at what each open source technology brings to the vector database discussion:
Apache Cassandra 5.0 Offers Native Vector Indexing
With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.
Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.
OpenSearch Provides a Combination of Benefits
Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.
With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.
The pgvector Extension Makes Postgres a Powerful Vector Store
Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.
pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.
Was the Answer to Your AI Challenges in Front of You All Along?
The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.
The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.
Who Did What to That and When? Exploring the User Actions Feature
NetApp recently released the user actions feature on the Instaclustr Managed Platform, allowing customers to search for user actions recorded against their accounts and organizations. We record over 100 different types of actions, with detailed descriptions of what was done, by whom, to what, and at what time.
This provides customers with visibility into the actions users are performing on their linked accounts. NetApp has always collected this information in line with our security and compliance policies, but now, all important changes to your managed cluster resources have self-service access from the Console and the APIs.
In the past, this information was accessible only through support tickets when important questions such as “Who deleted my cluster?” and “When was the firewall rule removed from my cluster?” needed answers. This feature adds more self-discoverability of what your users are doing and what our support staff are doing to keep your clusters healthy.
This blog post provides a detailed walkthrough of this new feature at a moderate level of technical detail, with the hope of encouraging you to explore and better find the actions you are looking for.
For this blog, I’ve created two Apache Cassandra® clusters in one account and performed some actions on each. I’ve also created an organization linked to this account and performed some actions on that. This will allow a full example UI to be shown and demonstrate the type of “stories” that can emerge from typical operations via user actions.
Introducing Global Directory
During development, we decided to consolidate the other global account pages into a new centralized location, which we are calling the “Directory”.
This Directory provides you with the consolidated view of all organizations and accounts that you have access to, collecting global searches and account functions into a view that does not have a “selected cluster” context (i.e., global). For more information on how Organizations, Accounts and Clusters relate to each other, check out this blog.
Organizations serve as an efficient method to consolidate all associated accounts into a single unified, easily accessible location. They introduce an extra layer to the permission model, facilitating the management and sharing of information such as contact and billing details. They also streamline the process of Single Sign-On (SSO) and account creation.
Let’s log in and click on the new button:
This will take us to the new directory landing page:
Here, you will find two types of global searches: accounts and user actions, as well as account creation. Selecting the new “User Actions” item will take us to the new page. You can also navigate to these directory pages directly from the top right ‘folder’ menu:
User Action Search Page: Walkthrough
This is the new page we land on if we choose to search for user actions:
When you first enter, it finds the last page of actions that happened in the accounts and organizations you have access to. It will show both organization and account actions on a single consolidated page, even though they are slightly different in nature.
*Note: The accessible accounts and organisations are defined as those you are linked to as
CLUSTER_ADMIN
or
OWNER
*TIP: If you don’t want an account user to see user actions, give the
READ_ONLY
access.
You may notice a brief progress bar display as the actions are retrieved. At the time of writing, we have recorded nearly 100 million actions made by our customers over a 6-month period.
From here, you can increase the number of actions shown on each page and page through the results. Sorting is not currently supported on the actions table, but it is something we will be looking to add in the future. For each action found, the table will display:
- Action: What happened to your account (or organization)? There are over 100 tracked kinds of actions recorded.
- Domain: The specific account or organization name of the action targeted.
- Description: An expanded description of what happened, using context captured at the time of action. Important values are highlighted between square brackets, and the copy button will copy the first one into the clipboard.
- User: The user who
performed the action, typically using the console/
APIs or
Terraform
provider, but
it can also be triggered by “Instaclustr
Support” using our
admin tools.
- For those actions marked with user “Instaclustr Support”, please reach out to support for more information about those actions we’ve taken on your behalf or visit https://support.instaclustr.com/hc/en-us.
- Local time: The action time from your local web browser’s perspective.
Additionally, for those who prefer programmatic access, the user action feature is fully accessible via our APIs, allowing for automation and integration into your existing workflows. Please visit our API documentation page here for more details.
Basic (super-search) Mode
Let’s say we only care about the “LeagueOfNations” organization domain; we can type ‘League’ and then click Search:
The name patterns are simple partial string patterns we look for as being ’contained’ within the name, such as ”Car” in ”Carlton”. These are case insensitive. They are not (yet!) general regular expressions.
Advanced “find a needle” Search Mode
Sometimes, searching by names is not precise enough; you may want to provide more detailed search criteria, such as time ranges or narrowing down to specific clusters or kinds of actions. Expanding the “Advanced Search” section will switch the page to a more advanced search criteria form, disabling the basic search area and its criteria.
Let’s say we only want to see the “Link Account” actions over the last week:
We select it from the actions multi-chip selector using the cursor (we could also type it and allow autocomplete to kick in). Hitting search will give you your needle time to go chase that Carl guy down and ask why he linked that darn account:
The available criteria fields are as follows (additive in nature):
- Action: the kinds of actions, with a bracketed count of their frequency over the current criteria; if empty, all are included.
- Account: The account name of interest OR its UUID can be useful to narrow the matches to only a specific account. It’s also useful when user, organization, and account names share string patterns, which makes the super-search less precise.
- Organization: the organization name of interest or its UUID.
- User: the user who performed the action.
- Description: matches against the value of an expanded description variable. This is useful because most actions mention the ‘target’ of the action, such as cluster-id, in the expanded description.
- Starting At: match actions starting from this time cannot be older than 12 months ago.
- Ending At: match actions up until this time.
Bonus Feature: Cluster Actions
While it’s nice to have this new search page, we wanted to build a higher-order question on top of it: What has happened to my cluster?
The answer can be found on the details tab of each cluster. When clicked on, it will take you directly to the user actions page with appropriate criteria to answer the question.
* TIP: we currently support entry into this view with a
descriptionFormat queryParam
allowing you to save bookmarks to particular action ‘targets’. Further
queryParams
may be supported in the future for the remaining criteria: https://console2.instaclustr.com/global/searches/user-action?descriptionContextPattern=acde7535-3288-48fa-be64-0f7afe4641b3
Clicking this provides you the answer:
Future Thoughts
There are some future capabilities we will look to add, including the ability to subscribe to webhooks that trigger on some criteria. We would also like to add the ability to generate reports against a criterion or to run such things regularly and send them via email. Let us know what other feature improvements you would like to see!
Conclusion
This new capability allows customers to search for user actions directly without contacting support. It also provides improved visibility and auditing of what’s been changing on their clusters and who’s been making those changes. We hope you found this interesting and welcome any feedback for “higher-order” types of searches you’d like to see built on top of this new feature. What kind of common questions about user actions can you think of?
If you have any questions about this feature, please contact Instaclustr Support at any time. If you are not a current Instaclustr customer and you’re interested to learn more, register for a free trial and spin up your first cluster for free!
The post Who Did What to That and When? Exploring the User Actions Feature appeared first on Instaclustr.
Powering AI Workloads with Intelligent Data Infrastructure and Open Source
In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively.
This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI.
The Power of Open Source in AI Solutions
Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions:
- Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
- Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
- Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals.
- Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness.
Vector Databases: A Key Component for AI Workloads
Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems.
Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities.
Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users.
The Role of Vector Databases in AI:
- Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models.
- High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly.
- Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance.
Leveraging Existing Infrastructure for AI Workloads
Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements:
- Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement.
- Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads.
- Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration.
Open Source Databases for Enterprise AI
Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors:
PostgreSQL® and pgvector
“The world’s most advanced open source relational database“, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure.
From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.
OpenSearch®
Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.
OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI agents to recommendation engines with trustworthy results and minimal hallucinations.
Apache Cassandra® 5.0 with Native Vector Indexing
Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.
Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.
Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises.
Open Source Opens the Door to Successful AI Workloads
Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutions—and suffering the pitfalls of vendor lock-in or simply mismatched features—can easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position.
When leveraging open source databases for AI workloads, consider the following:
- Data Security: Ensure robust security measures are in place to protect sensitive data.
- Scalability: Plan for future growth by choosing solutions that can scale with your needs.
- Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications.
- Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI.
Conclusion
Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency.
Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.
And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts!
The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.
How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes?
Data modeling in Apache Cassandra® is an important topic—how you model your data can affect your cluster’s performance, costs, etc. Today I’ll be looking at a new feature in Cassandra 5.0 called Storage -Attached Indexes (SAI), and how they affect the way you model data in Cassandra databases.
First, I’ll briefly cover what SAIs are (for more information about SAIs, check out this post). Then I’ll look at 3 use cases where your modeling strategy could change with SAI. Finally, I’ll talk about benefits and constraints of SAIs. and constraints of SAIs.
What Are Storage–Attached Indexes?
From the Cassandra 5.0 Documentation, Storage –Attached Indexes (SAIs) “[provide] an indexing mechanism that is closely integrated with the Cassandra storage engine to make data modeling easier for users.” Secondary Indexing, which is indexing values on properties that are not part of the Primary Key for that table, has been available for Cassandra in the past (called SASI and 2i). However, SAIs will replace the existing functionality, as it will be deprecated in 5.0, and then tentatively removed in Cassandra 6.0.
This is because SAIs improve upon the older methods in a lot of key ways. For one, according to the devs, SAIs are the fastest indexing method for Cassandra clusters. This performance boost was a plus for using indexing in production environments. It also lowered the data storage overhead over prior implementations, which lowers costs by reducing the need for database storage, which induces operational costs, and by reducing latency when dealing with indexes, lowering a loss of user interaction due to high latency.
How Do SAIs work?
SAIs are implemented as part of the SSTables, or Sorted String Tables, of a Cassandra database. This is because SAIs index Memtables and SSTables as they are written. It filters from both in-memory and on-disk sources, filtering them out into a series of indexed columns at read time. I’m not going to go into too much detail here because there are a lot of existing resources on this exciting topic: see the Cassandra 5.0 Documentation and the Instaclustr site for examples.
The main thing to keep in mind is that SAIs are attached to Cassandra’s storage engine, and it’s much more performant from speed, scalability, and data storage angles as a result. This means that you can use indexing reliably in production beginning with Cassandra 5.0, which allows data modeling to be improved very quickly.
To learn more about how SAIs work, check out this piece from the Apache Cassandra blog.
What Is SAI For?
SAI is a filtering engine, and while it does have some functionality overlap with search engines, it directly says it is “not an enterprise search engine” (source).
SAI is meant for creating filters on non-primary-key or composite partition keys (source), essentially meaning that you can enable a ‘WHERE’ clause on any column in your Cassandra 5.0 database. This makes queries a lot more flexible without sacrificing latency or storage space as with prior methods.
How Can We Use SAI When Data Modeling in Cassandra 5.0?
Because of the increased scalability and performance of SAIs, data modeling in Cassandra 5.0 will most definitely change.
You will be able to search collections more thoroughly and easily, for instance, indexing is more of an option when designing your Cassandra queries. This will also allow new query types, which can improve your existing queries—which by Cassandra’s design paradigm changes your table design.
But what if you’re not on a greenfield project and want to use SAIs? No problem! SAI is backwards-compatible, and you can migrate your application one index at a time if you need.
How Do Storage–Attached Indexes Affect Data Modeling in Apache Cassandra 5.0?
Cassandra’s SAI was designed with data modeling in mind (source). It unlocks new query patterns that make data modeling easier in quite a few cases. In the Cassandra team’s words: “You can create a table that is most natural for you, write to just that table, and query it any way you want.” (source)
I think another great way to look at how SAIs affect data modeling is by looking at some queries that could be asked of SAI data. This is because Cassandra data modeling relies heavily on the queries that will be used to retrieve the data. I’ll take a look at 2 use cases: indexing as a means of searching a collection in a row and indexing to manage a one-to-many relationship.
Use Case: Querying on Values of Non-Primary-Key Columns
You may find you’re searching for records with a particular value in a particular column often in a table. An example may be a search form for a large email inbox with lots of filters. You could find yourself looking at a record like:
- Subject
- Sender
- Receiver
- Body
- Time sent
Your table creation may look like:
CREATE KEYSPACE IF NOT EXISTS inbox WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; CREATE TABLE IF NOT EXISTS emails ( id int, sender text, receivers text, subject text, body text, timeSent timestamp, PRIMARY KEY (id)); };
If you allow users to search for a particular subject or sender, and the data set is large, not having SAIs could make query times painful:
SELECT * FROM emails WHERE emails.sender == “sam.example@example.com”
To fix this problem, we can create secondary indexes on our sender, receiver, and body fields:
CREATE CUSTOM INDEX sender_sai_idx ON Inbox.emails (sender) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE INDEX IF NOT EXISTS receiver_sai_idx on Inbox.emails (receiver) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE CUSTOM INDEX body_sai_idx ON Inbox.emails (body) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE CUSTOM INDEX subject_sai_idx ON Inbox.emails (subject) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};
Once you’ve established the indexes, you can run the same query and it will automatically use the SAI index to find all emails with a sender of “sam.example@examplemail.com” OR by subject match/body match. Note that although the data model changed with the inclusion of the indexes, the SELECT query does not change, and the fields of the table stayed the same as well!
Use Case: Managing One-To-Many Relationships
Going back to the previous example, one email could have many recipients. Prior to secondary indexes, you would need to scan every row in the collection of every row in the table in order to query on recipients. This could be solved in a few ways. One is to create a join table for recipients that contains an id, email id, and recipient. This becomes complicated when the constraint that each email should only appear once per email is added. With SAI, we now have an index-based solution—create an index on a collection of recipients for each row.
The script to create the table and indices changes a bit:
id int, sender text, receivers set<text>, subject text, body text, timeSent timestamp, PRIMARY KEY (id)); };
The text type of receivers changes to a set<text>. A set is used because each email should only occur once. This takes the logic you would have had to implement for the join table solution and moves it to Cassandra.
The indexing code remains mostly the same, except for the creation of the index for receivers:
CREATE INDEX IF NOT EXISTS receivers_sai_idx on Inbox.emails (receivers)
That’s it! One line of CQL and there’s now an index on receivers. We can query for emails with a particular receiver:
SELECT * FROM emails WHERE emails.receievers CONTAINS “sam.example@examplemail.com”
There are many one-to-many relationships that can be simplified in Cassandra with the use of secondary indexes and SAI.
What Are the Benefits of Data Modeling With Storage Attached Indexes?
There are many benefits to using SAI when data modeling in Cassandra 5.0:
- Query performance: because of SAI’s implementation, it has much faster query speeds than previous implementations, and indexed data is generally faster to search than unindexed data. This means you have more flexibility to search within your data and write queries that search non-primary-key columns and collections.
- Move over piecemeal: SAI’s backwards compatibility, coupled with how little your table structure has to change to add SAIs, means you can move over your data models piece by piece, meaning moving is easier.
- Data storage overhead: SAI has much lower data overhead than previous secondary index implementations, meaning more flexibility in what you can store in your data models without impacting overall storage needs.
- More complex
queries/features: SAI allows you to write much
more thorough queries when looking
through SAIs,
and offers up a lot of new functionality, like:
- Vector Search
- Numeric Range queries
- AND queries within indexes
- Support for map/set/
What Are the Constraints of Storage–Attached Indexes?
While there are benefits to SAI, there are also a few constraints, including:
- Because SAI is attached to the SSTable mechanism, the performance of queries on indexed columns will be “highly dependent on the compaction strategy in use” (per the Cassandra 5.0 CEP-7)
- SAI is not designed for unlimited-size data sets, such as logs; indexing on a dataset like that would cause performance issues. The reason for this is read latency at higher row counts spread across a cluster. It is also related to consistency level (CL), as the higher the CL is, the more nodes you’ll have to ping on larger datasets. (Source).
- Query complexity: while you can query as many indexes as you like, when you do so, you incur a cost related to the number of index values processed. Be aware when designing queries to select from as few indexes as necessary.
- You cannot index multiple columns in one index, as there is a 1-to-1 mapping of an SAI index to a column. You can however create separate indexes and query them simultaneously.
This is a v1, and some features, like the LIKE comparison for strings, the OR operator, and global sorting are all slated for v2.
Disk usage: SAI uses an extra 20-35% disk space over unindexed data; note that over previous versions of indexing, it consumes much less (source). You shouldn’t just make every column an index if you don’t need to, saving disk space and maintaining query performance.
Conclusion
SAI is a very robust solution for secondary indexes, and their addition to Cassandra 5.0 opens the door for several new data modelling strategies—from searching non-primary-key columns, to managing one-to-many relationships, to vector search. To learn more about SAI, read this post from the Instaclustr by NetApp blog, or check out the documentation for Cassandra 5.0.
If you’d like to test SAI without setting up and configuring Cassandra yourself, Instaclustr has a free trial and you can spin up Cassandra 5.0 clusters today through a public preview! Instaclustr also offers a bunch of educational content about Cassandra 5.0.
The post How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes? appeared first on Instaclustr.
Cassandra Lucene Index: Update
**An important update regarding support of Cassandra Lucene Index for Apache Cassandra® 5.0 and the retirement of Apache Lucene Add-On on the Instaclustr Managed Platform.**
Instaclustr by NetApp has been maintaining the new fork of the Cassandra Lucene Index plug-in since its announcement in 2018. After extensive evaluation, we have decided not to upgrade the Cassandra Lucene Index to support Apache Cassandra® 5.0. This decision aligns with the evolving needs of the Cassandra community and the capabilities offered by the Storage–Attached Indexing (SAI) in Cassandra 5.0.
SAI introduces significant improvements in secondary indexing, while simplifying data modeling and creating new use cases in Cassandra, such as Vector Search. While SAI is not a direct replacement for the Cassandra Lucene Index, it offers a more efficient alternative for many indexing needs.
For applications requiring advanced indexing features, such as full-text search or geospatial queries, users can consider external integrations, such as OpenSearch®, that offer numerous full-text search and advanced analysis features.
We are committed to maintaining the Cassandra Lucene Index for currently supported and newer versions of Apache Cassandra 4 (including minor and patch-level versions) for users who rely on its advanced search capabilities. We will continue to release bug fixes and provide necessary security patches for the supported versions in the public repository.
Retiring Apache Lucene Add-On for Instaclustr for Apache Cassandra
Similarly, Instaclustr is commencing the retirement process of the Apache Lucene add-on on its Instaclustr Managed Platform. The offering will move to the Closed state on July 31, 2024. This means that the add-on will no longer be available for new customers.
However, it will continue to be fully supported for existing customers with no restrictions on SLAs, and new deployments will be permitted by exception. Existing customers should be aware that the add-on will not be supported for Cassandra 5.0. For more details about our lifecycle policies, please visit our website here.
Instaclustr will work with the existing customers to ensure a smooth transition during this period. Support and documentation will remain in place for our customers running the Lucene add–on on their clusters.
For those transitioning to, or already using the Cassandra 5.0 beta version, we recommend exploring how Storage-Attached Indexing can help you with your indexing needs. You can try the SAI feature as part of the free trial on the Instaclustr Managed Platform.
We thank you for your understanding and support as we continue to adapt and respond to the community’s needs.
If you have any questions about this announcement, please contact us at support@instaclustr.com.
The post Cassandra Lucene Index: Update appeared first on Instaclustr.
Vector Search in Apache Cassandra® 5.0
Apache Cassandra® is moving towards an AI-driven future. Cassandra’s high availability and ability to scale with large amounts of data have made it the obvious choice for many of these types of applications.
In the constantly evolving world of data analysis, we have seen significant transformations over time. The capabilities needed now are those which support a new data type (vectors) that has accelerated in popularity with the growing adoption of Generative AI and large language models. But what does this mean?
Data Analysis Past, Present and Future
Traditionally, data analysis was about looking back and understanding trends and patterns. Big data analytics and predictive ML applications emerged as powerful tools for processing and analyzing vast amounts of data to make informed predictions and optimize decision-making.
Today, real-time analytics has become the norm. Organizations require timely insights to respond swiftly to changing market conditions, customer demands, and emerging trends.
Tomorrow’s world will be shaped by the emergence of generative AI. Generative AI is a transformative approach that goes way beyond traditional analytics. Instead, it leverages machine learning models trained on diverse datasets to produce something entirely new while retaining similar characteristics and patterns learned from the training data.
By training machine learning models on vast and varied datasets, businesses can unlock the power of generative AI. These models understand the underlying patterns, structures, and meaning in the data, enabling them to generate novel and innovative outputs. Whether it’s generating lifelike images, composing original music, or even creating entirely new virtual worlds, generative AI pushes the boundaries of what is possible.
Broadly speaking, these generative AI models store the ingested data in numerical representation, known as vectors (we’ll dive deeper later). Vectors capture the essential features and characteristics of the data, allowing the models to understand and generate outputs that align with those learned patterns.
With vector search capabilities in Apache Cassandra® 5.0 (now in public preview on the Instaclustr Platform), we enable the storage of these vectors and efficient searching and retrieval of them based on their similarity to the query vector. ‘Vector similarity search’ opens a whole new world of possibilities and lies at the core of generative AI.
The ability to create novel and
innovative outputs that were never explicitly present in the
training data has significant implications across many creative,
exploratory, and problem-solving uses. Organizations can now unlock
the full potential of their data by performing complex
similarity-based searches at scale which is key to supporting AI
workloads such as recommendation systems, image, text, or voice
matching applications, fraud detection, and much
more.
What Is Vector Search?
Vectors are essentially just a giant list of numbers across as many or as few dimensions as you want… but really they are just a list (or array) of numbers!
Embeddings, by comparison, are a numerical representation of something else, like a picture, song, topic, book; they capture the semantic meaning behind each vector and encode those semantic understandings as a vector.
Take, for example, the words “couch”, “sofa”, and “chair”. All 3 words are individual vectors, and while their semantic relationships are similar—they are pieces of furniture after all—but “couch” and “sofa” are more closely related to each other than to “chair”.
These semantic relationships are encoded as embeddings—dense numerical vectors that represent the semantic meaning of each word. As such, the embeddings for “couch” and “sofa” will be geometrically closer to each other than to the embedding for “chair.” “Couch”, “sofa”, and “chair” will all be closer than, say, to the word “erosion”.
Take a look at Figure 1 below to see that relationship:
Figure 1: While “Sofa”, “Couch”, and “Chair” are all related, “Sofa” and “Couch” are semantically closer to each other than “Chair” (i.e., they are both designed for multiple people, while a chair is meant for just one).
“Erosion”, on the other hand, shares practically no resemblance to the other vectors, which is why it is geometrically much further away.
When computers search for things, they typically rely on exact matching, text matching, or open searching/object matching. Vector search, on the other hand, works by using these embeddings to search for semantics as opposed to terms, allowing you to get a much closer match in unstructured datasets based on the meaning of the word.
Under the hood we can generate embeddings for our search term, and then using some vector math, find other embeddings that are geometrically close to our search term. This will return results that are semantically related to our search. For example, if we search for “sofa”, then “couch” followed by “chair” will be returned.
How Does Vector Search Work With Large Language Models (LLMs)?
Now that we’ve established an understanding of vectors, embeddings, and vector searching, let’s explore the intersection of vector searching and Large Language Models (LLMs), a trending topic within the world of sophisticated AI models.
With Retrieval Augmented Generation (RAG), we use the vectors to look up related text or data from an external source as part of our query, as part of the query we also retrieve the original human readable text. We then feed the original text alongside the original prompt into the LLM for generated text output, allowing the LLM to use the additional provided context while generating the response.
(PGVector does a very similar thing with PostgreSQL; check out our blog where we talk all about it).
By utilizing this approach, we can take known information, query it in a natural language way, and receive relevant responses. Essentially, we input questions, prompts, and data sources into a model and generate informative responses based on that data.
The combination of Vector Search
and LLMs opens
up
exciting
possibilities for more efficient and contextually rich
GenAI
applications powered
by Cassandra.
So What Does This Mean for Cassandra?
Well, with the additions of CEP-7 (Storage Attached Index, SAI) and CEP-30 (Approximate Nearest Neighbor, ANN Vector Search via SAI + Lucene) to the Cassandra 5.0 release, we can now store vectors and create indexes on them to perform similarity searches. This feature alone broadens the scope of possible Cassandra use cases.
We utilize ANN (or “Approximate Nearest Neighbor”) as the fast and approximate search as opposed to k-NN (or “k-Nearest Neighbor”), which is a slow and exact algorithm of large, high-dimensional data; speed and performance are the priorities.
Just about any application could utilize a vector search feature. Thus, in an existing data model, you would simply add a column of ‘vectors’ (Vectors are 32-bit floats, a new CQL data type that also takes a dimension input parameter). You would then create a vector search index to enable similarity searches on columns containing vectors.
Searching complexity scales linearly with vector dimensions, so it is important to keep in mind vector dimensions and normalization. Ideally, we want normalized vectors (as opposed to vectors of all different sizes) when we look for similarity as it will lead to faster and more ‘correct’ results. With that, we can now filter our data by vector.
Let’s walk through an example CQLSH session to demonstrate the creation, insertion, querying, and the new syntax that comes along with this feature.
1. Create your keyspace
CREATE KEYSPACE catalog WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}
2. Create your table or add a column to an existing table with the vector type (ALTER TABLE)
CREATE TABLE IF NOT EXISTS catalog.products ( product_id text PRIMARY KEY, product_vector VECTOR<float, 3> );
*Vector dimension of 3 chosen for simplicity, the dimensions of your vector will be dependent on the vectorization method*
3. Create your custom index with SAI on the vector column to enable ANN search
CREATE CUSTOM INDEX ann_index_for_similar_products ON products(product_vector) USING ‘StorageAttachedIndex’;
*Your custom index (ann_index_for_similar_products in this case) can be created with your choice of similarity functions: default cosine, dot product, or Euclidean similarity functions. See here*
4. Load vector data using CQL insert
INSERT INTO catalog.products (product_id, product_vector) VALUES (‘SKU1’, [8, 2.3, 58]); INSERT INTO catalog.products (product_id, product_vector) VALUES (‘SKU3’, [1.2, 3.4, 5.6]); INSERT INTO catalog.products (product_id, product_vector) VALUES (‘SKU5’, [23, 18, 3.9]);
5. Query the data to perform a similarity search using the new CQL operator
SELECT * FROM catalog.products WHERE product_vector ANN OF [3.4, 7.8, 9.1] limit 1;
Result:
product_id | product_vector ---+--------------------------------------------------------- SKU5 | [23, 18, 3.9]
How Can You Use This?
E-Commerce (for Content Recommendation)
We can start to think about simple e-commerce use cases like when a customer adds a product or service to their cart. You can now have a vector stored with that product, query the ANN algorithm (aka pull your next nearest neighbor(s)/most similar datapoint(s)), and display to that customer ‘similar’ products in ‘real-time.’
The above example data model allows you to store and retrieve products based on their unique product_id. You can also perform queries or analysis based on the vector embeddings, such as finding similar products or generating personalized recommendations. The vector column can store the vector representation of the product, which can be generated using techniques like Word2Vec, Doc2Vec, or other vectorization methods.
In a practical application, you might populate this table with actual product data, including the unique identifier, name, category, and the corresponding vector embeddings for each product. The vector embeddings can be generated elsewhere using machine learning models and then inserted into a C* table.
Content Generation (LLM or Data Augmentation)
What about the real-time, generative AI applications that many companies (old and new) are focusing on today and in the near future?
Well, Cassandra can serve the persistence of vectors for such applications. With this storage, we now have persisted data that can add supplemental context to an LLM query. This is the ‘RAG’ referred to earlier in this piece and enables uses like AI chatbots.
We can think of the ever-popular ChatGPT that tends to hallucinate on anything beyond 2021—its last knowledge update. With VSS on Cassandra, it is possible to augment such an LLM with stored vectors.
Think of the various PDF plugins behind ChatGPT. We can feed the model PDFs, which, at a high level, it then vectorizes and stores in Cassandra (there are more intermediary steps to this) and is then ready to be queried in a natural language way.
Here is a list of just a subset of possible industries/use-cases that can extract the value of a such a feature-add to the Open Source Cassandra project:
- Cybersecurity
(Anomaly Detection)
- Ex. Financial Fraud Detection
- Language Processing (Language Translation)
- Digital Marketing (Natural Language Generation)
- Document Management (Document Clustering)
- Data Integration (Entity Matching or Deduplication)
- Image
Recognition (Image Similarity
Search)
- Ex. Healthcare (Medical Imaging)
- Voice Assistants (Voice and Speech Recognition)
- Customer Support (AI Chatbots)
- Pharmaceutical (Drug Discovery)
Cassandra 5.0 is in beta release on the Instaclustr Platform today (with GA coming soon)! Stay tuned for Part 2 of this blog series, where I will delve deeper and demonstrate real-world usage of VSS in Cassandra 5.0.
The post Vector Search in Apache Cassandra® 5.0 appeared first on Instaclustr.
Medusa 0.16 was released
The k8ssandra team is happy to announce the release of Medusa for Apache Cassandra™ v0.16. This is a special release as we did a major overhaul of the storage classes implementation. We now have less code (and less dependencies) while providing much faster and resilient storage communications.
Back to ~basics~ official SDKs
Medusa has been an open source project for about 4 years now, and a private one for a few more. Over such a long time, other software it depends upon (or doesn’t) evolves as well. More specifically, the SDKs of the major cloud providers evolved a lot. We decided to check if we could replace a lot of custom code doing asynchronous parallelisation and calls to the cloud storage CLI utilities with the official SDKs.
Our storage classes so far relied on two different ways of interacting with the object storage backends:
- Apache Libcloud, which provided a Python API for abstracting ourselves from the different protocols. It was convenient and fast for uploading a lot of small files, but very inefficient for large transfers.
- Specific cloud vendors CLIs, which were much more efficient with large file transfers, but invoked through subprocesses. This created an overhead that made them inefficient for small file transfers. Relying on subprocesses also created a much more brittle implementation which led the community to create a lot of issues we’ve been struggling to fix.
To cut a long story short, we did it!
- We started by looking at S3, where we went for the official boto3. As it turns out, boto3 does all the chunking, throttling, retries and parallelisation for us. Yay!
- Next we looked at GCP. Here we went with TalkIQ’s gcloud-aio-storage. It works very well for everything, including the large files. The only thing missing is the throughput throttling.
- Finally, we used Azure’s official SDK to cover Azure compatibility. Sadly, this still works without throttling as well.
Right after finishing these replacements, we spotted the following improvements:
- The integration tests duration against the storage backends
dropped from ~45 min to ~15 min.
- This means Medusa became far more efficient.
- There is now much less time spent managing storage interaction thanks to it being asynchronous to the core.
- The Medusa uncompressed image size we bundle into k8ssandra
dropped from ~2GB to ~600MB and its build time went from 2 hours to
about 15 minutes.
- Aside from giving us much faster feedback loops when working on k8ssandra, this should help k8ssandra itself move a little bit faster.
- The file transfers are now much faster.
- We observed up to several hundreds of MB/s per node when moving data from a VM to blob storage within the same provider. The available network speed is the limit now.
- We are also aware that consuming the whole network throughput is not great. That’s why we now have proper throttling for S3 and are working on a solution for this in other backends too.
The only compromise we had to make was to drop Python 3.6 support. This is because the Pythons asyncio features only come in Python 3.7.
The other good stuff
Even though we are the happiest about the storage backends, there is a number of changes that should not go without mention:
- We fixed a bug with hierarchical storage containers in Azure. This flavor of blob storage works more like a regular file system, meaning it has a concept of directories. None of the other backends do this (including the vanilla Azure ones), and Medusa was not dealing gracefully with this.
- We are now able to build Medusa images for multiple architectures, including the arm64 one.
- Medusa can now purge backups of nodes that have been
decommissioned, meaning they are no longer present in the most
recent backups. Use the new
medusa purge-decommissioned
command to trigger such a purge.
Upgrade now
We encourage all Medusa users to upgrade to version 0.16 to benefit from all these storage improvements, making it much faster and reliable.
Medusa v0.16 is the default version in the newly released
k8ssandra-operator v1.9.0, and it can be used with previous
releases by setting the .spec.medusa.containerImage.tag
field in your K8ssandraCluster manifests.
Building a 100% ScyllaDB Shard-Aware Application Using Rust
Building a 100% ScyllaDB Shard-Aware Application Using Rust
I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...
Reaper 3.0 for Apache Cassandra was released
The K8ssandra team is pleased to announce the release of Reaper 3.0. Let’s dive into the main new features and improvements this major version brings, along with some notable removals.
Storage backends
Over the years, we regularly discussed dropping support for
Postgres and H2 with the TLP team. The effort for maintaining these
storage backends was moderate, despite our lack of expertise in
Postgres, as long as Reaper’s architecture was simple. Complexity
grew with more deployment options, culminating with the addition of
the sidecar mode.
Some features require different consensus strategies depending on
the backend, which sometimes led to implementations that worked
well with one backend and were buggy with others.
In order to allow building new features faster, while providing a
consistent experience for all users, we decided to drop the
Postgres and H2 backends in 3.0.
Apache Cassandra and the managed DataStax Astra DB service are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.
Reaper does not generally require high availability - even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.
Adaptive Repairs and Schedules
One of the pain points we observed when people start using
Reaper is understanding the segment orchestration and knowing how
the default timeout impacts the execution of repairs.
Repair is a complex choreography of operations in a distributed
system. As such, and especially in the days when Reaper was
created, the process could get blocked for several reasons and
required a manual restart. The smart folks that designed Reaper at
Spotify decided to put a timeout on segments to deal with such
blockage, over which they would be terminated and rescheduled.
Problems arise when segments are too big (or have too much entropy)
to process within the default 30 minutes timeout, despite not being
blocked. They are repeatedly terminated and recreated, and the
repair appears to make no progress.
Reaper did a poor job at dealing with this for mainly two
reasons:
- Each retry will use the same timeout, possibly failing segments forever
- Nothing obvious was reported to explain what was failing and how to fix the situation
We fixed the former by using a longer timeout on subsequent
retries, which is a simple trick to make repairs more “adaptive”.
If the segments are too big, they’ll eventually pass after a few
retries. It’s a good first step to improve the experience, but it’s
not enough for scheduled repairs as they could end up with the same
repeated failures for each run.
This is where we introduce adaptive schedules, which use feedback
from past repair runs to adjust either the number of segments or
the timeout for the next repair run.
Adaptive schedules will be updated at the end of each repair if
the run metrics justify it. The schedule can get a different number
of segments or a higher segment timeout depending on the latest
run.
The rules are the following:
- if more than 20% segments were extended, the number of segments will be raised by 20% on the schedule
- if less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout
- if no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.
This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.
Incremental Repair Triggers
As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anticompaction process.
The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.
Reaper 3.0 introduces a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.
Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.
These new features will allow to securely optimize tombstone
deletions by enabling the only_purge_repaired_tombstones
compaction subproperty in Cassandra, permitting to reduce
gc_grace_seconds
down to 3 hours without fearing that deleted data
reappears.
Schedules can be edited
That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.
3.0 fixes that embarrassing situation and adds an edit button to schedules, which allows to change the mutable settings of schedules:
More improvements
In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).
Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 will display a pop up informing you and allowing to force create the schedule/run:
We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:
java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml
There are many other improvements and we invite all users to check the changelog in the GitHub repo.
Upgrade Now
We encourage all Reaper users to upgrade to 3.0.0, while recommending users to carefully prepare their migration out of Postgres/H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.
All instructions to download, install, configure, and use Reaper 3.0.0 are available on the Reaper website.
Certificates management and Cassandra Pt II - cert-manager and Kubernetes
The joys of certificate management
Certificate management has long been a bugbear of enterprise environments, and expired certs have been the cause of countless outages. When managing large numbers of services at scale, it helps to have an automated approach to managing certs in order to handle renewal and avoid embarrassing and avoidable downtime.
This is part II of our exploration of certificates and encrypting Cassandra. In this blog post, we will dive into certificate management in Kubernetes. This post builds on a few of the concepts in Part I of this series, where Anthony explained the components of SSL encryption.
Recent years have seen the rise of some fantastic, free,
automation-first services like letsencrypt, and no one should be
caught flat footed by certificate renewals in 2021. In this blog
post, we will look at one Kubernetes native tool that aims to make
this process much more ergonomic on Kubernetes; cert-manager
.
Recap
Anthony has already discussed several points about certificates. To recap:
- In asymmetric encryption and digital signing processes we always have public/private key pairs. We are referring to these as the Keystore Private Signing Key (KS PSK) and Keystore Public Certificate (KS PC).
- Public keys can always be openly published and allow senders to communicate to the holder of the matching private key.
- A certificate is just a public key - and some additional fields - which has been signed by a certificate authority (CA). A CA is a party trusted by all parties to an encrypted conversation.
- When a CA signs a certificate, this is a way for that mutually trusted party to attest that the party holding that certificate is who they say they are.
- CA’s themselves use public certificates (Certificate Authority Public Certificate; CA PC) and private signing keys (the Certificate Authority Private Signing Key; CA PSK) to sign certificates in a verifiable way.
The many certificates that Cassandra might be using
In a moderately complex Cassandra configuration, we might have:
- A root CA (cert A) for internode encryption.
- A certificate per node signed by cert A.
- A root CA (cert B) for the client-server encryption.
- A certificate per node signed by cert B.
- A certificate per client signed by cert B.
Even in a three node cluster, we can envisage a case where we must create two root CAs and 6 certificates, plus a certificate for each client application; for a total of 8+ certificates!
To compound the problem, this isn’t a one-off setup. Instead, we need to be able to rotate these certificates at regular intervals as they expire.
Ergonomic certificate management on Kubernetes with cert-manager
Thankfully, these processes are well supported on Kubernetes by
a tool called cert-manager
.
cert-manager
is an
all-in-one tool that should save you from ever having to reach for
openssl
or keytool
again. As a
Kubernetes operator, it manages a variety of custom resources (CRs)
such as (Cluster)Issuers, CertificateRequests and Certificates.
Critically it integrates with Automated Certificate Management
Environment (ACME) Issuer
s, such as
LetsEncrypt (which we will not be discussing today).
The workfow reduces to:
- Create an
Issuer
(via ACME, or a custom CA). - Create a Certificate CR.
- Pick up your certificates and signing keys from the secrets
cert-manager
creates, and mount them as volumes in your pods’ containers.
Everything is managed declaratively, and you can reissue certificates at will simply by deleting and re-creating the certificates and secrets.
Or you can use the kubectl
plugin
which allows you to write a simple kubectl cert-manager
renew
. We won’t discuss this in depth here, see the
cert-manager
documentation
for more information
Java batteries included (mostly)
At this point, Cassandra users are probably about to interject
with a loud “Yes, but I need keystores and truststores, so this
solution only gets me halfway”. As luck would have it, from
version .15, cert-manager
also
allows you to create JKS truststores and keystores directly from
the Certificate CR.
The fine print
There are two caveats to be aware of here:
- Most Cassandra deployment options currently available
(including statefulSets,
cass-operator
or k8ssandra) do not currently support using a cert-per-node configuration in a convenient fashion. This is because thePodTemplate.spec
portions of these resources are identical for each pod in the StatefulSet. This precludes the possibility of adding per-node certs via environment or volume mounts. - There are currently some open questions about how to rotate
certificates without downtime when using internode encryption.
- Our current recommendation is to use a CA PC per Cassandra
datacenter (DC) and add some basic scripts to merge both CA PCs
into a single truststore to be propagated across all nodes. By
renewing the CA PC independently you can ensure one DC is always
online, but you still do suffer a network partition. Hinted handoff
should theoretically rescue the situation but it is a less than
robust solution, particularly on larger clusters. This solution is
not recommended when using lightweight transactions or non
LOCAL
consistency levels. - One mitigation to consider is using non-expiring CA PCs, in which case no CA PC rotation is ever performed without a manual trigger. KS PCs and KS PSKs may still be rotated. When CA PC rotation is essential this approach allows for careful planning ahead of time, but it is not always possible when using a 3rd party CA.
- Istio or other service mesh approaches can fully automate mTLS in clusters, but Istio is a fairly large committment and can create its own complexities.
- Manual management of certificates may be possible using a secure vault (e.g. HashiCorp vault), sealed secrets, or similar approaches. In this case, cert manager may not be involved.
- Our current recommendation is to use a CA PC per Cassandra
datacenter (DC) and add some basic scripts to merge both CA PCs
into a single truststore to be propagated across all nodes. By
renewing the CA PC independently you can ensure one DC is always
online, but you still do suffer a network partition. Hinted handoff
should theoretically rescue the situation but it is a less than
robust solution, particularly on larger clusters. This solution is
not recommended when using lightweight transactions or non
These caveats are not trivial. To address (2) more elegantly you could also implement Anthony’s solution from part one of this blog series; but you’ll need to script this up yourself to suit your k8s environment.
We are also in discussions with the folks over at cert-manager about how their ecosystem can better support Cassandra. We hope to report progress on this front over the coming months.
These caveats present challenges, but there are also specific cases where they matter less.
cert-manager and Reaper - a match made in heaven
One case where we really don’t care if a client is unavailable for a short period is when Reaper is the client.
Cassandra is an eventually consistent system and suffers from entropy. Data on nodes can become out of sync with other nodes due to transient network failures, node restarts and the general wear and tear incurred by a server operating 24/7 for several years.
Cassandra contemplates that this may occur. It provides a variety of consistency level settings allowing you to control how many nodes must agree for a piece of data to be considered the truth. But even though properly set consistency levels ensure that the data returned will be accurate, the process of reconciling data across the network degrades read performance - it is best to have consistent data on hand when you go to read it.
As a result, we recommend the use of Reaper, which runs as a Cassandra client and automatically repairs the cluster in a slow trickle, ensuring that a high volume of repairs are not scheduled all at once (which would overwhelm the cluster and degrade the performance of real clients) while also making sure that all data is eventually repaired for when it is needed.
The set up
The manifests for this blog post can be found here.
Environment
We assume that you’re running Kubernetes 1.21, and we’ll be running with a Cassandra 3.11.10 install. The demo environment we’ll be setting up is a 3 node environment, and we have tested this configuration against 3 nodes.
We will be installing the cass-operator
and
Cassandra cluster into the cass-operator
namespace, while the cert-manager
operator
will sit within the cert-manager
namespace.
Setting up kind
For testing, we often use kind
to provide a
local k8s cluster. You can use minikube
or whatever
solution you prefer (including a real cluster running on GKE, EKS,
or AKS), but we’ll include some kind
instructions and
scripts here to ease the way.
If you want a quick fix to get you started, try running the
setup-kind-multicluster.sh
script from the k8ssandra-operator repository,
with setup-kind-multicluster.sh
--kind-worker-nodes 3
. I have included this script in the
root of the code examples repo that accompanies this blog.
A demo CA certificate
We aren’t going to use LetsEncrypt for this demo, firstly
because ACME certificate issuance has some complexities (including
needing a DNS or a publicly hosted HTTP server) and secondly
because I want to reinforce that cert-manager
is
useful to organisations who are bringing their own certs and don’t
need one issued. This is especially useful for on-prem
deployments.
First off, create a new private key and certificate pair for your root CA. Note that the file names tls.crt and tls.key will become important in a moment.
openssl genrsa -out manifests/demoCA/tls.key 4096
openssl req -new -x509 -key manifests/demoCA/tls.key -out manifests/demoCA/tls.crt -subj "/C=AU/ST=NSW/L=Sydney/O=Global Security/OU=IT Department/CN=example.com"
(Or you can just run the generate-certs.sh
script in the manifests/demoCA directory - ensure you run it from
the root of the project so that the secrets appear in .manifests/demoCA/
.)
When running this process on MacOS be aware of this issue which affects the creation of self signed certificates. The repo referenced in this blog post contains example certificates which you can use for demo purposes - but do not use these outside your local machine.
Now we’re going to use kustomize
(which
comes with kubectl
) to add these
files to Kubernetes as secrets. kustomize
is not a
templating language like Helm. But it fulfills a similar role by
allowing you to build a set of base manifests that are then
bundled, and which can be customised for your particular deployment
scenario by patching.
Run kubectl
apply -k manifests/demoCA
. This will build the secrets
resources using the kustomize
secretGenerator and add them to Kubernetes. Breaking this process
down piece by piece:
# ./manifests/demoCA
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cass-operator
generatorOptions:
disableNameSuffixHash: true
secretGenerator:
- name: demo-ca
type: tls
files:
- tls.crt
- tls.key
- We use
disableNameSuffixHash
, because otherwisekustomize
will add hashes to each of our secret names. This makes it harder to build these deployments one component at a time. - The
tls
type secret conventionally takes two keys with these names, as per the next point.cert-manager
expects a secret in this format in order to create the Issuer which we will explain in the next step. - We are adding the files tls.crt and tls.key. The file names will become the keys of a secret called demo-ca.
cert-manager
cert-manager
can be
installed by running kubectl apply -f
https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
.
It will install into the cert-manager
namespace because a Kubernetes cluster should only ever have a
single cert-manager
operator
installed.
cert-manager
will
install a deployment, as well as various custom resource
definitions (CRDs) and webhooks to deal with the lifecycle of the
Custom Resources (CRs).
A cert-manager Issuer
Issuers come in various forms. Today we’ll be using a CA Issuer
because
our components need to trust each other, but don’t need to be
trusted by a web browser.
Other options include ACME based Issuer
s compatible
with LetsEncrypt, but these would require that we have control of a
public facing DNS or HTTP server, and that isn’t always the case
for Cassandra, especially on-prem.
Dive into the truststore-keystore
directory and you’ll find the Issuer
, it is very
simple so we won’t reproduce it here. The only thing to note is
that it takes a secret which has keys of tls.crt
and
tls.key
-
the secret you pass in must have these keys. These are the CA PC
and CA PSK we mentioned earlier.
We’ll apply this manifest to the cluster in the next step.
Some cert-manager certs
Let’s start with the Cassandra-Certificate.yaml
resource:
spec:
# Secret names are always required.
secretName: cassandra-jks-keystore
duration: 2160h # 90d
renewBefore: 360h # 15d
subject:
organizations:
- datastax
dnsNames:
- dc1.cass-operator.svc.cluster.local
isCA: false
usages:
- server auth
- client auth
issuerRef:
name: ca-issuer
# We can reference ClusterIssuers by changing the kind here.
# The default value is `Issuer` (i.e. a locally namespaced Issuer)
kind: Issuer
# This is optional since cert-manager will default to this value however
# if you are using an external issuer, change this to that `Issuer` group.
group: cert-manager.io
keystores:
jks:
create: true
passwordSecretRef: # Password used to encrypt the keystore
key: keystore-pass
name: jks-password
privateKey:
algorithm: RSA
encoding: PKCS1
size: 2048
The first part of the spec here tells us a few things:
- The keystore, truststore and certificates will be fields within
a secret called
cassandra-jks-keystore
. This secret will end up holding our KS PSK and KS PC. - It will be valid for 90 days.
- 15 days before expiry, it will be renewed automatically by cert
manager, which will contact the
Issuer
to do so. - It has a subject organisation. You can add any of the X509 subject fields here, but it needs to have one of them.
- It has a DNS name - you could also provide a URI or IP address.
In this case we have used the service address of the Cassandra
datacenter which we are about to create via the operator. This has
a format of
<DC_NAME>.<NAMESPACE>.svc.cluster.local
. - It is not a CA (
isCA
), and can be used for server auth or client auth (usages
). You can tune these settings according to your needs. If you make your cert a CA you can even reference it in a newIssuer
, and define cute tree like structures (if you’re into that).
Outside the certificates themselves, there are additional settings controlling how they are issued and what format this happens in.
IssuerRef
is used to define theIssuer
we want to issue the certificate. TheIssuer
will sign the certificate with its CA PSK.- We are specifying that we would like a keystore created with
the
keystore
key, and that we’d like it injks
format with the corresponding key. passwordSecretKeyRef
references a secret and a key within it. It will be used to provide the password for the keystore (the truststore is unencrypted as it contains only public certs and no signing keys).
The Reaper-Certificate.yaml
is similar in structure, but has a different DNS name. We aren’t
configuring Cassandra to verify that the DNS name on the
certificate matches the DNS name of the parties in this particular
case.
Apply all of the certs and the Issuer
using
kubectl apply -k
manifests/truststore-keystore
.
Cass-operator
Examining the cass-operator
directory, we’ll see that there is a kustomization.yaml
which references the remote cass-operator repository and a local
cassandraDatacenter.yaml
.
This applies the manifests required to run up a cass-operator
installation namespaced to the cass-operator
namespace.
Note that this installation of the operator will only watch its own namespace for CassandraDatacenter CRs. So if you create a DC in a different namespace, nothing will happen.
We will apply these manifests in the next step.
CassandraDatacenter
Finally, the CassandraDatacenter
resource in the ./cass-operator/
directory will describe the kind of DC we want:
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: dc1
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.10
managementApiAuth:
insecure: {}
size: 1
podTemplateSpec:
spec:
containers:
- name: "cassandra"
volumeMounts:
- name: certs
mountPath: "/crypto"
volumes:
- name: certs
secret:
secretName: cassandra-jks-keystore
storageConfig:
cassandraDataVolumeClaimSpec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
config:
cassandra-yaml:
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
role_manager: org.apache.cassandra.auth.CassandraRoleManager
client_encryption_options:
enabled: true
# If enabled and optional is set to true encrypted and unencrypted connections are handled.
optional: false
keystore: /crypto/keystore.jks
keystore_password: dc1
require_client_auth: true
# Set trustore and truststore_password if require_client_auth is true
truststore: /crypto/truststore.jks
truststore_password: dc1
protocol: TLS
# cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA] # An earlier version of this manifest configured cipher suites but the proposed config was less secure. This section does not need to be modified.
server_encryption_options:
internode_encryption: all
keystore: /crypto/keystore.jks
keystore_password: dc1
truststore: /crypto/truststore.jks
truststore_password: dc1
jvm-options:
initial_heap_size: 800M
max_heap_size: 800M
- We provide a name for the DC - dc1.
- We provide a name for the cluster - the DC would join other DCs
if they already exist in the k8s cluster and we configured the
additionalSeeds
property. - We use the
podTemplateSpec.volumes
array to declare the volumes for the Cassandra pods, and we use thepodTemplateSpec.containers.volumeMounts
array to describe where and how to mount them.
The config.cassandra-yaml
field is where most of the encryption configuration happens, and we
are using it to enable both internode and client-server encryption,
which both use the same keystore and truststore for simplicity.
Remember that using internode encryption means your DC
needs to go offline briefly for a full restart when the CA’s keys
rotate.
- We are not using authz/n in this case to keep things simple. Don’t do this in production.
- For both encryption types we need to specify (1) the keystore
location, (2) the truststore location and (3) the passwords for the
keystores. The locations of the keystore/truststore come from where
we mounted them in
volumeMounts
. - We are specifying JVM options just to make this run politely on a smaller machine. You would tune this for a production deployment.
Roll out the cass-operator and the CassandraDatacenter using
kubectl apply -k
manifests/cass-operator
. Because the CRDs might take a
moment to propagate, there is a chance you’ll see errors stating
that the resource type does not exist. Just keep re-applying until
everything works - this is a declarative system so applying the
same manifests multiple times is an idempotent operation.
Reaper deployment
The k8ssandra project offers a Reaper operator, but for simplicity we are using a simple deployment (because not every deployment needs an operator). The deployment is standard kubernetes fare, and if you want more information on how these work you should refer to the Kubernetes docs.
We are injecting the keystore and truststore passwords into the
environment here, to avoid placing them in the manifests.
cass-operator
does
not currently support this approach without an initContainer to
pre-process the cassandra.yaml using envsubst
or a similar
tool.
The only other note is that we are also pulling down a Cassandra
image and using it in an initContainer to create a keyspace for
Reaper, if it does not exist. In this container, we are also adding
a ~/.cassandra/cqlshrc
file under the home directory. This provides SSL connectivity
configurations for the container. The critical part of the
cqlshrc
file that we are adding is:
[ssl]
certfile = /crypto/ca.crt
validate = true
userkey = /crypto/tls.key
usercert = /crypto/tls.crt
version = TLSv1_2
The version =
TLSv1_2
tripped me up a few times, as it seems to be a
recent requirement. Failing to add this line will give you back the
rather fierce Last error: [SSL] internal
error
in the initContainer. The commands run in this
container are not ideal. In particular, the fact that we are
sleeping for 840 seconds to wait for Cassandra to start is sloppy.
In a real deployment we’d want to health check and wait until the
Cassandra service became available.
Apply the manifests using kubectl apply -k
manifests/reaper
.
Results
If you use a GUI, look at the logs for Reaper, you should see that it has connected to the cluster and provided some nice ASCII art to your console.
If you don’t use a GUI, you can run kubectl get pods -n
cass-operator
to find your Reaper pod (which we’ll call
REAPER_PODNAME
) and
then run kubectl
logs -n cass-operator REAPER_PODNAME
to pull the logs.
Conclusion
While the above might seem like a complex procedure, we’ve just created a Cassandra cluster with both client-server and internode encryption enabled, all of the required certs, and a Reaper deployment which is configured to connect using the correct certs. Not bad.
Do keep in mind the weaknesses relating to key rotation, and watch this space for progress on that front.
Cassandra Certificate Management Part 1 - How to Rotate Keys Without Downtime
Welcome to this three part blog series where we dive into the management of certificates in an Apache Cassandra cluster. For this first post in the series we will focus on how to rotate keys in an Apache Cassandra cluster without downtime.
Usability and Security at Odds
If you have downloaded and installed a vanilla installation of Apache Cassandra, you may have noticed when it is first started all security is disabled. Your “Hello World” application works out of the box because the Cassandra project chose usability over security. This is deliberately done so everyone benefits from the usability, as security requirement for each deployment differ. While only some deployments require multiple layers of security, others require no security features to be enabled.
Security of a system is applied in layers. For example one layer is isolating the nodes in a cluster behind a proxy. Another layer is locking down OS permissions. Encrypting connections between nodes, and between nodes and the application is another layer that can be applied. If this is the only layer applied, it leaves other areas of a system insecure. When securing a Cassandra cluster, we recommend pursuing an informed approach which offers defence-in-depth. Consider additional aspects such as encryption at rest (e.g. disk encryption), authorization, authentication, network architecture, and hardware, host and OS security.
Encrypting connections between two hosts can be difficult to set up as it involves a number of tools and commands to generate the necessary assets for the first time. We covered this process in previous posts: Hardening Cassandra Step by Step - Part 1 Inter-Node Encryption and Hardening Cassandra Step by Step - Part 2 Hostname Verification for Internode Encryption. I recommend reading both posts before reading through the rest of the series, as we will build off concepts explained in them.
Here is a quick summary of the basic steps to create the assets necessary to encrypt connections between two hosts.
- Create the Root Certificate Authority (CA) key pair from a
configuration file using
openssl
. - Create a keystore for each host (node or client) using
keytool
. - Export the Public Certificate from each host keystore as a
“Signing Request” using
keytool
. - Sign each Public Certificate “Signing Request” with our Root CA
to generate a Signed Certificate using
openssl
. - Import the Root CA Public Certificate and the Signed
Certificate into each keystore using
keytool
. - Create a common truststore and import the CA Public Certificate
into it using
keytool
.
Security Requires Ongoing Maintenance
Setting up SSL encryption for the various connections to Cassandra is only half the story. Like all other software out in the wild, there are ongoing maintenance to ensure the SSL encrypted connections continue to work.
At some point you wil need to update the certificates and stores used to implement the SSL encrypted connections because they will expire. If the certificates for a node expire it will be unable to communicate with other nodes in the cluster. This will lead to at least data inconsistencies or, in the worst case, unavailable data.
This point is specifically called out towards the end of the Inter-Node Encryption blog post. The note refers to steps 1, 2 and 4 in the above summary of commands to set up the certificates and stores. The validity periods are set for the certificates and stores in their respective steps.
One Certificate Authority to Rule Them All
Before we jump into how we handle expiring certificates and stores in a cluster, we first need to understand the role a certificate plays in securing a connection.
Certificates (and encryption) are often considered a hard topic. However, there are only a few concepts that you need to bear in mind when managing certificates.
Consider the case where two parties A and B wish to communicate with one another. Both parties distrust each other and each needs a way to prove that they are who they claim to be, as well as verify the other party is who they claim to be. To do this a mutually trusted third party needs to be brought in. In our case the trusted third party is the Certificate Authority (CA); often referred to as the Root CA.
The Root CA is effectively just a key pair; similar to an SSH key pair. The main difference is the public portion of the key pair has additional fields detailing who the public key belongs to. It has the following two components.
- Certificate Authority Private Signing Key (CA PSK) - Private component of the CA key pair. Used to sign a keystore’s public certificate.
- Certificate Authority Public Certificate (CA PC) - Public component of the CA key pair. Used to provide the issuer name when signing a keystore’s public certificate, as well as by a node to confirm that a third party public certificate (when presented) has been signed by the Root CA PSK.
When you run openssl
to create
your CA key pair using a certificate configuration file, this is
the command that is run.
$ openssl req \
-config path/to/ca_certificate.config \
-new \
-x509 \
-keyout path/to/ca_psk \
-out path/to/ca_pc \
-days <valid_days>
In the above command the -keyout
specifies the
path to the CA PSK, and the -out
specifies the
path to the CA PC.
And in the Darkness Sign Them
In addition to a common Root CA key pair, each party has its own certificate key pair to uniquely identify it and to encrypt communications. In the Cassandra world, two components are used to store the information needed to perform the above verification check and communication encryption; the keystore and the truststore.
The keystore contains a key pair which is made up of the following two components.
- Keystore Private Signing Key (KS PSK) - Hidden in keystore. Used to sign messages sent by the node, and decrypt messages received by the node.
- Keystore Public Certificate (KS PC) - Exported for signing by the Root CA. Used by a third party to encrypt messages sent to the node that owns this keystore.
When created, the keystore will contain the PC, and the PSK. The PC signed by the Root CA, and the CA PC are added to the keystore in subsequent operations to complete the trust chain. The certificates are always public and are presented to other parties, while PSK always remains secret. In an asymmetric/public key encryption system, messages can be encrypted with the PC but can only be decrypted using the PSK. In this way, a node can initiate encrypted communications without needing to share a secret.
The truststore stores one or more CA PCs of the parties which the node has chosen to trust, since they are the source of trust for the cluster. If a party tries to communicate with the node, it will refer to its truststore to see if it can validate the attempted communication using a CA PC that it knows about.
For a node’s KS PC to be trusted and verified by another node using the CA PC in the truststore, the KS PC needs to be signed by the Root CA key pair. Futhermore, the CA key pair is used to sign the KS PC of each party.
When you run openssl
to sign an
exported Keystore PC, this is the command that is run.
$ openssl x509 \
-req \
-CAkey path/to/ca_psk \
-CA path/to/ca_pc \
-in path/to/exported_ks_pc_sign_request \
-out paht/to/signed_ks_pc \
-days <valid_days> \
-CAcreateserial \
-passin pass:<ca_psk_password>
In the above command both the Root CA PSK and CA PC are used via
-CAkey
and -CA
respectively when signing the KS PC.
More Than One Way to Secure a Connection
Now that we have a deeper understanding of the assets that are used to encrypt communications, we can examine various ways to implement it. There are multiple ways to implement SSL encryption in an Apache Cassandra cluster. Regardless of the encryption approach, the objective when applying this type of security to a cluster is to ensure;
- Hosts (nodes or clients) can determine whether they should trust other hosts in cluster.
- Any intercepted communication between two hosts is indecipherable.
The three most common methods vary in both ease of deployment and resulting level of security. They are as follows.
The Cheats Way
The easiest and least secure method for rolling out SSL encryption can be done in the following way
Generation
- Single CA for the cluster.
- Single truststore containing the CA PC.
- Single keystore which has been signed by the CA.
Deployment
- The same keystore and truststore are deployed to each node.
In this method a single Root CA and a single keystore is deployed to all nodes in the cluster. This means any node can decipher communications intended for any other node. If a bad actor gains control of a node in the cluster then they will be able to impersonate any other node. That is, compromise of one host will compromise all of them. Depending on your threat model, this approach can be better than no encryption at all. It will ensure that a bad actor with access to only the network will no longer be able to eavesdrop on traffic.
We would use this method as a stop gap to get internode encryption enabled in a cluster. The idea would be to quickly deploy internode encryption with the view of updating the deployment in the near future to be more secure.
Best Bang for Buck
Arguably the most popular and well documented method for rolling out SSL encryption is
Generation
- Single CA for the cluster.
- Single truststore containing the CA PC.
- Unique keystore for each node all of which have been signed by the CA.
Deployment
- Each keystore is deployed to its associated node.
- The same truststore is deployed to each node.
Similar to the previous method, this method uses a cluster wide CA. However, unlike the previous method each node will have its own keystore. Each keystore has its own certificate that is signed by a Root CA common to all nodes. The process to generate and deploy the keystores in this way is practiced widely and well documented.
We would use this method as it provides better security over the previous method. Each keystore can have its own password and host verification, which further enhances the security that can be applied.
Fort Knox
The method that offers the strongest security of the three can be rolled out in following way
Generation
- Unique CA for each node.
- A single truststore containing the Public Certificate for each of the CAs.
- Unique keystore for each node that has been signed by the CA specific to the node.
Deployment
- Each keystore with its unique CA PC is deployed to its associated node.
- The same truststore is deployed to each node.
Unlike the other two methods, this one uses a Root CA per host and similar to the previous method, each node will have its own keystore. Each keystore has its own PC that is signed by a Root CA unique to the node. The Root CA PC of each node needs to be added to the truststore that is deployed to all nodes. For large cluster deployments this encryption configuration is cumbersome and will result in a large truststore being generated. Deployments of this encryption configuration are less common in the wild.
We would use this method as it provides all the advantages of the previous method and in addition, provides the ability to isolate a node from the cluster. This can be done by simply rolling out a new truststore which excludes a specific node’s CA PC. In this way a compromised node could be isolated from the cluster by simply changing the truststore. Under the previous two approaches, isolation of a compromised node in this fashion would require a rollout of an entirely new Root CA and one or more new keystores. Furthermore, each new Keystore CA would need to be signed by the new Root CA.
WARNING: Ensure your Certificate Authority is secure!
Regardless of the deployment method chosen, the whole setup will depend on the security of the Root CA. Ideally both components should be secured, or at the very least the PSK needs to be secured properly after it is generated since all trust is based on it. If both components are compromised by a bad actor, then that actor can potentially impersonate another node in the cluster. The good news is, there are a variety of ways to secure the Root CA components, however that topic goes beyond the scope of this post.
The Need for Rotation
If we are following best practices when generating our CAs and keystores, they will have an expiry date. This is a good thing because it forces us to regenerate and roll out our new encryption assets (stores, certificates, passwords) to the cluster. By doing this we minimise the exposure that any one of the components has. For example, if a password for a keystore is unknowingly leaked, the password is only good up until the keystore expiry. Having a scheduled expiry reduces the chance of a security leak becoming a breach, and increases the difficulty for a bad actor to gain persistence in the system. In the worst case it limits the validity of compromised credentials.
Always Read the Expiry Label
The only catch to having an expiry date on our encryption assets is that we need to rotate (update) them before they expire. Otherwise, our data will be unavailable or may be inconsistent in our cluster for a period of time. Expired encryption assets when forgotten can be a silent, sinister problem. If, for example, our SSL certificates expire unnoticed we will only discover this blunder when we restart the Cassandra service. In this case the Cassandra service will fail to connect to the cluster on restart and SSL expiry error will appear in the logs. At this point there is nothing we can do without incurring some data unavailability or inconsistency in the cluster. We will cover what to do in this case in a subsequent post. However, it is best to avoid this situation by rotating the encryption assets before they expire.
How to Play Musical Certificates
Assuming we are going to rotate our SSL certificates before they
expire, we can perform this operation live on the cluster without
downtime. This process requires the replication factor and
consistency level to configured to allow for a single node to be
down for a short period of time in the cluster. Hence, it works
best when use a replication factor >= 3 and use consistency
level <= QUORUM
or
LOCAL_QUORUM
depending on the cluster configuration.
- Create the NEW encryption assets; NEW CA, NEW keystores, and NEW truststore, using the process described earlier.
- Import the NEW CA to the OLD truststore already deployed in the
cluster using
keytool
. The OLD truststore will increase in size, as it has both the OLD and NEW CAs in it.$ keytool -keystore <old_truststore> -alias CARoot -importcert -file <new_ca_pc> -keypass <new_ca_psk_password> -storepass <old_truststore_password> -noprompt
Where:
<old_truststore>
: The path to the OLD truststore already deployed in the cluster. This can be just a copy of the OLD truststore deployed.<new_ca_pc>
: The path to the NEW CA PC generated.<new_ca_psk_password>
: The password for the NEW CA PSKz.<old_truststore_password>
: The password for the OLD truststore.
- Deploy the updated OLD truststore to all the nodes in the
cluster. Specifically, perform these steps on a single node, then
repeat them on the next node until all nodes are updated. Once this
step is complete, all nodes in the cluster will be able to
establish connections using both the OLD and NEW CAs.
- Drain the node using
nodetool drain
. - Stop the Cassandra service on the node.
- Copy the updated OLD truststore to the node.
- Start the Cassandra service on the node.
- Drain the node using
- Deploy the NEW keystores to their respective nodes in the cluster. Perform this operation one node at a time in the same way the OLD truststore was deployed in the previous step. Once this step is complete, all nodes in the cluster will be using their NEW SSL certificate to establish encrypted connections with each other.
- Deploy the NEW truststore to all the nodes in the cluster. Once again, perform this operation one node at a time in the same way the OLD truststore was deployed in Step 3.
The key to ensuring uptime in the rotation are in Steps 2 and 3. That is, we have the OLD and the NEW CAs all in the truststore and deployed on every node prior to rolling out the NEW keystores. This allows nodes to communicate regardless of whether they have the OLD or NEW keystore. This is because both the OLD and NEW assets are trusted by all nodes. The process still works whether our NEW CAs are per host or cluster wide. If the NEW CAs are per host, then they all need to be added to the OLD truststore.
Example Certificate Rotation on a Cluster
Now that we understand the theory, let’s see the process in
action. We will use ccm
to create a three
node cluster running Cassandra 3.11.10 with internode encryption
configured.
As pre-cluster setup task we will generate the keystores and truststore to implement the internode encryption. Rather than carry out the steps manually to generate the stores, we have developed a script called generate_cluster_ssl_stores that does the job for us.
The script requires us to supply the node IP addresses, and a certificate configuration file. Our certificate configuration file, test_ca_cert.conf has the following contents:
[ req ]
distinguished_name = req_distinguished_name
prompt = no
output_password = mypass
default_bits = 2048
[ req_distinguished_name ]
C = AU
ST = NSW
L = Sydney
O = TLP
OU = SSLTestCluster
CN = SSLTestClusterRootCA
emailAddress = info@thelastpickle.com¡
The command used to call the generate_cluster_ssl_stores.sh
script is as follows.
$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 test_ca_cert.conf
Let’s break down the options in the above command.
-g
- Generate passwords for each keystore and the truststore.-c
- Create a Root CA for the cluster and sign each keystore PC with it.-n
- List of nodes to generate keystores for.
The above command generates the following encryption assets.
$ ls -alh ssl_artifacts_20210602_125353
total 72
drwxr-xr-x 9 anthony staff 288B 2 Jun 12:53 .
drwxr-xr-x 5 anthony staff 160B 2 Jun 12:53 ..
-rw-r--r-- 1 anthony staff 17B 2 Jun 12:53 .srl
-rw-r--r-- 1 anthony staff 4.2K 2 Jun 12:53 127-0-0-1-keystore.jks
-rw-r--r-- 1 anthony staff 4.2K 2 Jun 12:53 127-0-0-2-keystore.jks
-rw-r--r-- 1 anthony staff 4.2K 2 Jun 12:53 127-0-0-3-keystore.jks
drwxr-xr-x 10 anthony staff 320B 2 Jun 12:53 certs
-rw-r--r-- 1 anthony staff 1.0K 2 Jun 12:53 common-truststore.jks
-rw-r--r-- 1 anthony staff 219B 2 Jun 12:53 stores.password
With the necessary stores generated we can create our three node
cluster in ccm
. Prior to
starting the cluster our nodes should look something like this.
$ ccm status
Cluster: 'SSLTestCluster'
-------------------------
node1: DOWN (Not initialized)
node2: DOWN (Not initialized)
node3: DOWN (Not initialized)
We can configure internode encryption in the cluster by
modifying the cassandra.yaml files for each node as
follows. The passwords for each store are in the
stores.password file created by the generate_cluster_ssl_stores.sh
script.
node1 - cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210602_125353/127-0-0-1-keystore.jks
keystore_password: HQR6xX4XQrYCz58CgAiFkWL9OTVDz08e
truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...
node2 - cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210602_125353/127-0-0-2-keystore.jks
keystore_password: Aw7pDCmrtacGLm6a1NCwVGxohB4E3eui
truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...
node3 - cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210602_125353/127-0-0-3-keystore.jks
keystore_password: 1DdFk27up3zsmP0E5959PCvuXIgZeLzd
truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...
Now that we configured internode encryption in the cluster, we can start the nodes and monitor the logs to make sure they start correctly.
$ ccm node1 start && touch ~/.ccm/SSLTestCluster/node1/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node1/logs/system.log
...
$ ccm node2 start && touch ~/.ccm/SSLTestCluster/node2/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node2/logs/system.log
...
$ ccm node3 start && touch ~/.ccm/SSLTestCluster/node3/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node3/logs/system.log
In all cases we see the following message in the logs indicating that internode encryption is enabled.
INFO [main] ... MessagingService.java:704 - Starting Encrypted Messaging Service on SSL port 7001
Once all the nodes have started, we can check the cluster status. We are looking to see that all nodes are up and in a normal state.
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 90.65 KiB 16 65.8% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 66.31 KiB 16 65.5% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 71.46 KiB 16 68.7% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
We will create a NEW Root CA along with a NEW set of stores for
the cluster. As part of this process, we will add the NEW Root CA
PC to OLD (current) truststore that is already in use in the
cluster. Once again we can use our generate_cluster_ssl_stores.sh
script to this, including the additional step of adding the NEW
Root CA PC to our OLD truststore. This can be done with the
following commands.
# Make the password to our old truststore available to script so we can add the new Root CA to it.
$ export EXISTING_TRUSTSTORE_PASSWORD=$(cat ssl_artifacts_20210602_125353/stores.password | grep common-truststore.jks | cut -d':' -f2)
$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 -e ssl_artifacts_20210602_125353/common-truststore.jks test_ca_cert.conf
We call our script using a similar command to the first time we
used it. The difference now is we are using one additional option;
-e
.
-e
- Path to our OLD (existing) truststore which we will add the new Root CA PC to. This option requires us to set the OLD truststore password in theEXISTING_TRUSTSTORE_PASSWORD
variable.
The above command generates the following new encryption assets.
These files are located in a different directory to the old ones.
The directory with the old encryption assets is ssl_artifacts_20210602_125353
and the directory with the new encryption assets is ssl_artifacts_20210603_070951
$ ls -alh ssl_artifacts_20210603_070951
total 72
drwxr-xr-x 9 anthony staff 288B 3 Jun 07:09 .
drwxr-xr-x 6 anthony staff 192B 3 Jun 07:09 ..
-rw-r--r-- 1 anthony staff 17B 3 Jun 07:09 .srl
-rw-r--r-- 1 anthony staff 4.2K 3 Jun 07:09 127-0-0-1-keystore.jks
-rw-r--r-- 1 anthony staff 4.2K 3 Jun 07:09 127-0-0-2-keystore.jks
-rw-r--r-- 1 anthony staff 4.2K 3 Jun 07:09 127-0-0-3-keystore.jks
drwxr-xr-x 10 anthony staff 320B 3 Jun 07:09 certs
-rw-r--r-- 1 anthony staff 1.0K 3 Jun 07:09 common-truststore.jks
-rw-r--r-- 1 anthony staff 223B 3 Jun 07:09 stores.password
When we look at our OLD truststore we can see that it has
increased in size. Originally, it was 1.0K
and it is now
2.0K
in
size after adding the new Root CA PC it.
$ ls -alh ssl_artifacts_20210602_125353/common-truststore.jks
-rw-r--r-- 1 anthony staff 2.0K 3 Jun 07:09 ssl_artifacts_20210602_125353/common-truststore.jks
We can now roll out the updated OLD truststore. In a production Cassandra deployment we would copy the updated OLD truststore to a node and restart the Cassandra service. Then repeat this process on the other nodes in the cluster, one node at a time. In our case, our locally running nodes are already pointing to the updated OLD truststore. We need to only restart the Cassandra service.
$ for i in $(ccm status | grep UP | cut -d':' -f1); do echo "restarting ${i}" && ccm ${i} stop && sleep 3 && ccm ${i} start; done
restarting node1
restarting node2
restarting node3
After the restart, our nodes are up and in a normal state.
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 140.35 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 167.23 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 173.7 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
Our nodes are using the updated OLD truststore which has the old Root CA PC and the new Root CA PC. This means that nodes will be able to communicate using either the old (current) keystore or the new keystore. We can now roll out the new keystore one node at a time and still have all our data available.
To do the new keystore roll out we will stop the Cassandra service, update its configuration to point to the new keystore, and then start the Cassandra service. A few notes before we start:
- The node will need to point to the new keystore located in the
directory with the new encryption assets;
ssl_artifacts_20210603_070951
. - The node will still need to use the OLD truststore, so its path will remain unchanged.
node1 - stop Cassandra service
$ ccm node1 stop
$ ccm node2 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 127.0.0.1 140.35 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 142.19 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 148.66 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
node1 - update keystore path to point to new keystore in cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...
node1 - start Cassandra service
$ ccm node1 start
$ ccm node2 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 179.23 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 142.19 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 148.66 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
At this point we have node1 using the new keystore while node2 and node3 are using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node2.
node2 - stop Cassandra service
$ ccm node2 stop
$ ccm node3 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 224.48 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
DN 127.0.0.2 188.46 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 194.35 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
node2 - update keystore path to point to new keystore in cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...
node2 - start Cassandra service
$ ccm node2 start
$ ccm node3 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 224.48 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 227.12 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 194.35 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
At this point we have node1 and node2 using the new keystore while node3 is using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node3.
node3 - stop Cassandra service
$ ccm node3 stop
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 225.42 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 191.31 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
DN 127.0.0.3 194.35 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
node3 - update keystore path to point to new keystore in cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...
node3 - start Cassandra service
$ ccm node3 start
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 225.42 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 191.31 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 239.3 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
The keystore rotation is now complete on all nodes in our cluster. However, all nodes are still using the updated OLD truststore. To ensure that our old Root CA can no longer be used to intercept messages in our cluster we need to roll out the NEW truststore to all nodes. This can be done in the same way we deployed the new keystores.
node1 - stop Cassandra service
$ ccm node1 stop
$ ccm node2 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 127.0.0.1 225.42 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 191.31 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 185.37 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
node1 - update truststore path to point to new truststore in cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...
node1 - start Cassandra service
$ ccm node1 start
$ ccm node2 status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 150 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 191.31 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 185.37 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
Now we update the truststore for node2.
node2 - stop Cassandra service
$ ccm node2 stop
$ ccm node3 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 150 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
DN 127.0.0.2 191.31 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 185.37 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
node2 - update truststore path to point to NEW truststore in cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...
node2 - start Cassandra service
$ ccm node2 start
$ ccm node3 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 150 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 294.05 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 185.37 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
Now we update the truststore for node3.
node3 - stop Cassandra service
$ ccm node3 stop
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 150 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 208.83 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
DN 127.0.0.3 185.37 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
node3 - update truststore path to point to NEW truststore in cassandra.yaml
...
server_encryption_options:
internode_encryption: all
keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...
node3 - start Cassandra service
$ ccm node3 start
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 150 KiB 16 100.0% 2661807a-d8d3-4bba-8639-6c0fada2ac88 rack1
UN 127.0.0.2 208.83 KiB 16 100.0% f3db4bbe-1f35-4edb-8513-cb55a05393a7 rack1
UN 127.0.0.3 288.6 KiB 16 100.0% 46c2f4b5-905b-42b4-8bb9-563a03c4b415 rack1
The rotation of the certificates is now complete and all while having only a single node down at any one time! This process can be used for all three of the deployment variations. In addition, it can be used to move between the different deployment variations without incurring downtime.
Conclusion
Internode encryption plays an important role in securing the internal communication of a cluster. When deployed, it is crucial that certificate expiry dates be tracked so the certificates can be rotated before they expire. Failure to do so will result in unavailability and inconsistencies.
Using the process discussed in this post and combined with the appropriate tooling, internode encryption can be easily deployed and associated certificates easily rotated. In addition, the process can be used to move between the different encryption deployments.
Regardless of the reason for using the process, it can be executed without incurring downtime in common Cassandra use cases.
Running your Database on OpenShift and CodeReady Containers
Let’s take an introductory run-through of setting up your database on OpenShift, using your own hardware and RedHat’s CodeReady Containers.
CodeReady Containers is a great way to run OpenShift K8s locally, ideal for development and testing. The steps in this blog post will require a machine, laptop or desktop, of decent capability; preferably quad CPUs and 16GB+ RAM.
Download and Install RedHat’s CodeReady Containers
Download and install RedHat’s CodeReady Containers (version 1.27) as described in Red Hat OpenShift 4 on your laptop: Introducing Red Hat CodeReady Containers.
First configure CodeReady Containers, from the command line
❯ crc setup
…
Your system is correctly setup for using CodeReady Containers, you can now run 'crc start' to start the OpenShift cluster
Check the version is correct.
❯ crc version
…
CodeReady Containers version: 1.27.0+3d6bc39d
OpenShift version: 4.7.11 (not embedded in executable)
Then start it, entering the Pull Secret copied from the download page. Have patience here, this can take ten minutes or more.
❯ crc start
INFO Checking if running as non-root
…
Started the OpenShift cluster.
The server is accessible via web console at:
https://console-openshift-console.apps-crc.testing
…
The output above will include the kubeadmin
password
which is required in the following oc login …
command.
❯ eval $(crc oc-env)
❯ oc login -u kubeadmin -p <password-from-crc-setup-output> https://api.crc.testing:6443
❯ oc version
Client Version: 4.7.11
Server Version: 4.7.11
Kubernetes Version: v1.20.0+75370d3
Open in a browser the URL https://console-openshift-console.apps-crc.testing
Log in using the kubeadmin
username
and password, as used above with the oc login …
command.
You might need to try a few times because of the self-signed
certificate used.
Once OpenShift has started and is running you should see the following webpage
Some commands to help check status and the startup process are
❯ oc status
In project default on server https://api.crc.testing:6443
svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 10.217.4.1:443 -> 6443
View details with 'oc describe <resource>/<name>' or list resources with 'oc get all'.
Before continuing, go to the CodeReady Containers Preferences dialog. Increase CPUs and Memory to >12 and >14GB correspondingly.
Create the OpenShift Local Volumes
Cassandra needs persistent volumes for its data directories. There are different ways to do this in OpenShift, from enabling local host paths in Rancher persistent volumes, to installing and using the OpenShift Local Storage Operator, and of course persistent volumes on the different cloud provider backends.
This blog post will use vanilla OpenShift volumes using folders on the master k8s node.
Go to the “Terminal” tab for the master node and create the required directories. The master node is found on the /cluster/nodes/ webpage.
Click on the node, named something like crc-m89r2-master-0, and then click on the “Terminal” tab. In the terminal, execute the following commands:
sh-4.4# chroot /host
sh-4.4# mkdir -p /mnt/cass-operator/pv000
sh-4.4# mkdir -p /mnt/cass-operator/pv001
sh-4.4# mkdir -p /mnt/cass-operator/pv002
sh-4.4#
Persistent Volumes are to be created with affinity to the master
node, declared in the following yaml. The name of the master node
can vary from installation to installation. If your master node is
not named crc-gm7cm-master-0
then the following command replaces its name. First download the
cass-operator-1.7.0-openshift-storage.yaml
file, check the name of the node in the nodeAffinity
sections
against your current CodeReady Containers instance, updating if
necessary.
❯ wget https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-storage.yaml
# The name of your master node
❯ oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers
# If it is not crc-gm7cm-master-0
❯ sed -i '' "s/crc-gm7cm-master-0/$(oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers)/" cass-operator-1.7.0-openshift-storage.yaml
Create the Persistent Volumes (PV) and Storage Class (SC).
❯ oc apply -f cass-operator-1.7.0-openshift-storage.yaml
persistentvolume/server-storage-0 created
persistentvolume/server-storage-1 created
persistentvolume/server-storage-2 created
storageclass.storage.k8s.io/server-storage created
To check the existence of the PVs.
❯ oc get pv | grep server-storage
server-storage-0 10Gi RWO Delete Available server-storage 5m19s
server-storage-1 10Gi RWO Delete Available server-storage 5m19s
server-storage-2 10Gi RWO Delete Available server-storage 5m19s
To check the existence of the SC.
❯ oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
server-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 5m36s
More information on using the can be found in the RedHat documentation for OpenShift volumes.
Deploy the Cass-Operator
Now create the cass-operator. Here we can use the upstream 1.7.0
version of the cass-operator. After creating (applying) the
cass-operator, it is important to quickly execute the oc adm policy …
commands in the following step so the pods have the privileges
required and are successfully created.
❯ oc apply -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml
namespace/cass-operator created
serviceaccount/cass-operator created
secret/cass-operator-webhook-config created
W0606 14:25:44.757092 27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0606 14:25:45.077394 27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com created
clusterrole.rbac.authorization.k8s.io/cass-operator-cr created
clusterrole.rbac.authorization.k8s.io/cass-operator-webhook created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-crb created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-webhook created
role.rbac.authorization.k8s.io/cass-operator created
rolebinding.rbac.authorization.k8s.io/cass-operator created
service/cassandradatacenter-webhook-service created
deployment.apps/cass-operator created
W0606 14:25:46.701712 27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
W0606 14:25:47.068795 27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
validatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration created
❯ oc adm policy add-scc-to-user privileged -z default -n cass-operator
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "default"
❯ oc adm policy add-scc-to-user privileged -z cass-operator -n cass-operator
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "cass-operator"
Let’s check the deployment happened.
❯ oc get deployments -n cass-operator
NAME READY UP-TO-DATE AVAILABLE AGE
cass-operator 1/1 1 1 14m
Let’s also check the cass-operator pod was created and is
successfully running. Note that the kubectl
command is
used here, for all k8s actions the oc
and kubectl
commands are
interchangable.
❯ kubectl get pods -w -n cass-operator
NAME READY STATUS RESTARTS AGE
cass-operator-7675b65744-hxc8z 1/1 Running 0 15m
Troubleshooting: If the cass-operator does not end up in
Running
status, or if any pods in later sections fail to start, it is
recommended to use the
OpenShift UI Events webpage for easy diagnostics.
Setup the Cassandra Cluster
The next step is to create the cluster. The following deployment file creates a 3 node cluster. It is largely a copy from the upstream cass-operator version 1.7.0 file example-cassdc-minimal.yaml but with a small modification made to allow all the pods to be deployed to the same worker node (as CodeReady Containers only uses one k8s node by default).
❯ oc apply -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml
cassandradatacenter.cassandra.datastax.com/dc1 created
Let’s watch the pods get created, initialise, and eventually
becoming running, using the kubectl get pods …
watch command.
❯ kubectl get pods -w -n cass-operator
NAME READY STATUS RESTARTS AGE
cass-operator-7675b65744-28fhw 1/1 Running 0 102s
cluster1-dc1-default-sts-0 0/2 Pending 0 0s
cluster1-dc1-default-sts-1 0/2 Pending 0 0s
cluster1-dc1-default-sts-2 0/2 Pending 0 0s
cluster1-dc1-default-sts-0 2/2 Running 0 3m
cluster1-dc1-default-sts-1 2/2 Running 0 3m
cluster1-dc1-default-sts-2 2/2 Running 0 3m
Use the Cassandra Cluster
With the Cassandra pods each up and running, the cluster is
ready to be used. Test it out using the nodetool status
command.
❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status
Defaulting container name to cassandra.
Use 'kubectl describe pod/cluster1-dc1-default-sts-0 -n cass-operator' to see all of the containers in this pod.
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.217.0.73 84.42 KiB 1 83.6% 672baba8-9a05-45ac-aad1-46427027b57a default
UN 10.217.0.72 70.2 KiB 1 65.3% 42758a86-ea7b-4e9b-a974-f9e71b958429 default
UN 10.217.0.71 65.31 KiB 1 51.1% 2fa73bc2-471a-4782-ae63-5a34cc27ab69 default
The above command can be run on `cluster1-dc1-default-sts-1` and `cluster1-dc1-default-sts-2` too.
Next, test out cqlsh
. For this
authentication is required, so first get the CQL username and
password.
# Get the cql username
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""
# Get the cql password
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""
❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- cqlsh -u <cql-username> -p <cql-password>
Connected to cluster1 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cluster1-superuser@cqlsh>
Keep It Clean
CodeReady Containers are very simple to clean up, especially because it is a packaging of OpenShift intended only for development purposes. To wipe everything, just “delete”
❯ crc stop
❯ crc delete
If, on the other hand, you only want to delete individual steps, each of the following can be done (but in order).
❯ oc delete -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml
❯ oc delete -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml
❯ oc delete -f cass-operator-1.7.0-openshift-storage.yaml
On Scylla Manager Suspend & Resume feature
On Scylla Manager Suspend & Resume feature
!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...
Apache Cassandra's Continuous Integration Systems
With Apache Cassandra 4.0 just around the corner, and the feature freeze on trunk lifted, let’s take a dive into the efforts ongoing with the project’s testing and Continuous Integration systems.
continuous integration in open source
Every software project benefits from sound testing practices and having a continuous integration in place. Even more so for open source projects. From contributors working around the world in many different timezones, particularly prone to broken builds and longer wait times and uncertainties, to contributors just not having the same communication bandwidths between each other because they work in different companies and are scratching different itches.
This is especially true for Apache Cassandra. As an early-maturity technology used everywhere on mission critical data, stability and reliability are crucial for deployments. Contributors from many companies: Alibaba, Amazon, Apple, Bloomberg, Dynatrace, DataStax, Huawei, Instaclustr, Netflix, Pythian, and more; need to coordinate and collaborate and most importantly trust each other.
During the feature freeze the project was fortunate to not just stabilise and fix tons of tests, but to also expand its continuous integration systems. This really helps set the stage for a post 4.0 roadmap that features heavy on pluggability, developer experience and safety, as well as aiming for an always-shippable trunk.
@ cassandra
The continuous integration systems at play are CircleCI and ci-cassandra.apache.org
CircleCI is a commercial solution. The main usage today of CircleCI is pre-commit, that is testing your patches while they get reviewed before they get merged. To effectively use CircleCI on Cassandra requires either the medium or high resource profiles that enables the use of hundreds of containers and lots of resources, and that’s basically only available for folk working in companies that are paying for a premium CircleCI account. There are lots stages to the CircleCI pipeline, and developers just trigger those stages they feel are relevant to test that patch on.
ci-cassandra is our community CI. It is based on CloudBees, provided by the ASF and running 40 agents (servers) around the world donated by numerous different companies in our community. Its main usage is post-commit, and its pipelines run every stage automatically. Today the pipeline consists of 40K tests. And for the first time in many years, on the lead up to 4.0, pipeline runs are completely green.
ci-cassandra is setup with a combination of Jenkins DSL script, and declarative Jenkinsfiles. These jobs use the build scripts found here.
forty thousand tests
The project has many types of tests. It has proper unit tests, and unit tests that have some embedded Cassandra server. The unit tests are run in a number of different parameterisations: from different Cassandra configuration, JDK 8 and JDK 11, to supporting the ARM architecture. There’s CQLSH tests written in Python against a single ccm node. Then there’s the Java distributed tests and Python distributed tests. The Python distributed tests are older, use CCM, and also run parameterised. The Java distributed tests are a recent addition and run the Cassandra nodes inside the JVM. Both types of distributed tests also include testing the upgrade paths of different Cassandra versions. Most new distributed tests today are written as Java distributed tests. There are also burn and microbench (JMH) tests.
distributed is difficult
Testing distributed tech is hardcore. Anyone who’s tried to run the Python upgrade dtests locally knows the pain. Running the tests in Docker helps a lot, and this is what CircleCI and ci-cassandra predominantly does. The base Docker images are found here. Distributed tests can fall over for numerous reasons, exacerbated in ci-cassandra with heterogenous servers around the world and all the possible network and disk issues that can occur. Just for the 4.0 release over 200 Jira tickets were focused just on strengthening flakey tests. Because ci-cassandra has limited storage, the logs and test results to all runs are archived in nightlies.apache.org/cassandra.
call for help
There’s still heaps to do. This is all part-time and volunteer efforts. No one in the community is dedicated to these systems or as a build engineer. The project can use all the help it can get.
There’s a ton of exciting stuff to add. Some examples are microbench and JMH reports, Jacoco test coverage reports, Harry for fuzz testing, Adelphi or Fallout for end-to-end performance and comparison testing, hooking up Apache Yetus for efficient resource usage, or putting our Jenkins stack into a k8s operator run script so you can run the pipeline on your own k8s cluster.
So don’t be afraid to jump in, pick your poison, we’d love to see you!
Reaper 2.2 for Apache Cassandra was released
We’re pleased to announce that Reaper 2.2 for Apache Cassandra was just released. This release includes a major redesign of how segments are orchestrated, which allows users to run concurrent repairs on nodes. Let’s dive into these changes and see what they mean for Reaper’s users.
New Segment Orchestration
Reaper works in a variety of standalone or distributed modes, which create some challenges in meeting the following requirements:
- A segment is processed successfully exactly once.
- No more than one segment is running on a node at once.
- Segments can only be started if the number of pending compactions on a node involved is lower than the defined threshold.
To make sure a segment won’t be handled by several Reaper instances at once, Reaper relies on LightWeight Transactions (LWT) to implement a leader election process. A Reaper instance will “take the lead” on a segment by using a LWT and then perform the checks for the last two conditions above.
To avoid race conditions between two different segments involving a common set of replicas that would start at the same time, a “master lock” was placed after the checks to guarantee that a single segment would be able to start. This required a double check to be performed before actually starting the segment.
There were several issues with this design:
- It involved a lot of LWTs even if no segment could be started.
- It was a complex design which made the code hard to maintain.
- The “master lock” was creating a lot of contention as all Reaper instances would compete for the same partition, leading to some nasty situations. This was especially the case in sidecar mode as it involved running a lot of Reaper instances (one per Cassandra node).
As we were seeing suboptimal performance and high LWT contention
in some setups, we redesigned how segments were orchestrated to
reduce the number of LWTs and maximize concurrency during repairs
(all nodes should be busy repairing if possible).
Instead of locking segments, we explored whether it would be
possible to lock nodes instead. This approach would give us several
benefits:
- We could check which nodes are already busy without issuing JMX requests to the nodes.
- We could easily filter segments to be processed to retain only those with available nodes.
- We could remove the master lock as we would have no more race conditions between segments.
One of the hard parts was that locking several nodes in a consistent manner would be challenging as they would involve several rows, and Cassandra doesn’t have a concept of an atomic transaction that can be rolled back as RDBMS do. Luckily, we were able to leverage one feature of batch statements: All Cassandra batch statements which target a single partition will turn all operations into a single atomic one (at the node level). If one node out of all replicas was already locked, then none would be locked by the batched LWTs. We used the following model for the leader election table on nodes:
CREATE TABLE reaper_db.running_repairs (
repair_id uuid,
node text,
reaper_instance_host text,
reaper_instance_id uuid,
segment_id uuid,
PRIMARY KEY (repair_id, node)
) WITH CLUSTERING ORDER BY (node ASC)
The following LWTs are then issued in a batch for each replica:
BEGIN BATCH
UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node1'
IF reaper_instance_id = null;
UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node2'
IF reaper_instance_id = null;
UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node3'
IF reaper_instance_id = null;
APPLY BATCH;
If all the conditional updates are able to be applied, we’ll get the following data in the table:
cqlsh> select * from reaper_db.running_repairs;
repair_id | node | reaper_instance_host | reaper_instance_id | segment_id
--------------------------------------+-------+----------------------+--------------------------------------+--------------------------------------
70f52bc0-7519-11eb-809e-9f94505f3a3e | node1 | reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
70f52bc0-7519-11eb-809e-9f94505f3a3e | node2 | reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
70f52bc0-7519-11eb-809e-9f94505f3a3e | node3 | reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
If one of the conditional updates fails because one node is
already locked for the same repair_id
, then none
will be applied.
Note: the Postgres backend also benefits from these new features through the use of transactions, using commit and rollback to deal with success/failure cases.
The new design is now much simpler than the initial one:
Segments are now filtered on those that have no replica locked to avoid wasting energy in trying to lock them and the pending compactions check also happens before any locking.
This reduces the number of LWTs by four in the simplest cases and we expect more challenging repairs to benefit from even more reductions:
At the same time, repair duration on a 9-node cluster showed 15%-20% improvements thanks to the more efficient segment selection.
One prerequisite to make that design efficient was to store the replicas for each segment in the database when the repair run is created. You can now see which nodes are involved for each segment and which datacenter they belong to in the Segments view:
Concurrent repairs
Using the repair id as the partition key for the node leader
election table gives us another feature that was long awaited:
Concurrent repairs.
A node could be locked by different Reaper instances for different
repair runs, allowing several repairs to run concurrently on each
node. In order to control the level of concurrency, a new setting
was introduced in Reaper: maxParallelRepairs
By default it is set to 2
and should be tuned
carefully as heavy concurrent repairs could have a negative impact
on clusters performance.
If you have small keyspaces that need to be repaired on a regular
basis, they won’t be blocked by large keyspaces anymore.
Future upgrades
As some of you are probably aware, JFrog has decided to sunset Bintray and JCenter. Bintray is our main distribution medium and we will be working on replacement repositories. The 2.2.0 release is unaffected by this change but future upgrades could require an update to yum/apt repos. The documentation will be updated accordingly in due time.
Upgrade now
We encourage all Reaper users to upgrade to 2.2.0. It was tested successfully by some of our customers which had issues with LWT pressure and blocking repairs. This version is expected to make repairs faster and more lightweight on the Cassandra backend. We were able to remove a lot of legacy code and design which were fit to single token clusters, but failed at spreading segments efficiently for clusters using vnodes.
The binaries for Reaper 2.2.0 are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.
All instructions to download, install, configure, and use Reaper 2.2.0 are available on the Reaper website.
Creating Flamegraphs with Apache Cassandra in Kubernetes (cass-operator)
In a previous blog post recommending disabling read repair chance, some flamegraphs were generated to demonstrate the effect read repair chance had on a cluster. Let’s go through how those flamegraphs were captured, step-by-step using Apache Cassandra 3.11.6, Kubernetes and the cass-operator, nosqlbench and the async-profiler.
In previous blog posts we would have used the existing tools of tlp-cluster or ccm, tlp-stress or cassandra-stress, and sjk. Here we take a new approach that is a lot more fun, as with k8s the same approach can be used locally or in the cloud. No need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. Nor are you bound to AWS for big instance testing, that’s right: no vendor lock-in. Cass-operator and K8ssandra is getting a ton of momentum from DataStax, so it is only deserved and exciting to introduce them to as much of the open source world as we can.
This blog post is not an in-depth dive into using cass-operator, rather a simple teaser to demonstrate how we can grab some flamegraphs, as quickly as possible. The blog post is split into three sections
- Setting up Kubernetes and getting Cassandra running
- Getting access to Cassandra from outside Kubernetes
- Stress testing and creating flamegraphs
Setup
Let’s go through a quick demonstration using Kubernetes, the cass-operator, and some flamegraphs.
First, download four yaml configuration files that will be used. This is not strictly necessary for the latter three, as kubectl may reference them by their URLs, but let’s download them for the sake of having the files locally and being able to make edits if and when desired.
wget https://thelastpickle.com/files/2021-01-31-cass_operator/01-kind-config.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/02-storageclass-kind.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/11-install-cass-operator-v1.1.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/13-cassandra-cluster-3nodes.yaml
The next steps involve kind
and kubectl
to create a
local cluster we can test. To use kind you have docker running
locally, it is recommended to have 4 CPU and 12GB RAM for this
exercise.
kind create cluster --name read-repair-chance-test --config 01-kind-config.yaml
kubectl create ns cass-operator
kubectl -n cass-operator apply -f 02-storageclass-kind.yaml
kubectl -n cass-operator apply -f 11-install-cass-operator-v1.1.yaml
# watch and wait until the pod is running
watch kubectl -n cass-operator get pod
# create 3 node C* cluster
kubectl -n cass-operator apply -f 13-cassandra-cluster-3nodes.yaml
# again, wait for pods to be running
watch kubectl -n cass-operator get pod
# test the three nodes are up
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status
Access
For this example we are going to run NoSqlBench from outside the k8s cluster, so we will need access to a pod’s Native Protocol interface via port-forwarding. This approach is practical here because it was desired to have the benchmark connect to just one coordinator. In many situations you would instead run NoSqlBench from a separate dedicated pod inside the k8s cluster.
# get the cql username
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""
# get the cql password
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""
# port forward the native protocol (CQL)
kubectl -n cass-operator port-forward --address 0.0.0.0 cluster1-dc1-default-sts-0 9042:9042
The above sets up the k8s cluster, a k8s storageClass, and the cass-operator with a three node Cassandra cluster. For a more in depth look at this setup checkout this tutorial.
Stress Testing and Flamegraphs
With a cluster to play with, let’s generate some load and then go grab some flamegraphs.
Instead of using SJK (Swiss Java Knife), as our previous blog
posts have done, we will use the async-profiler.
The async-profiler does not suffer from Safepoint bias problem, an
issue we see more often than we would like in Cassandra nodes
(protip: make sure you
configure ParallelGCThreads
and
ConcGCThreads
to the
same value).
Open a new terminal window and do the following.
# get the latest NoSqlBench jarfile
wget https://github.com/nosqlbench/nosqlbench/releases/latest/download/nb.jar
# generate some load, use credentials as found above
java -jar nb.jar cql-keyvalue username=<cql_username> password=<cql_password> whitelist=127.0.0.1 rampup-cycles=10000 main-cycles=500000 rf=3 read_cl=LOCAL_ONE
# while the load is still running,
# open a shell in the coordinator pod, download async-profiler and generate a flamegraph
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- /bin/bash
wget https://github.com/jvm-profiling-tools/async-profiler/releases/download/v1.8.3/async-profiler-1.8.3-linux-x64.tar.gz
tar xvf async-profiler-1.8.3-linux-x64.tar.gz
async-profiler-1.8.3-linux-x64/profiler.sh -d 300 -f /tmp/flame_away.svg <CASSANDRA_PID>
exit
# copy the flamegraph out of the pod
kubectl -n cass-operator cp cluster1-dc1-default-sts-0:/tmp/flame_away.svg flame_away.svg
Keep It Clean
After everything is done, it is time to clean up after yourself.
Delete the CassandraDatacenters first, otherwise Kubernetes will block deletion because we use a finalizer. Note, this will delete all data in the cluster.
kubectl delete cassdcs --all-namespaces --all
Remove the operator Deployment, CRD, etc.
# this command can take a while, be patient
kubectl delete -f https://raw.githubusercontent.com/datastax/cass-operator/v1.5.1/docs/user/cass-operator-manifests-v1.16.yaml
# if troubleshooting, to forcibly remove resources, though
# this should not be necessary, and take care as this will wipe all resources
kubectl delete "$(kubectl api-resources --namespaced=true --verbs=delete -o name | tr "\n" "," | sed -e 's/,$//')" --all
To remove the local Kubernetes cluster altogether
kind delete cluster --name read-repair-chance-test
To stop and remove the docker containers that are left running…
docker stop $(docker ps | grep kindest | cut -d" " -f1)
docker rm $(docker ps -a | grep kindest | cut -d" " -f1)
More… the cass-operator tutorials
There is a ton of documentation and tutorials getting released on how to use the cass-operator. If you are keen to learn more the following is highly recommended: Managing Cassandra Clusters in Kubernetes Using Cass-Operator.
The Impacts of Changing the Number of VNodes in Apache Cassandra
Apache Cassandra’s default value for num_tokens
is about
to change in 4.0! This might seem like a small edit note in the
CHANGES.txt, however such a change can have a profound effect on
day-to-day operations of the cluster. In this post we will examine
how changing the value for num_tokens
impacts
the cluster and its behaviour.
There are many knobs and levers that can be modified in Apache
Cassandra to tune its behaviour. The num_tokens
setting is
one of those. Like many settings it lives in the
cassandra.yaml file and has a defined default value.
That’s where it stops being like many of Cassandra’s settings. You
see, most of Cassandra’s settings will only affect a single aspect
of the cluster. However, when changing the value of num_tokens
there is
an array of behaviours that are altered. The Apache Cassandra
project has committed and resolved CASSANDRA-13701
which changed the default value for num_tokens
from 256
to 16. This change is significant, and to understand the
consequences we first need to understand the role that num_tokens
play in
the cluster.
Never try this on production
Before we dive into any details it is worth noting that
the num_tokens
setting on
a node should never ever be changed once it has joined the
cluster. For one thing the node will fail on a restart.
The value of this setting should be the same for every node in a
datacenter. Historically, different values were expected for
heterogeneous clusters. While it’s rare to see, nor would we
recommend, you can still in theory double the num_tokens
on nodes
that are twice as big in terms of hardware specifications.
Furthermore, it is common to see the nodes in a datacenter have a
value for num_tokens
that
differs to nodes in another datacenter. This is partly how changing
the value of this setting on a live cluster can be safely done with
zero downtime. It is out of scope for this blog post, but details
can be found in
migration to a new datacenter.
The Basics
The num_tokens
setting
influences the way Cassandra allocates data amongst the nodes, how
that data is retrieved, and how that data is moved between
nodes.
Under the hood Cassandra uses a partitioner to decide where data is stored in the cluster. The partitioner is a consistent hashing algorithm that maps a partition key (first part of the primary key) to a token. The token dictates which nodes will contain the data associated with the partition key. Each node in the cluster is assigned one or more unique token values from a token ring. This is just a fancy way of saying each node is assigned a number from a circular number range. That is, “the number” being the token hash, and “a circular number range” being the token ring. The token ring is circular because the next value after the maximum token value is the minimum token value.
An assigned token defines the range of tokens in the token ring the node is responsible for. This is commonly known as a “token range”. The “token range” a node is responsible for is bounded by its assigned token, and the next smallest token value going backwards in the ring. The assigned token is included in the range, and the smallest token value going backwards is excluded from the range. The smallest token value going backwards typically resides on the previous neighbouring node. Having a circular token ring means that the range of tokens a node is responsible for, could include both the minimum and maximum tokens in the ring. In at least one case the smallest token value going backwards will wrap back past the maximum token value in the ring.
For example, in the following Token Ring Assignment diagram we have a token ring with a range of hashes from 0 to 99. Token 10 is allocated to Node 1. The node before Node 1 in the cluster is Node 5. Node 5 is allocated token 90. Therefore, the range of tokens that Node 1 is responsible for is between 91 and 10. In this particular case, the token range wraps around past the maximum token in the ring.
Note that the above diagram is for only a single data replica. This is because only a single node is assigned to each token in the token ring. If multiple replicas of the data exists, a node’s neighbours become replicas for the token as well. This is illustrated in the Token Ring Assignment diagram below.
The reason the partitioner is defined as a consistent hashing algorithm is because it is just that; no matter how many times you feed in a specific input, it will always generate the same output value. It ensures that every node, coordinator, or otherwise, will always calculate the same token for a given partition key. The calculated token can then be used to reliably pinpoint the nodes with the sought after data.
Consequently, the minimum and maximum numbers for the token ring
are defined by the partitioner. The default Murur3Partitioner
based on the Murmur hash has for
example, a minimum and maximum range of -2^63
to +2^63 - 1
. The legacy
RandomPartitioner
(based on the MD5 hash) on the other hand has a range of
0
to
2^127
-
1. A critical side effect of this system is that once a partitioner
for a cluster is picked, it can never be changed. Changing to a
different partitioner requires the creation of a new cluster with
the desired partitioner and then reloading the data into the new
cluster.
Further information on consistent hashing functionality can be found in the Apache Cassandra documentation.
Back in the day…
Back in the pre-1.2 era, nodes could only be manually assigned a
single token. This was done and can still be done today using the
initial_token
setting
in the cassandra.yaml file. The default partitioner at
that point was the RandomPartitioner
.
Despite token assignment being manual, the partitioner made the
process of calculating the assigned tokens fairly straightforward
when setting up a cluster from scratch. For example, if you had a
three node cluster you would divide 2^127 - 1
by
3
and the
quotient would give you the correct increment amount for each token
value. Your first node would have an initial_token
of
0
, your
next node would have an initial_token
of
(2^127 - 1) /
3
, and your third node would have an initial_token
of
(2^127 - 1) / 3
* 2
. Thus, each node will have the same sized token
ranges.
Dividing the token ranges up evenly makes it less likely individual nodes are overloaded (assuming identical hardware for the nodes, and an even distribution of data across the cluster). Uneven token distribution can result in what is termed “hot spots”. This is where a node is under pressure as it is servicing more requests or carrying more data than other nodes.
Even though setting up a single token cluster can be a very manual process, their deployment is still common. Especially for very large Cassandra clusters where the node count typically exceeds 1,000 nodes. One of the advantages of this type of deployment, is you can ensure that the token distribution is even.
Although setting up a single token cluster from scratch can
result in an even load distribution, growing the cluster is far
less straight forward. If you insert a single node into your three
node cluster, the result is that two out of the four nodes will
have a smaller token range than the other two nodes. To fix this
problem and re-balance, you then have to run nodetool move
to
relocate tokens to other nodes. This is a tedious and expensive
task though, involving a lot of streaming around the whole cluster.
The alternative is to double the size of your cluster each time you
expand it. However, this usually means using more hardware than you
need. Much like having an immaculate backyard garden, maintaining
an even token range per node in a single token cluster requires
time, care, and attention, or alternatively, a good deal of clever
automation.
Scaling in a single token world is only half the challenge. Certain failure scenarios heavily reduce time to recovery. Let’s say for example you had a six node cluster with three replicas of the data in a single datacenter (Replication Factor = 3). Replicas might reside on Node 1 and Node 4, Node 2 and Node 5, and lastly on Node 3 and Node 6. In this scenario each node is responsible for a sixth of each of the three replicas.
In the above diagram, the tokens in the token ring are assigned an alpha character. This is to make tracking the token assignment to each node easier to follow. If the cluster had an outage where Node 1 and Node 6 are unavailable, you could only use Nodes 2 and 5 to recover the unique sixth of the data they each have. That is, only Node 2 could be used to recover the data associated with token range ‘F’, and similarly only Node 5 could be used to recover the data associated with token range ‘E’. This is illustrated in the diagram below.
vnodes to the rescue
To solve the shortcomings of a single token assignment, Cassandra version 1.2 was enhanced to allow a node to be assigned multiple tokens. That is a node could be responsible for multiple token ranges. This Cassandra feature is known as “virtual node” or vnodes for short. The vnodes feature was introduced via CASSANDRA-4119. As per the ticket description, the goals of vnodes were:
- Reduced operations complexity for scaling up/down.
- Reduced rebuild time in event of failure.
- Evenly distributed load impact in the event of failure.
- Evenly distributed impact of streaming operations.
- More viable support for heterogeneity of hardware.
The introduction of this feature gave birth to the num_tokens
setting in
the cassandra.yaml file. The setting defined the number of
vnodes (token ranges) a node was responsible for. By increasing the
number of vnodes per node, the token ranges become smaller. This is
because the token ring has a finite number of tokens. The more
ranges it is divided up into the smaller each range is.
To maintain backwards compatibility with older 1.x series
clusters, the num_tokens
defaulted to a value of 1. Moreover, the
setting was effectively disabled on a vanilla installation.
Specifically, the value in the cassandra.yaml file was
commented out. The commented line and
previous development commits did give a
glimpse into the future of where the feature was headed
though.
As foretold by the cassandra.yaml file, and the git
commit history, when Cassandra version 2.0 was released out the
vnodes feature was enabled by default. The num_tokens
line was
no longer commented out, so its effective default value on a
vanilla installation was 256. Thus ushering in a
new era of clusters that had relatively even token distributions,
and were simple to grow.
With nodes consisting of 256 vnodes and the accompanying
additional features, expanding the cluster was a dream. You could
insert one new node into your cluster and Cassandra would calculate
and assign the tokens automatically! The token values were randomly
calculated, and so over time as you added more nodes, the cluster
would converge on being in a balanced state. This engineering
wizardry put an end to spending hours doing calculations and
nodetool
move
operations to grow a cluster. The option was still
there though. If you had a very large cluster or another
requirement, you could still use the initial_token
setting
which was commented out in Cassandra version 2.0. In this case, the
value for the num_tokens
still had
to be set to the number of tokens manually defined in the
initial_token
setting.
Remember to read the fine print
This gave us a feature that was like a personal devops assistant; you handed them a node, told them to insert it, and then after some time it had tokens allocated and was part of the cluster. However, in a similar vein, there is a price to pay for the convenience…
While we get a more even token distribution when using 256 vnodes, the problem is that availability degrades earlier. Ironically, the more we break the token ranges up the more quickly we can get data unavailability. Then there is the issue of unbalanced token ranges when using a small number of vnodes. By small, I mean values less than 32. Cassandra’s random token allocation is hopeless when it comes to small vnode values. This is because there are insufficient tokens to balance out the wildly different token range sizes that are generated.
Pics or it didn’t happen
It is very easy to demonstrate the availability and token range
imbalance issues, using a test cluster. We can set up a single
token range cluster with six nodes using ccm
. After
calculating the tokens, configuring and starting our test cluster,
it looked like this.
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.17 KiB 1 33.3% 8d483ae7-e7fa-4c06-9c68-22e71b78e91f rack1
UN 127.0.0.2 65.99 KiB 1 33.3% cc15803b-2b93-40f7-825f-4e7bdda327f8 rack1
UN 127.0.0.3 85.3 KiB 1 33.3% d2dd4acb-b765-4b9e-a5ac-a49ec155f666 rack1
UN 127.0.0.4 104.58 KiB 1 33.3% ad11be76-b65a-486a-8b78-ccf911db4aeb rack1
UN 127.0.0.5 71.19 KiB 1 33.3% 76234ece-bf24-426a-8def-355239e8f17b rack1
UN 127.0.0.6 30.45 KiB 1 33.3% cca81c64-d3b9-47b8-ba03-46356133401b rack1
We can then create a test keyspace and populated it using
cqlsh
.
$ ccm node1 cqlsh
Connected to SINGLETOKEN at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.9 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE test_keyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
cqlsh> CREATE TABLE test_keyspace.test_table (
... id int,
... value text,
... PRIMARY KEY (id));
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (1, 'foo');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (2, 'bar');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (3, 'net');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (4, 'moo');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (5, 'car');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (6, 'set');
To confirm that the cluster is perfectly balanced, we can check the token ring.
$ ccm node1 nodetool ring test_keyspace
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token
6148914691236517202
127.0.0.1 rack1 Up Normal 125.64 KiB 50.00% -9223372036854775808
127.0.0.2 rack1 Up Normal 125.31 KiB 50.00% -6148914691236517206
127.0.0.3 rack1 Up Normal 124.1 KiB 50.00% -3074457345618258604
127.0.0.4 rack1 Up Normal 104.01 KiB 50.00% -2
127.0.0.5 rack1 Up Normal 126.05 KiB 50.00% 3074457345618258600
127.0.0.6 rack1 Up Normal 120.76 KiB 50.00% 6148914691236517202
We can see in the “Owns” column all nodes have 50% ownership of the data. To make the example easier to follow we can manually add a letter representation next to each token number. So the token ranges could be represented in the following way:
$ ccm node1 nodetool ring test_keyspace
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token Token Letter
6148914691236517202 F
127.0.0.1 rack1 Up Normal 125.64 KiB 50.00% -9223372036854775808 A
127.0.0.2 rack1 Up Normal 125.31 KiB 50.00% -6148914691236517206 B
127.0.0.3 rack1 Up Normal 124.1 KiB 50.00% -3074457345618258604 C
127.0.0.4 rack1 Up Normal 104.01 KiB 50.00% -2 D
127.0.0.5 rack1 Up Normal 126.05 KiB 50.00% 3074457345618258600 E
127.0.0.6 rack1 Up Normal 120.76 KiB 50.00% 6148914691236517202 F
We can then capture the output of ccm node1 nodetool
describering test_keyspace
and change the token numbers to
the corresponding letters in the above token ring output.
$ ccm node1 nodetool describering test_keyspace
Schema Version:6256fe3f-a41e-34ac-ad76-82dba04d92c3
TokenRange:
TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.4], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.3, 127.0.0.4, 127.0.0.5], rpc_endpoints:[127.0.0.3, 127.0.0.4, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.1], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:F, end_token:A, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
Using the above output, specifically the end_token
, we can
determine all the token ranges assigned to each node. As mentioned
earlier, the token range is defined by the values after the
previous token (start_token
) up to
and including the assigned token (end_token
). The token
ranges assigned to each node looked like this:
In this setup, if node3 and node6 were unavailable, we would lose an entire replica. Even if the application is using a Consistency Level of LOCAL_QUORUM, all the data is still available. We still have two other replicas across the other four nodes.
Now let’s consider the case where our cluster is using vnodes.
For example purposes we can set num_tokens
to
3. It will give us a smaller number of tokens
making for an easier to follow example. After configuring and
starting the nodes in ccm
, our test cluster
initially looked like this.
For the majority of production deployments where the cluster size is less than 500 nodes, it is recommended that you use a larger value for `num_tokens`. Further information can be found in the Apache Cassandra Production Recommendations.
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.21 KiB 3 46.2% 7d30cbd4-8356-4189-8c94-0abe8e4d4d73 rack1
UN 127.0.0.2 66.04 KiB 3 37.5% 16bb0b37-2260-440c-ae2a-08cbf9192f85 rack1
UN 127.0.0.3 90.48 KiB 3 28.9% dc8c9dfd-cf5b-470c-836d-8391941a5a7e rack1
UN 127.0.0.4 104.64 KiB 3 20.7% 3eecfe2f-65c4-4f41-bbe4-4236bcdf5bd2 rack1
UN 127.0.0.5 66.09 KiB 3 36.1% 4d5adf9f-fe0d-49a0-8ab3-e1f5f9f8e0a2 rack1
UN 127.0.0.6 71.23 KiB 3 30.6% b41496e6-f391-471c-b3c4-6f56ed4442d6 rack1
Right off the blocks we can see signs that the cluster might be
unbalanced. Similar to what we did with the single node cluster,
here we create the test keyspace and populate it using cqlsh
. We then grab a
read out of the token ring to see what that looks like. Once again,
to make the example easier to follow we manually add a letter
representation next to each token number.
$ ccm node1 nodetool ring test_keyspace
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token Token Letter
8828652533728408318 R
127.0.0.5 rack1 Up Normal 121.09 KiB 41.44% -7586808982694641609 A
127.0.0.1 rack1 Up Normal 126.49 KiB 64.03% -6737339388913371534 B
127.0.0.2 rack1 Up Normal 126.04 KiB 66.60% -5657740186656828604 C
127.0.0.3 rack1 Up Normal 135.71 KiB 39.89% -3714593062517416200 D
127.0.0.6 rack1 Up Normal 126.58 KiB 40.07% -2697218374613409116 E
127.0.0.1 rack1 Up Normal 126.49 KiB 64.03% -1044956249817882006 F
127.0.0.2 rack1 Up Normal 126.04 KiB 66.60% -877178609551551982 G
127.0.0.4 rack1 Up Normal 110.22 KiB 47.96% -852432543207202252 H
127.0.0.5 rack1 Up Normal 121.09 KiB 41.44% 117262867395611452 I
127.0.0.6 rack1 Up Normal 126.58 KiB 40.07% 762725591397791743 J
127.0.0.3 rack1 Up Normal 135.71 KiB 39.89% 1416289897444876127 K
127.0.0.1 rack1 Up Normal 126.49 KiB 64.03% 3730403440915368492 L
127.0.0.4 rack1 Up Normal 110.22 KiB 47.96% 4190414744358754863 M
127.0.0.2 rack1 Up Normal 126.04 KiB 66.60% 6904945895761639194 N
127.0.0.5 rack1 Up Normal 121.09 KiB 41.44% 7117770953638238964 O
127.0.0.4 rack1 Up Normal 110.22 KiB 47.96% 7764578023697676989 P
127.0.0.3 rack1 Up Normal 135.71 KiB 39.89% 8123167640761197831 Q
127.0.0.6 rack1 Up Normal 126.58 KiB 40.07% 8828652533728408318 R
As we can see from the “Owns” column above, there are some large token range ownership imbalances. The smallest token range ownership is by node 127.0.0.3 at 39.89%. The largest token range ownership is by node 127.0.0.2 at 66.6%. This is about 26% difference!
Once again, we capture the output of ccm node1 nodetool
describering test_keyspace
and change the token numbers to
the corresponding letters in the above token ring.
$ ccm node1 nodetool describering test_keyspace
Schema Version:4b2dc440-2e7c-33a4-aac6-ffea86cb0e21
TokenRange:
TokenRange(start_token:J, end_token:K, endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.4], rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:K, end_token:L, endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.4], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:I, end_token:J, endpoints:[127.0.0.6, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.6, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:R, end_token:A, endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:M, end_token:N, endpoints:[127.0.0.2, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.2, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:H, end_token:I, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:L, end_token:M, endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.5], rpc_endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:N, end_token:O, endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:P, end_token:Q, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.5], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:Q, end_token:R, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:F, end_token:G, endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.5], rpc_endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:G, end_token:H, endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.6], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:O, end_token:P, endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
Finally, we can determine all the token ranges assigned to each node. The token ranges assigned to each node looked like this:
Using this we can see what happens if we had the same outage as the single token cluster did, that is, node3 and node6 are unavailable. As we can see node3 and node6 are both responsible for tokens C, D, I, J, P, and Q. Hence, data associated with those tokens would be unavailable if our application is using a Consistency Level of LOCAL_QUORUM. To put that in different terms, unlike our single token cluster, in this case 33.3% of our data could no longer be retrieved.
Rack ‘em up
A seasoned Cassandra operator will notice that so far we have run our token distribution tests on clusters with only a single rack. To help increase the availability when using vnodes racks can be deployed. When racks are used Cassandra will try to place single replicas in each rack. That is, it will try to ensure no two identical token ranges appear in the same rack.
The key here is to configure the cluster so that for a given datacenter the number of racks is the same as the replication factor.
Let’s retry our previous example where we set num_tokens
to
3, only this time we’ll define three racks in the
test cluster. After configuring and starting the nodes in
ccm
, our
newly configured test cluster initially looks like this:
$ ccm node1 nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.08 KiB 3 31.8% 49df615d-bfe5-46ce-a8dd-4748c086f639 rack1
UN 127.0.0.2 71.04 KiB 3 34.4% 3fef187e-00f5-476d-b31f-7aa03e9d813c rack2
UN 127.0.0.3 66.04 KiB 3 37.3% c6a0a5f4-91f8-4bd1-b814-1efc3dae208f rack3
UN 127.0.0.4 109.79 KiB 3 52.9% 74ac0727-c03b-476b-8f52-38c154cfc759 rack1
UN 127.0.0.5 66.09 KiB 3 18.7% 5153bad4-07d7-4a24-8066-0189084bbc80 rack2
UN 127.0.0.6 66.09 KiB 3 25.0% 6693214b-a599-4f58-b1b4-a6cf0dd684ba rack3
We can still see signs that the cluster might be unbalanced.
This is a side issue, as the main point to take from the above is
that we now have three racks defined in the cluster with two nodes
assigned in each. Once again, similar to the single node cluster,
we can create the test keyspace and populate it using cqlsh
. We then grab a
read out of the token ring to see what it looks like. Same as the
previous tests, to make the example easier to follow, we manually
add a letter representation next to each token number.
ccm node1 nodetool ring test_keyspace
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token Token Letter
8993942771016137629 R
127.0.0.5 rack2 Up Normal 122.42 KiB 34.65% -8459555739932651620 A
127.0.0.4 rack1 Up Normal 111.07 KiB 53.84% -8458588239787937390 B
127.0.0.3 rack3 Up Normal 116.12 KiB 60.72% -8347996802899210689 C
127.0.0.1 rack1 Up Normal 121.31 KiB 46.16% -5712162437894176338 D
127.0.0.4 rack1 Up Normal 111.07 KiB 53.84% -2744262056092270718 E
127.0.0.6 rack3 Up Normal 122.39 KiB 39.28% -2132400046698162304 F
127.0.0.2 rack2 Up Normal 121.42 KiB 65.35% -1232974565497331829 G
127.0.0.4 rack1 Up Normal 111.07 KiB 53.84% 1026323925278501795 H
127.0.0.2 rack2 Up Normal 121.42 KiB 65.35% 3093888090255198737 I
127.0.0.2 rack2 Up Normal 121.42 KiB 65.35% 3596129656253861692 J
127.0.0.3 rack3 Up Normal 116.12 KiB 60.72% 3674189467337391158 K
127.0.0.5 rack2 Up Normal 122.42 KiB 34.65% 3846303495312788195 L
127.0.0.1 rack1 Up Normal 121.31 KiB 46.16% 4699181476441710984 M
127.0.0.1 rack1 Up Normal 121.31 KiB 46.16% 6795515568417945696 N
127.0.0.3 rack3 Up Normal 116.12 KiB 60.72% 7964270297230943708 O
127.0.0.5 rack2 Up Normal 122.42 KiB 34.65% 8105847793464083809 P
127.0.0.6 rack3 Up Normal 122.39 KiB 39.28% 8813162133522758143 Q
127.0.0.6 rack3 Up Normal 122.39 KiB 39.28% 8993942771016137629 R
Once again we capture the output of ccm node1 nodetool
describering test_keyspace
and change the token numbers to
the corresponding letters in the above token ring.
$ ccm node1 nodetool describering test_keyspace
Schema Version:aff03498-f4c1-3be1-b133-25503becf208
TokenRange:
TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
TokenRange(start_token:L, end_token:M, endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], rpc_endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2)])
TokenRange(start_token:N, end_token:O, endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:P, end_token:Q, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:K, end_token:L, endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
TokenRange(start_token:R, end_token:A, endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
TokenRange(start_token:I, end_token:J, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:Q, end_token:R, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.6, 127.0.0.2, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.2, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:H, end_token:I, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.4, 127.0.0.6, 127.0.0.2], rpc_endpoints:[127.0.0.4, 127.0.0.6, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.2], rpc_endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.1, 127.0.0.6, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.6, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
TokenRange(start_token:F, end_token:G, endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
TokenRange(start_token:O, end_token:P, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.4], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:J, end_token:K, endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
TokenRange(start_token:G, end_token:H, endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
TokenRange(start_token:M, end_token:N, endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], rpc_endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2)])
Lastly, we once again determine all the token ranges assigned to each node:
As we can see from the way Cassandra has assigned the tokens, there is now a complete data replica spread across two nodes in each of our three racks. If we go back to our failure scenario where node3 and node6 become unavailable, we can still service queries using a Consistency Level of LOCAL_QUORUM. The only elephant in the room here is node3 has a lot more tokens distributed to it than other nodes. Its counterpart in the same rack, node6, is at the opposite end with fewer tokens allocated to it.
Too many vnodes spoil the cluster
Given the token distribution issues with a low numbers of vnodes, one would think the best option is to have a large vnode value. However, apart from having a higher chance of some data being unavailable in a multi-node outage, large vnode values also impact streaming operations. To repair data on a node, Cassandra will start one repair session per vnode. These repair sessions need to be processed sequentially. Hence, the larger the vnode value the longer the repair times, and the overhead needed to run a repair.
In an effort to fix slow repair times as a result of large vnode values, CASSANDRA-5220 was introduced in 3.0. This change allows Cassandra to group common token ranges for a set of nodes into a single repair session. It increased the size of the repair session as multiple token ranges were being repaired, but reduced the number of repair sessions being executed in parallel.
We can see the effect that vnodes have on repair by running a
simple test on a cluster backed by real hardware. To do this test
we first need create a cluster that uses single tokens run a
repair. Then we can create the same cluster except with 256 vnodes,
and run the same repair. We will use tlp-cluster
to create
a Cassandra cluster in AWS with the following properties.
- Instance size: i3.2xlarge
- Node count: 12
- Rack count: 3 (4 nodes per rack)
- Cassandra version: 3.11.9 (latest stable release at the time of writing)
The commands to build this cluster are as follows.
$ tlp-cluster init --azs a,b,c --cassandra 12 --instance i3.2xlarge --stress 1 TLP BLOG "Blogpost repair testing"
$ tlp-cluster up
$ tlp-cluster use --config "cluster_name:SingleToken" --config "num_tokens:1" 3.11.9
$ tlp-cluster install
Once we provision the hardware we set the initial_token
property for each of the nodes individually. We can calculate the
initial tokens for each node using a simple Python command.
Python 2.7.16 (default, Nov 23 2020, 08:01:20)
[GCC Apple LLVM 12.0.0 (clang-1200.0.30.4) [+internal-os, ptrauth-isa=sign+stri on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> num_tokens = 1
>>> num_nodes = 12
>>> print("\n".join(['[Node {}] initial_token: {}'.format(n + 1, ','.join([str(((2**64 / (num_tokens * num_nodes)) * (t * num_nodes + n)) - 2**63) for t in range(num_tokens)])) for n in range(num_nodes)]))
[Node 1] initial_token: -9223372036854775808
[Node 2] initial_token: -7686143364045646507
[Node 3] initial_token: -6148914691236517206
[Node 4] initial_token: -4611686018427387905
[Node 5] initial_token: -3074457345618258604
[Node 6] initial_token: -1537228672809129303
[Node 7] initial_token: -2
[Node 8] initial_token: 1537228672809129299
[Node 9] initial_token: 3074457345618258600
[Node 10] initial_token: 4611686018427387901
[Node 11] initial_token: 6148914691236517202
[Node 12] initial_token: 7686143364045646503
After starting Cassandra on all the nodes, around 3 GB of data
per node can be preloaded using the following tlp-stress
command.
In this command we set our keyspace replication factor to 3 and set
gc_grace_seconds
to
0. This is done to make hints expire immediately when they
are created, which ensures they are never delivered to the
destination node.
ubuntu@ip-172-31-19-180:~$ tlp-stress run KeyValue --replication "{'class': 'NetworkTopologyStrategy', 'us-west-2':3 }" --cql "ALTER TABLE tlp_stress.keyvalue WITH gc_grace_seconds = 0" --reads 1 --partitions 100M --populate 100M --iterations 1
Upon completion of the data loading, the cluster status looks like this.
ubuntu@ip-172-31-30-95:~$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.30.95 2.78 GiB 1 25.0% 6640c7b9-c026-4496-9001-9d79bea7e8e5 2a
UN 172.31.31.106 2.79 GiB 1 25.0% ceaf9d56-3a62-40be-bfeb-79a7f7ade402 2a
UN 172.31.2.74 2.78 GiB 1 25.0% 4a90b071-830e-4dfe-9d9d-ab4674be3507 2c
UN 172.31.39.56 2.79 GiB 1 25.0% 37fd3fe0-598b-428f-a84b-c27fc65ee7d5 2b
UN 172.31.31.184 2.78 GiB 1 25.0% 40b4e538-476a-4f20-a012-022b10f257e9 2a
UN 172.31.10.87 2.79 GiB 1 25.0% fdccabef-53a9-475b-9131-b73c9f08a180 2c
UN 172.31.18.118 2.79 GiB 1 25.0% b41ab8fe-45e7-4628-94f0-a4ec3d21f8d0 2a
UN 172.31.35.4 2.79 GiB 1 25.0% 246bf6d8-8deb-42fe-bd11-05cca8f880d7 2b
UN 172.31.40.147 2.79 GiB 1 25.0% bdd3dd61-bb6a-4849-a7a6-b60a2b8499f6 2b
UN 172.31.13.226 2.79 GiB 1 25.0% d0389979-c38f-41e5-9836-5a7539b3d757 2c
UN 172.31.5.192 2.79 GiB 1 25.0% b0031ef9-de9f-4044-a530-ffc67288ebb6 2c
UN 172.31.33.0 2.79 GiB 1 25.0% da612776-4018-4cb7-afd5-79758a7b9cf8 2b
We can then run a full repair on each node using the following commands.
$ source env.sh
$ c_all "nodetool repair -full tlp_stress"
The repair times recorded for each node were.
[2021-01-22 20:20:13,952] Repair command #1 finished in 3 minutes 55 seconds
[2021-01-22 20:23:57,053] Repair command #1 finished in 3 minutes 36 seconds
[2021-01-22 20:27:42,123] Repair command #1 finished in 3 minutes 32 seconds
[2021-01-22 20:30:57,654] Repair command #1 finished in 3 minutes 21 seconds
[2021-01-22 20:34:27,740] Repair command #1 finished in 3 minutes 17 seconds
[2021-01-22 20:37:40,449] Repair command #1 finished in 3 minutes 23 seconds
[2021-01-22 20:41:32,391] Repair command #1 finished in 3 minutes 36 seconds
[2021-01-22 20:44:52,917] Repair command #1 finished in 3 minutes 25 seconds
[2021-01-22 20:47:57,729] Repair command #1 finished in 2 minutes 58 seconds
[2021-01-22 20:49:58,868] Repair command #1 finished in 1 minute 58 seconds
[2021-01-22 20:51:58,724] Repair command #1 finished in 1 minute 53 seconds
[2021-01-22 20:54:01,100] Repair command #1 finished in 1 minute 50 seconds
These times give us a total repair time of 36 minutes and 44 seconds.
The same cluster can be reused to test repair times when 256 vnodes are used. To do this we execute the following steps.
- Shut down Cassandra on all the nodes.
- Delete the contents in each of the directories
data
,commitlog
,hints
, andsaved_caches
(these are located in /var/lib/cassandra/ on each node). - Set
num_tokens
in the cassandra.yaml configuration file to a value of 256 and remove theinitial_token
setting. - Start up Cassandra on all the nodes.
After populating the cluster with data its status looked like this.
ubuntu@ip-172-31-30-95:~$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.31.30.95 2.79 GiB 256 24.3% 10b0a8b5-aaa6-4528-9d14-65887a9b0b9c 2a
UN 172.31.2.74 2.81 GiB 256 24.4% a748964d-0460-4f86-907d-a78edae2a2cb 2c
UN 172.31.31.106 3.1 GiB 256 26.4% 1fc68fbd-335d-4689-83b9-d62cca25c88a 2a
UN 172.31.31.184 2.78 GiB 256 23.9% 8a1b25e7-d2d8-4471-aa76-941c2556cc30 2a
UN 172.31.39.56 2.73 GiB 256 23.5% 3642a964-5d21-44f9-b330-74c03e017943 2b
UN 172.31.10.87 2.95 GiB 256 25.4% 540a38f5-ad05-4636-8768-241d85d88107 2c
UN 172.31.18.118 2.99 GiB 256 25.4% 41b9f16e-6e71-4631-9794-9321a6e875bd 2a
UN 172.31.35.4 2.96 GiB 256 25.6% 7f62d7fd-b9c2-46cf-89a1-83155feebb70 2b
UN 172.31.40.147 3.26 GiB 256 27.4% e17fd867-2221-4fb5-99ec-5b33981a05ef 2b
UN 172.31.13.226 2.91 GiB 256 25.0% 4ef69969-d9fe-4336-9618-359877c4b570 2c
UN 172.31.33.0 2.74 GiB 256 23.6% 298ab053-0c29-44ab-8a0a-8dde03b4f125 2b
UN 172.31.5.192 2.93 GiB 256 25.2% 7c690640-24df-4345-aef3-dacd6643d6c0 2c
When we run the same repair test for the single token cluster on the vnode cluster, the following repair times were recorded.
[2021-01-22 22:45:56,689] Repair command #1 finished in 4 minutes 40 seconds
[2021-01-22 22:50:09,170] Repair command #1 finished in 4 minutes 6 seconds
[2021-01-22 22:54:04,820] Repair command #1 finished in 3 minutes 43 seconds
[2021-01-22 22:57:26,193] Repair command #1 finished in 3 minutes 27 seconds
[2021-01-22 23:01:23,554] Repair command #1 finished in 3 minutes 44 seconds
[2021-01-22 23:04:40,523] Repair command #1 finished in 3 minutes 27 seconds
[2021-01-22 23:08:20,231] Repair command #1 finished in 3 minutes 23 seconds
[2021-01-22 23:11:01,230] Repair command #1 finished in 2 minutes 45 seconds
[2021-01-22 23:13:48,682] Repair command #1 finished in 2 minutes 40 seconds
[2021-01-22 23:16:23,630] Repair command #1 finished in 2 minutes 32 seconds
[2021-01-22 23:18:56,786] Repair command #1 finished in 2 minutes 26 seconds
[2021-01-22 23:21:38,961] Repair command #1 finished in 2 minutes 30 seconds
These times give us a total repair time of 39 minutes and 23 seconds.
While the time difference is quite small for 3 GB of data per node (up to an additional 45 seconds per node), it is easy to see how the difference could balloon out when we have data sizes in the order of hundreds of gigabytes per node.
Unfortunately, all data streaming operations like bootstrap and datacenter rebuild fall victim to the same issue repairs have with large vnode values. Specifically, when a node needs to stream data to another node a streaming session is opened for each token range on the node. This results in a lot of unnecessary overhead, as data is transferred via the JVM.
Secondary indexes impacted too
To add insult to injury, the negative effect of a large vnode values extends to secondary indexes because of the way the read path works.
When a coordinator node receives a secondary index request from a client, it fans out the request to all the nodes in the cluster or datacenter depending on the locality of the consistency level. Each node then checks the SSTables for each of the token ranges assigned to it for a match to the secondary index query. Matches to the query are then returned to the coordinator node.
Hence, the larger the number of vnodes, the larger the impact to the responsiveness of the secondary index query. Furthermore, the performance impacts on secondary indexes grow exponentially with the number of replicas in the cluster. In a scenario where multiple datacenters have nodes using many vnodes, secondary indexes become even more inefficient.
A new hope
So what we are left with then is a property in Cassandra that really hits the mark in terms of reducing the complexities when resizing a cluster. Unfortunately, their benefits come at the expense of unbalanced token ranges on one end, and degraded operations performance at the other. That being said, the vnodes story is far from over.
Eventually, it became a well-known fact in the Apache Cassandra
project that large vnode values had undesirable side effects on a
cluster. To combat this issue, clever contributors and committers
added CASSANDRA-7032
in 3.0; a replica aware token allocation algorithm. The idea was to
allow a low value to be used for num_tokens
while
maintaining relatively even balanced token ranges. The enhancement
includes the addition of the allocate_tokens_for_keyspace
setting in the cassandra.yaml file. The new algorithm is
used instead of the random token allocator when an existing user
keyspace is assigned to the allocate_tokens_for_keyspace
setting.
Behind the scenes, Cassandra takes the replication factor of the defined keyspace and uses it when calculating the token values for the node when it first enters the cluster. Unlike the random token generator, the replica aware generator is like an experienced member of a symphony orchestra; sophisticated and in tune with its surroundings. So much so, that the process it uses to generate token ranges involves:
- Constructing an initial token ring state.
- Computing candidates for new tokens by splitting all existing token ranges right in the middle.
- Evaluating the expected improvements from all candidates and forming a priority queue.
- Iterating through the candidates in the queue and selecting the
best combination.
- During token selection, re-evaluate the candidate improvements in the queue.
While this was good advancement for Cassandra, there are a few
gotchas to watch out for when using the replica aware token
allocation algorithm. To start with, it only works with the
Murmur3Partitioner
partitioner. If you started with an old cluster that used another
partitioner such as the RandomPartitioner
and
have upgraded over time to 3.0, the feature is unusable. The second
and more common stumbling block is that some trickery is required
to use this feature when creating a cluster from scratch. The
question was common enough that we wrote a blog post specifically
on how to use the new replica aware token allocation algorithm to
set up a new cluster with even token distribution.
As you can see, Cassandra 3.0 made a genuine effort to address
vnode’s rough edges. What’s more, there are additional beacons of
light on the horizon with the upcoming Cassandra 4.0 major release.
For instance, a new allocate_tokens_for_local_replication_factor
setting has been added to the cassandra.yaml file via
CASSANDRA-15260.
Similar to its cousin the allocate_tokens_for_keyspace
setting, the replica aware token allocation algorithm is activated
when a value is supplied to it.
However, unlike its close relative, it is more user-friendly.
This is because no phaffing is required to create a balanced
cluster from scratch. In the simplest case, you can set a value for
the allocate_tokens_for_local_replication_factor
setting and just start adding nodes. Advanced operators can still
manually assign tokens to the initial nodes to ensure the desired
replication factor is met. After that, subsequent nodes can be
added with the replication factor value assigned to the
allocate_tokens_for_local_replication_factor
setting.
Arguably, one of the longest time coming and significant changes
to be released with Cassandra 4.0 is the update to the default
value of the num_tokens
setting.
As mentioned at the beginning of this post thanks to CASSANDRA-13701
Cassandra 4.0 will ship with a num_tokens
value set
to 16 in the cassandra.yaml file. In
addition, the allocate_tokens_for_local_replication_factor
setting is enabled by default and set to a value of
3.
These changes are much better user defaults. On a vanilla installation of Cassandra 4.0, the replica aware token allocation algorithm kicks in as soon as there are enough hosts to satisfy a replication factor of 3. The result is an evenly distributed token ranges for new nodes with all the benefits that a low vnodes value has to offer.
Conclusion
The consistent hashing and token allocation functionality form part of Cassandra’s backbone. Virtual nodes take the guess work out of maintaining this critical functionality, specifically, making cluster resizing quicker and easier. As a rule of thumb, the lower the number of vnodes, the less even the token distribution will be, leading to some nodes being over worked. Alternatively, the higher the number of vnodes, the slower cluster wide operations take to complete and more likely data will be unavailable if multiple nodes are down. The features in 3.0 and the enhancements to those features thanks to 4.0, allow Cassandra to use a low number of vnodes while still maintaining a relatively even token distribution. Ultimately, it will produce a better out-of-the-box experience for new users when running a vanilla installation of Cassandra 4.0.
Get Rid of Read Repair Chance
Apache Cassandra has a feature called Read Repair Chance that we always recommend our clients to disable. It is often an additional ~20% internal read load cost on your cluster that serves little purpose and provides no guarantees.
What is read repair chance?
The feature comes with two schema options at the table level:
read_repair_chance
and dclocal_read_repair_chance
.
Each representing the probability that the coordinator node will
query the extra replica nodes, beyond the requested consistency
level, for the purpose of read repairs.
The original
setting read_repair_chance
now
defines the probability of issuing the extra queries to all
replicas in all data centers. And the newer
dclocal_read_repair_chance
setting defines the probability of issuing the extra queries to all
replicas within the current data center.
The default values are read_repair_chance =
0.0
and dclocal_read_repair_chance =
0.1
. This means that cross-datacenter asynchronous read
repair is disabled and asynchronous read repair within the
datacenter occurs on 10% of read requests.
What does it cost?
Consider the following cluster deployment:
- A keyspace with a replication factor of three (
RF=3
) in a single data center - The default value of
dclocal_read_repair_chance = 0.1
- Client reads using a consistency level of LOCAL_QUORUM
- Client is using the token aware policy (default for most drivers)
In this setup, the cluster is going to see ~10% of the read requests result in the coordinator issuing two messaging system queries to two replicas, instead of just one. This results in an additional ~5% load.
If the requested consistency level is LOCAL_ONE, which is the default for the java-driver, then ~10% of the read requests result in the coordinator increasing messaging system queries from zero to two. This equates to a ~20% read load increase.
With read_repair_chance =
0.1
and multiple datacenters the situation is much worse.
With three data centers each with RF=3, then 10% of the read
requests will result in the coordinator issuing eight extra replica
queries. And six of those extra replica queries are now via
cross-datacenter queries. In this use-case it becomes a doubling of
your read load.
Let’s take a look at this with some flamegraphs…
The first
flamegraph shows the default configuration of dclocal_read_repair_chance =
0.1
. When the coordinator’s code hits the AbstractReadExecutor.getReadExecutor(..)
method, it splits paths depending on the ReadRepairDecision
for the table. Stack traces containing either AlwaysSpeculatingReadExecutor
,
SpeculatingReadExecutor
or NeverSpeculatingReadExecutor
provide us a hint to which code path we are on, and whether either
the read repair chance or speculative retry are in play.
The second
flamegraph shows the behaviour when the configuration has been
changed to dclocal_read_repair_chance =
0.0
. The AlwaysSpeculatingReadExecutor
flame is gone and this demonstrates the degree of complexity
removed from runtime. Specifically, read requests from the client
are now forwarded to every replica instead of only those defined by
the consistency level.
ℹ️ These flamegraphs were created with Apache Cassandra 3.11.9, Kubernetes and the cass-operator, nosqlbench and the async-profiler.
Previously we relied upon the existing tools of tlp-cluster, ccm, tlp-stress and cassandra-stress. This new approach with new tools is remarkably easy, and by using k8s the same approach can be used locally or against a dedicated k8s infrastructure. That is, I don't need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. The same recipe applies everywhere. Nor am I bound to AWS for my cloud testing. It is also worth mentioning that these new tools are gaining a lot of focus and momentum from DataStax, so the introduction of this new approach to the open source community is deserved.
The full approach and recipe to generating these flamegraphs will follow in a [subsequent blog post](/blog/2021/01/31/cassandra_and_kubernetes_cass_operator.html).
What is the benefit of this additional load?
The coordinator returns the result to the client once it has received the response from one of the replicas, per the user’s requested consistency level. This is why we call the feature asynchronous read repairs. This means that read latencies are not directly impacted though the additional background load will indirectly impact latencies.
Asynchronous read repairs also means that there’s no guarantee that the response to the client is repaired data. In summary, 10% of the data you read will be guaranteed to be repaired after you have read it. This is not a guarantee clients can use or rely upon. And it is not a guarantee Cassandra operators can rely upon to ensure data at rest is consistent. In fact it is not a guarantee an operator would want to rely upon anyway, as most inconsistencies are dealt with by hints and nodes down longer than the hint window are expected to be manually repaired.
Furthermore, systems that use strong consistency (i.e. where reads and writes are using quorum consistency levels) will not expose such unrepaired data anyway. Such systems only need repairs and consistent data on disk for lower read latencies (by avoiding the additional digest mismatch round trip between coordinator and replicas) and ensuring deleted data is not resurrected (i.e. tombstones are properly propagated).
So the feature gives us additional load for no usable benefit. This is why disabling the feature is always an immediate recommendation we give everyone.
It is also the rationale for the feature being removed altogether in the next major release, Cassandra version 4.0. And, since 3.0.17 and 3.11.3, if you still have values set for these properties in your table, you may have noticed the following warning during startup:
dclocal_read_repair_chance table option has been deprecated and will be removed in version 4.0
Get Rid of It
For Cassandra clusters not yet on version 4.0, do the following to disable all asynchronous read repairs:
cqlsh -e 'ALTER TABLE <keyspace_name>.<table_name> WITH read_repair_chance = 0.0 AND dclocal_read_repair_chance = 0.0;'
When upgrading to Cassandra 4.0 no action is required, these settings are ignored and disappear.
Renaming and reshaping Scylla tables using scylla-migrator
We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...
Python scylla-driver: how we unleashed the Scylla monster's performance
At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...
Scylla Summit 2019
I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...
Scylla: four ways to optimize your disk space consumption
We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...
Scylla Summit 2018 write-up
It's been almost one month since I had the chance to attend and speak at Scylla Summit 2018 so I'm reliev...
Authenticating and connecting to a SSL enabled Scylla cluster using Spark 2
This quick article is a wrap up for reference on how to connect to ScyllaDB using Spark 2 when authentication and SSL are enforced for the clients on the...
A botspot story
I felt like sharing a recent story that allowed us identify a bot in a haystack thanks to Scylla.
...