ScyllaDB Summit 2024 Recap: An Inside Look

A rapid rundown of the whirlwind database performance event

Hello readers, it’s great to be back in the US to help host ScyllaDB Summit, 2024 edition. What a great virtual conference it has been – and once again, I’m excited to share the behind-the-scenes perspective as one of your hosts.

First, let’s thank all the presenters once again for contributions from around the world. With 30 presentations covering all things ScyllaDB, it made for a great event.

To kick things off, we had the now-famous Felipe Cardeneti Mendes host the ScyllaDB lounge and get straight into answering questions. Once the audience got a taste for it (and realized the technical acumen of Felipe’s knowledge), there was an unstoppable stream of questions for him. Felipe was so popular that he became a meme for the conference!

The first morning, we also trialed something new by running hands-on labs aimed at both novice and advanced users. These live streams were a hit, with over 1000 people in attendance and everyone keen to get their hands on ScyllaDB. If you’d like to continue that experience, be sure to check out the self-paced ScyllaDB University and the interactive instructor-led ScyllaDB University LIVE event coming up in March. Both are free and virtual!

Let’s recap some stand out presentations for me on the first day of the conference. The opening keynote, by CEO and co-founder Dor Laor, was titled ScyllaDB Leaps Forward. This is a must-see presentation. It provides the background context you need to understand tablet architecture and the direction that ScyllaDB is headed: not only the fastest database in terms of latency (at any scale), but also the quickest to scale in terms of elasticity. The companion keynote on day two from CTO and co-founder Avi Kivity completes the loop and explains in more detail why ScyllaDB is making this major architecture shift from vNodes replication to tablets. Take a look at Tablets: Rethink Replication for more insights.

The second keynote, from Discord Staff Engineer Bo Ingram, opened with the premise So You’ve Lost Quorum: Lessons From Accidental Downtime and shared how to diagnose issues in your clusters and how to avoid making a fault too big to tolerate. Bo is a talented storyteller and published author. Be sure to watch this keynote for great tips on how to handle production incidents at scale. And don’t forget to buy his book, ScyllaDB in Action for even more practical advice on getting the most out of ScyllaDB.

Download the first 4 chapters for free

An underlying theme for the conference was exploring individual customers’ migration paths from other databases onto ScyllaDB. To that end, we were fortunate to hear from JP Voltani, Head of Engineering at Tractian, on their Experience with Real-Time ML and the reasons why they moved from MongoDB to ScyllaDB to scale their data pipeline. Working with over 5B samples from +50K IoT devices, they were able to achieve their migration goals. Felipe’s presentation on MongoDB to ScyllaDB: Technical Comparison and the Path to Success then detailed the benchmarks, processes, and tools you need to be successful for these types of migrations. There were also great presentations looking at migration paths from DynamoDB and Cassandra; be sure to take a look at them if you’re on any of those journeys.

A common component in customer migration paths was the use of Change Data Capture (CDC) and we heard from Expedia on their migration journey from Cassandra to ScyllaDB. They cover the aspects and pitfalls the team needed to overcome as part of their Identity service project. If you are keen to learn more about this topic, then Guilherme’s presentation on Real-Time Event Processing with CDC is a must-see.

Martina Alilović Rojnić gave us the Strategy Behind Reversing Labs’ Massive Key-Value Migration which had mind-boggling scale, migrating more than 300 TB of data and over 400 microservices from their bespoke key-value store to ScyllaDB – with ZERO downtime. An impressive feat of engineering!

ShareChat shared everything about Getting the Most Out of ScyllaDB Monitoring. This is a practical talk about working with non-standard ScyllaDB metrics to analyze the remaining cluster capacity, debug performance problems, and more. Definitely worth a watch if you’re already running ScyllaDB in production.

After a big day hosting the conference combined with the fatigue of international travel from Australia, assisted with a couple of cold beverages the night after, sleep was the priority. Well rested and eager for more, we launched early into day two of the event with more great content.

Leading the day was Avi’s keynote, which I already mentioned above. Equally informative was the following keynote from Miles Ward and Joe Shorter on Radically Outperforming DynamoDB. If you’re looking for more reasons to switch, this was a good presentation to learn from, including details of the migration and using ScyllaDB Cloud with Google Cloud Platform.

Felipe delivered another presentation (earning him MVP of the conference) about using workload prioritization features of ScyllaDB to handle both Real-Time and Analytical workloads, something you might not ordinarily consider compatible. I also enjoyed Piotr’s presentation on how ScyllaDB Drivers take advantage of the unique ScyllaDB architecture to deliver high-performance and ultra low-latencies. This is yet another engineering talk showcasing the strengths of ScyllaDB’s feature sets. Kostja Osipov set the stage for this on Day 1. Kostja consistently delivers impressive Raft talks year after year, and his Topology on Raft: An Inside Look talk is another can’t miss. There’s a lot there! Give it a (re)watch if you want all the details on how Raft is implemented in the new releases and what it all means for you, from the user perspective.

We also heard from Kishore Krishnamurthy, CTO at ZEE5, giving us insights into Steering a High-Stakes Database Migration. It’s always interesting to hear the executive-level perspective on the direction you might take to reduce costs while maintaining your SLAs. There were also more fascinating insights from ZEE5 engineers on how they are Tracking Millions of Heartbeats on Zee’s OTT Platform. Solid technical content.

In a similar vein, proving that “simple” things can still present serious engineering challenges when things are distributed and at scale, Edvard Fagerholm showed us how Supercell Persists Real-Time Events. Edvard illustrated how ScyllaDB helps them process real-time events for their games like Clash of Clans and Clash Royale with hundreds of millions of users.

The day flew by. Before long, we were wrapping up the conference. Thanks to the community of 7500 for great participation – and thousands of comments – from start to finish. I truly enjoyed hosting the introductory lab and getting swarmed by questions. And no, I’m not a Kiwi! Thank you all for a wonderful experience.

Seastar, ScyllaDB, and C++23

Seastar now supports C++20 and C++23 (and dropped support for C++17)

Seastar is an open-source (Apache 2.0 licensed) C++ framework for I/O intensive asynchronous computing, using the thread-per-core model. Seastar underpins several high- performance distributed systems: ScyllaDB, Redpanda, and Ceph Crimson. Seastar source is available on github.

Background

As a C++ framework, Seastar must choose which C++ versions to support. The support policy is last-two-versions. That means that at any given time, the most recently released version as well as the previous one are supported, but earlier versions cannot be expected to work. This policy gives users of the framework three years to upgrade to the next C++ edition while not constraining Seastar to ancient versions of the language.

Now that C++23 has been ratified, Seastar now officially supports C++20 and C++23. The previously supported C++17 is now no longer supported.

New features in C++23

We will focus here on C++23 features that are relevant to Seastar users; this isn’t a comprehensive review of C++23 changes. For an overview of C++23 additions, consult the cppreference page.

std::expected

std::expected is a way to communicate error conditions without exceptions. This is useful since exception handling is very slow in most C++ implementations. In a way, it is similar to std::future and seastar::future: they are all variant types that can hold values and errors, though, of course, futures also represent concurrent computations, not just values.

So far, ScyllaDB has used boost::outcome for the same role that std::expected fills. This improved ScyllaDB’s performance under overload conditions. We’ll likely replace it with std::expected soon, and integration into Seastar itself is a good area for extending Seastar.

std::flat_set and std::flat_map

These new containers reduce allocations compared to their traditional variants and are suitable for request processing in Seastar applications. Seastar itself won’t use them since it still maintains C++20 compatibility, but Seastar users should consider them, along with abseil containers.

Retiring C++17 support

As can be seen from the previous section, C++23 does not have a dramatic impact on Seastar. The retiring of C++17, however, does. This is because we can now fully use some C++20-only features on Seastar itself.

Coroutines

C++20 introduced coroutines, which make synchronous code both easier to write and more efficient (a very rare tradeoff). Seastar applications could already use coroutines freely, but Seastar itself could not due to the need to support C++17. Since all supported C++ editions now have coroutines, continuation-style code will be replaced by coroutines where this makes sense.

std::format

Seastar has long been using the wonderful {fmt} library. Since it was standardized as std::format in C++20, we may drop this dependency in favor of the standard library version.

The std::ranges library

Another long-time dependency, the Boost.Range library, can now be replaced by its modern equivalent std::ranges. This promises better compile times, and, more importantly, better compile-time error reporting as the standard library uses C++ concepts to point out programmer errors more directly.

Concepts

As concepts were introduced in C++20, they can now be used unconditionally in Seastar. Previously, they were only used when C++20 mode was active, which somewhat restricted what could be done with them.

Conclusion

C++23 isn’t a revolution for C++ users in general and Seastar users in particular, but, it does reduce the dependency on third-party libraries for common tasks. Concurrent with its adoption, dropping C++17 allows us to continue modernizing and improving Seastar.

Distributed Database Consistency: Dr. Daniel Abadi & Kostja Osipov Chat

Dr. Daniel Abadi (University of Maryland) and Kostja Osipov (ScyllaDB) discuss PACELC, CAP theorem, Raft, and Paxos

Database consistency has been a strongly consistent theme at ScyllaDB Summit over the past few years – and we guarantee that will continue at ScyllaDB Summit 2024 (free + virtual). Co-founder Dor Laor’s opening keynote on “ScyllaDB Leaps Forward” includes an overview of the latest milestones on ScyllaDB’s path to immediate consistency. Kostja Osipov (Director of Engineering) then shares the details behind how we’re implementing this shift with Raft and what the new consistent metadata management updates mean for users. Then on Day 2, Avi Kivity (Co-founder) picks up this thread in his keynote introducing ScyllaDB’s revolutionary new tablet architecture – which is built on the foundation of Raft.

Update: ScyllaDB Summit 2024 is now a wrap!

Access ScyllaDB Summit On Demand

ScyllaDB Summit 2023 featured two talks on database consistency. Kostja Osipov shared a preview of Raft After ScyllaDB 5.2: Safe Topology Changes (also covered in this blog series). And Dr. Daniel Abadi, creator of the PACELC theorem, explored The Consistency vs Throughput Tradeoff in Distributed Databases.

After their talks, Daniel and Kostja got together to chat about distributed database consistency. You can watch the full discussion below.

Here are some key moments from the chat…

What is the CAP theorem and what is PACELC

Daniel: Let’s start with the CAP theorem. That’s the more well-known one, and that’s the one that came first historically. Some say it’s a three-way tradeoff, some say it’s a two-way tradeoff. It was originally described as a three-way tradeoff: out of consistency, availability, and tolerance to network partitions, you can have two of them, but not all three. That’s the way it’s defined. The intuition is that if you have a copy of your data in America and a copy of your data in Europe and you want to do a write in one of those two locations, you have two choices.

You do it in America, and then you say it’s done before it gets to Europe. Or, you wait for it to get to Europe, and then you wait for it to occur there before you say that it’s done. In the first case, if you commit and finish a transaction before it gets to Europe, then you’re giving up consistency because the value in Europe is not the most current value (the current value is the write that happened in America). But if America goes down, you could at least respond with stale data from Europe to maintain availabilty.

PACELC is really an extension of the CAP theorem. The PAC of PACELC is CAP. Basically, that’s saying that when there is a network partition, you must choose either availability or consistency. But the key point of PACELC is that network partitions are extremely rare. There’s all kinds of redundant ways to get a message from point A to point B.

So the CAP theorem is kind of interesting in theory, but in practice, there’s no real reason why you have to give up C or A. You can have both in most cases because there’s never a network partition. Yet we see many systems that do give up on consistency. Why? The main reason why you give up on consistency these days is latency. Consistency just takes time. Consistency requires coordination. You have to have two different locations communicate with each other to be able to remain consistent with one another. If you want consistency, that’s great. But you have to pay in latency. And if you don’t want to pay that latency cost, you’re going to pay in consistency. So the high-level explanation of the PACELC theorem is that when there is a partition, you have to choose between availability and consistency. But in the common case where there is no partition, you have to choose between latency and consistency.

[Read more in Dr. Abadi’s paper, Consistency Tradeoffs in Modern Distributed Database System Design]

In ScyllaDB, when we talk about consensus protocols, there’s Paxos and Raft. What’s the purpose for each?

Kostja: First, I would like to second what Dr. Abadi said. This is a tradeoff between latency and consistency. Consistency requires latency, basically. My take on the CAP theorem is that it was really oversold back in the 2010s. We were looking at this as a fundamental requirement, and we have been building systems as if we are never going to go back to strong consistency again. And now the train has turned around completely. Now many vendors are adding back strong consistency features.

For ScyllaDB, I’d say the biggest difference between Paxos and Raft is whether it’s a centralized algorithm or a decentralized algorithm. I think decentralized algorithms are just generally much harder to reason about. We use Raft for configuration changes, which we use as a basis for our topology changes (when we need the cluster to agree on a single state). The main reason we chose Raft was that it has been very well specified, very well tested and implemented, and so on. Paxos itself is not a multi-round protocol. You have to build on top of it; there are papers on how to build multi-Paxos on top of Paxos and how you manage configurations on top of that. If you are a practitioner, you need some very complete thing to build upon. Even when we were looking at Raft, we found quite a few open issues with the spec. That’s why both can co-exist. And I guess, we also have eventual consistency – so we could take the best of all worlds.

For data, we are certainly going to run multiple Raft groups. But this means that every partition is going to be its own consensus – running independently, essentially having its own leader. In the end, we’re going to have, logically, many leaders in the cluster. However, if you look at our schema and topology, there’s still a single leader. So for schema and topology, we have all of the members of the cluster in the same group. We do run a single leader, but this is an advantage because the topology state machine is itself quite complicated. Running in a decentralized fashion without a single leader would complicate it quite a bit more. For a layman, linearizable just means that you can very easily reason about what’s going on: one thing happens after another. And when you build algorithms, that’s a huge value. We build complex transitions of topology when you stream data from one node to another – you might need to abort this, you might need to coordinate it with another streaming operation, and having one central place to coordinate this is just much, much easier to reason about.

Daniel: Returning to what Kostja was saying. It’s not just that the trend (away from consistency) has started reverse script. I think it’s very true that people overreacted to CAP. It’s sort of like they used CAP as an excuse for why they didn’t create a consistent system. I think there are probably more systems than there should have been that might have been designed very differently if they didn’t drink the CAP Kool-aid so much. I think it’s a shame, and as Kostja said, it’s starting to reverse now.

Daniel and Kostja on Industry Shifts

Daniel: We are seeing sort of a lot of systems now, giving you the best of both worlds. You don’t want to do consistency at the application level. You really want to have a database that can take care of the consistency for you. It can often do it faster than the application can deal with it. Also, you see bugs coming up all the time in the application layer. It’s hard to get all those corner cases right. It’s not impossible but it’s just so hard. In many cases, it’s just worth paying the cost to get the consistency guaranteed in the system and be working with a rock-solid system. On the other hand, sometimes you need performance. Sometimes users can’t tolerate 20 milliseconds – it’s just too long. Sometimes you don’t need consistency. It makes sense to have both options. ScyllaDB is one example of this, and there are also other systems providing options for users. I think it’s a good thing.

Kostja: I want to say more about the complexity problem. There was this research study on Ruby on Rails, Python, and Go applications, looking at how they actually use strongly consistent databases and different consistency levels that are in the SQL standard. It discovered that most of the applications have potential issues simply because they use the default settings for transactional databases, like snapshot isolation and not serializable isolation. Applied complexity has to be taken into account. Building applications is more difficult and even more diverse than building databases. So you have to push the problem down to the database layer and provide strong consistency in the database layer to make all the data layers simpler. It makes a lot of sense.

Daniel: Yes, that was Peter Bailis’ 2015 UC Berkeley Ph.D. thesis, Coordination Avoidance in Distributed Databases. Very nice comparison. What I was saying was that they know what they’re getting, at least, and they just tried to design around it and they hit bugs. But what you’re saying is even worse: they don’t even know what they’re getting into. They’re just using the defaults and not getting full isolation and not getting full consistency – and they don’t even know what happened.

Continuing the Database Consistency Conversation

Intrigued by database consistency? Here are some places to learn more:

 

 

ScyllaDB Summit Speaker Spotlight: Miles Ward, CTO at SADA

SADA CTO Miles Ward shares a preview of his ScyllaDB Summit keynote with Joseph Shorter (VP of Platform Architecture at Digital Turbine) 

ScyllaDB Summit is now just days away! If database performance at scale matters to your team, join us to hear about your peers’ experiences and discover new ways to alleviate your own database latency, throughput, and cost pains. It’s free, virtual, and highly interactive.

While setting the virtual stage, we caught up with ScyllaDB Summit keynote speaker, SADA CTO, and electric sousaphone player Miles Ward. Miles and the ScyllaDB team go way back, and we’re thrilled to welcome him back to ScyllaDB Summit – along with Joseph Shorter, VP of Platform Architecture at Digital Turbine.

Miles and team worked with Joseph and team on a high stakes (e.g., “if we make a mistake, the business goes down”) and extreme scale DynamoDB to ScyllaDB migration. And to quantify “extreme scale,” consider this:

The keynote will be an informal chat about why and how they pulled off this database migration in the midst of a cloud migration (AWS to Google Cloud).

Update: ScyllaDB Summit 24 is completed! That means you can watch this session on demand.

Watch Miles and Joe On Demand

Here’s what Miles shared in our chat…

Can you share a sneak peek of your keynote? What should attendees expect?

Lots of tech talks speak in the hypothetical about features yet to ship, about potential and capabilities. Not here! The engineers at SADA and Digital Turbine are all done: we switched from the old and dusted DynamoDB on AWS to the new hotness Alternator via ScyllaDB on GCP, and the metrics are in! We’ll have the play-by-play, lessons learned, and the specifics you can use as you’re evaluating your own adoption of ScyllaDB.

You’re co-presenting with an exceptional tech leader, Digital Turbine’s Joseph Shorter. Can you tell us more about him?

Joseph is a stellar technical leader. We met as we were first connecting with Digital Turbine through their complex and manifold migration. Remember: they’re built by acquisition so it’s all-day integrations and reconciliations.

Joe stood out as utterly bereft of BS, clear about the human dimension of all the change his company was going through, and able to grapple with all the layers of this complex stack to bring order out of chaos.

Are there any recent developments on the SADA and Google Cloud fronts that might intrigue ScyllaDB Summit attendees?

Three!

GenAI is all the rage! SADA is building highly efficient systems using the power of Google’s Gemini and Duet APIs to automate some of the most rote, laborious tasks from our customers. None of this works if you can’t keep the data systems humming; thanks ScyllaDB!

New instances from GCP with even larger SSDs (critical for ScyllaDB performance!) are coming very soon. Perhaps keep your eyes out around Google Next (April 9-11!)

SADA just got snapped up by the incredible folks at Insight, so now we can help in waaaaaay more places and for customers big and small. If what Digital Turbine did sounds like something you could use help with, let me know!

How did you first come across ScyllaDB, and how has your relationship evolved over time?

I met Dor in Tel Aviv a long, long, (insert long pause), LONG time ago, right when ScyllaDB was getting started. I loved the value prop, the origin story, and the team immediately. I’ve been hunting down good use cases for ScyllaDB ever since!

What ScyllaDB Summit 24 sessions (besides your own, of course) are you most looking forward to and why?

One is the Disney talk, which is right after ours. We absolutely have to see how Disney’s DynamoDB migration compares to ours.

Another is Discord, I am an avid user of Discord. I’ve seen some of their performance infrastructure and they have very, very, very VERY high throughput so I’m sure they have some pretty interesting experiences to share.

Also, Dor and Avi of course!

Are you at liberty to share your daily caffeine intake?

Editor’s note: If you’ve ever witnessed Miles’ energy, you’d understand why we asked this question.

I’m pretty serious about my C8H10N4O2 intake. It is a delicious chemical that takes care of me. I’m steady with two shots from my La Marzzoco Linea Mini to start the day right, with typically a post-lunch-fight-the-sleepies cappuccino. Plus, my lovely wife has a tendency to only drink half of basically anything I serve her, so I get ‘dregs’ or sips of tasty coffee leftovers on the regular. Better living through chemistry!

***
We just met with Miles and Joe, and that chemistry is amazing too. This is probably your once-in-a-lifetime chance to attend a talk that covers electric sousaphones and motorcycles along with databases – with great conversation, practical tips, and candid lessons learned.

Trust us: You won’t want to miss their keynote, or all the other great talks that your peers across the community are preparing for ScyllaDB Summit.

Register Now – It’s Free

From Redis & Aurora to ScyllaDB, with 90% Lower Latency and $1M Savings

How SecurityScorecard built a scalable and resilient security ratings platform with ScyllaDB

SecurityScorecard is the global leader in cybersecurity ratings, with millions of organizations continuously rated. Their rating platform provides instant risk ratings across ten groups of risk factors, including DNS health, IP reputation, web application security, network security, leaked information, hacker chatter, endpoint security, and patching cadence.

Nguyen Cao, Staff Software Engineer at SecurityScorecard, joined us at ScyllaDB Summit 2023 to share how SecurityScorecard’s scoring architecture works and why they recently rearchitected it. This blog shares his perspective on how they decoupled the frontend and backend services by introducing a middle layer for improved scalability and maintainability.

Spoiler: Their architecture shift involved a migration from Redis and Aurora to ScyllaDB. And it resulted in:

  • 90% latency reduction for most service endpoints
  • 80% fewer production incidents related to Presto/Aurora performance
  • $1M infrastructure cost savings per year
  • 30% faster data pipeline processing
  • Much better customer experience

Curious? Read on as we unpack this.

Join us at ScyllaDB Summit 24 to hear more firsthand accounts of how teams are tackling their toughest database challenges. Disney, Discord, Expedia, Paramount, and more are all on the agenda.

Update: ScyllaDB Summit 2024 is now a wrap!

Access ScyllaDB Summit On Demand

SecurityScorecard’s Data Pipeline

SecurityScorecard’s mission is to make the world a safer place by transforming the way organizations understand, mitigate, and communicate cybersecurity to their boards, employees, and vendors. To do this, they continuously evaluate an organization’s security profile and report on ten key risk factors, each with a grade of A-F.

Here’s an abstraction of the data pipeline used to calculate these ratings:

Starting with signal collection, their global networks of sensors are deployed across over 50 countries to scan IPs, domains, DNS, and various external data sources to instantly detect threats. That result is then processed by the Attribution Engine and Cyber Analytics modules, which try to associate IP addresses and domains with vulnerabilities. Finally, the scoring engines compute a rating score. Nguyen’s team is responsible for the scoring engines that calculate the scores accessed by the frontend services.

Challenges with Redis and Aurora

The team’s previous data architecture served SecurityScorecard well for a while, but it couldn’t keep up with the company’s growth.

The platform API (shown on the left side of the diagram) received requests from end users, then made further requests to other internal services, such as the measurements service. That service then queried datastores such as Redis, Aurora, and Presto on HDFS. The scoring workflow (an enhancement of Apache Airflow) then generated a score based on the measurements and the findings of over 12 million scorecards.

This architecture met their needs for years. One aspect that worked well was using different datastores for different needs. They used three main datastores: Redis (for faster lookups of 12 million scorecards), Aurora (for storing 4 billion measurement stats across nodes), or a Presto cluster on HDFS (for complex SQL queries on historical results).

However, as the company grew, challenges with Redis and Aurora emerged.

Aurora and Presto latencies spiked under high throughput. The Scoring Workflow was running batch inserts of over 4B rows into Aurora throughout the day. Aurora has a primary/secondary architecture, where the primary node is solely responsible for writes and the secondary replicas are read-only. The main drawback of this approach was that their write-intensive workloads couldn’t keep up. Because their writes weren’t able to scale, this ultimately led to elevated read latencies because replicas were also overwhelmed. Moreover, Presto queries became consistently overwhelmed as the number of requests and amount of data associated with each scorecard grew. At times, latencies spiked to minutes, and this caused problems that impacted the entire platform.

The largest possible instance of Redis still wasn’t sufficient. The largest possible instance od Redis supported only 12M scorecards, but they needed to grow beyond that. They tried Redis cluster for increased cache capacity, but determined that this approach would bring excessive complexity. For example, at the time, only the Python driver supported consistent hashing-based routing. That meant they would need to implement their own custom drivers to support this critical functionality on top of their Java and Node.js services.

HDFS didn’t allow fast score updating. The company wants to encourage customers to rapidly remediate reported issues – and the ability to show them an immediate score improvement offers positive reinforcement for good behavior. However, their HDFS configuration (with data immutability) meant that data ingestion to HDFS had to go through the complete scoring workflow first. This meant that score updates could be delayed for 3 days.

Maintainability. Their ~50 internal services were implemented in a variety of tech stacks (Go, Java, Node.js, Python, etc.). All these services directly accessed the various datastores, so they had to handle all the different queries (SQL, Redis, etc.) effectively and efficiently. Whenever the team changed the database schema, they also had to update all the services.

Moving to a New Architecture with ScyllaDB

To reduce latencies at the new scale that their rapid business growth required, the team moved to ScyllaDB Cloud and developed a new scoring API that routed less latency-sensitive requests to Presto + S3 storage. Here’s a visualization of this – and considerably simpler – architecture:

A ScyllaDB Cloud cluster replaced Redis and Aurora, and AWS S3 replaced HDFS (Presto remains) for storing the scorecard details. Also added: a scoring-api service, which works as a data gateway. This component routes particular types of traffic to the appropriate data store.

How did this address their challenges?

Latency
With a carefully-designed ScyllaDB schema, they can quickly access the data that they need based on the primary key. Their scoring API can receive requests for up to 100,000 scorecards. This led them to build an API capable of splitting and parallelizing these payloads into smaller processing tasks to avoid overwhelming a single ScyllaDB coordinator. Upon completion, the results are aggregated and then returned to the calling service. Also, their high read throughput no longer causes latency spikes, thanks largely to ScyllaDB’s eventually consistent architecture.

Scalability
With ScyllaDB Cloud, they can simply add more nodes to their clusters as demand increases, allowing them to overcome limits faced with Redis. The scoring API is also scalable since it’s deployed as an ECS service. If they need to serve requests faster, they just add more instances.

Fast Score Updating
Now, scorecards can be updated immediately by sending an upsert request to ScyllaDB Cloud.

Maintainability
Instead of accessing the datastore directly, services now send REST queries to the scoring API, which works as a gateway. It directs requests to the appropriate places, depending on the use case. For example, if the request needs a low-latency response it’s sent to ScyllaDB Cloud. If it’s a request for historical data, it goes to Presto.

Results & Lessons Learned

Nguyen then shared the KPIs they used to track project success. On the day that they flipped the switch, latency immediately decreased by over 90% on the two endpoints tracked in this chart (from 4.5 seconds to 303 milliseconds). They’re fielding 80% fewer incidents. And they’ve already saved over $1M USD a year in infrastructure costs by replacing Redis and Aurora. On top of that, they achieved a 30% speed improvement in their data processing.

Wrapping up, Nguyen shared their top 3 lessons learned:

  • Design your ScyllaDB schema based on data access patterns with a focus on query latency.
  • Route infrequent, complex, and latency-tolerant data access to OLAP engines like Presto or Athena (generating reports, custom analysis, etc.).
  • Build a scalable, highly parallel processing aggregation component to fully benefit from ScyllaDB’s high throughput.

Watch the Complete SecurityScorecard Tech Talk

You can watch Nguyen’s complete tech talk and skim through the deck in our tech talk library.

Watch the Full Tech Talk

WebAssembly: Putting Code and Data Where They Belong

Brian Sletten and Piotr Sarna chat about Wasm + data trends at ScyllaDB Summit

ScyllaDB Summit isn’t just about ScyllaDB. It’s also a space for learning about the latest trends impacting the broader data and the database world. Case in point: the WebAssembly discussions at ScyllaDB Summit 23.

Free Access to ScyllaDB Summit 24 On-Demand

Last year, the conference featured two talks that addressed Wasm from two distinctly different perspectives: Brian Sletten (WebAssembly: The Definitive Guide author and President at Bosatsu Consulting) and Piotr Sarna (maintainer of libSQL, which supports Wasm user-defined functions and compiles to WebAssembly).

Brian’s Everything in its Place: Putting Code and Data Where They Belong talk shared how Wasm challenges the fundamental assumption that code runs on computers and data is stored in databases. And Piotr’s libSQL talk introduced, well, libSQL (Turso’s fork of SQLite that’s modernized for edge computing and distributed systems) and what’s behind its Wasm-powered support for dynamic function creation and execution.

See Brian’s Talk & Deck

See Piotr’s Talk & Deck

As the event unfolded, Brian and Piotr met up for a casual Wasm chat. Here are some highlights from their discussion…

How people are using WebAssembly to create a much closer binding between the data and the application

Brian: From mainframe, to client server, to extending servers with things like servlets, Perl scripts, and cgi-bin, I think we’ve just really been going back and forth, and back and forth, back and forth. The reality is that different problems require different topologies. We’re gaining more topologies to support, but also tools that give us support for those more topologies.

The idea that we can co-locate large amounts of data for long-term training sessions in the cloud avoids having to push a lot of that back and forth. But the more we want to support things like offline applications and interactions, the more we’re going to need to be able to capture data on-device and on-browser. So, the idea of being able to push app databases into the browser, like DuckDB and SQLite, and Postgres, and – I fully expect someday – ScyllaDB as well, allows us to say, “Let’s run the application, capture whatever interactions with the user we want locally, then sync it incrementally or in an offline capacity.”

We’re getting more and more options for where we co-locate these things, and it’s going to be driven by the business requirements as much as technical requirements for certain kinds of interactions. And things like regulatory compliance are a big part of that as well. For example, people might want to schedule work in Europe – or outside of Europe – for various regulatory reasons. So, there are lots of things happening that are driving the need, and technologies like WebAssembly and LLVM are helping us address that.

How Turso’s libSQL lets users get their compute closer to their data

Piotr: libSQL is actually just an embedded database library. By itself, it’s not a distributed system – but, it’s a very nice building block for one. You can use it at the edge. That’s a very broad term; to be clear, we don’t mean running software on your fridges or light bulbs. It’s just local data centers that are closer to the users. And we can build a distributed database by having lots of small instances running libSQL and then replicating to it. CRDTs – basically, this offline-first approach where users can interact with something and then sync later – is also a very interesting area. Turso is exploring that direction. And there’s actually another project called cr-sqlite that applies CRDTs to SQLite. It’s close to the approach we’d like to take with libSQL. We want to have such capabilities natively so users can write these kinds of offline-first applications and the system knows how to resolve conflicts and apply changes later.

Moving from “Web Scale” big data to a lot of small data

Brian: I think this shift represents a more realistic view. Our definition of scale is “web scale.” Nobody would think about trying to put the web into a container. Instead, you put things into the web. The idea that all of our data has to go into one container is something that clearly has an upper limit. That limit is getting bigger over time, but it will never be “all of our data.”
Protocols and data models that interlink loosely federated collections of data (on-demand, as needed, using standard query mechanisms) will allow us to keep the smaller amounts of data on-device, and then connect it back up. You’ve learned additional things locally that may be of interest when aggregated and connected. That could be an offline syncing to real-time linked data kinds of interactions.

Really, this idea of trying to “own” entire datasets is essentially an outdated mentality. We have to get it to where it needs to go, and obviously have the optimizations and engineering constraints around the analytics that we need to ask. But the reality is that data is produced in lots of different places. And having a way to work with different scenarios of where that data lives and how we connect it is all part of that story.

I’ve been a big advocate for linked data and knowledge, graphs, and things like that for quite some time. That’s where I think things like WebAssembly and linked data and architectural distribution, and serverless functions, and cloud computing and edge computing are all sort of coalescing on this fluid fabric view of data and computation.

Trends: From JavaScript to Rust, WebAssembly, and LLVM

Brian: JavaScript is a technology that began in the browser, but was increasingly pushed to the back end. For that to work, you either need a kind of flat computational environment where everything is the same, or something where we can virtualize the computation in the form of Java bytecode, or .NET, CIL, or whatever. JavaScript is part of that because we try to have a JavaScript engine that runs in all these different environments. You get better code reuse, and you get more portability in terms of where the code runs. But JavaScript itself also has some limitations in terms of the kinds of low-level stuff that it can handle (for example, there’s no real opportunity for ahead-of-time optimizations).

Rust represents a material advancement in languages for system engineering and application development. That then helps improve the performance and safety of our code. However, when Rust is built on the LLVM infrastructure, its pluggable back-end capability allows it to emit native code, emit WebAssembly, and even WASI-flavored WebAssembly. This ensures that it runs within an environment providing the capabilities required to do what it needs…and nothing else.

That’s why I think the intersection of architecture, data models, and computational substrates – and I consider both WebAssembly and LLVM important parts of those – to be a big part of solving this problem. I mean, the reason Apple was able to migrate from x86 to ARM relatively quickly is because they changed their tooling to be LLVM-based. At that point, it becomes a question of recompiling to some extent.

What’s needed to make small localized databases on the edge + big backend databases a successful design pattern

Piotr: These smaller local databases that are on this magical edge need some kind of source of truth that keeps everything consistent. Especially if ScyllaDB goes big on stronger consistency levels (with Raft), I do imagine a design where you can have these small local databases (say, libSQL instances) that are able to store user data, and users interact with them because they’re close by. That brings really low latency. Then, these small local databases could be synced to a larger database that becomes the single source of truth for all the data, allowing users to access this global state as well.

Watch the Complete WebAssembly Chat

You can watch the complete chat Brian <> Piotr chat here:



How ShareChat Performs Aggregations at Scale with Kafka + ScyllaDB

Sub-millisecond P99 latencies – even over 1M operations per second

ShareChat is India’s largest homegrown social media platform, with ~180 million monthly average users and 50 million daily active users. They capture and aggregate various engagement metrics such as likes, views, shares, comments, etc., at the post level to curate better content for their users.

Since engagement performance directly impacts users, ShareChat needs a datastore that offers ultra-low latencies while remaining highly available, resilient, and scalable – ideally, all at a reasonable cost. This blog shares how they accomplished that using in-house Kafka streams and ScyllaDB. It’s based on the information that Charan Movva, Technical Lead at ShareChat, shared at ScyllaDB Summit 23.

Join us at ScyllaDB Summit 24 to hear more firsthand accounts of how teams are tackling their toughest database challenges. Disney, Discord, Paramount, Expedia, and more are all on the agenda. Plus we’ll be featuring more insights from ShareChat. Ivan Burmistrov and Andrei Manakov will be presenting two tech talks: “Getting the Most Out of ScyllaDB Monitoring: ShareChat’s Tips” and “From 1M to 1B Features Per Second: Scaling ShareChat’s ML Feature Store.”

Update: ScyllaDB Summit 2024 is now a wrap!

Access ScyllaDB Summit On Demand

ShareChat: India’s Largest Social Media Platform

ShareChat aims to foster an inclusive community by supporting content creation and consumption in 15 languages. Allowing users to share experiences in the language they’re most comfortable with increases engagement – as demonstrated by their over 1.3 billion monthly shares, with users spending an average of 31 minutes per day on the app.

As all these users interact with the app, ShareChat collects events, including post views and engagement actions such as likes, shares, and comments. These events, which occur at a rate of 370k to 440k ops/second, are critical for populating the user feed and curating content via their data science and machine learning models. All of this is critical for enhancing the user experience and providing valuable content to ShareChat’s diverse community.

Why Stream Processing?

The team considered three options for processing all of these engagement events:

  • Request-response promised the lowest latency. But given the scale of events they handle, it would burden the database with too many connections and (unnecessary) writes and updates. For example, since every number of likes between 12,500 likes and 12,599 likes is displayed as “12.5 likes,” those updates don’t require the level of precision that this approach offers (at a price).
  • Batch processing offered high throughput, but also brought challenges such as stale data and delayed updates, especially for post engagements. This is especially problematic for early engagement on a post (imagine a user sharing something, then incorrectly thinking that nobody likes it because your updates are delayed).
  • Stream processing emerged as a well-balanced solution, offering continuous and non-blocking data processing. This was particularly important given their dynamic event stream with an unbounded and ever-growing dataset. The continuous nature of stream processing bridged the gap between request-response and batching.

Charan explained, “Our specific technical requirements revolve around windowed aggregation, where we aggregate events over predefined time frames, such as the last 5 or 10 minutes. Moreover, we need support for multiple aggregation windows based on engagement levels, requiring instant aggregation for smaller counters and a more flexible approach for larger counters.”

Additional technical considerations include:

  • Support for triggers, which are vital for user engagement
  • The ease of configuring new counters, triggers, and aggregration windows, which enables them to quickly evolve the product

Inside The ShareChat Architecture

Here’s a look at the architecture they designed to satisfy these requirements.

Various types of engagement events are captured by backend services, then sent to the Kafka cluster. Business logic at different services captures and derives internal counters crucial for understanding and making decisions about how the post should be displayed on the feed. Multiple Kafka topics cater to various use cases.

Multiple instances of the aggregation service are all running on Kubernetes. The Kafka Streams API handles the core logic for windowed aggregations and triggers. Once aggregations are complete, they update the counter or aggregated value in ScyllaDB and publish a change log to a Kafka topic.

Under the hood, ShareChat taps Kafka’s consistent hashing to prevent different messages or events for a given entity ID (post ID) and counter from being processed by different Kubernetes pods or instances. To support this, the combination of entity ID and counter is used as the partition key, and all its relevant events will be processed by the same consumers.

All the complex logic related to windowing and aggregations is managed using the Kafka Streams. Each stream processing application executes a defined topology, essentially a directed acyclic graph (DAG). Events are pushed through a series of transformations, then the aggregated values are then updated in the data store, which is ScyllaDB.

Exploring the ShareChat Topology

Here’s how Charan mapped out their topology.

Starting from the top left:

  • They consume events from an engagement topic, apply basic filtering to check if the counter is registered, and divert unregistered events for monitoring and logging.
  • The aggregation window is defined based on the counter’s previous value, branching the main event stream into different streams with distinct windows. To handle stateful operations, especially during node shutdowns or rebalancing, Kafka Streams employs an embedded RocksDB for in-memory storage, persisting data to disk for rapid recovery.
  • The output stream from aggregations is created by merging the individual streams, and the aggregated values are updated in the data store for the counters. They log changes in a change log topic before updating counters on the data store.

Next, Charan walked through some critical parts of their code, highlighting design decisions such as their grace second setting and sharing how aggregations and views were implemented. He concluded, “Ultimately, this approach achieved a 7:1 ratio of events received as inputs compared to the database writes. That resulted in an approximately 80% reduction in messages.”

Where ScyllaDB Comes In

As ShareChat’s business took off, it became clear that their existing DBaaS solution (from their cloud provider) couldn’t provide acceptable latencies. That’s when they decided to move to ScyllaDB.

As Charan explained, “ScyllaDB is continuously delivering sub-millisecond latencies for the counters cluster. In addition to meeting our speed requirements, it also provides critical transparency into the database through flexible monitoring. Having multiple levels of visibility—data center, cluster, instance, and shards—helps us identify issues, abnormalities, or unexpected occurrences so we can react promptly.”

The ScyllaDB migration has also paid off in terms of cost savings: they’ve cut database costs by over 50% in the face of rapid business growth.

For the engagement counters use case, ShareChat runs a three-node ScyllaDB cluster. Each node has 48 vCPUs and over 350 GB of memory. Below, you can see the P99 read and write latencies: all microseconds, even under heavy load.

With a similar setup, they tested the cluster’s response to an extreme (but feasible) 1.2 million ops/sec by replicating events in a parallel cluster. Even at 90% load, the cluster remained stable with minimal impact on their ultra-low latencies.

Charan summed it up as follows: “ScyllaDB has helped us a great deal in optimizing our application and better serving our end users. We are really in awe of what ScyllaDB has to offer and are expanding ScyllaDB adoption across our organization.”

Watch the Complete Tech Talk

You can watch the complete tech talk and skim through the deck in our tech talk library

Watch the Full Tech Talk

Learn more in this ShareChat blog

The ScyllaDB NoSQL Community: Shaping the Future

Get involved with the ScyllaDB community – via forums, open source contributions, events & more – and help shape the future of ScyllaDB 

ScyllaDB was created from the get-go as an open-source, community-driven solution. Across the wide and dynamic landscape of modern databases, ScyllaDB stands out as a high-performance, distributed NoSQL database – especially when single-digit millisecond latency and high scalability are required. A big part of this success can be attributed to a thriving open-source community that plays a crucial role in shaping and enhancing this cutting-edge technology.

Being open-source also has a strong impact on our company culture, but that’s a topic for a different post.

The ScyllaDB community is vibrant and growing. Our Slack Channel has thousands of users. Our Community Forum, launched last year, has become the go-to place for meaningful conversations.

No matter how you prefer to engage, we welcome discussions, bug reports, fixes, feature contributions, feature requests, documentation improvements, and other ways to make ScyllaDB even faster, more flexible, and more robust.

Say hello & share how you’re using ScyllaDB

How to Get Involved

The community, which comprises developers, engineers, users, and enthusiasts from around the globe, actively contributes to ScyllaDB’s growth and evolution of the database. The collaborative nature of open source allows for diverse perspectives and expertise to come together for continuous improvements and valuable innovations.

Here are a few ways to get involved:

There are also online communities in languages beyond English, as well as related open-source projects: the Seastar framework, the ScyllaDB Monitoring Stack, the Kubernetes ScyllaDB Operator, the Go CQL Extension (GoCQLX) driver, and more.

ScyllaDB Community Forum

It’s been a bit over a year since we went live with the ScyllaDB Community Forum. We chose Discourse as the underlying platform.

ScyllaDB developers actively participate in forum discussions, offering insights, addressing technical queries, and providing updates on the latest features and developments. This direct interaction between the community and the core development team fosters a sense of transparency and collaboration, enhancing the product and the overall user experience.

Here is a taste of some popular topics. You can see many more discussions on the forum itself.

Popular ScyllaDB Forum Topics

The differences between column families in Cassandra’s data model compared to Bigtable

In this discussion, a user is seeking clarification on the relationship between Cassandra’s column-family-based data model and Google’s Bigtable. The user notes the multi-dimensional sparse map structure in Bigtable, where data is organized by rows, columns, and time.

The explanation is that, initially, Cassandra’s data model was indeed based on Bigtable’s, where a row could include any number of columns, following a schema-less approach. However, as Cassandra evolved, the developers recognized the value of schemas for application correctness. The introduction of Cassandra Query Language (CQL) around Cassandra version 0.8 marked a shift towards a more structured approach with clustering keys and schemas.

The response emphasizes the importance of clustering keys, which define the structure within wide rows (now called partitions). Cassandra internally maintains a dual representation, converting user-facing CQL rows into old-style wide rows on disk.

Listing all keyspaces in a ScyllaDB (or Cassandra) cluster

The topic deals with retrieving a list of all existing keyspaces in a cluster. This is useful, for example, when a user forgets the keyspace name they previously created. The provided answer suggests using the CQL Shell command DESC KEYSPACES or DESCRIBE to achieve this. This command works identically for both ScyllaDB and Cassandra, and additional information can be found in the documentation.

The best way to fetch rows in ScyllaDB (Count, Limit, or paging)

There are different ways to query data to fetch rows. The discussion lists some examples of how and when to use each method.

The user is utilizing ScyllaDB to limit users’ actions within the past 24 hours, allowing only a specific number of orders. They use ScyllaDB’s TTL, count records, and employ the Murmur3 hash for the partition key. The user seeks advice on the most efficient query method, considering options like COUNT(), LIMIT, and paging.

The response emphasizes the importance of always using paging to avoid latency and memory fragmentation. It suggests using LIMIT with paging, highlighting potential issues with the user’s initial approach.

Community Forum vs. Slack

Both the Community Forum and Slack serve as valuable communication channels within the ScyllaDB community, but they cater to different needs and scenarios.

When should you use each one? Here are some tips.

ScyllaDB Community Forum

Technical Discussions

  • When to Use: For in-depth technical questions, discussing specific features, or getting help with troubleshooting. If you think your question would be helpful to others, see if someone already asked it. If not, ask it in the forum so people can find it in the future.
  • Why: The forum is easy to search. It provides a structured environment for technical discussions, allowing for detailed explanations, code snippets, and collaborative problem-solving.

Showcasing Use Cases, Tips and Guides

  • When to Use: For seeking or providing use cases, tips, and best practices, and practical applications related to ScyllaDB.
  • Why: The forum is easy to search. It’s an excellent repository for community-generated content, making it a valuable resource for users looking to learn or share knowledge.

Announcements and Updates

  • When to Use: To stay informed about the latest releases, updates, events, and announcements from the ScyllaDB team.
  • Why: Important news and updates are posted on the forum, providing a centralized location for community members to stay up-to-date.

ScyllaDB Slack

Real-Time, Specific Communication

  • When to Use: For quick, very specific questions, immediate assistance, or engaging in real-time discussions with community members.
  • Why: Slack offers a more instantaneous communication channel, making it suitable for situations where quick responses are crucial.

Informal Conversations

  • When to Use: For informal conversations, networking, and community bonding.
  • Why: Slack channels often have a more relaxed and conversational atmosphere, providing a space for community members to connect on a personal level.

Be aware that the support provided by the community on Slack and on the Forum is “best effort.” If you are running ScyllaDB in production and want premium support with guaranteed SLAs I recommend you look into ScyllaDB Enterprise or ScyllaDB Cloud which offer 24×7 priority support by our dedicated support engineers.

Community Events and Appreciation

We regularly organize events that serve as platforms for learning, collaboration, and the celebration of achievements within the ScyllaDB ecosystem.

In just a few weeks,  we’ll host ScyllaDB Summit. This free online event brings together developers, engineers, and database enthusiasts from around the globe. It’s an immersive experience, with keynote speakers, technical sessions, and hands-on workshops covering a myriad of topics, from performance optimization to real-world use cases. Participants gain insights into the latest developments, upcoming features, and best practices directly from the core contributors and experts behind ScyllaDB.

Other notable events include P99 CONF (the technical conference for anyone who obsesses over high-performance, low-latency applications), ScyllaDB University Live ( instructor-led NoSQL database training sessions), Meetups, NoSQL Masterclasses, ScyllaDB Labs (hands-on, online, training event), our webinars and more. All are free and virtual.

Contributors are recognized for their efforts through various means, including contributor spotlights, awards, and acknowledgments within the ScyllaDB ecosystem. This recognition not only celebrates individual achievements but also motivates others to actively participate and contribute to the community.

Say hello & share how you’re using ScyllaDB

Closing Thoughts

ScyllaDB embraces an open-source, community-driven approach, fostering collaboration and transparency.
The collaborative nature of open source allows for meaningful discussions, bug reports, feature contributions, and more. There are multiple avenues for involvement, including the Community Forum, Slack channel, and GitHub repository.

As the ScyllaDB community continues to grow, events like the ScyllaDB Summit, P99 CONF, ScyllaDB University Live, and more provide platforms for learning, collaboration, and celebrating achievements. I hope to hear from you soon on one of these platforms!

Connect to ScyllaDB Clusters using the DBeaver Universal Database Manager

Learn how to use DBeaver as a simple alternative to cqlsh – so you can access your ScyllaDB clusters through a GUI with syntax highlighting

DBeaver is a universal database manager for working with SQL and NoSQL. It provides a central GUI for connecting to and administering a wide variety of popular databases (Postgres, MySQL, Redis…) . You can use it to toggle between the various databases in your tech stack, view and evaluate multiple database schemas, and visualize your data.

And now you can use it with ScyllaDB, the monstrously fast NoSQL database. This provides a simple alternative to cqlsh – enabling you to access your ScyllaDB clusters through a GUI with syntax highlighting.

DBeaver Enterprise has released version 23.3, introducing support for ScyllaDB. This update lets you seamlessly connect to ScyllaDB clusters using DBeaver. Both self-hosted and ScyllaDB Cloud instances work well with DBeaver.You can install DBeaver on Windows, Linux, and MacOS.

In this post, I will show you how easy it is to connect DBeaver and ScyllaDB and run some basic queries.

Connect to ScyllaDB in DBeaver

Since ScyllaDB is compatible with Apache Cassandra, you can leverage DBeaver’s Cassandra driver to interact with ScyllaDB.
To create a new ScyllaDB connection in DBeaver:

  1. Download DBeaver Enterprise at https://dbeaver.com/download/enterprise/
  2. Click the “New database connection” button and search for “ScyllaDB.”
  3. Click “Next”.
  4. Enter the database connection details as follows:
  5. Click “Test Connection” and “Finish.”
  6. Inspect your tables.

  7. Run your CQL queries.

Query examples

Once the connection is set up, you can run all your CQL queries and see the result table right below your query. For example:

Aside from simple SELECT queries, you can also run queries to create new objects in the database, like a materialized view:

Wrapping up

DBeaver is a great database management tool that allows easy access to your ScyllaDB clusters. It provides a simple alternative to CQLsh with a nice user interface and syntax highlighting. Get started by downloading DBeaver and creating a new ScyllaDB Cloud cluster.

Any questions? You can discuss this post and share your thoughts in our community forum.

Worldwide Local Latency With ScyllaDB: The ZeroFlucs Strategy

How data is replicated to support low latency for ZeroFlucs’ global usage patterns – without racking up unnecessary costs

ZeroFlucs’ business – processing sports betting data – is rather latency sensitive. Content must be processed in near real-time, constantly, and in a region local to both the customer and the data. And there’s incredibly high throughput and concurrency requirements – events can update dozens of times per minute and each one of those updates triggers tens of thousands of new simulations (they process ~250,000 in-game events per second).

At ScyllaDB Summit 23, ZeroFlucs’ Director of Software Engineering Carly Christensen walked attendees through how ZeroFlucs uses ScyllaDB to provide optimized data storage local to the customer – including how their recently open-sourced package (cleverly named Charybdis) facilitates this. This blog post, based on that talk, shares their brilliant approach to figuring out exactly how data should be replicated to support low latency for their global usage patterns without racking up unnecessary storage costs.

Join us at ScyllaDB Summit 24 to hear more firsthand accounts of how teams are tackling their toughest database challenges. Disney, Discord, Expedia, Fanatics, Paramount, and more are all on the agenda.

Update: ScyllaDB Summit 2024 is now a wrap!

Access ScyllaDB Summit On Demand

What’s ZeroFlucs?

First, a little background on the business challenges that the ZeroFlucs technology is supporting. ZeroFlucs’ same-game pricing lets sports enthusiasts bet on multiple correlated outcomes within a single game. This is leagues beyond traditional bets on what team will win a game and by what spread. Here, customers are encouraged to design and test sophisticated game theories involving interconnected outcomes within the game. As a result, placing a bet is complex, and there’s a lot more at stake as the live event unfolds.

For example, assume there are three “markets” for bets:

  • Will Team A or Team B win?
  • Which player will score the first touchdown?
  • Will the combined scores of Team A and B be over or under 45.5 points?

Someone could place a bet on team A to win, B. Bhooma to score the first touchdown, and for the total score to be under 45.5 points. If you look at those 3 outcomes and multiply the prices together, you get a price of around $28. But in this case, the correct price is approximately $14.50.

Carly explains why. “It’s because these are correlated outcomes. So, we need to use a simulation-based approach to more effectively model the relationships between those outcomes. If a team wins, it’s much more likely that they will score the first touchdown or any other touchdown in that match. So, we run simulations, and each simulation models a game end-to-end, play-by-play. We run tens of thousands of these simulations to ensure that we cover as much of the probability space as possible.”

The ZeroFlucs Architecture

The ZeroFlucs platform was designed from the ground up to be cloud native. Their software stack runs on Kubernetes, using Oracle Container Engine for Kubernetes. There are 130+ microservices, growing every week. And a lot of their environment can be managed through custom resource definitions (CRDs) and operators. As Carly explains, “For example, if we want to add a new sport, we just define a new instance of that resource type and deploy that YAML file out to all of our clusters.” A few more tech details:

  • Services are primarily Golang
  • Python is used for modeling and simulation services
  • GRPC is used for internal communications
  • Kafka is used for “at least once” delivery of all incoming and outgoing updates
  • GraphQL is used for external-facing APIs

As the diagram above shows:

  • Multiple third-party sources send content feeds.
  • Those content items are combined into booking events, which are then used for model simulations.
  • The simulation results are used to generate hundreds to thousands of new markets (specific outcomes that can be bet on), which are then stored back on the original booking event.
  • Customers can interact directly with that booking event. Or, they can use the ZeroFlucs API to request prices for custom combinations of outcomes via the ZeroFlucs query engine. Those queries are answered with stored results from their simulations.

Any content update starts the entire process over again.

Keeping Pace with Live In-Play Events

ZeroFlucs’ ultimate goal is to process and simulate events fast enough to offer same-game prices for live in-play events. For example, they need to predict whether this play results in a touchdown and which player will score the next touchdown – and they must do it fast enough to provide the prices before the play is completed. There are two main challenges to accomplishing this:

  • High throughput and concurrency. Events can update dozens of times per minute, and each update triggers tens of thousands of new simulations (hundreds of megabytes of data). They’re currently processing about 250,000 in-game events per second.
  • Customers can be located anywhere in the world. That means ZeroFlucs must be able to place their services — and the associated data – near these customers. With each request passing through many microservices, even a small increase in latency between those services and the database can result in a major impact on the total end-to-end processing time.

Selecting a Database That’s Up to the Task

Carly and team initially explored whether three popular databases might meet their needs here.

  • MongoDB was familiar to many team members. However, they discovered that with a high number of concurrent queries, some queries took several seconds to complete.
  • Cassandra supported network-aware replication strategies, but its performance and resource usage fell short of their requirements.
  • CosmosDB addressed all their performance and regional distribution needs, but its high cost and Azure-only availability posed limitations on their portability. But they couldn’t justify its high cost, or the vendor lock-in.

Then they thought about ScyllaDB, a database they had discovered while working on a different project. It didn’t make sense for the earlier use case, but it met this project’s requirements quite nicely. As Carly put it: “ScyllaDB supported the distributed architecture that we needed, so we could locate our data replicas near our services and our customers to ensure that they always had low latency. It also supported the high throughput and concurrency that we required. We haven’t yet found a situation that we couldn’t just scale through. ScyllaDB was also easy to adopt. Using ScyllaDB Operator, we didn’t need a lot of domain knowledge to get started.”

Inside their ScyllaDB Architecture

ZeroFlucs is currently using ScyllaDB hosted on Oracle Cloud Flex 4 VMs. These VMs allow them to change the CPU and memory allocation to those nodes if needed. It’s currently performing well, but the company’s throughput increases with every new customer. That’s why they appreciate being able to scale up and run on bare metal if needed in the future.

They’re already using ScyllaDB Operator to manage ScyllaDB, and they were reviewing their strategy around ScyllaDB Manager and ScyllaDB Monitoring at the time of the talk.

Ensuring Data is Local to Customers

To make the most of ScyllaDB, ZeroFlucs divided their data into three main categories:

  • Global data. This is slow-changing data used by all their customers. It’s replicated to each and every one of their regions.
  • Regional data. This is data that’s used by multiple customers in a single region (for example, a sports feed). If a customer in another region requires their data, they separately replicate it into that region.
  • Customer data. This is data that is specific to that customer, such as their booked events or their simulation results. Each customer has a home region where multiple replicas of their data are stored. ZeroFlucs also keeps additional copies of their data in other agents that they can use for disaster recovery purposes.

Carly shared an example: “Just to illustrate that idea, let’s say we have a customer in London. We will place a copy of our services (“a cell”) into that region. And all of that customer’s interactions will be contained in that region, ensuring that they always have low latency. We’ll place multiple replicas of their data in that region. And will also place additional replicas of their data in other regions. This becomes important later.”

Now assume there’s a customer in the Newport region. They would place a cell of their services there, and all of that customer’s interactions would be contained within the Newport region so they also have low latency.

Carly continues, “If the London data center becomes unavailable, we can redirect that customer’s requests to the Newport region. And although they would have increased latency on the first hop of those requests, the rest of the processing is still contained within one data center – so it would still be low latency.” With a complete outage for that customer averted, ZeroFlucs would then increase the number of replicas of their data in that region to restore data resiliency for them.

Between Scylla(DB) and Charybdis

ZeroFlucs separates data into services and keyspaces, with each service using at least one keyspace. Global data has just one keyspace, regional data has a keyspace per region, and customer data has a keyspace per customer. Some services can have more than one data type, and thus might have both a global keyspace as well as customer keyspaces.

They needed a simple way to manage the orchestration and updating of keyspaces across all their services. Enter Charybdis, the Golang ScyllaDB helper library that the ZeroFlucs team created and open sourced. Charybdis features a table manager that will automatically create keyspaces as well as add tables, columns, and indexes. It offers simplified functions for CRUD-style operations, and it supports LWT and TTL.

Note: For an in-depth look at the design decisions behind Charydbis, see this blog by ZeroFlucs Founder and CEO Steve Gray.

There’s also a topology Controller Service that’s responsible for managing the replication settings and keyspace information related to every service.

Upon startup, the service calls the topology controller and retrieves its replication settings. It then combines that data with its table definitions and uses it to maintain its keyspaces in ScyllaDB. The above image shows sample Charybdis-generated DDL statements that include a network topology strategy.

Next on their Odyssey

Carly concluded: “We still have a lot to learn, and we’re really early in our journey. For example, our initial attempt at dynamic keyspace creation caused some timeouts between our services, especially if it was the first request for that instance of the service. And there are still many Scylla DB settings that we have yet to explore. I’m sure that we’ll be able to increase our performance and get even more out of Scylla DB in the future.”

Watch the Complete Tech Talk

You can watch Carly’s complete tech talk and skim through her deck in our tech talk library.

Watch the Full Tech Talk

Running Apache Cassandra® Single and Multi-Node Clusters on Docker with Docker Compose

Sometimes you might need to spin up a local test database quickly–a database that doesn’t need to last beyond a set time or number of uses. Or maybe you want to integrate Apache Cassandra® into an existing Docker setup.  

Either way, you’re going to want to run Cassandra on Docker, which means running it in a container with Docker as the container manager. This tutorial is here to guide you through running a single and multi-node setup of Apache Cassandra on Docker. 

Prerequisites 

Before getting started, you’ll need to have a few things already installed, and a few basic skills. These will make deploying and running your Cassandra database in Docker a seamless experience: 

  • Docker installed  
  • Basic knowledge of containers and Docker (see the Docker documentation for more insight) 
  • Basic command line knowledge 
  • A code editor (I use VSCode) 
  • CQL shell, aka cqlsh, installed (instructions for installing a standalone cqlsh without installing Cassandra can be found here) 

Method 1: Running a single Cassandra node using Docker CLI 

This method uses the Docker CLI to create a container based on the latest official Cassandra image. In this example we will: 

  • Set up the Docker container 
  • Test that it’s set up by connecting to it and running cqlsh 
  • Clean up the container once you’re done with using it. 

Setting up the container 

You can run Cassandra on your machine by opening up a terminal and using the following command in the Docker CLI: 

docker run –name my-cassandra-db  -d cassandra:latest 

Let’s look at what this command does: 

  • Docker uses the ‘run’ subcommand to run new containers.  
  • The ‘–name’ field allows us to name the container, which helps for later use and cleanup; we’ll use the name ‘my-cassandra-db’ 
  • The ‘-d’ flag tells Docker to run the container in the background, so we can run other commands or close the terminal without turning off the container.  
  • The final argument ‘cassandra:latest’ is the image to build the container from; we’re using the latest official Cassandra image 

When you run this, you should see an ID, like the screenshot below: 

To check and make sure everything is running smoothly, run the following command: 

docker ps -a 

You should see something like this: 

Connecting to the container 

Now that the data container has been created, you can now connect to it using the following command: 

docker exec -it my-cassandra-db cqlsh 

This will run cqlsh, or CQL Shell, inside your container, allowing you to make queries to your new Cassandra database. You should see a prompt like the following: 

Cleaning up the container 

Once you’re done, you can clean up the container with the ’docker rm’ command. First, you’ll need to stop the container though, so you must to run the following 2 commands:  

docker stop my-cassandra-db 

docker rm my-cassandra-db 

This will delete the database container, including all data that was written to the database. You’ll see a prompt like the following, which, if it worked correctly, will show the ID of the container being stopped/removed: 

Method 2: Deploying a three-node Apache Cassandra cluster using Docker compose 

This method allows you to have multiple nodes running on a single machine. But in which situations would you want to use this method? Some examples include testing the consistency level of your queries, your replication setup, and more.

Writing a docker-compose.yml 

The first step is creating a docker-compose.yml file that describes our Cassandra cluster. In your code editor, create a docker-compose.yml file and enter the following into it: 

version: '3.8' 

networks: 

  cassandra: 

services: 

  cassandra1: 

    image: cassandra:latest 

    container_name: cassandra1 

    hostname: cassandra1 

    networks: 

      - cassandra 

    ports: 

      - "9042:9042" 

    environment: &environment  

        CASSANDRA_SEEDS: "cassandra1,cassandra2"   

        CASSANDRA_CLUSTER_NAME: MyTestCluster 

        CASSANDRA_DC: DC1 

        CASSANDRA_RACK: RACK1 

        CASSANDRA_ENDPOINT_SNITCH: GossipingPropertyFileSnitch 

        CASSANDRA_NUM_TOKENS: 128 

  cassandra2: 

    image: cassandra:latest 

    container_name: cassandra2 

    hostname: cassandra2 

    networks: 

      - cassandra 

    ports: 

      - "9043:9042" 

    environment: *environment   

    depends_on: 

      cassandra1:  

        condition: service_started 

  cassandra3: 

    image: cassandra:latest 

    container_name: cassandra3 

    hostname: cassandra3 

    networks: 

      - cassandra 

    ports: 

      - "9044:9042" 

    environment: *environment   

    depends_on: 

      cassandra2:   

        condition: service_started

So what does this all mean? Let’s examine it part-by-part:  

First, we declare our docker compose version.  

version: '3.8'

Then, we declared a network called cassandra to host our cluster.  

networks: 

  cassandra:

Under services, cassandra1 is started. (NOTE: the depends on service start conditions in cassandra2 and cassandra3’s `depends_on~ attributes prevent them from starting until the service on cassandra1 and cassandra2 have started, respectively.) We also set the port forwarding here so that our local 9042 port will map to the container’s 9042. We also add it to the cassandra network we established: 

services: 

  cassandra1: 

    image: cassandra:latest 

    container_name: cassandra1 

    hostname: cassandra1 

    networks: 

      - cassandra 

    ports: 

      - "9042:9042" 

    environment: &environment  

        CASSANDRA_SEEDS: "cassandra1,cassandra2"   

        CASSANDRA_CLUSTER_NAME: MyTestCluster 

        CASSANDRA_DC: DC1 

        CASSANDRA_RACK: RACK1 

        CASSANDRA_ENDPOINT_SNITCH: GossipingPropertyFileSnitch 

        CASSANDRA_NUM_TOKENS: 128

Finally, we set some environment variables needed for startup, such as declaring CASSANDRA_SEEDS to be cassandra1 and cassandra2. 

The configurations for containers ‘cassandra2 ‘and ‘cassandra3’ are very similar; the only real difference are the names.  

  • Both use the same cassandra:latest image, set container names, add themselves to the Cassandra network, and expose their 9042 port.  
  • They also point to the same environment variables as cassandra1 with the *environment syntax. 
    • Their only difference? cassandra2 waits on cassandra1, and cassandra3 waits on cassandra2. 

Here is the code section that this maps to: 

cassandra2: 

    image: cassandra:latest 

    container_name: cassandra2 

    hostname: cassandra2 

    networks: 

      - cassandra 

    ports: 

      - "9043:9042" 

    environment: *environment   

    depends_on: 

      cassandra1:  

        condition: service_started 

  cassandra3: 

    image: cassandra:latest 

    container_name: cassandra3 

    hostname: cassandra3 

    networks: 

      - cassandra 

    ports: 

      - "9044:9042" 

    environment: *environment   

    depends_on: 

      cassandra2:   

        condition: service_started

Deploying your Cassandra cluster and running commands 

To deploy your Cassandra cluster, use the Docker CLI in the same folder as your docker-compose.yml to run the following command (the -d causes the containers to run in the background): 

docker compose up -d

Quite a few things should happen in your terminal when you run the command, but when the dust has settled you should see something like this: 

If you run the ‘docker ps -a,’ command,  you should see three running containers: 

To access your Cassandra cluster, you can use csqlsh to connect to the container database using the following commands: 

sudo docker exec -it cassandra1 cqlsh

You can also check the cluster configuration using: 

docker exec -it cassandra1 nodetool status

Which will get you something like this: 

And the node info with: 

docker exec -it cassandra1 nodetool info

From which you’ll see something similar to the following: 

You can also run these commands on the cassandra2 and cassandra3 containers. 

Cleaning up 

Once you’re done with the database cluster, you can take it down and remove it with the following command: 

docker compose down

This will stop and destroy all three containers, outputting something like this:

Now that we’ve covered two ways to run Cassandra in Docker, let’s look at a few things to keep in mind when you’re using it. 

Important things to know about running Cassandra in Docker 

Data Permanence 

Unless you declare volumes on the machine that maps to container volumes, the data you write to your Cassandra database will be erased when the container is destroyed. (You can read more about using Docker volumes here).

Performance and Resources 

Apache Cassandra can take a lot of resources, especially when a cluster is deployed on a single machine. This can affect the performance of queries, and you’ll need a decent amount of CPU and RAM to run a cluster locally. 

Conclusion 

There are several ways to run Apache Cassandra on Docker, and we hope this post has illuminated a few ways to do so. If you’re interested in learning more about Cassandra, you can find out more about how data modelling works with Cassandra, or how PostgreSQL and Cassandra differ. 

Ready to spin up some Cassandra clusters yourself? Give it a go with a free trial on the Instaclustr Managed Platform for Apache Cassandra® today!

The post Running Apache Cassandra® Single and Multi-Node Clusters on Docker with Docker Compose appeared first on Instaclustr.

The World’s Largest Apache Kafka® and Apache Cassandra® Migration?

Here at Instaclustr by NetApp, we pride ourselves on our ability to migrate customers from self-managed or other managed providers with minimal risk and zero downtime, no matter how complex the scenario. At any point in time, we typically have 5-10 cluster migrations in progress. Planning and executing these migrations with our customers is a core part of the expertise of our Technical Operations team. 

Recently, we completed the largest new customer onboarding migration exercise in our history and it’s quite possibly the largest Apache Cassandra and Apache Kafka migration exercise ever completed by anyone. While we can’t tell you who the customer is, in this blog we will walk through the overall process and provide details of our approach. This will give you an idea of the lengths we go to onboarding customers and perhaps to pick up some hints for your own migration exercises. 

Firstly, some stats to give you a sense of the scale of the exercise: 

  • Apache Cassandra: 
    • 58 clusters
    • 1,079 nodes
    • 17 node sizes (ranging from r6g.medium to im4gn.4xlarge)
    • 2 cloud providers (AWS and GCP)
    • 6 cloud provider regions
  • Apache Kafka 
    • 154 clusters
    • 1,050 nodes
    • 21 node sizes (ranging from r6g.large to im4gn.4xlarge and r6gd.4xlarge)
    • 2 cloud providers (AWS and GCP)
    • 6 cloud provider regions

From the size of the environment, you can get a sense that the customer involved is a pretty large and mature organisation. Interestingly, this customer had been an Instaclustr support customer for a number of years. Based on that support experience, they decide to trust us with taking on full management of their clusters to help reduce costs and improve reliability. 

Clearly, completing this number of migrations required a big effort both from Instaclustr and our customer. The timeline for the project looked something like: 

  • July 2022: contract signed and project kicked off 
  • July 2022 – March 2023: customer compliance review, POCs and feature enhancement development 
  • February 2023 – October 2023: production migrations 

Project Management and Governance 

A key to the success of any large project like this is strong project management and governance. Instaclustr has a wellestablished customer project management methodology that we apply to projects: 

Source: Instaclustr 

In line with this methodology, we staffed several key roles to support this project: 

  • Overall program manager 
  • Cassandra migration project manager 
  • Cassandra technical lead 
  • Kafka migration project manager 
  • Kafka technical lead 
  • Key Customer Product Manager 

The team worked directly with our customer counterparts and established several communication mechanisms that were vital to the success of the project. 

Architectural and Security Compliance 

While high-level compliance with the customer’s security and architecture requirements had been established during the pre-contract phase, the first phase of post-contract work was a more detailed solution review with the customer’s compliance and architectural teams.  

To facilitate this requirement, Instaclustr staff met regularly with the customer’s security team to understand their requirements and explain Instaclustr’s existing controls that met these needs. 

As expected, Instaclustr’s existing SOC2 and PCI certified controls meant that a very high percentage of the customer’s requirements were met right out of the box. This included controls such as intrusion detection, access logging and operating system hardening.  

However, as is common in mature environments with well-established requirements, a few gaps were identified and Instaclustr agreed to take these on as system enhancements. Some examples of the enhancements we delivered prior to commencing production migrations include: 

  • Extending the existing system to export logs to a customer-owned location to include audit logs 
  • The ability to opt-in at an account level for all newly created clusters to be automatically configured with log shipping 
  • Allowing the process that loads custom Kafka Connect connectors to use instance roles rather than access keys for s3 access 
  • Enhancements to our SCIM API for provisioning SSO access 

In addition to establishing security compliance, we used this period to further validate architectural fit and identified some enhancements that would help to ensure an optimal fit for the migrated clusters. Two key enhancements were delivered to meet this goal: 

  • Support for Kafka clusters running in two Availability Zones with RF2 
  • This is necessary as the customer has a fairly unique architecture that delivers HA above the Kafka cluster level 
  • Enabling multiple new AWS and GCP node types to optimize infrastructure spend 

Apache Kafka Migration 

Often when migrating Apache Kafka, the simplest approach is what we call Drain Out.   

In this approach, Kafka consumers are pointed at both the source and destination clusters; the producers are then switched to send messages to just the destination cluster. Once all messages are read from the source cluster, the consumers there can be switched off and the migration is complete. 

However, while this is the simplest approach from a Kafka point of view, it does not allow you to preserve message ordering through the cutover. This can be important in many use cases, and was certainly important for this customer. 

When the Drain Out approach is not suitable, using MirrorMaker2 can also be an option; we have deployed it on many occasions for other migrations. In this particular case, however, the level of consumer/producer application dependency for this approach ruled out using MirorrMaker2.  

This left us with the Shared Cluster approach, where we operate the source and destination clusters as a single cluster for a period before decommissioning the source. 

The high-level steps we followed for this shared cluster migration approach are: 

1. Provision destination Instaclustr managed cluster, shut down and wipe all data 

2. Update configurations on the destination cluster to match source cluster as required 

3. Join network environments with the source cluster (VPC peering, etc) 

4. Start up destination Apache ZooKeeper™ in observer mode, and start up destination Kafka brokers 

5. Use Kafka partition reassignment to move data:

a. Increase replication factor and replicate across destination as well as source brokers 

b. Swap preferred leaders to destination brokers 

c. Decrease replication factor to remove replicas from source brokers 

6. Reconfigure clients to use destination brokers as initial contact points 

7. Remove old brokers 

For each cluster, a detailed change plan was created by Instaclustr to cover all of the high-level steps listed above and rollback if any issues arose.  

 A couple of other specific requirements from this environment that added extra complexity worth mentioning: 

  • The source environment shared a single ZooKeeper instance across multiple clusters. This is not a configuration that we support and the customer agreed that it was a legacy configuration that they would rather leave behind. To accommodate the migration from this shared ZooKeeper, we had to develop functionality for custom configuration of ZooKeeper node names in our managed clusters as well as build a tool to “clean” the destination ZooKeeper of data related to other clusters after migration (for security and ongoing supportability). 
  • The existing clusters had port listener mappings that did not align with the mappings supported by our management system, and reconfiguring these prior to migration would have added extensive work on the customer side. We therefore extended our custom configuration to allow more extensive custom configuration of listeners. Like other custom configuration we support, this is stored in our central configuration database so it survives node replacements and is automatically added to new nodes in a cluster. 

Apache Cassandra Migration 

We have been doing zero downtime migrations of Apache Cassandra since 2014. All of them basically follow the “add a datacenter to an existing cluster” process that we outlined in a 2016 blog. 

One key enhancement that we’ve made since this blog–and even utilized since this most recent migration–is the introduction of the Instaclustr Minotaur consistent rebuild tool (available on GitHub here).  

If the source cluster is missing replicas of some data prior to starting the rebuild, the standard Cassandra data center rebuild process can try to copy more than one replica from the same source node. This results in even fewer replicas of data on the destination cluster.  

Instaclustr Minotaur addresses these issues.  

This can mean that in the standard case of replication factor 3 and consistency level quorum queries, you can go from having 2 replicas and data being consistently returned on the source cluster to only 1 replica (or even 0 replicas) and data being intermittently missed on the destination cluster.  

The “textbook” Cassandra approach to address this is to run Cassandra repairs after the rebuild, which will ensure all expected replicas are in sync. However, we are frequently asked to migrate clusters that have not been repaired for a long time and that can make running repairs a very tricky operation.  

Using the Minotaur tool, we can guarantee that the destination cluster has at least as many replicas as the source cluster. Running repairs to get the cluster back into a fully healthy state can then be left until the cluster is fully migrated, and our Tech Ops team can hand-hold the process. 

This approach was employed across all Cassandra migrations for this customer and proved particularly important for certain clusters with high levels of inconsistency pre-migration; one particularly tricky cluster even took two and half months to fully repair post migration!  

Another noteworthy challenge from this migration was a set of clusters where tables were dropped every 2 to 3 hours.  

This is a common design pattern for temporary data in Cassandra as it allows the data to be quickly and completely removed when it is no longer required (rather than a standard delete creating “tombstones” or virtual delete records). The downside is that the streaming of data to new nodes fails if a schema change occurs during a streaming operation and can’t be restarted.  

Through the migration process, we managed to work around this with manual coordination of pausing the table drop operation on the customer side while each node rebuild was occurring. However, it quickly became apparent that this would be too cumbersome to sustain through ongoing operations.  

To remedy this, we held a joint brainstorming meeting with the customer to work through the issue and potential solutions. The end result was a design for the automation on the customer-side to pause the dropping of tables whenever it was detected that a node in the cluster was not fully available. Instaclustr’s provisioning API provided node status information that could be used to facilitate this automation.  

Conclusion 

This was a mammoth effort that not only relied on Instaclustr’s accumulated expertise from many years of running Cassandra and Kafka, but also our strong focus on working as part of one team with our customers.  

The following feedback we received from the customer project manager is exactly the type of reaction we aim for with every customer interaction: 

“We’ve hit our goal ahead of schedule and could not have done it without the work from everyone on the Instaclustr side and [customer team]. It was a pleasure working with all parties involved!  

“The migration when smoothly with minimal disruption and some lessons learned. I’m looking forward to working with the Instaclustr team as we start to normalize with the new environment and build new processes with your teams to leverage your expertise.  

“Considering the size, scope, timeline and amount of data transferred, this was the largest migration I’ve ever worked on and I couldn’t have asked for better partners on both sides.” 

Interested in doing a migration yourself? Reach out to our team of engineers and we’ll get started on the best plan of action for your use case! 

The post The World’s Largest Apache Kafka® and Apache Cassandra® Migration? appeared first on Instaclustr.

Instaclustr Product Update: December 2023

There have been a lot of exciting things happening at Instaclustr! Here’s a roundup of the updates to our platform over the last few months. 

As always, if you have a specific feature request or improvement you’d love to see, please get in contact with us. 

Major Announcements 

Cadence®: 

  • Instaclustr Managed Platform adds HTTP API for Cadence®

The Instaclustr Managed Platform now provides an HTTP API to allow interaction with Cadence® workflows from virtually any language. For further information or a technical briefing, contact an Instaclustr Customer Success representative or sales@instaclustr.com 

  • Cadence® on the Instaclustr Managed Platform Achieves PCI Certification 

We are pleased to announce that Cadence® 1.0 on the Instaclustr Managed Platform is now PCI Certified on AWS and GCP, demonstrating our company’s commitment to stringent data security practices and architecture. For further information or a technical briefing, contact an Instaclustr Customer Success representative or sales@instaclustr.com 

Apache Cassandra®: 

  • Debezium Connector Cassandra is Released for General Availability  

Instaclustr’s managed Debezium connector for Apache Cassandra makes it easy to export a stream of data changes from an Instaclustr Managed Cassandra cluster to Instaclustr Managed Kafka cluster. Try creating a Cassandra cluster with Debezium Connector on our console. 

Ocean for Apache Spark™:  

  • Interactive data engineering at scaleJupyter Notebooks, embedded in the Ocean for Apache Spark UI, is now generally available in Ocean Spark. Our customers’ administrators no longer need to install components since Ocean Spark is now hosting Jupyter infrastructure. Users no longer need to switch applications as they can have Ocean for Apache Spark process massive amounts of data in the cloud while viewing live analytic results on our screen

Other Significant Changes 

Apache Cassandra®: 

  • Apache Cassandra® Version 3.11.16 was released as generally available on Instaclustr Platform 
  • A new feature has been released for zero-downtime restores for Apache Cassandra® that eliminates downtime when the in-place restore operation is performed on a Cassandra cluster 
  • Added support for customer-initiated resize on GCP and Azure, allowing customers to vertically scale their Cassandra cluster and select any larger production disk or instance size when scaling up. And for instance size, the ability to also scale down through the Instaclustr Console or API. 

Apache Kafka®: 

  • Support for Apache ZooKeeper™ version 3.8.2 released in general availability 
  • Altered the maximum heap size allocated to Instaclustr for Kafka and ZooKeeper on smaller node sizes to improve application stability. 

Cadence®: 

  • Cadence 1.2.2 is now in general availability 

Instaclustr Managed Platform: 

  • Added support for new AWS region: UAE, Zurich, Jakarta, Melbourne and Osaka 

OpenSearch®: 

  • OpenSearch 1.3.11 and OpenSearch 2.9.0 are now in general availability on the Instaclustr Platform 
  • Dedicated Ingest Nodes for OpenSearch® is released as public preview

PostgreSQL®: 

  • PostgreSQL 15.4, 14.9, 13.12, and 16 are now in general availability 

Ocean for Apache Spark™: 

  • New memory metrics charts in our UI give our users the self-help they need to write better big data analytical applications. Our UI-experience-tracking software has shown that this new feature has already helped some customers to find the root cause of failed applications. 
  • Autotuning memory down for cost savings. Customers can use rules to mitigate out of memory errors. 
  • Cost metrics improved in both API and UI: 
    • API: users can now make REST calls to retrieve Ocean Spark usage charges one month at a time 
    • UI: metrics that estimate cloud provider costs have been enhanced to account for differences among on-demand instances, spot instances, and reserved instances.  

Upcoming Releases 

Apache Cassandra®: 

  • AWS PrivateLink with Instaclustr for Apache Cassandra® will soon be released to general availability. The AWS PrivateLink offering provides our AWS customers with a simpler and more secure option for network connectivity.​​ Log into the Instaclustr Console with just one click to trial the AWS PrivateLink public preview release with your Cassandra clusters today. Alternatively, the AWS PrivateLink public preview for Cassandra is available at the Instaclustr API. 
  • AWS 7-series nodes for Apache Cassandra® will soon be generally available on Instaclustr’s Managed Platform. The AWS 7-series nodes introduce new hardware, with Arm-based AWS Graviton3 processors and Double Data Rate 5 (DDR5) memory, providing customers with leading technology that Arm-based AWS Graviton3 processors are known for.  

Apache Kafka®: 

  • Support for custom Subject Alternative Names (SAN) on AWS​ will be added early next year, making it easier for our customers to use internal DNS to connect to their managed Kafka clusters without needing to use IP addresses. 
  • Kafka 3.6.x will be introduced soon in general availability. This will be the first Managed Kafka release on our platform to have KRaft available in GA, entitling our customers to full coverage under our extensive SLAs. 

Cadence®: 

  • User authentication through open source is under development to enhance security of our Cadence architecture. This also helps our customers more easily pass security compliance checks of applications running on Instaclustr Managed Cadence. 

Instaclustr Managed Platform: 

  • Instaclustr Managed Services will soon be available to purchase through the Azure Marketplace. This will allow you to allocate your Instaclustr spend towards any Microsoft commit you might have. The integration will first be rolled out to RIIA clusters, then a fully RIYOA setup. Both self-serve and private offers will be supported. 
  • The upcoming release of Signup with Microsoft integration marks another step in enhancing our platform’s accessibility. Users will have the option to sign up and sign in using their Microsoft Account credentials. This addition complements our existing support for Google and GitHub SSO, as well as the ability to create an account using email, ensuring a variety of convenient access methods to suit different user preferences.  

OpenSearch®: 

  • Release of searchable snapshots​​ feature is coming soon as GA. This feature will make it possible to search indexes that are stored as snapshots within remote repositories (i.e., S3) without the need to download all the index data to disk ahead of time, allowing customers to save time and leverage cheaper storage options. 
  • Continuing on from the public preview release of dedicated ingest nodes for Managed OpenSearch, we’re now working on making it generally available within the coming months. 

PostgreSQL®: 

  • Release of PostgreSQL–Azure NetApp (ANF) Files Fast Forking feature will be available soon. By leveraging the ANF filesystem, customers can quickly and easily create an exact copy of their PostgreSQL cluster on additional infrastructure. 
  • Release of pgvector extension is expected soon. This extension lets you use your PostgreSQL server for vector embeddings to take advantage of the latest in Large Language Models (LLMs). 

Ocean for Apache Spark™: 

  • We will soon be announcing SOC 2 certification by an external auditor since we have implemented new controls and processes to better protect data. 
  • We are currently developing new log collection techniques and UI charts to accommodate longer running applications since our customers are running more streaming workloads on our platform. 

Did you know? 

It has become more common for customers to create many accounts, each managing smaller groups of clusters. The account search functionality helps answer several questions: where the cluster is across all accounts, where the account is, and what it owns. 

Combining the real-time data streaming capabilities of Apache Kafka with the powerful distributed computing capabilities of Apache Spark can simplify your end-to-end big data processing and Machine Learning pipelines. Explore a real-world example in the first part of a blog series on Real Time Machine Learning using NetApp’s Ocean for Apache Spark and Instaclustr for Apache Kafka 

Apache Kafka cluster visualization—AKHQ—is an open source Kafka GUI for Apache Kafka, which helps with managing topics, consumers groups, Schema Registry, Kafka® Connect, and more. In this blog we detail the steps needed to be done to get AKHQ, an open source project licensed under Apache 2, working with Instaclustr for Apache Kafka. This is the first in a series of planned blogs where we’ll take you through similar steps on using other popular open source Kafka user interfaces and how to use them with your beloved Instaclustr for Apache Kafka. 

The post Instaclustr Product Update: December 2023 appeared first on Instaclustr.

Debezium® Change Data Capture for Apache Cassandra®

The Debezium® Change Data Capture (CDC) source connector for Apache Cassandra® version 4.0.11 is now generally available on the Instaclustr Managed Platform. 

The Debezium CDC source connector will stream your Cassandra data (via Apache Kafka®) to a centralized enterprise data warehouse or any other downstream data consumer. Now business information in your Cassandra database is streamed much more simply than before!  

Instaclustr has been supporting the Debezium CDC Cassandra source connector through a private offering since 2020. More recently, our Development and Open Source teams have been developing the CDC feature for use by all Instaclustr Cassandra customers. 

Instaclustr’s support for the Debezium Cassandra source connector was released to Public Preview earlier this year for our customers to trial in a test environment. Now with the general availability release, our customers can get a fully supported Cassandra CDC feature already integrated and tested on the Instaclustr platform, rather than performing the tricky integration themselves.

The Debezium CDC source connector is a really exciting feature to add to our product set, enabling our customers to transform Cassandra database records into streaming events and easily pipe the events to downstream consumers.  

Our team has been actively developing support for Debezium CDC Cassandra source connector both on our managed platform and also through open source contributions by NetApp to the Debezium Cassandra connector project. We’re looking forward to seeing more of our Cassandra customers using CDC to enhance their data infrastructure operations.”—Jaime Borucinski, Apache Cassandra Product Manager

Instaclustr’s industryleading SLAs and support offered for the CDC feature provide our customers with confidence to rely on this solution for production. This means that Instaclustr’s managed Debezium source connector for Cassandra is a good fit for your most demanding production data workloads, increasing the value for our customers with integrated solutions across our product set. Our support document, Creating a Cassandra Cluster With Debezium Connector provides step by step instructions to create an Instaclustr for Cassandra CDC solution for your business.  

How Does Change Data Capture Operate with Cassandra?  

Change Data Capture is a native Cassandra setting that can be enabled on a table at creation, or with an alter table command on an existing table. Once enabled, it will create logs that capture and track cluster data that has changed because of inserts, updates, or deletes. Once initial installation and setup has been completed, the CDC process will take these row-level changes and convert your database records to a streaming event for downstream data integration consumers. When running, the Debezium Cassandra source connector will: 

  1. Read Cassandra commit log files in cdc_raw directory for inserts, updates, deletes, and log the change into the nodes commit log. 
  2. Create a change event for every row-level insert.  
  3. For each table, publish change events in a separate Kafka topic. In practice this means that each CDC enabled table in your Cassandra cluster will have its own Kafka topic.
  4. Delete the commit log from the cdc_raw directory. 

Apache Cassandra and Debezium Open Source Developments 

Support for CDC was introduced in Cassandra version 3.11 and has continued to mature through version 4.0. There have been notable open source contributions that improve CDC functionality, including support for Cassandra version 3.11 and Cassandra 4.0 as separate modules, with shared common logic known as a core module authored by NetApp employee Stefan Miklosovic and committed by Gunnar Morling. This was identified during development for Cassandra version 4.0 CDC support.

As support for version 4.0 was added, certain components of Cassandra version 3.11 CDC would break. The introduction of the core module now allows Debezium features and fixes to be pinned to a specific Cassandra version when released, without breaking Debezium support for other Cassandra versions. This will be particularly valuable for future development of Debezium support for Cassandra version 4.1 and each newer Cassandra version. 

Stefan has continued to enhance Cassandra 4.0 support for Debezium with additional contributions like changing the CQL schema of a node without interrupting streaming events to Kafka. Previously, propagating changes in the CQL schema back to the Debezium source connector required a restart of the Debezium connector. Stefan created a CQL Java Driver schema listener, hooked it to a running node, and as soon as somebody adds a column or a table or similar to their Cassandra cluster these changes can now be detected in Debezium streamed events with no Debezium restarts! 

Another notable improvement for Cassandra CDC in Cassandra version 4.0 is the way the Debezium source connector reads the commit log. Cassandra version 3.11 CDC would buffer the CDC change events and only publish change events in batch cycles creating a processing delay. Cassandra version 4.0 CDC now continuously processes the commit logs as new data is available, achieving near real-time publishing. 

If you already have Cassandra clusters that you want to stream into an enterprise data store, get started today using the Debezium source connector for Cassandra on Instaclustr’s Managed Platform.  

Contact our Support team to learn more about how Instaclustr’s managed Debezium connector for Cassandra can unlock the value of your new or existing Cassandra data stores.  

The post Debezium® Change Data Capture for Apache Cassandra® appeared first on Instaclustr.

Medusa 0.16 was released

The k8ssandra team is happy to announce the release of Medusa for Apache Cassandra™ v0.16. This is a special release as we did a major overhaul of the storage classes implementation. We now have less code (and less dependencies) while providing much faster and resilient storage communications.

Back to ~basics~ official SDKs

Medusa has been an open source project for about 4 years now, and a private one for a few more. Over such a long time, other software it depends upon (or doesn’t) evolves as well. More specifically, the SDKs of the major cloud providers evolved a lot. We decided to check if we could replace a lot of custom code doing asynchronous parallelisation and calls to the cloud storage CLI utilities with the official SDKs.

Our storage classes so far relied on two different ways of interacting with the object storage backends:

  • Apache Libcloud, which provided a Python API for abstracting ourselves from the different protocols. It was convenient and fast for uploading a lot of small files, but very inefficient for large transfers.
  • Specific cloud vendors CLIs, which were much more efficient with large file transfers, but invoked through subprocesses. This created an overhead that made them inefficient for small file transfers. Relying on subprocesses also created a much more brittle implementation which led the community to create a lot of issues we’ve been struggling to fix.

To cut a long story short, we did it!

  • We started by looking at S3, where we went for the official boto3. As it turns out, boto3 does all the chunking, throttling, retries and parallelisation for us. Yay!
  • Next we looked at GCP. Here we went with TalkIQ’s gcloud-aio-storage. It works very well for everything, including the large files. The only thing missing is the throughput throttling.
  • Finally, we used Azure’s official SDK to cover Azure compatibility. Sadly, this still works without throttling as well.

Right after finishing these replacements, we spotted the following improvements:

  • The integration tests duration against the storage backends dropped from ~45 min to ~15 min.
    • This means Medusa became far more efficient.
    • There is now much less time spent managing storage interaction thanks to it being asynchronous to the core.
  • The Medusa uncompressed image size we bundle into k8ssandra dropped from ~2GB to ~600MB and its build time went from 2 hours to about 15 minutes.
    • Aside from giving us much faster feedback loops when working on k8ssandra, this should help k8ssandra itself move a little bit faster.
  • The file transfers are now much faster.
    • We observed up to several hundreds of MB/s per node when moving data from a VM to blob storage within the same provider. The available network speed is the limit now.
    • We are also aware that consuming the whole network throughput is not great. That’s why we now have proper throttling for S3 and are working on a solution for this in other backends too.

The only compromise we had to make was to drop Python 3.6 support. This is because the Pythons asyncio features only come in Python 3.7.

The other good stuff

Even though we are the happiest about the storage backends, there is a number of changes that should not go without mention:

  • We fixed a bug with hierarchical storage containers in Azure. This flavor of blob storage works more like a regular file system, meaning it has a concept of directories. None of the other backends do this (including the vanilla Azure ones), and Medusa was not dealing gracefully with this.
  • We are now able to build Medusa images for multiple architectures, including the arm64 one.
  • Medusa can now purge backups of nodes that have been decommissioned, meaning they are no longer present in the most recent backups. Use the new medusa purge-decommissioned command to trigger such a purge.

Upgrade now

We encourage all Medusa users to upgrade to version 0.16 to benefit from all these storage improvements, making it much faster and reliable.

Medusa v0.16 is the default version in the newly released k8ssandra-operator v1.9.0, and it can be used with previous releases by setting the .spec.medusa.containerImage.tag field in your K8ssandraCluster manifests.

Building a 100% ScyllaDB Shard-Aware Application Using Rust

Building a 100% ScyllaDB Shard-Aware Application Using Rust

I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...

Reaper 3.0 for Apache Cassandra was released

The K8ssandra team is pleased to announce the release of Reaper 3.0. Let’s dive into the main new features and improvements this major version brings, along with some notable removals.

Storage backends

Over the years, we regularly discussed dropping support for Postgres and H2 with the TLP team. The effort for maintaining these storage backends was moderate, despite our lack of expertise in Postgres, as long as Reaper’s architecture was simple. Complexity grew with more deployment options, culminating with the addition of the sidecar mode.
Some features require different consensus strategies depending on the backend, which sometimes led to implementations that worked well with one backend and were buggy with others.
In order to allow building new features faster, while providing a consistent experience for all users, we decided to drop the Postgres and H2 backends in 3.0.

Apache Cassandra and the managed DataStax Astra DB service are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.

Reaper does not generally require high availability - even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.

Adaptive Repairs and Schedules

One of the pain points we observed when people start using Reaper is understanding the segment orchestration and knowing how the default timeout impacts the execution of repairs.
Repair is a complex choreography of operations in a distributed system. As such, and especially in the days when Reaper was created, the process could get blocked for several reasons and required a manual restart. The smart folks that designed Reaper at Spotify decided to put a timeout on segments to deal with such blockage, over which they would be terminated and rescheduled.
Problems arise when segments are too big (or have too much entropy) to process within the default 30 minutes timeout, despite not being blocked. They are repeatedly terminated and recreated, and the repair appears to make no progress.
Reaper did a poor job at dealing with this for mainly two reasons:

  • Each retry will use the same timeout, possibly failing segments forever
  • Nothing obvious was reported to explain what was failing and how to fix the situation

We fixed the former by using a longer timeout on subsequent retries, which is a simple trick to make repairs more “adaptive”. If the segments are too big, they’ll eventually pass after a few retries. It’s a good first step to improve the experience, but it’s not enough for scheduled repairs as they could end up with the same repeated failures for each run.
This is where we introduce adaptive schedules, which use feedback from past repair runs to adjust either the number of segments or the timeout for the next repair run.

Adaptive Schedules

Adaptive schedules will be updated at the end of each repair if the run metrics justify it. The schedule can get a different number of segments or a higher segment timeout depending on the latest run.
The rules are the following:

  • if more than 20% segments were extended, the number of segments will be raised by 20% on the schedule
  • if less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout
  • if no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.

This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.

Incremental Repair Triggers

As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anticompaction process.

The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.

Reaper 3.0 introduces a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.

Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.

Percent unrepaired triggers

These new features will allow to securely optimize tombstone deletions by enabling the only_purge_repaired_tombstones compaction subproperty in Cassandra, permitting to reduce gc_grace_seconds down to 3 hours without fearing that deleted data reappears.

Schedules can be edited

That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.

3.0 fixes that embarrassing situation and adds an edit button to schedules, which allows to change the mutable settings of schedules:

Edit Repair Schedule

More improvements

In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).

Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 will display a pop up informing you and allowing to force create the schedule/run:

Force bypass schedule conflict

We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:

java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml

There are many other improvements and we invite all users to check the changelog in the GitHub repo.

Upgrade Now

We encourage all Reaper users to upgrade to 3.0.0, while recommending users to carefully prepare their migration out of Postgres/H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.

All instructions to download, install, configure, and use Reaper 3.0.0 are available on the Reaper website.

Certificates management and Cassandra Pt II - cert-manager and Kubernetes

The joys of certificate management

Certificate management has long been a bugbear of enterprise environments, and expired certs have been the cause of countless outages. When managing large numbers of services at scale, it helps to have an automated approach to managing certs in order to handle renewal and avoid embarrassing and avoidable downtime.

This is part II of our exploration of certificates and encrypting Cassandra. In this blog post, we will dive into certificate management in Kubernetes. This post builds on a few of the concepts in Part I of this series, where Anthony explained the components of SSL encryption.

Recent years have seen the rise of some fantastic, free, automation-first services like letsencrypt, and no one should be caught flat footed by certificate renewals in 2021. In this blog post, we will look at one Kubernetes native tool that aims to make this process much more ergonomic on Kubernetes; cert-manager.

Recap

Anthony has already discussed several points about certificates. To recap:

  1. In asymmetric encryption and digital signing processes we always have public/private key pairs. We are referring to these as the Keystore Private Signing Key (KS PSK) and Keystore Public Certificate (KS PC).
  2. Public keys can always be openly published and allow senders to communicate to the holder of the matching private key.
  3. A certificate is just a public key - and some additional fields - which has been signed by a certificate authority (CA). A CA is a party trusted by all parties to an encrypted conversation.
  4. When a CA signs a certificate, this is a way for that mutually trusted party to attest that the party holding that certificate is who they say they are.
  5. CA’s themselves use public certificates (Certificate Authority Public Certificate; CA PC) and private signing keys (the Certificate Authority Private Signing Key; CA PSK) to sign certificates in a verifiable way.

The many certificates that Cassandra might be using

In a moderately complex Cassandra configuration, we might have:

  1. A root CA (cert A) for internode encryption.
  2. A certificate per node signed by cert A.
  3. A root CA (cert B) for the client-server encryption.
  4. A certificate per node signed by cert B.
  5. A certificate per client signed by cert B.

Even in a three node cluster, we can envisage a case where we must create two root CAs and 6 certificates, plus a certificate for each client application; for a total of 8+ certificates!

To compound the problem, this isn’t a one-off setup. Instead, we need to be able to rotate these certificates at regular intervals as they expire.

Ergonomic certificate management on Kubernetes with cert-manager

Thankfully, these processes are well supported on Kubernetes by a tool called cert-manager.

cert-manager is an all-in-one tool that should save you from ever having to reach for openssl or keytool again. As a Kubernetes operator, it manages a variety of custom resources (CRs) such as (Cluster)Issuers, CertificateRequests and Certificates. Critically it integrates with Automated Certificate Management Environment (ACME) Issuers, such as LetsEncrypt (which we will not be discussing today).

The workfow reduces to:

  1. Create an Issuer (via ACME, or a custom CA).
  2. Create a Certificate CR.
  3. Pick up your certificates and signing keys from the secrets cert-manager creates, and mount them as volumes in your pods’ containers.

Everything is managed declaratively, and you can reissue certificates at will simply by deleting and re-creating the certificates and secrets.

Or you can use the kubectl plugin which allows you to write a simple kubectl cert-manager renew. We won’t discuss this in depth here, see the cert-manager documentation for more information

Java batteries included (mostly)

At this point, Cassandra users are probably about to interject with a loud “Yes, but I need keystores and truststores, so this solution only gets me halfway”. As luck would have it, from version .15, cert-manager also allows you to create JKS truststores and keystores directly from the Certificate CR.

The fine print

There are two caveats to be aware of here:

  1. Most Cassandra deployment options currently available (including statefulSets, cass-operator or k8ssandra) do not currently support using a cert-per-node configuration in a convenient fashion. This is because the PodTemplate.spec portions of these resources are identical for each pod in the StatefulSet. This precludes the possibility of adding per-node certs via environment or volume mounts.
  2. There are currently some open questions about how to rotate certificates without downtime when using internode encryption.
    • Our current recommendation is to use a CA PC per Cassandra datacenter (DC) and add some basic scripts to merge both CA PCs into a single truststore to be propagated across all nodes. By renewing the CA PC independently you can ensure one DC is always online, but you still do suffer a network partition. Hinted handoff should theoretically rescue the situation but it is a less than robust solution, particularly on larger clusters. This solution is not recommended when using lightweight transactions or non LOCAL consistency levels.
    • One mitigation to consider is using non-expiring CA PCs, in which case no CA PC rotation is ever performed without a manual trigger. KS PCs and KS PSKs may still be rotated. When CA PC rotation is essential this approach allows for careful planning ahead of time, but it is not always possible when using a 3rd party CA.
    • Istio or other service mesh approaches can fully automate mTLS in clusters, but Istio is a fairly large committment and can create its own complexities.
    • Manual management of certificates may be possible using a secure vault (e.g. HashiCorp vault), sealed secrets, or similar approaches. In this case, cert manager may not be involved.

These caveats are not trivial. To address (2) more elegantly you could also implement Anthony’s solution from part one of this blog series; but you’ll need to script this up yourself to suit your k8s environment.

We are also in discussions with the folks over at cert-manager about how their ecosystem can better support Cassandra. We hope to report progress on this front over the coming months.

These caveats present challenges, but there are also specific cases where they matter less.

cert-manager and Reaper - a match made in heaven

One case where we really don’t care if a client is unavailable for a short period is when Reaper is the client.

Cassandra is an eventually consistent system and suffers from entropy. Data on nodes can become out of sync with other nodes due to transient network failures, node restarts and the general wear and tear incurred by a server operating 24/7 for several years.

Cassandra contemplates that this may occur. It provides a variety of consistency level settings allowing you to control how many nodes must agree for a piece of data to be considered the truth. But even though properly set consistency levels ensure that the data returned will be accurate, the process of reconciling data across the network degrades read performance - it is best to have consistent data on hand when you go to read it.

As a result, we recommend the use of Reaper, which runs as a Cassandra client and automatically repairs the cluster in a slow trickle, ensuring that a high volume of repairs are not scheduled all at once (which would overwhelm the cluster and degrade the performance of real clients) while also making sure that all data is eventually repaired for when it is needed.

The set up

The manifests for this blog post can be found here.

Environment

We assume that you’re running Kubernetes 1.21, and we’ll be running with a Cassandra 3.11.10 install. The demo environment we’ll be setting up is a 3 node environment, and we have tested this configuration against 3 nodes.

We will be installing the cass-operator and Cassandra cluster into the cass-operator namespace, while the cert-manager operator will sit within the cert-manager namespace.

Setting up kind

For testing, we often use kind to provide a local k8s cluster. You can use minikube or whatever solution you prefer (including a real cluster running on GKE, EKS, or AKS), but we’ll include some kind instructions and scripts here to ease the way.

If you want a quick fix to get you started, try running the setup-kind-multicluster.sh script from the k8ssandra-operator repository, with setup-kind-multicluster.sh --kind-worker-nodes 3. I have included this script in the root of the code examples repo that accompanies this blog.

A demo CA certificate

We aren’t going to use LetsEncrypt for this demo, firstly because ACME certificate issuance has some complexities (including needing a DNS or a publicly hosted HTTP server) and secondly because I want to reinforce that cert-manager is useful to organisations who are bringing their own certs and don’t need one issued. This is especially useful for on-prem deployments.

First off, create a new private key and certificate pair for your root CA. Note that the file names tls.crt and tls.key will become important in a moment.

openssl genrsa -out manifests/demoCA/tls.key 4096
openssl req -new -x509 -key manifests/demoCA/tls.key -out manifests/demoCA/tls.crt -subj "/C=AU/ST=NSW/L=Sydney/O=Global Security/OU=IT Department/CN=example.com"

(Or you can just run the generate-certs.sh script in the manifests/demoCA directory - ensure you run it from the root of the project so that the secrets appear in .manifests/demoCA/.)

When running this process on MacOS be aware of this issue which affects the creation of self signed certificates. The repo referenced in this blog post contains example certificates which you can use for demo purposes - but do not use these outside your local machine.

Now we’re going to use kustomize (which comes with kubectl) to add these files to Kubernetes as secrets. kustomize is not a templating language like Helm. But it fulfills a similar role by allowing you to build a set of base manifests that are then bundled, and which can be customised for your particular deployment scenario by patching.

Run kubectl apply -k manifests/demoCA. This will build the secrets resources using the kustomize secretGenerator and add them to Kubernetes. Breaking this process down piece by piece:

# ./manifests/demoCA
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cass-operator
generatorOptions:
 disableNameSuffixHash: true
secretGenerator:
- name: demo-ca
  type: tls
  files:
  - tls.crt
  - tls.key
  • We use disableNameSuffixHash, because otherwise kustomize will add hashes to each of our secret names. This makes it harder to build these deployments one component at a time.
  • The tls type secret conventionally takes two keys with these names, as per the next point. cert-manager expects a secret in this format in order to create the Issuer which we will explain in the next step.
  • We are adding the files tls.crt and tls.key. The file names will become the keys of a secret called demo-ca.

cert-manager

cert-manager can be installed by running kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml. It will install into the cert-manager namespace because a Kubernetes cluster should only ever have a single cert-manager operator installed.

cert-manager will install a deployment, as well as various custom resource definitions (CRDs) and webhooks to deal with the lifecycle of the Custom Resources (CRs).

A cert-manager Issuer

Issuers come in various forms. Today we’ll be using a CA Issuer because our components need to trust each other, but don’t need to be trusted by a web browser.

Other options include ACME based Issuers compatible with LetsEncrypt, but these would require that we have control of a public facing DNS or HTTP server, and that isn’t always the case for Cassandra, especially on-prem.

Dive into the truststore-keystore directory and you’ll find the Issuer, it is very simple so we won’t reproduce it here. The only thing to note is that it takes a secret which has keys of tls.crt and tls.key - the secret you pass in must have these keys. These are the CA PC and CA PSK we mentioned earlier.

We’ll apply this manifest to the cluster in the next step.

Some cert-manager certs

Let’s start with the Cassandra-Certificate.yaml resource:

spec:
  # Secret names are always required.
  secretName: cassandra-jks-keystore
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  subject:
    organizations:
    - datastax
  dnsNames:
  - dc1.cass-operator.svc.cluster.local
  isCA: false
  usages:
    - server auth
    - client auth
  issuerRef:
    name: ca-issuer
    # We can reference ClusterIssuers by changing the kind here.
    # The default value is `Issuer` (i.e. a locally namespaced Issuer)
    kind: Issuer
    # This is optional since cert-manager will default to this value however
    # if you are using an external issuer, change this to that `Issuer` group.
    group: cert-manager.io
  keystores:
    jks:
      create: true
      passwordSecretRef: # Password used to encrypt the keystore
        key: keystore-pass
        name: jks-password
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048

The first part of the spec here tells us a few things:

  • The keystore, truststore and certificates will be fields within a secret called cassandra-jks-keystore. This secret will end up holding our KS PSK and KS PC.
  • It will be valid for 90 days.
  • 15 days before expiry, it will be renewed automatically by cert manager, which will contact the Issuer to do so.
  • It has a subject organisation. You can add any of the X509 subject fields here, but it needs to have one of them.
  • It has a DNS name - you could also provide a URI or IP address. In this case we have used the service address of the Cassandra datacenter which we are about to create via the operator. This has a format of <DC_NAME>.<NAMESPACE>.svc.cluster.local.
  • It is not a CA (isCA), and can be used for server auth or client auth (usages). You can tune these settings according to your needs. If you make your cert a CA you can even reference it in a new Issuer, and define cute tree like structures (if you’re into that).

Outside the certificates themselves, there are additional settings controlling how they are issued and what format this happens in.

  • IssuerRef is used to define the Issuer we want to issue the certificate. The Issuer will sign the certificate with its CA PSK.
  • We are specifying that we would like a keystore created with the keystore key, and that we’d like it in jks format with the corresponding key.
  • passwordSecretKeyRef references a secret and a key within it. It will be used to provide the password for the keystore (the truststore is unencrypted as it contains only public certs and no signing keys).

The Reaper-Certificate.yaml is similar in structure, but has a different DNS name. We aren’t configuring Cassandra to verify that the DNS name on the certificate matches the DNS name of the parties in this particular case.

Apply all of the certs and the Issuer using kubectl apply -k manifests/truststore-keystore.

Cass-operator

Examining the cass-operator directory, we’ll see that there is a kustomization.yaml which references the remote cass-operator repository and a local cassandraDatacenter.yaml. This applies the manifests required to run up a cass-operator installation namespaced to the cass-operator namespace.

Note that this installation of the operator will only watch its own namespace for CassandraDatacenter CRs. So if you create a DC in a different namespace, nothing will happen.

We will apply these manifests in the next step.

CassandraDatacenter

Finally, the CassandraDatacenter resource in the ./cass-operator/ directory will describe the kind of DC we want:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.10
  managementApiAuth:
    insecure: {}
  size: 1
  podTemplateSpec:
    spec:
      containers:
        - name: "cassandra"
          volumeMounts:
          - name: certs
            mountPath: "/crypto"
      volumes:
      - name: certs
        secret:
          secretName: cassandra-jks-keystore
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
  config:
    cassandra-yaml:
      authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
      authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
      role_manager: org.apache.cassandra.auth.CassandraRoleManager
      client_encryption_options:
        enabled: true
        # If enabled and optional is set to true encrypted and unencrypted connections are handled.
        optional: false
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        require_client_auth: true
        # Set trustore and truststore_password if require_client_auth is true
        truststore: /crypto/truststore.jks
        truststore_password: dc1
        protocol: TLS
        # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA] # An earlier version of this manifest configured cipher suites but the proposed config was less secure. This section does not need to be modified.
      server_encryption_options:
        internode_encryption: all
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        truststore: /crypto/truststore.jks
        truststore_password: dc1
    jvm-options:
      initial_heap_size: 800M
      max_heap_size: 800M
  • We provide a name for the DC - dc1.
  • We provide a name for the cluster - the DC would join other DCs if they already exist in the k8s cluster and we configured the additionalSeeds property.
  • We use the podTemplateSpec.volumes array to declare the volumes for the Cassandra pods, and we use the podTemplateSpec.containers.volumeMounts array to describe where and how to mount them.

The config.cassandra-yaml field is where most of the encryption configuration happens, and we are using it to enable both internode and client-server encryption, which both use the same keystore and truststore for simplicity. Remember that using internode encryption means your DC needs to go offline briefly for a full restart when the CA’s keys rotate.

  • We are not using authz/n in this case to keep things simple. Don’t do this in production.
  • For both encryption types we need to specify (1) the keystore location, (2) the truststore location and (3) the passwords for the keystores. The locations of the keystore/truststore come from where we mounted them in volumeMounts.
  • We are specifying JVM options just to make this run politely on a smaller machine. You would tune this for a production deployment.

Roll out the cass-operator and the CassandraDatacenter using kubectl apply -k manifests/cass-operator. Because the CRDs might take a moment to propagate, there is a chance you’ll see errors stating that the resource type does not exist. Just keep re-applying until everything works - this is a declarative system so applying the same manifests multiple times is an idempotent operation.

Reaper deployment

The k8ssandra project offers a Reaper operator, but for simplicity we are using a simple deployment (because not every deployment needs an operator). The deployment is standard kubernetes fare, and if you want more information on how these work you should refer to the Kubernetes docs.

We are injecting the keystore and truststore passwords into the environment here, to avoid placing them in the manifests. cass-operator does not currently support this approach without an initContainer to pre-process the cassandra.yaml using envsubst or a similar tool.

The only other note is that we are also pulling down a Cassandra image and using it in an initContainer to create a keyspace for Reaper, if it does not exist. In this container, we are also adding a ~/.cassandra/cqlshrc file under the home directory. This provides SSL connectivity configurations for the container. The critical part of the cqlshrc file that we are adding is:

[ssl]
certfile = /crypto/ca.crt
validate = true
userkey = /crypto/tls.key
usercert = /crypto/tls.crt
version = TLSv1_2

The version = TLSv1_2 tripped me up a few times, as it seems to be a recent requirement. Failing to add this line will give you back the rather fierce Last error: [SSL] internal error in the initContainer. The commands run in this container are not ideal. In particular, the fact that we are sleeping for 840 seconds to wait for Cassandra to start is sloppy. In a real deployment we’d want to health check and wait until the Cassandra service became available.

Apply the manifests using kubectl apply -k manifests/reaper.

Results

If you use a GUI, look at the logs for Reaper, you should see that it has connected to the cluster and provided some nice ASCII art to your console.

If you don’t use a GUI, you can run kubectl get pods -n cass-operator to find your Reaper pod (which we’ll call REAPER_PODNAME) and then run kubectl logs -n cass-operator REAPER_PODNAME to pull the logs.

Conclusion

While the above might seem like a complex procedure, we’ve just created a Cassandra cluster with both client-server and internode encryption enabled, all of the required certs, and a Reaper deployment which is configured to connect using the correct certs. Not bad.

Do keep in mind the weaknesses relating to key rotation, and watch this space for progress on that front.

Cassandra Certificate Management Part 1 - How to Rotate Keys Without Downtime

Welcome to this three part blog series where we dive into the management of certificates in an Apache Cassandra cluster. For this first post in the series we will focus on how to rotate keys in an Apache Cassandra cluster without downtime.

Usability and Security at Odds

If you have downloaded and installed a vanilla installation of Apache Cassandra, you may have noticed when it is first started all security is disabled. Your “Hello World” application works out of the box because the Cassandra project chose usability over security. This is deliberately done so everyone benefits from the usability, as security requirement for each deployment differ. While only some deployments require multiple layers of security, others require no security features to be enabled.

Security of a system is applied in layers. For example one layer is isolating the nodes in a cluster behind a proxy. Another layer is locking down OS permissions. Encrypting connections between nodes, and between nodes and the application is another layer that can be applied. If this is the only layer applied, it leaves other areas of a system insecure. When securing a Cassandra cluster, we recommend pursuing an informed approach which offers defence-in-depth. Consider additional aspects such as encryption at rest (e.g. disk encryption), authorization, authentication, network architecture, and hardware, host and OS security.

Encrypting connections between two hosts can be difficult to set up as it involves a number of tools and commands to generate the necessary assets for the first time. We covered this process in previous posts: Hardening Cassandra Step by Step - Part 1 Inter-Node Encryption and Hardening Cassandra Step by Step - Part 2 Hostname Verification for Internode Encryption. I recommend reading both posts before reading through the rest of the series, as we will build off concepts explained in them.

Here is a quick summary of the basic steps to create the assets necessary to encrypt connections between two hosts.

  1. Create the Root Certificate Authority (CA) key pair from a configuration file using openssl.
  2. Create a keystore for each host (node or client) using keytool.
  3. Export the Public Certificate from each host keystore as a “Signing Request” using keytool.
  4. Sign each Public Certificate “Signing Request” with our Root CA to generate a Signed Certificate using openssl.
  5. Import the Root CA Public Certificate and the Signed Certificate into each keystore using keytool.
  6. Create a common truststore and import the CA Public Certificate into it using keytool.

Security Requires Ongoing Maintenance

Setting up SSL encryption for the various connections to Cassandra is only half the story. Like all other software out in the wild, there are ongoing maintenance to ensure the SSL encrypted connections continue to work.

At some point you wil need to update the certificates and stores used to implement the SSL encrypted connections because they will expire. If the certificates for a node expire it will be unable to communicate with other nodes in the cluster. This will lead to at least data inconsistencies or, in the worst case, unavailable data.

This point is specifically called out towards the end of the Inter-Node Encryption blog post. The note refers to steps 1, 2 and 4 in the above summary of commands to set up the certificates and stores. The validity periods are set for the certificates and stores in their respective steps.

One Certificate Authority to Rule Them All

Before we jump into how we handle expiring certificates and stores in a cluster, we first need to understand the role a certificate plays in securing a connection.

Certificates (and encryption) are often considered a hard topic. However, there are only a few concepts that you need to bear in mind when managing certificates.

Consider the case where two parties A and B wish to communicate with one another. Both parties distrust each other and each needs a way to prove that they are who they claim to be, as well as verify the other party is who they claim to be. To do this a mutually trusted third party needs to be brought in. In our case the trusted third party is the Certificate Authority (CA); often referred to as the Root CA.

The Root CA is effectively just a key pair; similar to an SSH key pair. The main difference is the public portion of the key pair has additional fields detailing who the public key belongs to. It has the following two components.

  • Certificate Authority Private Signing Key (CA PSK) - Private component of the CA key pair. Used to sign a keystore’s public certificate.
  • Certificate Authority Public Certificate (CA PC) - Public component of the CA key pair. Used to provide the issuer name when signing a keystore’s public certificate, as well as by a node to confirm that a third party public certificate (when presented) has been signed by the Root CA PSK.

When you run openssl to create your CA key pair using a certificate configuration file, this is the command that is run.

$ openssl req \
      -config path/to/ca_certificate.config \
      -new \
      -x509 \
      -keyout path/to/ca_psk \
      -out path/to/ca_pc \
      -days <valid_days>

In the above command the -keyout specifies the path to the CA PSK, and the -out specifies the path to the CA PC.

And in the Darkness Sign Them

In addition to a common Root CA key pair, each party has its own certificate key pair to uniquely identify it and to encrypt communications. In the Cassandra world, two components are used to store the information needed to perform the above verification check and communication encryption; the keystore and the truststore.

The keystore contains a key pair which is made up of the following two components.

  • Keystore Private Signing Key (KS PSK) - Hidden in keystore. Used to sign messages sent by the node, and decrypt messages received by the node.
  • Keystore Public Certificate (KS PC) - Exported for signing by the Root CA. Used by a third party to encrypt messages sent to the node that owns this keystore.

When created, the keystore will contain the PC, and the PSK. The PC signed by the Root CA, and the CA PC are added to the keystore in subsequent operations to complete the trust chain. The certificates are always public and are presented to other parties, while PSK always remains secret. In an asymmetric/public key encryption system, messages can be encrypted with the PC but can only be decrypted using the PSK. In this way, a node can initiate encrypted communications without needing to share a secret.

The truststore stores one or more CA PCs of the parties which the node has chosen to trust, since they are the source of trust for the cluster. If a party tries to communicate with the node, it will refer to its truststore to see if it can validate the attempted communication using a CA PC that it knows about.

For a node’s KS PC to be trusted and verified by another node using the CA PC in the truststore, the KS PC needs to be signed by the Root CA key pair. Futhermore, the CA key pair is used to sign the KS PC of each party.

When you run openssl to sign an exported Keystore PC, this is the command that is run.

$ openssl x509 \
    -req \
    -CAkey path/to/ca_psk \
    -CA path/to/ca_pc \
    -in path/to/exported_ks_pc_sign_request \
    -out paht/to/signed_ks_pc \
    -days <valid_days> \
    -CAcreateserial \
    -passin pass:<ca_psk_password>

In the above command both the Root CA PSK and CA PC are used via -CAkey and -CA respectively when signing the KS PC.

More Than One Way to Secure a Connection

Now that we have a deeper understanding of the assets that are used to encrypt communications, we can examine various ways to implement it. There are multiple ways to implement SSL encryption in an Apache Cassandra cluster. Regardless of the encryption approach, the objective when applying this type of security to a cluster is to ensure;

  • Hosts (nodes or clients) can determine whether they should trust other hosts in cluster.
  • Any intercepted communication between two hosts is indecipherable.

The three most common methods vary in both ease of deployment and resulting level of security. They are as follows.

The Cheats Way

The easiest and least secure method for rolling out SSL encryption can be done in the following way

Generation

  • Single CA for the cluster.
  • Single truststore containing the CA PC.
  • Single keystore which has been signed by the CA.

Deployment

  • The same keystore and truststore are deployed to each node.

In this method a single Root CA and a single keystore is deployed to all nodes in the cluster. This means any node can decipher communications intended for any other node. If a bad actor gains control of a node in the cluster then they will be able to impersonate any other node. That is, compromise of one host will compromise all of them. Depending on your threat model, this approach can be better than no encryption at all. It will ensure that a bad actor with access to only the network will no longer be able to eavesdrop on traffic.

We would use this method as a stop gap to get internode encryption enabled in a cluster. The idea would be to quickly deploy internode encryption with the view of updating the deployment in the near future to be more secure.

Best Bang for Buck

Arguably the most popular and well documented method for rolling out SSL encryption is

Generation

  • Single CA for the cluster.
  • Single truststore containing the CA PC.
  • Unique keystore for each node all of which have been signed by the CA.

Deployment

  • Each keystore is deployed to its associated node.
  • The same truststore is deployed to each node.

Similar to the previous method, this method uses a cluster wide CA. However, unlike the previous method each node will have its own keystore. Each keystore has its own certificate that is signed by a Root CA common to all nodes. The process to generate and deploy the keystores in this way is practiced widely and well documented.

We would use this method as it provides better security over the previous method. Each keystore can have its own password and host verification, which further enhances the security that can be applied.

Fort Knox

The method that offers the strongest security of the three can be rolled out in following way

Generation

  • Unique CA for each node.
  • A single truststore containing the Public Certificate for each of the CAs.
  • Unique keystore for each node that has been signed by the CA specific to the node.

Deployment

  • Each keystore with its unique CA PC is deployed to its associated node.
  • The same truststore is deployed to each node.

Unlike the other two methods, this one uses a Root CA per host and similar to the previous method, each node will have its own keystore. Each keystore has its own PC that is signed by a Root CA unique to the node. The Root CA PC of each node needs to be added to the truststore that is deployed to all nodes. For large cluster deployments this encryption configuration is cumbersome and will result in a large truststore being generated. Deployments of this encryption configuration are less common in the wild.

We would use this method as it provides all the advantages of the previous method and in addition, provides the ability to isolate a node from the cluster. This can be done by simply rolling out a new truststore which excludes a specific node’s CA PC. In this way a compromised node could be isolated from the cluster by simply changing the truststore. Under the previous two approaches, isolation of a compromised node in this fashion would require a rollout of an entirely new Root CA and one or more new keystores. Furthermore, each new Keystore CA would need to be signed by the new Root CA.

WARNING: Ensure your Certificate Authority is secure!

Regardless of the deployment method chosen, the whole setup will depend on the security of the Root CA. Ideally both components should be secured, or at the very least the PSK needs to be secured properly after it is generated since all trust is based on it. If both components are compromised by a bad actor, then that actor can potentially impersonate another node in the cluster. The good news is, there are a variety of ways to secure the Root CA components, however that topic goes beyond the scope of this post.

The Need for Rotation

If we are following best practices when generating our CAs and keystores, they will have an expiry date. This is a good thing because it forces us to regenerate and roll out our new encryption assets (stores, certificates, passwords) to the cluster. By doing this we minimise the exposure that any one of the components has. For example, if a password for a keystore is unknowingly leaked, the password is only good up until the keystore expiry. Having a scheduled expiry reduces the chance of a security leak becoming a breach, and increases the difficulty for a bad actor to gain persistence in the system. In the worst case it limits the validity of compromised credentials.

Always Read the Expiry Label

The only catch to having an expiry date on our encryption assets is that we need to rotate (update) them before they expire. Otherwise, our data will be unavailable or may be inconsistent in our cluster for a period of time. Expired encryption assets when forgotten can be a silent, sinister problem. If, for example, our SSL certificates expire unnoticed we will only discover this blunder when we restart the Cassandra service. In this case the Cassandra service will fail to connect to the cluster on restart and SSL expiry error will appear in the logs. At this point there is nothing we can do without incurring some data unavailability or inconsistency in the cluster. We will cover what to do in this case in a subsequent post. However, it is best to avoid this situation by rotating the encryption assets before they expire.

How to Play Musical Certificates

Assuming we are going to rotate our SSL certificates before they expire, we can perform this operation live on the cluster without downtime. This process requires the replication factor and consistency level to configured to allow for a single node to be down for a short period of time in the cluster. Hence, it works best when use a replication factor >= 3 and use consistency level <= QUORUM or LOCAL_QUORUM depending on the cluster configuration.

  1. Create the NEW encryption assets; NEW CA, NEW keystores, and NEW truststore, using the process described earlier.
  2. Import the NEW CA to the OLD truststore already deployed in the cluster using keytool. The OLD truststore will increase in size, as it has both the OLD and NEW CAs in it.
    $ keytool -keystore <old_truststore> -alias CARoot -importcert -file <new_ca_pc> -keypass <new_ca_psk_password> -storepass <old_truststore_password> -noprompt
    

    Where:

    • <old_truststore>: The path to the OLD truststore already deployed in the cluster. This can be just a copy of the OLD truststore deployed.
    • <new_ca_pc>: The path to the NEW CA PC generated.
    • <new_ca_psk_password>: The password for the NEW CA PSKz.
    • <old_truststore_password>: The password for the OLD truststore.
  3. Deploy the updated OLD truststore to all the nodes in the cluster. Specifically, perform these steps on a single node, then repeat them on the next node until all nodes are updated. Once this step is complete, all nodes in the cluster will be able to establish connections using both the OLD and NEW CAs.
    1. Drain the node using nodetool drain.
    2. Stop the Cassandra service on the node.
    3. Copy the updated OLD truststore to the node.
    4. Start the Cassandra service on the node.
  4. Deploy the NEW keystores to their respective nodes in the cluster. Perform this operation one node at a time in the same way the OLD truststore was deployed in the previous step. Once this step is complete, all nodes in the cluster will be using their NEW SSL certificate to establish encrypted connections with each other.
  5. Deploy the NEW truststore to all the nodes in the cluster. Once again, perform this operation one node at a time in the same way the OLD truststore was deployed in Step 3.

The key to ensuring uptime in the rotation are in Steps 2 and 3. That is, we have the OLD and the NEW CAs all in the truststore and deployed on every node prior to rolling out the NEW keystores. This allows nodes to communicate regardless of whether they have the OLD or NEW keystore. This is because both the OLD and NEW assets are trusted by all nodes. The process still works whether our NEW CAs are per host or cluster wide. If the NEW CAs are per host, then they all need to be added to the OLD truststore.

Example Certificate Rotation on a Cluster

Now that we understand the theory, let’s see the process in action. We will use ccm to create a three node cluster running Cassandra 3.11.10 with internode encryption configured.

As pre-cluster setup task we will generate the keystores and truststore to implement the internode encryption. Rather than carry out the steps manually to generate the stores, we have developed a script called generate_cluster_ssl_stores that does the job for us.

The script requires us to supply the node IP addresses, and a certificate configuration file. Our certificate configuration file, test_ca_cert.conf has the following contents:

[ req ]
distinguished_name     = req_distinguished_name
prompt                 = no
output_password        = mypass
default_bits           = 2048

[ req_distinguished_name ]
C                      = AU
ST                     = NSW
L                      = Sydney
O                      = TLP
OU                     = SSLTestCluster
CN                     = SSLTestClusterRootCA
emailAddress           = info@thelastpickle.com¡

The command used to call the generate_cluster_ssl_stores.sh script is as follows.

$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 test_ca_cert.conf

Let’s break down the options in the above command.

  • -g - Generate passwords for each keystore and the truststore.
  • -c - Create a Root CA for the cluster and sign each keystore PC with it.
  • -n - List of nodes to generate keystores for.

The above command generates the following encryption assets.

$ ls -alh ssl_artifacts_20210602_125353
total 72
drwxr-xr-x   9 anthony  staff   288B  2 Jun 12:53 .
drwxr-xr-x   5 anthony  staff   160B  2 Jun 12:53 ..
-rw-r--r--   1 anthony  staff    17B  2 Jun 12:53 .srl
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  2 Jun 12:53 certs
-rw-r--r--   1 anthony  staff   1.0K  2 Jun 12:53 common-truststore.jks
-rw-r--r--   1 anthony  staff   219B  2 Jun 12:53 stores.password

With the necessary stores generated we can create our three node cluster in ccm. Prior to starting the cluster our nodes should look something like this.

$ ccm status
Cluster: 'SSLTestCluster'
-------------------------
node1: DOWN (Not initialized)
node2: DOWN (Not initialized)
node3: DOWN (Not initialized)

We can configure internode encryption in the cluster by modifying the cassandra.yaml files for each node as follows. The passwords for each store are in the stores.password file created by the generate_cluster_ssl_stores.sh script.

node1 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-1-keystore.jks
  keystore_password: HQR6xX4XQrYCz58CgAiFkWL9OTVDz08e
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-2-keystore.jks
  keystore_password: Aw7pDCmrtacGLm6a1NCwVGxohB4E3eui
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-3-keystore.jks
  keystore_password: 1DdFk27up3zsmP0E5959PCvuXIgZeLzd
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

Now that we configured internode encryption in the cluster, we can start the nodes and monitor the logs to make sure they start correctly.

$ ccm node1 start && touch ~/.ccm/SSLTestCluster/node1/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node1/logs/system.log
...
$ ccm node2 start && touch ~/.ccm/SSLTestCluster/node2/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node2/logs/system.log
...
$ ccm node3 start && touch ~/.ccm/SSLTestCluster/node3/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node3/logs/system.log

In all cases we see the following message in the logs indicating that internode encryption is enabled.

INFO  [main] ... MessagingService.java:704 - Starting Encrypted Messaging Service on SSL port 7001

Once all the nodes have started, we can check the cluster status. We are looking to see that all nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  90.65 KiB  16           65.8%             2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  66.31 KiB  16           65.5%             f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  71.46 KiB  16           68.7%             46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

We will create a NEW Root CA along with a NEW set of stores for the cluster. As part of this process, we will add the NEW Root CA PC to OLD (current) truststore that is already in use in the cluster. Once again we can use our generate_cluster_ssl_stores.sh script to this, including the additional step of adding the NEW Root CA PC to our OLD truststore. This can be done with the following commands.

# Make the password to our old truststore available to script so we can add the new Root CA to it.

$ export EXISTING_TRUSTSTORE_PASSWORD=$(cat ssl_artifacts_20210602_125353/stores.password | grep common-truststore.jks | cut -d':' -f2)
$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 -e ssl_artifacts_20210602_125353/common-truststore.jks test_ca_cert.conf 

We call our script using a similar command to the first time we used it. The difference now is we are using one additional option; -e.

  • -e - Path to our OLD (existing) truststore which we will add the new Root CA PC to. This option requires us to set the OLD truststore password in the EXISTING_TRUSTSTORE_PASSWORD variable.

The above command generates the following new encryption assets. These files are located in a different directory to the old ones. The directory with the old encryption assets is ssl_artifacts_20210602_125353 and the directory with the new encryption assets is ssl_artifacts_20210603_070951

$ ls -alh ssl_artifacts_20210603_070951
total 72
drwxr-xr-x   9 anthony  staff   288B  3 Jun 07:09 .
drwxr-xr-x   6 anthony  staff   192B  3 Jun 07:09 ..
-rw-r--r--   1 anthony  staff    17B  3 Jun 07:09 .srl
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  3 Jun 07:09 certs
-rw-r--r--   1 anthony  staff   1.0K  3 Jun 07:09 common-truststore.jks
-rw-r--r--   1 anthony  staff   223B  3 Jun 07:09 stores.password

When we look at our OLD truststore we can see that it has increased in size. Originally, it was 1.0K and it is now 2.0K in size after adding the new Root CA PC it.

$ ls -alh ssl_artifacts_20210602_125353/common-truststore.jks
-rw-r--r--  1 anthony  staff   2.0K  3 Jun 07:09 ssl_artifacts_20210602_125353/common-truststore.jks

We can now roll out the updated OLD truststore. In a production Cassandra deployment we would copy the updated OLD truststore to a node and restart the Cassandra service. Then repeat this process on the other nodes in the cluster, one node at a time. In our case, our locally running nodes are already pointing to the updated OLD truststore. We need to only restart the Cassandra service.

$ for i in $(ccm status | grep UP | cut -d':' -f1); do echo "restarting ${i}" && ccm ${i} stop && sleep 3 && ccm ${i} start; done
restarting node1
restarting node2
restarting node3

After the restart, our nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  167.23 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  173.7 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Our nodes are using the updated OLD truststore which has the old Root CA PC and the new Root CA PC. This means that nodes will be able to communicate using either the old (current) keystore or the new keystore. We can now roll out the new keystore one node at a time and still have all our data available.

To do the new keystore roll out we will stop the Cassandra service, update its configuration to point to the new keystore, and then start the Cassandra service. A few notes before we start:

  • The node will need to point to the new keystore located in the directory with the new encryption assets; ssl_artifacts_20210603_070951.
  • The node will still need to use the OLD truststore, so its path will remain unchanged.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  179.23 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 using the new keystore while node2 and node3 are using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  188.46 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  227.12 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 and node2 using the new keystore while node3 is using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  239.3 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The keystore rotation is now complete on all nodes in our cluster. However, all nodes are still using the updated OLD truststore. To ensure that our old Root CA can no longer be used to intercept messages in our cluster we need to roll out the NEW truststore to all nodes. This can be done in the same way we deployed the new keystores.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update truststore path to point to new truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  294.05 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  288.6 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The rotation of the certificates is now complete and all while having only a single node down at any one time! This process can be used for all three of the deployment variations. In addition, it can be used to move between the different deployment variations without incurring downtime.

Conclusion

Internode encryption plays an important role in securing the internal communication of a cluster. When deployed, it is crucial that certificate expiry dates be tracked so the certificates can be rotated before they expire. Failure to do so will result in unavailability and inconsistencies.

Using the process discussed in this post and combined with the appropriate tooling, internode encryption can be easily deployed and associated certificates easily rotated. In addition, the process can be used to move between the different encryption deployments.

Regardless of the reason for using the process, it can be executed without incurring downtime in common Cassandra use cases.

Running your Database on OpenShift and CodeReady Containers

Let’s take an introductory run-through of setting up your database on OpenShift, using your own hardware and RedHat’s CodeReady Containers.

CodeReady Containers is a great way to run OpenShift K8s locally, ideal for development and testing. The steps in this blog post will require a machine, laptop or desktop, of decent capability; preferably quad CPUs and 16GB+ RAM.

Download and Install RedHat’s CodeReady Containers

Download and install RedHat’s CodeReady Containers (version 1.27) as described in Red Hat OpenShift 4 on your laptop: Introducing Red Hat CodeReady Containers.

First configure CodeReady Containers, from the command line

❯ crc setup

…
Your system is correctly setup for using CodeReady Containers, you can now run 'crc start' to start the OpenShift cluster

Check the version is correct.

❯ crc version

…
CodeReady Containers version: 1.27.0+3d6bc39d
OpenShift version: 4.7.11 (not embedded in executable)

Then start it, entering the Pull Secret copied from the download page. Have patience here, this can take ten minutes or more.

❯ crc start

INFO Checking if running as non-root
…
Started the OpenShift cluster.

The server is accessible via web console at:
  https://console-openshift-console.apps-crc.testing 
…

The output above will include the kubeadmin password which is required in the following oc login … command.

❯ eval $(crc oc-env)
❯ oc login -u kubeadmin -p <password-from-crc-setup-output> https://api.crc.testing:6443

❯ oc version

Client Version: 4.7.11
Server Version: 4.7.11
Kubernetes Version: v1.20.0+75370d3

Open in a browser the URL https://console-openshift-console.apps-crc.testing

Log in using the kubeadmin username and password, as used above with the oc login … command. You might need to try a few times because of the self-signed certificate used.

Once OpenShift has started and is running you should see the following webpage

CodeReady Start Webpage

Some commands to help check status and the startup process are

❯ oc status       

In project default on server https://api.crc.testing:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 10.217.4.1:443 -> 6443

View details with 'oc describe <resource>/<name>' or list resources with 'oc get all'.      

Before continuing, go to the CodeReady Containers Preferences dialog. Increase CPUs and Memory to >12 and >14GB correspondingly.

CodeReady Preferences dialog

Create the OpenShift Local Volumes

Cassandra needs persistent volumes for its data directories. There are different ways to do this in OpenShift, from enabling local host paths in Rancher persistent volumes, to installing and using the OpenShift Local Storage Operator, and of course persistent volumes on the different cloud provider backends.

This blog post will use vanilla OpenShift volumes using folders on the master k8s node.

Go to the “Terminal” tab for the master node and create the required directories. The master node is found on the /cluster/nodes/ webpage.

Click on the node, named something like crc-m89r2-master-0, and then click on the “Terminal” tab. In the terminal, execute the following commands:

sh-4.4# chroot /host
sh-4.4# mkdir -p /mnt/cass-operator/pv000
sh-4.4# mkdir -p /mnt/cass-operator/pv001
sh-4.4# mkdir -p /mnt/cass-operator/pv002
sh-4.4# 

Persistent Volumes are to be created with affinity to the master node, declared in the following yaml. The name of the master node can vary from installation to installation. If your master node is not named crc-gm7cm-master-0 then the following command replaces its name. First download the cass-operator-1.7.0-openshift-storage.yaml file, check the name of the node in the nodeAffinity sections against your current CodeReady Containers instance, updating if necessary.

❯ wget https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-storage.yaml

# The name of your master node
❯ oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers

# If it is not crc-gm7cm-master-0
❯ sed -i '' "s/crc-gm7cm-master-0/$(oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers)/" cass-operator-1.7.0-openshift-storage.yaml

Create the Persistent Volumes (PV) and Storage Class (SC).

❯ oc apply -f cass-operator-1.7.0-openshift-storage.yaml

persistentvolume/server-storage-0 created
persistentvolume/server-storage-1 created
persistentvolume/server-storage-2 created
storageclass.storage.k8s.io/server-storage created

To check the existence of the PVs.

❯ oc get pv | grep server-storage

server-storage-0   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-1   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-2   10Gi   RWO    Delete   Available   server-storage     5m19s

To check the existence of the SC.

❯ oc get sc

NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
server-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  5m36s

More information on using the can be found in the RedHat documentation for OpenShift volumes.

Deploy the Cass-Operator

Now create the cass-operator. Here we can use the upstream 1.7.0 version of the cass-operator. After creating (applying) the cass-operator, it is important to quickly execute the oc adm policy … commands in the following step so the pods have the privileges required and are successfully created.

❯ oc apply -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

namespace/cass-operator created
serviceaccount/cass-operator created
secret/cass-operator-webhook-config created
W0606 14:25:44.757092   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0606 14:25:45.077394   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com created
clusterrole.rbac.authorization.k8s.io/cass-operator-cr created
clusterrole.rbac.authorization.k8s.io/cass-operator-webhook created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-crb created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-webhook created
role.rbac.authorization.k8s.io/cass-operator created
rolebinding.rbac.authorization.k8s.io/cass-operator created
service/cassandradatacenter-webhook-service created
deployment.apps/cass-operator created
W0606 14:25:46.701712   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
W0606 14:25:47.068795   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
validatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration created

❯ oc adm policy add-scc-to-user privileged -z default -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "default"

❯ oc adm policy add-scc-to-user privileged -z cass-operator -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "cass-operator"

Let’s check the deployment happened.

❯ oc get deployments -n cass-operator

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
cass-operator   1/1     1            1           14m

Let’s also check the cass-operator pod was created and is successfully running. Note that the kubectl command is used here, for all k8s actions the oc and kubectl commands are interchangable.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-hxc8z   1/1     Running   0          15m

Troubleshooting: If the cass-operator does not end up in Running status, or if any pods in later sections fail to start, it is recommended to use the OpenShift UI Events webpage for easy diagnostics.

Setup the Cassandra Cluster

The next step is to create the cluster. The following deployment file creates a 3 node cluster. It is largely a copy from the upstream cass-operator version 1.7.0 file example-cassdc-minimal.yaml but with a small modification made to allow all the pods to be deployed to the same worker node (as CodeReady Containers only uses one k8s node by default).

❯ oc apply -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

cassandradatacenter.cassandra.datastax.com/dc1 created

Let’s watch the pods get created, initialise, and eventually becoming running, using the kubectl get pods … watch command.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-28fhw   1/1     Running   0          102s
cluster1-dc1-default-sts-0       0/2     Pending   0          0s
cluster1-dc1-default-sts-1       0/2     Pending   0          0s
cluster1-dc1-default-sts-2       0/2     Pending   0          0s
cluster1-dc1-default-sts-0       2/2     Running   0          3m
cluster1-dc1-default-sts-1       2/2     Running   0          3m
cluster1-dc1-default-sts-2       2/2     Running   0          3m

Use the Cassandra Cluster

With the Cassandra pods each up and running, the cluster is ready to be used. Test it out using the nodetool status command.

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Defaulting container name to cassandra.
Use 'kubectl describe pod/cluster1-dc1-default-sts-0 -n cass-operator' to see all of the containers in this pod.
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.217.0.73  84.42 KiB  1            83.6%             672baba8-9a05-45ac-aad1-46427027b57a  default
UN  10.217.0.72  70.2 KiB   1            65.3%             42758a86-ea7b-4e9b-a974-f9e71b958429  default
UN  10.217.0.71  65.31 KiB  1            51.1%             2fa73bc2-471a-4782-ae63-5a34cc27ab69  default

The above command can be run on `cluster1-dc1-default-sts-1` and `cluster1-dc1-default-sts-2` too.

Next, test out cqlsh. For this authentication is required, so first get the CQL username and password.

# Get the cql username
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# Get the cql password
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- cqlsh -u <cql-username> -p <cql-password>

Connected to cluster1 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cluster1-superuser@cqlsh>

Keep It Clean

CodeReady Containers are very simple to clean up, especially because it is a packaging of OpenShift intended only for development purposes. To wipe everything, just “delete”

❯ crc stop
❯ crc delete

If, on the other hand, you only want to delete individual steps, each of the following can be done (but in order).

❯ oc delete -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

❯ oc delete -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

❯ oc delete -f cass-operator-1.7.0-openshift-storage.yaml

On Scylla Manager Suspend & Resume feature

On Scylla Manager Suspend & Resume feature

!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...

Apache Cassandra's Continuous Integration Systems

With Apache Cassandra 4.0 just around the corner, and the feature freeze on trunk lifted, let’s take a dive into the efforts ongoing with the project’s testing and Continuous Integration systems.

continuous integration in open source

Every software project benefits from sound testing practices and having a continuous integration in place. Even more so for open source projects. From contributors working around the world in many different timezones, particularly prone to broken builds and longer wait times and uncertainties, to contributors just not having the same communication bandwidths between each other because they work in different companies and are scratching different itches.

This is especially true for Apache Cassandra. As an early-maturity technology used everywhere on mission critical data, stability and reliability are crucial for deployments. Contributors from many companies: Alibaba, Amazon, Apple, Bloomberg, Dynatrace, DataStax, Huawei, Instaclustr, Netflix, Pythian, and more; need to coordinate and collaborate and most importantly trust each other.

During the feature freeze the project was fortunate to not just stabilise and fix tons of tests, but to also expand its continuous integration systems. This really helps set the stage for a post 4.0 roadmap that features heavy on pluggability, developer experience and safety, as well as aiming for an always-shippable trunk.

@ cassandra

The continuous integration systems at play are CircleCI and ci-cassandra.apache.org

CircleCI is a commercial solution. The main usage today of CircleCI is pre-commit, that is testing your patches while they get reviewed before they get merged. To effectively use CircleCI on Cassandra requires either the medium or high resource profiles that enables the use of hundreds of containers and lots of resources, and that’s basically only available for folk working in companies that are paying for a premium CircleCI account. There are lots stages to the CircleCI pipeline, and developers just trigger those stages they feel are relevant to test that patch on.

CircleCI pipeline

ci-cassandra is our community CI. It is based on CloudBees, provided by the ASF and running 40 agents (servers) around the world donated by numerous different companies in our community. Its main usage is post-commit, and its pipelines run every stage automatically. Today the pipeline consists of 40K tests. And for the first time in many years, on the lead up to 4.0, pipeline runs are completely green.

ci-cassandra pipeline

ci-cassandra is setup with a combination of Jenkins DSL script, and declarative Jenkinsfiles. These jobs use the build scripts found here.

forty thousand tests

The project has many types of tests. It has proper unit tests, and unit tests that have some embedded Cassandra server. The unit tests are run in a number of different parameterisations: from different Cassandra configuration, JDK 8 and JDK 11, to supporting the ARM architecture. There’s CQLSH tests written in Python against a single ccm node. Then there’s the Java distributed tests and Python distributed tests. The Python distributed tests are older, use CCM, and also run parameterised. The Java distributed tests are a recent addition and run the Cassandra nodes inside the JVM. Both types of distributed tests also include testing the upgrade paths of different Cassandra versions. Most new distributed tests today are written as Java distributed tests. There are also burn and microbench (JMH) tests.

distributed is difficult

Testing distributed tech is hardcore. Anyone who’s tried to run the Python upgrade dtests locally knows the pain. Running the tests in Docker helps a lot, and this is what CircleCI and ci-cassandra predominantly does. The base Docker images are found here. Distributed tests can fall over for numerous reasons, exacerbated in ci-cassandra with heterogenous servers around the world and all the possible network and disk issues that can occur. Just for the 4.0 release over 200 Jira tickets were focused just on strengthening flakey tests. Because ci-cassandra has limited storage, the logs and test results to all runs are archived in nightlies.apache.org/cassandra.

call for help

There’s still heaps to do. This is all part-time and volunteer efforts. No one in the community is dedicated to these systems or as a build engineer. The project can use all the help it can get.

There’s a ton of exciting stuff to add. Some examples are microbench and JMH reports, Jacoco test coverage reports, Harry for fuzz testing, Adelphi or Fallout for end-to-end performance and comparison testing, hooking up Apache Yetus for efficient resource usage, or putting our Jenkins stack into a k8s operator run script so you can run the pipeline on your own k8s cluster.

So don’t be afraid to jump in, pick your poison, we’d love to see you!

Reaper 2.2 for Apache Cassandra was released

We’re pleased to announce that Reaper 2.2 for Apache Cassandra was just released. This release includes a major redesign of how segments are orchestrated, which allows users to run concurrent repairs on nodes. Let’s dive into these changes and see what they mean for Reaper’s users.

New Segment Orchestration

Reaper works in a variety of standalone or distributed modes, which create some challenges in meeting the following requirements:

  • A segment is processed successfully exactly once.
  • No more than one segment is running on a node at once.
  • Segments can only be started if the number of pending compactions on a node involved is lower than the defined threshold.

To make sure a segment won’t be handled by several Reaper instances at once, Reaper relies on LightWeight Transactions (LWT) to implement a leader election process. A Reaper instance will “take the lead” on a segment by using a LWT and then perform the checks for the last two conditions above.

To avoid race conditions between two different segments involving a common set of replicas that would start at the same time, a “master lock” was placed after the checks to guarantee that a single segment would be able to start. This required a double check to be performed before actually starting the segment.

Segment Orchestration pre 2.2 design

There were several issues with this design:

  • It involved a lot of LWTs even if no segment could be started.
  • It was a complex design which made the code hard to maintain.
  • The “master lock” was creating a lot of contention as all Reaper instances would compete for the same partition, leading to some nasty situations. This was especially the case in sidecar mode as it involved running a lot of Reaper instances (one per Cassandra node).

As we were seeing suboptimal performance and high LWT contention in some setups, we redesigned how segments were orchestrated to reduce the number of LWTs and maximize concurrency during repairs (all nodes should be busy repairing if possible).
Instead of locking segments, we explored whether it would be possible to lock nodes instead. This approach would give us several benefits:

  • We could check which nodes are already busy without issuing JMX requests to the nodes.
  • We could easily filter segments to be processed to retain only those with available nodes.
  • We could remove the master lock as we would have no more race conditions between segments.

One of the hard parts was that locking several nodes in a consistent manner would be challenging as they would involve several rows, and Cassandra doesn’t have a concept of an atomic transaction that can be rolled back as RDBMS do. Luckily, we were able to leverage one feature of batch statements: All Cassandra batch statements which target a single partition will turn all operations into a single atomic one (at the node level). If one node out of all replicas was already locked, then none would be locked by the batched LWTs. We used the following model for the leader election table on nodes:

CREATE TABLE reaper_db.running_repairs (
    repair_id uuid,
    node text,
    reaper_instance_host text,
    reaper_instance_id uuid,
    segment_id uuid,
    PRIMARY KEY (repair_id, node)
) WITH CLUSTERING ORDER BY (node ASC)

The following LWTs are then issued in a batch for each replica:

BEGIN BATCH

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1', 
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node1'
IF reaper_instance_id = null;

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node2'
IF reaper_instance_id = null;

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node3'
IF reaper_instance_id = null;

APPLY BATCH;

If all the conditional updates are able to be applied, we’ll get the following data in the table:

cqlsh> select * from reaper_db.running_repairs;

 repair_id                            | node  | reaper_instance_host | reaper_instance_id                   | segment_id
--------------------------------------+-------+----------------------+--------------------------------------+--------------------------------------
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node1 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node2 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node3 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 

If one of the conditional updates fails because one node is already locked for the same repair_id, then none will be applied.

Note: the Postgres backend also benefits from these new features through the use of transactions, using commit and rollback to deal with success/failure cases.

The new design is now much simpler than the initial one:

Segment Orchestration post 2.2 design

Segments are now filtered on those that have no replica locked to avoid wasting energy in trying to lock them and the pending compactions check also happens before any locking.

This reduces the number of LWTs by four in the simplest cases and we expect more challenging repairs to benefit from even more reductions:

LWT improvements

At the same time, repair duration on a 9-node cluster showed 15%-20% improvements thanks to the more efficient segment selection.

One prerequisite to make that design efficient was to store the replicas for each segment in the database when the repair run is created. You can now see which nodes are involved for each segment and which datacenter they belong to in the Segments view:

Segments view

Concurrent repairs

Using the repair id as the partition key for the node leader election table gives us another feature that was long awaited: Concurrent repairs.
A node could be locked by different Reaper instances for different repair runs, allowing several repairs to run concurrently on each node. In order to control the level of concurrency, a new setting was introduced in Reaper: maxParallelRepairs
By default it is set to 2 and should be tuned carefully as heavy concurrent repairs could have a negative impact on clusters performance.
If you have small keyspaces that need to be repaired on a regular basis, they won’t be blocked by large keyspaces anymore.

Future upgrades

As some of you are probably aware, JFrog has decided to sunset Bintray and JCenter. Bintray is our main distribution medium and we will be working on replacement repositories. The 2.2.0 release is unaffected by this change but future upgrades could require an update to yum/apt repos. The documentation will be updated accordingly in due time.

Upgrade now

We encourage all Reaper users to upgrade to 2.2.0. It was tested successfully by some of our customers which had issues with LWT pressure and blocking repairs. This version is expected to make repairs faster and more lightweight on the Cassandra backend. We were able to remove a lot of legacy code and design which were fit to single token clusters, but failed at spreading segments efficiently for clusters using vnodes.

The binaries for Reaper 2.2.0 are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 2.2.0 are available on the Reaper website.