DynamoDB – How to Move Out?

What does a DynamoDB migration really look like? Should you dual-write? Are there any tools to help?  Moving data from one place to another is conceptually simple. You simply read from one datasource and write to another. However, doing that consistently and safely is another story. There are a variety of mistakes you can make if you overlook important details. We recently discussed the top reasons so many organizations are currently seeking DynamoDB alternatives. Beyond costs (the most frequently-mentioned factor), aspects such as throttling, hard limits, and vendor lock-in are frequently cited as motivation for a switch. But what does a migration from DynamoDB to another database look like? Should you dual-write? Are there any available tools to assist you with that? What are the typical do’s and don’ts? In other words, how do you move out from DynamoDB? In this post, let’s start with an overview of how database migrations work, cover specific and important characteristics related to DynamoDB migrations, and then discuss some of the strategies employed to integrate with and migrate data seamlessly to other databases. How Database Migrations Work Most database migrations follow a strict set of steps to get the job done. First, you start capturing all changes made to the source database. This guarantees that any data modifications (or deltas) can be replayed later. Second, you simply copy data over. You read from the source database and write to the destination one. A variation is to export a source database backup and simply side-load it into the destination database. Past the initial data load, the target database will contain most of the records from the source database, except the ones that have changed during the period of time it took for you to complete the previous step. Naturally, the next step is to simply replay all deltas generated by your source database to the destination one. Once that completes, both databases will be fully in sync, and that’s when you may switch your application over. To Dual-Write or Not? If you are familiar with Cassandra migrations, then you have probably been introduced to the recommendation of simply “dual-writing” to get the job done. That is, you would proxy every writer mutation from your source database to also apply the same records to your target database. Unfortunately, not every database implements the concept of allowing a writer to retrieve or manipulate the timestamp of a record like the CQL protocol allows. This prevents you from implementing dual-writes in the application while back-filling the target database with historical data. If you attempt to do that, you will likely end up with an inconsistent migration, where some target Items may not reflect their latest state in your source database. Wait… Does it mean that dual-writing in a migration from DynamoDB is just wrong? Of course not! Consider that your DynamoDB table expires records (TTL) every 24h. In that case, it doesn’t make sense to back-fill your database: simply dual-write and – past the TTL period, switch your readers over. If your TTL is longer (say a year), then waiting for it to expire won’t be the most efficient way to move your data over. Back-Filling Historical Data Whether or not you need to back-fill historical data primarily depends on your use case. Yet, we can easily reason around the fact that it typically is a mandatory step in most migrations. There are 3 main ways for you to back-fill historical data from DynamoDB: ETL ETL (extract-transform-load) is essentially what a tool like Apache Spark does. It starts with a Table Scan and reads a single page worth of results. The results are then used to infer your source table’s schema. Next, it spawns readers to consume from your DynamoDB table as well as writer workers ingest the retrieved data to the destination database. This approach is great for carrying out simple migrations and also lets you transform (the T in the ETL part 🙂 your data as you go. However, it is unfortunately prone to some problems. For example: Schema inference: DynamoDB tables are schemaless, so it’s difficult to infer a schema. All table’s attributes (other than your hash and sort keys) might not be present within the first page of the initial scan. Plus, a given Item might not project all the attributes present within another Item. Cost: Sinces extracting data requires a DynamoDB table full scan, it will inevitably consume RCUs. This will ultimately drive up migration costs, and it can also introduce an upstream impact to your application if DynamoDB runs out of capacity. Time: The time it takes to migrate the data is proportional to your data set size. This means that if your migration takes longer than 24 hours, you may be unable to directly replay from DynamoDB Streams after, given that this is the period of time that AWS guarantees the availability of its events. Table Scan A table scan, as the name implies, involves retrieving all records from your source DynamoDB table – only after loading them to your destination database. Unlike the previous ETL approach where both the “Extract” and “Load” pieces are coupled and data gets written as you go, here each step is carried out in a phased way. The good news is that this method is extremely simple to wrap your head around. You run a single command. Once it completes, you’ve got all your data! For example: $ aws dynamodb scan --table-name source > output.json You’ll then end up with a single JSON file containing all existing Items within your source table, which you may then simply iterate through and write to your destination. Unless you are planning to transform your data, you shouldn’t need to worry about the schema (since you already know beforehand that all Key Attributes are present). This method works very well for small to medium-sized tables, but – as with the previous ETL method – it may take considerable time to scan larger tables. And that’s not accounting for the time it will take you to parse it and later load it to the destination. S3 Data Export If you have a large dataset or are concerned with RCU consumption and the impact to live traffic, you might rely on exporting DynamoDB data to Amazon S3. This allows you to easily dump your tables’ entire contents without impacting your DynamoDB table performance. In addition, you can request incremental exports later, in case the back-filling process took longer than 24 hours. To request a full DynamoDB export to S3, simply run: $ aws dynamodb export-table-to-point-in-time --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME --s3-bucket BUCKET_NAME --s3-prefix PREFIX_NAME --export-format DYNAMODB_JSON The export will then run in the background (assuming the specified S3 bucket exists). To check for its completion, run: $ aws dynamodb list-exports --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/source { "ExportSummaries": [ { "ExportArn": "arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME/export/01706834224965-34599c2a", "ExportStatus": "COMPLETED", "ExportType": "FULL_EXPORT" } ] } Once the process is complete, your source table’s data will be available within the S3 bucket/prefix specified earlier. Inside it, you will find a directory named AWSDynamoDB, under a structure that resembles something like this: $ tree AWSDynamoDB/ AWSDynamoDB/ └── 01706834981181-a5d17203 ├── _started ├── data │ ├── 325ukhrlsi7a3lva2hsjsl2bky.json.gz │ ├── 4i4ri4vq2u2vzcwnvdks4ze6ti.json.gz │ ├── aeqr5obfpay27eyb2fnwjayjr4.json.gz │ ├── d7bjx4nl4mywjdldiiqanmh3va.json.gz │ ├── dlxgixwzwi6qdmogrxvztxzfiy.json.gz │ ├── fuukigkeyi6argd27j25mieigm.json.gz │ ├── ja6tteiw3qy7vew4xa2mi6goqa.json.gz │ ├── jirrxupyje47nldxw7da52gnva.json.gz │ ├── jpsxsqb5tyynlehyo6bvqvpfki.json.gz │ ├── mvc3siwzxa7b3jmkxzrif6ohwu.json.gz │ ├── mzpb4kukfa5xfjvl2lselzf4e4.json.gz │ ├── qs4ria6s5m5x3mhv7xraecfydy.json.gz │ ├── u4uno3q3ly3mpmszbnwtzbpaqu.json.gz │ ├── uv5hh5bl4465lbqii2rvygwnq4.json.gz │ ├── vocd5hpbvmzmhhxz446dqsgvja.json.gz │ └── ysowqicdbyzr5mzys7myma3eu4.json.gz ├── manifest-files.json ├── manifest-files.md5 ├── manifest-summary.json └── manifest-summary.md5 2 directories, 21 files So how do you restore from these files? Well… you need to use the DynamoDB Low-level API. Thankfully, you don’t need to dig through its details since AWS provides the LoadS3toDynamoDB sample code as a way to get started. Simply override the DynamoDB connection with the writer logic of your target database, and off you go! Streaming DynamoDB Changes Whether or not you require back-filling data, chances are you want to capture events from DynamoDB to ensure both will get in sync with each other. Enter DynamoDB Streams. This can be used to capture changes performed in your source DynamoDB table. But how do you consume from its events? DynamoDB Streams Kinesis Adapter AWS provides the DynamoDB Streams Kinesis Adapter to allow you to process events from DynamoDB Streams via the Amazon Kinesis Client Library (such as the kinesis-asl module in Apache Spark). Beyond the historical data migration, simply stream events from DynamoDB to your target database. After that, both datastores should be in sync. Although this approach may introduce a steep learning curve, it is by far the most flexible one. It even lets you consume events from outside the AWS ecosystem (which may be particularly important if you’re switching to a different provider). For more details on this approach, AWS provides a walkthrough on how to consume events from a source DynamoDB table to a destination one. AWS Lambda Lambda functions are simple to get started with, handle all checkpointing logic on their own, and seamlessly integrate with the AWS ecosystem. With this approach, you simply encapsulate your application logic inside a Lambda function. That lets you write events to your destination database without having to deal with the Kinesis API logic, such as check-pointing or number of shards in a stream. When taking this route, you can load the captured events directly into your target database. Or, if the 24 hour retention limit is a concern, you can simply stream and retain these records in another service, such as Amazon SQS, and replay them later. The latter approach is well beyond the scope of this article. For examples of how to get started with Lambda functions, see the AWS documentation. Final Remarks Migrating from one database to another requires careful planning and a thorough understanding of all steps involved during the process. Further complicating the matter, there’s a variety of different ways to accomplish a migration, and each variation brings its own set of trade-offs and benefits. This article provided an in-depth look at how a migration from DynamoDB works, and how it differs from other databases. We also discussed different ways to back-fill historical data and stream changes to another database. Finally, we ran through an end-to-end migration, leveraging AWS tools you probably already know. At this point, you should have all the tools and tactics required to carry out a migration on your own. If you have any questions or require assistance with migrating your DynamoDB workloads to ScyllaDB, remember: we are here to help.

Avi Kivity and Dor Laor AMA: “Tablets” replication, all things async, caching, object storage & more

Even though the ScyllaDB monster has tentacles instead of feet, we still like to keep our community members on their toes. From a lively ScyllaDB Lounge, to hands-on ScyllaDB Labs, to nonstop speaker engagement (we racked up thousands of chat interactions), this was not your typical virtual event. But it’s pretty standard for the conferences we host – both ScyllaDB Summit and P99 CONF. Still, we surprised even the seasoned sea monsters by launching a previously unannounced session after the first break: an “Ask Me Anything” session with ScyllaDB Co-founders Dor Laor (CEO) and Avi Kivity (CTO). The discussion covered topics such as the new tablets replication algorithm (replacing Vnodes), ScyllaDB’s async shard-per-core architecture, strategies for tapping object storage with S3, the CDC roadmap, and comparing database deployment options. Here’s a taste of that discussion… Why do tablets matter for me, the end user? Avi: But that spoils my keynote tomorrow. 😉 One of the bigger problems that tablets can solve is the all-or-nothing approach to scaling that we have with Vnodes. Now, if you add a node, it’s all or nothing. You start streaming, and it can take hours, depending on the data shape. When it succeeds, you’ve added a node and only then it will be able to serve coordinator requests. With tablets, bootstrapping a node takes maybe a minute, and at the end of that minute, you have a new node in your cluster. But, that node doesn’t hold any data yet. The load balancer notices that an imbalance exists and starts moving tablets from the overloaded nodes toward that new node. From that, several things follow: You will no longer have that all-or-nothing situation. You won’t have to wait for hours to know if you succeeded or failed. As soon as you add a node, it starts shouldering part of the load. Imagine you’re stressed for CPU or you’re reaching the end of your storage capacity. As soon as you add the node, it starts taking part in the cluster workload, reducing CPU usage from your other nodes. Storage starts moving to the other nodes …. you don’t have to wait for hours. Since adding a node takes just a minute, you can add multiple nodes at the same time. Raft’s consistent topology allows us to linearize parallel requests and apply them almost concurrently.    After that, existing nodes will stream data to the new ones, and system load will eventually converge to a fair distribution as the process completes. So if you’re adding a node in response to an increase in your workload the database will be able to quickly react. You don’t get the full impact of adding the node immediately; that still takes time until the data is streamed. However, you get incremental capacity immediately, and it grows as you wait for data to be streamed. A new tablet, weighing only 5GB, scheduled on a new node will take 2 minutes to be operational and relieve the load on the original servers. Dor: Someone else just asked about migrating from Vnodes to tablets for existing deployments. Migration won’t happen automatically. In ScyllaDB 6.0, we’ll have tablets enabled for new clusters by default. But for existing clusters, it’s up to you to enable it. In ScyllaDB 6.1, tablets will likely be the default. We’ll have an automatic process that shifts Vnodes to tablets. In short, you won’t need to have a new cluster and migrate the data to it. It’s possible to do it in place – and we’ll have a process for it. Read our engineering blog on tablets What are some of the distinctive innovations in ScyllaDB? Avi: The primary innovation in ScyllaDB is its thread-per-core architecture. Thread-per-core was not invented by ScyllaDB. It was used by other products, mostly closed-source products, before – although I don’t know of any databases back then using this architecture. Our emphasis is on making everything asynchronous. We can sustain a very large amount of concurrency, so we can utilize modern disks or large node counts, large CPU counts, and very fast disks. All of these require a lot of concurrency to saturate the hardware and extract every bit of performance from it. So our architecture is really oriented towards that. And it’s not just the architecture – it’s also our mindset. Everything we do is oriented towards getting great performance while maintaining low latency. Another outcome from that was automatic tuning. We learned very quickly that it’s impossible to manually tune a database for different workloads, manage compaction settings, and so forth. It’s too complex for people to manage if they are not experts – and difficult even if they are experts. So we invested a lot of effort in having the database tune itself. This self-tuning is not perfect. But it is automatic, and it is always there. You know that if your workload changes, or something in the environment changes, ScyllaDB will adapt to it and you don’t need to be paged in order to change settings. When the workload increases, or you lose a node, the database will react to it and adapt. Another thing is workload prioritization. Once we had the ability to isolate different internal workloads (e.g., isolate the streaming workload from the compaction workload from the user workload), we exposed it to users as a capability called workload prioritization. That lets users define separate workloads, which could be your usual OLTP workload and an ETL workload or analytics workload. You can run those two workloads simultaneously on the same cluster, without them interfering with one another. How does ScyllaDB approach caching? Dor: Like many infrastructure projects, ScyllaDB has its own integrated, internal cache. The servers that we run on are pretty powerful and have lots of memory, so it makes sense to utilize this memory. The database knows best about the object lifetime and the access patterns – what is used most, what’s rarely used, what’s most expensive to retrieve from the disk, and so on. So ScyllaDB does its own caching very effectively. It’s also possible to monitor the cache utilization. We even have algorithms like heat-weighted load balancing. That comes into play, for example, when you restart a server. If you were doing upgrades, the server comes back relatively fast, but with a completely cold cache. If clients naively route queries evenly to all of your cluster nodes, then there will be one replica in the cluster (the one with a cold cache), and this could introduce a negative latency impact. ScyllaDB nodes actually know the cache utilization of their peers since we propagate these stats via Gossip. And over time, gradually, it sends more and more requests as a function of the cache utilization to all of the servers in the cluster. That’s how upgrades are done painlessly without any impact to your running operations. Other databases don’t have this mechanism. They require you to have an external cache. That’s extremely wasteful because all of the database servers have lots of memory, and you’re just wasting that, you’re wasting all its knowledge about object utilization and the recently used objects. Those external caches, like Redis, usually aren’t great for high availability. And you need to make sure that the cache maintains the same consistency as the database – and it’s complicated. Avi: Yes, caching is difficult, which is why it made it into that old joke about the two difficult things in computer science: naming things, cache invalidation, and off-by-one errors. It’s old, but it’s funny every time – maybe that says something about programmers’ sense of humor. 😉 The main advantage of having strong caching inside the database is that it simplifies things for the database user. Read our engineering blog on ScyllaDB’s specialized cache What are you hoping to achieve with the upcoming storage object storage capability and do you plan to support multiple tiers? Dor: We are adding S3 support: some of it is already in the master branch, the rest will follow. We see a lot of potential in S3. It can benefit total cost of ownership, and it will also improve backup and restore, letting you have a passive data center just by sending S3 to another data center but without nodes (unlike the active-active that we support today). There are multiple benefits, even faster scaling later on and faster recovery from a backup. But the main advantage is its cost: compared to NVMe SSDs, it’s definitely cheaper. The cost advantage comes with a tradeoff; the price is high latency, especially with the implementation by cloud vendors. Object storage is still way slower than pure block devices like NVMe. We will approach it by offering different options. One approach is basically tiered storage. You can have a policy where the hot data resides on NVMe and the data that you care less about, but you still want to keep, is stored on S3. That data on S3 will be cold, and its performance will be different. It’s simplest to model this with time series access patterns. We also want to get to more advanced access patterns (e.g., completely random access) and just cache the most recently used objects in NVMe while keeping the rarely accessed objects in S3. That’s one way to benefit from S3. Another way is the ability to still cache your entire data set on top of NVMe SSDs. You can have one or two replicas as a cached replica (stored not in RAM but in NVMe), so storage access will still be very fast when data no longer fits in RAM. The remaining replicas can use S3 instead of NVMe, and this setup will be very cost-effective. You might be able to have a 67% cost reduction or a 33% cost reduction, depending on the ratio of cache replicas versus replicas that are backed by S3. Can you speak to the pros and cons of ScyllaDB vs. Cassandra? Dor: If there is a con, we define it as a bug and we address it. 😉 Really, that’s what we’ve been doing for years and years. As far as the benefits, there’s no JVM – no big fat layer existing or requiring tuning. Another benefit is the caching. And we have algorithms that use the cache, like heat weighted load balancing. There’s also better observability with Grafana and Prometheus – you can even drill down per shard. Shard-per-core is another thing that guarantees performance. We also have “hints” where the client can say, “I’d like to bypass the cache because, for example, I’m doing a scan now and these values shouldn’t be cached.” Even the way that we index is different. Also, we have global indexes unlike Cassandra. That’s just a few things – there are many improvements over Cassandra. Co-Founder Keynotes: More from Dor and Avi Want to learn more? Access the co-founder keynotes here: See Dor’s Talk & Deck See Avi’s Talk & Deck Access all of ScyllaDB Summit On-Demand (free)

Cloud Database Rewards, Risks & Tradeoffs

Considering a fully-managed cloud database? Consider the top rewards, risks, and trade-offs related to performance and cost. What do you really gain – and give up– when moving to a fully managed cloud database? Now that managed cloud database offerings have been “battle tested” in production for a decade, how is the reality matching up to the expectation? What can teams thinking of adopting a fully managed cloud database learn from teams who have years of experience working with this deployment model? We’ve found that most teams are familiar with the admin/management aspects of  database as a service ( a.k.a “DBaaS”). But let’s zero in on the top risks, rewards and trade-offs related to two aspects that commonly catch teams off guard: performance and cost. Bonus: Hear your peers’ firsthand cloud database experiences at ScyllaDB Summit on-demand Cloud Database Performance Performance Rewards Using a cloud database makes it extremely easy for you to place your data close to your application and your end users. Most also support multiregion replication, which lets you deploy an always-on architecture with just a few clicks. This simplicity makes it feasible to run specialized use cases, such as “smart replication.” For example, think about a worldwide media streaming service where you have one catalog tailored to users living in Brazil and a different catalog for users in the United States. Or, consider a sports betting use case where you have users all around the globe and you need to ensure that as the game progresses, updated odds are rolled out to all users at the exact same time to “level the playing field” (this is the challenge that ZeroFlucs tackled very impressively). The ease of scale is another potential performance-related reward. To reap this reward, be sure to test that your selected cloud database is resilient enough to sustain sudden traffic spikes. Most vendors let you quickly scale out your deployment, but beware of solutions that don’t let you transition between “tiers.” After all, you don’t want to find yourself in a situation where it’s Black Friday and your application can’t meet the demands.  Managed cloud database options like ScyllaDB Cloud let you add as many nodes as needed to satisfy any traffic surges that your business is fortunate enough to experience. Performance Risks One performance risk is the unpredictable cost of scale. Know your throughput requirements and how much growth you anticipate. If you’re running up to a few thousand operations per second, a pay-per-operations service model probably makes sense. But as you grow to tens of thousands of operations per second and beyond, it can become quite expensive. Many high-growth companies opt for pricing models that don’t charge by the number of operations you run, but rather charge for the infrastructure you choose. Also, be mindful of potential hidden limits or quotas that your cloud database provider may impose. For example, DynamoDB limits item size to 400KB per item; the operation is simply refused if you try to exceed that. Moreover, your throughput could be throttled down if you try to pass your allowance, or if the vendor imposes a hard limit on the number of operations you can run on top of a single partition. Throttling severely increases latency, which may be unacceptable for real-time applications. If this is important to you, look for a cloud database model that doesn’t impose workload restrictions. With offerings that use infrastructure-based cost models, there aren’t artificial traffic limits; you can push as far as the underlying hardware can handle. Performance Trade-offs It’s crucial to remember that a fully-managed cloud database is fundamentally a business model. As your managed database vendor contributes to your growth, it also captures more revenue. Despite the ease of scalability, many vendors limit your scaling options to a specific range, potentially not providing the most performant infrastructure. For example, perhaps you have a real-time workload that reads a lot of cold data in such a way that I/O access is really important to you, but your vendor simply doesn’t support provisioning your database on top of NVMe (nonvolatile memory express) storage. Having a third party responsible for all your core database tasks obviously simplifies maintenance and operations. However, if you encounter performance issues, your visibility into the problem could be reduced, limiting your troubleshooting capabilities. In such cases, close collaboration with your vendor becomes essential for identifying the root cause. If visibility and fast resolution matter to you, opt for cloud database solutions that offer comprehensive visibility into your database’s performance. Cloud Database Costs Cost Rewards Adopting a cloud database eliminates the need for physical infrastructure and dedicated staff. You don’t have to invest in hardware or its maintenance because the infrastructure is provided and managed by the DBaaS provider. This shift results in significant cost savings, allowing you to allocate resources more effectively toward core operations, innovation and customer experience rather than spending on hardware procurement and management. Furthermore, using managed cloud databases reduce staffing costs by transferring responsibilities such as DevOps and database administration to the vendor. This eliminates the need for a specialized in-house database team, enabling you to optimize your workforce and allocate relevant staff to more strategic initiatives. There’s also a benefit in deployment flexibility, Leading providers typically offer two pricing models: pay-as-you-go and annual pricing. The pay-as-you-go model eliminates upfront capital requirements and allows for cost optimization by aligning expenses with actual database usage. This flexibility is particularly beneficial for startups or organizations with limited resources. Most cloud database vendors offer a standard model where the customer’s database sits on the vendor’s cloud provider infrastructure. Alternatively, there’s a “bring your own account” model, where the database remains on your organization’s cloud provider infrastructure. This deployment is especially advantageous for enterprises with established relationships with their cloud providers, potentially leading to cost savings through pre-negotiated discounts. Additionally, by keeping the database resources on your existing infrastructure, you avoid dealing with additional security concerns. It also allows you to manage your database as you manage your other existing infrastructure. Cost Risks Although a cloud database offers scalability, the expense of scaling your database may not follow a straightforward or easily predictable pattern. Increased workload from applications can lead to unexpected spikes or sudden scaling needs, resulting in higher costs (as mentioned in the previous section). As traffic or data volume surges, the resource requirements for the database may significantly rise, leading to unforeseen expenses as you need to scale up. It is crucial to closely monitor and analyze the cost implications of scaling to avoid budget surprises. Additionally, while many providers offer transparent pricing, there may still be hidden costs. These costs often arise from additional services or specific features not covered by the base pricing. For instance, specialized support or advanced features for specific use cases may incur extra charges. It is essential to carefully review the service-level agreements and pricing documentation provided by your cloud database provider to identify any potential hidden costs. Here’s a real-life example: One of our customers recently moved over from another cloud database vendor. At that previous vendor, they encountered massive unexpected variable costs, primarily associated with network usage. This “cloud bill shock” resulted in some internal drama; some engineers were fired in the aftermath. Understanding and accounting for these hidden, unanticipated costs is crucial for accurate budgeting and effective cost management. This ensures a comprehensive understanding of the total cost of ownership and enables more informed decisions on the most cost-effective approach for your organization. Given the high degree of vendor lock-in involved in most options, it’s worth thinking about this long and hard. You don’t want to be forced into a significant application rewrite because your solution isn’t sustainable from a cost perspective and doesn’t have any API-compatible paths out. Cost Trade-offs The first cost trade-off associated with using a cloud database involves limited cost optimizations. While managed cloud solutions offer some cost-saving features, they might limit your ability to optimize costs to the same extent as self-managed databases. Constraints imposed by the service provider may restrict actions like optimizing hardware configurations or performance tuning. They also provide standardized infrastructure that caters to a broad variety of use cases. On the one hand, this simplifies operations. On the other hand, one size does not fit all. It could limit your ability to implement highly customized cost-saving strategies by fine-tuning workload-specific parameters and caching strategies. The bottom line here: carefully evaluate these considerations to determine the impact on your cost optimization efforts. The second trade-off pertains to cost comparisons and total cost of ownership. When comparing costs between vendor-managed and self-managed databases, conducting a total cost of ownership analysis is essential. Consider factors such as hardware, licenses, maintenance, personnel and other operational expenses associated with managing a database in-house and then compare these costs against ongoing subscription fees and additional expenses related to the cloud database solution. Evaluate the long-term financial impact of using fully-managed cloud database versus managing the database infrastructure in-house. Then, with this holistic view of the costs, decide what’s best given your organization’s specific requirements and budget considerations. Additional Database Deployment Model Considerations Although a database as a service (DBaaS) deployment will definitely shield you from many infrastructure and hardware decisions through your selection process, a fundamental understanding of the generic compute resources required by any database is important for identifying potential bottlenecks that may limit performance. For an overview of the critical considerations and tradeoffs when selecting CPUs, memory, storage and networking for your distributed database infrastructure, see Chapter 7 of the free book, “Database Performance at Scale.” After an introduction to the hardware that’s involved in every deployment model, whether you think about it or not, that book chapter shifts focus to different deployment options and their impact on performance. You’ll learn about the special considerations associated with cloud-hosted deployments, serverless, containerization and container orchestration technologies such as Kubernetes. Access the complete “Database Performance at Scale” book free, courtesy of ScyllaDB.

New ScyllaDB Enterprise Release: Up to 50% Higher Throughput, 33% Lower Latency

Performance improvements, encryption at rest, Repair Based Node Operations, consistent schema management using Raft & more  ScyllaDB Enterprise 2024.1.0 LTS, a production-ready ScyllaDB Enterprise Long Term Support Major Release, is now available! It introduces significant performance improvements: up to 50% higher throughput, 35% greater efficiency, and 33% lower latency. It introduces encryption at rest, Repair Based Node Operations (RBNO) for all operations, and numerous improvements and bug fixes. Additionally, consistent schema management using Raft will be enabled automatically upon upgrade (see below for more details). The new release is based on ScyllaDB Open Source 5.4. In this blog, we’ll highlight the new capabilities that our users have been asking about most frequently. For the complete details, read the release notes. Read the detailed release notes on our forum Learn more about ScyllaDB Enterprise Get ScyllaDB Enterprise 2024.1 (customers only, or 30-day evaluation) Upgrade from ScyllaDB Enterprise 2023.1.x to 2024.1.y Upgrade from ScyllaDB Enterprise 2022.2.x to 2024.1.y Upgrade from ScyllaDB Open Source 5.4 to ScyllaDB Enterprise 2024.1.x ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2023.1, and are welcome to contact our Support Team with questions. Read the detailed release notes Performance Improvements 2024.1 includes many runtime and build performance improvements that translate to: Higher throughput per vCPU and server Lower mean and P99 latency ScyllaDB 2024.1 vs ScyllaDB 2023.1 Throughput tests 2024.1 has up to 50% higher throughput than 2023.1. In some cases, this can translate to a 35% reduction in the number of vCPUs required to support a similar load. This enables a similar reduction in vCPU cost. Latency tests Latency tests were performed at 50% of the maximum throughput tested. As demonstrated below, the latency (both mean and P99) is 33% lower, even with the higher throughput. Test Setup Amazon EC2 instance_type_db: i3.2xlarge (8 cores) instance_type_loader: c4.2xlarge Test profiles: cassandra-stress [mixed|read|write] no-warmup cl=QUORUM duration=50m -schema ‘replication(factor=3)’ -mode cql3 native -rate threads=100 -pop ‘dist=gauss(1..30000000,15000000,1500000)’ Note that these results are for tests performed on a small i3 server (i3.2xlarge). ScyllaDB scales linearly with the number of cores and achieves much better results for the i4i instance type. ScyllaDB Enterprise 2024.1 vs ScyllaDB Open Source 5.4 ScyllaDB Enterprise 2024.1 is based on ScyllaDB Open Source 5.4, but includes enterprise-only performance improvement optimizations. As shown below, the throughput gain is significant and latency is lower. These tests use the same setup and parameters detailed above. Encryption at Rest (EaR) Enhancements This new release includes enhancements to Encryption at Rest (EaR), including new Amazon KMS integration, and extended cluster-level encryption at rest. Together, these improvements allow you to easily use your own key for cluster-wide EaR. ScyllaDB Enterprise has supported Encryption at Rest (EaR) for some time. Until now, users could store the keys for EaR locally, in an encrypted table, or in an external KMIP server. This release adds the ability to: Use Amazon KMS to store and manage keys. Set default EaR parameters (including the new KMS) for *all* cluster tables. These are both detailed below. Amazon KMS Integration for Encryption at Rest ScyllaDB can now use a Customer Managed Key (CMK), stored in KMS, to create, encrypt, and decrypt Data Keys (DEK), which are then used to encrypt and decrypt the data in storage (such as SSTables, Commit logs, Batches, and hints logs). KMS creates DEK from CMK: DEK (plain text version) is used to encrypt the data at rest: Diagrams are from: https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#data-keys Before using KMS, you need to set KMS as a key provider and validate that ScyllaDB nodes have permission to access and use the CMK you created in KMS. Once you do that, you can use the CMK in the CREATE and ALTER TABLE commands with KmsKeyProviderFactory, as follows CREATE TABLE myks.mytable (......) WITH scylla_encryption_options = { 'cipher_algorithm' : 'AES/CBC/PKCS5Padding', 'secret_key_strength' : 128, 'key_provider': 'KmsKeyProviderFactory', 'kms_host': 'my_endpoint' }<>CODE> Where “my_key” points to a section in scylla.yaml kms_hosts: my_endpoint: aws_use_ec2_credentials: true aws_use_ec2_region: true master_key: alias/MyScyllaKey You can also use the KMS provider to encrypt system-level data. See more examples and info here. Transparent Data Encryption Transparent Data Encryption (TDE) adds a way to define Encryption at Rest parameters per cluster, not only per table. This allows the system administrator to enforce encryption of *all* tables using the same master key (e.g., from KMS) without specifying the encryption parameter per table. For example, with the following in scylla.yaml, all tables will be encrypted using encryption parameters of my-kms1: user_info_encryption: enabled: true key_provider: KmsKeyProviderFactory, kms_host: my_kms1 See more examples and info here. Repair Based Node Operations (RBNO) RBNO provides a more robust, reliable, and safer data streaming for node operations like node-replace and node-add/remove. In particular, a failed node operation can resume from the point it stopped – without sending data that has already been synced. In addition, with RBNO enabled, you don’t need to repair before or after node operations, such as replace or removenode. In this release, RBNO is enabled by default for all operations: remove node, rebuild, bootstrap, and decommission. The replace node operation was already enabled by default. For details, see the Repair Based Node Operations (RBNO) docs and the blog, Faster, Safer Node Operations with Repair vs Streaming. Node-Aggregated Table Level Metrics Most ScyllaDB metrics are per-shard, per-node, but not for a specific table. We now export some per-table metrics. These are exported once per node, not per shard, to reduce the number of metrics. Guardrails Guardrails is a framework to protect ScyllaDB users and admins from common mistakes and pitfalls. In this release, ScyllaDB includes a new guardrail on the replication factor. It is now possible to specify the minimum replication factor for new keyspaces via a new configuration item. Security In addition to the EaR enhancements above, the following security features were introduced in 2024.1: Encryption at transit, TLS certificates It is now possible to use TLS certificates to authenticate and authorize a user to ScyllaDB. The system can be configured to derive the user role from the client certificate and derive the permissions the user has from that role. #10099 Learn more in the certificate-authentication docs. FIPS Tolerant ScyllaDB Enterprise can now run on FIPS enabled Ubuntu, using libraries that were compiled with FIPS enabled, such as OpenSSL, GnuTLS, and more. Strongly Consistent Schema Management with Raft Strongly Consistent Schema Management with Raft became the default for new clusters in ScyllaDB Enterprise 2023.1.In this release, it is enabled by default when upgrading existing clusters. Learn more in the blog, ScyllaDB’s Path to Strong Consistency: A New Milestone.   Read the detailed release notes

Support for AWS PrivateLink On Instaclustr for Apache Cassandra® is now GA

Instaclustr is excited to announce the general availability of AWS PrivateLink with Apache Cassandra® on the Instaclustr Managed Platform. This release follows the announcement of the new feature in public preview last year.  

Support for AWS PrivateLink with Cassandra provides our AWS customers with a simpler and more secure option for network cross-account connectivity, to expose an application in one VPC to other users or applications in another VPC.  

Network connections to an AWS PrivateLink service can only be one-directional from the requestor to the destination VPC. This prevents network connections being initiated from the destination VPC to the requestor and creates an additional measure of protection from potential malicious activity. 

All resources in the destination VPC are masked and appear to the requestor as a single AWS PrivateLink service. The AWS PrivateLink service manages access to all resources within the destination VPC. This significantly simplifies cross-account network setup as compared to authorizing peering requests, configuring routes tables and security groups when establishing VPC peering.  

The Instaclustr team has worked with care to integrate the AWS PrivateLink service for your AWS Managed Cassandra environment to give you a simple and secure cross-account network solution with just a few clicks.  

Fitting AWS PrivateLink to Cassandra is not a straightforward task as AWS PrivateLink exposes a single IP proxy per AZ, and Cassandra clients generally expect direct access to all Cassandra nodes. To solve this problem, the development of Instaclustr’s AWS PrivateLink service has made use of Instaclustr’s Shotover Proxy in front of your AWS Managed Cassandra clusters to reduce cluster IP addresses from one-per-node to one-per-rack, enabling the use of a load balancer as required by AWS PrivateLink.  

By managing database requests in transit, Shotover gives Instaclustr customers AWS PrivateLink’s simple and secure network setup with the benefits of Managed Cassandra. Keep a look out for an upcoming blog post with more details on the technical implementation of AWS PrivateLink for Managed Cassandra. 

AWS PrivateLink is offered as an Instaclustr Enterprise feature, available at an additional charge of 20% on top of the node cost for the first feature enabled by you. The Instaclustr console will provide a summary of node prices or management units for your AWS PrivateLink enabled Cassandra cluster, covering both Cassandra and Shotover node size options and prices when you first create an AWS PrivateLink enabled Cassandra cluster. Information on charges from AWS is available here. 

Log into the Console to include support for AWS PrivateLink with your AWS Managed Cassandra clusters with just one click today. Alternatively, support for AWS PrivateLink for Managed Cassandra is available at the Instaclustr API or Terraform.  

Please reach out to our Support team for any assistance with AWS PrivateLink for your AWS Managed Cassandra clusters. 

The post Support for AWS PrivateLink On Instaclustr for Apache Cassandra® is now GA appeared first on Instaclustr.

ScyllaDB Summit 2024 Recap: An Inside Look

A rapid rundown of the whirlwind database performance event Hello readers, it’s great to be back in the US to help host ScyllaDB Summit, 2024 edition. What a great virtual conference it has been – and once again, I’m excited to share the behind-the-scenes perspective as one of your hosts. First, let’s thank all the presenters once again for contributions from around the world. With 30 presentations covering all things ScyllaDB, it made for a great event. To kick things off, we had the now-famous Felipe Cardeneti Mendes host the ScyllaDB lounge and get straight into answering questions. Once the audience got a taste for it (and realized the technical acumen of Felipe’s knowledge), there was an unstoppable stream of questions for him. Felipe was so popular that he became a meme for the conference!
Felipe Mendes’s comparing MongoDB vs ScyllaDB using @benchant_com benchmarks !! What a vast leap by ScyllaDB. Loving it. @ScyllaDB #ScyllaDB pic.twitter.com/ObrZzMU9o2 — ktv (@RamnaniKartavya) February 14, 2024
The first morning, we also trialed something new by running hands-on labs aimed at both novice and advanced users. These live streams were a hit, with over 1000 people in attendance and everyone keen to get their hands on ScyllaDB. If you’d like to continue that experience, be sure to check out the self-paced ScyllaDB University and the interactive instructor-led ScyllaDB University LIVE event coming up in March. Both are free and virtual! Let’s recap some stand out presentations for me on the first day of the conference. The opening keynote, by CEO and co-founder Dor Laor, was titled ScyllaDB Leaps Forward. This is a must-see presentation. It provides the background context you need to understand tablet architecture and the direction that ScyllaDB is headed: not only the fastest database in terms of latency (at any scale), but also the quickest to scale in terms of elasticity. The companion keynote on day two from CTO and co-founder Avi Kivity completes the loop and explains in more detail why ScyllaDB is making this major architecture shift from vNodes replication to tablets. Take a look at Tablets: Rethink Replication for more insights. The second keynote, from Discord Staff Engineer Bo Ingram, opened with the premise So You’ve Lost Quorum: Lessons From Accidental Downtime and shared how to diagnose issues in your clusters and how to avoid making a fault too big to tolerate. Bo is a talented storyteller and published author. Be sure to watch this keynote for great tips on how to handle production incidents at scale. And don’t forget to buy his book, ScyllaDB in Action for even more practical advice on getting the most out of ScyllaDB. Download the first 4 chapters for free An underlying theme for the conference was exploring individual customers’ migration paths from other databases onto ScyllaDB. To that end, we were fortunate to hear from JP Voltani, Head of Engineering at Tractian, on their Experience with Real-Time ML and the reasons why they moved from MongoDB to ScyllaDB to scale their data pipeline. Working with over 5B samples from +50K IoT devices, they were able to achieve their migration goals. Felipe’s presentation on MongoDB to ScyllaDB: Technical Comparison and the Path to Success then detailed the benchmarks, processes, and tools you need to be successful for these types of migrations. There were also great presentations looking at migration paths from DynamoDB and Cassandra; be sure to take a look at them if you’re on any of those journeys. A common component in customer migration paths was the use of Change Data Capture (CDC) and we heard from Expedia on their migration journey from Cassandra to ScyllaDB. They cover the aspects and pitfalls the team needed to overcome as part of their Identity service project. If you are keen to learn more about this topic, then Guilherme’s presentation on Real-Time Event Processing with CDC is a must-see. Martina Alilović Rojnić gave us the Strategy Behind Reversing Labs’ Massive Key-Value Migration which had mind-boggling scale, migrating more than 300 TB of data and over 400 microservices from their bespoke key-value store to ScyllaDB – with ZERO downtime. An impressive feat of engineering! ShareChat shared everything about Getting the Most Out of ScyllaDB Monitoring. This is a practical talk about working with non-standard ScyllaDB metrics to analyze the remaining cluster capacity, debug performance problems, and more. Definitely worth a watch if you’re already running ScyllaDB in production. After a big day hosting the conference combined with the fatigue of international travel from Australia, assisted with a couple of cold beverages the night after, sleep was the priority. Well rested and eager for more, we launched early into day two of the event with more great content. Leading the day was Avi’s keynote, which I already mentioned above. Equally informative was the following keynote from Miles Ward and Joe Shorter on Radically Outperforming DynamoDB. If you’re looking for more reasons to switch, this was a good presentation to learn from, including details of the migration and using ScyllaDB Cloud with Google Cloud Platform. Felipe delivered another presentation (earning him MVP of the conference) about using workload prioritization features of ScyllaDB to handle both Real-Time and Analytical workloads, something you might not ordinarily consider compatible. I also enjoyed Piotr’s presentation on how ScyllaDB Drivers take advantage of the unique ScyllaDB architecture to deliver high-performance and ultra low-latencies. This is yet another engineering talk showcasing the strengths of ScyllaDB’s feature sets. Kostja Osipov set the stage for this on Day 1. Kostja consistently delivers impressive Raft talks year after year, and his Topology on Raft: An Inside Look talk is another can’t miss. There’s a lot there! Give it a (re)watch if you want all the details on how Raft is implemented in the new releases and what it all means for you, from the user perspective. We also heard from Kishore Krishnamurthy, CTO at ZEE5, giving us insights into Steering a High-Stakes Database Migration. It’s always interesting to hear the executive-level perspective on the direction you might take to reduce costs while maintaining your SLAs. There were also more fascinating insights from ZEE5 engineers on how they are Tracking Millions of Heartbeats on Zee’s OTT Platform. Solid technical content. In a similar vein, proving that “simple” things can still present serious engineering challenges when things are distributed and at scale, Edvard Fagerholm showed us how Supercell Persists Real-Time Events. Edvard illustrated how ScyllaDB helps them process real-time events for their games like Clash of Clans and Clash Royale with hundreds of millions of users. The day flew by. Before long, we were wrapping up the conference. Thanks to the community of 7500 for great participation – and thousands of comments – from start to finish. I truly enjoyed hosting the introductory lab and getting swarmed by questions. And no, I’m not a Kiwi! Thank you all for a wonderful experience.

Seastar, ScyllaDB, and C++23

Seastar now supports C++20 and C++23 (and dropped support for C++17) Seastar is an open-source (Apache 2.0 licensed) C++ framework for I/O intensive asynchronous computing, using the thread-per-core model. Seastar underpins several high- performance distributed systems: ScyllaDB, Redpanda, and Ceph Crimson. Seastar source is available on github. Background As a C++ framework, Seastar must choose which C++ versions to support. The support policy is last-two-versions. That means that at any given time, the most recently released version as well as the previous one are supported, but earlier versions cannot be expected to work. This policy gives users of the framework three years to upgrade to the next C++ edition while not constraining Seastar to ancient versions of the language. Now that C++23 has been ratified, Seastar now officially supports C++20 and C++23. The previously supported C++17 is now no longer supported. New features in C++23 We will focus here on C++23 features that are relevant to Seastar users; this isn’t a comprehensive review of C++23 changes. For an overview of C++23 additions, consult the cppreference page. std::expected std::expected is a way to communicate error conditions without exceptions. This is useful since exception handling is very slow in most C++ implementations. In a way, it is similar to std::future and seastar::future: they are all variant types that can hold values and errors, though, of course, futures also represent concurrent computations, not just values. So far, ScyllaDB has used boost::outcome for the same role that std::expected fills. This improved ScyllaDB’s performance under overload conditions. We’ll likely replace it with std::expected soon, and integration into Seastar itself is a good area for extending Seastar. std::flat_set and std::flat_map These new containers reduce allocations compared to their traditional variants and are suitable for request processing in Seastar applications. Seastar itself won’t use them since it still maintains C++20 compatibility, but Seastar users should consider them, along with abseil containers. Retiring C++17 support As can be seen from the previous section, C++23 does not have a dramatic impact on Seastar. The retiring of C++17, however, does. This is because we can now fully use some C++20-only features on Seastar itself. Coroutines C++20 introduced coroutines, which make synchronous code both easier to write and more efficient (a very rare tradeoff). Seastar applications could already use coroutines freely, but Seastar itself could not due to the need to support C++17. Since all supported C++ editions now have coroutines, continuation-style code will be replaced by coroutines where this makes sense. std::format Seastar has long been using the wonderful {fmt} library. Since it was standardized as std::format in C++20, we may drop this dependency in favor of the standard library version. The std::ranges library Another long-time dependency, the Boost.Range library, can now be replaced by its modern equivalent std::ranges. This promises better compile times, and, more importantly, better compile-time error reporting as the standard library uses C++ concepts to point out programmer errors more directly. Concepts As concepts were introduced in C++20, they can now be used unconditionally in Seastar. Previously, they were only used when C++20 mode was active, which somewhat restricted what could be done with them. Conclusion C++23 isn’t a revolution for C++ users in general and Seastar users in particular, but, it does reduce the dependency on third-party libraries for common tasks. Concurrent with its adoption, dropping C++17 allows us to continue modernizing and improving Seastar.

Distributed Database Consistency: Dr. Daniel Abadi & Kostja Osipov Chat

Dr. Daniel Abadi (University of Maryland) and Kostja Osipov (ScyllaDB) discuss PACELC, CAP theorem, Raft, and Paxos Database consistency has been a strongly consistent theme at ScyllaDB Summit over the past few years – and we guarantee that will continue at ScyllaDB Summit 2024 (free + virtual). Co-founder Dor Laor’s opening keynote on “ScyllaDB Leaps Forward” includes an overview of the latest milestones on ScyllaDB’s path to immediate consistency. Kostja Osipov (Director of Engineering) then shares the details behind how we’re implementing this shift with Raft and what the new consistent metadata management updates mean for users. Then on Day 2, Avi Kivity (Co-founder) picks up this thread in his keynote introducing ScyllaDB’s revolutionary new tablet architecture – which is built on the foundation of Raft. Update: ScyllaDB Summit 2024 is now a wrap! Access ScyllaDB Summit On Demand ScyllaDB Summit 2023 featured two talks on database consistency. Kostja Osipov shared a preview of Raft After ScyllaDB 5.2: Safe Topology Changes (also covered in this blog series). And Dr. Daniel Abadi, creator of the PACELC theorem, explored The Consistency vs Throughput Tradeoff in Distributed Databases. After their talks, Daniel and Kostja got together to chat about distributed database consistency. You can watch the full discussion below. Here are some key moments from the chat… What is the CAP theorem and what is PACELC Daniel: Let’s start with the CAP theorem. That’s the more well-known one, and that’s the one that came first historically. Some say it’s a three-way tradeoff, some say it’s a two-way tradeoff. It was originally described as a three-way tradeoff: out of consistency, availability, and tolerance to network partitions, you can have two of them, but not all three. That’s the way it’s defined. The intuition is that if you have a copy of your data in America and a copy of your data in Europe and you want to do a write in one of those two locations, you have two choices. You do it in America, and then you say it’s done before it gets to Europe. Or, you wait for it to get to Europe, and then you wait for it to occur there before you say that it’s done. In the first case, if you commit and finish a transaction before it gets to Europe, then you’re giving up consistency because the value in Europe is not the most current value (the current value is the write that happened in America). But if America goes down, you could at least respond with stale data from Europe to maintain availabilty. PACELC is really an extension of the CAP theorem. The PAC of PACELC is CAP. Basically, that’s saying that when there is a network partition, you must choose either availability or consistency. But the key point of PACELC is that network partitions are extremely rare. There’s all kinds of redundant ways to get a message from point A to point B. So the CAP theorem is kind of interesting in theory, but in practice, there’s no real reason why you have to give up C or A. You can have both in most cases because there’s never a network partition. Yet we see many systems that do give up on consistency. Why? The main reason why you give up on consistency these days is latency. Consistency just takes time. Consistency requires coordination. You have to have two different locations communicate with each other to be able to remain consistent with one another. If you want consistency, that’s great. But you have to pay in latency. And if you don’t want to pay that latency cost, you’re going to pay in consistency. So the high-level explanation of the PACELC theorem is that when there is a partition, you have to choose between availability and consistency. But in the common case where there is no partition, you have to choose between latency and consistency. [Read more in Dr. Abadi’s paper, Consistency Tradeoffs in Modern Distributed Database System Design] In ScyllaDB, when we talk about consensus protocols, there’s Paxos and Raft. What’s the purpose for each? Kostja: First, I would like to second what Dr. Abadi said. This is a tradeoff between latency and consistency. Consistency requires latency, basically. My take on the CAP theorem is that it was really oversold back in the 2010s. We were looking at this as a fundamental requirement, and we have been building systems as if we are never going to go back to strong consistency again. And now the train has turned around completely. Now many vendors are adding back strong consistency features. For ScyllaDB, I’d say the biggest difference between Paxos and Raft is whether it’s a centralized algorithm or a decentralized algorithm. I think decentralized algorithms are just generally much harder to reason about. We use Raft for configuration changes, which we use as a basis for our topology changes (when we need the cluster to agree on a single state). The main reason we chose Raft was that it has been very well specified, very well tested and implemented, and so on. Paxos itself is not a multi-round protocol. You have to build on top of it; there are papers on how to build multi-Paxos on top of Paxos and how you manage configurations on top of that. If you are a practitioner, you need some very complete thing to build upon. Even when we were looking at Raft, we found quite a few open issues with the spec. That’s why both can co-exist. And I guess, we also have eventual consistency – so we could take the best of all worlds. For data, we are certainly going to run multiple Raft groups. But this means that every partition is going to be its own consensus – running independently, essentially having its own leader. In the end, we’re going to have, logically, many leaders in the cluster. However, if you look at our schema and topology, there’s still a single leader. So for schema and topology, we have all of the members of the cluster in the same group. We do run a single leader, but this is an advantage because the topology state machine is itself quite complicated. Running in a decentralized fashion without a single leader would complicate it quite a bit more. For a layman, linearizable just means that you can very easily reason about what’s going on: one thing happens after another. And when you build algorithms, that’s a huge value. We build complex transitions of topology when you stream data from one node to another – you might need to abort this, you might need to coordinate it with another streaming operation, and having one central place to coordinate this is just much, much easier to reason about. Daniel: Returning to what Kostja was saying. It’s not just that the trend (away from consistency) has started reverse script. I think it’s very true that people overreacted to CAP. It’s sort of like they used CAP as an excuse for why they didn’t create a consistent system. I think there are probably more systems than there should have been that might have been designed very differently if they didn’t drink the CAP Kool-aid so much. I think it’s a shame, and as Kostja said, it’s starting to reverse now. Daniel and Kostja on Industry Shifts Daniel: We are seeing sort of a lot of systems now, giving you the best of both worlds. You don’t want to do consistency at the application level. You really want to have a database that can take care of the consistency for you. It can often do it faster than the application can deal with it. Also, you see bugs coming up all the time in the application layer. It’s hard to get all those corner cases right. It’s not impossible but it’s just so hard. In many cases, it’s just worth paying the cost to get the consistency guaranteed in the system and be working with a rock-solid system. On the other hand, sometimes you need performance. Sometimes users can’t tolerate 20 milliseconds – it’s just too long. Sometimes you don’t need consistency. It makes sense to have both options. ScyllaDB is one example of this, and there are also other systems providing options for users. I think it’s a good thing. Kostja: I want to say more about the complexity problem. There was this research study on Ruby on Rails, Python, and Go applications, looking at how they actually use strongly consistent databases and different consistency levels that are in the SQL standard. It discovered that most of the applications have potential issues simply because they use the default settings for transactional databases, like snapshot isolation and not serializable isolation. Applied complexity has to be taken into account. Building applications is more difficult and even more diverse than building databases. So you have to push the problem down to the database layer and provide strong consistency in the database layer to make all the data layers simpler. It makes a lot of sense. Daniel: Yes, that was Peter Bailis’ 2015 UC Berkeley Ph.D. thesis, Coordination Avoidance in Distributed Databases. Very nice comparison. What I was saying was that they know what they’re getting, at least, and they just tried to design around it and they hit bugs. But what you’re saying is even worse: they don’t even know what they’re getting into. They’re just using the defaults and not getting full isolation and not getting full consistency – and they don’t even know what happened. Continuing the Database Consistency Conversation Intrigued by database consistency? Here are some places to learn more: Join us at ScyllaDB Summit, especially for Kostja’s tech talk on Day 1 Look at Daniel’s articles and talks Look at Kostja’s articles and talks    

ScyllaDB Summit Speaker Spotlight: Miles Ward, CTO at SADA

SADA CTO Miles Ward shares a preview of his ScyllaDB Summit keynote with Joseph Shorter (VP of Platform Architecture at Digital Turbine)  ScyllaDB Summit is now just days away! If database performance at scale matters to your team, join us to hear about your peers’ experiences and discover new ways to alleviate your own database latency, throughput, and cost pains. It’s free, virtual, and highly interactive. While setting the virtual stage, we caught up with ScyllaDB Summit keynote speaker, SADA CTO, and electric sousaphone player Miles Ward. Miles and the ScyllaDB team go way back, and we’re thrilled to welcome him back to ScyllaDB Summit – along with Joseph Shorter, VP of Platform Architecture at Digital Turbine. Miles and team worked with Joseph and team on a high stakes (e.g., “if we make a mistake, the business goes down”) and extreme scale DynamoDB to ScyllaDB migration. And to quantify “extreme scale,” consider this: The keynote will be an informal chat about why and how they pulled off this database migration in the midst of a cloud migration (AWS to Google Cloud). Update: ScyllaDB Summit 24 is completed! That means you can watch this session on demand. Watch Miles and Joe On Demand Here’s what Miles shared in our chat… Can you share a sneak peek of your keynote? What should attendees expect? Lots of tech talks speak in the hypothetical about features yet to ship, about potential and capabilities. Not here! The engineers at SADA and Digital Turbine are all done: we switched from the old and dusted DynamoDB on AWS to the new hotness Alternator via ScyllaDB on GCP, and the metrics are in! We’ll have the play-by-play, lessons learned, and the specifics you can use as you’re evaluating your own adoption of ScyllaDB. You’re co-presenting with an exceptional tech leader, Digital Turbine’s Joseph Shorter. Can you tell us more about him? Joseph is a stellar technical leader. We met as we were first connecting with Digital Turbine through their complex and manifold migration. Remember: they’re built by acquisition so it’s all-day integrations and reconciliations. Joe stood out as utterly bereft of BS, clear about the human dimension of all the change his company was going through, and able to grapple with all the layers of this complex stack to bring order out of chaos. Are there any recent developments on the SADA and Google Cloud fronts that might intrigue ScyllaDB Summit attendees? Three! GenAI is all the rage! SADA is building highly efficient systems using the power of Google’s Gemini and Duet APIs to automate some of the most rote, laborious tasks from our customers. None of this works if you can’t keep the data systems humming; thanks ScyllaDB! New instances from GCP with even larger SSDs (critical for ScyllaDB performance!) are coming very soon. Perhaps keep your eyes out around Google Next (April 9-11!) SADA just got snapped up by the incredible folks at Insight, so now we can help in waaaaaay more places and for customers big and small. If what Digital Turbine did sounds like something you could use help with, let me know! How did you first come across ScyllaDB, and how has your relationship evolved over time? I met Dor in Tel Aviv a long, long, (insert long pause), LONG time ago, right when ScyllaDB was getting started. I loved the value prop, the origin story, and the team immediately. I’ve been hunting down good use cases for ScyllaDB ever since! What ScyllaDB Summit 24 sessions (besides your own, of course) are you most looking forward to and why? One is the Disney talk, which is right after ours. We absolutely have to see how Disney’s DynamoDB migration compares to ours. Another is Discord, I am an avid user of Discord. I’ve seen some of their performance infrastructure and they have very, very, very VERY high throughput so I’m sure they have some pretty interesting experiences to share. Also, Dor and Avi of course! Are you at liberty to share your daily caffeine intake?
Editor’s note: If you’ve ever witnessed Miles’ energy, you’d understand why we asked this question.
I’m pretty serious about my C8H10N4O2 intake. It is a delicious chemical that takes care of me. I’m steady with two shots from my La Marzzoco Linea Mini to start the day right, with typically a post-lunch-fight-the-sleepies cappuccino. Plus, my lovely wife has a tendency to only drink half of basically anything I serve her, so I get ‘dregs’ or sips of tasty coffee leftovers on the regular. Better living through chemistry! *** We just met with Miles and Joe, and that chemistry is amazing too. This is probably your once-in-a-lifetime chance to attend a talk that covers electric sousaphones and motorcycles along with databases – with great conversation, practical tips, and candid lessons learned. Trust us: You won’t want to miss their keynote, or all the other great talks that your peers across the community are preparing for ScyllaDB Summit. Register Now – It’s Free

From Redis & Aurora to ScyllaDB, with 90% Lower Latency and $1M Savings

How SecurityScorecard built a scalable and resilient security ratings platform with ScyllaDB SecurityScorecard is the global leader in cybersecurity ratings, with millions of organizations continuously rated. Their rating platform provides instant risk ratings across ten groups of risk factors, including DNS health, IP reputation, web application security, network security, leaked information, hacker chatter, endpoint security, and patching cadence. Nguyen Cao, Staff Software Engineer at SecurityScorecard, joined us at ScyllaDB Summit 2023 to share how SecurityScorecard’s scoring architecture works and why they recently rearchitected it. This blog shares his perspective on how they decoupled the frontend and backend services by introducing a middle layer for improved scalability and maintainability. Spoiler: Their architecture shift involved a migration from Redis and Aurora to ScyllaDB. And it resulted in: 90% latency reduction for most service endpoints 80% fewer production incidents related to Presto/Aurora performance $1M infrastructure cost savings per year 30% faster data pipeline processing Much better customer experience Curious? Read on as we unpack this. Join us at ScyllaDB Summit 24 to hear more firsthand accounts of how teams are tackling their toughest database challenges. Disney, Discord, Expedia, Paramount, and more are all on the agenda. Update: ScyllaDB Summit 2024 is now a wrap! Access ScyllaDB Summit On Demand SecurityScorecard’s Data Pipeline SecurityScorecard’s mission is to make the world a safer place by transforming the way organizations understand, mitigate, and communicate cybersecurity to their boards, employees, and vendors. To do this, they continuously evaluate an organization’s security profile and report on ten key risk factors, each with a grade of A-F. Here’s an abstraction of the data pipeline used to calculate these ratings: Starting with signal collection, their global networks of sensors are deployed across over 50 countries to scan IPs, domains, DNS, and various external data sources to instantly detect threats. That result is then processed by the Attribution Engine and Cyber Analytics modules, which try to associate IP addresses and domains with vulnerabilities. Finally, the scoring engines compute a rating score. Nguyen’s team is responsible for the scoring engines that calculate the scores accessed by the frontend services. Challenges with Redis and Aurora The team’s previous data architecture served SecurityScorecard well for a while, but it couldn’t keep up with the company’s growth. The platform API (shown on the left side of the diagram) received requests from end users, then made further requests to other internal services, such as the measurements service. That service then queried datastores such as Redis, Aurora, and Presto on HDFS. The scoring workflow (an enhancement of Apache Airflow) then generated a score based on the measurements and the findings of over 12 million scorecards. This architecture met their needs for years. One aspect that worked well was using different datastores for different needs. They used three main datastores: Redis (for faster lookups of 12 million scorecards), Aurora (for storing 4 billion measurement stats across nodes), or a Presto cluster on HDFS (for complex SQL queries on historical results). However, as the company grew, challenges with Redis and Aurora emerged. Aurora and Presto latencies spiked under high throughput. The Scoring Workflow was running batch inserts of over 4B rows into Aurora throughout the day. Aurora has a primary/secondary architecture, where the primary node is solely responsible for writes and the secondary replicas are read-only. The main drawback of this approach was that their write-intensive workloads couldn’t keep up. Because their writes weren’t able to scale, this ultimately led to elevated read latencies because replicas were also overwhelmed. Moreover, Presto queries became consistently overwhelmed as the number of requests and amount of data associated with each scorecard grew. At times, latencies spiked to minutes, and this caused problems that impacted the entire platform. The largest possible instance of Redis still wasn’t sufficient. The largest possible instance od Redis supported only 12M scorecards, but they needed to grow beyond that. They tried Redis cluster for increased cache capacity, but determined that this approach would bring excessive complexity. For example, at the time, only the Python driver supported consistent hashing-based routing. That meant they would need to implement their own custom drivers to support this critical functionality on top of their Java and Node.js services. HDFS didn’t allow fast score updating. The company wants to encourage customers to rapidly remediate reported issues – and the ability to show them an immediate score improvement offers positive reinforcement for good behavior. However, their HDFS configuration (with data immutability) meant that data ingestion to HDFS had to go through the complete scoring workflow first. This meant that score updates could be delayed for 3 days. Maintainability. Their ~50 internal services were implemented in a variety of tech stacks (Go, Java, Node.js, Python, etc.). All these services directly accessed the various datastores, so they had to handle all the different queries (SQL, Redis, etc.) effectively and efficiently. Whenever the team changed the database schema, they also had to update all the services. Moving to a New Architecture with ScyllaDB To reduce latencies at the new scale that their rapid business growth required, the team moved to ScyllaDB Cloud and developed a new scoring API that routed less latency-sensitive requests to Presto + S3 storage. Here’s a visualization of this – and considerably simpler – architecture: A ScyllaDB Cloud cluster replaced Redis and Aurora, and AWS S3 replaced HDFS (Presto remains) for storing the scorecard details. Also added: a scoring-api service, which works as a data gateway. This component routes particular types of traffic to the appropriate data store. How did this address their challenges? Latency With a carefully-designed ScyllaDB schema, they can quickly access the data that they need based on the primary key. Their scoring API can receive requests for up to 100,000 scorecards. This led them to build an API capable of splitting and parallelizing these payloads into smaller processing tasks to avoid overwhelming a single ScyllaDB coordinator. Upon completion, the results are aggregated and then returned to the calling service. Also, their high read throughput no longer causes latency spikes, thanks largely to ScyllaDB’s eventually consistent architecture. Scalability With ScyllaDB Cloud, they can simply add more nodes to their clusters as demand increases, allowing them to overcome limits faced with Redis. The scoring API is also scalable since it’s deployed as an ECS service. If they need to serve requests faster, they just add more instances. Fast Score Updating Now, scorecards can be updated immediately by sending an upsert request to ScyllaDB Cloud. Maintainability Instead of accessing the datastore directly, services now send REST queries to the scoring API, which works as a gateway. It directs requests to the appropriate places, depending on the use case. For example, if the request needs a low-latency response it’s sent to ScyllaDB Cloud. If it’s a request for historical data, it goes to Presto. Results & Lessons Learned Nguyen then shared the KPIs they used to track project success. On the day that they flipped the switch, latency immediately decreased by over 90% on the two endpoints tracked in this chart (from 4.5 seconds to 303 milliseconds). They’re fielding 80% fewer incidents. And they’ve already saved over $1M USD a year in infrastructure costs by replacing Redis and Aurora. On top of that, they achieved a 30% speed improvement in their data processing. Wrapping up, Nguyen shared their top 3 lessons learned: Design your ScyllaDB schema based on data access patterns with a focus on query latency. Route infrequent, complex, and latency-tolerant data access to OLAP engines like Presto or Athena (generating reports, custom analysis, etc.). Build a scalable, highly parallel processing aggregation component to fully benefit from ScyllaDB’s high throughput. Watch the Complete SecurityScorecard Tech Talk You can watch Nguyen’s complete tech talk and skim through the deck in our tech talk library. Watch the Full Tech Talk

WebAssembly: Putting Code and Data Where They Belong

Brian Sletten and Piotr Sarna chat about Wasm + data trends at ScyllaDB Summit ScyllaDB Summit isn’t just about ScyllaDB. It’s also a space for learning about the latest trends impacting the broader data and the database world. Case in point: the WebAssembly discussions at ScyllaDB Summit 23. Free Access to ScyllaDB Summit 24 On-Demand Last year, the conference featured two talks that addressed Wasm from two distinctly different perspectives: Brian Sletten (WebAssembly: The Definitive Guide author and President at Bosatsu Consulting) and Piotr Sarna (maintainer of libSQL, which supports Wasm user-defined functions and compiles to WebAssembly). Brian’s Everything in its Place: Putting Code and Data Where They Belong talk shared how Wasm challenges the fundamental assumption that code runs on computers and data is stored in databases. And Piotr’s libSQL talk introduced, well, libSQL (Turso’s fork of SQLite that’s modernized for edge computing and distributed systems) and what’s behind its Wasm-powered support for dynamic function creation and execution. See Brian’s Talk & Deck See Piotr’s Talk & Deck As the event unfolded, Brian and Piotr met up for a casual Wasm chat. Here are some highlights from their discussion… How people are using WebAssembly to create a much closer binding between the data and the application Brian: From mainframe, to client server, to extending servers with things like servlets, Perl scripts, and cgi-bin, I think we’ve just really been going back and forth, and back and forth, back and forth. The reality is that different problems require different topologies. We’re gaining more topologies to support, but also tools that give us support for those more topologies. The idea that we can co-locate large amounts of data for long-term training sessions in the cloud avoids having to push a lot of that back and forth. But the more we want to support things like offline applications and interactions, the more we’re going to need to be able to capture data on-device and on-browser. So, the idea of being able to push app databases into the browser, like DuckDB and SQLite, and Postgres, and – I fully expect someday – ScyllaDB as well, allows us to say, “Let’s run the application, capture whatever interactions with the user we want locally, then sync it incrementally or in an offline capacity.” We’re getting more and more options for where we co-locate these things, and it’s going to be driven by the business requirements as much as technical requirements for certain kinds of interactions. And things like regulatory compliance are a big part of that as well. For example, people might want to schedule work in Europe – or outside of Europe – for various regulatory reasons. So, there are lots of things happening that are driving the need, and technologies like WebAssembly and LLVM are helping us address that. How Turso’s libSQL lets users get their compute closer to their data Piotr: libSQL is actually just an embedded database library. By itself, it’s not a distributed system – but, it’s a very nice building block for one. You can use it at the edge. That’s a very broad term; to be clear, we don’t mean running software on your fridges or light bulbs. It’s just local data centers that are closer to the users. And we can build a distributed database by having lots of small instances running libSQL and then replicating to it. CRDTs – basically, this offline-first approach where users can interact with something and then sync later – is also a very interesting area. Turso is exploring that direction. And there’s actually another project called cr-sqlite that applies CRDTs to SQLite. It’s close to the approach we’d like to take with libSQL. We want to have such capabilities natively so users can write these kinds of offline-first applications and the system knows how to resolve conflicts and apply changes later. Moving from “Web Scale” big data to a lot of small data Brian: I think this shift represents a more realistic view. Our definition of scale is “web scale.” Nobody would think about trying to put the web into a container. Instead, you put things into the web. The idea that all of our data has to go into one container is something that clearly has an upper limit. That limit is getting bigger over time, but it will never be “all of our data.” Protocols and data models that interlink loosely federated collections of data (on-demand, as needed, using standard query mechanisms) will allow us to keep the smaller amounts of data on-device, and then connect it back up. You’ve learned additional things locally that may be of interest when aggregated and connected. That could be an offline syncing to real-time linked data kinds of interactions. Really, this idea of trying to “own” entire datasets is essentially an outdated mentality. We have to get it to where it needs to go, and obviously have the optimizations and engineering constraints around the analytics that we need to ask. But the reality is that data is produced in lots of different places. And having a way to work with different scenarios of where that data lives and how we connect it is all part of that story. I’ve been a big advocate for linked data and knowledge, graphs, and things like that for quite some time. That’s where I think things like WebAssembly and linked data and architectural distribution, and serverless functions, and cloud computing and edge computing are all sort of coalescing on this fluid fabric view of data and computation. Trends: From JavaScript to Rust, WebAssembly, and LLVM Brian: JavaScript is a technology that began in the browser, but was increasingly pushed to the back end. For that to work, you either need a kind of flat computational environment where everything is the same, or something where we can virtualize the computation in the form of Java bytecode, or .NET, CIL, or whatever. JavaScript is part of that because we try to have a JavaScript engine that runs in all these different environments. You get better code reuse, and you get more portability in terms of where the code runs. But JavaScript itself also has some limitations in terms of the kinds of low-level stuff that it can handle (for example, there’s no real opportunity for ahead-of-time optimizations). Rust represents a material advancement in languages for system engineering and application development. That then helps improve the performance and safety of our code. However, when Rust is built on the LLVM infrastructure, its pluggable back-end capability allows it to emit native code, emit WebAssembly, and even WASI-flavored WebAssembly. This ensures that it runs within an environment providing the capabilities required to do what it needs…and nothing else. That’s why I think the intersection of architecture, data models, and computational substrates – and I consider both WebAssembly and LLVM important parts of those – to be a big part of solving this problem. I mean, the reason Apple was able to migrate from x86 to ARM relatively quickly is because they changed their tooling to be LLVM-based. At that point, it becomes a question of recompiling to some extent. What’s needed to make small localized databases on the edge + big backend databases a successful design pattern Piotr: These smaller local databases that are on this magical edge need some kind of source of truth that keeps everything consistent. Especially if ScyllaDB goes big on stronger consistency levels (with Raft), I do imagine a design where you can have these small local databases (say, libSQL instances) that are able to store user data, and users interact with them because they’re close by. That brings really low latency. Then, these small local databases could be synced to a larger database that becomes the single source of truth for all the data, allowing users to access this global state as well. Watch the Complete WebAssembly Chat You can watch the complete chat Brian <> Piotr chat here: 

How ShareChat Performs Aggregations at Scale with Kafka + ScyllaDB

Sub-millisecond P99 latencies – even over 1M operations per second ShareChat is India’s largest homegrown social media platform, with ~180 million monthly average users and 50 million daily active users. They capture and aggregate various engagement metrics such as likes, views, shares, comments, etc., at the post level to curate better content for their users. Since engagement performance directly impacts users, ShareChat needs a datastore that offers ultra-low latencies while remaining highly available, resilient, and scalable – ideally, all at a reasonable cost. This blog shares how they accomplished that using in-house Kafka streams and ScyllaDB. It’s based on the information that Charan Movva, Technical Lead at ShareChat, shared at ScyllaDB Summit 23. Join us at ScyllaDB Summit 24 to hear more firsthand accounts of how teams are tackling their toughest database challenges. Disney, Discord, Paramount, Expedia, and more are all on the agenda. Plus we’ll be featuring more insights from ShareChat. Ivan Burmistrov and Andrei Manakov will be presenting two tech talks: “Getting the Most Out of ScyllaDB Monitoring: ShareChat’s Tips” and “From 1M to 1B Features Per Second: Scaling ShareChat’s ML Feature Store.” Update: ScyllaDB Summit 2024 is now a wrap! Access ScyllaDB Summit On Demand ShareChat: India’s Largest Social Media Platform ShareChat aims to foster an inclusive community by supporting content creation and consumption in 15 languages. Allowing users to share experiences in the language they’re most comfortable with increases engagement – as demonstrated by their over 1.3 billion monthly shares, with users spending an average of 31 minutes per day on the app. As all these users interact with the app, ShareChat collects events, including post views and engagement actions such as likes, shares, and comments. These events, which occur at a rate of 370k to 440k ops/second, are critical for populating the user feed and curating content via their data science and machine learning models. All of this is critical for enhancing the user experience and providing valuable content to ShareChat’s diverse community. Why Stream Processing? The team considered three options for processing all of these engagement events: Request-response promised the lowest latency. But given the scale of events they handle, it would burden the database with too many connections and (unnecessary) writes and updates. For example, since every number of likes between 12,500 likes and 12,599 likes is displayed as “12.5 likes,” those updates don’t require the level of precision that this approach offers (at a price). Batch processing offered high throughput, but also brought challenges such as stale data and delayed updates, especially for post engagements. This is especially problematic for early engagement on a post (imagine a user sharing something, then incorrectly thinking that nobody likes it because your updates are delayed). Stream processing emerged as a well-balanced solution, offering continuous and non-blocking data processing. This was particularly important given their dynamic event stream with an unbounded and ever-growing dataset. The continuous nature of stream processing bridged the gap between request-response and batching. Charan explained, “Our specific technical requirements revolve around windowed aggregation, where we aggregate events over predefined time frames, such as the last 5 or 10 minutes. Moreover, we need support for multiple aggregation windows based on engagement levels, requiring instant aggregation for smaller counters and a more flexible approach for larger counters.” Additional technical considerations include: Support for triggers, which are vital for user engagement The ease of configuring new counters, triggers, and aggregration windows, which enables them to quickly evolve the product Inside The ShareChat Architecture Here’s a look at the architecture they designed to satisfy these requirements. Various types of engagement events are captured by backend services, then sent to the Kafka cluster. Business logic at different services captures and derives internal counters crucial for understanding and making decisions about how the post should be displayed on the feed. Multiple Kafka topics cater to various use cases. Multiple instances of the aggregation service are all running on Kubernetes. The Kafka Streams API handles the core logic for windowed aggregations and triggers. Once aggregations are complete, they update the counter or aggregated value in ScyllaDB and publish a change log to a Kafka topic. Under the hood, ShareChat taps Kafka’s consistent hashing to prevent different messages or events for a given entity ID (post ID) and counter from being processed by different Kubernetes pods or instances. To support this, the combination of entity ID and counter is used as the partition key, and all its relevant events will be processed by the same consumers. All the complex logic related to windowing and aggregations is managed using the Kafka Streams. Each stream processing application executes a defined topology, essentially a directed acyclic graph (DAG). Events are pushed through a series of transformations, then the aggregated values are then updated in the data store, which is ScyllaDB. Exploring the ShareChat Topology Here’s how Charan mapped out their topology. Starting from the top left: They consume events from an engagement topic, apply basic filtering to check if the counter is registered, and divert unregistered events for monitoring and logging. The aggregation window is defined based on the counter’s previous value, branching the main event stream into different streams with distinct windows. To handle stateful operations, especially during node shutdowns or rebalancing, Kafka Streams employs an embedded RocksDB for in-memory storage, persisting data to disk for rapid recovery. The output stream from aggregations is created by merging the individual streams, and the aggregated values are updated in the data store for the counters. They log changes in a change log topic before updating counters on the data store. Next, Charan walked through some critical parts of their code, highlighting design decisions such as their grace second setting and sharing how aggregations and views were implemented. He concluded, “Ultimately, this approach achieved a 7:1 ratio of events received as inputs compared to the database writes. That resulted in an approximately 80% reduction in messages.” Where ScyllaDB Comes In As ShareChat’s business took off, it became clear that their existing DBaaS solution (from their cloud provider) couldn’t provide acceptable latencies. That’s when they decided to move to ScyllaDB. As Charan explained, “ScyllaDB is continuously delivering sub-millisecond latencies for the counters cluster. In addition to meeting our speed requirements, it also provides critical transparency into the database through flexible monitoring. Having multiple levels of visibility—data center, cluster, instance, and shards—helps us identify issues, abnormalities, or unexpected occurrences so we can react promptly.” The ScyllaDB migration has also paid off in terms of cost savings: they’ve cut database costs by over 50% in the face of rapid business growth. For the engagement counters use case, ShareChat runs a three-node ScyllaDB cluster. Each node has 48 vCPUs and over 350 GB of memory. Below, you can see the P99 read and write latencies: all microseconds, even under heavy load. With a similar setup, they tested the cluster’s response to an extreme (but feasible) 1.2 million ops/sec by replicating events in a parallel cluster. Even at 90% load, the cluster remained stable with minimal impact on their ultra-low latencies. Charan summed it up as follows: “ScyllaDB has helped us a great deal in optimizing our application and better serving our end users. We are really in awe of what ScyllaDB has to offer and are expanding ScyllaDB adoption across our organization.” Watch the Complete Tech Talk You can watch the complete tech talk and skim through the deck in our tech talk library Watch the Full Tech Talk Learn more in this ShareChat blog

Running Apache Cassandra® Single and Multi-Node Clusters on Docker with Docker Compose

Sometimes you might need to spin up a local test database quickly–a database that doesn’t need to last beyond a set time or number of uses. Or maybe you want to integrate Apache Cassandra® into an existing Docker setup.  

Either way, you’re going to want to run Cassandra on Docker, which means running it in a container with Docker as the container manager. This tutorial is here to guide you through running a single and multi-node setup of Apache Cassandra on Docker. 

Prerequisites 

Before getting started, you’ll need to have a few things already installed, and a few basic skills. These will make deploying and running your Cassandra database in Docker a seamless experience: 

  • Docker installed  
  • Basic knowledge of containers and Docker (see the Docker documentation for more insight) 
  • Basic command line knowledge 
  • A code editor (I use VSCode) 
  • CQL shell, aka cqlsh, installed (instructions for installing a standalone cqlsh without installing Cassandra can be found here) 

Method 1: Running a single Cassandra node using Docker CLI 

This method uses the Docker CLI to create a container based on the latest official Cassandra image. In this example we will: 

  • Set up the Docker container 
  • Test that it’s set up by connecting to it and running cqlsh 
  • Clean up the container once you’re done with using it. 

Setting up the container 

You can run Cassandra on your machine by opening up a terminal and using the following command in the Docker CLI: 

docker run –name my-cassandra-db  -d cassandra:latest 

Let’s look at what this command does: 

  • Docker uses the ‘run’ subcommand to run new containers.  
  • The ‘–name’ field allows us to name the container, which helps for later use and cleanup; we’ll use the name ‘my-cassandra-db’ 
  • The ‘-d’ flag tells Docker to run the container in the background, so we can run other commands or close the terminal without turning off the container.  
  • The final argument ‘cassandra:latest’ is the image to build the container from; we’re using the latest official Cassandra image 

When you run this, you should see an ID, like the screenshot below: 

To check and make sure everything is running smoothly, run the following command: 

docker ps -a 

You should see something like this: 

Connecting to the container 

Now that the data container has been created, you can now connect to it using the following command: 

docker exec -it my-cassandra-db cqlsh 

This will run cqlsh, or CQL Shell, inside your container, allowing you to make queries to your new Cassandra database. You should see a prompt like the following: 

Cleaning up the container 

Once you’re done, you can clean up the container with the ’docker rm’ command. First, you’ll need to stop the container though, so you must to run the following 2 commands:  

docker stop my-cassandra-db 

docker rm my-cassandra-db 

This will delete the database container, including all data that was written to the database. You’ll see a prompt like the following, which, if it worked correctly, will show the ID of the container being stopped/removed: 

Method 2: Deploying a three-node Apache Cassandra cluster using Docker compose 

This method allows you to have multiple nodes running on a single machine. But in which situations would you want to use this method? Some examples include testing the consistency level of your queries, your replication setup, and more.

Writing a docker-compose.yml 

The first step is creating a docker-compose.yml file that describes our Cassandra cluster. In your code editor, create a docker-compose.yml file and enter the following into it: 

version: '3.8' 

networks: 

  cassandra: 

services: 

  cassandra1: 

    image: cassandra:latest 

    container_name: cassandra1 

    hostname: cassandra1 

    networks: 

      - cassandra 

    ports: 

      - "9042:9042" 

    environment: &environment  

        CASSANDRA_SEEDS: "cassandra1,cassandra2"   

        CASSANDRA_CLUSTER_NAME: MyTestCluster 

        CASSANDRA_DC: DC1 

        CASSANDRA_RACK: RACK1 

        CASSANDRA_ENDPOINT_SNITCH: GossipingPropertyFileSnitch 

        CASSANDRA_NUM_TOKENS: 128 

  cassandra2: 

    image: cassandra:latest 

    container_name: cassandra2 

    hostname: cassandra2 

    networks: 

      - cassandra 

    ports: 

      - "9043:9042" 

    environment: *environment   

    depends_on: 

      cassandra1:  

        condition: service_started 

  cassandra3: 

    image: cassandra:latest 

    container_name: cassandra3 

    hostname: cassandra3 

    networks: 

      - cassandra 

    ports: 

      - "9044:9042" 

    environment: *environment   

    depends_on: 

      cassandra2:   

        condition: service_started

So what does this all mean? Let’s examine it part-by-part:  

First, we declare our docker compose version.  

version: '3.8'

Then, we declared a network called cassandra to host our cluster.  

networks: 

  cassandra:

Under services, cassandra1 is started. (NOTE: the depends on service start conditions in cassandra2 and cassandra3’s `depends_on~ attributes prevent them from starting until the service on cassandra1 and cassandra2 have started, respectively.) We also set the port forwarding here so that our local 9042 port will map to the container’s 9042. We also add it to the cassandra network we established: 

services: 

  cassandra1: 

    image: cassandra:latest 

    container_name: cassandra1 

    hostname: cassandra1 

    networks: 

      - cassandra 

    ports: 

      - "9042:9042" 

    environment: &environment  

        CASSANDRA_SEEDS: "cassandra1,cassandra2"   

        CASSANDRA_CLUSTER_NAME: MyTestCluster 

        CASSANDRA_DC: DC1 

        CASSANDRA_RACK: RACK1 

        CASSANDRA_ENDPOINT_SNITCH: GossipingPropertyFileSnitch 

        CASSANDRA_NUM_TOKENS: 128

Finally, we set some environment variables needed for startup, such as declaring CASSANDRA_SEEDS to be cassandra1 and cassandra2. 

The configurations for containers ‘cassandra2 ‘and ‘cassandra3’ are very similar; the only real difference are the names.  

  • Both use the same cassandra:latest image, set container names, add themselves to the Cassandra network, and expose their 9042 port.  
  • They also point to the same environment variables as cassandra1 with the *environment syntax. 
    • Their only difference? cassandra2 waits on cassandra1, and cassandra3 waits on cassandra2. 

Here is the code section that this maps to: 

cassandra2: 

    image: cassandra:latest 

    container_name: cassandra2 

    hostname: cassandra2 

    networks: 

      - cassandra 

    ports: 

      - "9043:9042" 

    environment: *environment   

    depends_on: 

      cassandra1:  

        condition: service_started 

  cassandra3: 

    image: cassandra:latest 

    container_name: cassandra3 

    hostname: cassandra3 

    networks: 

      - cassandra 

    ports: 

      - "9044:9042" 

    environment: *environment   

    depends_on: 

      cassandra2:   

        condition: service_started

Deploying your Cassandra cluster and running commands 

To deploy your Cassandra cluster, use the Docker CLI in the same folder as your docker-compose.yml to run the following command (the -d causes the containers to run in the background): 

docker compose up -d

Quite a few things should happen in your terminal when you run the command, but when the dust has settled you should see something like this: 

If you run the ‘docker ps -a,’ command,  you should see three running containers: 

To access your Cassandra cluster, you can use csqlsh to connect to the container database using the following commands: 

sudo docker exec -it cassandra1 cqlsh

You can also check the cluster configuration using: 

docker exec -it cassandra1 nodetool status

Which will get you something like this: 

And the node info with: 

docker exec -it cassandra1 nodetool info

From which you’ll see something similar to the following: 

You can also run these commands on the cassandra2 and cassandra3 containers. 

Cleaning up 

Once you’re done with the database cluster, you can take it down and remove it with the following command: 

docker compose down

This will stop and destroy all three containers, outputting something like this:

Now that we’ve covered two ways to run Cassandra in Docker, let’s look at a few things to keep in mind when you’re using it. 

Important things to know about running Cassandra in Docker 

Data Permanence 

Unless you declare volumes on the machine that maps to container volumes, the data you write to your Cassandra database will be erased when the container is destroyed. (You can read more about using Docker volumes here).

Performance and Resources 

Apache Cassandra can take a lot of resources, especially when a cluster is deployed on a single machine. This can affect the performance of queries, and you’ll need a decent amount of CPU and RAM to run a cluster locally. 

Conclusion 

There are several ways to run Apache Cassandra on Docker, and we hope this post has illuminated a few ways to do so. If you’re interested in learning more about Cassandra, you can find out more about how data modelling works with Cassandra, or how PostgreSQL and Cassandra differ. 

Ready to spin up some Cassandra clusters yourself? Give it a go with a free trial on the Instaclustr Managed Platform for Apache Cassandra® today!

The post Running Apache Cassandra® Single and Multi-Node Clusters on Docker with Docker Compose appeared first on Instaclustr.

The World’s Largest Apache Kafka® and Apache Cassandra® Migration?

Here at Instaclustr by NetApp, we pride ourselves on our ability to migrate customers from self-managed or other managed providers with minimal risk and zero downtime, no matter how complex the scenario. At any point in time, we typically have 5-10 cluster migrations in progress. Planning and executing these migrations with our customers is a core part of the expertise of our Technical Operations team. 

Recently, we completed the largest new customer onboarding migration exercise in our history and it’s quite possibly the largest Apache Cassandra and Apache Kafka migration exercise ever completed by anyone. While we can’t tell you who the customer is, in this blog we will walk through the overall process and provide details of our approach. This will give you an idea of the lengths we go to onboarding customers and perhaps to pick up some hints for your own migration exercises. 

Firstly, some stats to give you a sense of the scale of the exercise: 

  • Apache Cassandra: 
    • 58 clusters
    • 1,079 nodes
    • 17 node sizes (ranging from r6g.medium to im4gn.4xlarge)
    • 2 cloud providers (AWS and GCP)
    • 6 cloud provider regions
  • Apache Kafka 
    • 154 clusters
    • 1,050 nodes
    • 21 node sizes (ranging from r6g.large to im4gn.4xlarge and r6gd.4xlarge)
    • 2 cloud providers (AWS and GCP)
    • 6 cloud provider regions

From the size of the environment, you can get a sense that the customer involved is a pretty large and mature organisation. Interestingly, this customer had been an Instaclustr support customer for a number of years. Based on that support experience, they decide to trust us with taking on full management of their clusters to help reduce costs and improve reliability. 

Clearly, completing this number of migrations required a big effort both from Instaclustr and our customer. The timeline for the project looked something like: 

  • July 2022: contract signed and project kicked off 
  • July 2022 – March 2023: customer compliance review, POCs and feature enhancement development 
  • February 2023 – October 2023: production migrations 

Project Management and Governance 

A key to the success of any large project like this is strong project management and governance. Instaclustr has a wellestablished customer project management methodology that we apply to projects: 

Source: Instaclustr 

In line with this methodology, we staffed several key roles to support this project: 

  • Overall program manager 
  • Cassandra migration project manager 
  • Cassandra technical lead 
  • Kafka migration project manager 
  • Kafka technical lead 
  • Key Customer Product Manager 

The team worked directly with our customer counterparts and established several communication mechanisms that were vital to the success of the project. 

Architectural and Security Compliance 

While high-level compliance with the customer’s security and architecture requirements had been established during the pre-contract phase, the first phase of post-contract work was a more detailed solution review with the customer’s compliance and architectural teams.  

To facilitate this requirement, Instaclustr staff met regularly with the customer’s security team to understand their requirements and explain Instaclustr’s existing controls that met these needs. 

As expected, Instaclustr’s existing SOC2 and PCI certified controls meant that a very high percentage of the customer’s requirements were met right out of the box. This included controls such as intrusion detection, access logging and operating system hardening.  

However, as is common in mature environments with well-established requirements, a few gaps were identified and Instaclustr agreed to take these on as system enhancements. Some examples of the enhancements we delivered prior to commencing production migrations include: 

  • Extending the existing system to export logs to a customer-owned location to include audit logs 
  • The ability to opt-in at an account level for all newly created clusters to be automatically configured with log shipping 
  • Allowing the process that loads custom Kafka Connect connectors to use instance roles rather than access keys for s3 access 
  • Enhancements to our SCIM API for provisioning SSO access 

In addition to establishing security compliance, we used this period to further validate architectural fit and identified some enhancements that would help to ensure an optimal fit for the migrated clusters. Two key enhancements were delivered to meet this goal: 

  • Support for Kafka clusters running in two Availability Zones with RF2 
  • This is necessary as the customer has a fairly unique architecture that delivers HA above the Kafka cluster level 
  • Enabling multiple new AWS and GCP node types to optimize infrastructure spend 

Apache Kafka Migration 

Often when migrating Apache Kafka, the simplest approach is what we call Drain Out.   

In this approach, Kafka consumers are pointed at both the source and destination clusters; the producers are then switched to send messages to just the destination cluster. Once all messages are read from the source cluster, the consumers there can be switched off and the migration is complete. 

However, while this is the simplest approach from a Kafka point of view, it does not allow you to preserve message ordering through the cutover. This can be important in many use cases, and was certainly important for this customer. 

When the Drain Out approach is not suitable, using MirrorMaker2 can also be an option; we have deployed it on many occasions for other migrations. In this particular case, however, the level of consumer/producer application dependency for this approach ruled out using MirorrMaker2.  

This left us with the Shared Cluster approach, where we operate the source and destination clusters as a single cluster for a period before decommissioning the source. 

The high-level steps we followed for this shared cluster migration approach are: 

1. Provision destination Instaclustr managed cluster, shut down and wipe all data 

2. Update configurations on the destination cluster to match source cluster as required 

3. Join network environments with the source cluster (VPC peering, etc) 

4. Start up destination Apache ZooKeeper™ in observer mode, and start up destination Kafka brokers 

5. Use Kafka partition reassignment to move data:

a. Increase replication factor and replicate across destination as well as source brokers 

b. Swap preferred leaders to destination brokers 

c. Decrease replication factor to remove replicas from source brokers 

6. Reconfigure clients to use destination brokers as initial contact points 

7. Remove old brokers 

For each cluster, a detailed change plan was created by Instaclustr to cover all of the high-level steps listed above and rollback if any issues arose.  

 A couple of other specific requirements from this environment that added extra complexity worth mentioning: 

  • The source environment shared a single ZooKeeper instance across multiple clusters. This is not a configuration that we support and the customer agreed that it was a legacy configuration that they would rather leave behind. To accommodate the migration from this shared ZooKeeper, we had to develop functionality for custom configuration of ZooKeeper node names in our managed clusters as well as build a tool to “clean” the destination ZooKeeper of data related to other clusters after migration (for security and ongoing supportability). 
  • The existing clusters had port listener mappings that did not align with the mappings supported by our management system, and reconfiguring these prior to migration would have added extensive work on the customer side. We therefore extended our custom configuration to allow more extensive custom configuration of listeners. Like other custom configuration we support, this is stored in our central configuration database so it survives node replacements and is automatically added to new nodes in a cluster. 

Apache Cassandra Migration 

We have been doing zero downtime migrations of Apache Cassandra since 2014. All of them basically follow the “add a datacenter to an existing cluster” process that we outlined in a 2016 blog. 

One key enhancement that we’ve made since this blog–and even utilized since this most recent migration–is the introduction of the Instaclustr Minotaur consistent rebuild tool (available on GitHub here).  

If the source cluster is missing replicas of some data prior to starting the rebuild, the standard Cassandra data center rebuild process can try to copy more than one replica from the same source node. This results in even fewer replicas of data on the destination cluster.  

Instaclustr Minotaur addresses these issues.  

This can mean that in the standard case of replication factor 3 and consistency level quorum queries, you can go from having 2 replicas and data being consistently returned on the source cluster to only 1 replica (or even 0 replicas) and data being intermittently missed on the destination cluster.  

The “textbook” Cassandra approach to address this is to run Cassandra repairs after the rebuild, which will ensure all expected replicas are in sync. However, we are frequently asked to migrate clusters that have not been repaired for a long time and that can make running repairs a very tricky operation.  

Using the Minotaur tool, we can guarantee that the destination cluster has at least as many replicas as the source cluster. Running repairs to get the cluster back into a fully healthy state can then be left until the cluster is fully migrated, and our Tech Ops team can hand-hold the process. 

This approach was employed across all Cassandra migrations for this customer and proved particularly important for certain clusters with high levels of inconsistency pre-migration; one particularly tricky cluster even took two and half months to fully repair post migration!  

Another noteworthy challenge from this migration was a set of clusters where tables were dropped every 2 to 3 hours.  

This is a common design pattern for temporary data in Cassandra as it allows the data to be quickly and completely removed when it is no longer required (rather than a standard delete creating “tombstones” or virtual delete records). The downside is that the streaming of data to new nodes fails if a schema change occurs during a streaming operation and can’t be restarted.  

Through the migration process, we managed to work around this with manual coordination of pausing the table drop operation on the customer side while each node rebuild was occurring. However, it quickly became apparent that this would be too cumbersome to sustain through ongoing operations.  

To remedy this, we held a joint brainstorming meeting with the customer to work through the issue and potential solutions. The end result was a design for the automation on the customer-side to pause the dropping of tables whenever it was detected that a node in the cluster was not fully available. Instaclustr’s provisioning API provided node status information that could be used to facilitate this automation.  

Conclusion 

This was a mammoth effort that not only relied on Instaclustr’s accumulated expertise from many years of running Cassandra and Kafka, but also our strong focus on working as part of one team with our customers.  

The following feedback we received from the customer project manager is exactly the type of reaction we aim for with every customer interaction: 

“We’ve hit our goal ahead of schedule and could not have done it without the work from everyone on the Instaclustr side and [customer team]. It was a pleasure working with all parties involved!  

“The migration when smoothly with minimal disruption and some lessons learned. I’m looking forward to working with the Instaclustr team as we start to normalize with the new environment and build new processes with your teams to leverage your expertise.  

“Considering the size, scope, timeline and amount of data transferred, this was the largest migration I’ve ever worked on and I couldn’t have asked for better partners on both sides.” 

Interested in doing a migration yourself? Reach out to our team of engineers and we’ll get started on the best plan of action for your use case! 

The post The World’s Largest Apache Kafka® and Apache Cassandra® Migration? appeared first on Instaclustr.

Instaclustr Product Update: December 2023

There have been a lot of exciting things happening at Instaclustr! Here’s a roundup of the updates to our platform over the last few months. 

As always, if you have a specific feature request or improvement you’d love to see, please get in contact with us. 

Major Announcements 

Cadence®: 

  • Instaclustr Managed Platform adds HTTP API for Cadence®

The Instaclustr Managed Platform now provides an HTTP API to allow interaction with Cadence® workflows from virtually any language. For further information or a technical briefing, contact an Instaclustr Customer Success representative or sales@instaclustr.com 

  • Cadence® on the Instaclustr Managed Platform Achieves PCI Certification 

We are pleased to announce that Cadence® 1.0 on the Instaclustr Managed Platform is now PCI Certified on AWS and GCP, demonstrating our company’s commitment to stringent data security practices and architecture. For further information or a technical briefing, contact an Instaclustr Customer Success representative or sales@instaclustr.com 

Apache Cassandra®: 

  • Debezium Connector Cassandra is Released for General Availability  

Instaclustr’s managed Debezium connector for Apache Cassandra makes it easy to export a stream of data changes from an Instaclustr Managed Cassandra cluster to Instaclustr Managed Kafka cluster. Try creating a Cassandra cluster with Debezium Connector on our console. 

Ocean for Apache Spark™:  

  • Interactive data engineering at scaleJupyter Notebooks, embedded in the Ocean for Apache Spark UI, is now generally available in Ocean Spark. Our customers’ administrators no longer need to install components since Ocean Spark is now hosting Jupyter infrastructure. Users no longer need to switch applications as they can have Ocean for Apache Spark process massive amounts of data in the cloud while viewing live analytic results on our screen

Other Significant Changes 

Apache Cassandra®: 

  • Apache Cassandra® Version 3.11.16 was released as generally available on Instaclustr Platform 
  • A new feature has been released for zero-downtime restores for Apache Cassandra® that eliminates downtime when the in-place restore operation is performed on a Cassandra cluster 
  • Added support for customer-initiated resize on GCP and Azure, allowing customers to vertically scale their Cassandra cluster and select any larger production disk or instance size when scaling up. And for instance size, the ability to also scale down through the Instaclustr Console or API. 

Apache Kafka®: 

  • Support for Apache ZooKeeper™ version 3.8.2 released in general availability 
  • Altered the maximum heap size allocated to Instaclustr for Kafka and ZooKeeper on smaller node sizes to improve application stability. 

Cadence®: 

  • Cadence 1.2.2 is now in general availability 

Instaclustr Managed Platform: 

  • Added support for new AWS region: UAE, Zurich, Jakarta, Melbourne and Osaka 

OpenSearch®: 

  • OpenSearch 1.3.11 and OpenSearch 2.9.0 are now in general availability on the Instaclustr Platform 
  • Dedicated Ingest Nodes for OpenSearch® is released as public preview

PostgreSQL®: 

  • PostgreSQL 15.4, 14.9, 13.12, and 16 are now in general availability 

Ocean for Apache Spark™: 

  • New memory metrics charts in our UI give our users the self-help they need to write better big data analytical applications. Our UI-experience-tracking software has shown that this new feature has already helped some customers to find the root cause of failed applications. 
  • Autotuning memory down for cost savings. Customers can use rules to mitigate out of memory errors. 
  • Cost metrics improved in both API and UI: 
    • API: users can now make REST calls to retrieve Ocean Spark usage charges one month at a time 
    • UI: metrics that estimate cloud provider costs have been enhanced to account for differences among on-demand instances, spot instances, and reserved instances.  

Upcoming Releases 

Apache Cassandra®: 

  • AWS PrivateLink with Instaclustr for Apache Cassandra® will soon be released to general availability. The AWS PrivateLink offering provides our AWS customers with a simpler and more secure option for network connectivity.​​ Log into the Instaclustr Console with just one click to trial the AWS PrivateLink public preview release with your Cassandra clusters today. Alternatively, the AWS PrivateLink public preview for Cassandra is available at the Instaclustr API. 
  • AWS 7-series nodes for Apache Cassandra® will soon be generally available on Instaclustr’s Managed Platform. The AWS 7-series nodes introduce new hardware, with Arm-based AWS Graviton3 processors and Double Data Rate 5 (DDR5) memory, providing customers with leading technology that Arm-based AWS Graviton3 processors are known for.  

Apache Kafka®: 

  • Support for custom Subject Alternative Names (SAN) on AWS​ will be added early next year, making it easier for our customers to use internal DNS to connect to their managed Kafka clusters without needing to use IP addresses. 
  • Kafka 3.6.x will be introduced soon in general availability. This will be the first Managed Kafka release on our platform to have KRaft available in GA, entitling our customers to full coverage under our extensive SLAs. 

Cadence®: 

  • User authentication through open source is under development to enhance security of our Cadence architecture. This also helps our customers more easily pass security compliance checks of applications running on Instaclustr Managed Cadence. 

Instaclustr Managed Platform: 

  • Instaclustr Managed Services will soon be available to purchase through the Azure Marketplace. This will allow you to allocate your Instaclustr spend towards any Microsoft commit you might have. The integration will first be rolled out to RIIA clusters, then a fully RIYOA setup. Both self-serve and private offers will be supported. 
  • The upcoming release of Signup with Microsoft integration marks another step in enhancing our platform’s accessibility. Users will have the option to sign up and sign in using their Microsoft Account credentials. This addition complements our existing support for Google and GitHub SSO, as well as the ability to create an account using email, ensuring a variety of convenient access methods to suit different user preferences.  

OpenSearch®: 

  • Release of searchable snapshots​​ feature is coming soon as GA. This feature will make it possible to search indexes that are stored as snapshots within remote repositories (i.e., S3) without the need to download all the index data to disk ahead of time, allowing customers to save time and leverage cheaper storage options. 
  • Continuing on from the public preview release of dedicated ingest nodes for Managed OpenSearch, we’re now working on making it generally available within the coming months. 

PostgreSQL®: 

  • Release of PostgreSQL–Azure NetApp (ANF) Files Fast Forking feature will be available soon. By leveraging the ANF filesystem, customers can quickly and easily create an exact copy of their PostgreSQL cluster on additional infrastructure. 
  • Release of pgvector extension is expected soon. This extension lets you use your PostgreSQL server for vector embeddings to take advantage of the latest in Large Language Models (LLMs). 

Ocean for Apache Spark™: 

  • We will soon be announcing SOC 2 certification by an external auditor since we have implemented new controls and processes to better protect data. 
  • We are currently developing new log collection techniques and UI charts to accommodate longer running applications since our customers are running more streaming workloads on our platform. 

Did you know? 

It has become more common for customers to create many accounts, each managing smaller groups of clusters. The account search functionality helps answer several questions: where the cluster is across all accounts, where the account is, and what it owns. 

Combining the real-time data streaming capabilities of Apache Kafka with the powerful distributed computing capabilities of Apache Spark can simplify your end-to-end big data processing and Machine Learning pipelines. Explore a real-world example in the first part of a blog series on Real Time Machine Learning using NetApp’s Ocean for Apache Spark and Instaclustr for Apache Kafka 

Apache Kafka cluster visualization—AKHQ—is an open source Kafka GUI for Apache Kafka, which helps with managing topics, consumers groups, Schema Registry, Kafka® Connect, and more. In this blog we detail the steps needed to be done to get AKHQ, an open source project licensed under Apache 2, working with Instaclustr for Apache Kafka. This is the first in a series of planned blogs where we’ll take you through similar steps on using other popular open source Kafka user interfaces and how to use them with your beloved Instaclustr for Apache Kafka. 

The post Instaclustr Product Update: December 2023 appeared first on Instaclustr.

Medusa 0.16 was released

The k8ssandra team is happy to announce the release of Medusa for Apache Cassandra™ v0.16. This is a special release as we did a major overhaul of the storage classes implementation. We now have less code (and less dependencies) while providing much faster and resilient storage communications.

Back to ~basics~ official SDKs

Medusa has been an open source project for about 4 years now, and a private one for a few more. Over such a long time, other software it depends upon (or doesn’t) evolves as well. More specifically, the SDKs of the major cloud providers evolved a lot. We decided to check if we could replace a lot of custom code doing asynchronous parallelisation and calls to the cloud storage CLI utilities with the official SDKs.

Our storage classes so far relied on two different ways of interacting with the object storage backends:

  • Apache Libcloud, which provided a Python API for abstracting ourselves from the different protocols. It was convenient and fast for uploading a lot of small files, but very inefficient for large transfers.
  • Specific cloud vendors CLIs, which were much more efficient with large file transfers, but invoked through subprocesses. This created an overhead that made them inefficient for small file transfers. Relying on subprocesses also created a much more brittle implementation which led the community to create a lot of issues we’ve been struggling to fix.

To cut a long story short, we did it!

  • We started by looking at S3, where we went for the official boto3. As it turns out, boto3 does all the chunking, throttling, retries and parallelisation for us. Yay!
  • Next we looked at GCP. Here we went with TalkIQ’s gcloud-aio-storage. It works very well for everything, including the large files. The only thing missing is the throughput throttling.
  • Finally, we used Azure’s official SDK to cover Azure compatibility. Sadly, this still works without throttling as well.

Right after finishing these replacements, we spotted the following improvements:

  • The integration tests duration against the storage backends dropped from ~45 min to ~15 min.
    • This means Medusa became far more efficient.
    • There is now much less time spent managing storage interaction thanks to it being asynchronous to the core.
  • The Medusa uncompressed image size we bundle into k8ssandra dropped from ~2GB to ~600MB and its build time went from 2 hours to about 15 minutes.
    • Aside from giving us much faster feedback loops when working on k8ssandra, this should help k8ssandra itself move a little bit faster.
  • The file transfers are now much faster.
    • We observed up to several hundreds of MB/s per node when moving data from a VM to blob storage within the same provider. The available network speed is the limit now.
    • We are also aware that consuming the whole network throughput is not great. That’s why we now have proper throttling for S3 and are working on a solution for this in other backends too.

The only compromise we had to make was to drop Python 3.6 support. This is because the Pythons asyncio features only come in Python 3.7.

The other good stuff

Even though we are the happiest about the storage backends, there is a number of changes that should not go without mention:

  • We fixed a bug with hierarchical storage containers in Azure. This flavor of blob storage works more like a regular file system, meaning it has a concept of directories. None of the other backends do this (including the vanilla Azure ones), and Medusa was not dealing gracefully with this.
  • We are now able to build Medusa images for multiple architectures, including the arm64 one.
  • Medusa can now purge backups of nodes that have been decommissioned, meaning they are no longer present in the most recent backups. Use the new medusa purge-decommissioned command to trigger such a purge.

Upgrade now

We encourage all Medusa users to upgrade to version 0.16 to benefit from all these storage improvements, making it much faster and reliable.

Medusa v0.16 is the default version in the newly released k8ssandra-operator v1.9.0, and it can be used with previous releases by setting the .spec.medusa.containerImage.tag field in your K8ssandraCluster manifests.

Building a 100% ScyllaDB Shard-Aware Application Using Rust

Building a 100% ScyllaDB Shard-Aware Application Using Rust

I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...

Reaper 3.0 for Apache Cassandra was released

The K8ssandra team is pleased to announce the release of Reaper 3.0. Let’s dive into the main new features and improvements this major version brings, along with some notable removals.

Storage backends

Over the years, we regularly discussed dropping support for Postgres and H2 with the TLP team. The effort for maintaining these storage backends was moderate, despite our lack of expertise in Postgres, as long as Reaper’s architecture was simple. Complexity grew with more deployment options, culminating with the addition of the sidecar mode.
Some features require different consensus strategies depending on the backend, which sometimes led to implementations that worked well with one backend and were buggy with others.
In order to allow building new features faster, while providing a consistent experience for all users, we decided to drop the Postgres and H2 backends in 3.0.

Apache Cassandra and the managed DataStax Astra DB service are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.

Reaper does not generally require high availability - even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.

Adaptive Repairs and Schedules

One of the pain points we observed when people start using Reaper is understanding the segment orchestration and knowing how the default timeout impacts the execution of repairs.
Repair is a complex choreography of operations in a distributed system. As such, and especially in the days when Reaper was created, the process could get blocked for several reasons and required a manual restart. The smart folks that designed Reaper at Spotify decided to put a timeout on segments to deal with such blockage, over which they would be terminated and rescheduled.
Problems arise when segments are too big (or have too much entropy) to process within the default 30 minutes timeout, despite not being blocked. They are repeatedly terminated and recreated, and the repair appears to make no progress.
Reaper did a poor job at dealing with this for mainly two reasons:

  • Each retry will use the same timeout, possibly failing segments forever
  • Nothing obvious was reported to explain what was failing and how to fix the situation

We fixed the former by using a longer timeout on subsequent retries, which is a simple trick to make repairs more “adaptive”. If the segments are too big, they’ll eventually pass after a few retries. It’s a good first step to improve the experience, but it’s not enough for scheduled repairs as they could end up with the same repeated failures for each run.
This is where we introduce adaptive schedules, which use feedback from past repair runs to adjust either the number of segments or the timeout for the next repair run.

Adaptive Schedules

Adaptive schedules will be updated at the end of each repair if the run metrics justify it. The schedule can get a different number of segments or a higher segment timeout depending on the latest run.
The rules are the following:

  • if more than 20% segments were extended, the number of segments will be raised by 20% on the schedule
  • if less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout
  • if no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.

This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.

Incremental Repair Triggers

As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anticompaction process.

The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.

Reaper 3.0 introduces a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.

Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.

Percent unrepaired triggers

These new features will allow to securely optimize tombstone deletions by enabling the only_purge_repaired_tombstones compaction subproperty in Cassandra, permitting to reduce gc_grace_seconds down to 3 hours without fearing that deleted data reappears.

Schedules can be edited

That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.

3.0 fixes that embarrassing situation and adds an edit button to schedules, which allows to change the mutable settings of schedules:

Edit Repair Schedule

More improvements

In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).

Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 will display a pop up informing you and allowing to force create the schedule/run:

Force bypass schedule conflict

We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:

java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml

There are many other improvements and we invite all users to check the changelog in the GitHub repo.

Upgrade Now

We encourage all Reaper users to upgrade to 3.0.0, while recommending users to carefully prepare their migration out of Postgres/H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.

All instructions to download, install, configure, and use Reaper 3.0.0 are available on the Reaper website.

Certificates management and Cassandra Pt II - cert-manager and Kubernetes

The joys of certificate management

Certificate management has long been a bugbear of enterprise environments, and expired certs have been the cause of countless outages. When managing large numbers of services at scale, it helps to have an automated approach to managing certs in order to handle renewal and avoid embarrassing and avoidable downtime.

This is part II of our exploration of certificates and encrypting Cassandra. In this blog post, we will dive into certificate management in Kubernetes. This post builds on a few of the concepts in Part I of this series, where Anthony explained the components of SSL encryption.

Recent years have seen the rise of some fantastic, free, automation-first services like letsencrypt, and no one should be caught flat footed by certificate renewals in 2021. In this blog post, we will look at one Kubernetes native tool that aims to make this process much more ergonomic on Kubernetes; cert-manager.

Recap

Anthony has already discussed several points about certificates. To recap:

  1. In asymmetric encryption and digital signing processes we always have public/private key pairs. We are referring to these as the Keystore Private Signing Key (KS PSK) and Keystore Public Certificate (KS PC).
  2. Public keys can always be openly published and allow senders to communicate to the holder of the matching private key.
  3. A certificate is just a public key - and some additional fields - which has been signed by a certificate authority (CA). A CA is a party trusted by all parties to an encrypted conversation.
  4. When a CA signs a certificate, this is a way for that mutually trusted party to attest that the party holding that certificate is who they say they are.
  5. CA’s themselves use public certificates (Certificate Authority Public Certificate; CA PC) and private signing keys (the Certificate Authority Private Signing Key; CA PSK) to sign certificates in a verifiable way.

The many certificates that Cassandra might be using

In a moderately complex Cassandra configuration, we might have:

  1. A root CA (cert A) for internode encryption.
  2. A certificate per node signed by cert A.
  3. A root CA (cert B) for the client-server encryption.
  4. A certificate per node signed by cert B.
  5. A certificate per client signed by cert B.

Even in a three node cluster, we can envisage a case where we must create two root CAs and 6 certificates, plus a certificate for each client application; for a total of 8+ certificates!

To compound the problem, this isn’t a one-off setup. Instead, we need to be able to rotate these certificates at regular intervals as they expire.

Ergonomic certificate management on Kubernetes with cert-manager

Thankfully, these processes are well supported on Kubernetes by a tool called cert-manager.

cert-manager is an all-in-one tool that should save you from ever having to reach for openssl or keytool again. As a Kubernetes operator, it manages a variety of custom resources (CRs) such as (Cluster)Issuers, CertificateRequests and Certificates. Critically it integrates with Automated Certificate Management Environment (ACME) Issuers, such as LetsEncrypt (which we will not be discussing today).

The workfow reduces to:

  1. Create an Issuer (via ACME, or a custom CA).
  2. Create a Certificate CR.
  3. Pick up your certificates and signing keys from the secrets cert-manager creates, and mount them as volumes in your pods’ containers.

Everything is managed declaratively, and you can reissue certificates at will simply by deleting and re-creating the certificates and secrets.

Or you can use the kubectl plugin which allows you to write a simple kubectl cert-manager renew. We won’t discuss this in depth here, see the cert-manager documentation for more information

Java batteries included (mostly)

At this point, Cassandra users are probably about to interject with a loud “Yes, but I need keystores and truststores, so this solution only gets me halfway”. As luck would have it, from version .15, cert-manager also allows you to create JKS truststores and keystores directly from the Certificate CR.

The fine print

There are two caveats to be aware of here:

  1. Most Cassandra deployment options currently available (including statefulSets, cass-operator or k8ssandra) do not currently support using a cert-per-node configuration in a convenient fashion. This is because the PodTemplate.spec portions of these resources are identical for each pod in the StatefulSet. This precludes the possibility of adding per-node certs via environment or volume mounts.
  2. There are currently some open questions about how to rotate certificates without downtime when using internode encryption.
    • Our current recommendation is to use a CA PC per Cassandra datacenter (DC) and add some basic scripts to merge both CA PCs into a single truststore to be propagated across all nodes. By renewing the CA PC independently you can ensure one DC is always online, but you still do suffer a network partition. Hinted handoff should theoretically rescue the situation but it is a less than robust solution, particularly on larger clusters. This solution is not recommended when using lightweight transactions or non LOCAL consistency levels.
    • One mitigation to consider is using non-expiring CA PCs, in which case no CA PC rotation is ever performed without a manual trigger. KS PCs and KS PSKs may still be rotated. When CA PC rotation is essential this approach allows for careful planning ahead of time, but it is not always possible when using a 3rd party CA.
    • Istio or other service mesh approaches can fully automate mTLS in clusters, but Istio is a fairly large committment and can create its own complexities.
    • Manual management of certificates may be possible using a secure vault (e.g. HashiCorp vault), sealed secrets, or similar approaches. In this case, cert manager may not be involved.

These caveats are not trivial. To address (2) more elegantly you could also implement Anthony’s solution from part one of this blog series; but you’ll need to script this up yourself to suit your k8s environment.

We are also in discussions with the folks over at cert-manager about how their ecosystem can better support Cassandra. We hope to report progress on this front over the coming months.

These caveats present challenges, but there are also specific cases where they matter less.

cert-manager and Reaper - a match made in heaven

One case where we really don’t care if a client is unavailable for a short period is when Reaper is the client.

Cassandra is an eventually consistent system and suffers from entropy. Data on nodes can become out of sync with other nodes due to transient network failures, node restarts and the general wear and tear incurred by a server operating 24/7 for several years.

Cassandra contemplates that this may occur. It provides a variety of consistency level settings allowing you to control how many nodes must agree for a piece of data to be considered the truth. But even though properly set consistency levels ensure that the data returned will be accurate, the process of reconciling data across the network degrades read performance - it is best to have consistent data on hand when you go to read it.

As a result, we recommend the use of Reaper, which runs as a Cassandra client and automatically repairs the cluster in a slow trickle, ensuring that a high volume of repairs are not scheduled all at once (which would overwhelm the cluster and degrade the performance of real clients) while also making sure that all data is eventually repaired for when it is needed.

The set up

The manifests for this blog post can be found here.

Environment

We assume that you’re running Kubernetes 1.21, and we’ll be running with a Cassandra 3.11.10 install. The demo environment we’ll be setting up is a 3 node environment, and we have tested this configuration against 3 nodes.

We will be installing the cass-operator and Cassandra cluster into the cass-operator namespace, while the cert-manager operator will sit within the cert-manager namespace.

Setting up kind

For testing, we often use kind to provide a local k8s cluster. You can use minikube or whatever solution you prefer (including a real cluster running on GKE, EKS, or AKS), but we’ll include some kind instructions and scripts here to ease the way.

If you want a quick fix to get you started, try running the setup-kind-multicluster.sh script from the k8ssandra-operator repository, with setup-kind-multicluster.sh --kind-worker-nodes 3. I have included this script in the root of the code examples repo that accompanies this blog.

A demo CA certificate

We aren’t going to use LetsEncrypt for this demo, firstly because ACME certificate issuance has some complexities (including needing a DNS or a publicly hosted HTTP server) and secondly because I want to reinforce that cert-manager is useful to organisations who are bringing their own certs and don’t need one issued. This is especially useful for on-prem deployments.

First off, create a new private key and certificate pair for your root CA. Note that the file names tls.crt and tls.key will become important in a moment.

openssl genrsa -out manifests/demoCA/tls.key 4096
openssl req -new -x509 -key manifests/demoCA/tls.key -out manifests/demoCA/tls.crt -subj "/C=AU/ST=NSW/L=Sydney/O=Global Security/OU=IT Department/CN=example.com"

(Or you can just run the generate-certs.sh script in the manifests/demoCA directory - ensure you run it from the root of the project so that the secrets appear in .manifests/demoCA/.)

When running this process on MacOS be aware of this issue which affects the creation of self signed certificates. The repo referenced in this blog post contains example certificates which you can use for demo purposes - but do not use these outside your local machine.

Now we’re going to use kustomize (which comes with kubectl) to add these files to Kubernetes as secrets. kustomize is not a templating language like Helm. But it fulfills a similar role by allowing you to build a set of base manifests that are then bundled, and which can be customised for your particular deployment scenario by patching.

Run kubectl apply -k manifests/demoCA. This will build the secrets resources using the kustomize secretGenerator and add them to Kubernetes. Breaking this process down piece by piece:

# ./manifests/demoCA
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cass-operator
generatorOptions:
 disableNameSuffixHash: true
secretGenerator:
- name: demo-ca
  type: tls
  files:
  - tls.crt
  - tls.key
  • We use disableNameSuffixHash, because otherwise kustomize will add hashes to each of our secret names. This makes it harder to build these deployments one component at a time.
  • The tls type secret conventionally takes two keys with these names, as per the next point. cert-manager expects a secret in this format in order to create the Issuer which we will explain in the next step.
  • We are adding the files tls.crt and tls.key. The file names will become the keys of a secret called demo-ca.

cert-manager

cert-manager can be installed by running kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml. It will install into the cert-manager namespace because a Kubernetes cluster should only ever have a single cert-manager operator installed.

cert-manager will install a deployment, as well as various custom resource definitions (CRDs) and webhooks to deal with the lifecycle of the Custom Resources (CRs).

A cert-manager Issuer

Issuers come in various forms. Today we’ll be using a CA Issuer because our components need to trust each other, but don’t need to be trusted by a web browser.

Other options include ACME based Issuers compatible with LetsEncrypt, but these would require that we have control of a public facing DNS or HTTP server, and that isn’t always the case for Cassandra, especially on-prem.

Dive into the truststore-keystore directory and you’ll find the Issuer, it is very simple so we won’t reproduce it here. The only thing to note is that it takes a secret which has keys of tls.crt and tls.key - the secret you pass in must have these keys. These are the CA PC and CA PSK we mentioned earlier.

We’ll apply this manifest to the cluster in the next step.

Some cert-manager certs

Let’s start with the Cassandra-Certificate.yaml resource:

spec:
  # Secret names are always required.
  secretName: cassandra-jks-keystore
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  subject:
    organizations:
    - datastax
  dnsNames:
  - dc1.cass-operator.svc.cluster.local
  isCA: false
  usages:
    - server auth
    - client auth
  issuerRef:
    name: ca-issuer
    # We can reference ClusterIssuers by changing the kind here.
    # The default value is `Issuer` (i.e. a locally namespaced Issuer)
    kind: Issuer
    # This is optional since cert-manager will default to this value however
    # if you are using an external issuer, change this to that `Issuer` group.
    group: cert-manager.io
  keystores:
    jks:
      create: true
      passwordSecretRef: # Password used to encrypt the keystore
        key: keystore-pass
        name: jks-password
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048

The first part of the spec here tells us a few things:

  • The keystore, truststore and certificates will be fields within a secret called cassandra-jks-keystore. This secret will end up holding our KS PSK and KS PC.
  • It will be valid for 90 days.
  • 15 days before expiry, it will be renewed automatically by cert manager, which will contact the Issuer to do so.
  • It has a subject organisation. You can add any of the X509 subject fields here, but it needs to have one of them.
  • It has a DNS name - you could also provide a URI or IP address. In this case we have used the service address of the Cassandra datacenter which we are about to create via the operator. This has a format of <DC_NAME>.<NAMESPACE>.svc.cluster.local.
  • It is not a CA (isCA), and can be used for server auth or client auth (usages). You can tune these settings according to your needs. If you make your cert a CA you can even reference it in a new Issuer, and define cute tree like structures (if you’re into that).

Outside the certificates themselves, there are additional settings controlling how they are issued and what format this happens in.

  • IssuerRef is used to define the Issuer we want to issue the certificate. The Issuer will sign the certificate with its CA PSK.
  • We are specifying that we would like a keystore created with the keystore key, and that we’d like it in jks format with the corresponding key.
  • passwordSecretKeyRef references a secret and a key within it. It will be used to provide the password for the keystore (the truststore is unencrypted as it contains only public certs and no signing keys).

The Reaper-Certificate.yaml is similar in structure, but has a different DNS name. We aren’t configuring Cassandra to verify that the DNS name on the certificate matches the DNS name of the parties in this particular case.

Apply all of the certs and the Issuer using kubectl apply -k manifests/truststore-keystore.

Cass-operator

Examining the cass-operator directory, we’ll see that there is a kustomization.yaml which references the remote cass-operator repository and a local cassandraDatacenter.yaml. This applies the manifests required to run up a cass-operator installation namespaced to the cass-operator namespace.

Note that this installation of the operator will only watch its own namespace for CassandraDatacenter CRs. So if you create a DC in a different namespace, nothing will happen.

We will apply these manifests in the next step.

CassandraDatacenter

Finally, the CassandraDatacenter resource in the ./cass-operator/ directory will describe the kind of DC we want:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.10
  managementApiAuth:
    insecure: {}
  size: 1
  podTemplateSpec:
    spec:
      containers:
        - name: "cassandra"
          volumeMounts:
          - name: certs
            mountPath: "/crypto"
      volumes:
      - name: certs
        secret:
          secretName: cassandra-jks-keystore
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
  config:
    cassandra-yaml:
      authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
      authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
      role_manager: org.apache.cassandra.auth.CassandraRoleManager
      client_encryption_options:
        enabled: true
        # If enabled and optional is set to true encrypted and unencrypted connections are handled.
        optional: false
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        require_client_auth: true
        # Set trustore and truststore_password if require_client_auth is true
        truststore: /crypto/truststore.jks
        truststore_password: dc1
        protocol: TLS
        # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA] # An earlier version of this manifest configured cipher suites but the proposed config was less secure. This section does not need to be modified.
      server_encryption_options:
        internode_encryption: all
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        truststore: /crypto/truststore.jks
        truststore_password: dc1
    jvm-options:
      initial_heap_size: 800M
      max_heap_size: 800M
  • We provide a name for the DC - dc1.
  • We provide a name for the cluster - the DC would join other DCs if they already exist in the k8s cluster and we configured the additionalSeeds property.
  • We use the podTemplateSpec.volumes array to declare the volumes for the Cassandra pods, and we use the podTemplateSpec.containers.volumeMounts array to describe where and how to mount them.

The config.cassandra-yaml field is where most of the encryption configuration happens, and we are using it to enable both internode and client-server encryption, which both use the same keystore and truststore for simplicity. Remember that using internode encryption means your DC needs to go offline briefly for a full restart when the CA’s keys rotate.

  • We are not using authz/n in this case to keep things simple. Don’t do this in production.
  • For both encryption types we need to specify (1) the keystore location, (2) the truststore location and (3) the passwords for the keystores. The locations of the keystore/truststore come from where we mounted them in volumeMounts.
  • We are specifying JVM options just to make this run politely on a smaller machine. You would tune this for a production deployment.

Roll out the cass-operator and the CassandraDatacenter using kubectl apply -k manifests/cass-operator. Because the CRDs might take a moment to propagate, there is a chance you’ll see errors stating that the resource type does not exist. Just keep re-applying until everything works - this is a declarative system so applying the same manifests multiple times is an idempotent operation.

Reaper deployment

The k8ssandra project offers a Reaper operator, but for simplicity we are using a simple deployment (because not every deployment needs an operator). The deployment is standard kubernetes fare, and if you want more information on how these work you should refer to the Kubernetes docs.

We are injecting the keystore and truststore passwords into the environment here, to avoid placing them in the manifests. cass-operator does not currently support this approach without an initContainer to pre-process the cassandra.yaml using envsubst or a similar tool.

The only other note is that we are also pulling down a Cassandra image and using it in an initContainer to create a keyspace for Reaper, if it does not exist. In this container, we are also adding a ~/.cassandra/cqlshrc file under the home directory. This provides SSL connectivity configurations for the container. The critical part of the cqlshrc file that we are adding is:

[ssl]
certfile = /crypto/ca.crt
validate = true
userkey = /crypto/tls.key
usercert = /crypto/tls.crt
version = TLSv1_2

The version = TLSv1_2 tripped me up a few times, as it seems to be a recent requirement. Failing to add this line will give you back the rather fierce Last error: [SSL] internal error in the initContainer. The commands run in this container are not ideal. In particular, the fact that we are sleeping for 840 seconds to wait for Cassandra to start is sloppy. In a real deployment we’d want to health check and wait until the Cassandra service became available.

Apply the manifests using kubectl apply -k manifests/reaper.

Results

If you use a GUI, look at the logs for Reaper, you should see that it has connected to the cluster and provided some nice ASCII art to your console.

If you don’t use a GUI, you can run kubectl get pods -n cass-operator to find your Reaper pod (which we’ll call REAPER_PODNAME) and then run kubectl logs -n cass-operator REAPER_PODNAME to pull the logs.

Conclusion

While the above might seem like a complex procedure, we’ve just created a Cassandra cluster with both client-server and internode encryption enabled, all of the required certs, and a Reaper deployment which is configured to connect using the correct certs. Not bad.

Do keep in mind the weaknesses relating to key rotation, and watch this space for progress on that front.

Cassandra Certificate Management Part 1 - How to Rotate Keys Without Downtime

Welcome to this three part blog series where we dive into the management of certificates in an Apache Cassandra cluster. For this first post in the series we will focus on how to rotate keys in an Apache Cassandra cluster without downtime.

Usability and Security at Odds

If you have downloaded and installed a vanilla installation of Apache Cassandra, you may have noticed when it is first started all security is disabled. Your “Hello World” application works out of the box because the Cassandra project chose usability over security. This is deliberately done so everyone benefits from the usability, as security requirement for each deployment differ. While only some deployments require multiple layers of security, others require no security features to be enabled.

Security of a system is applied in layers. For example one layer is isolating the nodes in a cluster behind a proxy. Another layer is locking down OS permissions. Encrypting connections between nodes, and between nodes and the application is another layer that can be applied. If this is the only layer applied, it leaves other areas of a system insecure. When securing a Cassandra cluster, we recommend pursuing an informed approach which offers defence-in-depth. Consider additional aspects such as encryption at rest (e.g. disk encryption), authorization, authentication, network architecture, and hardware, host and OS security.

Encrypting connections between two hosts can be difficult to set up as it involves a number of tools and commands to generate the necessary assets for the first time. We covered this process in previous posts: Hardening Cassandra Step by Step - Part 1 Inter-Node Encryption and Hardening Cassandra Step by Step - Part 2 Hostname Verification for Internode Encryption. I recommend reading both posts before reading through the rest of the series, as we will build off concepts explained in them.

Here is a quick summary of the basic steps to create the assets necessary to encrypt connections between two hosts.

  1. Create the Root Certificate Authority (CA) key pair from a configuration file using openssl.
  2. Create a keystore for each host (node or client) using keytool.
  3. Export the Public Certificate from each host keystore as a “Signing Request” using keytool.
  4. Sign each Public Certificate “Signing Request” with our Root CA to generate a Signed Certificate using openssl.
  5. Import the Root CA Public Certificate and the Signed Certificate into each keystore using keytool.
  6. Create a common truststore and import the CA Public Certificate into it using keytool.

Security Requires Ongoing Maintenance

Setting up SSL encryption for the various connections to Cassandra is only half the story. Like all other software out in the wild, there are ongoing maintenance to ensure the SSL encrypted connections continue to work.

At some point you wil need to update the certificates and stores used to implement the SSL encrypted connections because they will expire. If the certificates for a node expire it will be unable to communicate with other nodes in the cluster. This will lead to at least data inconsistencies or, in the worst case, unavailable data.

This point is specifically called out towards the end of the Inter-Node Encryption blog post. The note refers to steps 1, 2 and 4 in the above summary of commands to set up the certificates and stores. The validity periods are set for the certificates and stores in their respective steps.

One Certificate Authority to Rule Them All

Before we jump into how we handle expiring certificates and stores in a cluster, we first need to understand the role a certificate plays in securing a connection.

Certificates (and encryption) are often considered a hard topic. However, there are only a few concepts that you need to bear in mind when managing certificates.

Consider the case where two parties A and B wish to communicate with one another. Both parties distrust each other and each needs a way to prove that they are who they claim to be, as well as verify the other party is who they claim to be. To do this a mutually trusted third party needs to be brought in. In our case the trusted third party is the Certificate Authority (CA); often referred to as the Root CA.

The Root CA is effectively just a key pair; similar to an SSH key pair. The main difference is the public portion of the key pair has additional fields detailing who the public key belongs to. It has the following two components.

  • Certificate Authority Private Signing Key (CA PSK) - Private component of the CA key pair. Used to sign a keystore’s public certificate.
  • Certificate Authority Public Certificate (CA PC) - Public component of the CA key pair. Used to provide the issuer name when signing a keystore’s public certificate, as well as by a node to confirm that a third party public certificate (when presented) has been signed by the Root CA PSK.

When you run openssl to create your CA key pair using a certificate configuration file, this is the command that is run.

$ openssl req \
      -config path/to/ca_certificate.config \
      -new \
      -x509 \
      -keyout path/to/ca_psk \
      -out path/to/ca_pc \
      -days <valid_days>

In the above command the -keyout specifies the path to the CA PSK, and the -out specifies the path to the CA PC.

And in the Darkness Sign Them

In addition to a common Root CA key pair, each party has its own certificate key pair to uniquely identify it and to encrypt communications. In the Cassandra world, two components are used to store the information needed to perform the above verification check and communication encryption; the keystore and the truststore.

The keystore contains a key pair which is made up of the following two components.

  • Keystore Private Signing Key (KS PSK) - Hidden in keystore. Used to sign messages sent by the node, and decrypt messages received by the node.
  • Keystore Public Certificate (KS PC) - Exported for signing by the Root CA. Used by a third party to encrypt messages sent to the node that owns this keystore.

When created, the keystore will contain the PC, and the PSK. The PC signed by the Root CA, and the CA PC are added to the keystore in subsequent operations to complete the trust chain. The certificates are always public and are presented to other parties, while PSK always remains secret. In an asymmetric/public key encryption system, messages can be encrypted with the PC but can only be decrypted using the PSK. In this way, a node can initiate encrypted communications without needing to share a secret.

The truststore stores one or more CA PCs of the parties which the node has chosen to trust, since they are the source of trust for the cluster. If a party tries to communicate with the node, it will refer to its truststore to see if it can validate the attempted communication using a CA PC that it knows about.

For a node’s KS PC to be trusted and verified by another node using the CA PC in the truststore, the KS PC needs to be signed by the Root CA key pair. Futhermore, the CA key pair is used to sign the KS PC of each party.

When you run openssl to sign an exported Keystore PC, this is the command that is run.

$ openssl x509 \
    -req \
    -CAkey path/to/ca_psk \
    -CA path/to/ca_pc \
    -in path/to/exported_ks_pc_sign_request \
    -out paht/to/signed_ks_pc \
    -days <valid_days> \
    -CAcreateserial \
    -passin pass:<ca_psk_password>

In the above command both the Root CA PSK and CA PC are used via -CAkey and -CA respectively when signing the KS PC.

More Than One Way to Secure a Connection

Now that we have a deeper understanding of the assets that are used to encrypt communications, we can examine various ways to implement it. There are multiple ways to implement SSL encryption in an Apache Cassandra cluster. Regardless of the encryption approach, the objective when applying this type of security to a cluster is to ensure;

  • Hosts (nodes or clients) can determine whether they should trust other hosts in cluster.
  • Any intercepted communication between two hosts is indecipherable.

The three most common methods vary in both ease of deployment and resulting level of security. They are as follows.

The Cheats Way

The easiest and least secure method for rolling out SSL encryption can be done in the following way

Generation

  • Single CA for the cluster.
  • Single truststore containing the CA PC.
  • Single keystore which has been signed by the CA.

Deployment

  • The same keystore and truststore are deployed to each node.

In this method a single Root CA and a single keystore is deployed to all nodes in the cluster. This means any node can decipher communications intended for any other node. If a bad actor gains control of a node in the cluster then they will be able to impersonate any other node. That is, compromise of one host will compromise all of them. Depending on your threat model, this approach can be better than no encryption at all. It will ensure that a bad actor with access to only the network will no longer be able to eavesdrop on traffic.

We would use this method as a stop gap to get internode encryption enabled in a cluster. The idea would be to quickly deploy internode encryption with the view of updating the deployment in the near future to be more secure.

Best Bang for Buck

Arguably the most popular and well documented method for rolling out SSL encryption is

Generation

  • Single CA for the cluster.
  • Single truststore containing the CA PC.
  • Unique keystore for each node all of which have been signed by the CA.

Deployment

  • Each keystore is deployed to its associated node.
  • The same truststore is deployed to each node.

Similar to the previous method, this method uses a cluster wide CA. However, unlike the previous method each node will have its own keystore. Each keystore has its own certificate that is signed by a Root CA common to all nodes. The process to generate and deploy the keystores in this way is practiced widely and well documented.

We would use this method as it provides better security over the previous method. Each keystore can have its own password and host verification, which further enhances the security that can be applied.

Fort Knox

The method that offers the strongest security of the three can be rolled out in following way

Generation

  • Unique CA for each node.
  • A single truststore containing the Public Certificate for each of the CAs.
  • Unique keystore for each node that has been signed by the CA specific to the node.

Deployment

  • Each keystore with its unique CA PC is deployed to its associated node.
  • The same truststore is deployed to each node.

Unlike the other two methods, this one uses a Root CA per host and similar to the previous method, each node will have its own keystore. Each keystore has its own PC that is signed by a Root CA unique to the node. The Root CA PC of each node needs to be added to the truststore that is deployed to all nodes. For large cluster deployments this encryption configuration is cumbersome and will result in a large truststore being generated. Deployments of this encryption configuration are less common in the wild.

We would use this method as it provides all the advantages of the previous method and in addition, provides the ability to isolate a node from the cluster. This can be done by simply rolling out a new truststore which excludes a specific node’s CA PC. In this way a compromised node could be isolated from the cluster by simply changing the truststore. Under the previous two approaches, isolation of a compromised node in this fashion would require a rollout of an entirely new Root CA and one or more new keystores. Furthermore, each new Keystore CA would need to be signed by the new Root CA.

WARNING: Ensure your Certificate Authority is secure!

Regardless of the deployment method chosen, the whole setup will depend on the security of the Root CA. Ideally both components should be secured, or at the very least the PSK needs to be secured properly after it is generated since all trust is based on it. If both components are compromised by a bad actor, then that actor can potentially impersonate another node in the cluster. The good news is, there are a variety of ways to secure the Root CA components, however that topic goes beyond the scope of this post.

The Need for Rotation

If we are following best practices when generating our CAs and keystores, they will have an expiry date. This is a good thing because it forces us to regenerate and roll out our new encryption assets (stores, certificates, passwords) to the cluster. By doing this we minimise the exposure that any one of the components has. For example, if a password for a keystore is unknowingly leaked, the password is only good up until the keystore expiry. Having a scheduled expiry reduces the chance of a security leak becoming a breach, and increases the difficulty for a bad actor to gain persistence in the system. In the worst case it limits the validity of compromised credentials.

Always Read the Expiry Label

The only catch to having an expiry date on our encryption assets is that we need to rotate (update) them before they expire. Otherwise, our data will be unavailable or may be inconsistent in our cluster for a period of time. Expired encryption assets when forgotten can be a silent, sinister problem. If, for example, our SSL certificates expire unnoticed we will only discover this blunder when we restart the Cassandra service. In this case the Cassandra service will fail to connect to the cluster on restart and SSL expiry error will appear in the logs. At this point there is nothing we can do without incurring some data unavailability or inconsistency in the cluster. We will cover what to do in this case in a subsequent post. However, it is best to avoid this situation by rotating the encryption assets before they expire.

How to Play Musical Certificates

Assuming we are going to rotate our SSL certificates before they expire, we can perform this operation live on the cluster without downtime. This process requires the replication factor and consistency level to configured to allow for a single node to be down for a short period of time in the cluster. Hence, it works best when use a replication factor >= 3 and use consistency level <= QUORUM or LOCAL_QUORUM depending on the cluster configuration.

  1. Create the NEW encryption assets; NEW CA, NEW keystores, and NEW truststore, using the process described earlier.
  2. Import the NEW CA to the OLD truststore already deployed in the cluster using keytool. The OLD truststore will increase in size, as it has both the OLD and NEW CAs in it.
    $ keytool -keystore <old_truststore> -alias CARoot -importcert -file <new_ca_pc> -keypass <new_ca_psk_password> -storepass <old_truststore_password> -noprompt
    

    Where:

    • <old_truststore>: The path to the OLD truststore already deployed in the cluster. This can be just a copy of the OLD truststore deployed.
    • <new_ca_pc>: The path to the NEW CA PC generated.
    • <new_ca_psk_password>: The password for the NEW CA PSKz.
    • <old_truststore_password>: The password for the OLD truststore.
  3. Deploy the updated OLD truststore to all the nodes in the cluster. Specifically, perform these steps on a single node, then repeat them on the next node until all nodes are updated. Once this step is complete, all nodes in the cluster will be able to establish connections using both the OLD and NEW CAs.
    1. Drain the node using nodetool drain.
    2. Stop the Cassandra service on the node.
    3. Copy the updated OLD truststore to the node.
    4. Start the Cassandra service on the node.
  4. Deploy the NEW keystores to their respective nodes in the cluster. Perform this operation one node at a time in the same way the OLD truststore was deployed in the previous step. Once this step is complete, all nodes in the cluster will be using their NEW SSL certificate to establish encrypted connections with each other.
  5. Deploy the NEW truststore to all the nodes in the cluster. Once again, perform this operation one node at a time in the same way the OLD truststore was deployed in Step 3.

The key to ensuring uptime in the rotation are in Steps 2 and 3. That is, we have the OLD and the NEW CAs all in the truststore and deployed on every node prior to rolling out the NEW keystores. This allows nodes to communicate regardless of whether they have the OLD or NEW keystore. This is because both the OLD and NEW assets are trusted by all nodes. The process still works whether our NEW CAs are per host or cluster wide. If the NEW CAs are per host, then they all need to be added to the OLD truststore.

Example Certificate Rotation on a Cluster

Now that we understand the theory, let’s see the process in action. We will use ccm to create a three node cluster running Cassandra 3.11.10 with internode encryption configured.

As pre-cluster setup task we will generate the keystores and truststore to implement the internode encryption. Rather than carry out the steps manually to generate the stores, we have developed a script called generate_cluster_ssl_stores that does the job for us.

The script requires us to supply the node IP addresses, and a certificate configuration file. Our certificate configuration file, test_ca_cert.conf has the following contents:

[ req ]
distinguished_name     = req_distinguished_name
prompt                 = no
output_password        = mypass
default_bits           = 2048

[ req_distinguished_name ]
C                      = AU
ST                     = NSW
L                      = Sydney
O                      = TLP
OU                     = SSLTestCluster
CN                     = SSLTestClusterRootCA
emailAddress           = info@thelastpickle.com¡

The command used to call the generate_cluster_ssl_stores.sh script is as follows.

$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 test_ca_cert.conf

Let’s break down the options in the above command.

  • -g - Generate passwords for each keystore and the truststore.
  • -c - Create a Root CA for the cluster and sign each keystore PC with it.
  • -n - List of nodes to generate keystores for.

The above command generates the following encryption assets.

$ ls -alh ssl_artifacts_20210602_125353
total 72
drwxr-xr-x   9 anthony  staff   288B  2 Jun 12:53 .
drwxr-xr-x   5 anthony  staff   160B  2 Jun 12:53 ..
-rw-r--r--   1 anthony  staff    17B  2 Jun 12:53 .srl
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  2 Jun 12:53 certs
-rw-r--r--   1 anthony  staff   1.0K  2 Jun 12:53 common-truststore.jks
-rw-r--r--   1 anthony  staff   219B  2 Jun 12:53 stores.password

With the necessary stores generated we can create our three node cluster in ccm. Prior to starting the cluster our nodes should look something like this.

$ ccm status
Cluster: 'SSLTestCluster'
-------------------------
node1: DOWN (Not initialized)
node2: DOWN (Not initialized)
node3: DOWN (Not initialized)

We can configure internode encryption in the cluster by modifying the cassandra.yaml files for each node as follows. The passwords for each store are in the stores.password file created by the generate_cluster_ssl_stores.sh script.

node1 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-1-keystore.jks
  keystore_password: HQR6xX4XQrYCz58CgAiFkWL9OTVDz08e
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-2-keystore.jks
  keystore_password: Aw7pDCmrtacGLm6a1NCwVGxohB4E3eui
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-3-keystore.jks
  keystore_password: 1DdFk27up3zsmP0E5959PCvuXIgZeLzd
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

Now that we configured internode encryption in the cluster, we can start the nodes and monitor the logs to make sure they start correctly.

$ ccm node1 start && touch ~/.ccm/SSLTestCluster/node1/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node1/logs/system.log
...
$ ccm node2 start && touch ~/.ccm/SSLTestCluster/node2/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node2/logs/system.log
...
$ ccm node3 start && touch ~/.ccm/SSLTestCluster/node3/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node3/logs/system.log

In all cases we see the following message in the logs indicating that internode encryption is enabled.

INFO  [main] ... MessagingService.java:704 - Starting Encrypted Messaging Service on SSL port 7001

Once all the nodes have started, we can check the cluster status. We are looking to see that all nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  90.65 KiB  16           65.8%             2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  66.31 KiB  16           65.5%             f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  71.46 KiB  16           68.7%             46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

We will create a NEW Root CA along with a NEW set of stores for the cluster. As part of this process, we will add the NEW Root CA PC to OLD (current) truststore that is already in use in the cluster. Once again we can use our generate_cluster_ssl_stores.sh script to this, including the additional step of adding the NEW Root CA PC to our OLD truststore. This can be done with the following commands.

# Make the password to our old truststore available to script so we can add the new Root CA to it.

$ export EXISTING_TRUSTSTORE_PASSWORD=$(cat ssl_artifacts_20210602_125353/stores.password | grep common-truststore.jks | cut -d':' -f2)
$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 -e ssl_artifacts_20210602_125353/common-truststore.jks test_ca_cert.conf 

We call our script using a similar command to the first time we used it. The difference now is we are using one additional option; -e.

  • -e - Path to our OLD (existing) truststore which we will add the new Root CA PC to. This option requires us to set the OLD truststore password in the EXISTING_TRUSTSTORE_PASSWORD variable.

The above command generates the following new encryption assets. These files are located in a different directory to the old ones. The directory with the old encryption assets is ssl_artifacts_20210602_125353 and the directory with the new encryption assets is ssl_artifacts_20210603_070951

$ ls -alh ssl_artifacts_20210603_070951
total 72
drwxr-xr-x   9 anthony  staff   288B  3 Jun 07:09 .
drwxr-xr-x   6 anthony  staff   192B  3 Jun 07:09 ..
-rw-r--r--   1 anthony  staff    17B  3 Jun 07:09 .srl
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  3 Jun 07:09 certs
-rw-r--r--   1 anthony  staff   1.0K  3 Jun 07:09 common-truststore.jks
-rw-r--r--   1 anthony  staff   223B  3 Jun 07:09 stores.password

When we look at our OLD truststore we can see that it has increased in size. Originally, it was 1.0K and it is now 2.0K in size after adding the new Root CA PC it.

$ ls -alh ssl_artifacts_20210602_125353/common-truststore.jks
-rw-r--r--  1 anthony  staff   2.0K  3 Jun 07:09 ssl_artifacts_20210602_125353/common-truststore.jks

We can now roll out the updated OLD truststore. In a production Cassandra deployment we would copy the updated OLD truststore to a node and restart the Cassandra service. Then repeat this process on the other nodes in the cluster, one node at a time. In our case, our locally running nodes are already pointing to the updated OLD truststore. We need to only restart the Cassandra service.

$ for i in $(ccm status | grep UP | cut -d':' -f1); do echo "restarting ${i}" && ccm ${i} stop && sleep 3 && ccm ${i} start; done
restarting node1
restarting node2
restarting node3

After the restart, our nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  167.23 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  173.7 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Our nodes are using the updated OLD truststore which has the old Root CA PC and the new Root CA PC. This means that nodes will be able to communicate using either the old (current) keystore or the new keystore. We can now roll out the new keystore one node at a time and still have all our data available.

To do the new keystore roll out we will stop the Cassandra service, update its configuration to point to the new keystore, and then start the Cassandra service. A few notes before we start:

  • The node will need to point to the new keystore located in the directory with the new encryption assets; ssl_artifacts_20210603_070951.
  • The node will still need to use the OLD truststore, so its path will remain unchanged.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  179.23 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 using the new keystore while node2 and node3 are using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  188.46 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  227.12 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 and node2 using the new keystore while node3 is using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  239.3 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The keystore rotation is now complete on all nodes in our cluster. However, all nodes are still using the updated OLD truststore. To ensure that our old Root CA can no longer be used to intercept messages in our cluster we need to roll out the NEW truststore to all nodes. This can be done in the same way we deployed the new keystores.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update truststore path to point to new truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  294.05 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  288.6 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The rotation of the certificates is now complete and all while having only a single node down at any one time! This process can be used for all three of the deployment variations. In addition, it can be used to move between the different deployment variations without incurring downtime.

Conclusion

Internode encryption plays an important role in securing the internal communication of a cluster. When deployed, it is crucial that certificate expiry dates be tracked so the certificates can be rotated before they expire. Failure to do so will result in unavailability and inconsistencies.

Using the process discussed in this post and combined with the appropriate tooling, internode encryption can be easily deployed and associated certificates easily rotated. In addition, the process can be used to move between the different encryption deployments.

Regardless of the reason for using the process, it can be executed without incurring downtime in common Cassandra use cases.

Running your Database on OpenShift and CodeReady Containers

Let’s take an introductory run-through of setting up your database on OpenShift, using your own hardware and RedHat’s CodeReady Containers.

CodeReady Containers is a great way to run OpenShift K8s locally, ideal for development and testing. The steps in this blog post will require a machine, laptop or desktop, of decent capability; preferably quad CPUs and 16GB+ RAM.

Download and Install RedHat’s CodeReady Containers

Download and install RedHat’s CodeReady Containers (version 1.27) as described in Red Hat OpenShift 4 on your laptop: Introducing Red Hat CodeReady Containers.

First configure CodeReady Containers, from the command line

❯ crc setup

…
Your system is correctly setup for using CodeReady Containers, you can now run 'crc start' to start the OpenShift cluster

Check the version is correct.

❯ crc version

…
CodeReady Containers version: 1.27.0+3d6bc39d
OpenShift version: 4.7.11 (not embedded in executable)

Then start it, entering the Pull Secret copied from the download page. Have patience here, this can take ten minutes or more.

❯ crc start

INFO Checking if running as non-root
…
Started the OpenShift cluster.

The server is accessible via web console at:
  https://console-openshift-console.apps-crc.testing 
…

The output above will include the kubeadmin password which is required in the following oc login … command.

❯ eval $(crc oc-env)
❯ oc login -u kubeadmin -p <password-from-crc-setup-output> https://api.crc.testing:6443

❯ oc version

Client Version: 4.7.11
Server Version: 4.7.11
Kubernetes Version: v1.20.0+75370d3

Open in a browser the URL https://console-openshift-console.apps-crc.testing

Log in using the kubeadmin username and password, as used above with the oc login … command. You might need to try a few times because of the self-signed certificate used.

Once OpenShift has started and is running you should see the following webpage

CodeReady Start Webpage

Some commands to help check status and the startup process are

❯ oc status       

In project default on server https://api.crc.testing:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 10.217.4.1:443 -> 6443

View details with 'oc describe <resource>/<name>' or list resources with 'oc get all'.      

Before continuing, go to the CodeReady Containers Preferences dialog. Increase CPUs and Memory to >12 and >14GB correspondingly.

CodeReady Preferences dialog

Create the OpenShift Local Volumes

Cassandra needs persistent volumes for its data directories. There are different ways to do this in OpenShift, from enabling local host paths in Rancher persistent volumes, to installing and using the OpenShift Local Storage Operator, and of course persistent volumes on the different cloud provider backends.

This blog post will use vanilla OpenShift volumes using folders on the master k8s node.

Go to the “Terminal” tab for the master node and create the required directories. The master node is found on the /cluster/nodes/ webpage.

Click on the node, named something like crc-m89r2-master-0, and then click on the “Terminal” tab. In the terminal, execute the following commands:

sh-4.4# chroot /host
sh-4.4# mkdir -p /mnt/cass-operator/pv000
sh-4.4# mkdir -p /mnt/cass-operator/pv001
sh-4.4# mkdir -p /mnt/cass-operator/pv002
sh-4.4# 

Persistent Volumes are to be created with affinity to the master node, declared in the following yaml. The name of the master node can vary from installation to installation. If your master node is not named crc-gm7cm-master-0 then the following command replaces its name. First download the cass-operator-1.7.0-openshift-storage.yaml file, check the name of the node in the nodeAffinity sections against your current CodeReady Containers instance, updating if necessary.

❯ wget https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-storage.yaml

# The name of your master node
❯ oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers

# If it is not crc-gm7cm-master-0
❯ sed -i '' "s/crc-gm7cm-master-0/$(oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers)/" cass-operator-1.7.0-openshift-storage.yaml

Create the Persistent Volumes (PV) and Storage Class (SC).

❯ oc apply -f cass-operator-1.7.0-openshift-storage.yaml

persistentvolume/server-storage-0 created
persistentvolume/server-storage-1 created
persistentvolume/server-storage-2 created
storageclass.storage.k8s.io/server-storage created

To check the existence of the PVs.

❯ oc get pv | grep server-storage

server-storage-0   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-1   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-2   10Gi   RWO    Delete   Available   server-storage     5m19s

To check the existence of the SC.

❯ oc get sc

NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
server-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  5m36s

More information on using the can be found in the RedHat documentation for OpenShift volumes.

Deploy the Cass-Operator

Now create the cass-operator. Here we can use the upstream 1.7.0 version of the cass-operator. After creating (applying) the cass-operator, it is important to quickly execute the oc adm policy … commands in the following step so the pods have the privileges required and are successfully created.

❯ oc apply -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

namespace/cass-operator created
serviceaccount/cass-operator created
secret/cass-operator-webhook-config created
W0606 14:25:44.757092   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0606 14:25:45.077394   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com created
clusterrole.rbac.authorization.k8s.io/cass-operator-cr created
clusterrole.rbac.authorization.k8s.io/cass-operator-webhook created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-crb created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-webhook created
role.rbac.authorization.k8s.io/cass-operator created
rolebinding.rbac.authorization.k8s.io/cass-operator created
service/cassandradatacenter-webhook-service created
deployment.apps/cass-operator created
W0606 14:25:46.701712   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
W0606 14:25:47.068795   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
validatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration created

❯ oc adm policy add-scc-to-user privileged -z default -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "default"

❯ oc adm policy add-scc-to-user privileged -z cass-operator -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "cass-operator"

Let’s check the deployment happened.

❯ oc get deployments -n cass-operator

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
cass-operator   1/1     1            1           14m

Let’s also check the cass-operator pod was created and is successfully running. Note that the kubectl command is used here, for all k8s actions the oc and kubectl commands are interchangable.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-hxc8z   1/1     Running   0          15m

Troubleshooting: If the cass-operator does not end up in Running status, or if any pods in later sections fail to start, it is recommended to use the OpenShift UI Events webpage for easy diagnostics.

Setup the Cassandra Cluster

The next step is to create the cluster. The following deployment file creates a 3 node cluster. It is largely a copy from the upstream cass-operator version 1.7.0 file example-cassdc-minimal.yaml but with a small modification made to allow all the pods to be deployed to the same worker node (as CodeReady Containers only uses one k8s node by default).

❯ oc apply -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

cassandradatacenter.cassandra.datastax.com/dc1 created

Let’s watch the pods get created, initialise, and eventually becoming running, using the kubectl get pods … watch command.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-28fhw   1/1     Running   0          102s
cluster1-dc1-default-sts-0       0/2     Pending   0          0s
cluster1-dc1-default-sts-1       0/2     Pending   0          0s
cluster1-dc1-default-sts-2       0/2     Pending   0          0s
cluster1-dc1-default-sts-0       2/2     Running   0          3m
cluster1-dc1-default-sts-1       2/2     Running   0          3m
cluster1-dc1-default-sts-2       2/2     Running   0          3m

Use the Cassandra Cluster

With the Cassandra pods each up and running, the cluster is ready to be used. Test it out using the nodetool status command.

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Defaulting container name to cassandra.
Use 'kubectl describe pod/cluster1-dc1-default-sts-0 -n cass-operator' to see all of the containers in this pod.
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.217.0.73  84.42 KiB  1            83.6%             672baba8-9a05-45ac-aad1-46427027b57a  default
UN  10.217.0.72  70.2 KiB   1            65.3%             42758a86-ea7b-4e9b-a974-f9e71b958429  default
UN  10.217.0.71  65.31 KiB  1            51.1%             2fa73bc2-471a-4782-ae63-5a34cc27ab69  default

The above command can be run on `cluster1-dc1-default-sts-1` and `cluster1-dc1-default-sts-2` too.

Next, test out cqlsh. For this authentication is required, so first get the CQL username and password.

# Get the cql username
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# Get the cql password
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- cqlsh -u <cql-username> -p <cql-password>

Connected to cluster1 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cluster1-superuser@cqlsh>

Keep It Clean

CodeReady Containers are very simple to clean up, especially because it is a packaging of OpenShift intended only for development purposes. To wipe everything, just “delete”

❯ crc stop
❯ crc delete

If, on the other hand, you only want to delete individual steps, each of the following can be done (but in order).

❯ oc delete -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

❯ oc delete -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

❯ oc delete -f cass-operator-1.7.0-openshift-storage.yaml

On Scylla Manager Suspend & Resume feature

On Scylla Manager Suspend & Resume feature

!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...

Apache Cassandra's Continuous Integration Systems

With Apache Cassandra 4.0 just around the corner, and the feature freeze on trunk lifted, let’s take a dive into the efforts ongoing with the project’s testing and Continuous Integration systems.

continuous integration in open source

Every software project benefits from sound testing practices and having a continuous integration in place. Even more so for open source projects. From contributors working around the world in many different timezones, particularly prone to broken builds and longer wait times and uncertainties, to contributors just not having the same communication bandwidths between each other because they work in different companies and are scratching different itches.

This is especially true for Apache Cassandra. As an early-maturity technology used everywhere on mission critical data, stability and reliability are crucial for deployments. Contributors from many companies: Alibaba, Amazon, Apple, Bloomberg, Dynatrace, DataStax, Huawei, Instaclustr, Netflix, Pythian, and more; need to coordinate and collaborate and most importantly trust each other.

During the feature freeze the project was fortunate to not just stabilise and fix tons of tests, but to also expand its continuous integration systems. This really helps set the stage for a post 4.0 roadmap that features heavy on pluggability, developer experience and safety, as well as aiming for an always-shippable trunk.

@ cassandra

The continuous integration systems at play are CircleCI and ci-cassandra.apache.org

CircleCI is a commercial solution. The main usage today of CircleCI is pre-commit, that is testing your patches while they get reviewed before they get merged. To effectively use CircleCI on Cassandra requires either the medium or high resource profiles that enables the use of hundreds of containers and lots of resources, and that’s basically only available for folk working in companies that are paying for a premium CircleCI account. There are lots stages to the CircleCI pipeline, and developers just trigger those stages they feel are relevant to test that patch on.

CircleCI pipeline

ci-cassandra is our community CI. It is based on CloudBees, provided by the ASF and running 40 agents (servers) around the world donated by numerous different companies in our community. Its main usage is post-commit, and its pipelines run every stage automatically. Today the pipeline consists of 40K tests. And for the first time in many years, on the lead up to 4.0, pipeline runs are completely green.

ci-cassandra pipeline

ci-cassandra is setup with a combination of Jenkins DSL script, and declarative Jenkinsfiles. These jobs use the build scripts found here.

forty thousand tests

The project has many types of tests. It has proper unit tests, and unit tests that have some embedded Cassandra server. The unit tests are run in a number of different parameterisations: from different Cassandra configuration, JDK 8 and JDK 11, to supporting the ARM architecture. There’s CQLSH tests written in Python against a single ccm node. Then there’s the Java distributed tests and Python distributed tests. The Python distributed tests are older, use CCM, and also run parameterised. The Java distributed tests are a recent addition and run the Cassandra nodes inside the JVM. Both types of distributed tests also include testing the upgrade paths of different Cassandra versions. Most new distributed tests today are written as Java distributed tests. There are also burn and microbench (JMH) tests.

distributed is difficult

Testing distributed tech is hardcore. Anyone who’s tried to run the Python upgrade dtests locally knows the pain. Running the tests in Docker helps a lot, and this is what CircleCI and ci-cassandra predominantly does. The base Docker images are found here. Distributed tests can fall over for numerous reasons, exacerbated in ci-cassandra with heterogenous servers around the world and all the possible network and disk issues that can occur. Just for the 4.0 release over 200 Jira tickets were focused just on strengthening flakey tests. Because ci-cassandra has limited storage, the logs and test results to all runs are archived in nightlies.apache.org/cassandra.

call for help

There’s still heaps to do. This is all part-time and volunteer efforts. No one in the community is dedicated to these systems or as a build engineer. The project can use all the help it can get.

There’s a ton of exciting stuff to add. Some examples are microbench and JMH reports, Jacoco test coverage reports, Harry for fuzz testing, Adelphi or Fallout for end-to-end performance and comparison testing, hooking up Apache Yetus for efficient resource usage, or putting our Jenkins stack into a k8s operator run script so you can run the pipeline on your own k8s cluster.

So don’t be afraid to jump in, pick your poison, we’d love to see you!