ScyllaDB Tablets: Answering Your Top Questions

What does your team need to know about tablets– at a purely pragmatic level? Here are answers to the top user questions. The latest ScyllaDB releases feature some significant architectural shifts. Tablets build upon a multi-year project to re-architect our legacy ring architecture. And our metadata is now fully consistent, thanks to the assistance of Raft. Together, these changes can help teams with elasticity, speed, and operational simplicity. Avi Kivity, our CTO and co-founder, provided a detailed look at why and how we made this shift in a series of blogs (Why ScyllaDB Moved to “Tablets” Data Distribution and How We Implemented ScyllaDB’s “Tablets” Data Distribution). Join Avi for a technical deep dive…at our upcoming livestream And we recently created this quick demo to show you what this looks like in action, from the point of view of a database user/operator: But what does your team need to know – at a purely pragmatic level? Here are some of the questions we’ve heard from interested users, and a short summary of how we answer them. What’s the TL;DR on tablets? Tablets are the smallest replication unit in ScyllaDB. Data gets distributed by splitting tables into smaller logical pieces called tablets, and this allows ScyllaDB to shift from a static to a dynamic topology. Tablets are dynamically balanced across the cluster using the Raft consensus protocol. This was introduced as part of a project to bring more elasticity to ScyllaDB, enabling faster topology changes and seamless scaling. Tablets acknowledge that most workloads do not follow a static traffic pattern. In fact, most often follow a cyclical curve with different baseline and peaks through a period of time. By decoupling topology changes from the actual streaming of data, tablets therefore present significant cost saving opportunities for users adopting ScyllaDB by allowing infrastructure to be scaled on-demand, fast. Previously, adding or removing nodes required a sequential, one-at-a-time and serializable process with data streaming and rebalancing. Now, you can add or remove multiple nodes in parallel. This significantly speeds up the scaling process and makes ScyllaDB much more elastic. Tablets are distributed on a per-table basis, with each table having its own set of tablets. The tablets are then further distributed across the shards in the ScyllaDB cluster. The distribution is handled automatically by ScyllaDB, with tablets being dynamically migrated across replicas as needed. Data within a table is split across tablets based on the average geometric size of a token range boundary. How do I configure tablets? Tablets are enabled by default in ScyllaDB 2025.1 and are also available with ScyllaDB Cloud. When creating a new keyspace, you can specify whether to enable tablets or not. There are also three key configuration options for tablets: 1) the enable_tablets boolean setting, 2) the target_tablet_size_in_bytes (default is 5GB), and 3) the tablets property during a CREATE KEYSPACE statement. Here are a few tips for configuring these settings: enable_tablets indicates whether newly created keyspaces should rely on tablets for data distribution. Note that tablets are currently not yet enabled for workloads requiring the use of Counters Tables, Secondary Indexes, Materialized Views, CDC, or LWT. target_tablet_size_in_bytes indicates the average geometric size of a tablet, and is particularly useful during tablet split and merge operations. The default indicates splits are done when a tablet reaches 10GB and merges at 2.5GB. A higher value means tablet migration throughput can be reduced (due to larger tablets), whereas a lower value may significantly increase the number of tablets. The tablets property allows you to opt for tablets on a per keyspace basis via the ‘enabled’ boolean sub-option. This is particularly important if some of your workloads rely on the currently unsupported features mentioned earlier: You may opt out for these tables and fallback to the still supported vNode-based replication strategy. Still under the tablets property, the ‘initial’ sub-option determines how many tablets are created upfront on a per-table basis. We recommend that you target 100 tablets/shard. In future releases, we’ll introduce Per-table tablet options to extend and simplify this process while deprecating the keyspace sub-option. How/why should I monitor tablet distribution? Starting with ScyllaDB Monitoring 4.7, we introduced two additional panels for the observability and distribution of tablets within a ScyllaDB cluster. These metrics are present within the Detailed dashboard under the Tablets section: The Tablets over time panel is a heatmap showing the tablet distribution over time. As the data size of a tablet-enabled table grows, you should observe the number of tablets increasing (tablet split) and being automatically distributed by ScyllaDB. Likewise, as the table size shrinks, the number of tablets should be reduced (tablet merge, but to no less than your initially configured ‘initial’ value within the keyspace tablets property). Similarly, as you perform topology changes (e.g., adding nodes), you can monitor the tablet distribution progress. You’ll notice that existing replicas will have their tablet count reduced while new replicas will increase. The Tablets per DC/Instance/Shard panel shows the absolute count of tablets within the cluster. Under heterogeneous setups running on top of instances of the same size, this metric should be evenly balanced. However, the situation changes for heterogeneous setups with different shard counts. In this situation, it is expected that larger instances will hold more tablets given their additional processing power. This is, in fact, yet another benefit of tablets: the ability to run heterogeneous setups and leave it up to the database to determine how to internally maximize each instance’s performance capabilities. What are the impacts of tablets on maintenance tasks like node cleanup? The primary benefit of tablets is elasticity. Tablets allow you to easily and quickly scale out and in your database infrastructure without hassle. This not only translates to infrastructure savings (like avoiding being overprovisioned for the peak all the time). It also allows you to reach a higher percentage of storage utilization before rushing to add more nodes – so you can better utilize the underlying infrastructure you pay for. Another key benefit of tablets is that they eliminate the need for maintenance tasks like node cleanup. Previously, after scaling out the cluster, operators would need to run node cleanup to ensure data was properly evicted from nodes that no longer owned certain token ranges. With tablets, this is no longer necessary. The compaction process automatically handles the migration of data as tablets are dynamically balanced across the cluster. This is a significant operational improvement that reduces the maintenance burden for ScyllaDB users. The ability to now run heterogeneous deployments without running through cryptic and hours-long tuning cycles is also a plus. ScyllaDB’s tablet load balancer is smart enough to figure out how to distribute and place your data. It considers the amount of compute resources available, reducing the risk of traffic hotspots or data imbalances that may affect your clusters’ performance. In the future, ScyllaDB will bring transparent repairs on a per-tablet basis, further eliminating the need for users to worry about repairing their clusters, and also provide “temperature-based balancing” so that hot partitions get split and other shards cooperate with the incoming load. Do I need to change drivers? ScyllaDB’s latest drivers are tablet-aware, meaning they understand the tablets concept and can route queries to the correct nodes and shards. However, the drivers do not directly query the internal system.tablets table. That could become unwieldy as the number of tablets grows. Furthermore, tablets are transient, meaning a replica owning a tablet may no longer be a natural endpoint for it as time goes by. Instead, the drivers use a dynamic routing process: when a query is sent to the wrong node/shard, the coordinator will respond with the correct routing information, allowing the driver to update its routing cache. This ensures efficient query routing as tablets are migrated across the cluster. When using ScyllaDB tablets, it’s more important than ever to use ScyllaDB shard-aware – and now also tablet-aware – drivers instead of Cassandra drivers. The existing drivers will still work, but they won’t work as efficiently because they lack the necessary logic to understand the coordinator-provided tablet metadata. Using the latest ScyllaDB drivers should provide a nice throughput and latency boost. Read more in How We Updated ScyllaDB Drivers for Tablets Elasticity. More questions? If you’re interested in tablets and we didn’t answer your question here, please reach out to us! Our Contact Us page offers a number of ways to interact, including a community forum and Slack.  

Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview

ScyllaDB X Cloud just landed! It’s a truly elastic database that supports variable/unpredictable workloads with consistent low latency, plus low costs. The ScyllaDB team is excited to announce ScyllaDB X Cloud, the next generation of our fully-managed database-as-a-service. It features architectural enhancements for greater flexibility and lower cost. ScyllaDB X Cloud is a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. A few spoilers before we get into the details: You can now scale out and scale in almost instantly to match actual usage, hour by hour. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer latency hits while waiting for autoscaling to fully kick in. You can now safely run at 90% storage utilization, compared to the standard 70% utilization. This means you need fewer underlying servers and have substantially less infrastructure to pay for. Optimizations like file-based streaming and dictionary-based compression also speed up scaling and reduce network costs. Beyond the technical changes, there’s also an important pricing update. To go along with all this database flexibility, we’re now offering a “Flex Credit” pricing model. Basically, this gives you the flexibility of on-demand pricing with the cost advantage that comes from an annual commitment. Access ScyllaDB X Cloud Now If you want to get started right away, just go to ScyllaDB Cloud and choose the X Cloud cluster type when you create a cluster. This is our code name for the new type of cluster that enables greater elasticity, higher storage utilization, and automatic scaling. Note that X Cloud clusters are available from the ScyllaDB Cloud application (below) and API. They’re available on AWS and GCP, running on a ScyllaDB account or your company’s account with the Bring Your Own Account (BYOA) model. Sneak peek: In the next release, you won’t need to choose instance size or number of services if you select the X Cloud option. Instead, you will be able to define a serverless scaling policy and let X Cloud scale the cluster as required. If you want to learn more, keep reading. In this blog post, we’ll cover what’s behind the technical changes and also talk a little about the new pricing option. But first, let’s start with the why. Backstory Why did we do this? Consider this example from a marketing/AdTech platform that provides event-based targeting. Such a pattern, with predictable/cyclical daily peaks and low baseline off-hours, is quite common across retail platforms, food delivery services, and other applications aligned with customer work hours. In this case, the peak loads are 3x the base and require 2-3x the resources. With ScyllaDB X Cloud, they can provision for the baseline and quickly scale in/out as needed to serve the peaks. They get the steady low latency they need without having to overprovision – paying for peak capacity 24/7 when it’s really only needed for 4 hours a day. Tablets + just-in-time autoscaling If you follow ScyllaDB, you know that tablets aren’t new. We introduced them last year for ScyllaDB Enterprise (self-managed on the cloud or on-prem). Avi Kivity, our CTO, already provided a look at why and how we implemented tablets. And you can see tablets in action here: With tablets, data gets distributed by splitting tables into smaller logical pieces (“tablets”), which are dynamically balanced across the cluster using the Raft consensus protocol. This enables you to scale your databases as rapidly as you can scale your infrastructure. In a self-managed ScyllaDB deployment, tablets makes it much faster and simpler to expand and reduce your database capacity. However, you still need to plan ahead for expansion and initiate the operations yourself. ScyllaDB X Cloud lets you take full advantage of tablets’ elasticity. Scaling can be triggered automatically based on storage capacity (more on this below) or based on your knowledge of expected usage patterns. Moreover, as capacity expands and contracts, we’ll automatically optimize both node count and utilization. You don’t even have to choose node size; ScyllaDB X Cloud’s storage-utilization target does that for you. This should simplify admin and also save costs. 90% storage utilization ScyllaDB has always handled running at 100% compute utilization well by having automated internal schedulers manage compactions, repairs, and lower-priority tasks in a way that prioritizes performance. Now, it also does two things that let you increase the maximum storage utilization to 90%: Since tablets can move data to new nodes so much faster, ScyllaDB X Cloud can defer scaling until the very last minute Support for mixed instance sizes allows ScyllaDB X Cloud to allocate minimal additional resources to keep the usage close to 90% Previously, we recommended adding nodes at 70% capacity. This was because node additions were unpredictable and slow — sometimes taking hours or days — and you risked running out of space. We’d send a soft alert at 50% and automatically add nodes at 70%. However, those big nodes often sat underutilized. With ScyllaDB X Cloud’s tablets architecture, we can safely target 90% utilization. That’s particularly helpful for teams with storage-bound workloads. Support for mixed size clusters A little more on the “mixed instance size” support mentioned earlier. Basically, this means that ScyllaDB X Cloud can now add the exact mix of nodes you need to meet the exact capacity you need at any given time. Previous versions of ScyllaDB used a single instance size across all nodes in the cluster. For example, if you had a cluster with 3 i4i.16xlarge instances, increasing the capacity meant adding another i4i.16xlarge. That works, but it’s wasteful: you’re paying for a big node that you might not immediately need. Now with ScyllaDB X Cloud (thanks to tablets and support for mixed-instance sizes), we can scale in much smaller increments. You can add tiny instances first, then replace them with larger ones if needed. That means you rarely pay for unused capacity. For example, before, if you started with an i4i.16xlarge node that had 15 TB of storage and you hit 70% utilization, you had to launch another i4i.16xlarge — adding 15 TB at once. With ScyllaDB X Cloud, you might add two xlarge nodes (2 TB each) first. Then, if you need more storage, you add more small nodes, then eventually replace them with larger nodes. And by the way, i7i instances are now available too, and they are even more powerful. The key is granular, just-in-time scaling: you only add what you need, when you need it. This applies in reverse, too. Before, you had to decommission a large node all at once. Now, ScyllaDB X Cloud can remove smaller nodes gradually based on the policies you set, saving compute and storage costs. Network-focused engineering optimizations Every gigabyte leaving a node, crossing an Availability Zone (AZ) boundary, or replicating to another region shows up on your AWS, GCP, or Azure bill. That’s why we’ve done some engineering work at different layers of ScyllaDB to shrink those bytes—and the dollars tied to them. File-based streaming We anticipated that mutation-based streaming would hold us back once we moved to tablets. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Think of it as Cassandra’s zero-copy streaming, except that we keep ownership metadata with each replica. This table shows the result: You can read more about this in the blog Why We Changed ScyllaDB’s Data Streaming Approach. Dictionary-based compression We also introduced dictionary-trained Zstandard (Zstd), which is pipeline-aware. This involved building a custom RPC compressor with external dictionary support, and a mechanism that trains new dictionaries on RPC traffic, distributes them over the cluster, and performs a live switch of connections to the new dictionaries. This is done in 4 key steps: Sample: Continuously sample RPC traffic for some time Train: Train a 100 kiB dictionary on a 16MiB sample Distribute: Distribute a new dictionary via system distributed table Switch: Negotiate the switch separately within each connection On the graph below, you can see LZ4 (Cassandra’s default) leaves you at 72% of the original size. Generic Zstd cuts that to 50%. Our per-cluster Zstd dictionary takes it down to 30%, which is a 3X improvement over the default Cassandra compression. Flex Credit To close, let’s shift from the technical changes to a major pricing change: Flex Credit. Flex Credit is a new way to consume a ScyllaDB Cloud subscription. It can be applied to ScyllaDB Cloud as well as ScyllaDB Enterprise. Flex Credit provides the flexibility of on-demand pricing at a lower cost via an annual commitment. In combination with X Cloud, Flex Credit can be a great tool to reduce cost. You can use Reserved pricing for a load that’s known in advance and use Flex for less predictable bursts. This saves you from paying the higher on-demand pricing for anything above the reserved part. How might this play out in your day-to-day work? Imagine your baseline workload handles 100K OPS, but sometimes it spikes to 400K OPS. Previously, you’d have to provision (and pay for) enough capacity to sustain 400K OPS at all times. That’s inefficient and costly. With ScyllaDB X Cloud, you reserve 100K OPS upfront. When a spike hits, we automatically call the API to spin up “flex capacity” – instantly scaling you to 400K OPS – and then tear it down when traffic subsides. You only pay for the extra capacity during the peak. Not sure what to choose? We can help advise based on your workload specifics (contact your representative or ping us here), but here’s some quick guidance in the meantime. Reserved Capacity: The most cost-effective option across all plans. Commit to a set number of cluster nodes or machines for a year. You lock in lower rates and guarantee capacity availability. This is ideal if your cluster size is relatively stable. Hybrid Model: Reserved + On-Demand: Commit to a baseline reserved capacity to lock in lower rates, but if you exceed that baseline (e.g., because you have a traffic spike), you can scale with on-demand capacity at an hourly rate. This is good if your usage is mostly stable but occasionally spikes. Hybrid Model: Reserved + Flex Credit: Commit to baseline reserved capacity for the lowest rates. For peak usage, use pre-purchased flex credit (which is discounted) instead of paying on-demand prices. Flex credit also applies to network and backup usage at standard provider rates. This is ideal if you have predictable peak periods (e.g., seasonal spikes, event-driven workload surges, etc.). You get the best of both worlds: low baseline costs and cost-efficient peak capacity. Recap In summary, ScyllaDB X Cloud uses tablets to enable faster, more granular scaling with mixed-instance sizes. This lets you avoid overprovisioning and safely run at 90% storage utilization. All of this will help you respond to volatile/unpredictable demand with low latencies and low costs. Moreover, flexible pricing (on-demand, flex credit, reserved) will help you pay only for what you need, especially when you have tablets scaling your capacity up and down in response to traffic spikes. There are also some network cost optimizations through file-based streaming and improved compression. Want to learn more? Our Co-Founder/CTO Avi Kivity will be discussing the design decisions behind ScyllaDB X Cloud’s elasticity and efficiency. Join us for the engineering deep dive on July 10. ScyllaDB X Cloud: An Inside Look with Avi Kivity

A New Way to Estimate DynamoDB Costs

We built a new DynamoDB cost analyzer that helps developers understand what their workloads will really cost DynamoDB costs can blindside you. Teams regularly face “bill shock”: that sinking feeling when you look at a shockingly high bill and realize that you haven’t paid enough attention to your usage, especially with on-demand pricing. Provisioned capacity brings a different risk: performance. If you can’t accurately predict capacity or your math is off, requests get throttled. It’s a delicate balancing act. Although AWS offers a DynamoDB pricing calculator, it often misses the nuances of real-world workloads (e.g., bursty traffic or uneven access patterns, or using global tables or caching). We wanted something better. In full transparency, we wanted something better to help the teams considering ScyllaDB as a DynamoDB alternative. So we built a new DynamoDB cost calculator that helps developers understand what their workloads will really cost. Although we designed it for teams comparing DynamoDB with ScyllaDB, we believe it’s useful for anyone looking to more accurately estimate their DynamoDB costs, for any reason. You can see the live version at: calculator.scylladb.com How We Built It We wanted to build something that would work client side, without the need for any server components. It’s a simple JavaScript single page application that we currently host on GitHub pages. If you want to check out the source code, feel free to take a look at https://github.com/scylladb/calculator To be honest, working with the examples at https://calculator.aws/ was a bit of a nightmare, and when you “show calculations,” you get these walls of text: I was tempted to take a shorter approach, like: Monthly WCU Cost = WCUs × Price_per_WCU_per_hour × 730 hours/month But every time I simplified this, I found it harder to get parity between what I calculated and the final price in AWS’s calculation. Sometimes the difference was due to rounding, other times it was due to the mixture of reserved + provision capacity, and so on. So to make it easier (for me) to debug, I faithfully followed their calculations line by line and tried to replicate this in my own rather ugly function: https://github.com/scylladb/calculator/blob/main/src/calculator.js I may still refactor this into smaller functions. But for now, I wanted to get parity between theirs and ours. You’ll see that there are also some end-to-end tests for these calculations — I use those to test for a bunch of different configurations. I will probably expand on these in time as well. So that gets the job done for On Demand, Provisioned (and Reserved) capacity models. If you’ve used AWS’s calculator, you know that you can’t specify things like a peak (or peak width) in On Demand. I’m not sure about their reasoning. I decided it would be easier for users to specify both the baseline and peak for reads and writes (respectively) in On Demand, much like Provisioned capacity. Another design decision was to represent the traffic using a chart. I do better with visuals, so seeing the peaks and troughs makes it easier for me to understand – and I hope it does for you as well. You’ll also notice that as you change the inputs, the URL query parameters change to reflect those inputs. That’s designed to make it easier to share and reference specific variations of costs. There’s some other math in there, like figuring out the true cost of Global Tables and understanding derived costs of things like network transfer or DynamoDB Accelerator (DAX). However, explaining all that is a bit too dense for this format. We’ll talk more about that in an upcoming webinar (see the next section). The good news is that you can estimate these costs in addition to your workload, as they can be big cost multipliers when planning out your usage of DynamoDB. Explore “what if” scenarios for your own workloads Analyzing Costs in Real-World Scenarios The ultimate goal of all this tinkering and tuning is to help you explore various “what-if” scenarios from a DynamoDB cost perspective.  To get started, we’re sharing the cost impacts of some of the more interesting DynamoDB user scenarios we’ve come across at ScyllaDB. My colleague Gui and I just got together for a deep dive into how factors like traffic surges, multi-datacenter expansion, and the introduction of caching (e.g., DAX) impact DynamoDB costs. We explored how a few (anonymized) teams we work with ended up blindsided by their DynamoDB bills and the various options they considered for getting costs back under control. Watch the DynamoDB costs chat now

Cassandra vs. MongoDB: When to Use Which​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌‌‍‌‍‌‍‌‌​​‌‌‍‌‍​‌​​‌‌‍​‌‌‍​‌​‍‌‌‍​‌‌‍​‍​‍‌​​​​‍‌​‌​‌‍​‍​​‌​‌‌​‍‌​‍​​‍‌​‌​‌‍‌‍​‍‌​‌​‌​‌‌​‌‍​‌‌‌‍​​‍​​​‌‍‌‌‌‍​​​​‌‍‌‌​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌​‍‌‍​​‌‌‍‍​‌‌​‌‌​‌​​‌​​‍‌‌​​‌​​‌​‍‌‌​​‍‌​‌‍​‍‌‌​​‍‌​‌‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‌‍‍‌‌‍‌​​‌‌‍‌‍‌‍‌‌​​‌‌‍‌‍​‌​​‌‌‍​‌‌‍​‌​‍‌‌‍​‌‌‍​‍​‍‌​​​​‍‌​‌​‌‍​‍​​‌​‌‌​‍‌​‍​​‍‌​‌​‌‍‌‍​‍‌​‌​‌​‌‌​‌‍​‌‌‌‍​​‍​​​‌‍‌‌‌‍​​​​‌‍‌‌​‍‌‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‍‌‍‌​​‌‍‌‌‌​‍‌​‌​​‌‍‌‌‌‍​‌‌​‌‍‍‌‌‌‍‌‍‌‌​‌‌​​‌‌‌‌‍​‍‌‍​‌‍‍‌‌​‌‍‍​‌‍‌‌‌‍‌​​‍​‍‌‌

Which NoSQL database works best for powering your GenAI use cases? A look at Cassandra vs. MongoDB and which to use when.

Rust Rewrite, Postgres Exit: Blitz Revamps Its “League of Legends” Backend

How Blitz scaled their game coaching app with lower latency and leaner operations Blitz is a fast-growing startup that provides personalized coaching for games such as League of Legends, Valorant, and Fortnite. They aim to help gamers become League of Legends legends through real-time insights and post-match analysis. While players play, the app does quite a lot of work. It captures live match data, analyzes it quickly, and uses it for real-time game screen overlays plus personalized post-game coaching. The guidance is based on each player’s current and historic game activity, as well as data collected across billions of matches involving hundreds of millions of users. Thanks to growing awareness of Blitz’s popular stats and game-coaching app, their steadily increasing user base pushed their original Postgres- and Elixir-based architecture to its limits. This blog post explains how they recently overhauled their League of Legends data backend – using Rust and ScyllaDB. TL;DR – In order to provide low latency, high availability, and horizontal scalability to their growing user base, they ultimately: Migrated backend services from Elixir to Rust. Replaced Postgres with ScyllaDB Cloud. Heavily reduced their Redis footprint. Removed their Riak cluster. Replaced queue processing with realtime processing. Consolidated infrastructure from over a hundred cores of microservices to four n4‑standard‑4 Google Cloud nodes (plus a small Redis instance for edge caching) As an added bonus, these changes ended up cutting Blitz’s infrastructure costs and reducing the database burden on their engineering staff. Blitz Background As Naveed Khan (Head of Engineering at Blitz) explained, “We collect a lot of data from game publishers and during gameplay. For example, if you’re playing League of Legends, we use Riot’s API to pull match data, and if you install our app we also monitor gameplay in real time. All of this data is stored in our transactional database for initial processing, and most of it eventually ends up in our data lake.” Scaling Past Postgres One key part of Blitz’s system is the Playstyles API, which analyzes pre-game data for both teammates and opponents. This intensive process evaluates up to 20 matches per player and runs nine separate times per game (once for each player in the match). The team strategically refactored and consolidated numerous microservices to improve performance. But the data volume remained intense. According to Brian Morin (Principal Backend Engineer at Blitz), “Finding a database solution capable of handling this query volume was critical.” They originally used Postgres, which served them well early on. However, as their write-heavy workloads scaled, the operational complexity and costs on Google Cloud grew significantly. Moreover, scaling Postgres became quite complex. Naveed shared, “We tried all sorts of things to scale. We built multiple services around Postgres to get the scale we needed: a Redis cluster, a Riak cluster, and Elixir Oban queues that occasionally overflowed. Queue management became a big task.” To stay ahead of the game, they needed to move on. As startups scale, they often switch from “just use Postgres” to “just use NoSQL.” Fittingly, the Blitz team considered moving to MongoDB, but eventually ruled it out. “We had lots of MongoDB experience in the team and some of us really liked it. However, our workload is very write-heavy, with thousands of concurrent players generating a constant stream of data. MongoDB uses a single-writer architecture, so scaling writes means vertically scaling one node.” In other words, MongoDB’s primary-secondary architecture would create a bottleneck for their specific workload and anticipated growth. They then decided to move forward with RocksDB because of its low latency and cost considerations. Tests showed that it would meet their latency needs, so they performed the required data (re)modeling and migrated a few smaller games over from Postgres to RocksDB. However, they ultimately decided against RocksDB due to scale and high availability concerns. “Based on available data from our testing, it was clear RocksDB wouldn’t be able to handle the load of our bigger games – and we couldn’t risk vertically scaling a single instance, and then having that one instance go down,” Naveed explained. Why ScyllaDB One of their backend engineers suggested ScyllaDB, so they reached out and ran a proof of concept. They were primarily looking for a solution that can handle the write throughput, scales horizontally, and provides high availability. They tested it on their own hardware first, then moved to ScyllaDB Cloud. Per Naveed, “The cost was pretty close to self-hosting, and we got full management for free, so it was a no-brainer. We now have a significantly reduced Redis cluster, plus we got rid of the Riak cluster and Oban queues dependencies. Just write to ScyllaDB and it all just works. The amount of time we spend on infrastructure management has significantly decreased.” Performance-wise, the shift met their goal of leveling up the user experience … and also simplified life for their engineering teams. Brian added, “ScyllaDB proved exceptional, delivering robust performance with capacity to spare after optimization. Our League product peaks at around 5k ops/sec with the cluster reporting under 20% load. Our biggest constraint has been disk usage, which we’ve rolled out multiple updates to mitigate. The new system can now often return results immediately instead of relying on cached data, providing more up-to-date information on other players and even identifying frequent teammates. The results of this migration have been impressive: over a hundred cores of microservices have been replaced by just four n4-standard-4 nodes and a minimal Redis instance for caching. Additionally, a 3xn2-highmem ScyllaDB cluster has effectively replaced the previous relational database infrastructure that required significant computing resources.” High-Level Architecture of Blitz Server with Rust and ScyllaDB Rewriting Elixir Services into Rust As part of a major backend overhaul, the Blitz team began rethinking their entire infrastructure – beyond the previously described shift from Postgres to the high-performance and distributed ScyllaDB. Alongside this database migration, they also chose to sunset their Elixir-based services in favor of a more modern language. After careful evaluation, Rust emerged as the clear choice. “Elixir is great and it served its purpose well,” explained Naveed. “But we wanted to move toward something with broader adoption and a stronger systems-level ecosystem. Rust proved to be a robust and future-proof alternative.” Now that the first batch of Rust rewritten services are in production, Naveed and team aren’t looking back: “Rust is fantastic. It’s fast, and the compiler forces you to write memory-safe code upfront instead of debugging garbage-collection issues later. Performance is comparable to C, and the talent pool is also much larger compared to Elixir.”

Why We Changed ScyllaDB’s Data Streaming Approach

How moving from mutation-based streaming to file-based streaming resulted in 25X faster streaming time Data streaming – an internal operation that moves data from node to node over a network – has always been the foundation of various ScyllaDB cluster operations. For example, it is used by “add node” operations to copy data to a new node in a cluster (as well as “remove node” operations to do the opposite). As part of our multiyear project to optimize ScyllaDB’s elasticity, we reworked our approach to streaming. We recognized that when we moved to tablets-based data distribution, mutation-based streaming would hold us back. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Mutation-Based Streaming In ScyllaDB, data streaming is a low-level mechanism to move data between nodes. For example, when nodes are added to a cluster, streaming moves data from existing nodes to the new nodes. We also use streaming to decommission nodes from the cluster. In this case, streaming moves data from the decommissioned nodes to other nodes in order to balance the data across the cluster. Previously, we were using a streaming method called mutation-based streaming.   On the sender side, we read the data from multiple SSTables. We get a stream of mutations, serialize them, and send them over the network. On the receiver side, we deserialize and write them to SSTables. File-Based Streaming Recently, we introduced a new file-based streaming method. The big difference is that we do not read the individual mutations from the SSTables, and we skip all the parsing and serialization work. Instead, we read and send the SSTable directly to remote nodes. A given SSTable always belongs to a single tablet. This means we can always send the entire SSTable to other nodes without worrying about whether the SSTable contains unwanted data. We implemented this by having the Seastar RPC stream interface stream SSTable files on the network for tablet migration. More specifically, we take an internal snapshot of the SSTables we want to transfer so the SSTables won’t be deleted during streaming. Then, SSTable file readers are created for them so we can use the Seastar RPC stream to send the SSTable files over the network. On the receiver side, the file streams are written into SSTable files by the SSTable writers.       Why did we do this? First, it reduces CPU usage because we do not need to read each and every mutation fragment from the SSTables, and we do not need to parse mutations. The CPU reduction is even more significant for small cells, where the ratio of the amount of metadata parsed to real user data is higher. Second, the format of the SSTable is much more compact than the mutation format (since on-disk presentation of data is more compact than in-memory). This means we have less data to send over the network. As a result, it can boost the streaming speed rather significantly. Performance Improvements To quantify how this shift impacted performance, we compared the performance of mutation-based and file-based streaming when migrating tablets between nodes. The tests involved: 3 ScyllaDB nodes i4i.2xlarge 3 loaders t3.2xlarge 1 billion partitions Here are the results:   Note that file-based streaming results in 25 times faster streaming time. We also have much higher streaming bandwidth: the network bandwidth is 10 times faster with file-based streaming. As mentioned earlier, we have less data to send with file streaming. The data sent on the wire is almost three times less with file streaming. In addition, we can also see that file-based streaming consumes many fewer CPU cycles. Here’s a little more detail, in case you’re curious. Disk IO Queue The following sections show how the IO bandwidth compares across mutation-based and file-based streaming. Different colors represent different nodes. As expected, the throughput was higher with mutation-based streaming. Here are the detailed IO results for mutation-based streaming:   The streaming bandwidth is 30-40MB/s with mutation-based streaming. Here are the detailed IO results for file-based streaming: The bandwidth for file streaming is much higher than with mutation-based streaming. The pattern differs from the mutation-based graph because file streaming completes more quickly and can sustain a high speed of transfer bandwidth during streaming. CPU Load We found that the overall CPU usage is much lower for the file-based streaming. Here are the detailed CPU results for mutation-based streaming: Note that the CPU usage is around 12% for mutation-based streaming. Here are the detailed CPU results for file-based streaming: Note that the CPU usage for the file-based streaming is less than 5%. Again, this pattern differs from the mutation-based streaming graph because file streams complete much more quickly and can maintain a high transfer bandwidth throughout. Wrap Up This new file-based streaming makes data streaming in ScyllaDB faster and more efficient. You can explore it in ScyllaDB Cloud or ScyllaDB 2025.1. Also, our CTO and co-founder Avi Kivity shares an extensive look at our other recent and upcoming engineering projects in this tech talk: More engineering blog posts

The Strategy Behind ReversingLabs’ Monster Scale Key-Value Migration

Migrating 300+ TB of data and 400+ services from a key-value database to ScyllaDB – with zero downtime ReversingLabs recently completed the largest migration in their history: migrating more than 300 TB of data, more than 400 services, and data models from their internally-developed key-value database to ScyllaDB seamlessly, and with zero downtime. Services using multiple tables — reading, writing, and deleting data, and even using transactions — needed to go through a fast and seamless switch. How did they pull it off? Martina recently shared their strategy, including data modeling changes, the actual data migration, service migration, and a peek at how they addressed distributed locking. Here’s her complete tech talk:   And you can read highlights below… About ReversingLabs Reversing Labs is a security company that aims to analyze every enterprise software package, container and file to identify potential security threats and mitigate cybersecurity risks. They maintain a library of 20B classified samples of known “goodware” (benign) and malware files and packages. Those samples are supported by ~300 TB of metadata, which are processed using a network of approximately 400 microservices. As Martina put it: “It’s a huge system, complex system – a lot of services, a lot of communication, and a lot of maintenance.” Never build your own database (maybe?) When the ReversingLabs team set out to select a database in 2011, the options were limited. Cassandra was at version 0.6, which lacked role-level isolation DynamoDB was not yet released ScyllaDB was not yet released MongoDB 1.6 had consistency issues between replicas PostgreSQL was struggling with multi-version concurrency control (MVCC), which created significant overhead “That was an issue for us—Postgres used so much memory,” Martina explained. “For a startup with limited resources, having a database that ate all our memory was a problem. So we built our own data store. I know, it’s scandalous—a crazy idea today—but in this context, in this market, it made sense.” The team built a simple key-value store tailored to their specific needs—no extra features, just efficiency. It required manual maintenance and was only usable by their specialized database team. But it was fast, used minimal resources, and helped ReversingLabs, as a small startup, handle massive amounts of data (which became a core differentiator). However, after 10 years, ReversingLabs’ growing complexity and expanding use cases became overwhelming – to the database itself and the small database team responsible for it. Realizing that they reached their home-grown database’s tipping point, they started exploring alternatives. Enter ScyllaDB. Martina shared: “After an extensive search, we found ScyllaDB to be the most suitable replacement for our existing database. It was fast, resilient, and scalable enough for our use case. Plus, it had all the features our old database lacked. So, we decided on ScyllaDB and began a major migration project.” Migration Time The migration involved 300 TB of data, hundreds of tables, and 400 services. The system was complex, so the team followed one rule: keep it simple. They made minimal changes to the data model and didn’t change the code at all. “We decided to keep the existing interface from our old database and modify the code inside it,” Martina shared. “We created an interface library and adapted it to work with the ScyllaDB driver. The services didn’t need to know anything about the change—they were simply redeployed with the new version of the library, continuing to communicate with ScyllaDB instead of the old database.” Moving from a database with a single primary node to one with a leaderless ring architecture did require some changes, though. The team had to adjust the primary key structure, but the value itself didn’t need to be changed. In the old key-value store, data was stored as a packed protobuf with many fields. Although ScyllaDB could unpack these protobufs and separate the fields, the team chose to keep them as they were to ensure a smoother migration. At this point, they really just wanted to make it work exactly like before. The migration had to be invisible — they didn’t want API users to notice any differences. Here’s an overview of the migration process they performed once the models were ready: 1. Stream the old database output to Kafka The first step was to set up a Kafka topic dedicated to capturing updates from the old database. 2. Dump the old database into a specified location Once the streaming pipeline was in place, the team exported the full dataset from the old database. 3. Prepare a ScyllaDB table by configuring its structure and settings Before loading the data, they needed to create a ScyllaDB table with the new schema. 4. Prepare and load the dump into the ScyllaDB table With the table ready, the exported data was transformed as needed and loaded into ScyllaDB. 5. Continuously stream data to ScyllaDB They set up a continuous pipeline with a service that listened to the Kafka topic for updates and loaded the data into ScyllaDB. After the backlog was processed, the two databases were fully in sync, with only a negligible delay between the data in the old database and ScyllaDB. It’s a fairly straightforward process…but it had to be repeated for 100+ tables. Next Up: Service Migration The next challenge was migrating their ~400 microservices. Martina introduced the system as follows: “We have master services that act as data generators. They listen for new reports from static analysis, dynamic analysis, and other sources. These services serve as the source of truth, storing raw reports that need further processing. Each master service writes data to its own table and streams updates to relevant queues. The delivery services in the pipeline combine data from different master services, potentially populating, adding, or calculating something with the data, and combining various inputs. Their primary purpose is to store the data in a format that makes it easy for the APIs to read. The delivery services optimize the data for queries and store it in their own database, while the APIs then read from these new databases and expose the data to users.” Here’s the 5-step approach they applied to service migration: 1. Migrate the APIs one by one The team migrated APIs incrementally. Each API was updated to use the new ScyllaDB-backed interface library. After redeploying each API, the team monitored performance and data consistency before moving on to the next one. 2. Prepare for the big migration day Once the APIs were migrated, they had to prepare for the big migration day. Since all the services before the APIs are intertwined, they all had to be migrated all at once. 3. Stop the master services On migration day, the team stopped the master services (data generators), causing input queues to accumulate until the migration was complete. During this time, the APIs continued serving traffic without any downtime. However, the data in the databases was delayed for about an hour or two until all services were fully migrated. 4. Migrate the delivery services After stopping the master services, the team waited for the queues between the master and delivery services to empty – ensuring that the delivery services processed all data and stopped writing. The delivery services were then migrated one by one to the new database. There was no data at this point because the master services were stopped. 5. Migrate and start the master services At last, it was time to migrate and start the master services. The final step was to shut down the old database because everything was now working on ScyllaDB. “It worked great, Martina shared. “We were happy with the latencies we achieved. If you remember, our old architecture had a single master node, which created a single point of failure. Now, with ScyllaDB, we had resiliency and high availability, and we were quite pleased with the results.” And Finally…Resource Locking One final challenge: resource locking. Per Martina, “In the old architecture, resource locking was simple because there was a single master node handling all writes. You could just use a mutex on the master node, and that was it—locking was straightforward. Of course, it needed to be tied to the database connection, but that was the extent of it.” ScyllaDB’s leaderless architecture meant that the team had to figure out distributed locking. They leveraged ScyllaDB’s lightweight transactions and built a distributed locking mechanism on top of it. The team worked closely with ScyllaDB engineers, going through several proofs of concept (POCs)—some successful, others less so. Eventually, they developed a working solution for distributed locking in their new architecture. You can read all the details in Martina’s blog post, Implementing distributed locking with ScyllaDB.  

Efficient Full Table Scans with ScyllaDB Tablets

“Tablets” data distribution makes full table scans on ScyllaDB more performant than ever Full scans are resource-intensive operations reading through an entire dataset. They’re often required by analytical queries such as counting total records, identifying users from specific regions, or deriving top-K rankings. This article describes how ScyllaDB’s shift to tablets significantly improves full scan performance and processing time, as well as how it eliminates the complex tuning heuristics often needed with the previous vNodes based approach. It’s been quite some time since we last touched on the subject of handling full table scans on ScyllaDB. Previously, Avi Kivity described how the CQL token() function could be used in a divide and conquer approach to maximize running analytics on top of ScyllaDB. We also provided sample Go code and demonstrated how easy and efficient full scans could be done. With the recent introduction of tablets, it turns out that full scans are more performant than ever. Token Ring Revisited Prior to tablets, nodes in a ScyllaDB cluster owned fractions of the token ring, also known as token ranges. A token range is nothing more than a contiguous segment represented by two (very large) numbers. By default, each node used to own 256 ranges, also known as vNodes. When data gets written to the cluster, the Murmur3 hashing function is responsible for distributing data to replicas of a given token range. A full table scan thus involved parallelizing several token ranges until clients eventually traverse the entire ring. As a refresher, a scan involves iterating through multiple subranges (smaller vNode ranges) with the help of the token() function, like this: SELECT ... FROM t WHERE token(key) >= ? AND token(key) < ? To fully traverse the ring as fast as possible, clients needed to keep parallelism high enough (number of nodes x shard count x some smudge factor) to fully benefit from all available processing power. In other words, different cluster topologies would require different parallelism settings, which could often change as nodes got added or removed. Traversing vNodes worked nicely, but the approach introduced some additional drawbacks, such as: Sparse tables result in wasted work because most token ranges contain little or no data. Popular and high-density ranges could require fine-grained tuning to prevent uneven load distribution and resource contention. Otherwise, they would be prone to processing bottlenecks and suboptimal utilization. It was impossible to scan a token range owned by a single shard, and particularly difficult to even scan a range owned by a single replica. This increases coordination overhead, and creates a performance ceiling on how fast a single token range could be processed. The old way: system.size_estimates To assist applications during range scans, ScyllaDB provided a node-local system.size_estimates table (something we inherited from Apache Cassandra) whose schema looks like this: CREATE TABLE system.size_estimates ( keyspace_name text, table_name text, range_start text, range_end text, mean_partition_size bigint, partitions_count bigint, PRIMARY KEY (keyspace_name, table_name, range_start, range_end) ) Every token range owned by a given replica provides an estimated number of partitions along with a mean partition size. The product of both columns therefore provides a raw estimate on how much data needs to be retrieved if a scan reads through the entire range. This design works nicely under small clusters and when data isn’t frequently changing. Since the data is node local, an application in charge of the full scan would be required to keep track of 256 vNodes*Node entries to submit its queries. Therefore, larger clusters could introduce higher processing overhead. Even then, (as the table name suggests) the number of partitions and their sizes are just estimates, which can be underestimated or overestimated. Underestimating a token range size makes a scan more prone to timeouts, particularly when its data contains a few large partitions along many smaller sized keys. Overestimating it means a scan may take longer to complete due to wasted cycles while scanning through sparse ranges. Parsing the system.size_estimates table’s data is precisely what connectors like Trino and Spark do when you integrate them with either Cassandra or ScyllaDB. To address estimate skews, these tools often allow you to manually tune settings like split-size in a trial-and-error fashion until it somewhat works for your workload. Its rationale works like this: Clients parse the system.size_estimates data from every node in the cluster (since vNodes are non overlapping ranges, fully describing the ring distribution) The size of a specific range is determined by partitionsCount * meanPartitionSize It then calculates the estimated number of partitions and the size of the table to be scanned It evenly splits each vNode range into subranges, taking its corresponding ring fraction into account Subranges are parallelized across workers and routed to natural replicas as an additional optimization Finally, prior to tablets there was no deterministic way to scan a particular range and target a specific ScyllaDB shard. vNodes have no 1:1 token/shard mapping, meaning a single coordinator request would often need to communicate with other replica shards, making it particularly easier to introduce CPU contention. A layer of indirection: system.tablets Starting with ScyllaDB 2024.2, tablets are production ready. Tablets are the foundation behind ScyllaDB elasticity, while also effectively addressing the drawbacks involved with full table scans under the old vNode structure. In case you missed it, I highly encourage you to watch Avi Kivity talk on Tablets: Rethinking Replication for an in-depth understanding on how tablets evolved from the previous vNodes static topologies. During his talk, Avi mentions that tablets are implemented as a layer of indirection involving a token range to a (replica, shard) tuple. This layer of indirection is exposed in ScyllaDB as the system.tablets table, whose schema looks like this: CREATE TABLE system.tablets ( table_id uuid, last_token bigint, keyspace_name text STATIC, resize_seq_number bigint STATIC, resize_type text STATIC, table_name text STATIC, tablet_count int STATIC, new_replicas frozen<list<frozen<tuple<uuid, int>>>>, replicas frozen<list<frozen<tuple<uuid, int>>>>, session uuid, stage text, transition text, PRIMARY KEY (table_id, last_token) ) A tablet represents a contiguous token range owned by a group of replicas and shards. Unlike the previous static vNode topology, tablets are created on a per table basis and get dynamically split or merged on demand. This is important, because workloads may vary significantly: Some are very throughput intensive under frequently accessed (and small) data sets and will have fewer tablets. These take less time to scan. Others may become considerably storage bound over time, spanning through multiple terabytes (or even petabytes) of disk space. These take longer to scan. A single tablet targets a geometric average size of 5GB before it gets split. Therefore, splits are done when a tablet reaches 10GB and merges at 2.5GB. Note that the average size is configurable, and the default might change in the future. However, scanning over each tablet owned range allows full scans to deterministically determine up to how much data they are reading. The only exception to this rule is when very large (larger than the average) partitions are involved, although this is an edge case. Consider the following set of operations: In this example, we start by defining that we want tables within the ks keyspace to start with 128 tablets each. After we create table t, observe that the tablet_count matches what we’ve set upfront. If we had asked for a non base 2 number, the tablet_count would be rounded to the next base 2 number. The tablet_count represents the total number of tablets across the cluster, where the replicas column represents a tuple of host IDs/shards which are replicas of that tablet, matching our defined replication factor. Therefore, the previous logic can be optimized like this: Clients parse the system.tablets table and retrieve the existing tablet distribution Tablets ranges spanning the same replica-shards get grouped and split together Workers route requests to natural replica/shard endpoints via shard awareness by setting a routingKey for every request. Tablet full scans have lots to benefit from these improvements. By directly querying specific shards, we eliminate the cost of cross CPU and node communication. Traversing the ring is not only more efficient, but effectively removes the problem with sparse ranges and different tuning logic for small and large tables. Finally, given that a tablet has a predetermined size, long gone are the days of fine-tuning splitSizes! Example This GitHub repo contains boilerplate code demonstrating how to carry out these tasks efficiently. The process involves splitting tablets into smaller pieces of work, and scheduling them evenly across its corresponding replica/shards. The scheduler ensures that replica shards are kept busy with at least 2 inflight requests each, whereas the least loaded replica always consumes pending work for processing. The code also simulates real-world latency variability by introducing some jitter during each request processing. [Access from the GitHub repo] Conclusion This is just the beginning of our journey with tablets. The logic explained in this blog is provided for application builders to follow as part of their full scan jobs. It is worth mentioning that the previous vNode technique is backward compatible and still works if you use tablets. Remember that full scans often require reading through lots of data, and we highly recommend you to use BYPASS CACHE to prevent invalidating important cached rows. Furthermore, ScyllaDB Workload Prioritization helps with isolation and ensures latencies from concurrent are kept low. Happy scanning!

From Raw Performance to Price Performance: A Decade of Evolution at ScyllaDB

Tech journalist George Anadiotis catches up on how ScyllaDB’s latest releases deliver extreme elasticity and price-performance — and shares a peek at what’s next (vector search, object storage, and more) This is a guest post authored by tech journalist George Anadiotis. It’s a follow-up to articles that he published in 2023 and 2022 In business, they say it takes ten years to become an overnight success. In technology, they say it takes ten years to build a file system. ScyllaDB is in the technology business, offering a distributed NoSQL database that is monstrously fast and scalable. It turns out that it also takes ten years or more to build a successful database. This is something that Felipe Mendes and Guilherme Nogueira know well. Mendes and Nogueira are Technical Directors at ScyllaDB, working directly on the product as well as consulting clients. Recently, they presented some of the things they’ve been working on at ScyllaDB’s Monster Scale Summit, and they shared their insights in an exclusive fireside chat. You can also catch the podcast on AppleSpotify, and Amazon The evolution of ScyllaDB When ScyllaDB started out, it was all about raw performance. The goal was to be “the fastest NoSQL database available in the market, and we did that – we still are” as Mendes put it. However, as he added, raw speed alone does not necessarily make a good database. Features such as materialized views, secondary indexes, and integrations with third party solutions are really important as well. Adding such features marked the second generation in ScyllaDB’s evolution. ScyllaDB started as a performance-oriented alternative to Cassandra, so inevitably, evolution meant feature parity with Cassandra. The third generation of ScyllaDB was marked by the move to the cloud. ScyllaDB Cloud was introduced in 2019, has been growing at 200% YoY. As Nogueira shared, even today there are daily signups of new users ready to try the oddly-named database that’s used by companies such as Discord, Medium, and Tripadvisor, all of which the duo works with. The next generation brought a radical break from what Mendes called the inefficiencies in Cassandra, which involved introducing the Raft protocol for node coordination. Now ScyllaDB is moving to a new generation, by implementing what Mendes and Nogueira referred to as hallmark features: strong consistency and tablets. Strong consistency and tablets The combination of the new Raft and Tablets features enables clusters to scale up in seconds because it enables nodes to join in parallel, as opposed to sequentially which was the case for the Gossip protocol in Cassandra (which ScyllaDB also relied on originally). But it’s not just adding nodes that’s improved, it’s also removing nodes.When a node goes down for maintenance, for example, ScyllaDB’s strong consistency support means that the rest of the nodes in the cluster will be immediately aware. By contrast, in the previously supported regime of eventual consistency via a gossip protocol, it could take such updates a while to propagate. Using Raft means transitioning to a state machine mechanism, as Mendes noted. A node leader is appointed, so when a change occurs in the cluster, the state machine is updated and the change is immediately propagated. Raft is used to propagate updates consistently at every step of a topology change. It also allows for parallel topology updates, such as adding multiple nodes at once. This was not possible under the gossip-based approach. And this is where tablets come in. With tablets, instead of having one single leader per cluster, there is one leader per tablet. A tablet is a logical abstraction that partitions data in tables into smaller fragments. Tablets are load-balanced after new nodes join, ensuring consistent distribution across the cluster. Any changes to Tablets ownership are also ensured to be consistent by using Raft to propagate these changes. Each tablet is independent from the rest, which means that ScyllaDB with Raft can move them to other nodes on demand atomically and in a strongly consistent way as workloads grow or shrink. Speed, economy, elasticity By breaking down tables into smaller and more manageable units, data can be moved between nodes in a cluster much faster. This means that clusters can be scaled up rapidly, as Mendes demonstrated. When new nodes join a cluster, the data is redistributed in minutes rather than hours, which was the case previously (and is still the case with alternatives like Cassandra). When we’re talking about machines that have higher capacity, that also means that they have a higher storage density to be used, as Mendes noted. Tablets balance out in a way that utilizes storage capacity evenly, so all nodes in the cluster will have a similar utilization rate. That’s because the number of tablets at each node is determined according to the number of CPUs, which is always tied to storage in cloud nodes. In this sense, as storage utilization is more flexible and the cluster can scale faster, it also allows users to run at a much higher storage utilization rate. A typical storage utilization rate, Mendes said, is 50% to 60%. ScyllaDB aims to run at up to 90% storage utilization. That’s because tablets and cloud automations enable ScyllaDB Cloud to rapidly scale the cluster once those storage thresholds are exceeded, as ScyllaDB’s benchmarking shows. Going from 60% to 90% storage utilization means an extra 30% per node disk space can be utilized. At scale, that translates to significant savings for users. Further to scaling speed and economy, there is an additional benefit to tablets: enabling the elasticity of cloud operations for cloud deployments, without the complexity. Something old, something new, something borrowed, something blue Beyond strong consistency and tablets, there is a wide range of new features and improvements that the ScyllaDB team is working on. Some of these, such as support for S3 object storage, are efforts that are ongoing. Besides offering users choice, as well as a way to economize even further on storage, object storage support could also serve resilience. Other features, such as workload prioritization or the Alternator DynamoDB-compatible API, have been there for a while but are being improved and re-emphasized. As Mendes shared, when running a variety of workloads, it’s very hard for the database to know which is which and how to prioritize. Workload prioritization enables users to characterize and prioritize workloads, assigning appropriate service levels to each. Last but not least, ScyllaDB is also adding vector capabilities to the database engine. Vector data types, data structures, and query capabilities have been implemented and are being benchmarked. Initial results show great promise, even outperforming pure-play vector databases. This will eventually become a core feature, supported on both on-premise and cloud offerings. Once again, ScyllaDB is keeping with the times in its own characteristic way. As Mendes and Nogueira noted, there are many ScyllaDB clients using ScyllaDB to power AI workloads, some of them like Clearview AI sharing their stories. Nevertheless, ScyllaDB remains focused on database fundamentals, taking calculated steps in the spirit of continuous improvement that has become its trademark. After all, why change something that’s so deeply ingrained in the organization’s culture, is working well for them and appreciated by the ones who matter most – users?

How to Use Testcontainers with ScyllaDB

Learn how to use Testcontainers to create lightweight, throwaway instances of ScyllaDB for testing Why wrestle with all the complexities of database configuration for each round of integration testing? In this blog post, we will explain how to use the Testcontainers library to provide lightweight, throwaway instances of ScyllaDB for testing. We’ll go through a hands-on example that includes creating the database instance and testing against it. Testcontainers: A Valuable Tool for Integration Testing with ScyllaDB You automatically unit test your code and (hopefully) integration test your system…but what about your database? To rest assured that the application works as expected, you need to extend beyond unit testing. You also need to automatically test how the units interact with one another and how they interact with external services and systems (message brokers, data stores, and so on). But running those integration tests requires the infrastructure to be configured correctly, with all the components set to the proper state. You also need to ensure that the tests are isolated and don’t produce any side effects or “test pollution.” How do you reduce the pain…and get it all running in your CI/CD process? This is where Testcontainers comes into play. Testcontainers is an open source library for throwaway, lightweight instances of databases (including ScyllaDB), message brokers, web browsers, or just about anything that can run in a Docker container. You define your dependencies in code, which makes it well-suited for CI/CD processes. When you run your tests, a ScyllaDB container will be created and then deleted. This allows you to test your application against a real instance of the database without having to worry about complex environment configurations. It also ensures that the database setup has no effect on the production environment. Some of the advantages of using Testcontainers with ScyllaDB: It launches Dockerized databases on demand, so you get a fresh environment for every test run. It isolates tests with throwaway containers. There’s no test interference or state leakage since each test gets a pristine database state Tests are fast and realistic, since the container starts in seconds, ScyllaDB responds fast, and actual CQL responses are used. Tutorial: Building a ScyllaDB Test Step-by-Step The Testcontainers ScyllaDB integration works with JavaGo, Python (see example here), and Node.js. Here, we’ll walk through an example of how to use it with Java.The steps described below are applicable to any programming language and its corresponding testing framework. In our specific Java example, we will be using the JUnit 5 testing framework. The integration between Testcontainers and ScyllaDB uses Docker. You can read more about using ScyllaDB with Docker, and learn the Best Practices for Running ScyllaDB on Docker. Step 1: Configure Your Project Dependencies Before we begin, make sure you have: Java 21 or newer installed Docker installed and running (required for Testcontainers) Gradle 8 Note: If you are more comfortable with Maven, you can still follow this tutorial, but the setup and test execution steps will be different. To verify that Java 21 or newer is installed, run: java --version To verify that Docker is installed and running correctly, run: docker run hello-world To verify that Gradle 8 or newer is installed, run: gradle --version Once you have verified that all of the relevant project dependencies are installed and ready, you can move on to creating a new project. mkdir testcontainers-scylladb-java cd testcontainers-scylladb-java gradle init A series of prompts will appear. Here are the relevant choices you need to select: Select application Select java Enter Java version: 21 Enter project name: testcontainers-scylladb-java Select application structure: Single application project Select build script DSL: Groovy Select test framework: JUnit Jupiter For “Generate build using new APIs and behavior” select no After that part is finished, to verify the successful initialization of the new project, run: ./gradlew --version If everything goes well, you should see a build.gradle file in the app folder. You will need to add the following dependencies in your app/build.gradle file: dependencies { // Use JUnit Jupiter for testing. testImplementation libs.junit.jupiter testRuntimeOnly 'org.junit.platform:junit-platform-launcher' // This dependency is used by the application. implementation libs.guava // Add the required dependencies for the test testImplementation 'org.testcontainers:scylladb:1.20.5' testImplementation 'com.scylladb:java-driver-core:4.18.1.0' implementation 'ch.qos.logback:logback-classic:1.4.11' } Also, to get test report output in the terminal, you will need to add testLogging to app/build.gradle file as well: tasks.named('test') { // Use JUnit Platform for unit tests. useJUnitPlatform() // Add this testLogging configuration to get // the test results in terminal testLogging { events "passed", "skipped", "failed" showStandardStreams = true exceptionFormat = 'full' showCauses = true } } Once you’re finished editing the app/build.gradle file, you need to install the dependencies by running this command in the terminal: ./gradlew build You should see the BUILD SUCCESSFUL output in the terminal. The final preparation step is to create a ScyllaDBExampleTest.java file somewhere in the src/test/java folder. JUnit will find all tests in the src/test/java folder. For example: touch src/test/java/org/example/ScyllaDBExampleTest.java Step 2: Launch ScyllaDB in a Container Once the dependencies are installed and the ScyllaDBExampleTest.java file created, you can copy and paste the code provided below to the ScyllaDBExampleTest.java file. This code will start a fresh ScyllaDB instance for every test in this file in the setUp method. To ensure the instance will get shut down after every test, we’ve created the tearDown method, too. Step 3: Connect via the Java Driver You’ll connect to the ScyllaDB container by creating a new session. To do so, you’ll need to update your setUp method in the ScyllaDBExampleTest.java file: Step 4: Define Your Schema Now that you have the code to run ScyllaDB and connect to it, you can use the connection to create the schema for the database. Let’s define the schema by updating your setUp method in the ScyllaDBExampleTest.java file: Step 5: Insert and Query Data Once you have prepared the ScyllaDB instance, you can run operations on it. To do so, let’s add a new method to our ScyllaDBExampleTest class in the ScyllaDBExampleTest.java file: Step 6: Run and Validate the Test Your test is now complete and ready to be executed! Use the following command to execute the test: ./gradlew clean test --no-daemon If the execution is successful, you’ll notice the container starting in the logs, and the test will pass if the assertions are met. Here’s an example of what a successful terminal output might look like: 12:05:26.708 [Test worker] DEBUG com.github.dockerjava.zerodep.shaded.org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager -- ep-00000012: connection released [route: {}->unix://localhost:2375][total available: 1; route allocated: 1 of 2147483647; total allocated: 1 of 2147483647] 12:05:26.708 [Test worker] DEBUG org.testcontainers.utility.ResourceReaper -- Removed container and associated volume(s): scylladb/scylla:2025.1 ScyllaDBExampleTest > testScyllaDBOperations() PASSED BUILD SUCCESSFUL in 35s 3 actionable tasks: 3 executed Full code example The repository for the full code example can be found here:  https://github.com/scylladb/scylla-code-samples/tree/master/java-testcontainers Level Up: Extending Your ScyllaDB Tests That’s just the basics. Here are some additional uses you might want to explore on your own: Test schema migrations – Verify that your database evolution scripts work correctly Simulate multi-node clusters – Use multiple containers to test your application with multi-node and multi-dc  scenarios Benchmark performance – Measure your application’s throughput under various workloads Test failure scenarios – Simulate how your application handles network partitions or node failures Wrap-Up: Master ScyllaDB Testing with Confidence You’ve built a fast, real ScyllaDB test in Java that provides realistic database behavior without the overhead of a permanent installation. This approach should give you confidence that your code will work correctly in production. You can try it with an example app on ScyllaDB University, customize it to your project and specific needs, and share your experience with the community! Resources: Dive Deeper ScyllaDB Documentation ScyllaDB with Docker Best Practices for Running ScyllaDB on Docker Testcontainers GitHub Repository with Examples ScyllaDB University – Free courses to master ScyllaDB

Why Teams Are Ditching DynamoDB

Teams sometimes need lower latency, lower costs (especially as they scale) or the ability to run their applications somewhere other than AWS It’s easy to understand why so many teams have turned to Amazon DynamoDB since its introduction in 2012. It’s simple to get started, especially if your organization is already entrenched in the AWS ecosystem. It’s relatively fast and scalable, with a low learning curve. And since it’s fully managed, it abstracts away the operational effort and know-how traditionally required to keep a database up and running in a healthy state. But as time goes on, drawbacks emerge, especially as workloads scale and business requirements evolve. Teams sometimes need lower latency, lower costs (especially as they scale), or the ability to run their applications somewhere other than AWS. In those cases, ScyllaDB, which offers a DynamoDB-compatible API, is often selected as an alternative. Let’s explore the challenges that drove three teams to leave DynamoDB. Multi-Cloud Flexibility and Cost Savings Yieldmo is an online advertising platform that connects publishers and advertisers in real-time using an auction-based system, optimized with ML. Their business relies on delivering ads quickly (within 200-300 milliseconds) and efficiently, which requires ultra-fast, high-throughput database lookups at scale. Database delays directly translate to lost business. They initially built the platform on DynamoDB. However, while DynamoDB had been reliable, significant limitations emerged as they grew. As Todd Coleman, Technical Co-Founder and Chief Architect, explained, their primary concerns were twofold: escalating costs and geographic restrictions. The database was becoming increasingly expensive as they scaled, and it locked them into AWS, preventing true multi-cloud flexibility. While exploring DynamoDB alternatives, they were hoping to find an option that would maintain speed, scalability, and reliability while reducing costs and providing cloud vendor independence. Yieldmo first considered staying with DynamoDB and adding a caching layer. However, caching couldn’t fix the geographic latency issue. Cache misses would be too slow, making this approach impractical. They also explored Aerospike, which offered speed and cross-cloud support. However, Aerospike’s in-memory indexing would have required a prohibitively large and expensive cluster to handle Yieldmo’s large number of small data objects. Additionally, migrating to Aerospike would have required extensive and time-consuming code changes. Then they discovered ScyllaDB. And ScyllaDB’s DynamoDB-compatible API (Alternator) was a game changer. Todd explained, “ScyllaDB supported cross cloud deployments, required a manageable number of servers, and offered competitive costs. Best of all, its API was DynamoDB compatible, meaning we could migrate with minimal code changes. In fact, a single engineer implemented the necessary modifications in just a few days.” The migration process was carefully planned, leveraging their existing Kafka message queue architecture to ensure data integrity. They conducted two proof-of-concept (POC) tests: first with a single table of 28 billion objects, and then across all five AWS regions. The results were impressive. Todd shared, “Our database costs were cut in half, even with DynamoDB reserved capacity pricing.” And beyond cost savings, Yieldmo gained the flexibility to potentially deploy across different cloud providers. Their latency improved, and ScyllaDB was as simple to operate as DynamoDB. Wrapping up, Todd concluded: “One of our initial concerns was moving away from DynamoDB’s proven reliability. However, ScyllaDB has been an excellent partner. Their team provides monitoring of our clusters, alerts us to potential issues, and advises us when scaling is needed in terms of ongoing maintenance overhead. The experience has been comparable to DynamoDB, but with greater independence and substantial cost savings.” Hear from Yieldmo  Migrating to GCP with Better Performance and Lower Costs Digital Turbine, a major player in mobile ad tech with $500 million in annual revenue, faced growing challenges with its DynamoDB implementation. While its primary motivation for migration was standardizing on Google Cloud Platform following acquisitions, the existing DynamoDB solution had been causing both performance and cost concerns at scale. “It can be a little expensive as you scale, to be honest,” explained Joseph Shorter, vice president of Platform Architecture at Digital Turbine. “We were finding some performance issues. We were doing a ton of reads — 90% of all interactions with DynamoDB were read operations. With all those operations, we found that the performance hits required us to scale up more than we wanted, which increased costs.” Digital Turbine needed the migration to be as fast and low-risk as possible, which meant keeping application refactoring to a minimum. The main concern, according to Shorter, was “How can we migrate without radically refactoring our platform, while maintaining at least the same performance and value – and avoiding a crash-and-burn situation? Because if it failed, it would take down our whole company. “ After evaluating several options, Digital Turbine moved to ScyllaDB and achieved immediate improvements. The migration took less than a sprint to implement and the results exceeded expectations. “A 20% cost difference — that’s a big number, no matter what you’re talking about,” Shorter noted. “And when you consider our plans to scale even further, it becomes even more significant.” Beyond the cost savings, they found themselves “barely tapping the ScyllaDB clusters,” suggesting room for even more growth without proportional cost increases. Hear from Digital Turbine High Write Throughput with Low Latency and Lower Costs The User State and Customizations team for one of the world’s largest media streaming services had been using DynamoDB for several years. As they were rearchitecting two existing use cases, they wondered if it was time for a database change. The two use cases were: Pause/resume: If a user is watching a show and pauses it, they can pick up where they left off – on any device, from any location. Watch state: Using that same data, determine whether the user has watched the show. Here’s a simple architecture diagram: Every 30 seconds, the client sends heartbeats with the updated playhead position of the show and then sends those events to the database. The Edge Pipeline loads events in the same region as the user, while the Authority (Auth) Pipeline combines events for all five regions that the company serves. Finally, the data has to be fetched and served back to the client to support playback. Note that the team wanted to preserve separation between the Auth and Edge regions, so they weren’t looking for any database-specific replication between them. The two main technical requirements for supporting this architecture were: To ensure a great user experience, the system had to remain highly available, with low-latency reads and the ability to scale based on traffic surges. To avoid extensive infrastructure setup or DBA work, they needed easy integration with their AWS services. Once those boxes were checked, the team also hoped to reduce overall cost. “Our existing infrastructure had data spread across various clusters of DynamoDB and Elasticache, so we really wanted something simple that could combine these into a much lower cost system” explained their backend engineer. Specifically, they needed a database with: Multiregion support, since the service was popular across five major geographic regions. The ability to handle over 170K writes per second. Updates didn’t have a strict service-level agreement (SLA), but the system needed to perform conditional updates based on event timestamps. The ability to handle over 78K reads per second with a P99 latency of 10 to 20 milliseconds. The use case involved only simple point queries; things like indexes, partitioning and complicated query patterns weren’t a primary concern. Around 10TB of data with room for growth. Why move from DynamoDB? According to their backend engineer, “DynamoDB could support our technical requirements perfectly. But given our data size and high (write-heavy) throughput, continuing with DynamoDB would have been like shoveling money into the fire.” Based on their requirements for write performance and cost, they decided to explore ScyllaDB. For a proof of concept, they set up a ScyllaDB Cloud test cluster with six AWS i4i 4xlarge nodes and preloaded the cluster with 3 billion records. They ran combined loads of 170K writes per second and 78K reads per second. And the results? “We hit the combined load with zero errors. Our P99 read latency was 9 ms and the write latency was less than 1 ms.” These low latencies, paired with significant cost savings (over 50%) convinced them to leave DynamoDB. Beyond lower latencies at lower cost, the team also appreciated the following aspects of ScyllaDB: ScyllaDB’s performance-focused design (being built on the Seastar framework, using C++, being NUMA-aware, offering shard-aware drivers, etc.) helps the team reduce maintenance time and costs. Incremental Compaction Strategy helps them significantly reduce write amplification. Flexible consistency level and replication factors helps them support separate Auth and Edge pipelines. For example, Auth uses quorum consistency while Edge uses a consistency level of “1” due to the data duplication and high throughput. Their backend engineer concluded: “Choosing a database is hard. You need to consider not only features, but also costs. Serverless is not a silver bullet, especially in the database domain. “In our case, due to the high throughput and latency requirements, DynamoDB serverless was not a great option. Also, don’t underestimate the role of hardware. Better utilizing the hardware is key to reducing costs while improving performance.” Learn More Is Your Team Next? If your team is considering a move from DynamoDB, ScyllaDB might be an option to explore. Sign up for a technical consultation to talk more about your use case, SLAs, technical requirements and what you’re hoping to optimize. We’ll let you know if ScyllaDB is a good fit and, if so, what a migration might involve in terms of application changes, data modeling, infrastructure and so on. Bonus: Here’s a quick look at how ScyllaDB compares to DynamoDB

Cassandra Compaction Throughput Performance Explained

This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, CASSANDRA-15452.

This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice section covering the storage engine if you’d like to brush up before reading this post.

CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator

Introduction: The need for an Apache Cassandra® password validator and generator

Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.

While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.

When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.

But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?

That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.

The Goals of CEP-24

A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.

These were the key goals we established for CEP-24:

  • Introduce a way to enforce password strength upon role creation or role alteration.
  • Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
  • Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
  • To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
  • Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.

The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).

The password validator implements a custom guardrail introduced as part of CEP-24. A custom guardrail can validate and generate values of arbitrary types when properly implemented. In the CEP-24 context, the password guardrail provides CassandraPasswordValidator by extending ValueValidator, while passwords are generated by CassandraPasswordGenerator by extending ValueGenerator. Both components work with passwords as String type values.

Password validation and generation are configured in the cassandra.yaml file under the password_validator section. Let’s explore the key configuration properties available. First, the class_name and generator_class_name parameters specify which validator and generator classes will be used to validate and generate passwords respectively.

Cassandra ships CassandraPasswordValidator and CassandraPasswordGenerator out of the box. However, if a particular enterprise decides that they need something very custom, they are free to implement their own validators, put it on Cassandra’s class path and reference it in the configuration behind class_name parameter. Same for the validator.

CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.

password_validator: 
 # Implementation class of a validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.validators is prepended. 
 # By default, there is no validator. 
 class_name: CassandraPasswordValidator 
 # Implementation class of related generator which generates values which are valid when 
 # tested against this validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.generators is prepended. 
 # By default, there is no generator. 
 generator_class_name: CassandraPasswordGenerator

Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.

This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.

# There are four characteristics: 
 # upper-case, lower-case, special character and digit. 
 # If this value is set e.g. to 3, a password has to 
 # consist of 3 out of 4 characteristics. 

 # For example, it has to contain at least 2 upper-case characters, 
 # 2 lower-case, and 2 digits to pass, 
 # but it does not have to contain any special characters. 
 # If the number of characteristics found in the password is 
 # less than or equal to this number, it will emit a warning. 
 characteristic_warn: 3 
 # If the number of characteristics found in the password is 
 #less than or equal to this number, it will emit a failure. 
 characteristic_fail: 2

Next, there are configuration parameters for each characteristic which count towards warning or failure:

# If the password is shorter than this value, 
# the validator will emit a warning. 
length_warn: 12 
# If a password is shorter than this value, 
# the validator will emit a failure. 
length_fail: 8 
# If a password does not contain at least n 
# upper-case characters, the validator will emit a warning. 
upper_case_warn: 2 
# If a password does not contain at least 
# n upper-case characters, the validator will emit a failure. 
upper_case_fail: 1 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a warning. 
lower_case_warn: 2 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a failure. 
lower_case_fail: 1 
# If a password does not contain at least 
# n digits, the validator will emit a warning. 
digit_warn: 2 
# If a password does not contain at least 
# n digits, the validator will emit a failure. 
digit_fail: 1 
# If a password does not contain at least 
# n special characters, the validator will emit a warning. 
special_warn: 2 
# If a password does not contain at least 
# n special characters, the validator will emit a failure. 
special_fail: 1

It is also possible to say that illegal sequences of certain length found in a password will be forbidden: 

# If a password contains illegal sequences that are at least this long, it is invalid. 
# Illegal sequences might be either alphabetical (form 'abcde'), 
# numerical (form '34567'), or US qwerty (form 'asdfg') as well 
# as sequences from supported character sets. 
# The minimum value for this property is 3, 
# by default it is set to 5. 
illegal_sequence_length: 5

Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:

# Dictionary to check the passwords against. Defaults to no dictionary. 
# Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. 
# Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. 
dictionary: /path/to/dictionary/file

Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.

Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
contains the dictionary word 'cassandraisadatabase'. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; 
InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
must be 8 or more characters in length. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; 

Warnings: 

Guardrail password violated: Password was set, however it might not be 
strong enough according to the configured password strength policy. 
To fix this warning, the following has to be resolved: Password must be 12 or more 
characters in length. Passwords must contain 2 or more digit characters. Password 
must contain 2 or more special characters. Password matches 2 of 4 character rules, 
but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met. 

When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.

cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; 

generated_password 
------------------ 
   R7tb33?.mcAX

The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.

When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.

We can create a role with generated password like this:

cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; 

or by CREATE USER: 

cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;

When a password is generated foralice (out of scope of this documentation), she can log in: 

$ cqlsh -u alice -p R7tb33?.mcAX 
... 
alice@cqlsh>

Note: It is recommended to save password to ~/.cassandra/credentials, for example: 

[PlainTextAuthProvider] 
username = cassandra
password = R7tb33?.mcAX

and by setting auth_provider in ~/.cassandra/cqlshrc 

[auth_provider] 
module = cassandra.auth 
classname = PlainTextAuthProvider

It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:

alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength policy. 
You may also use 'GENERATED PASSWORD' upon role creation or alteration."

The following command will automatically generate a new password that meets all configured security requirements.

alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;

Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.

These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.

Final thoughts and next steps

The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.

By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.

As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.

Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.

CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.

The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy',         '       replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature  Cassandra 4 + OpenSearch  Cassandra 5 (Preview) 
Embedding Storage  OpenSearch  Native VECTOR Data Type 
Similarity Search  KNN Plugin in OpenSearch  COSINE, EUCLIDEAN, DOT_PRODUCT 
Search Method  Exact K-Nearest Neighbor  Approximate Nearest Neighbor (ANN) 
System Complexity  Requires two systems  All-in-one Cassandra solution 

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',          '
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

How Cassandra Streaming, Performance, Node Density, and Cost are All related

This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.

A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.

IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative

IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.

For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?

While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.

However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.

Decoding the IBM-DataStax link-up

DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.

While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.

This is where Instaclustr comes into the picture as a strong, reliable alternative solution.

Why consider Instaclustr?

Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.

Here’s why Instaclustr could be your best option moving forward:

1. 100% open source commitment

We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.

2. Platform agnostic

One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.

3. Transparent pricing

Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.

4. Expert support and services

With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.

Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.

5. Seamless migrations

Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.

Customer-centric focus

At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.

Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.

With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.

Final thoughts: Why Instaclustr is the smart choice for this moment

IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.

With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.

Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.

Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!

The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.

Innovative data compression for time series: An open source solution

Introduction

There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.  

Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.  

Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative. 

ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub. 

There are so many compressors already, so why develop another one? 

While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born. 

ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that functionno actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use. 

Here’s an example: for a temperature change metricwhich mostly varies slowly (as do a lot of system metrics!)instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios. 

Image 1: ATSC data for temperature 

How does ATSC work? 

ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.  

By default, ATSC will segment the datathis guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file. 

In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function. 

ATSC currently uses one (per frame) of those following functions: 

  • FFT (Fast Fourier Transforms) 
  • Constant 
  • Interpolation – Catmull-Rom 
  • Interpolation – Inverse Distance Weight 

Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting 

These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.

For a more detailed insight into ATSC internals and operations check our paper! 

 Use cases for ATSC and results 

ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below). 

Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results: 

Method   Compressed size (bytes)  Compression Ratio 
Prometheus  454,778,552  1.33 
LZ4  141,347,821  4.29 
ATSC  14,276,544   42.47 

Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times. 

Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.

ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include: 

  • Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss. 
  • Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events. 
  • Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data. 
  • Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)

Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)   

Using ATSC 

ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions. 

Future work 

Currently, we are planning to evolve ATSC in two ways (check our open issues): 

  1. Adding features to the core compressor focused on these functionalities:
    • Frame expansion for appending new data to existing frames 
    • Dynamic function loading to add more functions without altering the codebase 
    • Global and per-frame error storage 
    • Improved error encoding 
  2. Integrations with additional technologies (e.g. databases):
    • We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra® 
CREATE TABLE sensors_poly (   
    sensor_id UInt16,   
    location UInt32,
    timestamp DateTime,
    pressure Float64
CODEC(ATSC('Polynomial', 1)),
    temperature Float64 
CODEC(ATSC('Polynomial', 1)),
) 
ENGINE = MergeTree 
ORDER BY (sensor_id, location,
timestamp);

Image 5: Currently testing ClickHouse integration 

Sound interesting? Try it out and let us know what you think.  

ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data. 

But don’t just take our word for itdownload and run it!  

Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.   

Want to integrate this with another tool? You can build and run our demo integration with ClickHouse. 

The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.

New cassandra_latest.yaml configuration for a top performant Apache Cassandra®

Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.  

This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters. 

Motivation 

The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters: 

  1. Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints. 
  2. Operators: who prefer stability and minimal disruption during upgrades. 
  3. Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility. 

Using cassandra_latest.yaml 

Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.  

This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements. 

Key changes and features 

Key Cache Size 

  • Old: Evaluated as a minimum from 5% of the heap or 100MB
  • Latest: Explicitly set to 0

Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage. 

Commit Log Disk Access Mode 

  • Old: Set to legacy
  • Latest: Set to auto

Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.

Memtable Implementation 

  • Old: Skiplist-based
  • Latest: Trie-based

Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.

create table … with memtable = {'class': 'TrieMemtable', … }

Memtable Allocation Type 

  • Old: Heap buffers 
  • Latest: Off-heap objects 

Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments. 

Trickle Fsync 

  • Old: False 
  • Latest: True 

Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads. 

SSTable Format 

  • Old: big 
  • Latest: bti (trie-indexed structure) 

Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets. 

sstable:
  selected_format: bti
  default_compression: zstd
  compression:
    zstd:
      enabled: true
      chunk_length: 16KiB
      max_compressed_length: 16KiB

Default Compaction Strategy 

  • Old: STCS (Size-Tiered Compaction Strategy) 
  • Latest: Unified Compaction Strategy 

Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables. 

default_compaction:
  class_name: UnifiedCompactionStrategy
  parameters:
    scaling_parameters: T4
    max_sstables_to_compact: 64
    target_sstable_size: 1GiB
    sstable_growth: 0.3333333333333333
    min_sstable_size: 100MiB

Concurrent Compactors 

  • Old: Defaults to the smaller of the number of disks and cores
  • Latest: Explicitly set to 8

Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient. 

Default Secondary Index 

  • Old: legacy_local_table
  • Latest: sai

Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time. 

Stream Entire SSTables 

  • Old: implicity set to True
  • Latest: explicity set to True

Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by

entire_sstable_stream_throughput_outbound

and

entire_sstable_inter_dc_stream_throughput_outbound

for inter-DC transfers. 

UUID SSTable Identifiers 

  • Old: False
  • Latest: True

Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments. 

Storage Compatibility Mode 

  • Old: Cassandra 4
  • Latest: None

Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions. 

Testing and validation 

The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests. 

Future improvements 

Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml. 

Conclusion 

The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone. 

Try it out 

Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!

The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.

Cassandra 5 Released! What's New and How to Try it

Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.

You can grab the latest release on the Cassandra download page.

Instaclustr for Apache Cassandra® 5.0 Now Generally Available

NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.

NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and onpremises.

Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.

Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by: 

  • Helping you build AI/ML-based applications through Vector Search  
  • Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities 
  • Improving flexibility and security 

With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industryleading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.  

Support for continuous backups and private network addons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.  
  • Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.  
  • Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.  
  • Numerous stability and testing improvements: You can read all about these changes here. 

All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.  

Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.  

We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.  

Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things. 

Lifecycle Policy Updates 

As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).

To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters. 

Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.  

Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.  

You can read more about our lifecycle policies on our website. 

Getting Started 

Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.  

  • Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
  • Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
  • Learn about 4 new Apache Cassandra 5.0 features to be excited about. 
  • Click here to learn what you need to know about Apache Cassandra 5.0. 

Why Choose Apache Cassandra on the Instaclustr Managed Platform? 

NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.  

NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.  

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.  

With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.  

If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.  

The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.

Apache Cassandra® 5.0: Behind the Scenes

Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.  

Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.  

It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform. 

Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra projectHis changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform. 

August 2023: The Beginning

We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.  

One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java. 

September 2023: First Milestone 

The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).  

Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.  

At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released. 

November 2023 Further Testing and Internal Preview 

The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.  

We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.  

We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.  

During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us. 

If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer: 

java.lang.IllegalArgumentException: Could not parse type name vector<float, 5>  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356)  
<Rest of stacktrace removed for brevity>

December 2023: Focus on new features and planning 

As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.  

 The final list of high impact features we came up with was: 

  • A new data type Vectors 
  • Trie memtables/Trie Indexed SSTables (BTI Formatted SStables) 
  • Storage-Attached Indexes (SAI) 
  • Unified Compaction Strategy 

A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase. 

Once the holiday season arrived, it was time for a break, and we were back in force in February next year. 

February 2024: Intensive testing 

In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, wdiscovered issues in the interaction with our monitoring and provisioning setup. 

We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.  

Next, we broke down the work we needed to do This included identifying monitoring agents requiring upgrade and config defaults that needed to change. 

From this point, the project split into 3 streams of work: 

  1. Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.  
  2. Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.  
  3. Infrastructure Upgrades Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables. 

A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved. 

March 2024: Public Preview Release 

In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables. 

However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.  

See our public preview launch blog for further details. 

There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults. 

April 2024: Configuration Tuning and Deeper Testing 

The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.  

This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default. 

"memtable": 
  { 
    "configurations": 
      { 
        "skiplist": 
          { 
            "class_name": "SkipListMemtable" 
          }, 
        "sharded": 
          { 
            "class_name": "ShardedSkipListMemtable" 
          }, 
        "trie": 
          { 
            "class_name": "TrieMemtable" 
          }, 
        "default": 
          { 
            "inherits": "trie" 
          } 
      } 
  }, 
"sstable": 
  { 
    "selected_format": "bti" 
  }, 
"storage_compatibility_mode": "NONE",

The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables You can override this per table: 

CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};

Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information

You can also reference the cassandra_latest.yaml  file for the latest settings (please note you should not apply these to existing clusters without rigorous testing). 

May 2024: Major Infrastructure Milestone 

We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.  

At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed. 

June 2024: Successful Rollout New Cassandra Driver 

The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default. 

July 2024: Path to Release Candidate 

We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.  

The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform. 

The Road Ahead to General Availability 

We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including: 

  • Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure. 

At Launch: 

When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.  

For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.  

What Have We Learned From This Project? 

  • Releasing limited, small and frequent changes has resulted in a smooth project, even if sometimes frequent releases do not feel smooth. Some thoughts: 
    • Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
    • Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc. 
    • Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.  
  • Releasing to internal and public preview helped create momentum within the Instaclustr business and teams:  
    • This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
  • Communicating frequently, transparently, and efficiently is the foundation of success:  
    • We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything. 
    • It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project. 

This has been a longrunning but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.  

You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.  

More Readings 

 

The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.

easy-cass-lab v5 released

I’ve got some fun news to start the week off for users of easy-cass-lab: I’ve just released version 5. There are a number of nice improvements and bug fixes in here that should make it more enjoyable, more useful, and lay groundwork for some future enhancements.

  • When the cluster starts, we wait for the storage service to reach NORMAL state, then move to the next node. This is in contrast to the previous behavior where we waited for 2 minutes after starting a node. This queries JMX directly using Swiss Java Knife and is more reliable than the 2-minute method. Please see packer/bin-cassandra/wait-for-up-normal to read through the implementation.
  • Trunk now works correctly. Unfortunately, AxonOps doesn’t support trunk (5.1) yet, and using the agent was causing a startup error. You can test trunk out, but for now the AxonOps integration is disabled.
  • Added a new repl mode. This saves keystrokes and provides some auto-complete functionality and keeps SSH connections open. If you’re going to do a lot of work with ECL this will help you be a little more efficient. You can try this out with ecl repl.
  • Power user feature: Initial support for profiles in AWS regions other than us-west-2. We only provide AMIs for us-west-2, but you can now set up a profile in an alternate region, and build the required AMIs using easy-cass-lab build-image. This feature is still under development and requires using an easy-cass-lab build from source. Credit to Jordan West for contributing this work.
  • Power user feature: Support for multiple profiles. Setting the EASY_CASS_LAB_PROFILE environment variable allows you to configure alternate profiles. This is handy if you want to use multiple regions or have multiple organizations.
  • The project now uses Kotlin instead of Groovy for Gradle configuration.
  • Updated Gradle to 8.9.
  • When using the list command, don’t show the alias “current”.
  • Project cleanup, remove old unused pssh, cassandra build, and async profiler subprojects.

The release has been released to the project’s GitHub page and to homebrew. The project is largely driven by my own consulting needs and for my training. If you’re looking to have some features prioritized please reach out, and we can discuss a consulting engagement.

easy-cass-lab updated with Cassandra 5.0 RC-1 Support

I’m excited to announce that the latest version of easy-cass-lab now supports Cassandra 5.0 RC-1, which was just made available last week! This update marks a significant milestone, providing users with the ability to test and experiment with the newest Cassandra 5.0 features in a simplified manner. This post will walk you through how to set up a cluster, SSH in, and run your first stress test.

For those new to easy-cass-lab, it’s a tool designed to streamline the setup and management of Cassandra clusters in AWS, making it accessible for both new and experienced users. Whether you’re running tests, developing new features, or just exploring Cassandra, easy-cass-lab is your go-to tool.