The Developer’s Data Modeling Cheat Guide​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌​‌​​‍‌‍​​‌‌​‌‌‍‌‍​​‌​‌‍​‍‌​‌‌‌‍​‌‍​‌​​‍​‍‌​‌​​‌​​‌​​‌‌​‍‌​‍‌​‌‌‍‌​​​​‍‌​​‌​‌​​‍‌​​​‌​​‌‍​‌​​‌‌‍‌‌‌‍​‌‌‍​​​‍​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌​‍‌‍​​‌‌‍‍​‌‌​‌‌​‌​​‌​​‍‌‌​​‌​​‌​‍‌‌​​‍‌​‌‍​‍‌‌​​‍‌​‌‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‌‍‍‌‌‍‌​​‌​‌​​‍‌‍​​‌‌​‌‌‍‌‍​​‌​‌‍​‍‌​‌‌‌‍​‌‍​‌​​‍​‍‌​‌​​‌​​‌​​‌‌​‍‌​‍‌​‌‌‍‌​​​​‍‌​​‌​‌​​‍‌​​​‌​​‌‍​‌​​‌‌‍‌‌‌‍​‌‌‍​​​‍​‍‌‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‍‌‍‌​​‌‍‌‌‌​‍‌​‌​​‌‍‌‌‌‍​‌‌​‌‍‍‌‌‌‍‌‍‌‌​‌‌​​‌‌‌‌‍​‍‌‍​‌‍‍‌‌​‌‍‍​‌‍‌‌‌‍‌​​‍​‍‌‌

In Cassandra 5.0, storage-attached indexes offer a new way to interact with your data, giving developers the flexibility to query multiple columns with filtering, range conditions, and better performance.

ScyllaDB’s Engineering Summit in Sofia, Bulgaria

From hacking to hiking: what happens when engineering sea monsters get together  ScyllaDB is a remote-first company, with team members spread across the globe. We’re masters of virtual connection, but every year, we look forward to the chance to step away from our screens and come together for our Engineering Summit. It’s a time to reconnect, exchange ideas, and share spontaneous moments of joy that make working together so special. This year, we gathered in Sofia, Bulgaria — a city rich in history and culture, set against the stunning backdrop of the Balkan mountains. Where Monsters Meet This year’s summit brought together a record-breaking number of participants from across the globe—over 150! As ScyllaDB continues its continued growth, the turnout well reflects our momentum and our team’s expanding global reach. We, ScyllaDB Monsters from all corners of the world, came together to share knowledge, build connections, and collaborate on shaping the future of our company and product. A team-building activity An elevator ride to one of the sessions The summit brought together not just the engineering teams but also our Customer Experience (CX) colleagues. With their insights into the real-world experiences of our customers, the CX team helped us see the bigger picture and better understand how our work impacts those who use our product. The CX team Looking Inward, Moving Forward The summit was packed with really insightful talks, giving us a chance to reflect on where we are and where we’re heading next. It was all about looking back at the wins we’ve had so far, getting excited about the cool new features we’re working on now, and diving deep into what’s coming down the pipeline. CEO and Co-founder, Dor Laor, kicking off the summit The sessions sparked fruitful discussions about how we can keep pushing forward and build on the strong foundation we’ve already laid. The speakers touched on a variety of topics, including: ScyllaDB X Cloud Consistent topology Data distribution with tablets Object storage Tombstone garbage collection Customer stories Improving customer experience And many more Notes and focus doodles Collaboration at Its Best: The Hackathon This year, we took the summit energy to the next level with a hackathon that brought out the best in creativity, collaboration, and problem-solving. Participants were divided into small teams, each tackling a unique problem. The projects were chosen ahead of time so that we had a chance to work on real challenges that could make a tangible impact on our product and processes. The range of projects was diverse. Some teams focused on adding new features, like implementing a notification API to enhance the user experience. Others took on documentation-related challenges, improving the way we share knowledge. But across the board, every team managed to create a functioning solution or prototype. At the hackathon The hackathon brought people from different teams together to tackle complex issues, pushing everyone a bit outside their comfort zone. Beyond the technical achievements, it was a powerful team-building experience, reinforcing our culture of collaboration and shared purpose. It reminded us that solving real-life challenges—and doing it together—makes our work even more rewarding. The hackathon will undoubtedly be a highlight of future summits to come! From Development to Dance And then, of course, came the party. The atmosphere shifted from work to celebration with live music from a band playing all-time hits, followed by a DJ spinning tracks that kept everyone on their feet. Live music at the party Almost everyone hit the dance floor—even those who usually prefer to sit it out couldn’t resist the rhythm. It was the perfect way to unwind and celebrate the success of the summit! Sea monsters swaying Exploring Sofia and Beyond During our time in Sofia, we had the chance to immerse ourselves in the city’s rich history and culture. Framed by the dramatic Balkan mountains, Sofia blends the old with the new, offering a mix of history, culture, and modern vibe. We wandered through the ancient ruins of the Roman Theater and visited the iconic Alexander Nevsky Cathedral, marveling at their beauty and historical significance. To recharge our batteries, we enjoyed delicious meals in modern Bulgarian restaurants. In front of Alexander Nevsky Cathedral But the adventure didn’t stop in the city. We took a day trip to the Rila Mountains, where the breathtaking landscapes and serene atmosphere left us in awe. One of the standout sights was the Rila Monastery, a UNESCO World Heritage site known for its stunning architecture and spiritual significance. The Rila Monastery After soaking in the peaceful vibes of the monastery, we hiked the trail leading to the Stob Earth Pyramids, a natural wonder that looked almost otherworldly. The Stob Pyramids The hike was rewarding, offering stunning views of the mountains and the unique rock formations below. It was the perfect way to experience Bulgaria’s natural beauty while winding down from the summit excitement. Happy hiking Looking Ahead to the Future As we wrapped up this year’s summit, we left feeling energized by the connections made, ideas shared, and challenges overcome. From brainstorming ideas to clinking glasses around the dinner table, this summit was a reminder of why in-person gatherings are so valuable—connecting not just as colleagues but as a team united by a common purpose. As ScyllaDB continues to expand, we’re excited for what lies ahead, and we can’t wait to meet again next year. Until then, we’ll carry the lessons, memories, and new friendships with us as we keep moving forward. Чао! We’re hiring – join our team! Our team

How ScyllaDB Simulates Real-World Production Workloads with the Rust-Based “latte” Benchmarking Tool

Learn why we use a little-known benchmarking tool for testing Before using a tech product, it’s always nice to know its capabilities and limits. In the world of databases, there are a lot of different benchmarking tools that help us assess that… If you’re ok with some standard benchmarking scenarios, you’re set – one of the existing tools will probably serve you well. But what if not? Rigorously assessing ScyllaDB, a high-performance distributed database, requires testing some rather specific scenarios, ones with real-world production workloads. Fortunately, there is a tool to help with that. It is latte: a Rust-based lightweight benchmarking tool for Apache Cassandra and ScyllaDB.
Special thanks to Piotr Kołaczkowski for implementing the latte benchmarking tool.
We (the ScyllaDB testing team) forked it and enhanced it. In this blog post, I’ll share why and how we adopted it for our specialized testing needs. About latte Our team really values latte’s “flexibility.” Want to create a schema using a user defined type (UDT), Map, Set, List, or any other data type? Ok. Want to create a materialized views and query it? Ok. Want to change custom function behavior based on elapsed time? Ok. Want to run multiple custom functions in parallel? Ok. Want to use small, medium, and large partitions? Ok. Basically, latte lets us define any schema and workload functions. We can do this thanks to its implementation design. The latte tool is a type of engine/kernel and rune scripts are essentially the “business logic” that’s written separately. Rune scripts are an enhanced, more powerful, analog of what cassandra-stress calls user profiles. The rune scripting language is dynamically-typed and native to the Rust programming language ecosystem. Here’s a simple example of a rune script: In the above example of a rune script, we defined 2 required functions (schema and prepare) and one custom to be used as our workload –myinsert. First, we create a schema: Then, we use the latte run command to call our custom myinsert function: The replication_factor parameter above is a custom parameter. If we do not specify it, then latte will use its default value, 3. We can define any number of custom parameters. How is latte different from other benchmarking Tools? Based on our team’s experiences, here’s how latte compares to the 2 main competitors: cassandra-stress and ycsb: How is our fork of latte different from the original latte project? At ScyllaDB, our main use case for latte is testing complex and realistic customer scenarios with controlled disruptions. But (from what we understand), the project was designed to perform general latency measurements in healthy DB clusters. Given these different goals, we changed some features (“overlapping features”) – and added other new ones (“unique to our fork”). Here’s an overview: Overlapping features differences Latency measurement. Fork-latte accounts for coordinated omission in latencies The original project doesn’t consider the “coordinated omission” phenomenon. Saturated DB impact. When a system under test cannot keep up with the load/stress, fork-latte tries to satisfy the “rate”, compensating for missed scheduler ticks ASAP. Source-latte pulls back on violating the rate requirement and doesn’t later compensate for missed scheduler ticks. This isn’t a “bug”; it is a design decision which also violates the idea of proper latency calculation related to the “coordinated omission” phenomenon. Retries. We enabled retries by default; there, it is disabled by default. Prepared statements. Fork-latte supports all the CQL data types available in ScyllaDB Rust Driver. The source project has limited support of CQL data types. ScyllaDB Rust Driver. Our fork uses the latest version – “1.2.0” The source project sticks to the old version “0.13.2” Stress execution reporting. Report is disabled by default in fork-latte. It’s enabled in source-latte. Features unique to our fork Preferred datacenter support. Useful for testing multi-DC DB setups Preferred rack support. Useful for testing multi-rack DB setups Possibility to get a list of actual datacenter values from the DB nodes that the driver connected to. Useful for creating schema with dc-based keyspaces Sine-wave rate limiting. Useful for SLA/Workload Prioritization demo and OLTP testing with peaks and lows. Batch query type support. Multi-row partitions. Our fork can create multi-row partitions of different sizes. Page size support for select queries. Useful using multi-row partitions feature. HDR histograms support. The source project has only 1 way to get the HDR histograms data It stores HDR histograms data in RAM till the end of a stress command execution and only in the end releases it as part of a report. Leaks RAM. Forked latte supports the above inherited approach and one more: Real-time streaming of HDR histogram data not storing in RAM. No RAM leaks. Rows count validation for select queries. Useful for testing data resurrection.   Example: Testing multi-row partitions of different sizes Let’s look at one specific user scenario where we applied our fork of latte to test ScyllaDB. For background, one of the user’s ScyllaDB production clusters was using large partitions which could be grouped by size in 3 groups: 2000, 4000 and 8000 rows per partition. 95% of the partitions had 2000 rows, 4% of partitions had 4000 rows, and the last 1% of partitions had 8000 rows. The target table had 20+ different columns of different types. Also, ScyllaDB’s Secondary Indexes (SI) feature was enabled for the target table. One day, on one of the cloud providers, latencies spiked and throughput dropped. The source of the problem was not immediately clear. To learn more, we needed to have a quick way to reproduce the customer’s workload in a test environment. Using the latte tool and its great flexibility, we created a rune script covering all the above specifics. The simplified rune script looks like the following: Assume we have a ScyllaDB cluster where one of the nodes has a 172.17.0.2 IP address. Here is the command to create the schema we need: And here is the command to populate the just-created table: To read from the main table and from the MV, use a similar command – just replacing the function name to get and get_from_mv respectively. So, the usage of the above commands allowed us to get a stable issue reproducer and work on its solution. Working with ScyllaDB’s Workload Prioritization feature In other cases, we needed to: Create a Workload Prioritization (WLP) demo. Test an OLTP setup with continuous peaks and lows to showcase giving priority to different workloads. And for these scenarios, we used a special latte feature called sine wave rate. This is an extension to the common rate-limiting feature. It allows us to specify how many operations per second we want to produce. It can be used with following command parameters: And looking at the monitoring, we can see the following picture of the operations per second graph: Internal testing of tombstones (validation) As of June 2025, forked latte supports row count validation. It is useful for testing data resurrection. Here is the rune script for latte to demonstrate these capabilities: As before, we create the schema first: Then, we populate the table with 100k rows using the following command: To check that all rows are in place, we use command similar to the one above, change the function to be get, and define the validation strategy to be fail-fast: The supported validation strategies are retry, fail-fast, ignore. Then, we run 2 different commands in parallel. Here is the first one, which deletes part of the rows: Here is the second one, which knows when we expect 1 row and when we expect none: And here is the timing of actions that take place during these 2 commands’ runtime: That’s a simple example of how we can check whether data got deleted or not. In long-running testing scenarios, we might run more parallel commands, make them depend on the elapsed time, and many more other flexibilities. Conclusions Yes, to take advantage of latte, you first need to study a bit of  rune scripting. But once you’ve done that to some extent, especially having available examples, it becomes a powerful tool that is capable of covering various scenarios of different types.

ScyllaDB Tablets: Answering Your Top Questions

What does your team need to know about tablets– at a purely pragmatic level? Here are answers to the top user questions. The latest ScyllaDB releases feature some significant architectural shifts. Tablets build upon a multi-year project to re-architect our legacy ring architecture. And our metadata is now fully consistent, thanks to the assistance of Raft. Together, these changes can help teams with elasticity, speed, and operational simplicity. Avi Kivity, our CTO and co-founder, provided a detailed look at why and how we made this shift in a series of blogs (Why ScyllaDB Moved to “Tablets” Data Distribution and How We Implemented ScyllaDB’s “Tablets” Data Distribution). Join Avi for a technical deep dive…at our upcoming livestream And we recently created this quick demo to show you what this looks like in action, from the point of view of a database user/operator: But what does your team need to know – at a purely pragmatic level? Here are some of the questions we’ve heard from interested users, and a short summary of how we answer them. What’s the TL;DR on tablets? Tablets are the smallest replication unit in ScyllaDB. Data gets distributed by splitting tables into smaller logical pieces called tablets, and this allows ScyllaDB to shift from a static to a dynamic topology. Tablets are dynamically balanced across the cluster using the Raft consensus protocol. This was introduced as part of a project to bring more elasticity to ScyllaDB, enabling faster topology changes and seamless scaling. Tablets acknowledge that most workloads do not follow a static traffic pattern. In fact, most often follow a cyclical curve with different baseline and peaks through a period of time. By decoupling topology changes from the actual streaming of data, tablets therefore present significant cost saving opportunities for users adopting ScyllaDB by allowing infrastructure to be scaled on-demand, fast. Previously, adding or removing nodes required a sequential, one-at-a-time and serializable process with data streaming and rebalancing. Now, you can add or remove multiple nodes in parallel. This significantly speeds up the scaling process and makes ScyllaDB much more elastic. Tablets are distributed on a per-table basis, with each table having its own set of tablets. The tablets are then further distributed across the shards in the ScyllaDB cluster. The distribution is handled automatically by ScyllaDB, with tablets being dynamically migrated across replicas as needed. Data within a table is split across tablets based on the average geometric size of a token range boundary. How do I configure tablets? Tablets are enabled by default in ScyllaDB 2025.1 and are also available with ScyllaDB Cloud. When creating a new keyspace, you can specify whether to enable tablets or not. There are also three key configuration options for tablets: 1) the enable_tablets boolean setting, 2) the target_tablet_size_in_bytes (default is 5GB), and 3) the tablets property during a CREATE KEYSPACE statement. Here are a few tips for configuring these settings: enable_tablets indicates whether newly created keyspaces should rely on tablets for data distribution. Note that tablets are currently not yet enabled for workloads requiring the use of Counters Tables, Secondary Indexes, Materialized Views, CDC, or LWT. target_tablet_size_in_bytes indicates the average geometric size of a tablet, and is particularly useful during tablet split and merge operations. The default indicates splits are done when a tablet reaches 10GB and merges at 2.5GB. A higher value means tablet migration throughput can be reduced (due to larger tablets), whereas a lower value may significantly increase the number of tablets. The tablets property allows you to opt for tablets on a per keyspace basis via the ‘enabled’ boolean sub-option. This is particularly important if some of your workloads rely on the currently unsupported features mentioned earlier: You may opt out for these tables and fallback to the still supported vNode-based replication strategy. Still under the tablets property, the ‘initial’ sub-option determines how many tablets are created upfront on a per-table basis. We recommend that you target 100 tablets/shard. In future releases, we’ll introduce Per-table tablet options to extend and simplify this process while deprecating the keyspace sub-option. How/why should I monitor tablet distribution? Starting with ScyllaDB Monitoring 4.7, we introduced two additional panels for the observability and distribution of tablets within a ScyllaDB cluster. These metrics are present within the Detailed dashboard under the Tablets section: The Tablets over time panel is a heatmap showing the tablet distribution over time. As the data size of a tablet-enabled table grows, you should observe the number of tablets increasing (tablet split) and being automatically distributed by ScyllaDB. Likewise, as the table size shrinks, the number of tablets should be reduced (tablet merge, but to no less than your initially configured ‘initial’ value within the keyspace tablets property). Similarly, as you perform topology changes (e.g., adding nodes), you can monitor the tablet distribution progress. You’ll notice that existing replicas will have their tablet count reduced while new replicas will increase. The Tablets per DC/Instance/Shard panel shows the absolute count of tablets within the cluster. Under heterogeneous setups running on top of instances of the same size, this metric should be evenly balanced. However, the situation changes for heterogeneous setups with different shard counts. In this situation, it is expected that larger instances will hold more tablets given their additional processing power. This is, in fact, yet another benefit of tablets: the ability to run heterogeneous setups and leave it up to the database to determine how to internally maximize each instance’s performance capabilities. What are the impacts of tablets on maintenance tasks like node cleanup? The primary benefit of tablets is elasticity. Tablets allow you to easily and quickly scale out and in your database infrastructure without hassle. This not only translates to infrastructure savings (like avoiding being overprovisioned for the peak all the time). It also allows you to reach a higher percentage of storage utilization before rushing to add more nodes – so you can better utilize the underlying infrastructure you pay for. Another key benefit of tablets is that they eliminate the need for maintenance tasks like node cleanup. Previously, after scaling out the cluster, operators would need to run node cleanup to ensure data was properly evicted from nodes that no longer owned certain token ranges. With tablets, this is no longer necessary. The compaction process automatically handles the migration of data as tablets are dynamically balanced across the cluster. This is a significant operational improvement that reduces the maintenance burden for ScyllaDB users. The ability to now run heterogeneous deployments without running through cryptic and hours-long tuning cycles is also a plus. ScyllaDB’s tablet load balancer is smart enough to figure out how to distribute and place your data. It considers the amount of compute resources available, reducing the risk of traffic hotspots or data imbalances that may affect your clusters’ performance. In the future, ScyllaDB will bring transparent repairs on a per-tablet basis, further eliminating the need for users to worry about repairing their clusters, and also provide “temperature-based balancing” so that hot partitions get split and other shards cooperate with the incoming load. Do I need to change drivers? ScyllaDB’s latest drivers are tablet-aware, meaning they understand the tablets concept and can route queries to the correct nodes and shards. However, the drivers do not directly query the internal system.tablets table. That could become unwieldy as the number of tablets grows. Furthermore, tablets are transient, meaning a replica owning a tablet may no longer be a natural endpoint for it as time goes by. Instead, the drivers use a dynamic routing process: when a query is sent to the wrong node/shard, the coordinator will respond with the correct routing information, allowing the driver to update its routing cache. This ensures efficient query routing as tablets are migrated across the cluster. When using ScyllaDB tablets, it’s more important than ever to use ScyllaDB shard-aware – and now also tablet-aware – drivers instead of Cassandra drivers. The existing drivers will still work, but they won’t work as efficiently because they lack the necessary logic to understand the coordinator-provided tablet metadata. Using the latest ScyllaDB drivers should provide a nice throughput and latency boost. Read more in How We Updated ScyllaDB Drivers for Tablets Elasticity. More questions? If you’re interested in tablets and we didn’t answer your question here, please reach out to us! Our Contact Us page offers a number of ways to interact, including a community forum and Slack.  

Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview

ScyllaDB X Cloud just landed! It’s a truly elastic database that supports variable/unpredictable workloads with consistent low latency, plus low costs. The ScyllaDB team is excited to announce ScyllaDB X Cloud, the next generation of our fully-managed database-as-a-service. It features architectural enhancements for greater flexibility and lower cost. ScyllaDB X Cloud is a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. A few spoilers before we get into the details: You can now scale out and scale in almost instantly to match actual usage, hour by hour. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer latency hits while waiting for autoscaling to fully kick in. You can now safely run at 90% storage utilization, compared to the standard 70% utilization. This means you need fewer underlying servers and have substantially less infrastructure to pay for. Optimizations like file-based streaming and dictionary-based compression also speed up scaling and reduce network costs. Beyond the technical changes, there’s also an important pricing update. To go along with all this database flexibility, we’re now offering a “Flex Credit” pricing model. Basically, this gives you the flexibility of on-demand pricing with the cost advantage that comes from an annual commitment. Access ScyllaDB X Cloud Now If you want to get started right away, just go to ScyllaDB Cloud and choose the X Cloud cluster type when you create a cluster. This is our code name for the new type of cluster that enables greater elasticity, higher storage utilization, and automatic scaling. Note that X Cloud clusters are available from the ScyllaDB Cloud application (below) and API. They’re available on AWS and GCP, running on a ScyllaDB account or your company’s account with the Bring Your Own Account (BYOA) model. Sneak peek: In the next release, you won’t need to choose instance size or number of services if you select the X Cloud option. Instead, you will be able to define a serverless scaling policy and let X Cloud scale the cluster as required. If you want to learn more, keep reading. In this blog post, we’ll cover what’s behind the technical changes and also talk a little about the new pricing option. But first, let’s start with the why. Backstory Why did we do this? Consider this example from a marketing/AdTech platform that provides event-based targeting. Such a pattern, with predictable/cyclical daily peaks and low baseline off-hours, is quite common across retail platforms, food delivery services, and other applications aligned with customer work hours. In this case, the peak loads are 3x the base and require 2-3x the resources. With ScyllaDB X Cloud, they can provision for the baseline and quickly scale in/out as needed to serve the peaks. They get the steady low latency they need without having to overprovision – paying for peak capacity 24/7 when it’s really only needed for 4 hours a day. Tablets + just-in-time autoscaling If you follow ScyllaDB, you know that tablets aren’t new. We introduced them last year for ScyllaDB Enterprise (self-managed on the cloud or on-prem). Avi Kivity, our CTO, already provided a look at why and how we implemented tablets. And you can see tablets in action here: With tablets, data gets distributed by splitting tables into smaller logical pieces (“tablets”), which are dynamically balanced across the cluster using the Raft consensus protocol. This enables you to scale your databases as rapidly as you can scale your infrastructure. In a self-managed ScyllaDB deployment, tablets makes it much faster and simpler to expand and reduce your database capacity. However, you still need to plan ahead for expansion and initiate the operations yourself. ScyllaDB X Cloud lets you take full advantage of tablets’ elasticity. Scaling can be triggered automatically based on storage capacity (more on this below) or based on your knowledge of expected usage patterns. Moreover, as capacity expands and contracts, we’ll automatically optimize both node count and utilization. You don’t even have to choose node size; ScyllaDB X Cloud’s storage-utilization target does that for you. This should simplify admin and also save costs. 90% storage utilization ScyllaDB has always handled running at 100% compute utilization well by having automated internal schedulers manage compactions, repairs, and lower-priority tasks in a way that prioritizes performance. Now, it also does two things that let you increase the maximum storage utilization to 90%: Since tablets can move data to new nodes so much faster, ScyllaDB X Cloud can defer scaling until the very last minute Support for mixed instance sizes allows ScyllaDB X Cloud to allocate minimal additional resources to keep the usage close to 90% Previously, we recommended adding nodes at 70% capacity. This was because node additions were unpredictable and slow — sometimes taking hours or days — and you risked running out of space. We’d send a soft alert at 50% and automatically add nodes at 70%. However, those big nodes often sat underutilized. With ScyllaDB X Cloud’s tablets architecture, we can safely target 90% utilization. That’s particularly helpful for teams with storage-bound workloads. Support for mixed size clusters A little more on the “mixed instance size” support mentioned earlier. Basically, this means that ScyllaDB X Cloud can now add the exact mix of nodes you need to meet the exact capacity you need at any given time. Previous versions of ScyllaDB used a single instance size across all nodes in the cluster. For example, if you had a cluster with 3 i4i.16xlarge instances, increasing the capacity meant adding another i4i.16xlarge. That works, but it’s wasteful: you’re paying for a big node that you might not immediately need. Now with ScyllaDB X Cloud (thanks to tablets and support for mixed-instance sizes), we can scale in much smaller increments. You can add tiny instances first, then replace them with larger ones if needed. That means you rarely pay for unused capacity. For example, before, if you started with an i4i.16xlarge node that had 15 TB of storage and you hit 70% utilization, you had to launch another i4i.16xlarge — adding 15 TB at once. With ScyllaDB X Cloud, you might add two xlarge nodes (2 TB each) first. Then, if you need more storage, you add more small nodes, then eventually replace them with larger nodes. And by the way, i7i instances are now available too, and they are even more powerful. The key is granular, just-in-time scaling: you only add what you need, when you need it. This applies in reverse, too. Before, you had to decommission a large node all at once. Now, ScyllaDB X Cloud can remove smaller nodes gradually based on the policies you set, saving compute and storage costs. Network-focused engineering optimizations Every gigabyte leaving a node, crossing an Availability Zone (AZ) boundary, or replicating to another region shows up on your AWS, GCP, or Azure bill. That’s why we’ve done some engineering work at different layers of ScyllaDB to shrink those bytes—and the dollars tied to them. File-based streaming We anticipated that mutation-based streaming would hold us back once we moved to tablets. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Think of it as Cassandra’s zero-copy streaming, except that we keep ownership metadata with each replica. This table shows the result: You can read more about this in the blog Why We Changed ScyllaDB’s Data Streaming Approach. Dictionary-based compression We also introduced dictionary-trained Zstandard (Zstd), which is pipeline-aware. This involved building a custom RPC compressor with external dictionary support, and a mechanism that trains new dictionaries on RPC traffic, distributes them over the cluster, and performs a live switch of connections to the new dictionaries. This is done in 4 key steps: Sample: Continuously sample RPC traffic for some time Train: Train a 100 kiB dictionary on a 16MiB sample Distribute: Distribute a new dictionary via system distributed table Switch: Negotiate the switch separately within each connection On the graph below, you can see LZ4 (Cassandra’s default) leaves you at 72% of the original size. Generic Zstd cuts that to 50%. Our per-cluster Zstd dictionary takes it down to 30%, which is a 3X improvement over the default Cassandra compression. Flex Credit To close, let’s shift from the technical changes to a major pricing change: Flex Credit. Flex Credit is a new way to consume a ScyllaDB Cloud subscription. It can be applied to ScyllaDB Cloud as well as ScyllaDB Enterprise. Flex Credit provides the flexibility of on-demand pricing at a lower cost via an annual commitment. In combination with X Cloud, Flex Credit can be a great tool to reduce cost. You can use Reserved pricing for a load that’s known in advance and use Flex for less predictable bursts. This saves you from paying the higher on-demand pricing for anything above the reserved part. How might this play out in your day-to-day work? Imagine your baseline workload handles 100K OPS, but sometimes it spikes to 400K OPS. Previously, you’d have to provision (and pay for) enough capacity to sustain 400K OPS at all times. That’s inefficient and costly. With ScyllaDB X Cloud, you reserve 100K OPS upfront. When a spike hits, we automatically call the API to spin up “flex capacity” – instantly scaling you to 400K OPS – and then tear it down when traffic subsides. You only pay for the extra capacity during the peak. Not sure what to choose? We can help advise based on your workload specifics (contact your representative or ping us here), but here’s some quick guidance in the meantime. Reserved Capacity: The most cost-effective option across all plans. Commit to a set number of cluster nodes or machines for a year. You lock in lower rates and guarantee capacity availability. This is ideal if your cluster size is relatively stable. Hybrid Model: Reserved + On-Demand: Commit to a baseline reserved capacity to lock in lower rates, but if you exceed that baseline (e.g., because you have a traffic spike), you can scale with on-demand capacity at an hourly rate. This is good if your usage is mostly stable but occasionally spikes. Hybrid Model: Reserved + Flex Credit: Commit to baseline reserved capacity for the lowest rates. For peak usage, use pre-purchased flex credit (which is discounted) instead of paying on-demand prices. Flex credit also applies to network and backup usage at standard provider rates. This is ideal if you have predictable peak periods (e.g., seasonal spikes, event-driven workload surges, etc.). You get the best of both worlds: low baseline costs and cost-efficient peak capacity. Recap In summary, ScyllaDB X Cloud uses tablets to enable faster, more granular scaling with mixed-instance sizes. This lets you avoid overprovisioning and safely run at 90% storage utilization. All of this will help you respond to volatile/unpredictable demand with low latencies and low costs. Moreover, flexible pricing (on-demand, flex credit, reserved) will help you pay only for what you need, especially when you have tablets scaling your capacity up and down in response to traffic spikes. There are also some network cost optimizations through file-based streaming and improved compression. Want to learn more? Our Co-Founder/CTO Avi Kivity will be discussing the design decisions behind ScyllaDB X Cloud’s elasticity and efficiency. Join us for the engineering deep dive on July 10. ScyllaDB X Cloud: An Inside Look with Avi Kivity

A New Way to Estimate DynamoDB Costs

We built a new DynamoDB cost analyzer that helps developers understand what their workloads will really cost DynamoDB costs can blindside you. Teams regularly face “bill shock”: that sinking feeling when you look at a shockingly high bill and realize that you haven’t paid enough attention to your usage, especially with on-demand pricing. Provisioned capacity brings a different risk: performance. If you can’t accurately predict capacity or your math is off, requests get throttled. It’s a delicate balancing act. Although AWS offers a DynamoDB pricing calculator, it often misses the nuances of real-world workloads (e.g., bursty traffic or uneven access patterns, or using global tables or caching). We wanted something better. In full transparency, we wanted something better to help the teams considering ScyllaDB as a DynamoDB alternative. So we built a new DynamoDB cost calculator that helps developers understand what their workloads will really cost. Although we designed it for teams comparing DynamoDB with ScyllaDB, we believe it’s useful for anyone looking to more accurately estimate their DynamoDB costs, for any reason. You can see the live version at: calculator.scylladb.com How We Built It We wanted to build something that would work client side, without the need for any server components. It’s a simple JavaScript single page application that we currently host on GitHub pages. If you want to check out the source code, feel free to take a look at https://github.com/scylladb/calculator To be honest, working with the examples at https://calculator.aws/ was a bit of a nightmare, and when you “show calculations,” you get these walls of text: I was tempted to take a shorter approach, like: Monthly WCU Cost = WCUs × Price_per_WCU_per_hour × 730 hours/month But every time I simplified this, I found it harder to get parity between what I calculated and the final price in AWS’s calculation. Sometimes the difference was due to rounding, other times it was due to the mixture of reserved + provision capacity, and so on. So to make it easier (for me) to debug, I faithfully followed their calculations line by line and tried to replicate this in my own rather ugly function: https://github.com/scylladb/calculator/blob/main/src/calculator.js I may still refactor this into smaller functions. But for now, I wanted to get parity between theirs and ours. You’ll see that there are also some end-to-end tests for these calculations — I use those to test for a bunch of different configurations. I will probably expand on these in time as well. So that gets the job done for On Demand, Provisioned (and Reserved) capacity models. If you’ve used AWS’s calculator, you know that you can’t specify things like a peak (or peak width) in On Demand. I’m not sure about their reasoning. I decided it would be easier for users to specify both the baseline and peak for reads and writes (respectively) in On Demand, much like Provisioned capacity. Another design decision was to represent the traffic using a chart. I do better with visuals, so seeing the peaks and troughs makes it easier for me to understand – and I hope it does for you as well. You’ll also notice that as you change the inputs, the URL query parameters change to reflect those inputs. That’s designed to make it easier to share and reference specific variations of costs. There’s some other math in there, like figuring out the true cost of Global Tables and understanding derived costs of things like network transfer or DynamoDB Accelerator (DAX). However, explaining all that is a bit too dense for this format. We’ll talk more about that in an upcoming webinar (see the next section). The good news is that you can estimate these costs in addition to your workload, as they can be big cost multipliers when planning out your usage of DynamoDB. Explore “what if” scenarios for your own workloads Analyzing Costs in Real-World Scenarios The ultimate goal of all this tinkering and tuning is to help you explore various “what-if” scenarios from a DynamoDB cost perspective.  To get started, we’re sharing the cost impacts of some of the more interesting DynamoDB user scenarios we’ve come across at ScyllaDB. My colleague Gui and I just got together for a deep dive into how factors like traffic surges, multi-datacenter expansion, and the introduction of caching (e.g., DAX) impact DynamoDB costs. We explored how a few (anonymized) teams we work with ended up blindsided by their DynamoDB bills and the various options they considered for getting costs back under control. Watch the DynamoDB costs chat now

Cassandra vs. MongoDB: When to Use Which​​​​‌‍​‍​‍‌‍‌​‍‌‍‍‌‌‍‌‌‍‍‌‌‍‍​‍​‍​‍‍​‍​‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌‍‍‌‌‍​‍​‍​‍​​‍​‍‌‍‍​‌​‍‌‍‌‌‌‍‌‍​‍​‍​‍‍​‍​‍‌‍‍​‌‌​‌‌​‌​​‌​​‍‍​‍​‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‍‌‌‍‍‌‌​‌‍‌‌‌‍‍‌‌​​‍‌‍‌‌‌‍‌​‌‍‍‌‌‌​​‍‌‍‌‌‍‌‍‌​‌‍‌‌​‌‌​​‌​‍‌‍‌‌‌​‌‍‌‌‌‍‍‌‌​‌‍​‌‌‌​‌‍‍‌‌‍‌‍‍​‍‌‍‍‌‌‍‌​​‌‌‍‌‍‌‍‌‌​​‌‌‍‌‍​‌​​‌‌‍​‌‌‍​‌​‍‌‌‍​‌‌‍​‍​‍‌​​​​‍‌​‌​‌‍​‍​​‌​‌‌​‍‌​‍​​‍‌​‌​‌‍‌‍​‍‌​‌​‌​‌‌​‌‍​‌‌‌‍​​‍​​​‌‍‌‌‌‍​​​​‌‍‌‌​‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‌‍​‍‌‍​‌‌​‌‍‌‌‌‌‌‌‌​‍‌‍​​‌‌‍‍​‌‌​‌‌​‌​​‌​​‍‌‌​​‌​​‌​‍‌‌​​‍‌​‌‍​‍‌‌​​‍‌​‌‍‌‍‌​‌‍​‌‌‌​‌‍​‌‌​‌‌​‌‍​‌‌‍​​‍‍‌​‌‍​‌‌‍‍‌‍‍‌‌‌​‌‍‌​‍‍‌​‌‌​‌‌‌‌‍‌​‌‍‍‌‌‍​‍‌‍‌‍‍‌‌‍‌​​‌‌‍‌‍‌‍‌‌​​‌‌‍‌‍​‌​​‌‌‍​‌‌‍​‌​‍‌‌‍​‌‌‍​‍​‍‌​​​​‍‌​‌​‌‍​‍​​‌​‌‌​‍‌​‍​​‍‌​‌​‌‍‌‍​‍‌​‌​‌​‌‌​‌‍​‌‌‌‍​​‍​​​‌‍‌‌‌‍​​​​‌‍‌‌​‍‌‍‌‌​‌‍‌‌​​‌‍‌‌​‌‌‍​‍‌‍​‌‍‌‍‌‌‌​​‌‍‌​‌‌​​‍‌‍‌​​‌‍​‌‌‌​‌‍‍​​‌‌‌​‌‍‍‌‌‌​‌‍​‌‍‌‌​‍‌‍‌​​‌‍‌‌‌​‍‌​‌​​‌‍‌‌‌‍​‌‌​‌‍‍‌‌‌‍‌‍‌‌​‌‌​​‌‌‌‌‍​‍‌‍​‌‍‍‌‌​‌‍‍​‌‍‌‌‌‍‌​​‍​‍‌‌

Which NoSQL database works best for powering your GenAI use cases? A look at Cassandra vs. MongoDB and which to use when.

Rust Rewrite, Postgres Exit: Blitz Revamps Its “League of Legends” Backend

How Blitz scaled their game coaching app with lower latency and leaner operations Blitz is a fast-growing startup that provides personalized coaching for games such as League of Legends, Valorant, and Fortnite. They aim to help gamers become League of Legends legends through real-time insights and post-match analysis. While players play, the app does quite a lot of work. It captures live match data, analyzes it quickly, and uses it for real-time game screen overlays plus personalized post-game coaching. The guidance is based on each player’s current and historic game activity, as well as data collected across billions of matches involving hundreds of millions of users. Thanks to growing awareness of Blitz’s popular stats and game-coaching app, their steadily increasing user base pushed their original Postgres- and Elixir-based architecture to its limits. This blog post explains how they recently overhauled their League of Legends data backend – using Rust and ScyllaDB. TL;DR – In order to provide low latency, high availability, and horizontal scalability to their growing user base, they ultimately: Migrated backend services from Elixir to Rust. Replaced Postgres with ScyllaDB Cloud. Heavily reduced their Redis footprint. Removed their Riak cluster. Replaced queue processing with realtime processing. Consolidated infrastructure from over a hundred cores of microservices to four n4‑standard‑4 Google Cloud nodes (plus a small Redis instance for edge caching) As an added bonus, these changes ended up cutting Blitz’s infrastructure costs and reducing the database burden on their engineering staff. Blitz Background As Naveed Khan (Head of Engineering at Blitz) explained, “We collect a lot of data from game publishers and during gameplay. For example, if you’re playing League of Legends, we use Riot’s API to pull match data, and if you install our app we also monitor gameplay in real time. All of this data is stored in our transactional database for initial processing, and most of it eventually ends up in our data lake.” Scaling Past Postgres One key part of Blitz’s system is the Playstyles API, which analyzes pre-game data for both teammates and opponents. This intensive process evaluates up to 20 matches per player and runs nine separate times per game (once for each player in the match). The team strategically refactored and consolidated numerous microservices to improve performance. But the data volume remained intense. According to Brian Morin (Principal Backend Engineer at Blitz), “Finding a database solution capable of handling this query volume was critical.” They originally used Postgres, which served them well early on. However, as their write-heavy workloads scaled, the operational complexity and costs on Google Cloud grew significantly. Moreover, scaling Postgres became quite complex. Naveed shared, “We tried all sorts of things to scale. We built multiple services around Postgres to get the scale we needed: a Redis cluster, a Riak cluster, and Elixir Oban queues that occasionally overflowed. Queue management became a big task.” To stay ahead of the game, they needed to move on. As startups scale, they often switch from “just use Postgres” to “just use NoSQL.” Fittingly, the Blitz team considered moving to MongoDB, but eventually ruled it out. “We had lots of MongoDB experience in the team and some of us really liked it. However, our workload is very write-heavy, with thousands of concurrent players generating a constant stream of data. MongoDB uses a single-writer architecture, so scaling writes means vertically scaling one node.” In other words, MongoDB’s primary-secondary architecture would create a bottleneck for their specific workload and anticipated growth. They then decided to move forward with RocksDB because of its low latency and cost considerations. Tests showed that it would meet their latency needs, so they performed the required data (re)modeling and migrated a few smaller games over from Postgres to RocksDB. However, they ultimately decided against RocksDB due to scale and high availability concerns. “Based on available data from our testing, it was clear RocksDB wouldn’t be able to handle the load of our bigger games – and we couldn’t risk vertically scaling a single instance, and then having that one instance go down,” Naveed explained. Why ScyllaDB One of their backend engineers suggested ScyllaDB, so they reached out and ran a proof of concept. They were primarily looking for a solution that can handle the write throughput, scales horizontally, and provides high availability. They tested it on their own hardware first, then moved to ScyllaDB Cloud. Per Naveed, “The cost was pretty close to self-hosting, and we got full management for free, so it was a no-brainer. We now have a significantly reduced Redis cluster, plus we got rid of the Riak cluster and Oban queues dependencies. Just write to ScyllaDB and it all just works. The amount of time we spend on infrastructure management has significantly decreased.” Performance-wise, the shift met their goal of leveling up the user experience … and also simplified life for their engineering teams. Brian added, “ScyllaDB proved exceptional, delivering robust performance with capacity to spare after optimization. Our League product peaks at around 5k ops/sec with the cluster reporting under 20% load. Our biggest constraint has been disk usage, which we’ve rolled out multiple updates to mitigate. The new system can now often return results immediately instead of relying on cached data, providing more up-to-date information on other players and even identifying frequent teammates. The results of this migration have been impressive: over a hundred cores of microservices have been replaced by just four n4-standard-4 nodes and a minimal Redis instance for caching. Additionally, a 3xn2-highmem ScyllaDB cluster has effectively replaced the previous relational database infrastructure that required significant computing resources.” High-Level Architecture of Blitz Server with Rust and ScyllaDB Rewriting Elixir Services into Rust As part of a major backend overhaul, the Blitz team began rethinking their entire infrastructure – beyond the previously described shift from Postgres to the high-performance and distributed ScyllaDB. Alongside this database migration, they also chose to sunset their Elixir-based services in favor of a more modern language. After careful evaluation, Rust emerged as the clear choice. “Elixir is great and it served its purpose well,” explained Naveed. “But we wanted to move toward something with broader adoption and a stronger systems-level ecosystem. Rust proved to be a robust and future-proof alternative.” Now that the first batch of Rust rewritten services are in production, Naveed and team aren’t looking back: “Rust is fantastic. It’s fast, and the compiler forces you to write memory-safe code upfront instead of debugging garbage-collection issues later. Performance is comparable to C, and the talent pool is also much larger compared to Elixir.”

Why We Changed ScyllaDB’s Data Streaming Approach

How moving from mutation-based streaming to file-based streaming resulted in 25X faster streaming time Data streaming – an internal operation that moves data from node to node over a network – has always been the foundation of various ScyllaDB cluster operations. For example, it is used by “add node” operations to copy data to a new node in a cluster (as well as “remove node” operations to do the opposite). As part of our multiyear project to optimize ScyllaDB’s elasticity, we reworked our approach to streaming. We recognized that when we moved to tablets-based data distribution, mutation-based streaming would hold us back. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Mutation-Based Streaming In ScyllaDB, data streaming is a low-level mechanism to move data between nodes. For example, when nodes are added to a cluster, streaming moves data from existing nodes to the new nodes. We also use streaming to decommission nodes from the cluster. In this case, streaming moves data from the decommissioned nodes to other nodes in order to balance the data across the cluster. Previously, we were using a streaming method called mutation-based streaming.   On the sender side, we read the data from multiple SSTables. We get a stream of mutations, serialize them, and send them over the network. On the receiver side, we deserialize and write them to SSTables. File-Based Streaming Recently, we introduced a new file-based streaming method. The big difference is that we do not read the individual mutations from the SSTables, and we skip all the parsing and serialization work. Instead, we read and send the SSTable directly to remote nodes. A given SSTable always belongs to a single tablet. This means we can always send the entire SSTable to other nodes without worrying about whether the SSTable contains unwanted data. We implemented this by having the Seastar RPC stream interface stream SSTable files on the network for tablet migration. More specifically, we take an internal snapshot of the SSTables we want to transfer so the SSTables won’t be deleted during streaming. Then, SSTable file readers are created for them so we can use the Seastar RPC stream to send the SSTable files over the network. On the receiver side, the file streams are written into SSTable files by the SSTable writers.       Why did we do this? First, it reduces CPU usage because we do not need to read each and every mutation fragment from the SSTables, and we do not need to parse mutations. The CPU reduction is even more significant for small cells, where the ratio of the amount of metadata parsed to real user data is higher. Second, the format of the SSTable is much more compact than the mutation format (since on-disk presentation of data is more compact than in-memory). This means we have less data to send over the network. As a result, it can boost the streaming speed rather significantly. Performance Improvements To quantify how this shift impacted performance, we compared the performance of mutation-based and file-based streaming when migrating tablets between nodes. The tests involved: 3 ScyllaDB nodes i4i.2xlarge 3 loaders t3.2xlarge 1 billion partitions Here are the results:   Note that file-based streaming results in 25 times faster streaming time. We also have much higher streaming bandwidth: the network bandwidth is 10 times faster with file-based streaming. As mentioned earlier, we have less data to send with file streaming. The data sent on the wire is almost three times less with file streaming. In addition, we can also see that file-based streaming consumes many fewer CPU cycles. Here’s a little more detail, in case you’re curious. Disk IO Queue The following sections show how the IO bandwidth compares across mutation-based and file-based streaming. Different colors represent different nodes. As expected, the throughput was higher with mutation-based streaming. Here are the detailed IO results for mutation-based streaming:   The streaming bandwidth is 30-40MB/s with mutation-based streaming. Here are the detailed IO results for file-based streaming: The bandwidth for file streaming is much higher than with mutation-based streaming. The pattern differs from the mutation-based graph because file streaming completes more quickly and can sustain a high speed of transfer bandwidth during streaming. CPU Load We found that the overall CPU usage is much lower for the file-based streaming. Here are the detailed CPU results for mutation-based streaming: Note that the CPU usage is around 12% for mutation-based streaming. Here are the detailed CPU results for file-based streaming: Note that the CPU usage for the file-based streaming is less than 5%. Again, this pattern differs from the mutation-based streaming graph because file streams complete much more quickly and can maintain a high transfer bandwidth throughout. Wrap Up This new file-based streaming makes data streaming in ScyllaDB faster and more efficient. You can explore it in ScyllaDB Cloud or ScyllaDB 2025.1. Also, our CTO and co-founder Avi Kivity shares an extensive look at our other recent and upcoming engineering projects in this tech talk: More engineering blog posts

CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator

Introduction: The need for an Apache Cassandra® password validator and generator

Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.

While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.

When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.

But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?

That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.

The Goals of CEP-24

A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.

These were the key goals we established for CEP-24:

  • Introduce a way to enforce password strength upon role creation or role alteration.
  • Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
  • Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
  • To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
  • Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.

The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).

The password validator implements a custom guardrail introduced as part of CEP-24. A custom guardrail can validate and generate values of arbitrary types when properly implemented. In the CEP-24 context, the password guardrail provides CassandraPasswordValidator by extending ValueValidator, while passwords are generated by CassandraPasswordGenerator by extending ValueGenerator. Both components work with passwords as String type values.

Password validation and generation are configured in the cassandra.yaml file under the password_validator section. Let’s explore the key configuration properties available. First, the class_name and generator_class_name parameters specify which validator and generator classes will be used to validate and generate passwords respectively.

Cassandra ships CassandraPasswordValidator and CassandraPasswordGenerator out of the box. However, if a particular enterprise decides that they need something very custom, they are free to implement their own validators, put it on Cassandra’s class path and reference it in the configuration behind class_name parameter. Same for the validator.

CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.

password_validator: 
 # Implementation class of a validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.validators is prepended. 
 # By default, there is no validator. 
 class_name: CassandraPasswordValidator 
 # Implementation class of related generator which generates values which are valid when 
 # tested against this validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.generators is prepended. 
 # By default, there is no generator. 
 generator_class_name: CassandraPasswordGenerator

Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.

This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.

# There are four characteristics: 
 # upper-case, lower-case, special character and digit. 
 # If this value is set e.g. to 3, a password has to 
 # consist of 3 out of 4 characteristics. 

 # For example, it has to contain at least 2 upper-case characters, 
 # 2 lower-case, and 2 digits to pass, 
 # but it does not have to contain any special characters. 
 # If the number of characteristics found in the password is 
 # less than or equal to this number, it will emit a warning. 
 characteristic_warn: 3 
 # If the number of characteristics found in the password is 
 #less than or equal to this number, it will emit a failure. 
 characteristic_fail: 2

Next, there are configuration parameters for each characteristic which count towards warning or failure:

# If the password is shorter than this value, 
# the validator will emit a warning. 
length_warn: 12 
# If a password is shorter than this value, 
# the validator will emit a failure. 
length_fail: 8 
# If a password does not contain at least n 
# upper-case characters, the validator will emit a warning. 
upper_case_warn: 2 
# If a password does not contain at least 
# n upper-case characters, the validator will emit a failure. 
upper_case_fail: 1 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a warning. 
lower_case_warn: 2 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a failure. 
lower_case_fail: 1 
# If a password does not contain at least 
# n digits, the validator will emit a warning. 
digit_warn: 2 
# If a password does not contain at least 
# n digits, the validator will emit a failure. 
digit_fail: 1 
# If a password does not contain at least 
# n special characters, the validator will emit a warning. 
special_warn: 2 
# If a password does not contain at least 
# n special characters, the validator will emit a failure. 
special_fail: 1

It is also possible to say that illegal sequences of certain length found in a password will be forbidden: 

# If a password contains illegal sequences that are at least this long, it is invalid. 
# Illegal sequences might be either alphabetical (form 'abcde'), 
# numerical (form '34567'), or US qwerty (form 'asdfg') as well 
# as sequences from supported character sets. 
# The minimum value for this property is 3, 
# by default it is set to 5. 
illegal_sequence_length: 5

Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:

# Dictionary to check the passwords against. Defaults to no dictionary. 
# Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. 
# Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. 
dictionary: /path/to/dictionary/file

Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.

Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
contains the dictionary word 'cassandraisadatabase'. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; 
InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
must be 8 or more characters in length. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; 

Warnings: 

Guardrail password violated: Password was set, however it might not be 
strong enough according to the configured password strength policy. 
To fix this warning, the following has to be resolved: Password must be 12 or more 
characters in length. Passwords must contain 2 or more digit characters. Password 
must contain 2 or more special characters. Password matches 2 of 4 character rules, 
but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met. 

When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.

cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; 

generated_password 
------------------ 
   R7tb33?.mcAX

The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.

When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.

We can create a role with generated password like this:

cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; 

or by CREATE USER: 

cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;

When a password is generated foralice (out of scope of this documentation), she can log in: 

$ cqlsh -u alice -p R7tb33?.mcAX 
... 
alice@cqlsh>

Note: It is recommended to save password to ~/.cassandra/credentials, for example: 

[PlainTextAuthProvider] 
username = cassandra
password = R7tb33?.mcAX

and by setting auth_provider in ~/.cassandra/cqlshrc 

[auth_provider] 
module = cassandra.auth 
classname = PlainTextAuthProvider

It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:

alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength policy. 
You may also use 'GENERATED PASSWORD' upon role creation or alteration."

The following command will automatically generate a new password that meets all configured security requirements.

alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;

Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.

These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.

Final thoughts and next steps

The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.

By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.

As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.

Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.

CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.

The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy',         '       replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature  Cassandra 4 + OpenSearch  Cassandra 5 (Preview) 
Embedding Storage  OpenSearch  Native VECTOR Data Type 
Similarity Search  KNN Plugin in OpenSearch  COSINE, EUCLIDEAN, DOT_PRODUCT 
Search Method  Exact K-Nearest Neighbor  Approximate Nearest Neighbor (ANN) 
System Complexity  Requires two systems  All-in-one Cassandra solution 

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',          '
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative

IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.

For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?

While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.

However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.

Decoding the IBM-DataStax link-up

DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.

While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.

This is where Instaclustr comes into the picture as a strong, reliable alternative solution.

Why consider Instaclustr?

Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.

Here’s why Instaclustr could be your best option moving forward:

1. 100% open source commitment

We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.

2. Platform agnostic

One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.

3. Transparent pricing

Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.

4. Expert support and services

With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.

Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.

5. Seamless migrations

Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.

Customer-centric focus

At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.

Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.

With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.

Final thoughts: Why Instaclustr is the smart choice for this moment

IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.

With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.

Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.

Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!

The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.

Innovative data compression for time series: An open source solution

Introduction

There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.  

Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.  

Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative. 

ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub. 

There are so many compressors already, so why develop another one? 

While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born. 

ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that functionno actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use. 

Here’s an example: for a temperature change metricwhich mostly varies slowly (as do a lot of system metrics!)instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios. 

Image 1: ATSC data for temperature 

How does ATSC work? 

ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.  

By default, ATSC will segment the datathis guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file. 

In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function. 

ATSC currently uses one (per frame) of those following functions: 

  • FFT (Fast Fourier Transforms) 
  • Constant 
  • Interpolation – Catmull-Rom 
  • Interpolation – Inverse Distance Weight 

Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting 

These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.

For a more detailed insight into ATSC internals and operations check our paper! 

 Use cases for ATSC and results 

ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below). 

Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results: 

Method   Compressed size (bytes)  Compression Ratio 
Prometheus  454,778,552  1.33 
LZ4  141,347,821  4.29 
ATSC  14,276,544   42.47 

Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times. 

Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.

ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include: 

  • Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss. 
  • Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events. 
  • Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data. 
  • Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)

Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)   

Using ATSC 

ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions. 

Future work 

Currently, we are planning to evolve ATSC in two ways (check our open issues): 

  1. Adding features to the core compressor focused on these functionalities:
    • Frame expansion for appending new data to existing frames 
    • Dynamic function loading to add more functions without altering the codebase 
    • Global and per-frame error storage 
    • Improved error encoding 
  2. Integrations with additional technologies (e.g. databases):
    • We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra® 
CREATE TABLE sensors_poly (   
    sensor_id UInt16,   
    location UInt32,
    timestamp DateTime,
    pressure Float64
CODEC(ATSC('Polynomial', 1)),
    temperature Float64 
CODEC(ATSC('Polynomial', 1)),
) 
ENGINE = MergeTree 
ORDER BY (sensor_id, location,
timestamp);

Image 5: Currently testing ClickHouse integration 

Sound interesting? Try it out and let us know what you think.  

ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data. 

But don’t just take our word for itdownload and run it!  

Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.   

Want to integrate this with another tool? You can build and run our demo integration with ClickHouse. 

The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.

New cassandra_latest.yaml configuration for a top performant Apache Cassandra®

Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.  

This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters. 

Motivation 

The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters: 

  1. Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints. 
  2. Operators: who prefer stability and minimal disruption during upgrades. 
  3. Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility. 

Using cassandra_latest.yaml 

Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.  

This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements. 

Key changes and features 

Key Cache Size 

  • Old: Evaluated as a minimum from 5% of the heap or 100MB
  • Latest: Explicitly set to 0

Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage. 

Commit Log Disk Access Mode 

  • Old: Set to legacy
  • Latest: Set to auto

Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.

Memtable Implementation 

  • Old: Skiplist-based
  • Latest: Trie-based

Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.

create table … with memtable = {'class': 'TrieMemtable', … }

Memtable Allocation Type 

  • Old: Heap buffers 
  • Latest: Off-heap objects 

Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments. 

Trickle Fsync 

  • Old: False 
  • Latest: True 

Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads. 

SSTable Format 

  • Old: big 
  • Latest: bti (trie-indexed structure) 

Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets. 

sstable:
  selected_format: bti
  default_compression: zstd
  compression:
    zstd:
      enabled: true
      chunk_length: 16KiB
      max_compressed_length: 16KiB

Default Compaction Strategy 

  • Old: STCS (Size-Tiered Compaction Strategy) 
  • Latest: Unified Compaction Strategy 

Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables. 

default_compaction:
  class_name: UnifiedCompactionStrategy
  parameters:
    scaling_parameters: T4
    max_sstables_to_compact: 64
    target_sstable_size: 1GiB
    sstable_growth: 0.3333333333333333
    min_sstable_size: 100MiB

Concurrent Compactors 

  • Old: Defaults to the smaller of the number of disks and cores
  • Latest: Explicitly set to 8

Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient. 

Default Secondary Index 

  • Old: legacy_local_table
  • Latest: sai

Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time. 

Stream Entire SSTables 

  • Old: implicity set to True
  • Latest: explicity set to True

Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by

entire_sstable_stream_throughput_outbound

and

entire_sstable_inter_dc_stream_throughput_outbound

for inter-DC transfers. 

UUID SSTable Identifiers 

  • Old: False
  • Latest: True

Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments. 

Storage Compatibility Mode 

  • Old: Cassandra 4
  • Latest: None

Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions. 

Testing and validation 

The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests. 

Future improvements 

Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml. 

Conclusion 

The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone. 

Try it out 

Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!

The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.

Instaclustr for Apache Cassandra® 5.0 Now Generally Available

NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.

NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and onpremises.

Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.

Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by: 

  • Helping you build AI/ML-based applications through Vector Search  
  • Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities 
  • Improving flexibility and security 

With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industryleading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.  

Support for continuous backups and private network addons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.  
  • Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.  
  • Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.  
  • Numerous stability and testing improvements: You can read all about these changes here. 

All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.  

Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.  

We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.  

Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things. 

Lifecycle Policy Updates 

As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).

To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters. 

Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.  

Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.  

You can read more about our lifecycle policies on our website. 

Getting Started 

Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.  

  • Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
  • Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
  • Learn about 4 new Apache Cassandra 5.0 features to be excited about. 
  • Click here to learn what you need to know about Apache Cassandra 5.0. 

Why Choose Apache Cassandra on the Instaclustr Managed Platform? 

NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.  

NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.  

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.  

With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.  

If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.  

The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.

Apache Cassandra® 5.0: Behind the Scenes

Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.  

Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.  

It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform. 

Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra projectHis changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform. 

August 2023: The Beginning

We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.  

One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java. 

September 2023: First Milestone 

The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).  

Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.  

At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released. 

November 2023 Further Testing and Internal Preview 

The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.  

We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.  

We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.  

During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us. 

If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer: 

java.lang.IllegalArgumentException: Could not parse type name vector<float, 5>  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356)  
<Rest of stacktrace removed for brevity>

December 2023: Focus on new features and planning 

As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.  

 The final list of high impact features we came up with was: 

  • A new data type Vectors 
  • Trie memtables/Trie Indexed SSTables (BTI Formatted SStables) 
  • Storage-Attached Indexes (SAI) 
  • Unified Compaction Strategy 

A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase. 

Once the holiday season arrived, it was time for a break, and we were back in force in February next year. 

February 2024: Intensive testing 

In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, wdiscovered issues in the interaction with our monitoring and provisioning setup. 

We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.  

Next, we broke down the work we needed to do This included identifying monitoring agents requiring upgrade and config defaults that needed to change. 

From this point, the project split into 3 streams of work: 

  1. Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.  
  2. Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.  
  3. Infrastructure Upgrades Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables. 

A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved. 

March 2024: Public Preview Release 

In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables. 

However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.  

See our public preview launch blog for further details. 

There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults. 

April 2024: Configuration Tuning and Deeper Testing 

The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.  

This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default. 

"memtable": 
  { 
    "configurations": 
      { 
        "skiplist": 
          { 
            "class_name": "SkipListMemtable" 
          }, 
        "sharded": 
          { 
            "class_name": "ShardedSkipListMemtable" 
          }, 
        "trie": 
          { 
            "class_name": "TrieMemtable" 
          }, 
        "default": 
          { 
            "inherits": "trie" 
          } 
      } 
  }, 
"sstable": 
  { 
    "selected_format": "bti" 
  }, 
"storage_compatibility_mode": "NONE",

The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables You can override this per table: 

CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};

Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information

You can also reference the cassandra_latest.yaml  file for the latest settings (please note you should not apply these to existing clusters without rigorous testing). 

May 2024: Major Infrastructure Milestone 

We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.  

At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed. 

June 2024: Successful Rollout New Cassandra Driver 

The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default. 

July 2024: Path to Release Candidate 

We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.  

The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform. 

The Road Ahead to General Availability 

We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including: 

  • Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure. 

At Launch: 

When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.  

For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.  

What Have We Learned From This Project? 

  • Releasing limited, small and frequent changes has resulted in a smooth project, even if sometimes frequent releases do not feel smooth. Some thoughts: 
    • Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
    • Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc. 
    • Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.  
  • Releasing to internal and public preview helped create momentum within the Instaclustr business and teams:  
    • This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
  • Communicating frequently, transparently, and efficiently is the foundation of success:  
    • We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything. 
    • It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project. 

This has been a longrunning but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.  

You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.  

More Readings 

 

The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.

Building a 100% ScyllaDB Shard-Aware Application Using Rust

Building a 100% ScyllaDB Shard-Aware Application Using Rust

I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...

On Scylla Manager Suspend & Resume feature

On Scylla Manager Suspend & Resume feature

!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...

Renaming and reshaping Scylla tables using scylla-migrator

We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...

Python scylla-driver: how we unleashed the Scylla monster's performance

At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...

Scylla Summit 2019

I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...

Scylla: four ways to optimize your disk space consumption

We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...