The Developer’s Data Modeling Cheat Guide
In Cassandra 5.0, storage-attached indexes offer a new way to interact with your data, giving developers the flexibility to query multiple columns with filtering, range conditions, and better performance.ScyllaDB’s Engineering Summit in Sofia, Bulgaria
From hacking to hiking: what happens when engineering sea monsters get together ScyllaDB is a remote-first company, with team members spread across the globe. We’re masters of virtual connection, but every year, we look forward to the chance to step away from our screens and come together for our Engineering Summit. It’s a time to reconnect, exchange ideas, and share spontaneous moments of joy that make working together so special. This year, we gathered in Sofia, Bulgaria — a city rich in history and culture, set against the stunning backdrop of the Balkan mountains. Where Monsters Meet This year’s summit brought together a record-breaking number of participants from across the globe—over 150! As ScyllaDB continues its continued growth, the turnout well reflects our momentum and our team’s expanding global reach. We, ScyllaDB Monsters from all corners of the world, came together to share knowledge, build connections, and collaborate on shaping the future of our company and product. A team-building activity An elevator ride to one of the sessions The summit brought together not just the engineering teams but also our Customer Experience (CX) colleagues. With their insights into the real-world experiences of our customers, the CX team helped us see the bigger picture and better understand how our work impacts those who use our product. The CX team Looking Inward, Moving Forward The summit was packed with really insightful talks, giving us a chance to reflect on where we are and where we’re heading next. It was all about looking back at the wins we’ve had so far, getting excited about the cool new features we’re working on now, and diving deep into what’s coming down the pipeline. CEO and Co-founder, Dor Laor, kicking off the summit The sessions sparked fruitful discussions about how we can keep pushing forward and build on the strong foundation we’ve already laid. The speakers touched on a variety of topics, including: ScyllaDB X Cloud Consistent topology Data distribution with tablets Object storage Tombstone garbage collection Customer stories Improving customer experience And many more Notes and focus doodles Collaboration at Its Best: The Hackathon This year, we took the summit energy to the next level with a hackathon that brought out the best in creativity, collaboration, and problem-solving. Participants were divided into small teams, each tackling a unique problem. The projects were chosen ahead of time so that we had a chance to work on real challenges that could make a tangible impact on our product and processes. The range of projects was diverse. Some teams focused on adding new features, like implementing a notification API to enhance the user experience. Others took on documentation-related challenges, improving the way we share knowledge. But across the board, every team managed to create a functioning solution or prototype. At the hackathon The hackathon brought people from different teams together to tackle complex issues, pushing everyone a bit outside their comfort zone. Beyond the technical achievements, it was a powerful team-building experience, reinforcing our culture of collaboration and shared purpose. It reminded us that solving real-life challenges—and doing it together—makes our work even more rewarding. The hackathon will undoubtedly be a highlight of future summits to come! From Development to Dance And then, of course, came the party. The atmosphere shifted from work to celebration with live music from a band playing all-time hits, followed by a DJ spinning tracks that kept everyone on their feet. Live music at the party Almost everyone hit the dance floor—even those who usually prefer to sit it out couldn’t resist the rhythm. It was the perfect way to unwind and celebrate the success of the summit! Sea monsters swaying Exploring Sofia and Beyond During our time in Sofia, we had the chance to immerse ourselves in the city’s rich history and culture. Framed by the dramatic Balkan mountains, Sofia blends the old with the new, offering a mix of history, culture, and modern vibe. We wandered through the ancient ruins of the Roman Theater and visited the iconic Alexander Nevsky Cathedral, marveling at their beauty and historical significance. To recharge our batteries, we enjoyed delicious meals in modern Bulgarian restaurants. In front of Alexander Nevsky Cathedral But the adventure didn’t stop in the city. We took a day trip to the Rila Mountains, where the breathtaking landscapes and serene atmosphere left us in awe. One of the standout sights was the Rila Monastery, a UNESCO World Heritage site known for its stunning architecture and spiritual significance. The Rila Monastery After soaking in the peaceful vibes of the monastery, we hiked the trail leading to the Stob Earth Pyramids, a natural wonder that looked almost otherworldly. The Stob Pyramids The hike was rewarding, offering stunning views of the mountains and the unique rock formations below. It was the perfect way to experience Bulgaria’s natural beauty while winding down from the summit excitement. Happy hiking Looking Ahead to the Future As we wrapped up this year’s summit, we left feeling energized by the connections made, ideas shared, and challenges overcome. From brainstorming ideas to clinking glasses around the dinner table, this summit was a reminder of why in-person gatherings are so valuable—connecting not just as colleagues but as a team united by a common purpose. As ScyllaDB continues to expand, we’re excited for what lies ahead, and we can’t wait to meet again next year. Until then, we’ll carry the lessons, memories, and new friendships with us as we keep moving forward. Чао! We’re hiring – join our team! Our teamHow ScyllaDB Simulates Real-World Production Workloads with the Rust-Based “latte” Benchmarking Tool
Learn why we use a little-known benchmarking tool for testing Before using a tech product, it’s always nice to know its capabilities and limits. In the world of databases, there are a lot of different benchmarking tools that help us assess that… If you’re ok with some standard benchmarking scenarios, you’re set – one of the existing tools will probably serve you well. But what if not? Rigorously assessing ScyllaDB, a high-performance distributed database, requires testing some rather specific scenarios, ones with real-world production workloads. Fortunately, there is a tool to help with that. It is latte: a Rust-based lightweight benchmarking tool for Apache Cassandra and ScyllaDB.Special thanks to Piotr Kołaczkowski for implementing the latte benchmarking tool.We (the ScyllaDB testing team) forked it and enhanced it. In this blog post, I’ll share why and how we adopted it for our specialized testing needs. About latte Our team really values latte’s “flexibility.” Want to create a schema using a user defined type (UDT), Map, Set, List, or any other data type? Ok. Want to create a materialized views and query it? Ok. Want to change custom function behavior based on elapsed time? Ok. Want to run multiple custom functions in parallel? Ok. Want to use small, medium, and large partitions? Ok. Basically, latte lets us define any schema and workload functions. We can do this thanks to its implementation design. The
latte
tool is a type of engine/kernel and
rune scripts
are essentially the “business logic”
that’s written separately. Rune scripts
are an
enhanced, more powerful, analog of what cassandra-stress calls
user profiles
. The rune scripting language is
dynamically-typed and native to the Rust programming language
ecosystem. Here’s a simple example of a rune script: In the above
example of a rune script, we defined 2 required functions
(schema
and prepare
) and one custom to be
used as our workload –myinsert
. First, we create a
schema: Then, we use the latte run
command to call our
custom myinsert
function: The
replication_factor
parameter above is a custom
parameter. If we do not specify it, then latte will use its default
value, 3
. We can define any number of custom
parameters. How is latte different from other benchmarking Tools?
Based on our team’s experiences, here’s how latte
compares to the 2 main competitors: cassandra-stress
and ycsb
: How is our fork of latte different from the
original latte project? At ScyllaDB, our main use case for latte is
testing complex and realistic customer scenarios with controlled
disruptions. But (from what we understand), the project was
designed to perform general latency measurements in healthy DB
clusters. Given these different goals, we changed some features
(“overlapping features”) – and added other new ones (“unique to our
fork”). Here’s an overview: Overlapping features differences
Latency measurement. Fork-latte accounts for coordinated
omission in latencies The original project doesn’t consider the
“coordinated omission” phenomenon. Saturated DB impact. When a
system under test cannot keep up with the load/stress, fork-latte
tries to satisfy the “rate”, compensating for missed scheduler
ticks ASAP. Source-latte pulls back on violating the rate
requirement and doesn’t later compensate for missed scheduler
ticks. This isn’t a “bug”; it is a design decision which also
violates the idea of proper latency calculation related to the
“coordinated omission” phenomenon. Retries. We enabled retries by
default; there, it is disabled by default. Prepared statements.
Fork-latte supports all the CQL data types available in ScyllaDB
Rust Driver. The source project has limited support of CQL data
types. ScyllaDB Rust Driver. Our fork uses the latest version –
“1.2.0” The source project sticks to the old version “0.13.2”
Stress execution reporting. Report is disabled by default in
fork-latte. It’s enabled in source-latte. Features unique to our
fork Preferred datacenter support. Useful for testing multi-DC DB
setups Preferred rack support. Useful for testing multi-rack DB
setups Possibility to get a list of actual datacenter values from
the DB nodes that the driver connected to. Useful for creating
schema with dc-based keyspaces Sine-wave rate limiting. Useful for
SLA/Workload
Prioritization demo and OLTP testing with peaks and lows. Batch
query type support. Multi-row partitions. Our fork can create
multi-row partitions of different sizes. Page size support for
select queries. Useful using multi-row partitions feature. HDR
histograms support. The source project has only 1 way to get the
HDR histograms data It stores HDR histograms data in RAM till the
end of a stress command execution and only in the end releases it
as part of a report. Leaks RAM. Forked latte supports the above
inherited approach and one more: Real-time streaming of HDR
histogram data not storing in RAM. No RAM leaks. Rows count
validation for select queries. Useful for testing
data resurrection. Example: Testing multi-row partitions
of different sizes Let’s look at one specific user scenario where
we applied our fork of latte to test ScyllaDB. For background, one
of the user’s ScyllaDB production clusters was using large
partitions which could be grouped by size in 3 groups: 2000, 4000
and 8000 rows per partition. 95% of the partitions had 2000 rows,
4% of partitions had 4000 rows, and the last 1% of partitions had
8000 rows. The target table had 20+ different columns of different
types. Also, ScyllaDB’s Secondary Indexes (SI) feature was enabled
for the target table. One day, on one of the cloud providers,
latencies spiked and throughput dropped. The source of the problem
was not immediately clear. To learn more, we needed to have a quick
way to reproduce the customer’s workload in a test environment.
Using the latte
tool and its great flexibility, we
created a rune script covering all the above specifics. The
simplified rune script looks like the following: Assume we have a
ScyllaDB cluster where one of the nodes has a
172.17.0.2
IP address. Here is the command to create
the schema we need: And here is the command to populate the
just-created table: To read from the main table and from the MV,
use a similar command – just replacing the function name to
get
and get_from_mv
respectively. So, the
usage of the above commands allowed us to get a stable issue
reproducer and work on its solution. Working with ScyllaDB’s
Workload Prioritization feature In other cases, we needed to:
Create a
Workload Prioritization (WLP) demo. Test an OLTP setup with
continuous peaks and lows to showcase giving priority to different
workloads. And for these scenarios, we used a special latte feature
called sine wave rate
. This is an extension to the
common rate-limiting
feature. It allows us to specify
how many operations per second we want to produce. It can be used
with following command parameters: And looking at the monitoring,
we can see the following picture of the operations per second
graph: Internal testing of tombstones (validation) As of June 2025,
forked latte supports row count validation. It is useful for
testing
data resurrection. Here is the rune script for latte to
demonstrate these capabilities: As before, we create the schema
first: Then, we populate the table with 100k rows using the
following command: To check that all rows are in place, we use
command similar to the one above, change the function to be
get
, and define the validation strategy to be
fail-fast
: The supported validation strategies are
retry
, fail-fast
, ignore
.
Then, we run 2 different commands in parallel. Here is the first
one, which deletes part of the rows: Here is the second one, which
knows when we expect 1 row and when we expect none: And here is the
timing of actions that take place during these 2 commands’ runtime:
That’s a simple example of how we can check whether data got
deleted or not. In long-running testing scenarios, we might run
more parallel commands, make them depend on the elapsed time, and
many more other flexibilities. Conclusions Yes, to take advantage
of latte, you first need to study a bit of rune
scripting. But
once you’ve done that to some extent, especially having available
examples, it becomes a powerful tool that is capable of covering
various scenarios of different types. ScyllaDB Tablets: Answering Your Top Questions
What does your team need to know about tablets– at a purely pragmatic level? Here are answers to the top user questions. The latest ScyllaDB releases feature some significant architectural shifts. Tablets build upon a multi-year project to re-architect our legacy ring architecture. And our metadata is now fully consistent, thanks to the assistance of Raft. Together, these changes can help teams with elasticity, speed, and operational simplicity. Avi Kivity, our CTO and co-founder, provided a detailed look at why and how we made this shift in a series of blogs (Why ScyllaDB Moved to “Tablets” Data Distribution and How We Implemented ScyllaDB’s “Tablets” Data Distribution). Join Avi for a technical deep dive…at our upcoming livestream And we recently created this quick demo to show you what this looks like in action, from the point of view of a database user/operator: But what does your team need to know – at a purely pragmatic level? Here are some of the questions we’ve heard from interested users, and a short summary of how we answer them. What’s the TL;DR on tablets? Tablets are the smallest replication unit in ScyllaDB. Data gets distributed by splitting tables into smaller logical pieces called tablets, and this allows ScyllaDB to shift from a static to a dynamic topology. Tablets are dynamically balanced across the cluster using the Raft consensus protocol. This was introduced as part of a project to bring more elasticity to ScyllaDB, enabling faster topology changes and seamless scaling. Tablets acknowledge that most workloads do not follow a static traffic pattern. In fact, most often follow a cyclical curve with different baseline and peaks through a period of time. By decoupling topology changes from the actual streaming of data, tablets therefore present significant cost saving opportunities for users adopting ScyllaDB by allowing infrastructure to be scaled on-demand, fast. Previously, adding or removing nodes required a sequential, one-at-a-time and serializable process with data streaming and rebalancing. Now, you can add or remove multiple nodes in parallel. This significantly speeds up the scaling process and makes ScyllaDB much more elastic. Tablets are distributed on a per-table basis, with each table having its own set of tablets. The tablets are then further distributed across the shards in the ScyllaDB cluster. The distribution is handled automatically by ScyllaDB, with tablets being dynamically migrated across replicas as needed. Data within a table is split across tablets based on the average geometric size of a token range boundary. How do I configure tablets? Tablets are enabled by default in ScyllaDB 2025.1 and are also available with ScyllaDB Cloud. When creating a new keyspace, you can specify whether to enable tablets or not. There are also three key configuration options for tablets: 1) the enable_tablets boolean setting, 2) the target_tablet_size_in_bytes (default is 5GB), and 3) the tablets property during a CREATE KEYSPACE statement. Here are a few tips for configuring these settings: enable_tablets indicates whether newly created keyspaces should rely on tablets for data distribution. Note that tablets are currently not yet enabled for workloads requiring the use of Counters Tables, Secondary Indexes, Materialized Views, CDC, or LWT. target_tablet_size_in_bytes indicates the average geometric size of a tablet, and is particularly useful during tablet split and merge operations. The default indicates splits are done when a tablet reaches 10GB and merges at 2.5GB. A higher value means tablet migration throughput can be reduced (due to larger tablets), whereas a lower value may significantly increase the number of tablets. The tablets property allows you to opt for tablets on a per keyspace basis via the ‘enabled’ boolean sub-option. This is particularly important if some of your workloads rely on the currently unsupported features mentioned earlier: You may opt out for these tables and fallback to the still supported vNode-based replication strategy. Still under the tablets property, the ‘initial’ sub-option determines how many tablets are created upfront on a per-table basis. We recommend that you target 100 tablets/shard. In future releases, we’ll introduce Per-table tablet options to extend and simplify this process while deprecating the keyspace sub-option. How/why should I monitor tablet distribution? Starting with ScyllaDB Monitoring 4.7, we introduced two additional panels for the observability and distribution of tablets within a ScyllaDB cluster. These metrics are present within the Detailed dashboard under the Tablets section: The Tablets over time panel is a heatmap showing the tablet distribution over time. As the data size of a tablet-enabled table grows, you should observe the number of tablets increasing (tablet split) and being automatically distributed by ScyllaDB. Likewise, as the table size shrinks, the number of tablets should be reduced (tablet merge, but to no less than your initially configured ‘initial’ value within the keyspace tablets property). Similarly, as you perform topology changes (e.g., adding nodes), you can monitor the tablet distribution progress. You’ll notice that existing replicas will have their tablet count reduced while new replicas will increase. The Tablets per DC/Instance/Shard panel shows the absolute count of tablets within the cluster. Under heterogeneous setups running on top of instances of the same size, this metric should be evenly balanced. However, the situation changes for heterogeneous setups with different shard counts. In this situation, it is expected that larger instances will hold more tablets given their additional processing power. This is, in fact, yet another benefit of tablets: the ability to run heterogeneous setups and leave it up to the database to determine how to internally maximize each instance’s performance capabilities. What are the impacts of tablets on maintenance tasks like node cleanup? The primary benefit of tablets is elasticity. Tablets allow you to easily and quickly scale out and in your database infrastructure without hassle. This not only translates to infrastructure savings (like avoiding being overprovisioned for the peak all the time). It also allows you to reach a higher percentage of storage utilization before rushing to add more nodes – so you can better utilize the underlying infrastructure you pay for. Another key benefit of tablets is that they eliminate the need for maintenance tasks like node cleanup. Previously, after scaling out the cluster, operators would need to run node cleanup to ensure data was properly evicted from nodes that no longer owned certain token ranges. With tablets, this is no longer necessary. The compaction process automatically handles the migration of data as tablets are dynamically balanced across the cluster. This is a significant operational improvement that reduces the maintenance burden for ScyllaDB users. The ability to now run heterogeneous deployments without running through cryptic and hours-long tuning cycles is also a plus. ScyllaDB’s tablet load balancer is smart enough to figure out how to distribute and place your data. It considers the amount of compute resources available, reducing the risk of traffic hotspots or data imbalances that may affect your clusters’ performance. In the future, ScyllaDB will bring transparent repairs on a per-tablet basis, further eliminating the need for users to worry about repairing their clusters, and also provide “temperature-based balancing” so that hot partitions get split and other shards cooperate with the incoming load. Do I need to change drivers? ScyllaDB’s latest drivers are tablet-aware, meaning they understand the tablets concept and can route queries to the correct nodes and shards. However, the drivers do not directly query the internal system.tablets table. That could become unwieldy as the number of tablets grows. Furthermore, tablets are transient, meaning a replica owning a tablet may no longer be a natural endpoint for it as time goes by. Instead, the drivers use a dynamic routing process: when a query is sent to the wrong node/shard, the coordinator will respond with the correct routing information, allowing the driver to update its routing cache. This ensures efficient query routing as tablets are migrated across the cluster. When using ScyllaDB tablets, it’s more important than ever to use ScyllaDB shard-aware – and now also tablet-aware – drivers instead of Cassandra drivers. The existing drivers will still work, but they won’t work as efficiently because they lack the necessary logic to understand the coordinator-provided tablet metadata. Using the latest ScyllaDB drivers should provide a nice throughput and latency boost. Read more in How We Updated ScyllaDB Drivers for Tablets Elasticity. More questions? If you’re interested in tablets and we didn’t answer your question here, please reach out to us! Our Contact Us page offers a number of ways to interact, including a community forum and Slack.Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview
ScyllaDB X Cloud just landed! It’s a truly elastic database that supports variable/unpredictable workloads with consistent low latency, plus low costs. The ScyllaDB team is excited to announce ScyllaDB X Cloud, the next generation of our fully-managed database-as-a-service. It features architectural enhancements for greater flexibility and lower cost. ScyllaDB X Cloud is a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. A few spoilers before we get into the details: You can now scale out and scale in almost instantly to match actual usage, hour by hour. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer latency hits while waiting for autoscaling to fully kick in. You can now safely run at 90% storage utilization, compared to the standard 70% utilization. This means you need fewer underlying servers and have substantially less infrastructure to pay for. Optimizations like file-based streaming and dictionary-based compression also speed up scaling and reduce network costs. Beyond the technical changes, there’s also an important pricing update. To go along with all this database flexibility, we’re now offering a “Flex Credit” pricing model. Basically, this gives you the flexibility of on-demand pricing with the cost advantage that comes from an annual commitment. Access ScyllaDB X Cloud Now If you want to get started right away, just go to ScyllaDB Cloud and choose the X Cloud cluster type when you create a cluster. This is our code name for the new type of cluster that enables greater elasticity, higher storage utilization, and automatic scaling. Note that X Cloud clusters are available from the ScyllaDB Cloud application (below) and API. They’re available on AWS and GCP, running on a ScyllaDB account or your company’s account with the Bring Your Own Account (BYOA) model. Sneak peek: In the next release, you won’t need to choose instance size or number of services if you select the X Cloud option. Instead, you will be able to define a serverless scaling policy and let X Cloud scale the cluster as required. If you want to learn more, keep reading. In this blog post, we’ll cover what’s behind the technical changes and also talk a little about the new pricing option. But first, let’s start with the why. Backstory Why did we do this? Consider this example from a marketing/AdTech platform that provides event-based targeting. Such a pattern, with predictable/cyclical daily peaks and low baseline off-hours, is quite common across retail platforms, food delivery services, and other applications aligned with customer work hours. In this case, the peak loads are 3x the base and require 2-3x the resources. With ScyllaDB X Cloud, they can provision for the baseline and quickly scale in/out as needed to serve the peaks. They get the steady low latency they need without having to overprovision – paying for peak capacity 24/7 when it’s really only needed for 4 hours a day. Tablets + just-in-time autoscaling If you follow ScyllaDB, you know that tablets aren’t new. We introduced them last year for ScyllaDB Enterprise (self-managed on the cloud or on-prem). Avi Kivity, our CTO, already provided a look at why and how we implemented tablets. And you can see tablets in action here: With tablets, data gets distributed by splitting tables into smaller logical pieces (“tablets”), which are dynamically balanced across the cluster using the Raft consensus protocol. This enables you to scale your databases as rapidly as you can scale your infrastructure. In a self-managed ScyllaDB deployment, tablets makes it much faster and simpler to expand and reduce your database capacity. However, you still need to plan ahead for expansion and initiate the operations yourself. ScyllaDB X Cloud lets you take full advantage of tablets’ elasticity. Scaling can be triggered automatically based on storage capacity (more on this below) or based on your knowledge of expected usage patterns. Moreover, as capacity expands and contracts, we’ll automatically optimize both node count and utilization. You don’t even have to choose node size; ScyllaDB X Cloud’s storage-utilization target does that for you. This should simplify admin and also save costs. 90% storage utilization ScyllaDB has always handled running at 100% compute utilization well by having automated internal schedulers manage compactions, repairs, and lower-priority tasks in a way that prioritizes performance. Now, it also does two things that let you increase the maximum storage utilization to 90%: Since tablets can move data to new nodes so much faster, ScyllaDB X Cloud can defer scaling until the very last minute Support for mixed instance sizes allows ScyllaDB X Cloud to allocate minimal additional resources to keep the usage close to 90% Previously, we recommended adding nodes at 70% capacity. This was because node additions were unpredictable and slow — sometimes taking hours or days — and you risked running out of space. We’d send a soft alert at 50% and automatically add nodes at 70%. However, those big nodes often sat underutilized. With ScyllaDB X Cloud’s tablets architecture, we can safely target 90% utilization. That’s particularly helpful for teams with storage-bound workloads. Support for mixed size clusters A little more on the “mixed instance size” support mentioned earlier. Basically, this means that ScyllaDB X Cloud can now add the exact mix of nodes you need to meet the exact capacity you need at any given time. Previous versions of ScyllaDB used a single instance size across all nodes in the cluster. For example, if you had a cluster with 3 i4i.16xlarge instances, increasing the capacity meant adding another i4i.16xlarge. That works, but it’s wasteful: you’re paying for a big node that you might not immediately need. Now with ScyllaDB X Cloud (thanks to tablets and support for mixed-instance sizes), we can scale in much smaller increments. You can add tiny instances first, then replace them with larger ones if needed. That means you rarely pay for unused capacity. For example, before, if you started with an i4i.16xlarge node that had 15 TB of storage and you hit 70% utilization, you had to launch another i4i.16xlarge — adding 15 TB at once. With ScyllaDB X Cloud, you might add two xlarge nodes (2 TB each) first. Then, if you need more storage, you add more small nodes, then eventually replace them with larger nodes. And by the way, i7i instances are now available too, and they are even more powerful. The key is granular, just-in-time scaling: you only add what you need, when you need it. This applies in reverse, too. Before, you had to decommission a large node all at once. Now, ScyllaDB X Cloud can remove smaller nodes gradually based on the policies you set, saving compute and storage costs. Network-focused engineering optimizations Every gigabyte leaving a node, crossing an Availability Zone (AZ) boundary, or replicating to another region shows up on your AWS, GCP, or Azure bill. That’s why we’ve done some engineering work at different layers of ScyllaDB to shrink those bytes—and the dollars tied to them. File-based streaming We anticipated that mutation-based streaming would hold us back once we moved to tablets. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Think of it as Cassandra’s zero-copy streaming, except that we keep ownership metadata with each replica. This table shows the result: You can read more about this in the blog Why We Changed ScyllaDB’s Data Streaming Approach. Dictionary-based compression We also introduced dictionary-trained Zstandard (Zstd), which is pipeline-aware. This involved building a custom RPC compressor with external dictionary support, and a mechanism that trains new dictionaries on RPC traffic, distributes them over the cluster, and performs a live switch of connections to the new dictionaries. This is done in 4 key steps: Sample: Continuously sample RPC traffic for some time Train: Train a 100 kiB dictionary on a 16MiB sample Distribute: Distribute a new dictionary via system distributed table Switch: Negotiate the switch separately within each connection On the graph below, you can see LZ4 (Cassandra’s default) leaves you at 72% of the original size. Generic Zstd cuts that to 50%. Our per-cluster Zstd dictionary takes it down to 30%, which is a 3X improvement over the default Cassandra compression. Flex Credit To close, let’s shift from the technical changes to a major pricing change: Flex Credit. Flex Credit is a new way to consume a ScyllaDB Cloud subscription. It can be applied to ScyllaDB Cloud as well as ScyllaDB Enterprise. Flex Credit provides the flexibility of on-demand pricing at a lower cost via an annual commitment. In combination with X Cloud, Flex Credit can be a great tool to reduce cost. You can use Reserved pricing for a load that’s known in advance and use Flex for less predictable bursts. This saves you from paying the higher on-demand pricing for anything above the reserved part. How might this play out in your day-to-day work? Imagine your baseline workload handles 100K OPS, but sometimes it spikes to 400K OPS. Previously, you’d have to provision (and pay for) enough capacity to sustain 400K OPS at all times. That’s inefficient and costly. With ScyllaDB X Cloud, you reserve 100K OPS upfront. When a spike hits, we automatically call the API to spin up “flex capacity” – instantly scaling you to 400K OPS – and then tear it down when traffic subsides. You only pay for the extra capacity during the peak. Not sure what to choose? We can help advise based on your workload specifics (contact your representative or ping us here), but here’s some quick guidance in the meantime. Reserved Capacity: The most cost-effective option across all plans. Commit to a set number of cluster nodes or machines for a year. You lock in lower rates and guarantee capacity availability. This is ideal if your cluster size is relatively stable. Hybrid Model: Reserved + On-Demand: Commit to a baseline reserved capacity to lock in lower rates, but if you exceed that baseline (e.g., because you have a traffic spike), you can scale with on-demand capacity at an hourly rate. This is good if your usage is mostly stable but occasionally spikes. Hybrid Model: Reserved + Flex Credit: Commit to baseline reserved capacity for the lowest rates. For peak usage, use pre-purchased flex credit (which is discounted) instead of paying on-demand prices. Flex credit also applies to network and backup usage at standard provider rates. This is ideal if you have predictable peak periods (e.g., seasonal spikes, event-driven workload surges, etc.). You get the best of both worlds: low baseline costs and cost-efficient peak capacity. Recap In summary, ScyllaDB X Cloud uses tablets to enable faster, more granular scaling with mixed-instance sizes. This lets you avoid overprovisioning and safely run at 90% storage utilization. All of this will help you respond to volatile/unpredictable demand with low latencies and low costs. Moreover, flexible pricing (on-demand, flex credit, reserved) will help you pay only for what you need, especially when you have tablets scaling your capacity up and down in response to traffic spikes. There are also some network cost optimizations through file-based streaming and improved compression. Want to learn more? Our Co-Founder/CTO Avi Kivity will be discussing the design decisions behind ScyllaDB X Cloud’s elasticity and efficiency. Join us for the engineering deep dive on July 10. ScyllaDB X Cloud: An Inside Look with Avi KivityA New Way to Estimate DynamoDB Costs
We built a new DynamoDB cost analyzer that helps developers understand what their workloads will really cost DynamoDB costs can blindside you. Teams regularly face “bill shock”: that sinking feeling when you look at a shockingly high bill and realize that you haven’t paid enough attention to your usage, especially with on-demand pricing. Provisioned capacity brings a different risk: performance. If you can’t accurately predict capacity or your math is off, requests get throttled. It’s a delicate balancing act. Although AWS offers a DynamoDB pricing calculator, it often misses the nuances of real-world workloads (e.g., bursty traffic or uneven access patterns, or using global tables or caching). We wanted something better. In full transparency, we wanted something better to help the teams considering ScyllaDB as a DynamoDB alternative. So we built a new DynamoDB cost calculator that helps developers understand what their workloads will really cost. Although we designed it for teams comparing DynamoDB with ScyllaDB, we believe it’s useful for anyone looking to more accurately estimate their DynamoDB costs, for any reason. You can see the live version at: calculator.scylladb.com How We Built It We wanted to build something that would work client side, without the need for any server components. It’s a simple JavaScript single page application that we currently host on GitHub pages. If you want to check out the source code, feel free to take a look at https://github.com/scylladb/calculator To be honest, working with the examples at https://calculator.aws/ was a bit of a nightmare, and when you “show calculations,” you get these walls of text: I was tempted to take a shorter approach, like: Monthly WCU Cost = WCUs × Price_per_WCU_per_hour × 730 hours/month But every time I simplified this, I found it harder to get parity between what I calculated and the final price in AWS’s calculation. Sometimes the difference was due to rounding, other times it was due to the mixture of reserved + provision capacity, and so on. So to make it easier (for me) to debug, I faithfully followed their calculations line by line and tried to replicate this in my own rather ugly function: https://github.com/scylladb/calculator/blob/main/src/calculator.js I may still refactor this into smaller functions. But for now, I wanted to get parity between theirs and ours. You’ll see that there are also some end-to-end tests for these calculations — I use those to test for a bunch of different configurations. I will probably expand on these in time as well. So that gets the job done for On Demand, Provisioned (and Reserved) capacity models. If you’ve used AWS’s calculator, you know that you can’t specify things like a peak (or peak width) in On Demand. I’m not sure about their reasoning. I decided it would be easier for users to specify both the baseline and peak for reads and writes (respectively) in On Demand, much like Provisioned capacity. Another design decision was to represent the traffic using a chart. I do better with visuals, so seeing the peaks and troughs makes it easier for me to understand – and I hope it does for you as well. You’ll also notice that as you change the inputs, the URL query parameters change to reflect those inputs. That’s designed to make it easier to share and reference specific variations of costs. There’s some other math in there, like figuring out the true cost of Global Tables and understanding derived costs of things like network transfer or DynamoDB Accelerator (DAX). However, explaining all that is a bit too dense for this format. We’ll talk more about that in an upcoming webinar (see the next section). The good news is that you can estimate these costs in addition to your workload, as they can be big cost multipliers when planning out your usage of DynamoDB. Explore “what if” scenarios for your own workloads Analyzing Costs in Real-World Scenarios The ultimate goal of all this tinkering and tuning is to help you explore various “what-if” scenarios from a DynamoDB cost perspective. To get started, we’re sharing the cost impacts of some of the more interesting DynamoDB user scenarios we’ve come across at ScyllaDB. My colleague Gui and I just got together for a deep dive into how factors like traffic surges, multi-datacenter expansion, and the introduction of caching (e.g., DAX) impact DynamoDB costs. We explored how a few (anonymized) teams we work with ended up blindsided by their DynamoDB bills and the various options they considered for getting costs back under control. Watch the DynamoDB costs chat nowRust Rewrite, Postgres Exit: Blitz Revamps Its “League of Legends” Backend
How Blitz scaled their game coaching app with lower latency and leaner operations Blitz is a fast-growing startup that provides personalized coaching for games such as League of Legends, Valorant, and Fortnite. They aim to help gamers become League of Legends legends through real-time insights and post-match analysis. While players play, the app does quite a lot of work. It captures live match data, analyzes it quickly, and uses it for real-time game screen overlays plus personalized post-game coaching. The guidance is based on each player’s current and historic game activity, as well as data collected across billions of matches involving hundreds of millions of users. Thanks to growing awareness of Blitz’s popular stats and game-coaching app, their steadily increasing user base pushed their original Postgres- and Elixir-based architecture to its limits. This blog post explains how they recently overhauled their League of Legends data backend – using Rust and ScyllaDB. TL;DR – In order to provide low latency, high availability, and horizontal scalability to their growing user base, they ultimately: Migrated backend services from Elixir to Rust. Replaced Postgres with ScyllaDB Cloud. Heavily reduced their Redis footprint. Removed their Riak cluster. Replaced queue processing with realtime processing. Consolidated infrastructure from over a hundred cores of microservices to four n4‑standard‑4 Google Cloud nodes (plus a small Redis instance for edge caching) As an added bonus, these changes ended up cutting Blitz’s infrastructure costs and reducing the database burden on their engineering staff. Blitz Background As Naveed Khan (Head of Engineering at Blitz) explained, “We collect a lot of data from game publishers and during gameplay. For example, if you’re playing League of Legends, we use Riot’s API to pull match data, and if you install our app we also monitor gameplay in real time. All of this data is stored in our transactional database for initial processing, and most of it eventually ends up in our data lake.” Scaling Past Postgres One key part of Blitz’s system is the Playstyles API, which analyzes pre-game data for both teammates and opponents. This intensive process evaluates up to 20 matches per player and runs nine separate times per game (once for each player in the match). The team strategically refactored and consolidated numerous microservices to improve performance. But the data volume remained intense. According to Brian Morin (Principal Backend Engineer at Blitz), “Finding a database solution capable of handling this query volume was critical.” They originally used Postgres, which served them well early on. However, as their write-heavy workloads scaled, the operational complexity and costs on Google Cloud grew significantly. Moreover, scaling Postgres became quite complex. Naveed shared, “We tried all sorts of things to scale. We built multiple services around Postgres to get the scale we needed: a Redis cluster, a Riak cluster, and Elixir Oban queues that occasionally overflowed. Queue management became a big task.” To stay ahead of the game, they needed to move on. As startups scale, they often switch from “just use Postgres” to “just use NoSQL.” Fittingly, the Blitz team considered moving to MongoDB, but eventually ruled it out. “We had lots of MongoDB experience in the team and some of us really liked it. However, our workload is very write-heavy, with thousands of concurrent players generating a constant stream of data. MongoDB uses a single-writer architecture, so scaling writes means vertically scaling one node.” In other words, MongoDB’s primary-secondary architecture would create a bottleneck for their specific workload and anticipated growth. They then decided to move forward with RocksDB because of its low latency and cost considerations. Tests showed that it would meet their latency needs, so they performed the required data (re)modeling and migrated a few smaller games over from Postgres to RocksDB. However, they ultimately decided against RocksDB due to scale and high availability concerns. “Based on available data from our testing, it was clear RocksDB wouldn’t be able to handle the load of our bigger games – and we couldn’t risk vertically scaling a single instance, and then having that one instance go down,” Naveed explained. Why ScyllaDB One of their backend engineers suggested ScyllaDB, so they reached out and ran a proof of concept. They were primarily looking for a solution that can handle the write throughput, scales horizontally, and provides high availability. They tested it on their own hardware first, then moved to ScyllaDB Cloud. Per Naveed, “The cost was pretty close to self-hosting, and we got full management for free, so it was a no-brainer. We now have a significantly reduced Redis cluster, plus we got rid of the Riak cluster and Oban queues dependencies. Just write to ScyllaDB and it all just works. The amount of time we spend on infrastructure management has significantly decreased.” Performance-wise, the shift met their goal of leveling up the user experience … and also simplified life for their engineering teams. Brian added, “ScyllaDB proved exceptional, delivering robust performance with capacity to spare after optimization. Our League product peaks at around 5k ops/sec with the cluster reporting under 20% load. Our biggest constraint has been disk usage, which we’ve rolled out multiple updates to mitigate. The new system can now often return results immediately instead of relying on cached data, providing more up-to-date information on other players and even identifying frequent teammates. The results of this migration have been impressive: over a hundred cores of microservices have been replaced by just four n4-standard-4 nodes and a minimal Redis instance for caching. Additionally, a 3xn2-highmem ScyllaDB cluster has effectively replaced the previous relational database infrastructure that required significant computing resources.” High-Level Architecture of Blitz Server with Rust and ScyllaDB Rewriting Elixir Services into Rust As part of a major backend overhaul, the Blitz team began rethinking their entire infrastructure – beyond the previously described shift from Postgres to the high-performance and distributed ScyllaDB. Alongside this database migration, they also chose to sunset their Elixir-based services in favor of a more modern language. After careful evaluation, Rust emerged as the clear choice. “Elixir is great and it served its purpose well,” explained Naveed. “But we wanted to move toward something with broader adoption and a stronger systems-level ecosystem. Rust proved to be a robust and future-proof alternative.” Now that the first batch of Rust rewritten services are in production, Naveed and team aren’t looking back: “Rust is fantastic. It’s fast, and the compiler forces you to write memory-safe code upfront instead of debugging garbage-collection issues later. Performance is comparable to C, and the talent pool is also much larger compared to Elixir.”Why We Changed ScyllaDB’s Data Streaming Approach
How moving from mutation-based streaming to file-based streaming resulted in 25X faster streaming time Data streaming – an internal operation that moves data from node to node over a network – has always been the foundation of various ScyllaDB cluster operations. For example, it is used by “add node” operations to copy data to a new node in a cluster (as well as “remove node” operations to do the opposite). As part of our multiyear project to optimize ScyllaDB’s elasticity, we reworked our approach to streaming. We recognized that when we moved to tablets-based data distribution, mutation-based streaming would hold us back. So we shifted to a new approach: stream the entire SSTable files without deserializing them into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. Mutation-Based Streaming In ScyllaDB, data streaming is a low-level mechanism to move data between nodes. For example, when nodes are added to a cluster, streaming moves data from existing nodes to the new nodes. We also use streaming to decommission nodes from the cluster. In this case, streaming moves data from the decommissioned nodes to other nodes in order to balance the data across the cluster. Previously, we were using a streaming method called mutation-based streaming. On the sender side, we read the data from multiple SSTables. We get a stream of mutations, serialize them, and send them over the network. On the receiver side, we deserialize and write them to SSTables. File-Based Streaming Recently, we introduced a new file-based streaming method. The big difference is that we do not read the individual mutations from the SSTables, and we skip all the parsing and serialization work. Instead, we read and send the SSTable directly to remote nodes. A given SSTable always belongs to a single tablet. This means we can always send the entire SSTable to other nodes without worrying about whether the SSTable contains unwanted data. We implemented this by having the Seastar RPC stream interface stream SSTable files on the network for tablet migration. More specifically, we take an internal snapshot of the SSTables we want to transfer so the SSTables won’t be deleted during streaming. Then, SSTable file readers are created for them so we can use the Seastar RPC stream to send the SSTable files over the network. On the receiver side, the file streams are written into SSTable files by the SSTable writers. Why did we do this? First, it reduces CPU usage because we do not need to read each and every mutation fragment from the SSTables, and we do not need to parse mutations. The CPU reduction is even more significant for small cells, where the ratio of the amount of metadata parsed to real user data is higher. Second, the format of the SSTable is much more compact than the mutation format (since on-disk presentation of data is more compact than in-memory). This means we have less data to send over the network. As a result, it can boost the streaming speed rather significantly. Performance Improvements To quantify how this shift impacted performance, we compared the performance of mutation-based and file-based streaming when migrating tablets between nodes. The tests involved: 3 ScyllaDB nodes i4i.2xlarge 3 loaders t3.2xlarge 1 billion partitions Here are the results: Note that file-based streaming results in 25 times faster streaming time. We also have much higher streaming bandwidth: the network bandwidth is 10 times faster with file-based streaming. As mentioned earlier, we have less data to send with file streaming. The data sent on the wire is almost three times less with file streaming. In addition, we can also see that file-based streaming consumes many fewer CPU cycles. Here’s a little more detail, in case you’re curious. Disk IO Queue The following sections show how the IO bandwidth compares across mutation-based and file-based streaming. Different colors represent different nodes. As expected, the throughput was higher with mutation-based streaming. Here are the detailed IO results for mutation-based streaming: The streaming bandwidth is 30-40MB/s with mutation-based streaming. Here are the detailed IO results for file-based streaming: The bandwidth for file streaming is much higher than with mutation-based streaming. The pattern differs from the mutation-based graph because file streaming completes more quickly and can sustain a high speed of transfer bandwidth during streaming. CPU Load We found that the overall CPU usage is much lower for the file-based streaming. Here are the detailed CPU results for mutation-based streaming: Note that the CPU usage is around 12% for mutation-based streaming. Here are the detailed CPU results for file-based streaming: Note that the CPU usage for the file-based streaming is less than 5%. Again, this pattern differs from the mutation-based streaming graph because file streams complete much more quickly and can maintain a high transfer bandwidth throughout. Wrap Up This new file-based streaming makes data streaming in ScyllaDB faster and more efficient. You can explore it in ScyllaDB Cloud or ScyllaDB 2025.1. Also, our CTO and co-founder Avi Kivity shares an extensive look at our other recent and upcoming engineering projects in this tech talk: More engineering blog postsCEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator
Introduction: The need for an Apache Cassandra® password validator and generator
Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.
While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.
When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.
But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?
That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.
The Goals of CEP-24
A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.
These were the key goals we established for CEP-24:
- Introduce a way to enforce password strength upon role creation or role alteration.
- Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
- Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
- To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
- Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.
The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).
The password validator implements a custom guardrail introduced
as part of
CEP-24. A custom guardrail can validate and generate values of
arbitrary types when properly implemented. In the CEP-24 context,
the password guardrail provides
CassandraPasswordValidator
by extending
ValueValidator
, while passwords are generated by
CassandraPasswordGenerator
by extending
ValueGenerator
. Both components work with passwords as
String type values.
Password validation and generation are configured in the
cassandra.yaml
file under the
password_validator
section. Let’s explore the key
configuration properties available. First, the
class_name
and generator_class_name
parameters specify which validator and generator classes will be
used to validate and generate passwords respectively.
Cassandra
ships CassandraPasswordValidator
and CassandraPasswordGenerator
out
of the box. However, if a particular enterprise decides that they
need something very custom, they are free to implement their own
validators, put it on Cassandra’s class path and reference it in
the configuration behind class_name parameter. Same for the
validator.
CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.
password_validator: # Implementation class of a validator. When not in form of FQCN, the # package name org.apache.cassandra.db.guardrails.validators is prepended. # By default, there is no validator. class_name: CassandraPasswordValidator # Implementation class of related generator which generates values which are valid when # tested against this validator. When not in form of FQCN, the # package name org.apache.cassandra.db.guardrails.generators is prepended. # By default, there is no generator. generator_class_name: CassandraPasswordGenerator
Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.
This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.
# There are four characteristics: # upper-case, lower-case, special character and digit. # If this value is set e.g. to 3, a password has to # consist of 3 out of 4 characteristics. # For example, it has to contain at least 2 upper-case characters, # 2 lower-case, and 2 digits to pass, # but it does not have to contain any special characters. # If the number of characteristics found in the password is # less than or equal to this number, it will emit a warning. characteristic_warn: 3 # If the number of characteristics found in the password is #less than or equal to this number, it will emit a failure. characteristic_fail: 2
Next, there are configuration parameters for each characteristic which count towards warning or failure:
# If the password is shorter than this value, # the validator will emit a warning. length_warn: 12 # If a password is shorter than this value, # the validator will emit a failure. length_fail: 8 # If a password does not contain at least n # upper-case characters, the validator will emit a warning. upper_case_warn: 2 # If a password does not contain at least # n upper-case characters, the validator will emit a failure. upper_case_fail: 1 # If a password does not contain at least # n lower-case characters, the validator will emit a warning. lower_case_warn: 2 # If a password does not contain at least # n lower-case characters, the validator will emit a failure. lower_case_fail: 1 # If a password does not contain at least # n digits, the validator will emit a warning. digit_warn: 2 # If a password does not contain at least # n digits, the validator will emit a failure. digit_fail: 1 # If a password does not contain at least # n special characters, the validator will emit a warning. special_warn: 2 # If a password does not contain at least # n special characters, the validator will emit a failure. special_fail: 1
It is also possible to say that illegal sequences of certain length found in a password will be forbidden:
# If a password contains illegal sequences that are at least this long, it is invalid. # Illegal sequences might be either alphabetical (form 'abcde'), # numerical (form '34567'), or US qwerty (form 'asdfg') as well # as sequences from supported character sets. # The minimum value for this property is 3, # by default it is set to 5. illegal_sequence_length: 5
Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:
# Dictionary to check the passwords against. Defaults to no dictionary. # Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. # Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. dictionary: /path/to/dictionary/file
Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.
Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.
cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. To fix this error, the following has to be resolved: Password contains the dictionary word 'cassandraisadatabase'. You may also use 'GENERATED PASSWORD' upon role creation or alteration."
The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:
cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. To fix this error, the following has to be resolved: Password must be 8 or more characters in length. You may also use 'GENERATED PASSWORD' upon role creation or alteration."
The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.
cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; Warnings: Guardrail password violated: Password was set, however it might not be strong enough according to the configured password strength policy. To fix this warning, the following has to be resolved: Password must be 12 or more characters in length. Passwords must contain 2 or more digit characters. Password must contain 2 or more special characters. Password matches 2 of 4 character rules, but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.
The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.
When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.
cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; generated_password ------------------ R7tb33?.mcAX
The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.
When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.
We can create a role with generated password like this:
cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; or by CREATE USER: cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;
When a password is generated for alice (out of scope of this documentation), she can log in:
$ cqlsh -u alice -p R7tb33?.mcAX ... alice@cqlsh>
Note: It is recommended to save password to ~/.cassandra/credentials, for example:
[PlainTextAuthProvider] username = cassandra password = R7tb33?.mcAX
and by setting auth_provider in ~/.cassandra/cqlshrc
[auth_provider] module = cassandra.auth classname = PlainTextAuthProvider
It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:
alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. You may also use 'GENERATED PASSWORD' upon role creation or alteration."
The following command will automatically generate a new password that meets all configured security requirements.
alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;
Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.
These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.
Final thoughts and next steps
The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.
By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.
As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.
Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.
CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.
The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.
Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type
In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.
But with the release of Cassandra 5, things become much simpler.
Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.
Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.
The power of vector search in Cassandra 5
Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.
The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.
How vectors work
Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.
When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.
Storage-Attached Indexing (SAI): The engine behind vector search
Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.
SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.
Example: Performing similarity search with Cassandra 5’s vector data type
Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.
Step 1: Setting up the embeddings table
To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.
CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', ' replication_factor': 1}}; CREATE TABLE IF NOT EXISTS embeddings ( id UUID, paragraph_uuid UUID, filename TEXT, embeddings vector<float, 300>, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) ); CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai';
This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.
You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:
CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.
Step 2: Inserting embeddings into Cassandra 5
To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:
import time from uuid import uuid4, UUID from cassandra.cluster import Cluster from cassandra.query import SimpleStatement from cassandra.policies import DCAwareRoundRobinPolicy from cassandra.auth import PlainTextAuthProvider from google.colab import userdata # Connect to the single-node cluster cluster = Cluster( # Replace with your IP list ["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed port=9042, auth_provider=PlainTextAuthProvider ( username='iccassandra', password='replace_with_your_password' ) ) session = cluster.connect() print('Connected to cluster %s' % cluster.metadata.cluster_name) def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None): try: embeddings = list(map(float, embedding)) # Generate UUIDs if not provided if id is None: id = uuid4() if paragraph_uuid is None: paragraph_uuid = uuid4() # Ensure id and paragraph_uuid are UUID objects if isinstance(id, str): id = UUID(id) if isinstance(paragraph_uuid, str): paragraph_uuid = UUID(paragraph_uuid) # Create the query string with placeholders insert_query = f""" INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated) VALUES (?, ?, ?, ?, ?, toTimestamp(now())) """ # Create a prepared statement with the query prepared = session.prepare(insert_query) # Execute the query session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text))) return None # Successful insertion except Exception as e: error_message = f"Failed to execute query:\nError: {str(e)}" return error_message # Return error message on failure def insert_with_retry(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None, max_retries=3, retry_delay_seconds=1): retry_count = 0 while retry_count < max_retries: result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) if result is None: return True # Successful insertion else: retry_count += 1 print(f"Insertion failed on attempt {retry_count} with error: {result}") if retry_count < max_retries: time.sleep(retry_delay_seconds) # Delay before the next retry return False # Failed after max_retries # Replace the file path pointing to the desired file file_path = "/path/to/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path) from tqdm import tqdm for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_with_retry( session=session, embedding=paragraph['embedding'], id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ): # Display an error message if insertion fails tqdm.write(f"Insertion failed after maximum retries for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...")
This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.
Step 3: Performing similarity searches in Cassandra 5
Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:
import numpy as np # ------------------ Embedding Functions ------------------ def text_to_vector(text): """Convert a text chunk into a vector using the FastText model.""" words = text.split() vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): # Convert the input text to an embedding input_embedding = text_to_vector(input_text) input_embedding_str = ', '.join(map(str, input_embedding.tolist())) # Adjusted query without the ORDER BY clause and correct comment syntax query = f""" SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity FROM {keyspace_name}.embeddings ORDER BY embeddings ANN OF [{input_embedding_str}] LIMIT {top_k}; """ prepared = session.prepare(query) bound = prepared.bind((input_embedding,)) rows = session.execute(bound) # Sort the results by similarity in Python similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) return similar_texts[:top_k] from IPython.display import display, HTML # The word you want to find similarities for input_text = "place" # Call the function to find similar texts in the Cassandra database similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)
This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.
Step 4: Displaying the results
To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:
# Print the similar texts along with their similarity scores for similarity, filename, text in similar_texts: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>Similarity:</b> {similarity:.4f}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code will display the top similar texts, along with their similarity scores and associated file names.
Cassandra 5 vs. Cassandra 4 + OpenSearch®
Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.
Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.
Feature | Cassandra 4 + OpenSearch | Cassandra 5 (Preview) |
Embedding Storage | OpenSearch | Native VECTOR Data Type |
Similarity Search | KNN Plugin in OpenSearch | COSINE, EUCLIDEAN, DOT_PRODUCT |
Search Method | Exact K-Nearest Neighbor | Approximate Nearest Neighbor (ANN) |
System Complexity | Requires two systems | All-in-one Cassandra solution |
Conclusion: A simpler path to similarity search with Cassandra 5
With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.
Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!
Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!
The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.
Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®
Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.
They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.
For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.
As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.
In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.
Cassandra 4 and OpenSearch: A partnership for embeddings
Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.
This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.
Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.
This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.
Deploying the environment
To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.
Here’s how to get started:
- Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
- Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.
By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.
Preparing the environment
Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.
Step 1: Setting up Cassandra
In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:
- Create the Table:
Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy', ' replication_factor': 3}; USE file_metadata; DROP TABLE IF EXISTS file_metadata; CREATE TABLE IF NOT EXISTS file_metadata ( id UUID, paragraph_uuid UUID, filename TEXT, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) );
Step 2: Configuring OpenSearch
In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:
- Create the index:
Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ "settings": { "index": { "number_of_shards": 2, "knn": true, "knn.space_type": "l2" } }, "mappings": { "properties": { "file_uuid": { "type": "keyword" }, "paragraph_uuid": { "type": "keyword" }, "embedding": { "type": "knn_vector", "dimension": 300 } } } }
This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.
With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.
Generating embeddings with FastText
Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.
Step 1: Download and load the FastText model
To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:
1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:
import requests import zipfile import os # Adjust file_url and local_filename variables accordingly file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' def download_file(url, filename): with requests.get(url, stream=True) as r: r.raise_for_status() os.makedirs(os.path.dirname(filename), exist_ok=True) with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) def unzip_file(filename, extract_to): with zipfile.ZipFile(filename, 'r') as zip_ref: zip_ref.extractall(extract_to) # Download and extract download_file(file_url, local_filename) unzip_file(local_filename, extract_dir)
2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly:
from gensim.models import KeyedVectors # Adjust model_path variable accordingly model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec" fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)
Step 2: Generate embeddings from text
With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.
Here’s a function that handles the conversion:
import numpy as np import re def text_to_vector(text): """Convert text into a vector using the FastText model.""" text = text.lower() words = re.findall(r'\b\w+\b', text) vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] if not vectors: print(f"No embeddings found for text: {text}") return np.zeros(fasttext_model.vector_size) return np.mean(vectors, axis=0)
This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.
Step 3: Extract text and generate embeddings from documents
In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:
import uuid import mimetypes import pandas as pd from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer from docx import Document from pptx import Presentation def generate_deterministic_uuid(name): return uuid.uuid5(uuid.NAMESPACE_DNS, name) def generate_random_uuid(): return uuid.uuid4() def get_file_type(file_path): # Guess the MIME type based on the file extension mime_type, _ = mimetypes.guess_type(file_path) return mime_type def extract_text_from_excel(excel_path): xls = pd.ExcelFile(excel_path) text_list = [] for sheet_index, sheet_name in enumerate(xls.sheet_names): df = xls.parse(sheet_name) for row in df.iterrows(): text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1)) # +1 to make it 1 based index return text_list def extract_text_from_pdf(pdf_path): return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) for element in page_layout if isinstance(element, LTTextContainer) for text_line in element if text_line.get_text().strip()] def extract_text_from_word(file_path): doc = Document(file_path) return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] def extract_text_from_txt(file_path): with open(file_path, 'r') as file: return [(line.strip(), 1) for line in file.readlines() if line.strip()] def extract_text_from_pptx(pptx_path): prs = Presentation(pptx_path) return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] def extract_text_with_page_number_and_embeddings(file_path, embedding_function): file_uuid = generate_deterministic_uuid(file_path) file_type = get_file_type(file_path) extractors = { 'text/plain': extract_text_from_txt, 'application/pdf': extract_text_from_pdf, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 'application/vnd.ms-excel': extract_text_from_excel } text_list = extractors.get(file_type, lambda _: [])(file_path) return [ { "uuid": file_uuid, "paragraph_uuid": generate_random_uuid(), "filename": file_path, "text": text, "page_num": page_num, "embedding": embedding } for text, page_num in text_list if (embedding := embedding_function(text)).any() # Check if the embedding is not all zeros ] # Replace the file path with the one you want to process file_path = "../../docs-manager/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)
This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.
With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.
Performing similarity searches
To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.
For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.
This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.
Example: Inserting metadata in Cassandra and embeddings in OpenSearch
In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.
We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.
The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:
from tqdm import tqdm # Function to insert data into both Cassandra and OpenSearch def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): # Insert into Cassandra cassandra_result = insert_with_retry( session=session, id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ) if not cassandra_result: return False # Stop further processing if Cassandra insertion fails # Insert into OpenSearch opensearch_result = insert_embedding_to_opensearch( os_client=os_client, index_name=index_name, file_uuid=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], embedding=paragraph['embedding'] ) if opensearch_result is not None: return False # Return False if OpenSearch insertion fails return True # Return True on success for both # Process each paragraph with a progress bar print("Starting batch insertion of paragraphs.") for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_paragraph_data( session=session, os_client=os_client, paragraph=paragraph, keyspace_name=keyspace_name, index_name=index_name ): print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") print("Batch insertion completed.")
Performing similarity search
Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.
The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.
Here’s how it works:
- Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
- Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
- Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.
The following code demonstrates this process:
import uuid from IPython.display import display, HTML def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" query = { "size": top_k, "query": { "knn": { "embedding": { "vector": input_embedding.tolist(), "k": top_k } } } } response = os_client.search(index=index_name, body=query) similar_uuids = [] for hit in response['hits']['hits']: file_uuid = hit['_source']['file_uuid'] paragraph_uuid = hit['_source']['paragraph_uuid'] similar_uuids.append((file_uuid, paragraph_uuid)) return similar_uuids def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" file_uuid = uuid.UUID(file_uuid) paragraph_uuid = uuid.UUID(paragraph_uuid) query = f""" SELECT text, filename FROM {keyspace_name}.file_metadata WHERE id = ? AND paragraph_uuid = ?; """ prepared = session.prepare(query) bound = prepared.bind((file_uuid, paragraph_uuid)) rows = session.execute(bound) for row in rows: return row.filename, row.text return None, None # Input text to find similar embeddings input_text = "place" # Convert input text to embedding input_embedding = text_to_vector(input_text) # Find similar embeddings in OpenSearch similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) # Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch for file_uuid, paragraph_uuid in similar_uuids: filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name) if filename and text: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>File UUID:</b> {file_uuid}</p> <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.
Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch
By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.
Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.
Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.
The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.
IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative
IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.
For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?
While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.
However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.
Decoding the IBM-DataStax link-up
DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.
While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.
This is where Instaclustr comes into the picture as a strong, reliable alternative solution.
Why consider Instaclustr?
Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.
Here’s why Instaclustr could be your best option moving forward:
1. 100% open source commitment
We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.
2. Platform agnostic
One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.
3. Transparent pricing
Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.
4. Expert support and services
With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.
Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.
5. Seamless migrations
Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.
Customer-centric focus
At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.
Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.
With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.
Final thoughts: Why Instaclustr is the smart choice for this moment
IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.
With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.
Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.
Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!
The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.
Innovative data compression for time series: An open source solution
Introduction
There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.
Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.
Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative.
ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub.
There are so many compressors already, so why develop another one?
While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born.
ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that function—no actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use.
Here’s an example: for a temperature change metric—which mostly varies slowly (as do a lot of system metrics!)—instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios.
Image 1: ATSC data for temperature
How does ATSC work?
ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.
By default, ATSC will segment the data—this guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file.
In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function.
ATSC currently uses one (per frame) of those following functions:
- FFT (Fast Fourier Transforms)
- Constant
- Interpolation – Catmull-Rom
- Interpolation – Inverse Distance Weight
Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting
These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.
For a more detailed insight into ATSC internals and operations check our paper!
Use cases for ATSC and results
ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below).
Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results:
Method | Compressed size (bytes) | Compression Ratio |
Prometheus | 454,778,552 | 1.33 |
LZ4 | 141,347,821 | 4.29 |
ATSC | 14,276,544 | 42.47 |
Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times.
Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.
ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include:
- Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss.
- Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events.
- Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data.
- Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)
Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)
Using ATSC
ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions.
Future work
Currently, we are planning to evolve ATSC in two ways (check our open issues):
- Adding features to the core compressor
focused on
these functionalities:
- Frame expansion for appending new data to existing frames
- Dynamic function loading to add more functions without altering the codebase
- Global and per-frame error storage
- Improved error encoding
- Integrations with
additional
technologies (e.g.
databases):
- We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra®
CREATE TABLE sensors_poly ( sensor_id UInt16, location UInt32, timestamp DateTime, pressure Float64 CODEC(ATSC('Polynomial', 1)), temperature Float64 CODEC(ATSC('Polynomial', 1)), ) ENGINE = MergeTree ORDER BY (sensor_id, location, timestamp);
Image 5: Currently testing ClickHouse integration
Sound interesting? Try it out and let us know what you think.
ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data.
But don’t just take our word for it—download and run it!
Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.
Want to integrate this with another tool? You can build and run our demo integration with ClickHouse.
The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.
New cassandra_latest.yaml configuration for a top performant Apache Cassandra®
Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.
This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters.
Motivation
The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters:
- Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints.
- Operators: who prefer stability and minimal disruption during upgrades.
- Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility.
Using cassandra_latest.yaml
Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.
This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements.
Key changes and features
Key Cache Size
- Old: Evaluated as a minimum from 5% of the heap or 100MB
- Latest: Explicitly set to 0
Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage.
Commit Log Disk Access Mode
- Old: Set to legacy
- Latest: Set to auto
Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.
Memtable Implementation
- Old: Skiplist-based
- Latest: Trie-based
Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.
create table … with memtable = {'class': 'TrieMemtable', … }
Memtable Allocation Type
- Old: Heap buffers
- Latest: Off-heap objects
Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments.
Trickle Fsync
- Old: False
- Latest: True
Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads.
SSTable Format
- Old: big
- Latest: bti (trie-indexed structure)
Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets.
sstable: selected_format: bti default_compression: zstd compression: zstd: enabled: true chunk_length: 16KiB max_compressed_length: 16KiB
Default Compaction Strategy
- Old: STCS (Size-Tiered Compaction Strategy)
- Latest: Unified Compaction Strategy
Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables.
default_compaction: class_name: UnifiedCompactionStrategy parameters: scaling_parameters: T4 max_sstables_to_compact: 64 target_sstable_size: 1GiB sstable_growth: 0.3333333333333333 min_sstable_size: 100MiB
Concurrent Compactors
- Old: Defaults to the smaller of the number of disks and cores
- Latest: Explicitly set to 8
Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient.
Default Secondary Index
- Old: legacy_local_table
- Latest: sai
Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time.
Stream Entire SSTables
- Old: implicity set to True
- Latest: explicity set to True
Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by
entire_sstable_stream_throughput_outbound
and
entire_sstable_inter_dc_stream_throughput_outbound
for inter-DC transfers.
UUID SSTable Identifiers
- Old: False
- Latest: True
Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments.
Storage Compatibility Mode
- Old: Cassandra 4
- Latest: None
Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions.
Testing and validation
The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests.
Future improvements
Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml.
Conclusion
The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone.
Try it out
Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!
The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.
Instaclustr for Apache Cassandra® 5.0 Now Generally Available
NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.
NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and on–premises.
Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.
Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by:
- Helping you build AI/ML-based applications through Vector Search
- Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities
- Improving flexibility and security
With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industry–leading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.
Support for continuous backups and private network add–ons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported.
Some of the key new features in Cassandra 5.0 include:
- Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.
- Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.
- Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.
- Numerous stability and testing improvements: You can read all about these changes here.
All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.
Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.
We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.
Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things.
Lifecycle Policy Updates
As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).
To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters.
Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.
Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.
You can read more about our lifecycle policies on our website.
Getting Started
Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.
- Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
- Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
- Learn about 4 new Apache Cassandra 5.0 features to be excited about.
- Click here to learn what you need to know about Apache Cassandra 5.0.
Why Choose Apache Cassandra on the Instaclustr Managed Platform?
NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.
NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.
Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.
With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.
If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.
The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.
Apache Cassandra® 5.0: Behind the Scenes
Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.
Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.
It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform.
Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra project. His changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform.
August 2023: The Beginning
We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.
One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java.
September 2023: First Milestone
The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).
Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.
At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released.
November 2023: Further Testing and Internal Preview
The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.
We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.
We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.
During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us.
If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer:
java.lang.IllegalArgumentException: Could not parse type name vector<float, 5> Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356) <Rest of stacktrace removed for brevity>
December 2023: Focus on new features and planning
As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.
The final list of high impact features we came up with was:
- A new data type – Vectors
- Trie memtables/Trie Indexed SSTables (BTI Formatted SStables)
- Storage-Attached Indexes (SAI)
- Unified Compaction Strategy
A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase.
Once the holiday season arrived, it was time for a break, and we were back in force in February next year.
February 2024: Intensive testing
In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, we discovered issues in the interaction with our monitoring and provisioning setup.
We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.
Next, we broke down the work we needed to do. This included identifying monitoring agents requiring upgrade and config defaults that needed to change.
From this point, the project split into 3 streams of work:
- Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.
- Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.
- Infrastructure Upgrades – Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables.
A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved.
March 2024: Public Preview Release
In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables.
However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.
See our public preview launch blog for further details.
There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults.
April 2024: Configuration Tuning and Deeper Testing
The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.
This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default.
"memtable": { "configurations": { "skiplist": { "class_name": "SkipListMemtable" }, "sharded": { "class_name": "ShardedSkipListMemtable" }, "trie": { "class_name": "TrieMemtable" }, "default": { "inherits": "trie" } } }, "sstable": { "selected_format": "bti" }, "storage_compatibility_mode": "NONE",
The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables. You can override this per table:
CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};
Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information.
You can also reference the cassandra_latest.yaml file for the latest settings (please note you should not apply these to existing clusters without rigorous testing).
May 2024: Major Infrastructure Milestone
We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.
At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed.
June 2024: Successful Rollout New Cassandra Driver
The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default.
July 2024: Path to Release Candidate
We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.
The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform.
The Road Ahead to General Availability
We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including:
- Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure.
At Launch:
When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.
For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.
What Have We Learned From This Project?
- Releasing limited,
small
and frequent changes
has resulted in a smooth project, even if sometimes frequent
releases do not feel smooth. Some
thoughts:
- Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
- Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc.
- Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.
- Releasing to internal and public preview helped
create
momentum within
the Instaclustr
business and
teams:
- This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
- Communicating frequently, transparently, and efficiently is the foundation
of success:
- We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything.
- It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project.
This has been a long–running but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.
You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.
More Readings
- The Top 5 Questions We’re Asked about Apache Cassandra 5.0
- Vector Search in Apache Cassandra 5.0
- How Does Data Modeling Change in Apache Cassandra 5.0?
The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.
Building a 100% ScyllaDB Shard-Aware Application Using Rust
Building a 100% ScyllaDB Shard-Aware Application Using Rust
I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...
On Scylla Manager Suspend & Resume feature
On Scylla Manager Suspend & Resume feature
!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...
Renaming and reshaping Scylla tables using scylla-migrator
We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...
Python scylla-driver: how we unleashed the Scylla monster's performance
At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...
Scylla Summit 2019
I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...
Scylla: four ways to optimize your disk space consumption
We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...