How to Reduce DynamoDB Costs: Expert Tips from Alex DeBrie
DynamoDB consultant Alex DeBrie shares where teams tend to get into trouble DynamoDB pricing can be a blessing and a curse. When you’re just starting off, costs are usually quite reasonable and on-demand pricing can seem like the perfect way to minimize upfront costs. But then perhaps you face “catastrophic success” with an exponential increase of users flooding your database…and your monthly bill far exceeds your budget. The more predictable provisioned capacity model might seem safer. But if you overprovision, you’re burning money – and if you underprovision, your application might be throttled during a critical peak period. It’s complicated. Add in the often-overlooked costs of secondary indexes, ACID transactions, and global tables – plus the nuances of dealing with DAX – and you could find that your cost estimates were worlds away from reality. Rather than learn these lessons the hard (and costly) way, why not take a shortcut: tap the expert known for helping teams reduce their DynamoDB costs. Enter Alex DeBrie, the guy who literally wrote the book on DynamoDB. Alex shared his experiences at the recent Monster SCALE Summit. This article recaps the key points from his talk (you can watch his complete talk here). Watch Alex’s Complete Talk Note: If you need further cost reduction beyond these strategies, consider ScyllaDB. ScyllaDB is an API-compatible DynamoDB alternative that provides better latency at 50% of the cost (or less), thanks to extreme engineering efficiency. Learn more about ScyllaDB as a DynamoDB alternative DynamoDB Pricing: The Basics Alex began the talk with an overview of how DynamoDB’s pricing structure works. Unlike other cloud databases where you provision resources like CPU and RAM, DynamoDB charges directly for operations. You pay for: Read Capacity Units (RCUs): Each RCU allows reading up to 4KB of data per request Write Capacity Units (WCUs): Each WCU allows writing up to 1KB of data per request Storage: Priced per gigabyte-month (similar to EBS or S3) Then there’s DynamoDB billing modes, which determine how you get that capacity for reads and writes. Provisioned Throughput is the traditional billing mode. You specify how many RCUs and WCUs you want available on a per-second basis. Basically, it’s a “use it or lose it” model. You’re paying for what you requested, whether you take advantage of it or not. If you happen to exceed what you requested, your workload gets throttled. And speaking of throttling, Alex called out another important difference between DynamoDB and other databases. With other databases, response times gradually worsen as concurrent queries increase. Not so in DynamoDB. Alex explained, “As you increase the number of concurrent queries, you’ll still hit some saturation point where you might not have provisioned enough throughput to support the reads or writes you want to perform. But rather than giving you long-tail response times, which aren’t ideal, it simply throttles you. It instantly returns a 500 error, telling you, ‘Hey, you haven’t provisioned enough for this particular second. Come back in another second, and you’ll have more reads and writes available.’” As a result, you get predictable response times – to a limit, at least. On-Demand Mode is more like a serverless or pay-per-request mode. Rather than saying how much capacity you want in advance, you just get charged per request. As you throw reads and writes at your DynamoDB database, AWS will charge you fractions of a cent each time. At the end of the month, they’ll total up all those costs and send you a bill. Beyond the Basics For an accurate assessment of your DynamoDB costs, you need to go beyond simply plugging your anticipated read and write estimates into a calculator (either the AWS-hosted DynamoDB cost calculator or the more nuanced DynamoDB cost analyzer we’ve designed). Many other factors – in your DynamoDB configuration as well as your actual application – impact your costs. Critical DynamoDB cost factors that Alex highlighted in his talk include: Table storage classes WCU and RCU cost multipliers Let’s look at each in turn. Table Storage Classes In DynamoDB, “table storage classes” define the underlying storage tier and access patterns for your table’s data. There are two options: standard mode for hot data and Standard-IA for infrequently accessed, historical, or backup data. Standard Mode: This is the traditional table storage class. It provides high-performance storage optimized for frequent access. It’s the cheapest mode for paying for operations. However, be aware that storage cost is more expensive (about 25 cents per Gigabyte-month in the cheapest regions). Standard-IA (Infrequent Access): This is a lower-cost, less performant tier designed for infrequent access. If you have a table with a lot of data and you’re doing fewer operations on it, you can use this option for cheaper storage (only about 10 cents per Gigabyte-month). However, the tradeoffs are that you pay a premium on operations and you cannot reserve capacity. [Amazon’s tips on selecting the table storage class] WCU and RCU Cost Multipliers Beyond the core settings, there’s also an array of “multipliers” that can exponentially increase your capacity unit consumption. Factors such as item size, secondary indexes, transactions, global table replication, and read consistency can all cause costs to skyrocket if you’re not careful. The riskiest cost multipliers that Alex called out include: Item size: Although the standard RCU is 4KB and the standard WCU is 1KB, you can go beyond that (for a cost). If you’re reading a 20KB item, that’s going to be 5 RCUs (20KB / 4KB = 5 RCUs). Or if you’re writing a 10KB item, that’s going to be 10 WCUs (10KB / 1KB= 10 WCUs). Secondary indexes: DynamoDB lets you use secondary indexes, but again – it will cost you. In addition to paying for the writes that go to your main table, you will also pay for all the writes to your secondary indexes. That can really drive up your WCU consumption. ACID Transactions: You can configure ACID transactions to operate on multiple items in a single request in an all-or-nothing way. However, you pay quite a premium for this. Global Tables: DynamoDB Global Tables replicate data across multiple regions, but you really pay the price due to increased write operations as well as increased storage needs. Consistent reads: Consistent reads ensure that a read request always returns the most recent write. But, you pay higher costs compared to eventually consistent reads that might return slightly older data. How to Reduce DynamoDB Costs Alex’s top tip is to “mind your multipliers.” Make sure you really understand the cost impacts of different options. Also, avoid any options that don’t justify their steep costs. In particular… Watch Item Sizes DynamoDB users tend to bloat their item sizes without really thinking about it. This consumes a lot of resources (disk/memory/CPU), so review your item sizes carefully: Remove unused attributes If you have large values, consider storing them in S3 instead Reduce the attribute names (Since AWS charges for the full payload transmitted over the wire, large attribute names result in larger item sizes) If you have a smaller amount of frequently updated data and a larger amount of slow-moving data, consider splitting items into multiple different items (vertical partitioning) Limit Secondary Indexes Secondary indexes are another common culprit behind unexpected DynamoDB costs. Be vigilant about spotting and removing secondary indexes that you don’t really need. Remember, they’re causing you to pay twice: you pay in terms of storage and also on every write. You can also use Projections to limit the number of writes to your secondary indexes and/or limit the size of the items in those indexes. Regularly review secondary indexes to ensure they are being utilized. Remove any index that isn’t being read and evaluate the “write:read” cost ratio to determine if the cost is justified. Use Transactions Sparingly Limit transactions. Alex put it this way: “AWS came out with DynamoDB transactions six or seven years ago. They’re super useful for many things, but I wouldn’t use them willy-nilly. They’re slower than traditional DynamoDB operations and more expensive. So, I try to limit my transactions to high-value, low-volume, low-frequency applications. That’s where I find them worthwhile — if I use them at all. Otherwise, I focus on modeling around them, leaning into DynamoDB’s design to avoid needing transactions in the first place.” Be Selective with Global Tables Global tables are critical if you need data in multiple regions, but make sure they’re really worth it. Given that they will multiply your write and storage costs, they should add significant value to justify their existence. Consider Eventually Consistent Reads Do you really need strongly consistent reads every time? Alex has found that in most cases, users don’t. “You’re almost always going to get the latest version of the item, and even if you don’t, it shouldn’t cause data corruption.” Choose the Right Billing Mode On-demand DynamoDB costs about 3.5X the price of fully utilized provisioned capacity (and this is quite an improvement from the previous 7X the price). However, achieving full utilization of provisioned capacity is difficult because overprovisioning is often necessary to handle traffic spikes. Generally, ~28-29% utilization is needed to make provisioned capacity cost effective. For smaller workloads or those with unpredictable traffic, on-demand is often the better choice. Alex advises: “Use on-demand until it hurts. If your DynamoDB bill is under $1,000 a month, don’t spend too much time optimizing provisioned capacity. Instead, set it to on-demand and see how it goes. Once costs start to rise, then consider whether it’s worth optimizing. If you’re using provisioned capacity, aim for at least 28.8% utilization. If you’re not hitting that, switch to on-demand. Autoscaling can help with provisioned capacity – as long as your traffic doesn’t have rapid spikes. For stable, predictable workloads, reserved capacity (purchased a year in advance) can save you a lot of money.” Review Table Storage Classes Monthly Review your table storage classes every month. When deciding between storage classes, the key metric is whether total operations costs are 2.4X total storage costs. If operations costs exceed this, standard storage is preferable; otherwise, standard infrequent access (IA) is a better choice. Also, be aware that the optimal setting could vary over time. Per Alex, “Standard storage is usually cheaper at first. For example, writing a kilobyte of data costs roughly the same as storing it for five months, so you’ll likely start in standard storage. However, over time, as your data grows, storage costs increase, and it may be worth switching to standard IA.” Another tip on this front: use TTL to your advantage. If you don’t need to keep data forever, use TTL to automatically expire it. This will help with storage costs. “DynamoDB pricing should influence how you build your application” Alex left us with this thought: “DynamoDB pricing should influence how you build your application. You should consider these cost multipliers when designing your data model because you can easily see the connection between resource usage and cost, ensuring you’re getting value from it. For example, if you’re thinking about adding a secondary index, run the numbers to see if it’s better to over-read from your main table instead of paying the write cost for a secondary index. There are many strategies you can use.” Browse our DynamoDB Resources Learn how ScyllaDB Compares to DynamoDBCEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator
Introduction: The need for an Apache Cassandra® password validator and generator
Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.
While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.
When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.
But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?
That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.
The Goals of CEP-24
A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.
These were the key goals we established for CEP-24:
- Introduce a way to enforce password strength upon role creation or role alteration.
- Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
- Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
- To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
- Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.
The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).
The password validator implements a custom guardrail introduced
as part of
CEP-24. A custom guardrail can validate and generate values of
arbitrary types when properly implemented. In the CEP-24 context,
the password guardrail provides
CassandraPasswordValidator
by extending
ValueValidator
, while passwords are generated by
CassandraPasswordGenerator
by extending
ValueGenerator
. Both components work with passwords as
String type values.
Password validation and generation are configured in the
cassandra.yaml
file under the
password_validator
section. Let’s explore the key
configuration properties available. First, the
class_name
and generator_class_name
parameters specify which validator and generator classes will be
used to validate and generate passwords respectively.
Cassandra
ships CassandraPasswordValidator
and CassandraPasswordGenerator
out
of the box. However, if a particular enterprise decides that they
need something very custom, they are free to implement their own
validators, put it on Cassandra’s class path and reference it in
the configuration behind class_name parameter. Same for the
validator.
CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.
password_validator: # Implementation class of a validator. When not in form of FQCN, the # package name org.apache.cassandra.db.guardrails.validators is prepended. # By default, there is no validator. class_name: CassandraPasswordValidator # Implementation class of related generator which generates values which are valid when # tested against this validator. When not in form of FQCN, the # package name org.apache.cassandra.db.guardrails.generators is prepended. # By default, there is no generator. generator_class_name: CassandraPasswordGenerator
Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.
This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.
# There are four characteristics: # upper-case, lower-case, special character and digit. # If this value is set e.g. to 3, a password has to # consist of 3 out of 4 characteristics. # For example, it has to contain at least 2 upper-case characters, # 2 lower-case, and 2 digits to pass, # but it does not have to contain any special characters. # If the number of characteristics found in the password is # less than or equal to this number, it will emit a warning. characteristic_warn: 3 # If the number of characteristics found in the password is #less than or equal to this number, it will emit a failure. characteristic_fail: 2
Next, there are configuration parameters for each characteristic which count towards warning or failure:
# If the password is shorter than this value, # the validator will emit a warning. length_warn: 12 # If a password is shorter than this value, # the validator will emit a failure. length_fail: 8 # If a password does not contain at least n # upper-case characters, the validator will emit a warning. upper_case_warn: 2 # If a password does not contain at least # n upper-case characters, the validator will emit a failure. upper_case_fail: 1 # If a password does not contain at least # n lower-case characters, the validator will emit a warning. lower_case_warn: 2 # If a password does not contain at least # n lower-case characters, the validator will emit a failure. lower_case_fail: 1 # If a password does not contain at least # n digits, the validator will emit a warning. digit_warn: 2 # If a password does not contain at least # n digits, the validator will emit a failure. digit_fail: 1 # If a password does not contain at least # n special characters, the validator will emit a warning. special_warn: 2 # If a password does not contain at least # n special characters, the validator will emit a failure. special_fail: 1
It is also possible to say that illegal sequences of certain length found in a password will be forbidden:
# If a password contains illegal sequences that are at least this long, it is invalid. # Illegal sequences might be either alphabetical (form 'abcde'), # numerical (form '34567'), or US qwerty (form 'asdfg') as well # as sequences from supported character sets. # The minimum value for this property is 3, # by default it is set to 5. illegal_sequence_length: 5
Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:
# Dictionary to check the passwords against. Defaults to no dictionary. # Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. # Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. dictionary: /path/to/dictionary/file
Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.
Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.
cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. To fix this error, the following has to be resolved: Password contains the dictionary word 'cassandraisadatabase'. You may also use 'GENERATED PASSWORD' upon role creation or alteration."
The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:
cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. To fix this error, the following has to be resolved: Password must be 8 or more characters in length. You may also use 'GENERATED PASSWORD' upon role creation or alteration."
The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.
cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; Warnings: Guardrail password violated: Password was set, however it might not be strong enough according to the configured password strength policy. To fix this warning, the following has to be resolved: Password must be 12 or more characters in length. Passwords must contain 2 or more digit characters. Password must contain 2 or more special characters. Password matches 2 of 4 character rules, but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.
The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.
When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.
cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; generated_password ------------------ R7tb33?.mcAX
The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.
When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.
We can create a role with generated password like this:
cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; or by CREATE USER: cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;
When a password is generated for alice (out of scope of this documentation), she can log in:
$ cqlsh -u alice -p R7tb33?.mcAX ... alice@cqlsh>
Note: It is recommended to save password to ~/.cassandra/credentials, for example:
[PlainTextAuthProvider] username = cassandra password = R7tb33?.mcAX
and by setting auth_provider in ~/.cassandra/cqlshrc
[auth_provider] module = cassandra.auth classname = PlainTextAuthProvider
It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:
alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Password was not set as it violated configured password strength policy. You may also use 'GENERATED PASSWORD' upon role creation or alteration."
The following command will automatically generate a new password that meets all configured security requirements.
alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;
Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.
These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.
Final thoughts and next steps
The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.
By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.
As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.
Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.
CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.
The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.
Announcing ScyllaDB 2025.1, Our First Source-Available Release
Tablets are enabled by default + new support for mixed clusters with varying core counts and resources The ScyllaDB team is pleased to announce the release of ScyllaDB 2025.1.0 LTS, a production-ready ScyllaDB Long Term Support Major Release. ScyllaDB 2025.1 is the first release under our Source-Available License. It combines all the improvements from Enterprise releases (up to 2024.2) and Open Source Releases (up to 6.2) into a single source-available code base. ScyllaDB 2025.1 enables Tablets by default. It also improves performance and scaling speed and allows mixed clusters (nodes that use different instance types). Several new capabilities, updates, and hundreds of bug fixes are also included. This release is the base for the upcoming ScyllaDB X Cloud, the new and improved ScyllaDB Cloud. That release offers fast boot, fast scaling (out and in), and an upper limit of 90% storage utilization (compared to 70% today). In this blog, we’ll highlight the new capabilities our users have frequently asked about. For the complete details, read the release notes. Read the detailed release notes on our forum Learn more about ScyllaDB Enterprise Get ScyllaDB Enterprise 2025.1 Upgrade from ScyllaDB Enterprise 2024.x to 2025.1 Upgrade from ScyllaDB Open Source 6.2 to 2025.1 ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2025 and are welcome to contact our Support Team with questions. Read the detailed release notes Tablets Overview In this release, ScyllaDB makes tablets the default for new Keyspaces. “Tablets” is a new data distribution algorithm that improves upon the legacy vNodes approach from Apache Cassandra. Unlike vNodes, which statically distribute tables across nodes based on the token ring, Tablets dynamically assign tables to a subset of nodes based on size. Future updates will optimize this distribution using CPU and OPS information. Key benefits of Tablets include: Faster scaling and topology changes allow new nodes to serve reads and writes as soon as the first Tablet is migrated. Together with Raft-based Strongly Consistent Topology Updates, Tablets enable users to add multiple nodes simultaneously. Automatic support for mixed clusters with varying core counts. You can run some Keyspaces with Tablets enabled and others with Tablets disabled. In this case, scaling improvements will only apply to Keyspaces with Tablets enabled. vNodes will continue to be supported for existing and new Keyspaces using the`tablets =
{ 'enabled': false }`
option. Tablet Merge Tablet Merge is a
new feature in 2025.1. The goal of Tablet Merge is to reduce the
tablet count for a shrinking table, similar to how Split increases
the count while the table is growing. The load balancer decision to
merge was implemented today (it came with the infrastructure
introduced for split), but it hasn’t been handled until now. The
topology coordinator will now detect shrunk tables and merge
adjacent tablets to meet the average tablet replica size goal.
#18181
Tablet-Based Keyspace Limitations Tablets Keyspaces are NOT yet
enabled for the following features: Materialized Views Secondary
Indexes Change Data Capture (CDC) Lightweight Transactions (LWT)
Counters Alternator (Amazon DynamoDB API) Using Tablets You can
continue using these features by using a vNode based Keyspace.
Monitoring Tablets To monitor Tablets in real time, upgrade
ScyllaDB Monitoring Stack to release 4.7, and use the new dynamic
Tablet panels shown below. Tablets Driver Support The following
driver versions and newer support Tablets Java driver 4.x, from
4.18.0.2 Java driver 3.x, from 3.11.5.4 Python driver, from 3.28.1
Gocql driver, from 1.14.5 Rust driver, from 0.13.0 Legacy ScyllaDB
and Apache Cassandra drivers will continue to work with ScyllaDB.
However, they will be less efficient when working with tablet-based
Keyspaces. File-based Streaming for Tablets File-based streaming
enhances tablet migration. In previous releases, migration involved
streaming mutation fragments, requiring deserialization and
reserialization of SSTable files. In this release, we directly
stream entire SSTables, eliminating the need to process mutation
fragments. This method reduces network data transfer and CPU usage,
particularly for small-cell data models. File-based streaming is
utilized for tablet migration in all keyspaces with tablets
enabled. More in
Docs. Arbiter and Zero-Token Nodes There is now support for
zero-token nodes. These nodes do not replicate data but can assist
in query coordination and Raft quorum voting. This allows the
creation of an Arbiter: a tiebreaker node that helps maintain
quorum in a symmetrical two-datacenter cluster. If one data center
fails, the Arbiter (placed in a third data center) keeps the quorum
alive without replicating user data or incurring network and
storage costs. #15360 You
can use
nodetool status to find a list of zero token nodes. Additional
Key Features The following features were introduced in ScyllaDB
Enterprise Feature Release 2024.2 and are now available in
Long-Term Support ScyllaDB 2025.1. For a full description of each,
see
2024.2 release notes and ScyllaDB
Docs. Strongly Consistent Topology Updates.
With Raft-managed topology enabled, all topology operations are
internally sequenced consistently. Strongly Consistent Topology
Updates is now the default for new clusters and
should be enabled after upgrade for existing clusters.
Strongly Consistent Auth Updates. Role-Based
Access Control (RBAC)
commands like create role or grant permission are safe to run in
parallel without a risk of getting out of sync with themselves and
other metadata operations, like schema changes. As a result, there
is no need to update system_auth RF or run repair when adding a
DataCenter. Strongly Consistent Service Levels.
Service Levels allow you to define attributes like timeout per
workload. Service levels are now strongly consistent using Raft,
like Schema, Topology, and Auth. Improved network
compression for intra-node RPC. New compression
improvements for node-to-node communication: Using
zstd instead of lz4 Using a shared dictionary re-trained
periodically on the traffic, instead of the message-by-message
compression. Alternator RBAC. Authorization:
Alternator supports Role-Based Access Control (RBAC).
Control is done via CQL. Native Nodetool. The
nodetool utility provides simple command-line interface operations
and attributes. The native nodetool works much faster. Unlike the
Java version, the native nodetool is part of the ScyllaDB repo and
allows easier and faster updates. Removing the JMX
Server. With the Native Nodetool (above), the JMX server
has become redundant and will no longer be part of the default
ScyllaDB Installation or image. Maintenance Mode.
Maintenance mode is a new mode in which the node does not
communicate with clients or other nodes and only listens to the
local maintenance socket and the REST API. Maintenance
Socket. The Maintenance Socket provides a new way to
interact with ScyllaDB from within the node it runs on. It is
mainly for debugging. As described in the
Maintenance Socket docs, you can use cqlsh with the Maintenance
Socket. Read
the detailed release notes Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio
The engineering challenges and design decisions that led to the 1.0 release of ScyllaDB Rust Driver ScyllaDB Rust driver is a client-side, shard-aware driver written in pure Rust with a fully async API using Tokio. The Rust Driver project was born back in 2021 during ScyllaDB’s internal developer hackathon. Our initial goal was to provide a native implementation of a CQL driver that’s compatible with Apache Cassandra and also contains a variety of ScyllaDB-specific optimizations. Later that year, we released ScyllaDB Rust Driver 0.2.0 on the Rust community’s package registry, crates.io. Comparative benchmarks for that early release confirmed that this driver was (more than) satisfactory in terms of performance. So we continued working on it, with the goal of an official release – and also an ambitious plan to unify other ScyllaDB-specific drivers by converting them into bindings for our Rust driver. Now that we’ve reached a major milestone for the Rust Driver project (officially releasing ScyllaDB Rust Driver 1.0), it’s time to share the challenges and design decisions that led to this 1.0 release. Learn about our versioning rationale What’s New in ScyllaDB Rust Driver 1.0? Along with stability, this new release brings powerful new features, better performance, and smarter design choices. Here’s a look at what we worked on and why. Refactored Error Types Our original error types met ad hoc needs, but weren’t ideal for long-term production use. They weren’t very type-safe, some of them stringified other errors, and they did not provide sufficient information to diagnose the error’s root cause. Some of them were severely abused – most notablyParseError
. There was a
One-To-Rule-Them-All error type: the ubiquitous
QueryError
, which many user-facing APIs used to
return. Before Back in 0.13 of the driver,
QueryError
looked like this: Note
that: The structure was painfully flat, with extremely niche errors
(such as UnableToAllocStreamId
) being just inline
variants of this enum. Many variants contained just strings. The
worst offender was Invalid Message
, which just jammed
all sorts of different error types into a single string. Many
errors were buried inside IoError
, too. This
stringification broke the clear code path to the underlying errors,
affecting readability and causing chaos. Due to the above
omnipresent stringification, matching on error kinds was virtually
impossible. The error types were public and, at the same time, were
not decorated with the #[non_exhaustive]
attribute.
Due to this, adding any new error variant required breaking the
API! It was unacceptable for a driver that was aspiring to bear the
name of an API-stable library. In version 1.0.0, the new error
types are clearer and more helpful. The error hierarchy now
reflects the code flow. Error conversions are explicit, so no
undesired confusing conversion takes place. The one-to-fit-them-all
error type has been replaced. Instead, APIs return various error
types that exhaustively cover the possible errors, without any need
to match on error variants that can’t occur when executing a given
function. The QueryError
’s new counterpart,
ExecutionError
, looks like this:
Note that: There is much more nesting, reflecting the driver’s
modules and abstraction layers. The stringification is gone! Error
types are decorated with the #[non_exhaustive]
attribute, which requires downstream crates to always have the
“else” case (like _ => { … }
) when matching on
them. This way, we prevent breaking downstream crates’ code when
adding a new error variant. Refactored Module Structure The module
structure also stemmed from various ad-hoc decisions. Users
familiar with older releases of our driver may recall, for example,
the ubiquitous transport
module. It used to contain a
bit of absolutely everything: essentially, it was a flat bag with
no real deeper structure. Back in 0.15.1, the module structure
looked like this (omitting the modules that were not later
restructured): transport load_balancing default.rs
mod.rs plan.rs locator (submodules) caching_session.rs cluster.rs
connection_pool.rs connection.rs
downgrading_consistency_retry_policy.rs errors.rs
execution_profile.rs host_filter.rs iterator.rs metrics.rs node.rs
partitioner.rs query_result.rs retry_policy.rs session_builder.rs
session_test.rs session.rs speculative_execution.rs topology.rs
history.rs routing.rs The new module structure clarifies the
driver’s separate abstraction layers. Each higher-level module is
documented with descriptions of what abstractions it should hold.
We also refined our item export policy. Before, there could be
multiple paths to import items from. Now items can be imported from
just one path: either their original paths (i.e., where they are
defined), or from their re-export paths (i.e., where they are
imported, and then re-exported from). In 1.0.0, the module
structure is the following (again, omitting the unchanged modules):
client caching_session.rs execution_profile.rs
pager.rs self_identity.rs session_builder.rs session.rs
session_test.rs cluster metadata.rs node.rs
state.rs worker.rs network connection.rs
connection_pool.rs errors.rs (top-level module)
policies address_translator.rs host_filter.rs
load_balancing default.rs plan.rs retry default.rs
downgrading_consistency.rs fallthrough.rs retry_policy.rs
speculative_execution.rs observability
driver_tracing.rs history.rs metrics.rs tracing.rs
response query_result.rs request_response.rs
routing locator (unchanged contents)
partitioner.rs sharding.rs Removed Unstable Dependencies From the
Public API With the ScyllaDB Rust Driver 1.0 release, we wanted to
fully eliminate unstable (pre-1.0) dependencies from the public
API. Instead, we now expose these dependencies through feature
flags that explicitly encode the major version number, such as
"num-bigint-03"
. Why did we do this?
API Stability & Semver Compliance – The 1.0
release promises a stable API, so breaking changes must be avoided
in future minor updates. If our public API directly depended on
pre-1.0 crates, any breaking changes in those dependencies would
force us to introduce breaking changes as well. By removing them
from the public API, we shield users from unexpected
incompatibilities. Greater Flexibility for Users –
Developers using the ScyllaDB Rust driver can now opt into specific
versions of optional dependencies via feature flags. This allows
better integration with their existing projects without being
forced to upgrade or downgrade dependencies due to our choices.
Long-Term Maintainability – By isolating unstable
dependencies, we reduce technical debt and make future updates
easier. If a dependency introduces breaking changes, we can simply
update the corresponding feature flag (e.g.,
"num-bigint-04"
) without affecting the core driver
API. Avoiding Unnecessary Dependencies – Some
users may not need certain dependencies at all. Exposing them via
opt-in feature flags helps keep the dependency tree lean, improving
compilation times and reducing potential security risks.
Improved Ecosystem Compatibility – By allowing
users to choose specific versions of dependencies, we minimize
conflicts with other crates in their projects. This is particularly
important when working with the broader Rust ecosystem, where
dependency version mismatches can lead to build failures or
unwanted upgrades. Support for Multiple Versions
Simultaneously – By namespacing dependencies with feature
flags (e.g., "num-bigint-03"
and
"num-bigint-04"
), users can leverage multiple versions
of the same dependency within their project. This is particularly
useful when integrating with other crates that may require
different versions of a shared dependency, reducing version
conflicts and easing the upgrade path. How this impacts
users: The core ScyllaDB Rust driver remains stable and
free from external pre-1.0 dependencies (with one exception: the
popular rand
crate, which is still in 0.*). If you
need functionality from an optional dependency, enable it
explicitly using the appropriate feature flag (e.g.,
"num-bigint-03"
). Future updates can introduce new
versions of dependencies under separate feature flags – without
breaking existing integrations. This change ensures that the
ScyllaDB Rust driver remains stable, flexible, and future-proof,
while still providing access to powerful third-party libraries when
needed. Rustls Support for TLS The driver now supports Rustls,
simplifying TLS connections and removing the need for additional
system C libraries (openssl). Previously, ScyllaDB Rust Driver only
supported OpenSSL-based TLS – like our other drivers did. However,
the Rust ecosystem has its own native TLS library:
Rustls. Rustls is designed for both performance
and security, leveraging Rust’s strong memory safety guarantees
while often outperforming
OpenSSL in real-world benchmarks. With the 1.0.0 release, we
have added Rustls as an alternative TLS backend. This gives users
more flexibility in choosing their preferred implementation.
Additional system C libraries (openssl) are no longer required to
establish secure connections. Feature-Based Backend
Selection Just as we isolated pre-1.0 dependencies via
version-encoded feature flags (see the previous section), we
applied the same strategy to TLS backends. Both
OpenSSL and Rustls are exposed
through opt-in feature flags. This allows users to explicitly
select their desired implementation and ensures: API
Stability – Users can enable TLS support without
introducing unnecessary dependencies in their projects.
Avoiding Unwanted Conflicts – Users can choose the
TLS backend that best fits their project without forcing a
dependency on OpenSSL or Rustls if they don’t need it.
Future-Proofing – If a breaking change occurs in a
TLS library, we can introduce a new feature flag (e.g.,
"rustls-023", "openssl-010"
) without modifying the
core API. Abstraction Over TLS Backends We also introduced an
abstraction layer over the TLS backends. Key enums such as
TlsProvider
, TlsContext
,
TlsConfig
and Tls
now contain variants
corresponding to each backend. This means that switching between
OpenSSL and Rustls (as well as between different versions of the
same backend) is a matter of enabling the respective feature flag
and selecting the desired variant. If you prefer Rustls, enable the
"rustls-023"
feature and use the
TlsContext::Rustls
variant. If you need OpenSSL,
enable "openssl-010"
and use
TlsContext::OpenSSL
. If you want both backends or
different versions of the same backend (in production or just to
explore), you can enable multiple features and it will “just work.”
If you don’t require TLS at all, you can exclude both, reducing
dependency overhead. Our ultimate goal with adding Rustls support
and refining TLS backend selection was to ensure that the ScyllaDB
Rust Driver is both flexible and well-integrated with the Rust
ecosystem. We hope this better accommodates users’ different
performance and security needs. The Battle For The Empty Enums We
really wanted to let users build the driver with no TLS backends
opted in. In particular, this required us to make our enums work
without any variants, (i.e., as empty enums). This was a bit
tricky. For instance, one cannot match over &x
,
where x: X
is an instance of the enum, if
X
is empty. Specifically, consider the following
definition: This
would not compile:error[E0004]:
non-exhaustive patterns: type `&X` is non-empty
–> scylla/src/network/tls.rs:230:11
| 230 | match x {
| ^
| note: `X` defined here
–> scylla/src/network/tls.rs:223:6
| 223 | enum X {
| ^
= note: the matched value is of type
`&X` = note: references are always
considered inhabited help: ensure that all possible cases are being
handled by adding a match arm with a wildcard pattern as shown
| 230 ~ match x { 231 +
_ => todo!(), 232 + }
| Note that references are
always considered inhabited. Therefore, in order to
make code compile in such a case, we have to match on the value
itself, not on a reference:
But if we now enable the "a"
feature, we get
another error… error[E0507]: cannot move out of `x` as enum variant
`A` which is behind a shared reference –>
scylla/src/network/tls.rs:230:11 | 230 |
match *x { |
^^ 231 |
#[cfg(feature = “a”)] 232 | X::A(s)
=> { /* Handle it */ } |
–
|
| |
data moved here
|
move occurs because `s` has type `String`, which does not
implement the `Copy` trait | help: consider
removing the dereference here | 230 –
match *x { 230 + match x {
| Ugh. rustc
literally
advises us to revert the change. No luck… Then we would end up with
the same problem as before. Hmmm… Wait a moment… I vaguely remember
Rust had an obscure reserved word used for matching by reference,
ref
. Let’s try it out.
Yay, it compiles!!! This is how we made our (possibly) empty
enums work… finally!. Faster and Extended Metrics Performance
matters. So we reworked how the driver handles metrics, eliminating
bottlenecks and reducing overhead for those who need real-time
insights. Moreover, metrics are now an opt-in feature, so you only
pay (in terms of resource consumption) for what you use. And we
added even more metrics! Background Benchmarks
showed that the driver may spend significant time logging query
latency.
Flamegraphs revealed that collecting metrics can consume up to
11.68% of CPU time!
We suspected that the culprit was contention on a mutex
guarding the metrics histogram. Even though the issue was
discovered in 2021 (!), we postponed dealing with it because the
publicly available crates didn’t yet include a lock-free histogram
(which we hoped would reduce the overhead). Lock-free
histogram As we approached the 1.0 release deadline, two
contributors (Nikodem Gapski and Dawid Pawlik) engaged with the
issue. Nikodem explored the new generation of the
histogram
crate and discovered that someone had added
a lock-free histogram: AtomicHistogram
. “Great”, he
thought. “This is exactly what’s needed.” Then, he discovered that
AtomicHistogram
is flawed: there’s a logical race due
to insufficient synchronization! To fix the problem, he ported the
Go implementation of LockFreeHistogram
from
Prometheus, which prevents logical races at the cost of execution
time (though it was still performing much better than a mutex).
If you are interested in all the details about what was wrong
with AtomicHistogram
and how
LockFreeHistogram
tries to solve it, see the
discussion in this PR. Eventually, the
histogram
crate’s maintainer joined the discussion and
convinced us that the skew caused by the logical races in
AtomicHistogram
is benign. Long story short, histogram
is a bit skewed anyway, and we need to accept it. In the end, we
accepted AtomicHistogram
for its lower overhead
compared to LockFreeHistogram
.
LockFreeHistogram
is still available on
its author’s dedicated branch. We left ourselves a way to
replace one histogram implementation with another if we decide it’s
needed. More metrics The Rust driver is a proud
base for the cpp-rust-driver
(a rewrite of cpp-driver as a thin
bindings layer on top of – as you can probably guess at this point
– the Rust driver). Before cpp-driver functionalities could be
implemented in cpp-rust-driver, they had to be implemented in the
Rust driver first. This was the case for some metrics, too. The
same two contributors took care of that, too. (Btw, thanks, guys!
Some cool sea monster swag will be coming your way).
Metrics as an opt-in Not every driver user needs
metrics. In fact, it’s quite probable that most users don’t check
them even once. So why force users to pay (in terms of resource
consumption) for metrics they’re not using? To avoid this, we put
the metrics module behind the "metrics"
feature (which
is disabled by default). Even more performance gain! For a
comprehensive list of changes introduced in the 1.0 release,
see our release notes. Stepping Stones on the Path to the 1.0
Release We’ve been working towards this 1.0 release for years, and
it involved a lot of incremental improvements that we rolled out in
minor releases along the way. Here’s a look at the most notable
ones. Ser/De (from versions 0.11 and 0.15) Previous releases
reworked the serialization and deserialization APIs to improve
safety and efficiency. In short, the 0.11 release introduced a
revamped serialization API that leverages Rust’s type system to
catch misserialization issues early. And the 0.15 release refined
deserialization for better performance and memory efficiency. Here
are more details. Serialization API Refactor (released in
0.11): Leverage Rust’s Powerful Type System to Prevent
Misserialization — For Safer and More Robust Query Binding
Before 0.11, the driver’s serialization API had several pitfalls,
particularly around type safety. The old approach relied on loosely
structured traits and structs (Value
,
ValueList
, SerializedValues
,
BatchValues
, etc.), which lacked strong compile-time
guarantees. This meant that if a user mistakenly bound an incorrect
type to a query parameter, they wouldn’t receive an immediate,
clear error. Instead, they might encounter a confusing
serialization error from ScyllaDB — or, in the worst case, could
suffer from silent data corruption! To address these issues, we
introduced a redesigned serialization API that replaces the old
traits with SerializeValue
, SerializeRow
,
and new versions of BatchValues
and
SerializedValues
. This new approach enforces stronger
type safety. Now, type mismatches are caught locally at compile
time or runtime (rather than surfacing as obscure database errors
after query execution). Key benefits of this refactor include:
Early Error Detection – Incorrectly typed bind
markers now trigger clear, local errors instead of ambiguous
database-side failures. Stronger Type Safety – The
new API ensures that only compatible types can be bound to queries,
reducing the risk of subtle bugs. Deserialization API
Refactor (released in 0.15): For Better Performance and Memory
Efficiency Prior to release 0.15, the driver’s
deserialization process was burdened with multiple inefficiencies,
slowing down applications and increasing memory usage. The first
major issue was type erasure — all values were initially converted
into the CQL-type-agnostic CqlValue before being transformed into
the user’s desired type. This unnecessary indirection introduced
additional allocations and copying, making the entire process
slower than it needed to be. But the inefficiencies didn’t stop
there. Another major flaw was the eager allocation of columns and
rows. Instead of deserializing data on demand, every column in a
row was eagerly allocated at once — whether it was needed or not.
Even worse, each page of query results was fully materialized into
a Vec<Row>
. As a result, all rows in a page were
allocated at the same time — all of them in the form of the
ephemeric CqlValue
. This usually required further
conversion to the user’s desired type and incurred allocations. For
queries returning large datasets, this led to excessive memory
usage and unnecessary CPU overhead. To fix these issues, we
introduced a completely redesigned deserialization API. The new
approach ensures that: CQL values are deserialized lazily,
directly into user-defined types, skipping
CqlValue
entirely and eliminating redundant
allocations. Columns are no longer eagerly
deserialized and allocated. Memory is used only for the
fields that are actually accessed. Rows are
streamed instead of eagerly materialized. This avoids
unnecessary bulk allocations and allows more efficient processing
of large result sets. Paging API (released in 0.14) We heard from
our users that the driver’s API for executing queries was prone to
misuse with regard to query paging. For instance, the
Session::query()
and Session::execute()
methods would silently return only the first page of the result if
page size was set on the statement. On the other hand, if page size
was not set, those methods would perform unpaged queries, putting
high and undesirable load on the cluster. Furthermore,
Session::query_paged()
and
Session::execute_paged()
would only fetch a single
page! (if page size was set on the statement; otherwise, the query
would not be paged…!!!) To combat this: We decided
to redesign the paging API in a way that no other driver had done
before. We concluded that the API must be crystal clear about
paging, and that paging will be controlled by the method used, not
by the statement itself. We ditched query()
and
query_paged()
(as well as their execute
counterparts), replacing them with query_unpaged()
and
query_single_page()
, respectively (similarly for
execute*
). We separated the setting of page size from
the paging method itself. Page size is now mandatory on the
statement (before, it was optional). The paging method (no paging,
manual paging, transparent automated paging) is now selected by
using different session methods
({query,execute}_unpaged()
,
{query,execute}_single_page()
, and
{query,execute}_iter()
, respectively). This separation
is likely the most important change we made to help users avoid
footguns and pitfalls. We introduced strongly typed PagingState and
PagingStateResponse abstractions. This made it clearer how to use
manual paging (available using
{query,execute}_single_page()
). Ultimately, we
provided
a cheat sheet in the Docs that describes best practices
regarding statement execution. Looking Ahead The journey doesn’t
stop here. We have many ideas for possible future driver
improvements: Adding a prelude
module
containing commonly used driver’s functionalities. More
performance optimizations to push the limits of
scalability (and benchmarks to track how we’re doing). Extending
CQL execution APIs to combine transparent paging
with zero-copy deserialization, and introducing
BoundStatement
. Designing our own test
harness to enable cluster sharing and reuse between tests
(with hopes of speeding up test suite execution and encouraging
people to write more tests). Reworking CQL execution
APIs for less code duplication and better usability.
Introducing QueryDisplayer
to pretty print results of
the query in a tabular way, similarly to the cqlsh
tool. (In our dreams) Rewriting cqlsh
(based on Python
driver) with cqlsh-rs (a wrapper over Rust driver). And of course,
we’re always eager to hear from the community — your feedback helps
shape the future of the driver! Get Started with ScyllaDB Rust
Driver 1.0 If you’re working on cool Rust applications that use
ScyllaDB and/or you want to contribute to this Rust driver project,
here are some starting points. GitHub Repository:
ScyllaDB
Rust Driver – Contributions welcome!
Crates.io: Scylla Crate
Documentation: crate docs on docs.rs,
the guide
to the driver. And if you have any questions, please contact us
on the community forum or
ScyllaDB User Slack (see
the #rust-driver channel). ScyllaDB Rust Driver 1.0 is Officially Released
The long-awaited ScyllaDB Rust Driver 1.0 is finally released. This open source project was designed to bring a stable, high-performance, and production-ready CQL driver to the Rust ecosystem. Key changes in the 1.0 release include: Improved Stability: We removed unstable dependencies and put them behind feature flags. This keeps the driver stable, flexible, and future-proof while still allowing access to powerful-yet-unstable third-party libraries when needed. Refactored Error Types: The error types were significantly improved for clarity, type safety, and diagnostic information. This makes debugging easier and prevents API-breaking changes in future updates. Refactored Module Structure: The module structure was reorganized to better reflect abstraction layers and improve clarity. This makes the driver’s architecture more understandable and simplifies importing items. Easier TLS Setup: Rustls support provides a Rust-native alternative to openssl. This simplifies TLS configuration and can prevent system library issues. Faster and Extended Metrics: New metrics were added and metrics were optimized using an atomic histogram that reduces CPU overhead. The entire metrics module is now optional – so users who don’t care about it won’t suffer any performance impacts from it. Read the release notes In this post, we’ll shed light on why we took this unconventionally extensive (years long) path from a popular production-ready 0.x release to a 1.0 release. We’ll also share our versioning/release plans from this point forward. Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio provides a deep dive into exactly what we changed and why. Read the deep dive into what we changed and why The Path to “1.0” Over the past few years, Rust Driver has proven itself to be high quality, with very few bugs compared to other drivers as well as better performance. It is successfully used by customers in production, and by us internally. By all means, we have considered it fully production-ready for a long time. Given that, why did we keep releasing 0.x versions? Although we were confident in the driver’s quality, we weren’t satisfied with some aspects of its API. Keeping the version at 0.x was our way of saying that breaking changes are expected often. Frequent breaking changes are not really great for our users. Instead of just updating the driver, they have to adjust their code after pretty much every update. However, 0.x version numbers suggest that the driver is not actually production-ready (but in this case, it truly was). So we really wanted to release a 1.0 version. One option was to just call one of the previous versions (e.g. 0.9) version 1.0 and be done with it. But we knew there were still many breaking changes we wanted to make – and if we kept introducing planned changes, we would quickly arrive at a high version number like 7.0. In Rust (and semver in general) 1.0 is called an “API-stable” version. There is no definition of that term, so it can have various interpretations. What’s perfectly clear, however, is that rapidly releasing major versions – thus quickly arriving at a high major version number – does not constitute API stability. It also does nothing to help users easily update. They would still need to change their code after most updates! We also realized that we will never be able to achieve complete stabilization. There are, and will probably always be, things that we want to improve in our API. We don’t want stability to stand in the way of driver refinement. Even if we somehow achieve an API that we are fully satisfied with, that we don’t want to change at all, there is another reason for change: the databases that the driver supports (ScyllaDB and Cassandra) are constantly changing, and some of those changes may require modifying the driver API. For example, ScyllaDB recently introduced a new replication mechanism: Tablets. It is possible to add Tablets support to a driver without a breaking change. We did that in our other drivers, which are forks, because we can’t break compatibility there. However, it requires ugly workarounds. With Tablets, calculating a replica list for a request requires knowing which table the request uses. Tablets are per-table data structures, which means that different tables may have different replica sets for the same token (as opposed to the token ring, which is per-keyspace). This affects many APIs in the driver: Metadata, Load Balancing, and Request Routing, to name just a few. In Rust Driver, we could nicely adapt those APIs, and we want to continue doing so when major changes are introduced in ScyllaDB or Cassandra. Given those restrictions, we reached a compromise. We decided to focus on the API-breaking changes we had planned and complete a big portion of them – making the API more future-proof and flexible. This reduces the risk of being forced to make unwanted API-breaking changes in the future. What’s Next for Rust Driver Now that we’ve reached the long-anticipated “1.0” status, what’s next? We will focus on other driver tasks that do not require changing the API. Those will be released as minor updates (1.x versions). Releasing minor versions means that our users can easily update the driver without changing their code, and so they will quickly get the latest improvements. Of course, we won’t stay at 1.0 forever. We don’t know exactly when the 2.0 release will happen, but we want to provide some reasonable stability to make life easier for our users. We’ve settled on 9 months for 1.0 – so 2.0 won’t be released any earlier than 9 months after the 1.0 release date. For future versions (3.0, etc) this time may (almost certainly) be increased since we will have already smoothed out more and more API rough edges. When a new major version (e.g. 2.0) is released, we will keep supporting the previous major version (e.g. 1.x) with bugfixes, but no new functionalities. The duration of such support is not yet decided. This will also make the migration to a new major version a bit easier. Get Started with Rust Driver 1.0 If you’re ready to get started, take a look at: GitHub Repository: ScyllaDB Rust Driver – Contributions welcome! Crates.io: Scylla Crate Documentation: crate docs on docs.rs, the guide to the driver. And if you have any questions, please contact us on the community forum or ScyllaDB User Slack (see the #rust-driver channel).Upcoming ScyllaDB University LIVE and Community Forum Updates
What to expect at the upcoming ScyllaDB University Live training event – and what’s trending on the community forum Following up on all the interest in ScyllaDB – at Monster SCALE Summit and a whirlwind of in-person events around the world – let’s continue the ScyllaDB conversation. Is ScyllaDB a good fit for your use case? How do you navigate some of the decisions you face when getting started? We’re here to help! In this post, I’ll update you about the upcoming ScyllaDB University Live training event and highlight some trending topics from the community forum. ScyllaDB University LIVE Our next ScyllaDB University LIVE training event will be held on Wednesday, April 9, 2025, 8 AM PDT – 10 AM PDT. This is a free live virtual training led by our top engineers and architects. Whether you’re just curious about ScyllaDB or an experienced user looking to master advanced strategies, join us for ScyllaDB University LIVE! Sessions are interactive and NOT available on-demand – be sure to mark your calendar and attend! The event will be interactive, and you will have a chance to run some hands-on labs throughout the event, and learn by actually doing. The team and I are preparing lots of new examples and exercises – so if you’ve joined before, there’s a great excuse to join again. 😉 Register here In the event, there will be two parallel tracks, Essentials and Advanced. Essentials Track The Essentials track (Getting Started with ScyllaDB) is intended for people new to ScyllaDB. I will start with a talk covering a quick overview of NoSQL and where ScyllaDB fits in the NoSQL world. Next, you will run the Quick Wins labs, in which you’ll see how easy it is to start a ScyllaDB cluster, create a keyspace, create a table, and run some basic queries. After the lab, you’ll learn about ScyllaDB’s basic architecture, including a node, cluster, data replication, Replication Factor, how the database partitions data, Consistency Level, multiple data centers, and an example of what happens when we write data to a cluster. We’ll cover data modeling fundamentals for ScyllaDB. Key concepts include the difference in data modeling between NoSQL and Relational databases, Keyspace, Table, Row, CQL, the CQL shell, Partition Key, and Clustering Key. After that, you’ll run another lab, where you’ll put the data modeling theory into practice. Finally (if we have enough time left), we will discuss ScyllaDB’s special shard-aware drivers. The next part of this session is led by Attila Toth. Here, we’ll walk through a real-world application and understand how the different concepts from the previous talk come into play. We’ll also use a lab where you can do the coding and test it yourself. Additionally, you will see a demo application running one million ops/sec with single-digit millisecond latency and learn how to run this demo yourself. Advanced Track In the Advanced Track (Extreme Elasticity and Performance) by Tzach Livyatan and Felipe Mendes, you will take a deep dive into ScyllaDB’s unique features and tooling such as Workload Prioritization as well as advanced data modeling, and tips for using counters and Time To Live (TTL). You’ll learn how ScyllaDB’s new Tablets feature enables extreme elasticity without any downtime and how to have multiple workloads on a single cluster. The two talks in this track will also use multiple labs that you can run yourself during the event. Before the event, please make sure you have a ScyllaDB University account (free). We will use this platform during the event for the hands-on labs. Register on ScyllaDB University Trending Topics on the Community Forum The community forum is the place to discuss anything ScyllaDB and NoSQL related, learn from your peers, share how you’re using ScyllaDB, and ask questions about your use case. It’s where you can read Avi Kivity’s, our co-founder and CTO’s, popular, weekly Last week in scylladb.git master update (for example here). It’s also the place to learn about new releases and events. Say Hello here Many of the new topics focus on performance issues, troubleshooting, specific use case questions and general data modeling questions. Many of the recent discussions have been about Tablets and how this feature affects performance and elasticity. Here’s a summary of some of the top topics since my last update. A user asked about latency spikes, hot partitions, and how to detect this. Key insights shared in this discussion emphasize the importance of understanding compaction settings and implementing strategies to mitigate tombstone accumulation. Upgrade paths and Tablets integration: The introduction of the Tablets feature led to significant discussions regarding its adoption for scaling purposes. A user discussed the processes of enabling this feature after an upgrade, and its effects on performance in posts like this one. General cluster management support: different contributors actively assisted newcomers by clarifying different admin procedures, such as addressing schema migrations, compaction, and SSTable behavior. An example of such a discussion deals with the process for gracefully stopping ScyllaDB. Data modeling: A popular topic was data modeling and the data model’s effect of the data model on performance for specific use cases. Users exchanged ideas on addressing challenges tied to row-level reads, batching, drivers, and the implications of large (and hot) partitions. One such discussion dealt with data modeling when having subgroups of data with volume disparity. Alternator: the DynamoDB compatible API was a popular topic. Users asked about how views work under the hood with Alternator as well as other questions related to compatibility with DynamoDB and performance. Hope to see you at the ScyllaDB University Live event! Meanwhile, stay in touch.Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type
In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.
But with the release of Cassandra 5, things become much simpler.
Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.
Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.
The power of vector search in Cassandra 5
Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.
The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.
How vectors work
Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.
When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.
Storage-Attached Indexing (SAI): The engine behind vector search
Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.
SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.
Example: Performing similarity search with Cassandra 5’s vector data type
Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.
Step 1: Setting up the embeddings table
To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.
CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', ' replication_factor': 1}}; CREATE TABLE IF NOT EXISTS embeddings ( id UUID, paragraph_uuid UUID, filename TEXT, embeddings vector<float, 300>, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) ); CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai';
This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.
You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:
CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.
Step 2: Inserting embeddings into Cassandra 5
To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:
import time from uuid import uuid4, UUID from cassandra.cluster import Cluster from cassandra.query import SimpleStatement from cassandra.policies import DCAwareRoundRobinPolicy from cassandra.auth import PlainTextAuthProvider from google.colab import userdata # Connect to the single-node cluster cluster = Cluster( # Replace with your IP list ["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed port=9042, auth_provider=PlainTextAuthProvider ( username='iccassandra', password='replace_with_your_password' ) ) session = cluster.connect() print('Connected to cluster %s' % cluster.metadata.cluster_name) def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None): try: embeddings = list(map(float, embedding)) # Generate UUIDs if not provided if id is None: id = uuid4() if paragraph_uuid is None: paragraph_uuid = uuid4() # Ensure id and paragraph_uuid are UUID objects if isinstance(id, str): id = UUID(id) if isinstance(paragraph_uuid, str): paragraph_uuid = UUID(paragraph_uuid) # Create the query string with placeholders insert_query = f""" INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated) VALUES (?, ?, ?, ?, ?, toTimestamp(now())) """ # Create a prepared statement with the query prepared = session.prepare(insert_query) # Execute the query session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text))) return None # Successful insertion except Exception as e: error_message = f"Failed to execute query:\nError: {str(e)}" return error_message # Return error message on failure def insert_with_retry(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None, max_retries=3, retry_delay_seconds=1): retry_count = 0 while retry_count < max_retries: result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) if result is None: return True # Successful insertion else: retry_count += 1 print(f"Insertion failed on attempt {retry_count} with error: {result}") if retry_count < max_retries: time.sleep(retry_delay_seconds) # Delay before the next retry return False # Failed after max_retries # Replace the file path pointing to the desired file file_path = "/path/to/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path) from tqdm import tqdm for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_with_retry( session=session, embedding=paragraph['embedding'], id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ): # Display an error message if insertion fails tqdm.write(f"Insertion failed after maximum retries for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...")
This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.
Step 3: Performing similarity searches in Cassandra 5
Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:
import numpy as np # ------------------ Embedding Functions ------------------ def text_to_vector(text): """Convert a text chunk into a vector using the FastText model.""" words = text.split() vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): # Convert the input text to an embedding input_embedding = text_to_vector(input_text) input_embedding_str = ', '.join(map(str, input_embedding.tolist())) # Adjusted query without the ORDER BY clause and correct comment syntax query = f""" SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity FROM {keyspace_name}.embeddings ORDER BY embeddings ANN OF [{input_embedding_str}] LIMIT {top_k}; """ prepared = session.prepare(query) bound = prepared.bind((input_embedding,)) rows = session.execute(bound) # Sort the results by similarity in Python similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) return similar_texts[:top_k] from IPython.display import display, HTML # The word you want to find similarities for input_text = "place" # Call the function to find similar texts in the Cassandra database similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)
This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.
Step 4: Displaying the results
To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:
# Print the similar texts along with their similarity scores for similarity, filename, text in similar_texts: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>Similarity:</b> {similarity:.4f}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code will display the top similar texts, along with their similarity scores and associated file names.
Cassandra 5 vs. Cassandra 4 + OpenSearch®
Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.
Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.
Feature | Cassandra 4 + OpenSearch | Cassandra 5 (Preview) |
Embedding Storage | OpenSearch | Native VECTOR Data Type |
Similarity Search | KNN Plugin in OpenSearch | COSINE, EUCLIDEAN, DOT_PRODUCT |
Search Method | Exact K-Nearest Neighbor | Approximate Nearest Neighbor (ANN) |
System Complexity | Requires two systems | All-in-one Cassandra solution |
Conclusion: A simpler path to similarity search with Cassandra 5
With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.
Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!
Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!
The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.
Monster Scale Summit Recap: Scaling Systems, Databases, and Engineering Leadership
Monster Scale Summit brought together some of the sharpest minds in distributed systems, data infrastructure, and engineering leadership — all focused on one thing: what it really takes to build and operate systems at scale. From database internals to leadership lessons, here are some highlights from two packed days of tech talks. Watch On-Demand Kelsey Hightower: Engineering at Scale: What Separates Leaders from Laggards We kicked off with a candid conversation with Kelsey Hightower, a name that needs no introduction if you’ve ever dealt with Kubernetes, CoreOS, or even Puppet. Kelsey has been at the center of some of the biggest shifts in infrastructure over the past decade. Hearing his perspective on what separates companies that succeed at scale from those that don’t was the most memorable part of the event for me. Kelsey tackled questions such as: Misconceptions in scaling engineering efforts: What common mistakes do engineers make? Design trade-offs: How do you balance the need to move fast while still designing for future growth? Avoiding over-engineering: How do you build just enough to handle scale without building complexity that slows you down? Developer experience and tooling: How do you give teams the right tools without overwhelming them? Leadership balance: How do technical depth and soft skills factor into great engineering leadership? And of course, I couldn’t resist asking: “Good programmers copy, great programmers paste” — is that still true? Spoiler: his answer was “They use ChatGPT!” Kelsey shared razor-sharp, unfiltered insights throughout the unscripted live session. If you care about engineering leadership in high-scale environments, watch this – now. Dor Laor, ScyllaDB CEO: Pushing the Boundaries of Performance Dor Laor, ScyllaDB CEO and Co-founder, took the virtual stage to share 10 years of lessons learned building ScyllaDB, a database designed for extreme speed and scale. Dor walked us through: The shard-per-core design that sets ScyllaDB apart. How ScyllaDB evolved from an idea (codename: “Sea Star” [C*]) to production systems handling billions of operations per day. What’s next in terms of performance, cost-efficiency, and scalability. Organizations have wasted time and money overprovisioning other databases at scale. Dor presented the next generation of ScyllaDB X Cloud which provides true elasticity and unmatched storage capability, unique to ScyllaDB. If you’re dealing with high-throughput, low-latency database workloads, take some time to absorb all the advances introduced… and how they might help your team. Real-World Scaling Stories from Industry Leaders One of the best parts of Monster Scale was hearing directly from the people building and operating some of the largest systems on the planet. Some of the talks that got the chat buzzing include… Extreme Scale in Action Cloudflare: Serving millions of boot artifacts to a global audience. Agoda: Scaling 50x throughput with ScyllaDB. Discord: Handling trillions of search requests. American Express: Sharing design choices for routing global payments. Canva: Running machine learning workflows with over 100M images/day. Database Internals and Their Impacts Avi Kivity (ScyllaDB CTO): Deep dive into engineering advances enabling massive scale. Felipe Mendes (ScyllaDB Technical Director): Detailed breakdown of how ScyllaDB stacks up against Cassandra 5.0. Responsive: Almog Gavra on replacing RocksDB with ScyllaDB to achieve next-level Kafka stream processing. Optimizing Cost and Performance in the Cloud ScyllaDB: Cloud cost reduction, tiered storage, and high availability (HA) strategies. Slack: Managing 300+ mission-critical cron jobs efficiently. Yieldmo: Real savings from moving off DynamoDB to ScyllaDB. Gwen Shapira: Reengineering Postgres for Millions of Tenants If you think relational databases can’t handle scale, Gwen Shapira showed up to challenge that. She detailed how Nile is rethinking Postgres to serve millions of tenants and shared the real operational challenges behind that journey. Her bottom line: “Scaling relational data is frigging hard.” But it’s also possible if you know what you’re doing. ShareChat: Building One of the World’s Largest Feature Stores With over 300M monthly active users, ShareChat has built a feature store that processes over a billion features per second. David and Ivan walked us through how they got there, the role ScyllaDB plays, and what they’re doing now to optimize cost without compromising on scale. Martin Kleppmann + Chris Riccomini: Designing Data-Intensive Apps in 2025 Yes, Martin & Chris confirmed an update to “Designing Data-Intensive Applications” is on the way. But this wasn’t a book promo — it was a frank discussion on real-world data architecture, including what’s broken and what still works when scaling distributed systems. Avi Kivity: ScyllaDB’s Monstrous Engineering Advances Avi took us through ScyllaDB’s latest innovations, from internals to future plans — essential viewing if you’re using ScyllaDB and/or you’re curious about the engineering behind high-performance, distributed databases. More Sessions on Tackling Scale Head-On Resonate, Antithesis, Turso, poolside, Uber: Simple (and not-so-simple) mechanics of scaling. Medium, Alex DeBrie, Guilherme Nogueira+ Nadav Har’El, Patrick Bossman: The reality of DynamoDB costs and why customers switch to ScyllaDB – plus practical migration insights. Kostja Osipov (ScyllaDB): Real lessons in surviving majority failures and consensus mechanics. Dzejla Medjedovic (Social Explorer): Exploring the benefits and tradeoffs between B-trees, B^eps-trees, and LSM-trees. Ethan Donowitz: Database Upgrades with Shadow Clusters at Discord Ethan gave us a compelling presentation on the use of “shadow clusters” at Discord to effectively de-risk the upgrade process in large-scale production systems. This included insights on how to build, mirror, validate, test and monitor — all practical tips you can apply to your own database environments. Rachel Stephens + Adam Jacob: Scaling is the Funnest Game Rachel and Adam gave us their honest take on the human side of scaling, with plenty of fun stories around technical trade-offs and why business context matters as much as engineering decisions. To quote Adam (while recounting some anecdotal coffee shop encounters with Chef users): “There is no funner game than the at-scale technology game.” Personal Takeaways As an event host, I get the chance to review the recordings before the show — but it’s not until the entire show is assembled and streamed online that the true depth and quality of content becomes apparent to me. Also, what a privilege it was to interview Kelsey in person. I’ve used many of the systems and software he has influenced, so having a chat with him was both inspiring and grounding. You couldn’t ask for a better role model in software engineering leadership. Cheers mate! Monster Scale Summit wasn’t just about theory — it was about what happens when systems, teams, and businesses hit real limits and what it takes to move past them. From deep engineering to leadership lessons, if you’re working on systems that need to scale and perform predictably, this was a treasure trove of insights. And if you missed it? Check out the replays — because this is the kind of knowledge that will save you months of effort and pain. Watch Tech Talk Replays On-Demand Behind the scenes, from the perspective of Wayne’s Ray-Ban Smart GlassesHigh Performance on a Low Budget: Gwen Shapira’s Tips for Startups
How even a scrappy early-stage startup can deliver outstanding performance “It’s one thing to solve performance challenges when you have plenty of time, money and expertise available. But what do you do in the opposite situation: If you are a small startup with no time or money and still need to deliver outstanding performance?” – Gwen Shapira, co-founder of Nile (PostgreSQL reengineered for multi-tenant apps) That’s the multi-million-dollar question for many early-stage startups. And who better to lead that discussion than Gwen Shapira, who has tackled performance from two vastly different perspectives? After years of focusing on data systems performance at large organizations, she recently pivoted to leading a startup – where she found herself responsible for full-stack performance, from the ground up. In her P99 CONF keynote, “High Performance on a Low Budget,” Gwen explored the topic by sharing performance challenges she and her small team faced at Nile – how they approached them, tradeoffs and lessons learned. A few takeaways that set the conference chat on fire: Benchmarks should pay rent Keep tests stupid simple If you don’t have time to optimize, at least don’t pessimize But the real value is in hearing the experiences behind these and other zingers. You can watch her talk below or keep reading for a guided tour. Enjoy Gwen’s insights? She’ll be delivering another keynote at Monster SCALE Summit—alongside Kelsey Hightower and engineers from Discord, Disney+, Slack, Canva, Atlassian, Uber, ScyllaDB and many other leaders sharing how they’re tackling extreme-scale engineering challenges. Join us – it’s free + virtual. Get Your Conference Pass Do Worry About Performance Before You Ship It Per Gwen, founders get all sorts of advice on how to run the company (whether they want it or not). Regarding performance, it’s common to hear tips like: Don’t worry about performance until users complain. Don’t worry about performance until you have product market fit. Performance is a good problem to have. If people complain about performance, it is a great sign! But she respectfully disagrees. After years of focusing on performance, she’s all too familiar with the aftermath of that approach. Gwen shared, “As you talk to people, you want to figure out the minimal feature set required to bring an impactful product to market. And performance is part of this package. You should discover the target market’s performance expectations when you discover the other expectations.” Things to consider at an early stage: If you’re trying to beat the competition on performance, how much faster do you need to be? Even if performance is not your key differentiator, what are your users’ expectations regarding performance? To what extent are users willing to accept latency or throughput tradeoffs for different capabilities – or accept higher costs to avoid those tradeoffs? Founders are often told that even if you’re not fully satisfied with the product, just ship it and see how people react. Gwen’s reaction: “If you do a startup, there is a 100% chance that you will ship something that you’re not 100% happy with. But the reason you ship early and iterate is because you really want to learn fast. If you identified performance expectations during the discovery phase, try to ship something in that ballpark. Otherwise, you’re probably not learning all that much – with respect to performance, at least.” Hyperfocus on the User’s Perceived Latency For startups looking to make the biggest performance impact with limited resources, you can “cheat” by focusing on what will really help you attract and retain customers: optimizing the user’s perceived latency. Web apps are a great place to begin. Even if your core product is an eBPF-based edge database, your users will likely be interacting with a web app from the start of the user journey. Plus, there are lots of nice metrics to track (for example, Google’s Core Web Vitals). Gwen noted, “Startups very rarely have a scale problem. If you do have a scale problem, you can, for example, put people on a wait list while you’re determining how to scale out and add machines. However, even if you have a small number of users, you definitely care about them having a great experience with low latency. And perceived low latency is what really matters.” For example, consider this dashboard: When users logged in, the Nile team wanted to impress them by having this cool dashboard load instantly. However, they found that response times ranged from a snappy 200 milliseconds to a terrible 10+ seconds. To tackle the problem, the team started by parallelizing requests, filling in dashboard elements as data arrived and creating a progressive loading experience. These optimizations helped – and progressive loading turned out to be a fantastic way to hide latency (keeping the user engaged, like mirrors distracting you in a slow elevator). However, the optimizations exposed another issue. The app was making 2,000 individual API calls just to count open tickets. This is the classic N + 1 problem (when you end up running a query for each result instead of running a single optimized query that retrieves all necessary data at once.). Naturally, that inspired some API refinement and tuning. Then, another discovery. Their front-end dev noticed they were fetching more data than needed, so he cached it in the browser. This update sped up dashboard interactions by serving pre-cached data from the browser’s local storage. However, despite all those optimizations, the dashboard remained data-heavy. “Our customers loved it, but there was no reason why it had to be the first page after logging in,” Gwen remarked. So they moved the dashboard a layer down in the navigation. In its place, they added a much simpler landing page with easy access to the most common user tasks. Benchmarks Should Pay Rent Next, topic: The importance of being strategic about benchmarking. “Performance people love benchmarking (and I’m guilty of that),” Gwen admitted. ”But you can spend infinite time benchmarking with very little to show for it. So I want to share some tips on how to spend less time and have more to show for it.” She continued, “Benchmarks should pay rent by answering some important questions that you have. If you don’t have an important question, don’t run a benchmark. There, I just saved you weeks of your life – something invaluable for startups. You can thank me later. “ If your primary competitive advantage is performance, you will be expected to share performance tests to (attempt to) prove how fast and cool you are. Call it “benchmarketing.” For everyone else, two common questions to answer with benchmarking are: Is our database setup optimal? Are we delivering a good user experience? To assess the database setup, teams tend to: Turn to a standard benchmark like TPCC Increase load over time Look for bottlenecks Fix what they can But is this really the best way for a startup to spend its limited time and resources? Given her background as a performance expert, Gwen couldn’t resist doing such a benchmark early on at Nile. But she doesn’t recommend it – at least not for startups: “First of all, it takes a lot of time to run the standard benchmarks when you’re not used to doing it week in, week out. It takes time to adjust all the knobs and parameters. It takes time to analyze the results, rinse and repeat. Even with good tools, it’s never easy.” They did identify and fix some low-hanging fruits from this exercise. But since tests were based on a standard benchmark, it was unclear how well it mapped to actual user experiences. Gwen continued, “I didn’t feel the ROI was exactly compelling. If you’re a performance expert and it takes you only about a day, it’s probably worth it. But if you have to get up to speed, if you’re spending significant time on the setup, you’re better off focusing your efforts elsewhere.” A better question to obsess over is “Are we delivering a good experience?” More specifically, focus on these three areas: Optimizing user onboarding paths Addressing performance issues that annoy developers (these likely annoy users too) Paying attention to metrics that customers obsess over – even if they’re not the ones your team has focused on Keep Benchmarking Tests Stupid Simple Another testing lesson learned: Focus on extra stupid sanity tests. At Nile, the team ran the simplest possible queries, like loading an empty page or querying an empty table. If those were slow, there was no point in running more complex tests. Stop, fix the problem, then proceed with more interesting tests. Also, obsess over understanding what the numbers actually measure. You don’t want to base critical performance decisions on misleading results (e.g., empty responses) or unintended behaviors. For example, her team once intended to test the write path but ended up testing the read path thanks to a misconfigured DNS. Build Infrastructure for Long-Term Value, Optimize for Quick Wins The instrumentation and observability tools put in place during testing will pay off for years to come. At Nile, this infrastructure became invaluable throughout the product’s lifetime for answering the persistent question “Why is it slow?” As Gwen put it: “Those early performance test numbers, that instrumentation, all the observability – this is very much a gift that keeps on giving as you continue to build and users inevitably complain about performance.” When prioritizing performance improvements, look for quick wins. For example, Nile found that a slow request was spending 40% of the time on parsing, 40% on lookups, and just 20% on actual work. The developer realized he could reuse an existing caching library to speed up lookups. That was a nice quick win – giving 40% of the time back with minimal effort. However, if he’d said, “I’m not 100% sure about caching, but I have this fast JSON parsing library,” then that would have been a better way to shave off an equivalent 40%. About a year later, they pushed most of the parsing down to a Postgres extension that was written in C and nicely optimized. The optimizations never end! No Time to Optimize? Then At Least Don’t Pessimize Gwen’s final tip involved empowering experienced engineers to make common-sense improvements. “Last but not least, sometimes you really don’t have time to optimize. But if you have a team of experienced engineers, they know not to pessimize. They are familiar with faster JSON libraries, async libraries that work behind the scenes, they know not to put slow stuff on the critical path and so on. Even if you lack the time to prove that these things are actually faster, just do them. It’s not premature optimization. It’s just avoiding premature pessimization.”Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®
Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.
They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.
For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.
As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.
In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.
Cassandra 4 and OpenSearch: A partnership for embeddings
Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.
This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.
Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.
This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.
Deploying the environment
To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.
Here’s how to get started:
- Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
- Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.
By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.
Preparing the environment
Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.
Step 1: Setting up Cassandra
In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:
- Create the Table:
Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy', ' replication_factor': 3}; USE file_metadata; DROP TABLE IF EXISTS file_metadata; CREATE TABLE IF NOT EXISTS file_metadata ( id UUID, paragraph_uuid UUID, filename TEXT, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) );
Step 2: Configuring OpenSearch
In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:
- Create the index:
Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ "settings": { "index": { "number_of_shards": 2, "knn": true, "knn.space_type": "l2" } }, "mappings": { "properties": { "file_uuid": { "type": "keyword" }, "paragraph_uuid": { "type": "keyword" }, "embedding": { "type": "knn_vector", "dimension": 300 } } } }
This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.
With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.
Generating embeddings with FastText
Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.
Step 1: Download and load the FastText model
To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:
1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:
import requests import zipfile import os # Adjust file_url and local_filename variables accordingly file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' def download_file(url, filename): with requests.get(url, stream=True) as r: r.raise_for_status() os.makedirs(os.path.dirname(filename), exist_ok=True) with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) def unzip_file(filename, extract_to): with zipfile.ZipFile(filename, 'r') as zip_ref: zip_ref.extractall(extract_to) # Download and extract download_file(file_url, local_filename) unzip_file(local_filename, extract_dir)
2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly:
from gensim.models import KeyedVectors # Adjust model_path variable accordingly model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec" fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)
Step 2: Generate embeddings from text
With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.
Here’s a function that handles the conversion:
import numpy as np import re def text_to_vector(text): """Convert text into a vector using the FastText model.""" text = text.lower() words = re.findall(r'\b\w+\b', text) vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] if not vectors: print(f"No embeddings found for text: {text}") return np.zeros(fasttext_model.vector_size) return np.mean(vectors, axis=0)
This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.
Step 3: Extract text and generate embeddings from documents
In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:
import uuid import mimetypes import pandas as pd from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer from docx import Document from pptx import Presentation def generate_deterministic_uuid(name): return uuid.uuid5(uuid.NAMESPACE_DNS, name) def generate_random_uuid(): return uuid.uuid4() def get_file_type(file_path): # Guess the MIME type based on the file extension mime_type, _ = mimetypes.guess_type(file_path) return mime_type def extract_text_from_excel(excel_path): xls = pd.ExcelFile(excel_path) text_list = [] for sheet_index, sheet_name in enumerate(xls.sheet_names): df = xls.parse(sheet_name) for row in df.iterrows(): text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1)) # +1 to make it 1 based index return text_list def extract_text_from_pdf(pdf_path): return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) for element in page_layout if isinstance(element, LTTextContainer) for text_line in element if text_line.get_text().strip()] def extract_text_from_word(file_path): doc = Document(file_path) return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] def extract_text_from_txt(file_path): with open(file_path, 'r') as file: return [(line.strip(), 1) for line in file.readlines() if line.strip()] def extract_text_from_pptx(pptx_path): prs = Presentation(pptx_path) return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] def extract_text_with_page_number_and_embeddings(file_path, embedding_function): file_uuid = generate_deterministic_uuid(file_path) file_type = get_file_type(file_path) extractors = { 'text/plain': extract_text_from_txt, 'application/pdf': extract_text_from_pdf, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 'application/vnd.ms-excel': extract_text_from_excel } text_list = extractors.get(file_type, lambda _: [])(file_path) return [ { "uuid": file_uuid, "paragraph_uuid": generate_random_uuid(), "filename": file_path, "text": text, "page_num": page_num, "embedding": embedding } for text, page_num in text_list if (embedding := embedding_function(text)).any() # Check if the embedding is not all zeros ] # Replace the file path with the one you want to process file_path = "../../docs-manager/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)
This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.
With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.
Performing similarity searches
To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.
For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.
This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.
Example: Inserting metadata in Cassandra and embeddings in OpenSearch
In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.
We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.
The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:
from tqdm import tqdm # Function to insert data into both Cassandra and OpenSearch def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): # Insert into Cassandra cassandra_result = insert_with_retry( session=session, id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ) if not cassandra_result: return False # Stop further processing if Cassandra insertion fails # Insert into OpenSearch opensearch_result = insert_embedding_to_opensearch( os_client=os_client, index_name=index_name, file_uuid=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], embedding=paragraph['embedding'] ) if opensearch_result is not None: return False # Return False if OpenSearch insertion fails return True # Return True on success for both # Process each paragraph with a progress bar print("Starting batch insertion of paragraphs.") for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_paragraph_data( session=session, os_client=os_client, paragraph=paragraph, keyspace_name=keyspace_name, index_name=index_name ): print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") print("Batch insertion completed.")
Performing similarity search
Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.
The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.
Here’s how it works:
- Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
- Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
- Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.
The following code demonstrates this process:
import uuid from IPython.display import display, HTML def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" query = { "size": top_k, "query": { "knn": { "embedding": { "vector": input_embedding.tolist(), "k": top_k } } } } response = os_client.search(index=index_name, body=query) similar_uuids = [] for hit in response['hits']['hits']: file_uuid = hit['_source']['file_uuid'] paragraph_uuid = hit['_source']['paragraph_uuid'] similar_uuids.append((file_uuid, paragraph_uuid)) return similar_uuids def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" file_uuid = uuid.UUID(file_uuid) paragraph_uuid = uuid.UUID(paragraph_uuid) query = f""" SELECT text, filename FROM {keyspace_name}.file_metadata WHERE id = ? AND paragraph_uuid = ?; """ prepared = session.prepare(query) bound = prepared.bind((file_uuid, paragraph_uuid)) rows = session.execute(bound) for row in rows: return row.filename, row.text return None, None # Input text to find similar embeddings input_text = "place" # Convert input text to embedding input_embedding = text_to_vector(input_text) # Find similar embeddings in OpenSearch similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) # Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch for file_uuid, paragraph_uuid in similar_uuids: filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name) if filename and text: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>File UUID:</b> {file_uuid}</p> <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.
Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch
By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.
Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.
Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.
The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.
How Cassandra Streaming, Performance, Node Density, and Cost are All related
This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.
A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.
IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative
IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.
For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?
While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.
However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.
Decoding the IBM-DataStax link-up
DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.
While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.
This is where Instaclustr comes into the picture as a strong, reliable alternative solution.
Why consider Instaclustr?
Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.
Here’s why Instaclustr could be your best option moving forward:
1. 100% open source commitment
We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.
2. Platform agnostic
One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.
3. Transparent pricing
Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.
4. Expert support and services
With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.
Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.
5. Seamless migrations
Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.
Customer-centric focus
At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.
Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.
With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.
Final thoughts: Why Instaclustr is the smart choice for this moment
IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.
With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.
Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.
Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!
The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.
Innovative data compression for time series: An open source solution
Introduction
There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.
Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.
Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative.
ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub.
There are so many compressors already, so why develop another one?
While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born.
ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that function—no actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use.
Here’s an example: for a temperature change metric—which mostly varies slowly (as do a lot of system metrics!)—instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios.
Image 1: ATSC data for temperature
How does ATSC work?
ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.
By default, ATSC will segment the data—this guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file.
In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function.
ATSC currently uses one (per frame) of those following functions:
- FFT (Fast Fourier Transforms)
- Constant
- Interpolation – Catmull-Rom
- Interpolation – Inverse Distance Weight
Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting
These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.
For a more detailed insight into ATSC internals and operations check our paper!
Use cases for ATSC and results
ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below).
Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results:
Method | Compressed size (bytes) | Compression Ratio |
Prometheus | 454,778,552 | 1.33 |
LZ4 | 141,347,821 | 4.29 |
ATSC | 14,276,544 | 42.47 |
Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times.
Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.
ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include:
- Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss.
- Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events.
- Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data.
- Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)
Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)
Using ATSC
ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions.
Future work
Currently, we are planning to evolve ATSC in two ways (check our open issues):
- Adding features to the core compressor
focused on
these functionalities:
- Frame expansion for appending new data to existing frames
- Dynamic function loading to add more functions without altering the codebase
- Global and per-frame error storage
- Improved error encoding
- Integrations with
additional
technologies (e.g.
databases):
- We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra®
CREATE TABLE sensors_poly ( sensor_id UInt16, location UInt32, timestamp DateTime, pressure Float64 CODEC(ATSC('Polynomial', 1)), temperature Float64 CODEC(ATSC('Polynomial', 1)), ) ENGINE = MergeTree ORDER BY (sensor_id, location, timestamp);
Image 5: Currently testing ClickHouse integration
Sound interesting? Try it out and let us know what you think.
ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data.
But don’t just take our word for it—download and run it!
Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.
Want to integrate this with another tool? You can build and run our demo integration with ClickHouse.
The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.
New cassandra_latest.yaml configuration for a top performant Apache Cassandra®
Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.
This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters.
Motivation
The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters:
- Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints.
- Operators: who prefer stability and minimal disruption during upgrades.
- Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility.
Using cassandra_latest.yaml
Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.
This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements.
Key changes and features
Key Cache Size
- Old: Evaluated as a minimum from 5% of the heap or 100MB
- Latest: Explicitly set to 0
Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage.
Commit Log Disk Access Mode
- Old: Set to legacy
- Latest: Set to auto
Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.
Memtable Implementation
- Old: Skiplist-based
- Latest: Trie-based
Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.
create table … with memtable = {'class': 'TrieMemtable', … }
Memtable Allocation Type
- Old: Heap buffers
- Latest: Off-heap objects
Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments.
Trickle Fsync
- Old: False
- Latest: True
Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads.
SSTable Format
- Old: big
- Latest: bti (trie-indexed structure)
Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets.
sstable: selected_format: bti default_compression: zstd compression: zstd: enabled: true chunk_length: 16KiB max_compressed_length: 16KiB
Default Compaction Strategy
- Old: STCS (Size-Tiered Compaction Strategy)
- Latest: Unified Compaction Strategy
Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables.
default_compaction: class_name: UnifiedCompactionStrategy parameters: scaling_parameters: T4 max_sstables_to_compact: 64 target_sstable_size: 1GiB sstable_growth: 0.3333333333333333 min_sstable_size: 100MiB
Concurrent Compactors
- Old: Defaults to the smaller of the number of disks and cores
- Latest: Explicitly set to 8
Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient.
Default Secondary Index
- Old: legacy_local_table
- Latest: sai
Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time.
Stream Entire SSTables
- Old: implicity set to True
- Latest: explicity set to True
Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by
entire_sstable_stream_throughput_outbound
and
entire_sstable_inter_dc_stream_throughput_outbound
for inter-DC transfers.
UUID SSTable Identifiers
- Old: False
- Latest: True
Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments.
Storage Compatibility Mode
- Old: Cassandra 4
- Latest: None
Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions.
Testing and validation
The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests.
Future improvements
Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml.
Conclusion
The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone.
Try it out
Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!
The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.
Cassandra 5 Released! What's New and How to Try it
Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.
You can grab the latest release on the Cassandra download page.
Instaclustr for Apache Cassandra® 5.0 Now Generally Available
NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.
NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and on–premises.
Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.
Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by:
- Helping you build AI/ML-based applications through Vector Search
- Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities
- Improving flexibility and security
With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industry–leading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.
Support for continuous backups and private network add–ons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported.
Some of the key new features in Cassandra 5.0 include:
- Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.
- Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.
- Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.
- Numerous stability and testing improvements: You can read all about these changes here.
All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.
Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.
We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.
Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things.
Lifecycle Policy Updates
As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).
To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters.
Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.
Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.
You can read more about our lifecycle policies on our website.
Getting Started
Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.
- Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
- Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
- Learn about 4 new Apache Cassandra 5.0 features to be excited about.
- Click here to learn what you need to know about Apache Cassandra 5.0.
Why Choose Apache Cassandra on the Instaclustr Managed Platform?
NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.
NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.
Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.
With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.
If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.
The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.
Apache Cassandra® 5.0: Behind the Scenes
Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.
Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.
It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform.
Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra project. His changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform.
August 2023: The Beginning
We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.
One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java.
September 2023: First Milestone
The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).
Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.
At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released.
November 2023: Further Testing and Internal Preview
The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.
We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.
We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.
During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us.
If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer:
java.lang.IllegalArgumentException: Could not parse type name vector<float, 5> Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356) <Rest of stacktrace removed for brevity>
December 2023: Focus on new features and planning
As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.
The final list of high impact features we came up with was:
- A new data type – Vectors
- Trie memtables/Trie Indexed SSTables (BTI Formatted SStables)
- Storage-Attached Indexes (SAI)
- Unified Compaction Strategy
A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase.
Once the holiday season arrived, it was time for a break, and we were back in force in February next year.
February 2024: Intensive testing
In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, we discovered issues in the interaction with our monitoring and provisioning setup.
We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.
Next, we broke down the work we needed to do. This included identifying monitoring agents requiring upgrade and config defaults that needed to change.
From this point, the project split into 3 streams of work:
- Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.
- Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.
- Infrastructure Upgrades – Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables.
A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved.
March 2024: Public Preview Release
In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables.
However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.
See our public preview launch blog for further details.
There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults.
April 2024: Configuration Tuning and Deeper Testing
The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.
This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default.
"memtable": { "configurations": { "skiplist": { "class_name": "SkipListMemtable" }, "sharded": { "class_name": "ShardedSkipListMemtable" }, "trie": { "class_name": "TrieMemtable" }, "default": { "inherits": "trie" } } }, "sstable": { "selected_format": "bti" }, "storage_compatibility_mode": "NONE",
The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables. You can override this per table:
CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};
Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information.
You can also reference the cassandra_latest.yaml file for the latest settings (please note you should not apply these to existing clusters without rigorous testing).
May 2024: Major Infrastructure Milestone
We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.
At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed.
June 2024: Successful Rollout New Cassandra Driver
The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default.
July 2024: Path to Release Candidate
We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.
The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform.
The Road Ahead to General Availability
We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including:
- Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure.
At Launch:
When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.
For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.
What Have We Learned From This Project?
- Releasing limited,
small
and frequent changes
has resulted in a smooth project, even if sometimes frequent
releases do not feel smooth. Some
thoughts:
- Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
- Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc.
- Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.
- Releasing to internal and public preview helped
create
momentum within
the Instaclustr
business and
teams:
- This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
- Communicating frequently, transparently, and efficiently is the foundation
of success:
- We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything.
- It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project.
This has been a long–running but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.
You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.
More Readings
- The Top 5 Questions We’re Asked about Apache Cassandra 5.0
- Vector Search in Apache Cassandra 5.0
- How Does Data Modeling Change in Apache Cassandra 5.0?
The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.
Will Your Cassandra Database Project Succeed?: The New Stack
Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.
That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.
Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.
Accurate Data Modeling Is a Must
Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.
On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.
Configure Cassandra Clusters the Right Way
Accurate, expertly managed cluster configurations are pivotal to the success of Cassandra implementations. Get those cluster settings wrong and Cassandra can suffer from data inconsistencies and performance issues due to inappropriate node capacities, poor partitioning or replication strategies that aren’t up to the task.
Teams should understand the needs of their particular use case and how each cluster configuration setting affects Cassandra’s abilities to serve that use case. Attuning configurations to best support your application — including the right settings for node capacity, data distribution, replication factor and consistency levels — will ensure that you can harness the full power of Cassandra when it counts.
Take Advantage of Tunable Consistency
Cassandra gives teams the option to leverage the best balance of data consistency and availability for their use case. While these tunable consistency levels are a valuable tool in the right hands, teams that don’t understand the nuances of these controls can saddle their applications with painful latency and troublesome data inconsistencies.
Teams that learn to operate Cassandra’s tunable consistency levels properly and carefully assess their application’s needs — especially with read and write patterns, data sensitivity and the ability to tolerate eventual consistency — will unlock far more beneficial Cassandra experiences.
Perform Regular Maintenance
Regular Cassandra maintenance is required to stave off issues such as data inconsistencies and performance drop-offs. Within their Cassandra operational procedures, teams should routinely perform compaction, repair and node-tool operations to prevent challenges down the road, while ensuring cluster health and performance are optimized.
Anticipate Capacity and Scaling Needs
By its nature, success will yield new needs. Be prepared for your Cassandra cluster to grow and scale well into the future — that is what this database is built to do. Starving your Cassandra cluster for CPU, RAM and storage resources because you don’t have a plan to seamlessly add capacity is a way of plucking failure from the jaws of success. Poor performance, data loss and expensive downtime are the rewards for growing without looking ahead.
Plan for growth and scalability from the beginning of your Cassandra implementation. Practice careful capacity planning. Look at your data volumes, write/read patterns and performance requirements today and tomorrow. Teams with clusters built for growth will be ready to do so far more easily and affordably.
Make Changes With a Careful Testing/Staging/Prod Process
Teams that think they’re streamlining their process efficiency by putting Cassandra changes straight into production actually enable a pipeline for bugs, performance roadblocks and data inconsistencies. Testing and staging environments are essential for validating changes before putting them into production environments and will save teams countless hours of headaches.
At the end of the day, running all data migrations, changes to schema and application updates through testing and staging environments is far more efficient than putting them straight into production and then cleaning up myriad live issues.
Set Up Monitoring and Alerts
Teams implementing monitoring and alerts to track metrics and flag anomalies can mitigate trouble spots before they become full-blown service interruptions. The speed at which teams become aware of issues can mean the difference between a behind-the-scenes blip and a downtime event.
Have Backup and Disaster Recovery at the Ready
In addition to standing up robust monitoring and alerting, teams should regularly test and run practice drills on their procedures for recovering from disasters and using data backups. Don’t neglect this step; these measures are absolutely essential for ensuring the safety and resilience of systems and data.
The less prepared an organization is to recover from issues, the longer and more costly and impactful downtime will be. Incremental or snapshot backup strategies, replication that’s based in the cloud or across multiple data centers and fine-tuned recovery processes should be in place to minimize downtime, stress and confusion whenever the worst occurs.
Nurture Cassandra Expertise
The expertise required to optimize Cassandra configurations, operations and performance will only come with a dedicated focus. Enlisting experienced talent, instilling continuous training regimens that keep up with Cassandra updates, turning to external support and ensuring available resources — or all of the above — will position organizations to succeed in following the best practices highlighted here and achieving all of the benefits that Cassandra can deliver.
The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.
Use Your Data in LLMs With the Vector Database You Already Have: The New Stack
Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.
Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.
You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.
But, ready for some good news?
If you’re already using Apache Cassandra 5.0, OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.
For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.
Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.
RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.
Let’s look closer at what each open source technology brings to the vector database discussion:
Apache Cassandra 5.0 Offers Native Vector Indexing
With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.
Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.
OpenSearch Provides a Combination of Benefits
Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.
With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.
The pgvector Extension Makes Postgres a Powerful Vector Store
Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.
pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.
Was the Answer to Your AI Challenges in Front of You All Along?
The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.
The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.
easy-cass-lab v5 released
I’ve got some fun news to start the week off for users of easy-cass-lab: I’ve just released version 5. There are a number of nice improvements and bug fixes in here that should make it more enjoyable, more useful, and lay groundwork for some future enhancements.
- When the cluster starts, we wait for the storage service to
reach NORMAL state, then move to the next node. This is in contrast
to the previous behavior where we waited for 2 minutes after
starting a node. This queries JMX directly using Swiss Java Knife
and is more reliable than the 2-minute method. Please see
packer/bin-cassandra/wait-for-up-normal
to read through the implementation. - Trunk now works correctly. Unfortunately, AxonOps doesn’t support trunk (5.1) yet, and using the agent was causing a startup error. You can test trunk out, but for now the AxonOps integration is disabled.
- Added a new repl mode. This saves keystrokes and provides some
auto-complete functionality and keeps SSH connections open. If
you’re going to do a lot of work with ECL this will help you be a
little more efficient. You can try this out with
ecl repl
. - Power user feature: Initial support for profiles in AWS regions
other than
us-west-2
. We only provide AMIs forus-west-2
, but you can now set up a profile in an alternate region, and build the required AMIs usingeasy-cass-lab build-image
. This feature is still under development and requires using aneasy-cass-lab
build from source. Credit to Jordan West for contributing this work. - Power user feature: Support for multiple profiles. Setting the
EASY_CASS_LAB_PROFILE
environment variable allows you to configure alternate profiles. This is handy if you want to use multiple regions or have multiple organizations. - The project now uses Kotlin instead of Groovy for Gradle configuration.
- Updated Gradle to 8.9.
- When using the list command, don’t show the alias “current”.
- Project cleanup, remove old unused pssh, cassandra build, and async profiler subprojects.
The release has been released to the project’s GitHub page and to homebrew. The project is largely driven by my own consulting needs and for my training. If you’re looking to have some features prioritized please reach out, and we can discuss a consulting engagement.
easy-cass-lab updated with Cassandra 5.0 RC-1 Support
I’m excited to announce that the latest version of easy-cass-lab now supports Cassandra 5.0 RC-1, which was just made available last week! This update marks a significant milestone, providing users with the ability to test and experiment with the newest Cassandra 5.0 features in a simplified manner. This post will walk you through how to set up a cluster, SSH in, and run your first stress test.
For those new to easy-cass-lab, it’s a tool designed to streamline the setup and management of Cassandra clusters in AWS, making it accessible for both new and experienced users. Whether you’re running tests, developing new features, or just exploring Cassandra, easy-cass-lab is your go-to tool.
easy-cass-lab now available in Homebrew
I’m happy to share some exciting news for all Cassandra enthusiasts! My open source project, easy-cass-lab, is now installable via a homebrew tap. This powerful tool is designed to make testing any major version of Cassandra (or even builds that haven’t been released yet) a breeze, using AWS. A big thank-you to Jordan West who took the time to make this happen!
What is easy-cass-lab?
easy-cass-lab is a versatile testing tool for Apache Cassandra. Whether you’re dealing with the latest stable releases or experimenting with unreleased builds, easy-cass-lab provides a seamless way to test and validate your applications. With easy-cass-lab, you can ensure compatibility and performance across different Cassandra versions, making it an essential tool for developers and system administrators. easy-cass-lab is used extensively for my consulting engagements, my training program, and to evaluate performance patches destined for open source Cassandra. Here are a few examples:
Cassandra Training Signups For July and August Are Open!
I’m pleased to announce that I’ve opened training signups for Operator Excellence to the public for July and August. If you’re interested in stepping up your game as a Cassandra operator, this course is for you. Head over to the training page to find out more and sign up for the course.
Streaming My Sessions With Cassandra 5.0
As a long time participant with the Cassandra project, I’ve witnessed firsthand the evolution of this incredible database. From its early days to the present, our journey has been marked by continuous innovation, challenges, and a relentless pursuit of excellence. I’m thrilled to share that I’ll be streaming several working sessions over the next several weeks as I evaluate the latest builds and test out new features as we move toward the 5.0 release.