How to Reduce DynamoDB Costs: Expert Tips from Alex DeBrie

DynamoDB consultant Alex DeBrie shares where teams tend to get into trouble DynamoDB pricing can be a blessing and a curse. When you’re just starting off, costs are usually quite reasonable and on-demand pricing can seem like the perfect way to minimize upfront costs. But then perhaps you face “catastrophic success” with an exponential increase of users flooding your database…and your monthly bill far exceeds your budget. The more predictable provisioned capacity model might seem safer. But if you overprovision, you’re burning money – and if you underprovision, your application might be throttled during a critical peak period. It’s complicated. Add in the often-overlooked costs of secondary indexes, ACID transactions, and global tables – plus the nuances of dealing with DAX – and you could find that your cost estimates were worlds away from reality. Rather than learn these lessons the hard (and costly) way, why not take a shortcut: tap the expert known for helping teams reduce their DynamoDB costs. Enter Alex DeBrie, the guy who literally wrote the book on DynamoDB. Alex shared his experiences at the recent Monster SCALE Summit. This article recaps the key points from his talk (you can watch his complete talk here). Watch Alex’s Complete Talk Note: If you need further cost reduction beyond these strategies, consider ScyllaDB. ScyllaDB is an API-compatible DynamoDB alternative that provides better latency at 50% of the cost (or less), thanks to extreme engineering efficiency. Learn more about ScyllaDB as a DynamoDB alternative DynamoDB Pricing: The Basics Alex began the talk with an overview of how DynamoDB’s pricing structure works. Unlike other cloud databases where you provision resources like CPU and RAM, DynamoDB charges directly for operations. You pay for: Read Capacity Units (RCUs): Each RCU allows reading up to 4KB of data per request Write Capacity Units (WCUs): Each WCU allows writing up to 1KB of data per request Storage: Priced per gigabyte-month (similar to EBS or S3) Then there’s DynamoDB billing modes, which determine how you get that capacity for reads and writes. Provisioned Throughput is the traditional billing mode. You specify how many RCUs and WCUs you want available on a per-second basis. Basically, it’s a “use it or lose it” model. You’re paying for what you requested, whether you take advantage of it or not. If you happen to exceed what you requested, your workload gets throttled. And speaking of throttling, Alex called out another important difference between DynamoDB and other databases. With other databases, response times gradually worsen as concurrent queries increase. Not so in DynamoDB. Alex explained, “As you increase the number of concurrent queries, you’ll still hit some saturation point where you might not have provisioned enough throughput to support the reads or writes you want to perform. But rather than giving you long-tail response times, which aren’t ideal, it simply throttles you. It instantly returns a 500 error, telling you, ‘Hey, you haven’t provisioned enough for this particular second. Come back in another second, and you’ll have more reads and writes available.’” As a result, you get predictable response times – to a limit, at least. On-Demand Mode is more like a serverless or pay-per-request mode. Rather than saying how much capacity you want in advance, you just get charged per request. As you throw reads and writes at your DynamoDB database, AWS will charge you fractions of a cent each time. At the end of the month, they’ll total up all those costs  and send you a bill. Beyond the Basics For an accurate assessment of your DynamoDB costs, you need to go beyond simply plugging your anticipated read and write estimates into a calculator (either the AWS-hosted DynamoDB cost calculator or the more nuanced DynamoDB cost analyzer we’ve designed). Many other factors – in your DynamoDB configuration as well as your actual application – impact your costs. Critical DynamoDB cost factors that Alex highlighted in his talk include: Table storage classes WCU and RCU cost multipliers Let’s look at each in turn. Table Storage Classes In DynamoDB, “table storage classes” define the underlying storage tier and access patterns for your table’s data. There are two options: standard mode for hot data and Standard-IA for infrequently accessed, historical, or backup data. Standard Mode: This is the traditional table storage class. It provides high-performance storage optimized for frequent access. It’s the cheapest mode for paying for operations. However, be aware that storage cost is more expensive (about 25 cents per Gigabyte-month in the cheapest regions). Standard-IA (Infrequent Access): This is a lower-cost, less performant tier designed for infrequent access. If you have a table with a lot of data and you’re doing fewer operations on it, you can use this option for cheaper storage (only about 10 cents per Gigabyte-month). However, the tradeoffs are that you pay a premium on operations and you cannot reserve capacity. [Amazon’s tips on selecting the table storage class] WCU and RCU Cost Multipliers Beyond the core settings, there’s also an array of “multipliers” that can exponentially increase your capacity unit consumption. Factors such as item size, secondary indexes, transactions, global table replication, and read consistency can all cause costs to skyrocket if you’re not careful. The riskiest cost multipliers that Alex called out include: Item size: Although the standard RCU is 4KB and the standard WCU is 1KB, you can go beyond that (for a cost). If you’re reading a 20KB item, that’s going to be 5 RCUs (20KB / 4KB = 5 RCUs). Or if you’re writing a 10KB item, that’s going to be 10 WCUs (10KB / 1KB= 10 WCUs). Secondary indexes: DynamoDB lets you use secondary indexes, but again – it will cost you. In addition to paying for the writes that go to your main table, you will also pay for all the writes to your secondary indexes. That can really drive up your WCU consumption. ACID Transactions: You can configure ACID transactions to operate on multiple items in a single request in an all-or-nothing way. However, you pay quite a premium for this. Global Tables: DynamoDB Global Tables replicate data across multiple regions, but you really pay the price due to increased write operations as well as increased storage needs. Consistent reads: Consistent reads ensure that a read request always returns the most recent write. But, you pay higher costs compared to eventually consistent reads that might return slightly older data. How to Reduce DynamoDB Costs Alex’s top tip is to “mind your multipliers.” Make sure you really understand the cost impacts of different options. Also, avoid any options that don’t justify their steep costs. In particular… Watch Item Sizes DynamoDB users tend to bloat their item sizes without really thinking about it. This consumes a lot of resources (disk/memory/CPU), so review your item sizes carefully: Remove unused attributes If you have large values, consider storing them in S3 instead Reduce the attribute names (Since AWS charges for the full payload transmitted over the wire, large attribute names result in larger item sizes) If you have a smaller amount of frequently updated data and a larger amount of slow-moving data, consider splitting items into multiple different items (vertical partitioning) Limit Secondary Indexes Secondary indexes are another common culprit behind unexpected DynamoDB costs. Be vigilant about spotting and removing secondary indexes that you don’t really need. Remember, they’re causing you to pay twice: you pay in terms of storage and also on every write. You can also use Projections to limit the number of writes to your secondary indexes and/or limit the size of the items in those indexes. Regularly review secondary indexes to ensure they are being utilized. Remove any index that isn’t being read and evaluate the “write:read” cost ratio to determine if the cost is justified. Use Transactions Sparingly Limit transactions. Alex put it this way: “AWS came out with DynamoDB transactions six or seven years ago. They’re super useful for many things, but I wouldn’t use them willy-nilly. They’re slower than traditional DynamoDB operations and more expensive. So, I try to limit my transactions to high-value, low-volume, low-frequency applications. That’s where I find them worthwhile — if I use them at all. Otherwise, I focus on modeling around them, leaning into DynamoDB’s design to avoid needing transactions in the first place.” Be Selective with Global Tables Global tables are critical if you need data in multiple regions, but make sure they’re really worth it. Given that they will multiply your write and storage costs, they should add significant value to justify their existence. Consider Eventually Consistent Reads Do you really need strongly consistent reads every time? Alex has found that in most cases, users don’t. “You’re almost always going to get the latest version of the item, and even if you don’t, it shouldn’t cause data corruption.” Choose the Right Billing Mode On-demand DynamoDB costs about 3.5X the price of fully utilized provisioned capacity (and this is quite an improvement from the previous 7X the price). However, achieving full utilization of provisioned capacity is difficult because overprovisioning is often necessary to handle traffic spikes. Generally, ~28-29% utilization is needed to make provisioned capacity cost effective. For smaller workloads or those with unpredictable traffic, on-demand is often the better choice. Alex advises: “Use on-demand until it hurts. If your DynamoDB bill is under $1,000 a month, don’t spend too much time optimizing provisioned capacity. Instead, set it to on-demand and see how it goes. Once costs start to rise, then consider whether it’s worth optimizing. If you’re using provisioned capacity, aim for at least 28.8% utilization. If you’re not hitting that, switch to on-demand. Autoscaling can help with provisioned capacity – as long as your traffic doesn’t have rapid spikes. For stable, predictable workloads, reserved capacity (purchased a year in advance) can save you a lot of money.” Review Table Storage Classes Monthly Review your table storage classes every month. When deciding between storage classes, the key metric is whether total operations costs are 2.4X total storage costs. If operations costs exceed this, standard storage is preferable; otherwise, standard infrequent access (IA) is a better choice. Also, be aware that the optimal setting could vary over time. Per Alex, “Standard storage is usually cheaper at first. For example, writing a kilobyte of data costs roughly the same as storing it for five months, so you’ll likely start in standard storage. However, over time, as your data grows, storage costs increase, and it may be worth switching to standard IA.” Another tip on this front: use TTL to your advantage. If you don’t need to keep data forever, use TTL to automatically expire it. This will help with storage costs. “DynamoDB pricing should influence how you build your application” Alex left us with this thought: “DynamoDB pricing should influence how you build your application. You should consider these cost multipliers when designing your data model because you can easily see the connection between resource usage and cost, ensuring you’re getting value from it. For example, if you’re thinking about adding a secondary index, run the numbers to see if it’s better to over-read from your main table instead of paying the write cost for a secondary index. There are many strategies you can use.” Browse our DynamoDB Resources  Learn how ScyllaDB Compares to DynamoDB  

CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator

Introduction: The need for an Apache Cassandra® password validator and generator

Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.

While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.

When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.

But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?

That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.

The Goals of CEP-24

A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.

These were the key goals we established for CEP-24:

  • Introduce a way to enforce password strength upon role creation or role alteration.
  • Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
  • Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
  • To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
  • Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.

The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).

The password validator implements a custom guardrail introduced as part of CEP-24. A custom guardrail can validate and generate values of arbitrary types when properly implemented. In the CEP-24 context, the password guardrail provides CassandraPasswordValidator by extending ValueValidator, while passwords are generated by CassandraPasswordGenerator by extending ValueGenerator. Both components work with passwords as String type values.

Password validation and generation are configured in the cassandra.yaml file under the password_validator section. Let’s explore the key configuration properties available. First, the class_name and generator_class_name parameters specify which validator and generator classes will be used to validate and generate passwords respectively.

Cassandra ships CassandraPasswordValidator and CassandraPasswordGenerator out of the box. However, if a particular enterprise decides that they need something very custom, they are free to implement their own validators, put it on Cassandra’s class path and reference it in the configuration behind class_name parameter. Same for the validator.

CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.

password_validator: 
 # Implementation class of a validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.validators is prepended. 
 # By default, there is no validator. 
 class_name: CassandraPasswordValidator 
 # Implementation class of related generator which generates values which are valid when 
 # tested against this validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.generators is prepended. 
 # By default, there is no generator. 
 generator_class_name: CassandraPasswordGenerator

Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.

This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.

# There are four characteristics: 
 # upper-case, lower-case, special character and digit. 
 # If this value is set e.g. to 3, a password has to 
 # consist of 3 out of 4 characteristics. 

 # For example, it has to contain at least 2 upper-case characters, 
 # 2 lower-case, and 2 digits to pass, 
 # but it does not have to contain any special characters. 
 # If the number of characteristics found in the password is 
 # less than or equal to this number, it will emit a warning. 
 characteristic_warn: 3 
 # If the number of characteristics found in the password is 
 #less than or equal to this number, it will emit a failure. 
 characteristic_fail: 2

Next, there are configuration parameters for each characteristic which count towards warning or failure:

# If the password is shorter than this value, 
# the validator will emit a warning. 
length_warn: 12 
# If a password is shorter than this value, 
# the validator will emit a failure. 
length_fail: 8 
# If a password does not contain at least n 
# upper-case characters, the validator will emit a warning. 
upper_case_warn: 2 
# If a password does not contain at least 
# n upper-case characters, the validator will emit a failure. 
upper_case_fail: 1 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a warning. 
lower_case_warn: 2 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a failure. 
lower_case_fail: 1 
# If a password does not contain at least 
# n digits, the validator will emit a warning. 
digit_warn: 2 
# If a password does not contain at least 
# n digits, the validator will emit a failure. 
digit_fail: 1 
# If a password does not contain at least 
# n special characters, the validator will emit a warning. 
special_warn: 2 
# If a password does not contain at least 
# n special characters, the validator will emit a failure. 
special_fail: 1

It is also possible to say that illegal sequences of certain length found in a password will be forbidden: 

# If a password contains illegal sequences that are at least this long, it is invalid. 
# Illegal sequences might be either alphabetical (form 'abcde'), 
# numerical (form '34567'), or US qwerty (form 'asdfg') as well 
# as sequences from supported character sets. 
# The minimum value for this property is 3, 
# by default it is set to 5. 
illegal_sequence_length: 5

Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:

# Dictionary to check the passwords against. Defaults to no dictionary. 
# Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. 
# Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. 
dictionary: /path/to/dictionary/file

Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.

Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
contains the dictionary word 'cassandraisadatabase'. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; 
InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
must be 8 or more characters in length. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; 

Warnings: 

Guardrail password violated: Password was set, however it might not be 
strong enough according to the configured password strength policy. 
To fix this warning, the following has to be resolved: Password must be 12 or more 
characters in length. Passwords must contain 2 or more digit characters. Password 
must contain 2 or more special characters. Password matches 2 of 4 character rules, 
but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met. 

When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.

cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; 

generated_password 
------------------ 
   R7tb33?.mcAX

The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.

When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.

We can create a role with generated password like this:

cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; 

or by CREATE USER: 

cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;

When a password is generated foralice (out of scope of this documentation), she can log in: 

$ cqlsh -u alice -p R7tb33?.mcAX 
... 
alice@cqlsh>

Note: It is recommended to save password to ~/.cassandra/credentials, for example: 

[PlainTextAuthProvider] 
username = cassandra
password = R7tb33?.mcAX

and by setting auth_provider in ~/.cassandra/cqlshrc 

[auth_provider] 
module = cassandra.auth 
classname = PlainTextAuthProvider

It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:

alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength policy. 
You may also use 'GENERATED PASSWORD' upon role creation or alteration."

The following command will automatically generate a new password that meets all configured security requirements.

alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;

Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.

These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.

Final thoughts and next steps

The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.

By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.

As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.

Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.

CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.

The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.

Announcing ScyllaDB 2025.1, Our First Source-Available Release

Tablets are enabled by default + new support for mixed clusters with varying core counts and resources The ScyllaDB team is pleased to announce the release of ScyllaDB 2025.1.0 LTS, a production-ready ScyllaDB Long Term Support Major Release. ScyllaDB 2025.1 is the first release under our Source-Available License. It combines all the improvements from Enterprise releases (up to 2024.2) and Open Source Releases (up to 6.2) into a single source-available code base. ScyllaDB 2025.1 enables Tablets by default. It also improves performance and scaling speed and allows mixed clusters (nodes that use different instance types). Several new capabilities, updates, and hundreds of bug fixes are also included. This release is the base for the upcoming ScyllaDB X Cloud, the new and improved ScyllaDB Cloud. That release offers fast boot, fast scaling (out and in), and an upper limit of 90% storage utilization (compared to 70% today). In this blog, we’ll highlight the new capabilities our users have frequently asked about. For the complete details, read the release notes. Read the detailed release notes on our forum Learn more about ScyllaDB Enterprise Get ScyllaDB Enterprise 2025.1 Upgrade from ScyllaDB Enterprise 2024.x to 2025.1 Upgrade from ScyllaDB Open Source 6.2 to  2025.1 ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2025 and are welcome to contact our Support Team with questions. Read the detailed release notes Tablets Overview In this release, ScyllaDB makes tablets the default for new Keyspaces. “Tablets” is a new data distribution algorithm that improves upon the legacy vNodes approach from Apache Cassandra. Unlike vNodes, which statically distribute tables across nodes based on the token ring, Tablets dynamically assign tables to a subset of nodes based on size. Future updates will optimize this distribution using CPU and OPS information. Key benefits of Tablets include: Faster scaling and topology changes allow new nodes to serve reads and writes as soon as the first Tablet is migrated. Together with Raft-based Strongly Consistent Topology Updates, Tablets enable users to add multiple nodes simultaneously. Automatic support for mixed clusters with varying core counts. You can run some Keyspaces with Tablets enabled and others with Tablets disabled. In this case, scaling improvements will only apply to Keyspaces with Tablets enabled. vNodes will continue to be supported for existing and new Keyspaces using the `tablets = { 'enabled': false }` option. Tablet Merge Tablet Merge is a new feature in 2025.1. The goal of Tablet Merge is to reduce the tablet count for a shrinking table, similar to how Split increases the count while the table is growing. The load balancer decision to merge was implemented today (it came with the infrastructure introduced for split), but it hasn’t been handled until now. The topology coordinator will now detect shrunk tables and merge adjacent tablets to meet the average tablet replica size goal. #18181 Tablet-Based Keyspace Limitations Tablets Keyspaces are NOT yet enabled for the following features: Materialized Views Secondary Indexes Change Data Capture (CDC) Lightweight Transactions (LWT) Counters Alternator (Amazon DynamoDB API) Using Tablets You can continue using these features by using a vNode based Keyspace. Monitoring Tablets To monitor Tablets in real time, upgrade ScyllaDB Monitoring Stack to release 4.7, and use the new dynamic Tablet panels shown below. Tablets Driver Support The following driver versions and newer support Tablets Java driver 4.x, from 4.18.0.2 Java driver 3.x, from 3.11.5.4 Python driver, from 3.28.1 Gocql driver, from 1.14.5 Rust driver, from 0.13.0 Legacy ScyllaDB and Apache Cassandra drivers will continue to work with ScyllaDB. However, they will be less efficient when working with tablet-based Keyspaces. File-based Streaming for Tablets File-based streaming enhances tablet migration. In previous releases, migration involved streaming mutation fragments, requiring deserialization and reserialization of SSTable files. In this release, we directly stream entire SSTables, eliminating the need to process mutation fragments. This method reduces network data transfer and CPU usage, particularly for small-cell data models. File-based streaming is utilized for tablet migration in all keyspaces with tablets enabled. More in Docs. Arbiter and Zero-Token Nodes There is now support for zero-token nodes. These nodes do not replicate data but can assist in query coordination and Raft quorum voting. This allows the creation of an Arbiter: a tiebreaker node that helps maintain quorum in a symmetrical two-datacenter cluster. If one data center fails, the Arbiter (placed in a third data center) keeps the quorum alive without replicating user data or incurring network and storage costs. #15360 You can use nodetool status to find a list of zero token nodes. Additional Key Features The following features were introduced in ScyllaDB Enterprise Feature Release 2024.2 and are now available in Long-Term Support ScyllaDB 2025.1. For a full description of each, see 2024.2 release notes and ScyllaDB Docs. Strongly Consistent Topology Updates. With Raft-managed topology enabled, all topology operations are internally sequenced consistently. Strongly Consistent Topology Updates is now the default for new clusters and should be enabled after upgrade for existing clusters. Strongly Consistent Auth Updates. Role-Based Access Control (RBAC) commands like create role or grant permission are safe to run in parallel without a risk of getting out of sync with themselves and other metadata operations, like schema changes. As a result, there is no need to update system_auth RF or run repair when adding a DataCenter. Strongly Consistent Service Levels. Service Levels allow you to define attributes like timeout per workload. Service levels are now strongly consistent using Raft, like Schema, Topology, and Auth. Improved network compression for intra-node RPC. New compression improvements for node-to-node communication: Using zstd instead of lz4 Using a shared dictionary re-trained periodically on the traffic, instead of the message-by-message compression. Alternator RBAC. Authorization: Alternator supports Role-Based Access Control (RBAC). Control is done via CQL. Native Nodetool. The nodetool utility provides simple command-line interface operations and attributes. The native nodetool works much faster. Unlike the Java version, the native nodetool is part of the ScyllaDB repo and allows easier and faster updates. Removing the JMX Server. With the Native Nodetool (above), the JMX server has become redundant and will no longer be part of the default ScyllaDB Installation or image. Maintenance Mode. Maintenance mode is a new mode in which the node does not communicate with clients or other nodes and only listens to the local maintenance socket and the REST API. Maintenance Socket. The Maintenance Socket provides a new way to interact with ScyllaDB from within the node it runs on. It is mainly for debugging. As described in the Maintenance Socket docs, you can use cqlsh with the Maintenance Socket. Read the detailed release notes

Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio

The engineering challenges and design decisions that led to the 1.0 release of ScyllaDB Rust Driver ScyllaDB Rust driver is a client-side, shard-aware driver written in pure Rust with a fully async API using Tokio. The Rust Driver project was born back in 2021 during ScyllaDB’s internal developer hackathon. Our initial goal was to provide a native implementation of a CQL driver that’s compatible with Apache Cassandra and also contains a variety of ScyllaDB-specific optimizations. Later that year, we released ScyllaDB Rust Driver 0.2.0 on the Rust community’s package registry, crates.io. Comparative benchmarks for that early release confirmed that this driver was (more than) satisfactory in terms of performance. So we continued working on it, with the goal of an official release – and also an ambitious plan to unify other ScyllaDB-specific drivers by converting them into bindings for our Rust driver. Now that we’ve reached a major milestone for the Rust Driver project (officially releasing ScyllaDB Rust Driver 1.0), it’s time to share the challenges and design decisions that led to this 1.0 release. Learn about our versioning rationale What’s New in ScyllaDB Rust Driver 1.0? Along with stability, this new release brings powerful new features, better performance, and smarter design choices. Here’s a look at what we worked on and why. Refactored Error Types Our original error types met ad hoc needs, but weren’t ideal for long-term production use. They weren’t very type-safe, some of them stringified other errors, and they did not provide sufficient information to diagnose the error’s root cause. Some of them were severely abused – most notably ParseError. There was a One-To-Rule-Them-All error type: the ubiquitous QueryError, which many user-facing APIs used to return. Before Back in 0.13 of the driver, QueryError looked like this: Note that: The structure was painfully flat, with extremely niche errors (such as UnableToAllocStreamId) being just inline variants of this enum. Many variants contained just strings. The worst offender was Invalid Message, which just jammed all sorts of different error types into a single string. Many errors were buried inside IoError, too. This stringification broke the clear code path to the underlying errors, affecting readability and causing chaos. Due to the above omnipresent stringification, matching on error kinds was virtually impossible. The error types were public and, at the same time, were not decorated with the #[non_exhaustive] attribute. Due to this, adding any new error variant required breaking the API! It was unacceptable for a driver that was aspiring to bear the name of an API-stable library. In version 1.0.0, the new error types are clearer and more helpful. The error hierarchy now reflects the code flow. Error conversions are explicit, so no undesired confusing conversion takes place. The one-to-fit-them-all error type has been replaced. Instead, APIs return various error types that exhaustively cover the possible errors, without any need to match on error variants that can’t occur when executing a given function. The QueryError’s new counterpart, ExecutionError, looks like this: Note that: There is much more nesting, reflecting the driver’s modules and abstraction layers. The stringification is gone! Error types are decorated with the #[non_exhaustive] attribute, which requires downstream crates to always have the “else” case (like _ => { … } ) when matching on them. This way, we prevent breaking downstream crates’ code when adding a new error variant. Refactored Module Structure The module structure also stemmed from various ad-hoc decisions. Users familiar with older releases of our driver may recall, for example, the ubiquitous transport module. It used to contain a bit of absolutely everything: essentially, it was a flat bag with no real deeper structure. Back in 0.15.1, the module structure looked like this (omitting the modules that were not later restructured): transport load_balancing default.rs mod.rs plan.rs locator (submodules) caching_session.rs cluster.rs connection_pool.rs connection.rs downgrading_consistency_retry_policy.rs errors.rs execution_profile.rs host_filter.rs iterator.rs metrics.rs node.rs partitioner.rs query_result.rs retry_policy.rs session_builder.rs session_test.rs session.rs speculative_execution.rs topology.rs history.rs routing.rs The new module structure clarifies the driver’s separate abstraction layers. Each higher-level module is documented with descriptions of what abstractions it should hold. We also refined our item export policy. Before, there could be multiple paths to import items from. Now items can be imported from just one path: either their original paths (i.e., where they are defined), or from their re-export paths (i.e., where they are imported, and then re-exported from). In 1.0.0, the module structure is the following (again, omitting the unchanged modules): client caching_session.rs execution_profile.rs pager.rs self_identity.rs session_builder.rs session.rs session_test.rs cluster metadata.rs node.rs state.rs worker.rs network connection.rs connection_pool.rs errors.rs (top-level module) policies address_translator.rs host_filter.rs load_balancing default.rs plan.rs retry default.rs downgrading_consistency.rs fallthrough.rs retry_policy.rs speculative_execution.rs observability driver_tracing.rs history.rs metrics.rs tracing.rs response query_result.rs request_response.rs routing locator (unchanged contents) partitioner.rs sharding.rs Removed Unstable Dependencies From the Public API With the ScyllaDB Rust Driver 1.0 release, we wanted to fully eliminate unstable (pre-1.0) dependencies from the public API. Instead, we now expose these dependencies through feature flags that explicitly encode the major version number, such as "num-bigint-03". Why did we do this? API Stability & Semver Compliance – The 1.0 release promises a stable API, so breaking changes must be avoided in future minor updates. If our public API directly depended on pre-1.0 crates, any breaking changes in those dependencies would force us to introduce breaking changes as well. By removing them from the public API, we shield users from unexpected incompatibilities. Greater Flexibility for Users – Developers using the ScyllaDB Rust driver can now opt into specific versions of optional dependencies via feature flags. This allows better integration with their existing projects without being forced to upgrade or downgrade dependencies due to our choices. Long-Term Maintainability – By isolating unstable dependencies, we reduce technical debt and make future updates easier. If a dependency introduces breaking changes, we can simply update the corresponding feature flag (e.g., "num-bigint-04") without affecting the core driver API. Avoiding Unnecessary Dependencies – Some users may not need certain dependencies at all. Exposing them via opt-in feature flags helps keep the dependency tree lean, improving compilation times and reducing potential security risks. Improved Ecosystem Compatibility – By allowing users to choose specific versions of dependencies, we minimize conflicts with other crates in their projects. This is particularly important when working with the broader Rust ecosystem, where dependency version mismatches can lead to build failures or unwanted upgrades. Support for Multiple Versions Simultaneously – By namespacing dependencies with feature flags (e.g., "num-bigint-03" and "num-bigint-04"), users can leverage multiple versions of the same dependency within their project. This is particularly useful when integrating with other crates that may require different versions of a shared dependency, reducing version conflicts and easing the upgrade path. How this impacts users: The core ScyllaDB Rust driver remains stable and free from external pre-1.0 dependencies (with one exception: the popular rand crate, which is still in 0.*). If you need functionality from an optional dependency, enable it explicitly using the appropriate feature flag (e.g., "num-bigint-03"). Future updates can introduce new versions of dependencies under separate feature flags – without breaking existing integrations. This change ensures that the ScyllaDB Rust driver remains stable, flexible, and future-proof, while still providing access to powerful third-party libraries when needed. Rustls Support for TLS The driver now supports Rustls, simplifying TLS connections and removing the need for additional system C libraries (openssl). Previously, ScyllaDB Rust Driver only supported OpenSSL-based TLS – like our other drivers did. However, the Rust ecosystem has its own native TLS library: Rustls. Rustls is designed for both performance and security, leveraging Rust’s strong memory safety guarantees while often outperforming OpenSSL in real-world benchmarks. With the 1.0.0 release, we have added Rustls as an alternative TLS backend. This gives users more flexibility in choosing their preferred implementation. Additional system C libraries (openssl) are no longer required to establish secure connections. Feature-Based Backend Selection Just as we isolated pre-1.0 dependencies via version-encoded feature flags (see the previous section), we applied the same strategy to TLS backends. Both OpenSSL and Rustls are exposed through opt-in feature flags. This allows users to explicitly select their desired implementation and ensures: API Stability – Users can enable TLS support without introducing unnecessary dependencies in their projects. Avoiding Unwanted Conflicts – Users can choose the TLS backend that best fits their project without forcing a dependency on OpenSSL or Rustls if they don’t need it. Future-Proofing – If a breaking change occurs in a TLS library, we can introduce a new feature flag (e.g., "rustls-023", "openssl-010") without modifying the core API. Abstraction Over TLS Backends We also introduced an abstraction layer over the TLS backends. Key enums such as TlsProvider, TlsContext, TlsConfig and Tls now contain variants corresponding to each backend. This means that switching between OpenSSL and Rustls (as well as between different versions of the same backend) is a matter of enabling the respective feature flag and selecting the desired variant. If you prefer Rustls, enable the "rustls-023" feature and use the TlsContext::Rustls variant. If you need OpenSSL, enable "openssl-010" and use TlsContext::OpenSSL. If you want both backends or different versions of the same backend (in production or just to explore), you can enable multiple features and it will “just work.” If you don’t require TLS at all, you can exclude both, reducing dependency overhead. Our ultimate goal with adding Rustls support and refining TLS backend selection was to ensure that the ScyllaDB Rust Driver is both flexible and well-integrated with the Rust ecosystem. We hope this better accommodates users’ different performance and security needs. The Battle For The Empty Enums We really wanted to let users build the driver with no TLS backends opted in. In particular, this required us to make our enums work without any variants, (i.e., as empty enums). This was a bit tricky. For instance, one cannot match over &x, where x: X is an instance of the enum, if X is empty. Specifically, consider the following definition: This would not compile:error[E0004]: non-exhaustive patterns: type `&X` is non-empty    –> scylla/src/network/tls.rs:230:11     | 230 |     match x {     |           ^     | note: `X` defined here    –> scylla/src/network/tls.rs:223:6     | 223 | enum X {     |      ^     = note: the matched value is of type `&X`     = note: references are always considered inhabited help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern as shown     | 230 ~    match x { 231 +         _ => todo!(), 232 +     }     | Note that references are always considered inhabited. Therefore, in order to make code compile in such a case, we have to match on the value itself, not on a reference: But if we now enable the "a" feature, we get another error… error[E0507]: cannot move out of `x` as enum variant `A` which is behind a shared reference    –> scylla/src/network/tls.rs:230:11     | 230 |     match *x {     |           ^^ 231 |         #[cfg(feature = “a”)] 232 |         X::A(s) => { /* Handle it */ }     |              –     |              |     |              data moved here     |              move occurs because `s` has type `String`, which does not implement the `Copy` trait     | help: consider removing the dereference here     | 230 –    match *x { 230 +    match x {     | Ugh. rustc literally advises us to revert the change. No luck… Then we would end up with the same problem as before. Hmmm… Wait a moment… I vaguely remember Rust had an obscure reserved word used for matching by reference, ref. Let’s try it out. Yay, it compiles!!! This is how we made our (possibly) empty enums work… finally!. Faster and Extended Metrics Performance matters. So we reworked how the driver handles metrics, eliminating bottlenecks and reducing overhead for those who need real-time insights. Moreover, metrics are now an opt-in feature, so you only pay (in terms of resource consumption) for what you use. And we added even more metrics! Background Benchmarks showed that the driver may spend significant time logging query latency. Flamegraphs revealed that collecting metrics can consume up to 11.68% of CPU time!   We suspected that the culprit was contention on a mutex guarding the metrics histogram. Even though the issue was discovered in 2021 (!), we postponed dealing with it because the publicly available crates didn’t yet include a lock-free histogram (which we hoped would reduce the overhead). Lock-free histogram As we approached the 1.0 release deadline, two contributors (Nikodem Gapski and Dawid Pawlik) engaged with the issue. Nikodem explored the new generation of the histogram crate and discovered that someone had added a lock-free histogram: AtomicHistogram. “Great”, he thought. “This is exactly what’s needed.” Then, he discovered that AtomicHistogram is flawed: there’s a logical race due to insufficient synchronization! To fix the problem, he ported the Go implementation of LockFreeHistogram from Prometheus, which prevents logical races at the cost of execution time (though it was still performing much better than a mutex). If you are interested in all the details about what was wrong with AtomicHistogram and how LockFreeHistogram tries to solve it, see the discussion in this PR. Eventually, the histogram crate’s maintainer joined the discussion and convinced us that the skew caused by the logical races in AtomicHistogram is benign. Long story short, histogram is a bit skewed anyway, and we need to accept it. In the end, we accepted AtomicHistogram for its lower overhead compared to LockFreeHistogram. LockFreeHistogram is still available on its author’s dedicated branch. We left ourselves a way to replace one histogram implementation with another if we decide it’s needed. More metrics The Rust driver is a proud base for the cpp-rust-driver (a rewrite of cpp-driver as a thin bindings layer on top of – as you can probably guess at this point – the Rust driver). Before cpp-driver functionalities could be implemented in cpp-rust-driver, they had to be implemented in the Rust driver first. This was the case for some metrics, too. The same two contributors took care of that, too. (Btw, thanks, guys! Some cool sea monster swag will be coming your way). Metrics as an opt-in Not every driver user needs metrics. In fact, it’s quite probable that most users don’t check them even once. So why force users to pay (in terms of resource consumption) for metrics they’re not using? To avoid this, we put the metrics module behind the "metrics" feature (which is disabled by default). Even more performance gain! For a comprehensive list of changes introduced in the 1.0 release, see our release notes. Stepping Stones on the Path to the 1.0 Release We’ve been working towards this 1.0 release for years, and it involved a lot of incremental improvements that we rolled out in minor releases along the way. Here’s a look at the most notable ones. Ser/De (from versions 0.11 and 0.15) Previous releases reworked the serialization and deserialization APIs to improve safety and efficiency. In short, the 0.11 release introduced a revamped serialization API that leverages Rust’s type system to catch misserialization issues early. And the 0.15 release refined deserialization for better performance and memory efficiency. Here are more details. Serialization API Refactor (released in 0.11): Leverage Rust’s Powerful Type System to Prevent Misserialization — For Safer and More Robust Query Binding Before 0.11, the driver’s serialization API had several pitfalls, particularly around type safety. The old approach relied on loosely structured traits and structs (Value, ValueList, SerializedValues, BatchValues, etc.), which lacked strong compile-time guarantees. This meant that if a user mistakenly bound an incorrect type to a query parameter, they wouldn’t receive an immediate, clear error. Instead, they might encounter a confusing serialization error from ScyllaDB — or, in the worst case, could suffer from silent data corruption! To address these issues, we introduced a redesigned serialization API that replaces the old traits with SerializeValue, SerializeRow, and new versions of BatchValues and SerializedValues. This new approach enforces stronger type safety. Now, type mismatches are caught locally at compile time or runtime (rather than surfacing as obscure database errors after query execution). Key benefits of this refactor include: Early Error Detection – Incorrectly typed bind markers now trigger clear, local errors instead of ambiguous database-side failures. Stronger Type Safety – The new API ensures that only compatible types can be bound to queries, reducing the risk of subtle bugs. Deserialization API Refactor (released in 0.15): For Better Performance and Memory Efficiency Prior to release 0.15, the driver’s deserialization process was burdened with multiple inefficiencies, slowing down applications and increasing memory usage. The first major issue was type erasure — all values were initially converted into the CQL-type-agnostic CqlValue before being transformed into the user’s desired type. This unnecessary indirection introduced additional allocations and copying, making the entire process slower than it needed to be. But the inefficiencies didn’t stop there. Another major flaw was the eager allocation of columns and rows. Instead of deserializing data on demand, every column in a row was eagerly allocated at once — whether it was needed or not. Even worse, each page of query results was fully materialized into a Vec<Row>. As a result, all rows in a page were allocated at the same time — all of them in the form of the ephemeric CqlValue. This usually required further conversion to the user’s desired type and incurred allocations. For queries returning large datasets, this led to excessive memory usage and unnecessary CPU overhead. To fix these issues, we introduced a completely redesigned deserialization API. The new approach ensures that: CQL values are deserialized lazily, directly into user-defined types, skipping CqlValue entirely and eliminating redundant allocations. Columns are no longer eagerly deserialized and allocated. Memory is used only for the fields that are actually accessed. Rows are streamed instead of eagerly materialized. This avoids unnecessary bulk allocations and allows more efficient processing of large result sets. Paging API (released in 0.14) We heard from our users that the driver’s API for executing queries was prone to misuse with regard to query paging. For instance, the Session::query() and Session::execute() methods would silently return only the first page of the result if page size was set on the statement. On the other hand, if page size was not set, those methods would perform unpaged queries, putting high and undesirable load on the cluster. Furthermore, Session::query_paged() and Session::execute_paged() would only fetch a single page! (if page size was set on the statement; otherwise, the query would not be paged…!!!) To combat this: We decided to redesign the paging API in a way that no other driver had done before. We concluded that the API must be crystal clear about paging, and that paging will be controlled by the method used, not by the statement itself. We ditched query() and query_paged() (as well as their execute counterparts), replacing them with query_unpaged() and query_single_page(), respectively (similarly for execute*). We separated the setting of page size from the paging method itself. Page size is now mandatory on the statement (before, it was optional). The paging method (no paging, manual paging, transparent automated paging) is now selected by using different session methods ({query,execute}_unpaged(), {query,execute}_single_page(), and {query,execute}_iter(), respectively). This separation is likely the most important change we made to help users avoid footguns and pitfalls. We introduced strongly typed PagingState and PagingStateResponse abstractions. This made it clearer how to use manual paging (available using {query,execute}_single_page()). Ultimately, we provided a cheat sheet in the Docs that describes best practices regarding statement execution. Looking Ahead The journey doesn’t stop here. We have many ideas for possible future driver improvements: Adding a prelude module containing commonly used driver’s functionalities. More performance optimizations to push the limits of scalability (and benchmarks to track how we’re doing). Extending CQL execution APIs to combine transparent paging with zero-copy deserialization, and introducing BoundStatement. Designing our own test harness to enable cluster sharing and reuse between tests (with hopes of speeding up test suite execution and encouraging people to write more tests). Reworking CQL execution APIs for less code duplication and better usability. Introducing QueryDisplayer to pretty print results of the query in a tabular way, similarly to the cqlsh tool. (In our dreams) Rewriting cqlsh (based on Python driver) with cqlsh-rs (a wrapper over Rust driver). And of course, we’re always eager to hear from the community — your feedback helps shape the future of the driver! Get Started with ScyllaDB Rust Driver 1.0 If you’re working on cool Rust applications that use ScyllaDB and/or you want to contribute to this Rust driver project, here are some starting points. GitHub Repository: ScyllaDB Rust Driver – Contributions welcome! Crates.io: Scylla Crate Documentation: crate docs on docs.rs, the guide to the driver. And if you have any questions, please contact us on the community forum or ScyllaDB User Slack (see the #rust-driver channel).

ScyllaDB Rust Driver 1.0 is Officially Released

The long-awaited ScyllaDB Rust Driver 1.0 is finally released. This open source project was designed to bring a stable, high-performance, and production-ready CQL driver to the Rust ecosystem. Key changes in the 1.0 release include: Improved Stability: We removed unstable dependencies and put them behind feature flags. This keeps the driver stable, flexible, and future-proof while still allowing access to powerful-yet-unstable third-party libraries when needed. Refactored Error Types: The error types were significantly improved for clarity, type safety, and diagnostic information. This makes debugging easier and prevents API-breaking changes in future updates. Refactored Module Structure: The module structure was reorganized to better reflect abstraction layers and improve clarity. This makes the driver’s architecture more understandable and simplifies importing items. Easier TLS Setup: Rustls support provides a Rust-native alternative to openssl. This simplifies TLS configuration and can prevent system library issues. Faster and Extended Metrics: New metrics were added and metrics were optimized using an atomic histogram that reduces CPU overhead. The entire metrics module is now optional – so users who don’t care about it won’t suffer any performance impacts from it. Read the release notes In this post, we’ll shed light on why we took this unconventionally extensive (years long) path from a popular production-ready 0.x release to a 1.0 release. We’ll also share our versioning/release plans from this point forward. Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio provides a deep dive into exactly what we changed and why. Read the deep dive into what we changed and why The Path to “1.0” Over the past few years, Rust Driver has proven itself to be high quality, with very few bugs compared to other drivers as well as better performance. It is successfully used by customers in production, and by us internally. By all means, we have considered it fully production-ready for a long time. Given that, why did we keep releasing 0.x versions? Although we were confident in the driver’s quality, we weren’t satisfied with some aspects of its API. Keeping the version at 0.x was our way of saying that breaking changes are expected often. Frequent breaking changes are not really great for our users. Instead of just updating the driver, they have to adjust their code after pretty much every update. However, 0.x version numbers suggest that the driver is not actually production-ready (but in this case, it truly was). So we really wanted to release a 1.0 version. One option was to just call one of the previous versions (e.g. 0.9) version 1.0 and be done with it. But we knew there were still many breaking changes we wanted to make – and if we kept introducing planned changes, we would quickly arrive at a high version number like 7.0. In Rust (and semver in general) 1.0 is called an “API-stable” version. There is no definition of that term, so it can have various interpretations. What’s perfectly clear, however, is that rapidly releasing major versions – thus quickly arriving at a high major version number – does not constitute API stability. It also does nothing to help users easily update. They would still need to change their code after most updates! We also realized that we will never be able to achieve complete stabilization. There are, and will probably always be, things that we want to improve in our API. We don’t want stability to stand in the way of driver refinement. Even if we somehow achieve an API that we are fully satisfied with, that we don’t want to change at all, there is another reason for change: the databases that the driver supports (ScyllaDB and Cassandra) are constantly changing, and some of those changes may require modifying the driver API. For example, ScyllaDB recently introduced a new replication mechanism: Tablets. It is possible to add Tablets support to a driver without a breaking change. We did that in our other drivers, which are forks, because we can’t break compatibility there. However, it requires ugly workarounds. With Tablets, calculating a replica list for a request requires knowing which table the request uses. Tablets are per-table data structures, which means that different tables may have different replica sets for the same token (as opposed to the token ring, which is per-keyspace). This affects many APIs in the driver: Metadata, Load Balancing, and Request Routing, to name just a few. In Rust Driver, we could nicely adapt those APIs, and we want to continue doing so when major changes are introduced in ScyllaDB or Cassandra. Given those restrictions, we reached a compromise. We decided to focus on the API-breaking changes we had planned and complete a big portion of them – making the API more future-proof and flexible. This reduces the risk of being forced to make unwanted API-breaking changes in the future. What’s Next for Rust Driver Now that we’ve reached the long-anticipated “1.0” status, what’s next? We will focus on other driver tasks that do not require changing the API. Those will be released as minor updates (1.x versions). Releasing minor versions means that our users can easily update the driver without changing their code, and so they will quickly get the latest improvements. Of course, we won’t stay at 1.0 forever. We don’t know exactly when the 2.0 release will happen, but we want to provide some reasonable stability to make life easier for our users. We’ve settled on 9 months for 1.0 – so 2.0 won’t be released any earlier than 9 months after the 1.0 release date. For future versions (3.0, etc) this time may (almost certainly) be increased since we will have already smoothed out more and more API rough edges. When a new major version (e.g. 2.0) is released, we will keep supporting the previous major version (e.g. 1.x) with bugfixes, but no new functionalities. The duration of such support is not yet decided. This will also make the migration to a new major version a bit easier. Get Started with Rust Driver 1.0 If you’re ready to get started, take a look at: GitHub Repository: ScyllaDB Rust Driver – Contributions welcome! Crates.io: Scylla Crate Documentation: crate docs on docs.rs, the guide to the driver. And if you have any questions, please contact us on the community forum or ScyllaDB User Slack (see the #rust-driver channel).  

Upcoming ScyllaDB University LIVE and Community Forum Updates

What to expect at the upcoming ScyllaDB University Live training event – and what’s trending on the community forum Following up on all the interest in ScyllaDB – at Monster SCALE Summit and a whirlwind of in-person events around the world – let’s continue the ScyllaDB conversation. Is ScyllaDB a good fit for your use case? How do you navigate some of the decisions you face when getting started? We’re here to help! In this post, I’ll update you about the upcoming ScyllaDB University Live training event and highlight some trending topics from the community forum. ScyllaDB University LIVE Our next ScyllaDB University LIVE training event will be held on Wednesday, April 9, 2025, 8 AM PDT – 10 AM PDT. This is a free live virtual training led by our top engineers and architects. Whether you’re just curious about ScyllaDB or an experienced user looking to master advanced strategies, join us for ScyllaDB University LIVE! Sessions are interactive and NOT available on-demand – be sure to mark your calendar and attend! The event will be interactive, and you will have a chance to run some hands-on labs throughout the event, and learn by actually doing. The team and I are preparing lots of new examples and exercises – so if you’ve joined before, there’s a great excuse to join again. 😉 Register here In the event, there will be two parallel tracks, Essentials and Advanced. Essentials Track The Essentials track (Getting Started with ScyllaDB) is intended for people new to ScyllaDB. I will start with a talk covering a quick overview of NoSQL and where ScyllaDB fits in the NoSQL world. Next, you will run the Quick Wins labs, in which you’ll see how easy it is to start a ScyllaDB cluster, create a keyspace, create a table, and run some basic queries. After the lab, you’ll learn about ScyllaDB’s basic architecture, including a node, cluster, data replication, Replication Factor, how the database partitions data, Consistency Level, multiple data centers, and an example of what happens when we write data to a cluster. We’ll cover data modeling fundamentals for ScyllaDB. Key concepts include the difference in data modeling between NoSQL and Relational databases, Keyspace, Table, Row, CQL, the CQL shell, Partition Key, and Clustering Key. After that, you’ll run another lab, where you’ll put the data modeling theory into practice. Finally (if we have enough time left), we will discuss ScyllaDB’s special shard-aware drivers. The next part of this session is led by Attila Toth. Here, we’ll walk through a real-world application and understand how the different concepts from the previous talk come into play. We’ll also use a lab where you can do the coding and test it yourself. Additionally, you will see a demo application running one million ops/sec with single-digit millisecond latency and learn how to run this demo yourself. Advanced Track In the Advanced Track (Extreme Elasticity and Performance) by Tzach Livyatan and Felipe Mendes, you will take a deep dive into ScyllaDB’s unique features and tooling such as Workload Prioritization as well as advanced data modeling, and tips for using counters and Time To Live (TTL). You’ll learn how ScyllaDB’s new Tablets feature enables extreme elasticity without any downtime and how to have multiple workloads on a single cluster. The two talks in this track will also use multiple labs that you can run yourself during the event. Before the event, please make sure you have a ScyllaDB University account (free). We will use this platform during the event for the hands-on labs. Register on ScyllaDB University  Trending Topics on the Community Forum The community forum is the place to discuss anything ScyllaDB and NoSQL related, learn from your peers, share how you’re using ScyllaDB, and ask questions about your use case. It’s where you can read Avi Kivity’s, our co-founder and CTO’s, popular, weekly Last week in scylladb.git master update (for example here). It’s also the place to learn about new releases and events. Say Hello here Many of the new topics focus on performance issues, troubleshooting, specific use case questions and general data modeling questions. Many of the recent discussions have been about Tablets and how this feature affects performance and elasticity. Here’s a summary of some of the top topics since my last update. A user asked about latency spikes, hot partitions, and how to detect this. Key insights shared in this discussion emphasize the importance of understanding compaction settings and implementing strategies to mitigate tombstone accumulation. Upgrade paths and Tablets integration: The introduction of the Tablets feature led to significant discussions regarding its adoption for scaling purposes. A user discussed the processes of enabling this feature after an upgrade, and its effects on performance in posts like this one. General cluster management support: different contributors actively assisted newcomers by clarifying different admin procedures, such as addressing schema migrations, compaction, and SSTable behavior. An example of such a discussion deals with the process for gracefully stopping ScyllaDB. Data modeling: A popular topic was data modeling and the data model’s effect of the data model on performance for specific use cases. Users exchanged ideas on addressing challenges tied to row-level reads, batching, drivers, and the implications of large (and hot) partitions. One such discussion dealt with data modeling when having subgroups of data with volume disparity. Alternator: the DynamoDB compatible API was a popular topic. Users asked about how views work under the hood with Alternator as well as other questions related to compatibility with DynamoDB and performance. Hope to see you at the ScyllaDB University Live event! Meanwhile, stay in touch.

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy',         '       replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature  Cassandra 4 + OpenSearch  Cassandra 5 (Preview) 
Embedding Storage  OpenSearch  Native VECTOR Data Type 
Similarity Search  KNN Plugin in OpenSearch  COSINE, EUCLIDEAN, DOT_PRODUCT 
Search Method  Exact K-Nearest Neighbor  Approximate Nearest Neighbor (ANN) 
System Complexity  Requires two systems  All-in-one Cassandra solution 

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

Monster Scale Summit Recap: Scaling Systems, Databases, and Engineering Leadership

Monster Scale Summit brought together some of the sharpest minds in distributed systems, data infrastructure, and engineering leadership — all focused on one thing: what it really takes to build and operate systems at scale. From database internals to leadership lessons, here are some highlights from two packed days of tech talks. Watch On-Demand  Kelsey Hightower: Engineering at Scale: What Separates Leaders from Laggards We kicked off with a candid conversation with Kelsey Hightower, a name that needs no introduction if you’ve ever dealt with Kubernetes, CoreOS, or even Puppet. Kelsey has been at the center of some of the biggest shifts in infrastructure over the past decade. Hearing his perspective on what separates companies that succeed at scale from those that don’t was the most memorable part of the event for me. Kelsey tackled questions such as: Misconceptions in scaling engineering efforts: What common mistakes do engineers make? Design trade-offs: How do you balance the need to move fast while still designing for future growth? Avoiding over-engineering: How do you build just enough to handle scale without building complexity that slows you down? Developer experience and tooling: How do you give teams the right tools without overwhelming them? Leadership balance: How do technical depth and soft skills factor into great engineering leadership? And of course, I couldn’t resist asking: “Good programmers copy, great programmers paste” — is that still true? Spoiler: his answer was “They use ChatGPT!” Kelsey shared razor-sharp, unfiltered insights throughout the unscripted live session. If you care about engineering leadership in high-scale environments, watch this – now. Dor Laor, ScyllaDB CEO: Pushing the Boundaries of Performance Dor Laor, ScyllaDB CEO and Co-founder, took the virtual stage to share 10 years of lessons learned building ScyllaDB, a database designed for extreme speed and scale. Dor walked us through: The shard-per-core design that sets ScyllaDB apart. How ScyllaDB evolved from an idea (codename: “Sea Star” [C*]) to production systems handling billions of operations per day. What’s next in terms of performance, cost-efficiency, and scalability. Organizations have wasted time and money overprovisioning other databases at scale. Dor presented the next generation of ScyllaDB X Cloud which provides true elasticity and unmatched storage capability, unique to ScyllaDB. If you’re dealing with high-throughput, low-latency database workloads, take some time to absorb all the advances introduced… and how they might help your team. Real-World Scaling Stories from Industry Leaders One of the best parts of Monster Scale was hearing directly from the people building and operating some of the largest systems on the planet. Some of the talks that got the chat buzzing include… Extreme Scale in Action Cloudflare: Serving millions of boot artifacts to a global audience. Agoda: Scaling 50x throughput with ScyllaDB. Discord: Handling trillions of search requests. American Express: Sharing design choices for routing global payments. Canva: Running machine learning workflows with over 100M images/day. Database Internals and Their Impacts Avi Kivity (ScyllaDB CTO): Deep dive into engineering advances enabling massive scale. Felipe Mendes (ScyllaDB Technical Director): Detailed breakdown of how ScyllaDB stacks up against Cassandra 5.0. Responsive: Almog Gavra on replacing RocksDB with ScyllaDB to achieve next-level Kafka stream processing. Optimizing Cost and Performance in the Cloud ScyllaDB: Cloud cost reduction, tiered storage, and high availability (HA) strategies. Slack: Managing 300+ mission-critical cron jobs efficiently. Yieldmo: Real savings from moving off DynamoDB to ScyllaDB. Gwen Shapira: Reengineering Postgres for Millions of Tenants If you think relational databases can’t handle scale, Gwen Shapira showed up to challenge that. She detailed how Nile is rethinking Postgres to serve millions of tenants and shared the real operational challenges behind that journey. Her bottom line: “Scaling relational data is frigging hard.” But it’s also possible if you know what you’re doing. ShareChat: Building One of the World’s Largest Feature Stores With over 300M monthly active users, ShareChat has built a feature store that processes over a billion features per second. David and Ivan walked us through how they got there, the role ScyllaDB plays, and what they’re doing now to optimize cost without compromising on scale. Martin Kleppmann + Chris Riccomini: Designing Data-Intensive Apps in 2025 Yes, Martin & Chris confirmed an update to “Designing Data-Intensive Applications” is on the way. But this wasn’t a book promo — it was a frank discussion on real-world data architecture, including what’s broken and what still works when scaling distributed systems. Avi Kivity: ScyllaDB’s Monstrous Engineering Advances Avi took us through ScyllaDB’s latest innovations, from internals to future plans — essential viewing if you’re using ScyllaDB and/or you’re curious about the engineering behind high-performance, distributed databases. More Sessions on Tackling Scale Head-On Resonate, Antithesis, Turso, poolside, Uber: Simple (and not-so-simple) mechanics of scaling. Medium, Alex DeBrie, Guilherme Nogueira+ Nadav Har’El, Patrick Bossman: The reality of DynamoDB costs and why customers switch to ScyllaDB – plus practical migration insights. Kostja Osipov (ScyllaDB): Real lessons in surviving majority failures and consensus mechanics. Dzejla Medjedovic (Social Explorer): Exploring the benefits and tradeoffs between B-trees, B^eps-trees, and LSM-trees. Ethan Donowitz: Database Upgrades with Shadow Clusters at Discord Ethan gave us a compelling presentation on the use of “shadow clusters” at Discord to effectively de-risk the upgrade process in large-scale production systems. This included insights on how to build, mirror, validate, test and monitor — all practical tips you can apply to your own database environments. Rachel Stephens + Adam Jacob: Scaling is the Funnest Game Rachel and Adam gave us their honest take on the human side of scaling, with plenty of fun stories around technical trade-offs and why business context matters as much as engineering decisions. To quote Adam (while recounting some anecdotal coffee shop encounters with Chef users): “There is no funner game than the at-scale technology game.” Personal Takeaways As an event host, I get the chance to review the recordings before the show — but it’s not until the entire show is assembled and streamed online that the true depth and quality of content becomes apparent to me. Also, what a privilege it was to interview Kelsey in person. I’ve used many of the systems and software he has influenced, so having a chat with him was both inspiring and grounding. You couldn’t ask for a better role model in software engineering leadership. Cheers mate! Monster Scale Summit wasn’t just about theory — it was about what happens when systems, teams, and businesses hit real limits and what it takes to move past them. From deep engineering to leadership lessons, if you’re working on systems that need to scale and perform predictably, this was a treasure trove of insights. And if you missed it? Check out the replays — because this is the kind of knowledge that will save you months of effort and pain. Watch Tech Talk Replays On-Demand    Behind the scenes, from the perspective of Wayne’s Ray-Ban Smart Glasses  

High Performance on a Low Budget: Gwen Shapira’s Tips for Startups

How even a scrappy early-stage startup can deliver outstanding performance “It’s one thing to solve performance challenges when you have plenty of time, money and expertise available. But what do you do in the opposite situation: If you are a small startup with no time or money and still need to deliver outstanding performance?” – Gwen Shapira, co-founder of Nile (PostgreSQL reengineered for multi-tenant apps) That’s the multi-million-dollar question for many early-stage startups. And who better to lead that discussion than Gwen Shapira, who has tackled performance from two vastly different perspectives? After years of focusing on data systems performance at large organizations, she recently pivoted to leading a startup – where she found herself responsible for full-stack performance, from the ground up. In her P99 CONF keynote, “High Performance on a Low Budget,” Gwen explored the topic by sharing performance challenges she and her small team faced at Nile – how they approached them, tradeoffs and lessons learned. A few takeaways that set the conference chat on fire: Benchmarks should pay rent Keep tests stupid simple If you don’t have time to optimize, at least don’t pessimize But the real value is in hearing the experiences behind these and other zingers. You can watch her talk below or keep reading for a guided tour. Enjoy Gwen’s insights? She’ll be delivering another keynote at Monster SCALE Summit—alongside Kelsey Hightower and engineers from Discord, Disney+, Slack, Canva, Atlassian, Uber, ScyllaDB and many other leaders sharing how they’re tackling extreme-scale engineering challenges. Join us – it’s free + virtual. Get Your Conference Pass Do Worry About Performance Before You Ship It Per Gwen, founders get all sorts of advice on how to run the company (whether they want it or not). Regarding performance, it’s common to hear tips like: Don’t worry about performance until users complain. Don’t worry about performance until you have product market fit. Performance is a good problem to have. If people complain about performance, it is a great sign! But she respectfully disagrees. After years of focusing on performance, she’s all too familiar with the aftermath of that approach. Gwen shared, “As you talk to people, you want to figure out the minimal feature set required to bring an impactful product to market. And performance is part of this package. You should discover the target market’s performance expectations when you discover the other expectations.” Things to consider at an early stage: If you’re trying to beat the competition on performance, how much faster do you need to be? Even if performance is not your key differentiator, what are your users’ expectations regarding performance? To what extent are users willing to accept latency or throughput tradeoffs for different capabilities – or accept higher costs to avoid those tradeoffs? Founders are often told that even if you’re not fully satisfied with the product, just ship it and see how people react. Gwen’s reaction: “If you do a startup, there is a 100% chance that you will ship something that you’re not 100% happy with. But the reason you ship early and iterate is because you really want to learn fast. If you identified performance expectations during the discovery phase, try to ship something in that ballpark. Otherwise, you’re probably not learning all that much – with respect to performance, at least.” Hyperfocus on the User’s Perceived Latency For startups looking to make the biggest performance impact with limited resources, you can “cheat” by focusing on what will really help you attract and retain customers: optimizing the user’s perceived latency. Web apps are a great place to begin. Even if your core product is an eBPF-based edge database, your users will likely be interacting with a web app from the start of the user journey. Plus, there are lots of nice metrics to track (for example, Google’s Core Web Vitals). Gwen noted, “Startups very rarely have a scale problem. If you do have a scale problem, you can, for example, put people on a wait list while you’re determining how to scale out and add machines. However, even if you have a small number of users, you definitely care about them having a great experience with low latency. And perceived low latency is what really matters.” For example, consider this dashboard: When users logged in, the Nile team wanted to impress them by having this cool dashboard load instantly. However, they found that response times ranged from a snappy 200 milliseconds to a terrible 10+ seconds. To tackle the problem, the team started by parallelizing requests, filling in dashboard elements as data arrived and creating a progressive loading experience. These optimizations helped – and progressive loading turned out to be a fantastic way to hide latency (keeping the user engaged, like mirrors distracting you in a slow elevator). However, the optimizations exposed another issue. The app was making 2,000 individual API calls just to count open tickets. This is the classic N + 1 problem (when you end up running a query for each result instead of running a single optimized query that retrieves all necessary data at once.). Naturally, that inspired some API refinement and tuning. Then, another discovery. Their front-end dev noticed they were fetching more data than needed, so he cached it in the browser. This update sped up dashboard interactions by serving pre-cached data from the browser’s local storage. However, despite all those optimizations, the dashboard remained data-heavy. “Our customers loved it, but there was no reason why it had to be the first page after logging in,” Gwen remarked. So they moved the dashboard a layer down in the navigation. In its place, they added a much simpler landing page with easy access to the most common user tasks. Benchmarks Should Pay Rent Next, topic: The importance of being strategic about benchmarking. “Performance people love benchmarking (and I’m guilty of that),” Gwen admitted. ”But you can spend infinite time benchmarking with very little to show for it. So I want to share some tips on how to spend less time and have more to show for it.” She continued, “Benchmarks should pay rent by answering some important questions that you have. If you don’t have an important question, don’t run a benchmark. There, I just saved you weeks of your life – something invaluable for startups. You can thank me later. “ If your primary competitive advantage is performance, you will be expected to share performance tests to (attempt to) prove how fast and cool you are. Call it “benchmarketing.” For everyone else, two common questions to answer with benchmarking are: Is our database setup optimal? Are we delivering a good user experience? To assess the database setup, teams tend to: Turn to a standard benchmark like TPCC Increase load over time Look for bottlenecks Fix what they can But is this really the best way for a startup to spend its limited time and resources? Given her background as a performance expert, Gwen couldn’t resist doing such a benchmark early on at Nile. But she doesn’t recommend it – at least not for startups: “First of all, it takes a lot of time to run the standard benchmarks when you’re not used to doing it week in, week out. It takes time to adjust all the knobs and parameters. It takes time to analyze the results, rinse and repeat. Even with good tools, it’s never easy.” They did identify and fix some low-hanging fruits from this exercise. But since tests were based on a standard benchmark, it was unclear how well it mapped to actual user experiences. Gwen continued, “I didn’t feel the ROI was exactly compelling. If you’re a performance expert and it takes you only about a day, it’s probably worth it. But if you have to get up to speed, if you’re spending significant time on the setup, you’re better off focusing your efforts elsewhere.” A better question to obsess over is “Are we delivering a good experience?” More specifically, focus on these three areas: Optimizing user onboarding paths Addressing performance issues that annoy developers (these likely annoy users too) Paying attention to metrics that customers obsess over – even if they’re not the ones your team has focused on Keep Benchmarking Tests Stupid Simple Another testing lesson learned: Focus on extra stupid sanity tests. At Nile, the team ran the simplest possible queries, like loading an empty page or querying an empty table. If those were slow, there was no point in running more complex tests. Stop, fix the problem, then proceed with more interesting tests. Also, obsess over understanding what the numbers actually measure. You don’t want to base critical performance decisions on misleading results (e.g., empty responses) or unintended behaviors. For example, her team once intended to test the write path but ended up testing the read path thanks to a misconfigured DNS. Build Infrastructure for Long-Term Value, Optimize for Quick Wins The instrumentation and observability tools put in place during testing will pay off for years to come. At Nile, this infrastructure became invaluable throughout the product’s lifetime for answering the persistent question “Why is it slow?” As Gwen put it: “Those early performance test numbers, that instrumentation, all the observability – this is very much a gift that keeps on giving as you continue to build and users inevitably complain about performance.” When prioritizing performance improvements, look for quick wins. For example, Nile found that a slow request was spending 40% of the time on parsing, 40% on lookups, and just 20% on actual work. The developer realized he could reuse an existing caching library to speed up lookups. That was a nice quick win – giving 40% of the time back with minimal effort. However, if he’d said, “I’m not 100% sure about caching, but I have this fast JSON parsing library,” then that would have been a better way to shave off an equivalent 40%. About a year later, they pushed most of the parsing down to a Postgres extension that was written in C and nicely optimized. The optimizations never end! No Time to Optimize? Then At Least Don’t Pessimize Gwen’s final tip involved empowering experienced engineers to make common-sense improvements. “Last but not least, sometimes you really don’t have time to optimize. But if you have a team of experienced engineers, they know not to pessimize. They are familiar with faster JSON libraries, async libraries that work behind the scenes, they know not to put slow stuff on the critical path and so on. Even if you lack the time to prove that these things are actually faster, just do them. It’s not premature optimization. It’s just avoiding premature pessimization.”

Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',          '
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

How Cassandra Streaming, Performance, Node Density, and Cost are All related

This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.

A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.

Why TRACTIAN Migrated from MongoDB to ScyllaDB for Real-Time ML

TRACTIAN’s ML model workloads increased over 2X in a year. Here’s why they changed databases and their lessons learned What happens when you hit a database scaling wall? Since TRACTIAN, an AI-driven industrial monitoring company, is all about preventing problems, they didn’t want to wait and see. After the company’s ML workloads doubled in a year, their industrial IoT platform was experiencing unsolvable performance degradation. With more rapid growth on the horizon, their engineering leaders decided to rethink their distributed data system before they hit MongoDB’s breaking point. JP Voltani, TRACTIAN’s Director of Engineering, recently shared the team’s experiences at ScyllaDB Summit. If we gave out Academy Awards for production, this one would have been the clear winner (all credit to the TRACTIAN team). So, be sure to watch this quick look at some impressive scaling work. Enjoy engineering case studies like this? Choose your own adventure through 60+ tech talks at Monster Scale Summit (free + virtual). You can learn from experts like Martin Kleppmann, Kelsey Hightower and Gwen Shapira, plus engineers from Discord, Disney+, Slack, Atlassian, Uber, Canva, Medium, Cloudflare, and more. Get a free conference pass Key Takeaways A few key takeaways: TRACTIAN was reaching a critical inflection point when their sensor network grew more than 2x in a single year. MongoDB struggled, even after the team’s valiant optimization and scaling attempts. The constant stream of time-series sensor data (vibration, temperature, energy consumption) caused performance degradation that could compromise their latency targets. The team wanted a database architecture specifically designed for high-throughput, time-partitioned data workloads, which led them to ScyllaDB. They benchmarked ScyllaDB vs Cassandra, Postgres, and MongoDB. The results showed a 10x performance improvement with ScyllaDB, and they appreciated its operational simplicity compared to Cassandra. The TRACTIAN team moved their most performance-critical workloads to ScyllaDB while maintaining MongoDB for other use cases, exemplifying their “right tool for the job” philosophy. They experienced a 10x improvement in throughput and latency with ScyllaDB. TRACTIAN applied a four-phase migration process (dual writes → historical backfill → read switching → final validation). This phased approach maintained 99.95% availability while transitioning critical industrial IoT data pipelines. The team mapped their IoT workload to ScyllaDB by partitioning data by sensor ID and clustering by timestamp. This data modeling change improved query performance for time-window searches and eliminated the hotspot issues that had plagued their MongoDB implementation. Here’s a lightly edited transcript… Intro Hello, everyone. My name is JP, and I’m the Director of Engineering at TRACTIAN. Today, I’m going to talk about our experience with real time machine learning using ScyllaDB. I will start talking about what TRACTIAN is and what we do, what our infrastructure looks like, why we migrated away from MongoDB for some workloads, our ScyllaDB migration process, and what is next for us. At TRACTIAN, we build solutions for industrial maintenance. We want to empower the maintenance teams around the globe with the best in class hybrid and AI assisted software. We have three products: The Smart Trac is a vibration and temperature sensor that is able to detect more than 70 types of failures in rotating machines. The TracOS is a system with everything needed to manage the operations of maintenance teams on the plant floor, enabling mobile and offline operations. The Energy Trac is a sensor that is able to monitor energy consumption, efficiency and electrical quality. Together, these products form a very concise solution that works seamlessly with one another – bringing a very Apple-like experience to industrial maintenance. We have already raised over $100M through VC funding, establishing a global footprint with customers across the Americas. We have three different headquarters: one in Brazil, one in Mexico, one in the USA. We have employees worldwide. The TRACTIAN Tech Stack Let’s talk about our tech stack. We have a very straightforward approach to adopting new technologies: If it helps solve a real problem, we embrace it. For this reason, our tech stack is very modern and extensive. We use more than 80 databases and 6 different languages for our services. That allows us to leverage the strengths of each technology. We have a microservices architecture with more than 30 services, ranging from APIs, consumers, producers and batch processes. They all handle more than 1500 events per second from different sources. And they do so with an average latency lower than 200 milliseconds and with 99.95% availability. Here’s what our infrastructure looked like before ScyllaDB. The sensor sent data to our APIs, and the APIs put the sensor data into Kafka topics. We had different services that would consume these topics to process the data– saving into MongoDB, into different collections. After that, we sent triggers to the AI pipeline to process the data. We start with a binary blob from the sensor and the processing services expand the data to different tables. Some use it for client visualizations, others as vectors for AI (training and inference). Why They Evolved As the company grew, the number of samples arriving to the system also grew. We saw the workloads increase over 2x in a single year, and the database needed to deal with that increase. Unfortunately, even after upscale operations and optimizations, that was not the case with MongoDB. Performance degradation made us look for alternative solutions for our warehouse and AI workloads. Why ScyllaDB Why ScyllaDB? At the time, we already tested Cassandra. The results were promising, but some database operations, like upscaling, had some aspects that were not attractive to us. MongoDB was not handling the IoT workload very well, and we wanted something that was easier to scale. ScyllaDB showed itself to be a light at the end of the tunnel. We were searching for something really specific, and luckily ScyllaDB had a data model that fit our problem very well. Also, ScyllaDB’s database operations were way better than Cassandra’s. This is just one example of how ScyllaDB’s data model works in favor of our workloads. In this case, we have some binary data that we want to start partitioning by sensor ID and ordering by the timestamp. ScyllaDB will make this query for a specific ID in a time window very fast. We had a plan on our hands. First, we created a new DSL. What would the tables on ScyllaDB look like? How would MongoDB data map to the new tables? After that, we did a bunch of theoretical benchmarks, which is basically testing with synthetic data. This is an easy and fast way to validate an idea. Then we did it all over again, but with real data. Sometimes synthetic tests fail to map some nuances of real data and miss things like partitions and hot spots. Other times, they fail to create a good mapping, and this only becomes visible when you test with real data. So, it’s important to not skip this step. Next, we went into the weeds and refactored all the existing application code to use the new database. It’s important to have very, very clear success criteria. What are you trying to achieve with this migration? We had a very clear number of devices in mind that the new infrastructure should be able to handle. The test results came in favor of ScyllaDB. In some workloads, we saw an increase of 10x in throughput and latency.   Migration Strategy Next, let’s talk about the migration game plan. We did everything live and without downtime. Initially, all the data was being written to MongoDB. After that, we started to write to both databases. This was the first checkpoint of the migration. At this step, we checked to see if both databases agreed if the data was correct and if the initial performance test agreed with the benchmark ones. After that, we started our migration script that would backfill ScyllaDB with the historical data from MongoDB and check that no data was missing. Then, we switched the reads to occur on ScyllaDB, while continuing to write on MongoDB as a backup if any problems occurred. This is how we did our online no downtime migration. The results speak for themselves. Results We have a great write read latency after migration and ScyllaDB has scaled very well with our increasing workload. Our infrastructure now has ScyllaDB as one of its backbones, and we still use MongoDB for other types of workloads – and also a bunch of other databases for other challenges. Read more about TRACTIAN’s comparison of ScyllaDB vs MongoDB and PostgreSQL in ScyllaDB vs MongoDB vs PostgreSQL

IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative

IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.

For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?

While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.

However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.

Decoding the IBM-DataStax link-up

DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.

While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.

This is where Instaclustr comes into the picture as a strong, reliable alternative solution.

Why consider Instaclustr?

Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.

Here’s why Instaclustr could be your best option moving forward:

1. 100% open source commitment

We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.

2. Platform agnostic

One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.

3. Transparent pricing

Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.

4. Expert support and services

With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.

Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.

5. Seamless migrations

Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.

Customer-centric focus

At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.

Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.

With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.

Final thoughts: Why Instaclustr is the smart choice for this moment

IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.

With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.

Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.

Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!

The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.

Build an RPG Using the Bluesky Jetstream, ScyllaDB, and Rust

Learn how to build a Rust application that tracks Bluesky user experiences and events. Let’s build a high-performance, scalable, and reliable application that can: Fetch and process public events from the Bluesky platform. Track user events and experiences. Implement a leveling system with experience points (XP). Display user levels and progress based on XP via a REST API. 1. Background Bluesky, which uses a mix of SQLite and ScyllaDB to store data, has a really cool feature called Firehose. Firehose is an aggregated stream of all the public data updates in the network. You can understand it by accessing FireSky.tv, an app that implements this stream and serves it directly in the browser. Implementing it from scratch requires deep knowledge of the AT Protocol. But a Bluesky engineer built Jetstream: a Firehose aggregator. With Firehose, you can just listen on a websocket and get a JSON stream of selected events. Here’s a sample of an event payload from Jetstream: Just listening to one of these streams without any issues is amazing. And it turns out that you can even select which type of event you want to listen to, like: app.bsky.graph.follow; app.bsky.feed.post; app.bsky.feed.like; app.bsky.feed.repost; and many more! But how can we turn it into an application? Well, it depends on your needs. The data is there; just consume it and do your magic! In my case, I like to transform data into games. 2. Gamifying Jetstream I’m not a game developer, but games follow an Event-Driven Development approach, right? Every time that you earn some points in something, you level up or learn a new skill. But to earn experience points, users need to take actions. And that’s what you do inside a Social Network: actions! Imagine that every time you: Post: Just Text? Earns 50 experience Have Media? Earns 60 experience Have Media with Alt Text? Earns 70 Like: Earns 10 experience Repost: Just Text? Earns 50 experience Have Media? Earns 60 experience Have Media with Alt Text? Earn 70 experience! There are plenty of other abstractions that can be done, but that’s the idea. The experience will be calculated using arithmetic progression, and should follow this simple rule: With that, we can now talk about the technologies used in this project. 3. Meet the Stack Bluesky uses ScyllaDB to serve all the AppView layer thinking about high availability and throughput, so we’re going to do the same! Also, I’ve been using Rust extensively (and always learning more!), so I decided to implement this project with Rust. Here’s the tech stack in a nutshell: Language: Rust Database: ScyllaDB Packages: HTTP Server: actix-web ORM: charybdis Jetstream Client: jetstream-oxide Bluesky Client: atrium-api My goal is to build something that, besides creating cool charts on Grafana, can also display something via REST API. First, let’s explore our data modeling strategy. 4. What about the Data Modeling? Initially, the idea was to just store the events and test how stressed the app/database would become. But, at this point, we can go a little bit further. ScyllaDB follows a Query Driven Development approach because it’s a Wide-Column NoSQL Database.  Let’s think about that. First, it’s an RPG focused on a timeline profile, so it will have heavy read operations on top of the “characters”: Since we only have one item in the WHERE CLAUSE, it means that our query is a Key Value  lookup. But wait…we also need to store the current experience of this user.  For that, I would use the Counter type to atomically store it using key-value pairs: It’s supposed to be simple, just like this! But it also has to be fast enough to serve 1M requests/s with ease. WARNING: Counter types can’t be clusterized or used as partition keys. Also, if you use them in a table, all fields besides the Partition Keys aggregates must be Counters! I also want to track all possible events happening in a user’s account and list them in our extension to show how that person can be a better Bluesky user. So, the queries would be around users and they must be clusterized in descending order: Alright, that should be enough for an MVP. Now let’s model each part showing some Rust and Charybdis ORM! 4.1 Modeling: Leveling State UDT Since we’re using ScyllaDB, we can use UDTs (User Defined Types). Keeping track of operations can be a pain. However, if you’re making this a pattern across all tables, UDTs can be useful when you don’t want to recreate the same fields every time. Now we can just use it around the other tables, whether it’s related to events or characters. 4.2 Modeling: Characters Table This will be the most accessed table inside our project via REST API. And the modeling (at this moment) is simple since we only want the user_handle and the leveling state (udt). Check it out: With the UDT, we can serve exactly the latest leveling state to build a UI later on. We can also add new fields since none of them will be part of the Partition Key. 4.3 Modeling: Characters Experience Table As mentioned earlier, we should store the experience so that it won’t become a race condition. Why? ScyllaDB is a highly available database that can replicate your data across multiple nodes. To avoid race conditions, we need to use the only Atomic Type available: the Counter type. With that, we will ensure that every write/read will be the latest there. Yes, it impacts performance. However, Counters are planned and optimized for this type of operation. The modeling would be: Now the last one, the events table! 4.4 Modeling: Events Table and MV This is the most “complicated” part, but it’s not that hard. As mentioned before, there are plenty of events around ATProto Bluesky, and I want to give all the possible events for each user. Displaying data in descending order is a must. ScyllaDB can provide this functionality if you include a Clustering Key in your table. Check it out: With the CLUSTERING ORDER BY (event_at DESC) I’m basically telling it that every time I fetch a chunk of data from this table, it ALWAYS will be the recent inserts. However, now we have a problem. Imagine that we want to list all events from a specific type. With this table, we’re not able to do that. Why? Because you can only use as WHERE clause items that you add inside your Partitions or Clustering Keys. However, we can get around this by creating a Materialized View! Materialized Views are tables created based on a parent table. Every time that this parent table receives a write, your view will also receive it.  You can then play with the partition/clusterization. Check it out: Now, we have different partitions for the same user, storing different types of events that we’re able to query directly. With that, our data modeling is finally DONE! Let’s jump into some business rules implementation. 5. Hands-on: Application Flow With the basics taken care of, let’s explain how everything works under the hood. 5.1 App: Jetstream Oxide At the Websocket layer, we’re using the Jetstream Oxide package to receive all the events in an elegantly structured way. The boilerplate can be like: For each type of event, we’ll receive a specific amount of experience and a different response in asynchronicity. With that, the goal was to make an OCP integration where we only need to add new events when possible:  That takes us to the last step, which sets up the event default behavior at the Trait. We have three types of event actions: Create, Update, and Delete. The Handler will take care of the whole Action/Communication with ScyllaDB through Charybdis ORM. In this example, you can check how the CreateEventHandler works: We can implement other types of events by only extending the trait to the new Dynamic Struct, and it will be working fine. 5.2 App: Actix Web For serving this data, there’s a simple implementation of an endpoint using Actix. Since the long-term goal is to build a browser extension, we need to serve an endpoint with the character/user information: 6. Conclusion This exploration of Bluesky Jetstream and its potential for gamification showcases the power of leveraging cutting-edge technologies like ScyllaDB and Rust to build scalable, high-performance applications. By focusing on event-driven development, we successfully demonstrated how to create an interactive system that transforms social media activities into measurable, gamified metrics. You can check out the project here.  

How JioCinema Uses ScyllaDB Bloom Filters for Personalization

Why they used ScyllaDB Bloom Filters instead of building their own using common solutions like Redis Bloom filters When you log in to your favorite streaming service, first impressions matter. The featured content should instantly lure you into binge-watching mode. If it’s full of shows and movies you’ve already seen, your brain quickly shifts into “Hmmm, is it time to cancel this service?” mode. At a technical level, ensuring fresh recommendations is something that every streaming platform faces. But the standard solutions weren’t a good fit for JioCinema, a prominent Indian streaming service known for its affordability and extensive content library – and currently experiencing explosive growth (e.g., with world-record-breaking 620M IPL viewers, peaking over 20M concurrent viewers). Instead of building their own Bloom filters or using common solutions like Redis Bloom filters, they took a different path: using ScyllaDB’s built-in Bloom filter support to check watch status in real time. JioCinema’s Charan Kamal (Back-End Developer) and Harshit Jain (Software Engineer) recently shared why they took this unconventional path, including the tradeoffs of the more obvious solutions and the logistics of implementing this with ScyllaDB. Watch their complete tech talk below, or read on to skim the highlights. Enjoy engineering case studies like this? Choose your own adventure through 60+ tech talks at Monster Scale Summit (free + virtual). You can learn from experts like Martin Kleppmann, Kelsey Hightower and Gwen Shapira, plus engineers from Discord, Disney+, Slack, Atlassian, Uber, Canva, Medium, Cloudflare, and more.  Get a free conference pass The Challenge: “Watch Discounting” for Fresh Recommendations JioCinema is a leading “over the top” (OTT) streaming platform. The service features top Indian and international entertainment, including live sports (from IPL cricket, to LaLiga European football, to NBA basketball), new movies, HBO originals, and more. Their massive content library spans 10 Indian languages. The JioCinema app uses customized content trays like “Because You Watched” to keep users engaged and help them discover new content. For example, after a user completes “Game of Thrones,” the platform might commonly recommend “House of the Dragon” – but if the user already watched it, it will suggest something else.     As Harshit Jain put it, “These personalized trays not only keep the customers engaged but also enhance content discoverability, fostering long-term engagement and reducing churn rates. However, personalization comes with its own challenges, particularly the issue of recommending content that the customers have already watched. To address this, we have implemented a solution and termed it ‘Watch Discounting.’” This service must cost-efficiently satisfy low-latency requirements at an impressive scale (e.g., 10M daily active users consuming hundreds of thousands of shows and films per day). Charan Kamal explained, “Keeping the sheer size of our customer base and catalog in mind, we had to use a data structure which was space-efficient as well as blazing fast. While we want to keep our recommendations fresh, we also want to avoid over-engineering and making the system overly complex. We could tolerate occasional false positives here. So this led us to Bloom filters – space-efficient probabilistic data structures designed for rapid membership lookup in a set.” The Problem with Custom and Redis Bloom Filters Okay, but which Bloom filters were the best fit here? The team first considered building a custom Bloom filter to store and serve content. Although this “fun exercise” would have provided complete control over the implementation, it presented significant scaling challenges. They didn’t trust that a simple map of Bloom filters would scale vertically to keep pace with JioCinema’s massive (and rapidly growing) user base. Horizontal scaling would have required either implementing sticky sessions (where specific pods would hold Bloom filters for particular users) or replicating data across every pod in the system. While this would have been an interesting engineering challenge, it just didn’t make sense from a business perspective. The next option they explored was Redis with Bloom filter capabilities. Their initial testing with an open-source Redis cluster revealed an interesting issue with Redis’ cluster topology. During high load, nodes would occasionally get hot and trigger failovers, promoting replicas to primary nodes. This created a cascade effect where entire nodes within the cluster became unusable while primary-replica promotions continued in an endless loop. Looking to avoid that risk, they explored Redis Enterprise. This approach showed significant performance improvements and indeed met their SLA requirements. However, there was a catch. JioCinema’s business requires millisecond-level latency across multiple regions.   According to Charan, “Even with Redis Enterprise, we faced a choice between an active-active deployment to maintain low latency or compromising the customer experience in certain regions. The latter was unacceptable for our use case. Additionally, Redis Enterprise imposed substantial charges for each operation and replication, making it challenging to justify the return on investment of this feature for our business. This led us to explore ScyllaDB.” Implementing Watch Discounting with ScyllaDB Charan continued, “ScyllaDB not only supported Bloom filters out of the box, but we also had prior experience using it for different personalization use cases. Knowing its exceptional speed with the ability to replicate data into multiple regions and serve customers from locations close to their origin, ScyllaDB seemed like a comprehensive solution. This allowed us to concentrate on developing what mattered most for our customers and enabled us to go to market fast.” As the following diagram shows, the Watch Discounting feature was powered by two ingestion pipelines. Batch processes compute users’ watch history within a specified time window, determining if a piece of content meets the completion criteria to be considered “watched.” If so, the system updates the ScyllaDB table with a time-to-live (TTL) attribute, ensuring content only becomes rediscoverable after a specified amount of time. Short videos (e.g., 30-60 second videos that drive high engagement) require a different treatment. Here, content must be marked as “viewed” immediately, so real-time event streaming is used to update the watch discounting repository. Why ScyllaDB Charan concluded, “As mentioned earlier, adopting ScyllaDB enabled us to prioritize developing new functionality over creating data structures. This approach allowed us to keep our nodes small and maintain a clear separation of concerns between Bloom filters and filtering watched content. The unmatched performance of ScyllaDB became evident, especially when dealing with high cardinality of partitions and small data sizes—precisely the characteristics of our dataset. TTLs made cleanups easy and permitted the discovery of watched content after a specified period. Moreover, reading from LOCAL_QUORUM ensured that we could access data from the closest region to the user, resulting in high throughput and lower latencies. This strategic combination of features in ScyllaDB significantly contributed to the efficacy and effectiveness of our system.”

Charybdis: Building High-Performance Distributed Rust Backends with ScyllaDB

Build a high-performance distributed Rust backend—without losing the expressiveness and ease of use of Ruby on Rails and  SQL Editor’s note: This post was originally published on Goran’s blog. Ruby on Rails (RoR) is one of the most renowned web frameworks. When combined with SQL databases, RoR transforms into a powerhouse for developing back-end (or even full-stack) applications. It resolves numerous issues out of the box, sometimes without developers even realizing it. For example, with the right callbacks, complex business logic for a single API action is automatically wrapped within a transaction, ensuring ACID (Atomicity, Consistency, Isolation, Durability) compliance. This removes many potential concerns from the developer’s plate. Typically, developers only need to define a functional data model and adhere to the framework’s conventions — sounds easy, right? However, as with all good things, there are trade-offs. In this case, it’s performance. While the RoR and RDBMS combination is exceptional for many applications, it struggles to provide a suitable solution for large-scale systems. Additionally, using frameworks like RoR alongside standard relational databases introduces another pitfall: it becomes easy to develop poor data models. Why? Simply because SQL databases are highly flexible, allowing developers to make almost any data model work. We just utilize more indexing, joins, and preloading to avoid the dreaded N+1 query problem. We’ve all fallen into this trap at some point. What if we could build a high-performance, distributed, Rust-based backend while retaining some of the expressiveness and ease-of-use found in RoR and SQL? This is where ScyllaDB and Charybdis ORM come into play. Before diving into these technologies, it’s essential to understand the fundamental differences between traditional Relational Database Management Systems (RDBMS) and ScyllaDB NoSQL. LSM vs. B+ Tree ScyllaDB, like Cassandra, employs a Log-Structured Merge Tree (LSM) storage engine, which optimizes write operations by appending data to in-memory structures called memtables and periodically flushing them to disk as SSTables. This approach allows for high write throughput and efficient handling of large volumes of data. By using a partition key and a hash function, ScyllaDB can quickly locate the relevant SSTables and memtables, avoiding global index scans and focusing operations on specific data segments. However, while LSM trees excel at write-heavy workloads, they can introduce read amplification since data might be spread across multiple SSTables. To mitigate this, ScyllaDB/Cassandra uses Bloom filters and optimized indexing strategies. Read performance may occasionally be less predictable compared to B+ trees, especially for certain read patterns. Traditional SQL Databases: B+ Tree Indexing In contrast, traditional SQL databases like PostgreSQL and MySQL (InnoDB) use B+ Tree indexing, which provides O(log N) read operations by traversing the tree from root to leaf nodes to locate specific rows. This structure is highly effective for read-heavy applications and supports complex queries, including range scans and multi-table joins. While B+ trees offer excellent read performance, write operations are slower compared to LSM trees due to the need to maintain tree balance, which may involve node splits and more random I/O operations. Additionally, SQL databases benefit from sophisticated caching mechanisms that keep frequently accessed index pages in memory, further enhancing read efficiency. Horizontal Scalability ScyllaDB/Cassandra: Designed for Seamless Horizontal Scaling ScyllaDB/Cassandra are inherently built for horizontal scalability through their shared-nothing architecture. Each node operates independently, and data is automatically distributed across the cluster using consistent hashing. This design ensures that adding more nodes proportionally increases both storage capacity and compute resources, allowing the system to handle growing workloads efficiently. The automatic data distribution and replication mechanisms provide high availability and fault tolerance, ensuring that the system remains resilient even if individual nodes fail. Furthermore, ScyllaDB/Cassandra offer tunable consistency levels, allowing developers to balance between consistency and availability based on application requirements. This flexibility is particularly advantageous for distributed applications that need to maintain performance and reliability at scale. Traditional SQL Databases: Challenges with Horizontal Scaling Traditional SQL databases, on the other hand, were primarily designed for vertical scalability, relying on enhancing a single server’s resources to manage increased load. While replication (primary-replica or multi-primary) and sharding techniques enable horizontal scaling, these approaches often introduce significant operational complexity. Managing data distribution, ensuring consistency across replicas, and handling failovers require careful planning and additional tooling. Moreover, maintaining ACID properties across a distributed SQL setup can be resource-intensive, potentially limiting scalability compared to NoSQL solutions like ScyllaDB/Cassandra. Data Modeling To harness ScyllaDB’s full potential, there is one fundamental rule: data modeling should revolve around queries. This means designing your data structures based on how you plan to access and query them. At first glance, this might seem obvious, prompting the question: Aren’t we already doing this with traditional RDBMSs? Not entirely. The flexibility of SQL databases allows developers to make nearly any data model work by leveraging joins, indexes, and preloading techniques. This often masks underlying inefficiencies, making it easy to overlook suboptimal data designs. In contrast, ScyllaDB requires a more deliberate approach. You must carefully select partition and clustering keys to ensure that queries are scoped to single partitions and data is ordered optimally. This eliminates the need for extensive indexing and complex joins, allowing ScyllaDB’s Log-Structured Merge (LSM) engine to deliver high performance. While this approach demands more upfront effort, it leads to more efficient and scalable data models. To be fair, it also means that, as a rule, you usually have to provide more information to locate the desired data. Although this can initially appear challenging, the more you work with it, the more you naturally develop the intuition needed to create optimal models. Charybdis Now that we have grasped the fundamentals of data modeling in ScyllaDB, we can turn our attention to Charybdis. Charybdis is a Rust ORM built on top of the ScyllaDB Rust Driver, focusing on ease of use and performance. Out of the box, it generates nearly all available queries for a model and provides helpers for custom queries. It also supports automatic migrations, allowing you to run commands to migrate the database structure based on differences between model definitions in your code and the database. Additionally, Charybdis supports partial models, enabling developers to work seamlessly with subsets of model fields while implementing all traits and functionalities that are present in the main model. Sample User Model Note: We will always query a user by id, so we simply added id as the partition key, leaving the clustering key empty. Installing and Running Migrations First, install the migration tool: cargo install charybdis-migrate Within your src/ directory, run the migration: migrate --hosts <host> --keyspace <your_keyspace> --drop-and-replace (optional) This command will create the users table with fields defined in your model. Note that for migrations to work, you need to use types or aliases defined within charybdis::types::*. Basic Queries for the User Model Sample Models for a Reddit-Like Application In a Reddit-like application, we have communities that have posts, and posts have comments. Note that the following sample is available within the Charybdis examples repository. Community Model Post Model Actix-Web Services for Posts Note: The insert_cb method triggers the before_insert callback within our trait, assigning a new id and created_at to the post. Retrieving All Posts for a Community Updating a Post’s Description To avoid potential inconsistency issues, such as concurrent requests to update a post’s description and other fields, we use the automatically generated partial_<model>! macro. The partial_post! is automatically generated by the charybdis_model macro. The first argument is the new struct name of the partial model, and the others are a subset of the main model fields that we want to work with. In this case, UpdateDescriptionPost behaves just like the standard Post model but operates on a subset of model fields. For each partial model, we must provide the complete primary key, and the main model must implement the Default trait. Additionally, all traits on the main model that are defined below the charybdis_model will automatically be included for all partial models. Now we can have an Actix service dedicated to updating a post’s description: Note: To update a post, you must provide all components of the primary key (community_id, created_at, id). Final Notes A full working sample is available within the Charybdis examples repository. Note: We defined our models somewhat differently than in typical SQL scenarios by using three columns to define the primary key. This is because, in designing models, we also determine where and how data will be stored for querying and data transformation. ACID Compliance Considerations While ScyllaDB offers exceptional performance and seamless horizontal scalability for many applications, it is not suitable for scenarios where ACID (Atomicity, Consistency, Isolation, Durability) properties are required. Sample Use Cases Requiring ACID Integrity Bank Transactions: Ensuring that fund transfers are processed atomically to prevent discrepancies and maintain financial accuracy. Seat Reservations: Guaranteeing that seat allocations in airline bookings or event ticketing systems are handled without double-booking. Inventory Management: Maintaining accurate stock levels in e-commerce platforms to avoid overselling items. For some critical applications, the lack of inherent ACID guarantees in ScyllaDB means that developers must implement additional safeguards to ensure data integrity. In cases where absolute transactional reliability is non-negotiable, integrating ScyllaDB with a traditional RDBMS that provides full ACID compliance might be necessary. In upcoming articles, we will explore how to handle additional scenarios and how to leverage eventual consistency effectively for the majority of your web application, as well as strategies for maintaining strong consistency when required by your data models in ScyllaDB.