Learn Apache Cassandra® 5.0 Data Modeling
We're excited to announce my ground-up Cassandra Data Modeling Series for 2025, a comprehensive application-focused journey designed to arm developers and architects with the latest knowledge.Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio
The engineering challenges and design decisions that led to the 1.0 release of ScyllaDB Rust Driver ScyllaDB Rust driver is a client-side, shard-aware driver written in pure Rust with a fully async API using Tokio. The Rust Driver project was born back in 2021 during ScyllaDB’s internal developer hackathon. Our initial goal was to provide a native implementation of a CQL driver that’s compatible with Apache Cassandra and also contains a variety of ScyllaDB-specific optimizations. Later that year, we released ScyllaDB Rust Driver 0.2.0 on the Rust community’s package registry, crates.io. Comparative benchmarks for that early release confirmed that this driver was (more than) satisfactory in terms of performance. So we continued working on it, with the goal of an official release – and also an ambitious plan to unify other ScyllaDB-specific drivers by converting them into bindings for our Rust driver. Now that we’ve reached a major milestone for the Rust Driver project (officially releasing ScyllaDB Rust Driver 1.0), it’s time to share the challenges and design decisions that led to this 1.0 release. Learn about our versioning rationale What’s New in ScyllaDB Rust Driver 1.0? Along with stability, this new release brings powerful new features, better performance, and smarter design choices. Here’s a look at what we worked on and why. Refactored Error Types Our original error types met ad hoc needs, but weren’t ideal for long-term production use. They weren’t very type-safe, some of them stringified other errors, and they did not provide sufficient information to diagnose the error’s root cause. Some of them were severely abused – most notablyParseError
. There was a
One-To-Rule-Them-All error type: the ubiquitous
QueryError
, which many user-facing APIs used to
return. Before Back in 0.13 of the driver,
QueryError
looked like this: Note
that: The structure was painfully flat, with extremely niche errors
(such as UnableToAllocStreamId
) being just inline
variants of this enum. Many variants contained just strings. The
worst offender was Invalid Message
, which just jammed
all sorts of different error types into a single string. Many
errors were buried inside IoError
, too. This
stringification broke the clear code path to the underlying errors,
affecting readability and causing chaos. Due to the above
omnipresent stringification, matching on error kinds was virtually
impossible. The error types were public and, at the same time, were
not decorated with the #[non_exhaustive]
attribute.
Due to this, adding any new error variant required breaking the
API! It was unacceptable for a driver that was aspiring to bear the
name of an API-stable library. In version 1.0.0, the new error
types are clearer and more helpful. The error hierarchy now
reflects the code flow. Error conversions are explicit, so no
undesired confusing conversion takes place. The one-to-fit-them-all
error type has been replaced. Instead, APIs return various error
types that exhaustively cover the possible errors, without any need
to match on error variants that can’t occur when executing a given
function. The QueryError
’s new counterpart,
ExecutionError
, looks like this:
Note that: There is much more nesting, reflecting the driver’s
modules and abstraction layers. The stringification is gone! Error
types are decorated with the #[non_exhaustive]
attribute, which requires downstream crates to always have the
“else” case (like _ => { … }
) when matching on
them. This way, we prevent breaking downstream crates’ code when
adding a new error variant. Refactored Module Structure The module
structure also stemmed from various ad-hoc decisions. Users
familiar with older releases of our driver may recall, for example,
the ubiquitous transport
module. It used to contain a
bit of absolutely everything: essentially, it was a flat bag with
no real deeper structure. Back in 0.15.1, the module structure
looked like this (omitting the modules that were not later
restructured): transport load_balancing default.rs
mod.rs plan.rs locator (submodules) caching_session.rs cluster.rs
connection_pool.rs connection.rs
downgrading_consistency_retry_policy.rs errors.rs
execution_profile.rs host_filter.rs iterator.rs metrics.rs node.rs
partitioner.rs query_result.rs retry_policy.rs session_builder.rs
session_test.rs session.rs speculative_execution.rs topology.rs
history.rs routing.rs The new module structure clarifies the
driver’s separate abstraction layers. Each higher-level module is
documented with descriptions of what abstractions it should hold.
We also refined our item export policy. Before, there could be
multiple paths to import items from. Now items can be imported from
just one path: either their original paths (i.e., where they are
defined), or from their re-export paths (i.e., where they are
imported, and then re-exported from). In 1.0.0, the module
structure is the following (again, omitting the unchanged modules):
client caching_session.rs execution_profile.rs
pager.rs self_identity.rs session_builder.rs session.rs
session_test.rs cluster metadata.rs node.rs
state.rs worker.rs network connection.rs
connection_pool.rs errors.rs (top-level module)
policies address_translator.rs host_filter.rs
load_balancing default.rs plan.rs retry default.rs
downgrading_consistency.rs fallthrough.rs retry_policy.rs
speculative_execution.rs observability
driver_tracing.rs history.rs metrics.rs tracing.rs
response query_result.rs request_response.rs
routing locator (unchanged contents)
partitioner.rs sharding.rs Removed Unstable Dependencies From the
Public API With the ScyllaDB Rust Driver 1.0 release, we wanted to
fully eliminate unstable (pre-1.0) dependencies from the public
API. Instead, we now expose these dependencies through feature
flags that explicitly encode the major version number, such as
"num-bigint-03"
. Why did we do this?
API Stability & Semver Compliance – The 1.0
release promises a stable API, so breaking changes must be avoided
in future minor updates. If our public API directly depended on
pre-1.0 crates, any breaking changes in those dependencies would
force us to introduce breaking changes as well. By removing them
from the public API, we shield users from unexpected
incompatibilities. Greater Flexibility for Users –
Developers using the ScyllaDB Rust driver can now opt into specific
versions of optional dependencies via feature flags. This allows
better integration with their existing projects without being
forced to upgrade or downgrade dependencies due to our choices.
Long-Term Maintainability – By isolating unstable
dependencies, we reduce technical debt and make future updates
easier. If a dependency introduces breaking changes, we can simply
update the corresponding feature flag (e.g.,
"num-bigint-04"
) without affecting the core driver
API. Avoiding Unnecessary Dependencies – Some
users may not need certain dependencies at all. Exposing them via
opt-in feature flags helps keep the dependency tree lean, improving
compilation times and reducing potential security risks.
Improved Ecosystem Compatibility – By allowing
users to choose specific versions of dependencies, we minimize
conflicts with other crates in their projects. This is particularly
important when working with the broader Rust ecosystem, where
dependency version mismatches can lead to build failures or
unwanted upgrades. Support for Multiple Versions
Simultaneously – By namespacing dependencies with feature
flags (e.g., "num-bigint-03"
and
"num-bigint-04"
), users can leverage multiple versions
of the same dependency within their project. This is particularly
useful when integrating with other crates that may require
different versions of a shared dependency, reducing version
conflicts and easing the upgrade path. How this impacts
users: The core ScyllaDB Rust driver remains stable and
free from external pre-1.0 dependencies (with one exception: the
popular rand
crate, which is still in 0.*). If you
need functionality from an optional dependency, enable it
explicitly using the appropriate feature flag (e.g.,
"num-bigint-03"
). Future updates can introduce new
versions of dependencies under separate feature flags – without
breaking existing integrations. This change ensures that the
ScyllaDB Rust driver remains stable, flexible, and future-proof,
while still providing access to powerful third-party libraries when
needed. Rustls Support for TLS The driver now supports Rustls,
simplifying TLS connections and removing the need for additional
system C libraries (openssl). Previously, ScyllaDB Rust Driver only
supported OpenSSL-based TLS – like our other drivers did. However,
the Rust ecosystem has its own native TLS library:
Rustls. Rustls is designed for both performance
and security, leveraging Rust’s strong memory safety guarantees
while often outperforming
OpenSSL in real-world benchmarks. With the 1.0.0 release, we
have added Rustls as an alternative TLS backend. This gives users
more flexibility in choosing their preferred implementation.
Additional system C libraries (openssl) are no longer required to
establish secure connections. Feature-Based Backend
Selection Just as we isolated pre-1.0 dependencies via
version-encoded feature flags (see the previous section), we
applied the same strategy to TLS backends. Both
OpenSSL and Rustls are exposed
through opt-in feature flags. This allows users to explicitly
select their desired implementation and ensures: API
Stability – Users can enable TLS support without
introducing unnecessary dependencies in their projects.
Avoiding Unwanted Conflicts – Users can choose the
TLS backend that best fits their project without forcing a
dependency on OpenSSL or Rustls if they don’t need it.
Future-Proofing – If a breaking change occurs in a
TLS library, we can introduce a new feature flag (e.g.,
"rustls-023", "openssl-010"
) without modifying the
core API. Abstraction Over TLS Backends We also introduced an
abstraction layer over the TLS backends. Key enums such as
TlsProvider
, TlsContext
,
TlsConfig
and Tls
now contain variants
corresponding to each backend. This means that switching between
OpenSSL and Rustls (as well as between different versions of the
same backend) is a matter of enabling the respective feature flag
and selecting the desired variant. If you prefer Rustls, enable the
"rustls-023"
feature and use the
TlsContext::Rustls
variant. If you need OpenSSL,
enable "openssl-010"
and use
TlsContext::OpenSSL
. If you want both backends or
different versions of the same backend (in production or just to
explore), you can enable multiple features and it will “just work.”
If you don’t require TLS at all, you can exclude both, reducing
dependency overhead. Our ultimate goal with adding Rustls support
and refining TLS backend selection was to ensure that the ScyllaDB
Rust Driver is both flexible and well-integrated with the Rust
ecosystem. We hope this better accommodates users’ different
performance and security needs. The Battle For The Empty Enums We
really wanted to let users build the driver with no TLS backends
opted in. In particular, this required us to make our enums work
without any variants, (i.e., as empty enums). This was a bit
tricky. For instance, one cannot match over &x
,
where x: X
is an instance of the enum, if
X
is empty. Specifically, consider the following
definition: This
would not compile:error[E0004]:
non-exhaustive patterns: type `&X` is non-empty
–> scylla/src/network/tls.rs:230:11
| 230 | match x {
| ^
| note: `X` defined here
–> scylla/src/network/tls.rs:223:6
| 223 | enum X {
| ^
= note: the matched value is of type
`&X` = note: references are always
considered inhabited help: ensure that all possible cases are being
handled by adding a match arm with a wildcard pattern as shown
| 230 ~ match x { 231 +
_ => todo!(), 232 + }
| Note that references are
always considered inhabited. Therefore, in order to
make code compile in such a case, we have to match on the value
itself, not on a reference:
But if we now enable the "a"
feature, we get
another error… error[E0507]: cannot move out of `x` as enum variant
`A` which is behind a shared reference –>
scylla/src/network/tls.rs:230:11 | 230 |
match *x { |
^^ 231 |
#[cfg(feature = “a”)] 232 | X::A(s)
=> { /* Handle it */ } |
–
|
| |
data moved here
|
move occurs because `s` has type `String`, which does not
implement the `Copy` trait | help: consider
removing the dereference here | 230 –
match *x { 230 + match x {
| Ugh. rustc
literally
advises us to revert the change. No luck… Then we would end up with
the same problem as before. Hmmm… Wait a moment… I vaguely remember
Rust had an obscure reserved word used for matching by reference,
ref
. Let’s try it out.
Yay, it compiles!!! This is how we made our (possibly) empty
enums work… finally!. Faster and Extended Metrics Performance
matters. So we reworked how the driver handles metrics, eliminating
bottlenecks and reducing overhead for those who need real-time
insights. Moreover, metrics are now an opt-in feature, so you only
pay (in terms of resource consumption) for what you use. And we
added even more metrics! Background Benchmarks
showed that the driver may spend significant time logging query
latency.
Flamegraphs revealed that collecting metrics can consume up to
11.68% of CPU time!
We suspected that the culprit was contention on a mutex
guarding the metrics histogram. Even though the issue was
discovered in 2021 (!), we postponed dealing with it because the
publicly available crates didn’t yet include a lock-free histogram
(which we hoped would reduce the overhead). Lock-free
histogram As we approached the 1.0 release deadline, two
contributors (Nikodem Gapski and Dawid Pawlik) engaged with the
issue. Nikodem explored the new generation of the
histogram
crate and discovered that someone had added
a lock-free histogram: AtomicHistogram
. “Great”, he
thought. “This is exactly what’s needed.” Then, he discovered that
AtomicHistogram
is flawed: there’s a logical race due
to insufficient synchronization! To fix the problem, he ported the
Go implementation of LockFreeHistogram
from
Prometheus, which prevents logical races at the cost of execution
time (though it was still performing much better than a mutex).
If you are interested in all the details about what was wrong
with AtomicHistogram
and how
LockFreeHistogram
tries to solve it, see the
discussion in this PR. Eventually, the
histogram
crate’s maintainer joined the discussion and
convinced us that the skew caused by the logical races in
AtomicHistogram
is benign. Long story short, histogram
is a bit skewed anyway, and we need to accept it. In the end, we
accepted AtomicHistogram
for its lower overhead
compared to LockFreeHistogram
.
LockFreeHistogram
is still available on
its author’s dedicated branch. We left ourselves a way to
replace one histogram implementation with another if we decide it’s
needed. More metrics The Rust driver is a proud
base for the cpp-rust-driver
(a rewrite of cpp-driver as a thin
bindings layer on top of – as you can probably guess at this point
– the Rust driver). Before cpp-driver functionalities could be
implemented in cpp-rust-driver, they had to be implemented in the
Rust driver first. This was the case for some metrics, too. The
same two contributors took care of that, too. (Btw, thanks, guys!
Some cool sea monster swag will be coming your way).
Metrics as an opt-in Not every driver user needs
metrics. In fact, it’s quite probable that most users don’t check
them even once. So why force users to pay (in terms of resource
consumption) for metrics they’re not using? To avoid this, we put
the metrics module behind the "metrics"
feature (which
is disabled by default). Even more performance gain! For a
comprehensive list of changes introduced in the 1.0 release,
see our release notes. Stepping Stones on the Path to the 1.0
Release We’ve been working towards this 1.0 release for years, and
it involved a lot of incremental improvements that we rolled out in
minor releases along the way. Here’s a look at the most notable
ones. Ser/De (from versions 0.11 and 0.15) Previous releases
reworked the serialization and deserialization APIs to improve
safety and efficiency. In short, the 0.11 release introduced a
revamped serialization API that leverages Rust’s type system to
catch misserialization issues early. And the 0.15 release refined
deserialization for better performance and memory efficiency. Here
are more details. Serialization API Refactor (released in
0.11): Leverage Rust’s Powerful Type System to Prevent
Misserialization — For Safer and More Robust Query Binding
Before 0.11, the driver’s serialization API had several pitfalls,
particularly around type safety. The old approach relied on loosely
structured traits and structs (Value
,
ValueList
, SerializedValues
,
BatchValues
, etc.), which lacked strong compile-time
guarantees. This meant that if a user mistakenly bound an incorrect
type to a query parameter, they wouldn’t receive an immediate,
clear error. Instead, they might encounter a confusing
serialization error from ScyllaDB — or, in the worst case, could
suffer from silent data corruption! To address these issues, we
introduced a redesigned serialization API that replaces the old
traits with SerializeValue
, SerializeRow
,
and new versions of BatchValues
and
SerializedValues
. This new approach enforces stronger
type safety. Now, type mismatches are caught locally at compile
time or runtime (rather than surfacing as obscure database errors
after query execution). Key benefits of this refactor include:
Early Error Detection – Incorrectly typed bind
markers now trigger clear, local errors instead of ambiguous
database-side failures. Stronger Type Safety – The
new API ensures that only compatible types can be bound to queries,
reducing the risk of subtle bugs. Deserialization API
Refactor (released in 0.15): For Better Performance and Memory
Efficiency Prior to release 0.15, the driver’s
deserialization process was burdened with multiple inefficiencies,
slowing down applications and increasing memory usage. The first
major issue was type erasure — all values were initially converted
into the CQL-type-agnostic CqlValue before being transformed into
the user’s desired type. This unnecessary indirection introduced
additional allocations and copying, making the entire process
slower than it needed to be. But the inefficiencies didn’t stop
there. Another major flaw was the eager allocation of columns and
rows. Instead of deserializing data on demand, every column in a
row was eagerly allocated at once — whether it was needed or not.
Even worse, each page of query results was fully materialized into
a Vec<Row>
. As a result, all rows in a page were
allocated at the same time — all of them in the form of the
ephemeric CqlValue
. This usually required further
conversion to the user’s desired type and incurred allocations. For
queries returning large datasets, this led to excessive memory
usage and unnecessary CPU overhead. To fix these issues, we
introduced a completely redesigned deserialization API. The new
approach ensures that: CQL values are deserialized lazily,
directly into user-defined types, skipping
CqlValue
entirely and eliminating redundant
allocations. Columns are no longer eagerly
deserialized and allocated. Memory is used only for the
fields that are actually accessed. Rows are
streamed instead of eagerly materialized. This avoids
unnecessary bulk allocations and allows more efficient processing
of large result sets. Paging API (released in 0.14) We heard from
our users that the driver’s API for executing queries was prone to
misuse with regard to query paging. For instance, the
Session::query()
and Session::execute()
methods would silently return only the first page of the result if
page size was set on the statement. On the other hand, if page size
was not set, those methods would perform unpaged queries, putting
high and undesirable load on the cluster. Furthermore,
Session::query_paged()
and
Session::execute_paged()
would only fetch a single
page! (if page size was set on the statement; otherwise, the query
would not be paged…!!!) To combat this: We decided
to redesign the paging API in a way that no other driver had done
before. We concluded that the API must be crystal clear about
paging, and that paging will be controlled by the method used, not
by the statement itself. We ditched query()
and
query_paged()
(as well as their execute
counterparts), replacing them with query_unpaged()
and
query_single_page()
, respectively (similarly for
execute*
). We separated the setting of page size from
the paging method itself. Page size is now mandatory on the
statement (before, it was optional). The paging method (no paging,
manual paging, transparent automated paging) is now selected by
using different session methods
({query,execute}_unpaged()
,
{query,execute}_single_page()
, and
{query,execute}_iter()
, respectively). This separation
is likely the most important change we made to help users avoid
footguns and pitfalls. We introduced strongly typed PagingState and
PagingStateResponse abstractions. This made it clearer how to use
manual paging (available using
{query,execute}_single_page()
). Ultimately, we
provided
a cheat sheet in the Docs that describes best practices
regarding statement execution. Looking Ahead The journey doesn’t
stop here. We have many ideas for possible future driver
improvements: Adding a prelude
module
containing commonly used driver’s functionalities. More
performance optimizations to push the limits of
scalability (and benchmarks to track how we’re doing). Extending
CQL execution APIs to combine transparent paging
with zero-copy deserialization, and introducing
BoundStatement
. Designing our own test
harness to enable cluster sharing and reuse between tests
(with hopes of speeding up test suite execution and encouraging
people to write more tests). Reworking CQL execution
APIs for less code duplication and better usability.
Introducing QueryDisplayer
to pretty print results of
the query in a tabular way, similarly to the cqlsh
tool. (In our dreams) Rewriting cqlsh
(based on Python
driver) with cqlsh-rs (a wrapper over Rust driver). And of course,
we’re always eager to hear from the community — your feedback helps
shape the future of the driver! Get Started with ScyllaDB Rust
Driver 1.0 If you’re working on cool Rust applications that use
ScyllaDB and/or you want to contribute to this Rust driver project,
here are some starting points. GitHub Repository:
ScyllaDB
Rust Driver – Contributions welcome!
Crates.io: Scylla Crate
Documentation: crate docs on docs.rs,
the guide
to the driver. And if you have any questions, please contact us
on the community forum or
ScyllaDB User Slack (see
the #rust-driver channel). ScyllaDB Rust Driver 1.0 is Officially Released
The long-awaited ScyllaDB Rust Driver 1.0 is finally released. This open source project was designed to bring a stable, high-performance, and production-ready CQL driver to the Rust ecosystem. Key changes in the 1.0 release include: Improved Stability: We removed unstable dependencies and put them behind feature flags. This keeps the driver stable, flexible, and future-proof while still allowing access to powerful-yet-unstable third-party libraries when needed. Refactored Error Types: The error types were significantly improved for clarity, type safety, and diagnostic information. This makes debugging easier and prevents API-breaking changes in future updates. Refactored Module Structure: The module structure was reorganized to better reflect abstraction layers and improve clarity. This makes the driver’s architecture more understandable and simplifies importing items. Easier TLS Setup: Rustls support provides a Rust-native alternative to openssl. This simplifies TLS configuration and can prevent system library issues. Faster and Extended Metrics: New metrics were added and metrics were optimized using an atomic histogram that reduces CPU overhead. The entire metrics module is now optional – so users who don’t care about it won’t suffer any performance impacts from it. Read the release notes In this post, we’ll shed light on why we took this unconventionally extensive (years long) path from a popular production-ready 0.x release to a 1.0 release. We’ll also share our versioning/release plans from this point forward. Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio provides a deep dive into exactly what we changed and why. Read the deep dive into what we changed and why The Path to “1.0” Over the past few years, Rust Driver has proven itself to be high quality, with very few bugs compared to other drivers as well as better performance. It is successfully used by customers in production, and by us internally. By all means, we have considered it fully production-ready for a long time. Given that, why did we keep releasing 0.x versions? Although we were confident in the driver’s quality, we weren’t satisfied with some aspects of its API. Keeping the version at 0.x was our way of saying that breaking changes are expected often. Frequent breaking changes are not really great for our users. Instead of just updating the driver, they have to adjust their code after pretty much every update. However, 0.x version numbers suggest that the driver is not actually production-ready (but in this case, it truly was). So we really wanted to release a 1.0 version. One option was to just call one of the previous versions (e.g. 0.9) version 1.0 and be done with it. But we knew there were still many breaking changes we wanted to make – and if we kept introducing planned changes, we would quickly arrive at a high version number like 7.0. In Rust (and semver in general) 1.0 is called an “API-stable” version. There is no definition of that term, so it can have various interpretations. What’s perfectly clear, however, is that rapidly releasing major versions – thus quickly arriving at a high major version number – does not constitute API stability. It also does nothing to help users easily update. They would still need to change their code after most updates! We also realized that we will never be able to achieve complete stabilization. There are, and will probably always be, things that we want to improve in our API. We don’t want stability to stand in the way of driver refinement. Even if we somehow achieve an API that we are fully satisfied with, that we don’t want to change at all, there is another reason for change: the databases that the driver supports (ScyllaDB and Cassandra) are constantly changing, and some of those changes may require modifying the driver API. For example, ScyllaDB recently introduced a new replication mechanism: Tablets. It is possible to add Tablets support to a driver without a breaking change. We did that in our other drivers, which are forks, because we can’t break compatibility there. However, it requires ugly workarounds. With Tablets, calculating a replica list for a request requires knowing which table the request uses. Tablets are per-table data structures, which means that different tables may have different replica sets for the same token (as opposed to the token ring, which is per-keyspace). This affects many APIs in the driver: Metadata, Load Balancing, and Request Routing, to name just a few. In Rust Driver, we could nicely adapt those APIs, and we want to continue doing so when major changes are introduced in ScyllaDB or Cassandra. Given those restrictions, we reached a compromise. We decided to focus on the API-breaking changes we had planned and complete a big portion of them – making the API more future-proof and flexible. This reduces the risk of being forced to make unwanted API-breaking changes in the future. What’s Next for Rust Driver Now that we’ve reached the long-anticipated “1.0” status, what’s next? We will focus on other driver tasks that do not require changing the API. Those will be released as minor updates (1.x versions). Releasing minor versions means that our users can easily update the driver without changing their code, and so they will quickly get the latest improvements. Of course, we won’t stay at 1.0 forever. We don’t know exactly when the 2.0 release will happen, but we want to provide some reasonable stability to make life easier for our users. We’ve settled on 9 months for 1.0 – so 2.0 won’t be released any earlier than 9 months after the 1.0 release date. For future versions (3.0, etc) this time may (almost certainly) be increased since we will have already smoothed out more and more API rough edges. When a new major version (e.g. 2.0) is released, we will keep supporting the previous major version (e.g. 1.x) with bugfixes, but no new functionalities. The duration of such support is not yet decided. This will also make the migration to a new major version a bit easier. Get Started with Rust Driver 1.0 If you’re ready to get started, take a look at: GitHub Repository: ScyllaDB Rust Driver – Contributions welcome! Crates.io: Scylla Crate Documentation: crate docs on docs.rs, the guide to the driver. And if you have any questions, please contact us on the community forum or ScyllaDB User Slack (see the #rust-driver channel).Upcoming ScyllaDB University LIVE and Community Forum Updates
What to expect at the upcoming ScyllaDB University Live training event – and what’s trending on the community forum Following up on all the interest in ScyllaDB – at Monster SCALE Summit and a whirlwind of in-person events around the world – let’s continue the ScyllaDB conversation. Is ScyllaDB a good fit for your use case? How do you navigate some of the decisions you face when getting started? We’re here to help! In this post, I’ll update you about the upcoming ScyllaDB University Live training event and highlight some trending topics from the community forum. ScyllaDB University LIVE Our next ScyllaDB University LIVE training event will be held on Wednesday, April 9, 2025, 8 AM PDT – 10 AM PDT. This is a free live virtual training led by our top engineers and architects. Whether you’re just curious about ScyllaDB or an experienced user looking to master advanced strategies, join us for ScyllaDB University LIVE! Sessions are interactive and NOT available on-demand – be sure to mark your calendar and attend! The event will be interactive, and you will have a chance to run some hands-on labs throughout the event, and learn by actually doing. The team and I are preparing lots of new examples and exercises – so if you’ve joined before, there’s a great excuse to join again. 😉 Register here In the event, there will be two parallel tracks, Essentials and Advanced. Essentials Track The Essentials track (Getting Started with ScyllaDB) is intended for people new to ScyllaDB. I will start with a talk covering a quick overview of NoSQL and where ScyllaDB fits in the NoSQL world. Next, you will run the Quick Wins labs, in which you’ll see how easy it is to start a ScyllaDB cluster, create a keyspace, create a table, and run some basic queries. After the lab, you’ll learn about ScyllaDB’s basic architecture, including a node, cluster, data replication, Replication Factor, how the database partitions data, Consistency Level, multiple data centers, and an example of what happens when we write data to a cluster. We’ll cover data modeling fundamentals for ScyllaDB. Key concepts include the difference in data modeling between NoSQL and Relational databases, Keyspace, Table, Row, CQL, the CQL shell, Partition Key, and Clustering Key. After that, you’ll run another lab, where you’ll put the data modeling theory into practice. Finally (if we have enough time left), we will discuss ScyllaDB’s special shard-aware drivers. The next part of this session is led by Attila Toth. Here, we’ll walk through a real-world application and understand how the different concepts from the previous talk come into play. We’ll also use a lab where you can do the coding and test it yourself. Additionally, you will see a demo application running one million ops/sec with single-digit millisecond latency and learn how to run this demo yourself. Advanced Track In the Advanced Track (Extreme Elasticity and Performance) by Tzach Livyatan and Felipe Mendes, you will take a deep dive into ScyllaDB’s unique features and tooling such as Workload Prioritization as well as advanced data modeling, and tips for using counters and Time To Live (TTL). You’ll learn how ScyllaDB’s new Tablets feature enables extreme elasticity without any downtime and how to have multiple workloads on a single cluster. The two talks in this track will also use multiple labs that you can run yourself during the event. Before the event, please make sure you have a ScyllaDB University account (free). We will use this platform during the event for the hands-on labs. Register on ScyllaDB University Trending Topics on the Community Forum The community forum is the place to discuss anything ScyllaDB and NoSQL related, learn from your peers, share how you’re using ScyllaDB, and ask questions about your use case. It’s where you can read Avi Kivity’s, our co-founder and CTO’s, popular, weekly Last week in scylladb.git master update (for example here). It’s also the place to learn about new releases and events. Say Hello here Many of the new topics focus on performance issues, troubleshooting, specific use case questions and general data modeling questions. Many of the recent discussions have been about Tablets and how this feature affects performance and elasticity. Here’s a summary of some of the top topics since my last update. A user asked about latency spikes, hot partitions, and how to detect this. Key insights shared in this discussion emphasize the importance of understanding compaction settings and implementing strategies to mitigate tombstone accumulation. Upgrade paths and Tablets integration: The introduction of the Tablets feature led to significant discussions regarding its adoption for scaling purposes. A user discussed the processes of enabling this feature after an upgrade, and its effects on performance in posts like this one. General cluster management support: different contributors actively assisted newcomers by clarifying different admin procedures, such as addressing schema migrations, compaction, and SSTable behavior. An example of such a discussion deals with the process for gracefully stopping ScyllaDB. Data modeling: A popular topic was data modeling and the data model’s effect of the data model on performance for specific use cases. Users exchanged ideas on addressing challenges tied to row-level reads, batching, drivers, and the implications of large (and hot) partitions. One such discussion dealt with data modeling when having subgroups of data with volume disparity. Alternator: the DynamoDB compatible API was a popular topic. Users asked about how views work under the hood with Alternator as well as other questions related to compatibility with DynamoDB and performance. Hope to see you at the ScyllaDB University Live event! Meanwhile, stay in touch.A Decade of Apache Cassandra® Data Modeling
Data modeling has been a challenge with Apache Cassandra for as long as the project has been around. After a decade, we have tools and functions at our disposal that can help us to better solve this problem from a developer’s perspective.Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type
In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.
But with the release of Cassandra 5, things become much simpler.
Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.
Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.
The power of vector search in Cassandra 5
Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.
The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.
How vectors work
Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.
When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.
Storage-Attached Indexing (SAI): The engine behind vector search
Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.
SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.
Example: Performing similarity search with Cassandra 5’s vector data type
Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.
Step 1: Setting up the embeddings table
To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.
CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', ' replication_factor': 1}}; CREATE TABLE IF NOT EXISTS embeddings ( id UUID, paragraph_uuid UUID, filename TEXT, embeddings vector<float, 300>, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) ); CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai';
This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.
You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:
CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.
Step 2: Inserting embeddings into Cassandra 5
To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:
import time from uuid import uuid4, UUID from cassandra.cluster import Cluster from cassandra.query import SimpleStatement from cassandra.policies import DCAwareRoundRobinPolicy from cassandra.auth import PlainTextAuthProvider from google.colab import userdata # Connect to the single-node cluster cluster = Cluster( # Replace with your IP list ["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed port=9042, auth_provider=PlainTextAuthProvider ( username='iccassandra', password='replace_with_your_password' ) ) session = cluster.connect() print('Connected to cluster %s' % cluster.metadata.cluster_name) def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None): try: embeddings = list(map(float, embedding)) # Generate UUIDs if not provided if id is None: id = uuid4() if paragraph_uuid is None: paragraph_uuid = uuid4() # Ensure id and paragraph_uuid are UUID objects if isinstance(id, str): id = UUID(id) if isinstance(paragraph_uuid, str): paragraph_uuid = UUID(paragraph_uuid) # Create the query string with placeholders insert_query = f""" INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated) VALUES (?, ?, ?, ?, ?, toTimestamp(now())) """ # Create a prepared statement with the query prepared = session.prepare(insert_query) # Execute the query session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text))) return None # Successful insertion except Exception as e: error_message = f"Failed to execute query:\nError: {str(e)}" return error_message # Return error message on failure def insert_with_retry(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None, max_retries=3, retry_delay_seconds=1): retry_count = 0 while retry_count < max_retries: result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) if result is None: return True # Successful insertion else: retry_count += 1 print(f"Insertion failed on attempt {retry_count} with error: {result}") if retry_count < max_retries: time.sleep(retry_delay_seconds) # Delay before the next retry return False # Failed after max_retries # Replace the file path pointing to the desired file file_path = "/path/to/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path) from tqdm import tqdm for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_with_retry( session=session, embedding=paragraph['embedding'], id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ): # Display an error message if insertion fails tqdm.write(f"Insertion failed after maximum retries for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...")
This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.
Step 3: Performing similarity searches in Cassandra 5
Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:
import numpy as np # ------------------ Embedding Functions ------------------ def text_to_vector(text): """Convert a text chunk into a vector using the FastText model.""" words = text.split() vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): # Convert the input text to an embedding input_embedding = text_to_vector(input_text) input_embedding_str = ', '.join(map(str, input_embedding.tolist())) # Adjusted query without the ORDER BY clause and correct comment syntax query = f""" SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity FROM {keyspace_name}.embeddings ORDER BY embeddings ANN OF [{input_embedding_str}] LIMIT {top_k}; """ prepared = session.prepare(query) bound = prepared.bind((input_embedding,)) rows = session.execute(bound) # Sort the results by similarity in Python similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) return similar_texts[:top_k] from IPython.display import display, HTML # The word you want to find similarities for input_text = "place" # Call the function to find similar texts in the Cassandra database similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)
This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.
Step 4: Displaying the results
To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:
# Print the similar texts along with their similarity scores for similarity, filename, text in similar_texts: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>Similarity:</b> {similarity:.4f}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code will display the top similar texts, along with their similarity scores and associated file names.
Cassandra 5 vs. Cassandra 4 + OpenSearch®
Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.
Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.
Feature | Cassandra 4 + OpenSearch | Cassandra 5 (Preview) |
Embedding Storage | OpenSearch | Native VECTOR Data Type |
Similarity Search | KNN Plugin in OpenSearch | COSINE, EUCLIDEAN, DOT_PRODUCT |
Search Method | Exact K-Nearest Neighbor | Approximate Nearest Neighbor (ANN) |
System Complexity | Requires two systems | All-in-one Cassandra solution |
Conclusion: A simpler path to similarity search with Cassandra 5
With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.
Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!
Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!
The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.