Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio
The engineering challenges and design decisions that led to the 1.0 release of ScyllaDB Rust Driver ScyllaDB Rust driver is a client-side, shard-aware driver written in pure Rust with a fully async API using Tokio. The Rust Driver project was born back in 2021 during ScyllaDB’s internal developer hackathon. Our initial goal was to provide a native implementation of a CQL driver that’s compatible with Apache Cassandra and also contains a variety of ScyllaDB-specific optimizations. Later that year, we released ScyllaDB Rust Driver 0.2.0 on the Rust community’s package registry, crates.io. Comparative benchmarks for that early release confirmed that this driver was (more than) satisfactory in terms of performance. So we continued working on it, with the goal of an official release – and also an ambitious plan to unify other ScyllaDB-specific drivers by converting them into bindings for our Rust driver. Now that we’ve reached a major milestone for the Rust Driver project (officially releasing ScyllaDB Rust Driver 1.0), it’s time to share the challenges and design decisions that led to this 1.0 release. Learn about our versioning rationale What’s New in ScyllaDB Rust Driver 1.0? Along with stability, this new release brings powerful new features, better performance, and smarter design choices. Here’s a look at what we worked on and why. Refactored Error Types Our original error types met ad hoc needs, but weren’t ideal for long-term production use. They weren’t very type-safe, some of them stringified other errors, and they did not provide sufficient information to diagnose the error’s root cause. Some of them were severely abused – most notablyParseError
. There was a
One-To-Rule-Them-All error type: the ubiquitous
QueryError
, which many user-facing APIs used to
return. Before Back in 0.13 of the driver,
QueryError
looked like this: Note
that: The structure was painfully flat, with extremely niche errors
(such as UnableToAllocStreamId
) being just inline
variants of this enum. Many variants contained just strings. The
worst offender was Invalid Message
, which just jammed
all sorts of different error types into a single string. Many
errors were buried inside IoError
, too. This
stringification broke the clear code path to the underlying errors,
affecting readability and causing chaos. Due to the above
omnipresent stringification, matching on error kinds was virtually
impossible. The error types were public and, at the same time, were
not decorated with the #[non_exhaustive]
attribute.
Due to this, adding any new error variant required breaking the
API! It was unacceptable for a driver that was aspiring to bear the
name of an API-stable library. In version 1.0.0, the new error
types are clearer and more helpful. The error hierarchy now
reflects the code flow. Error conversions are explicit, so no
undesired confusing conversion takes place. The one-to-fit-them-all
error type has been replaced. Instead, APIs return various error
types that exhaustively cover the possible errors, without any need
to match on error variants that can’t occur when executing a given
function. The QueryError
’s new counterpart,
ExecutionError
, looks like this:
Note that: There is much more nesting, reflecting the driver’s
modules and abstraction layers. The stringification is gone! Error
types are decorated with the #[non_exhaustive]
attribute, which requires downstream crates to always have the
“else” case (like _ => { … }
) when matching on
them. This way, we prevent breaking downstream crates’ code when
adding a new error variant. Refactored Module Structure The module
structure also stemmed from various ad-hoc decisions. Users
familiar with older releases of our driver may recall, for example,
the ubiquitous transport
module. It used to contain a
bit of absolutely everything: essentially, it was a flat bag with
no real deeper structure. Back in 0.15.1, the module structure
looked like this (omitting the modules that were not later
restructured): transport load_balancing default.rs
mod.rs plan.rs locator (submodules) caching_session.rs cluster.rs
connection_pool.rs connection.rs
downgrading_consistency_retry_policy.rs errors.rs
execution_profile.rs host_filter.rs iterator.rs metrics.rs node.rs
partitioner.rs query_result.rs retry_policy.rs session_builder.rs
session_test.rs session.rs speculative_execution.rs topology.rs
history.rs routing.rs The new module structure clarifies the
driver’s separate abstraction layers. Each higher-level module is
documented with descriptions of what abstractions it should hold.
We also refined our item export policy. Before, there could be
multiple paths to import items from. Now items can be imported from
just one path: either their original paths (i.e., where they are
defined), or from their re-export paths (i.e., where they are
imported, and then re-exported from). In 1.0.0, the module
structure is the following (again, omitting the unchanged modules):
client caching_session.rs execution_profile.rs
pager.rs self_identity.rs session_builder.rs session.rs
session_test.rs cluster metadata.rs node.rs
state.rs worker.rs network connection.rs
connection_pool.rs errors.rs (top-level module)
policies address_translator.rs host_filter.rs
load_balancing default.rs plan.rs retry default.rs
downgrading_consistency.rs fallthrough.rs retry_policy.rs
speculative_execution.rs observability
driver_tracing.rs history.rs metrics.rs tracing.rs
response query_result.rs request_response.rs
routing locator (unchanged contents)
partitioner.rs sharding.rs Removed Unstable Dependencies From the
Public API With the ScyllaDB Rust Driver 1.0 release, we wanted to
fully eliminate unstable (pre-1.0) dependencies from the public
API. Instead, we now expose these dependencies through feature
flags that explicitly encode the major version number, such as
"num-bigint-03"
. Why did we do this?
API Stability & Semver Compliance – The 1.0
release promises a stable API, so breaking changes must be avoided
in future minor updates. If our public API directly depended on
pre-1.0 crates, any breaking changes in those dependencies would
force us to introduce breaking changes as well. By removing them
from the public API, we shield users from unexpected
incompatibilities. Greater Flexibility for Users –
Developers using the ScyllaDB Rust driver can now opt into specific
versions of optional dependencies via feature flags. This allows
better integration with their existing projects without being
forced to upgrade or downgrade dependencies due to our choices.
Long-Term Maintainability – By isolating unstable
dependencies, we reduce technical debt and make future updates
easier. If a dependency introduces breaking changes, we can simply
update the corresponding feature flag (e.g.,
"num-bigint-04"
) without affecting the core driver
API. Avoiding Unnecessary Dependencies – Some
users may not need certain dependencies at all. Exposing them via
opt-in feature flags helps keep the dependency tree lean, improving
compilation times and reducing potential security risks.
Improved Ecosystem Compatibility – By allowing
users to choose specific versions of dependencies, we minimize
conflicts with other crates in their projects. This is particularly
important when working with the broader Rust ecosystem, where
dependency version mismatches can lead to build failures or
unwanted upgrades. Support for Multiple Versions
Simultaneously – By namespacing dependencies with feature
flags (e.g., "num-bigint-03"
and
"num-bigint-04"
), users can leverage multiple versions
of the same dependency within their project. This is particularly
useful when integrating with other crates that may require
different versions of a shared dependency, reducing version
conflicts and easing the upgrade path. How this impacts
users: The core ScyllaDB Rust driver remains stable and
free from external pre-1.0 dependencies (with one exception: the
popular rand
crate, which is still in 0.*). If you
need functionality from an optional dependency, enable it
explicitly using the appropriate feature flag (e.g.,
"num-bigint-03"
). Future updates can introduce new
versions of dependencies under separate feature flags – without
breaking existing integrations. This change ensures that the
ScyllaDB Rust driver remains stable, flexible, and future-proof,
while still providing access to powerful third-party libraries when
needed. Rustls Support for TLS The driver now supports Rustls,
simplifying TLS connections and removing the need for additional
system C libraries (openssl). Previously, ScyllaDB Rust Driver only
supported OpenSSL-based TLS – like our other drivers did. However,
the Rust ecosystem has its own native TLS library:
Rustls. Rustls is designed for both performance
and security, leveraging Rust’s strong memory safety guarantees
while often outperforming
OpenSSL in real-world benchmarks. With the 1.0.0 release, we
have added Rustls as an alternative TLS backend. This gives users
more flexibility in choosing their preferred implementation.
Additional system C libraries (openssl) are no longer required to
establish secure connections. Feature-Based Backend
Selection Just as we isolated pre-1.0 dependencies via
version-encoded feature flags (see the previous section), we
applied the same strategy to TLS backends. Both
OpenSSL and Rustls are exposed
through opt-in feature flags. This allows users to explicitly
select their desired implementation and ensures: API
Stability – Users can enable TLS support without
introducing unnecessary dependencies in their projects.
Avoiding Unwanted Conflicts – Users can choose the
TLS backend that best fits their project without forcing a
dependency on OpenSSL or Rustls if they don’t need it.
Future-Proofing – If a breaking change occurs in a
TLS library, we can introduce a new feature flag (e.g.,
"rustls-023", "openssl-010"
) without modifying the
core API. Abstraction Over TLS Backends We also introduced an
abstraction layer over the TLS backends. Key enums such as
TlsProvider
, TlsContext
,
TlsConfig
and Tls
now contain variants
corresponding to each backend. This means that switching between
OpenSSL and Rustls (as well as between different versions of the
same backend) is a matter of enabling the respective feature flag
and selecting the desired variant. If you prefer Rustls, enable the
"rustls-023"
feature and use the
TlsContext::Rustls
variant. If you need OpenSSL,
enable "openssl-010"
and use
TlsContext::OpenSSL
. If you want both backends or
different versions of the same backend (in production or just to
explore), you can enable multiple features and it will “just work.”
If you don’t require TLS at all, you can exclude both, reducing
dependency overhead. Our ultimate goal with adding Rustls support
and refining TLS backend selection was to ensure that the ScyllaDB
Rust Driver is both flexible and well-integrated with the Rust
ecosystem. We hope this better accommodates users’ different
performance and security needs. The Battle For The Empty Enums We
really wanted to let users build the driver with no TLS backends
opted in. In particular, this required us to make our enums work
without any variants, (i.e., as empty enums). This was a bit
tricky. For instance, one cannot match over &x
,
where x: X
is an instance of the enum, if
X
is empty. Specifically, consider the following
definition: This
would not compile:error[E0004]:
non-exhaustive patterns: type `&X` is non-empty
–> scylla/src/network/tls.rs:230:11
| 230 | match x {
| ^
| note: `X` defined here
–> scylla/src/network/tls.rs:223:6
| 223 | enum X {
| ^
= note: the matched value is of type
`&X` = note: references are always
considered inhabited help: ensure that all possible cases are being
handled by adding a match arm with a wildcard pattern as shown
| 230 ~ match x { 231 +
_ => todo!(), 232 + }
| Note that references are
always considered inhabited. Therefore, in order to
make code compile in such a case, we have to match on the value
itself, not on a reference:
But if we now enable the "a"
feature, we get
another error… error[E0507]: cannot move out of `x` as enum variant
`A` which is behind a shared reference –>
scylla/src/network/tls.rs:230:11 | 230 |
match *x { |
^^ 231 |
#[cfg(feature = “a”)] 232 | X::A(s)
=> { /* Handle it */ } |
–
|
| |
data moved here
|
move occurs because `s` has type `String`, which does not
implement the `Copy` trait | help: consider
removing the dereference here | 230 –
match *x { 230 + match x {
| Ugh. rustc
literally
advises us to revert the change. No luck… Then we would end up with
the same problem as before. Hmmm… Wait a moment… I vaguely remember
Rust had an obscure reserved word used for matching by reference,
ref
. Let’s try it out.
Yay, it compiles!!! This is how we made our (possibly) empty
enums work… finally!. Faster and Extended Metrics Performance
matters. So we reworked how the driver handles metrics, eliminating
bottlenecks and reducing overhead for those who need real-time
insights. Moreover, metrics are now an opt-in feature, so you only
pay (in terms of resource consumption) for what you use. And we
added even more metrics! Background Benchmarks
showed that the driver may spend significant time logging query
latency.
Flamegraphs revealed that collecting metrics can consume up to
11.68% of CPU time!
We suspected that the culprit was contention on a mutex
guarding the metrics histogram. Even though the issue was
discovered in 2021 (!), we postponed dealing with it because the
publicly available crates didn’t yet include a lock-free histogram
(which we hoped would reduce the overhead). Lock-free
histogram As we approached the 1.0 release deadline, two
contributors (Nikodem Gapski and Dawid Pawlik) engaged with the
issue. Nikodem explored the new generation of the
histogram
crate and discovered that someone had added
a lock-free histogram: AtomicHistogram
. “Great”, he
thought. “This is exactly what’s needed.” Then, he discovered that
AtomicHistogram
is flawed: there’s a logical race due
to insufficient synchronization! To fix the problem, he ported the
Go implementation of LockFreeHistogram
from
Prometheus, which prevents logical races at the cost of execution
time (though it was still performing much better than a mutex).
If you are interested in all the details about what was wrong
with AtomicHistogram
and how
LockFreeHistogram
tries to solve it, see the
discussion in this PR. Eventually, the
histogram
crate’s maintainer joined the discussion and
convinced us that the skew caused by the logical races in
AtomicHistogram
is benign. Long story short, histogram
is a bit skewed anyway, and we need to accept it. In the end, we
accepted AtomicHistogram
for its lower overhead
compared to LockFreeHistogram
.
LockFreeHistogram
is still available on
its author’s dedicated branch. We left ourselves a way to
replace one histogram implementation with another if we decide it’s
needed. More metrics The Rust driver is a proud
base for the cpp-rust-driver
(a rewrite of cpp-driver as a thin
bindings layer on top of – as you can probably guess at this point
– the Rust driver). Before cpp-driver functionalities could be
implemented in cpp-rust-driver, they had to be implemented in the
Rust driver first. This was the case for some metrics, too. The
same two contributors took care of that, too. (Btw, thanks, guys!
Some cool sea monster swag will be coming your way).
Metrics as an opt-in Not every driver user needs
metrics. In fact, it’s quite probable that most users don’t check
them even once. So why force users to pay (in terms of resource
consumption) for metrics they’re not using? To avoid this, we put
the metrics module behind the "metrics"
feature (which
is disabled by default). Even more performance gain! For a
comprehensive list of changes introduced in the 1.0 release,
see our release notes. Stepping Stones on the Path to the 1.0
Release We’ve been working towards this 1.0 release for years, and
it involved a lot of incremental improvements that we rolled out in
minor releases along the way. Here’s a look at the most notable
ones. Ser/De (from versions 0.11 and 0.15) Previous releases
reworked the serialization and deserialization APIs to improve
safety and efficiency. In short, the 0.11 release introduced a
revamped serialization API that leverages Rust’s type system to
catch misserialization issues early. And the 0.15 release refined
deserialization for better performance and memory efficiency. Here
are more details. Serialization API Refactor (released in
0.11): Leverage Rust’s Powerful Type System to Prevent
Misserialization — For Safer and More Robust Query Binding
Before 0.11, the driver’s serialization API had several pitfalls,
particularly around type safety. The old approach relied on loosely
structured traits and structs (Value
,
ValueList
, SerializedValues
,
BatchValues
, etc.), which lacked strong compile-time
guarantees. This meant that if a user mistakenly bound an incorrect
type to a query parameter, they wouldn’t receive an immediate,
clear error. Instead, they might encounter a confusing
serialization error from ScyllaDB — or, in the worst case, could
suffer from silent data corruption! To address these issues, we
introduced a redesigned serialization API that replaces the old
traits with SerializeValue
, SerializeRow
,
and new versions of BatchValues
and
SerializedValues
. This new approach enforces stronger
type safety. Now, type mismatches are caught locally at compile
time or runtime (rather than surfacing as obscure database errors
after query execution). Key benefits of this refactor include:
Early Error Detection – Incorrectly typed bind
markers now trigger clear, local errors instead of ambiguous
database-side failures. Stronger Type Safety – The
new API ensures that only compatible types can be bound to queries,
reducing the risk of subtle bugs. Deserialization API
Refactor (released in 0.15): For Better Performance and Memory
Efficiency Prior to release 0.15, the driver’s
deserialization process was burdened with multiple inefficiencies,
slowing down applications and increasing memory usage. The first
major issue was type erasure — all values were initially converted
into the CQL-type-agnostic CqlValue before being transformed into
the user’s desired type. This unnecessary indirection introduced
additional allocations and copying, making the entire process
slower than it needed to be. But the inefficiencies didn’t stop
there. Another major flaw was the eager allocation of columns and
rows. Instead of deserializing data on demand, every column in a
row was eagerly allocated at once — whether it was needed or not.
Even worse, each page of query results was fully materialized into
a Vec<Row>
. As a result, all rows in a page were
allocated at the same time — all of them in the form of the
ephemeric CqlValue
. This usually required further
conversion to the user’s desired type and incurred allocations. For
queries returning large datasets, this led to excessive memory
usage and unnecessary CPU overhead. To fix these issues, we
introduced a completely redesigned deserialization API. The new
approach ensures that: CQL values are deserialized lazily,
directly into user-defined types, skipping
CqlValue
entirely and eliminating redundant
allocations. Columns are no longer eagerly
deserialized and allocated. Memory is used only for the
fields that are actually accessed. Rows are
streamed instead of eagerly materialized. This avoids
unnecessary bulk allocations and allows more efficient processing
of large result sets. Paging API (released in 0.14) We heard from
our users that the driver’s API for executing queries was prone to
misuse with regard to query paging. For instance, the
Session::query()
and Session::execute()
methods would silently return only the first page of the result if
page size was set on the statement. On the other hand, if page size
was not set, those methods would perform unpaged queries, putting
high and undesirable load on the cluster. Furthermore,
Session::query_paged()
and
Session::execute_paged()
would only fetch a single
page! (if page size was set on the statement; otherwise, the query
would not be paged…!!!) To combat this: We decided
to redesign the paging API in a way that no other driver had done
before. We concluded that the API must be crystal clear about
paging, and that paging will be controlled by the method used, not
by the statement itself. We ditched query()
and
query_paged()
(as well as their execute
counterparts), replacing them with query_unpaged()
and
query_single_page()
, respectively (similarly for
execute*
). We separated the setting of page size from
the paging method itself. Page size is now mandatory on the
statement (before, it was optional). The paging method (no paging,
manual paging, transparent automated paging) is now selected by
using different session methods
({query,execute}_unpaged()
,
{query,execute}_single_page()
, and
{query,execute}_iter()
, respectively). This separation
is likely the most important change we made to help users avoid
footguns and pitfalls. We introduced strongly typed PagingState and
PagingStateResponse abstractions. This made it clearer how to use
manual paging (available using
{query,execute}_single_page()
). Ultimately, we
provided
a cheat sheet in the Docs that describes best practices
regarding statement execution. Looking Ahead The journey doesn’t
stop here. We have many ideas for possible future driver
improvements: Adding a prelude
module
containing commonly used driver’s functionalities. More
performance optimizations to push the limits of
scalability (and benchmarks to track how we’re doing). Extending
CQL execution APIs to combine transparent paging
with zero-copy deserialization, and introducing
BoundStatement
. Designing our own test
harness to enable cluster sharing and reuse between tests
(with hopes of speeding up test suite execution and encouraging
people to write more tests). Reworking CQL execution
APIs for less code duplication and better usability.
Introducing QueryDisplayer
to pretty print results of
the query in a tabular way, similarly to the cqlsh
tool. (In our dreams) Rewriting cqlsh
(based on Python
driver) with cqlsh-rs (a wrapper over Rust driver). And of course,
we’re always eager to hear from the community — your feedback helps
shape the future of the driver! Get Started with ScyllaDB Rust
Driver 1.0 If you’re working on cool Rust applications that use
ScyllaDB and/or you want to contribute to this Rust driver project,
here are some starting points. GitHub Repository:
ScyllaDB
Rust Driver – Contributions welcome!
Crates.io: Scylla Crate
Documentation: crate docs on docs.rs,
the guide
to the driver. And if you have any questions, please contact us
on the community forum or
ScyllaDB User Slack (see
the #rust-driver channel). ScyllaDB Rust Driver 1.0 is Officially Released
The long-awaited ScyllaDB Rust Driver 1.0 is finally released. This open source project was designed to bring a stable, high-performance, and production-ready CQL driver to the Rust ecosystem. Key changes in the 1.0 release include: Improved Stability: We removed unstable dependencies and put them behind feature flags. This keeps the driver stable, flexible, and future-proof while still allowing access to powerful-yet-unstable third-party libraries when needed. Refactored Error Types: The error types were significantly improved for clarity, type safety, and diagnostic information. This makes debugging easier and prevents API-breaking changes in future updates. Refactored Module Structure: The module structure was reorganized to better reflect abstraction layers and improve clarity. This makes the driver’s architecture more understandable and simplifies importing items. Easier TLS Setup: Rustls support provides a Rust-native alternative to openssl. This simplifies TLS configuration and can prevent system library issues. Faster and Extended Metrics: New metrics were added and metrics were optimized using an atomic histogram that reduces CPU overhead. The entire metrics module is now optional – so users who don’t care about it won’t suffer any performance impacts from it. Read the release notes In this post, we’ll shed light on why we took this unconventionally extensive (years long) path from a popular production-ready 0.x release to a 1.0 release. We’ll also share our versioning/release plans from this point forward. Inside ScyllaDB Rust Driver 1.0: A Fully Async Shard-Aware CQL Driver Using Tokio provides a deep dive into exactly what we changed and why. Read the deep dive into what we changed and why The Path to “1.0” Over the past few years, Rust Driver has proven itself to be high quality, with very few bugs compared to other drivers as well as better performance. It is successfully used by customers in production, and by us internally. By all means, we have considered it fully production-ready for a long time. Given that, why did we keep releasing 0.x versions? Although we were confident in the driver’s quality, we weren’t satisfied with some aspects of its API. Keeping the version at 0.x was our way of saying that breaking changes are expected often. Frequent breaking changes are not really great for our users. Instead of just updating the driver, they have to adjust their code after pretty much every update. However, 0.x version numbers suggest that the driver is not actually production-ready (but in this case, it truly was). So we really wanted to release a 1.0 version. One option was to just call one of the previous versions (e.g. 0.9) version 1.0 and be done with it. But we knew there were still many breaking changes we wanted to make – and if we kept introducing planned changes, we would quickly arrive at a high version number like 7.0. In Rust (and semver in general) 1.0 is called an “API-stable” version. There is no definition of that term, so it can have various interpretations. What’s perfectly clear, however, is that rapidly releasing major versions – thus quickly arriving at a high major version number – does not constitute API stability. It also does nothing to help users easily update. They would still need to change their code after most updates! We also realized that we will never be able to achieve complete stabilization. There are, and will probably always be, things that we want to improve in our API. We don’t want stability to stand in the way of driver refinement. Even if we somehow achieve an API that we are fully satisfied with, that we don’t want to change at all, there is another reason for change: the databases that the driver supports (ScyllaDB and Cassandra) are constantly changing, and some of those changes may require modifying the driver API. For example, ScyllaDB recently introduced a new replication mechanism: Tablets. It is possible to add Tablets support to a driver without a breaking change. We did that in our other drivers, which are forks, because we can’t break compatibility there. However, it requires ugly workarounds. With Tablets, calculating a replica list for a request requires knowing which table the request uses. Tablets are per-table data structures, which means that different tables may have different replica sets for the same token (as opposed to the token ring, which is per-keyspace). This affects many APIs in the driver: Metadata, Load Balancing, and Request Routing, to name just a few. In Rust Driver, we could nicely adapt those APIs, and we want to continue doing so when major changes are introduced in ScyllaDB or Cassandra. Given those restrictions, we reached a compromise. We decided to focus on the API-breaking changes we had planned and complete a big portion of them – making the API more future-proof and flexible. This reduces the risk of being forced to make unwanted API-breaking changes in the future. What’s Next for Rust Driver Now that we’ve reached the long-anticipated “1.0” status, what’s next? We will focus on other driver tasks that do not require changing the API. Those will be released as minor updates (1.x versions). Releasing minor versions means that our users can easily update the driver without changing their code, and so they will quickly get the latest improvements. Of course, we won’t stay at 1.0 forever. We don’t know exactly when the 2.0 release will happen, but we want to provide some reasonable stability to make life easier for our users. We’ve settled on 9 months for 1.0 – so 2.0 won’t be released any earlier than 9 months after the 1.0 release date. For future versions (3.0, etc) this time may (almost certainly) be increased since we will have already smoothed out more and more API rough edges. When a new major version (e.g. 2.0) is released, we will keep supporting the previous major version (e.g. 1.x) with bugfixes, but no new functionalities. The duration of such support is not yet decided. This will also make the migration to a new major version a bit easier. Get Started with Rust Driver 1.0 If you’re ready to get started, take a look at: GitHub Repository: ScyllaDB Rust Driver – Contributions welcome! Crates.io: Scylla Crate Documentation: crate docs on docs.rs, the guide to the driver. And if you have any questions, please contact us on the community forum or ScyllaDB User Slack (see the #rust-driver channel).Upcoming ScyllaDB University LIVE and Community Forum Updates
What to expect at the upcoming ScyllaDB University Live training event – and what’s trending on the community forum Following up on all the interest in ScyllaDB – at Monster SCALE Summit and a whirlwind of in-person events around the world – let’s continue the ScyllaDB conversation. Is ScyllaDB a good fit for your use case? How do you navigate some of the decisions you face when getting started? We’re here to help! In this post, I’ll update you about the upcoming ScyllaDB University Live training event and highlight some trending topics from the community forum. ScyllaDB University LIVE Our next ScyllaDB University LIVE training event will be held on Wednesday, April 9, 2025, 8 AM PDT – 10 AM PDT. This is a free live virtual training led by our top engineers and architects. Whether you’re just curious about ScyllaDB or an experienced user looking to master advanced strategies, join us for ScyllaDB University LIVE! Sessions are interactive and NOT available on-demand – be sure to mark your calendar and attend! The event will be interactive, and you will have a chance to run some hands-on labs throughout the event, and learn by actually doing. The team and I are preparing lots of new examples and exercises – so if you’ve joined before, there’s a great excuse to join again. 😉 Register here In the event, there will be two parallel tracks, Essentials and Advanced. Essentials Track The Essentials track (Getting Started with ScyllaDB) is intended for people new to ScyllaDB. I will start with a talk covering a quick overview of NoSQL and where ScyllaDB fits in the NoSQL world. Next, you will run the Quick Wins labs, in which you’ll see how easy it is to start a ScyllaDB cluster, create a keyspace, create a table, and run some basic queries. After the lab, you’ll learn about ScyllaDB’s basic architecture, including a node, cluster, data replication, Replication Factor, how the database partitions data, Consistency Level, multiple data centers, and an example of what happens when we write data to a cluster. We’ll cover data modeling fundamentals for ScyllaDB. Key concepts include the difference in data modeling between NoSQL and Relational databases, Keyspace, Table, Row, CQL, the CQL shell, Partition Key, and Clustering Key. After that, you’ll run another lab, where you’ll put the data modeling theory into practice. Finally (if we have enough time left), we will discuss ScyllaDB’s special shard-aware drivers. The next part of this session is led by Attila Toth. Here, we’ll walk through a real-world application and understand how the different concepts from the previous talk come into play. We’ll also use a lab where you can do the coding and test it yourself. Additionally, you will see a demo application running one million ops/sec with single-digit millisecond latency and learn how to run this demo yourself. Advanced Track In the Advanced Track (Extreme Elasticity and Performance) by Tzach Livyatan and Felipe Mendes, you will take a deep dive into ScyllaDB’s unique features and tooling such as Workload Prioritization as well as advanced data modeling, and tips for using counters and Time To Live (TTL). You’ll learn how ScyllaDB’s new Tablets feature enables extreme elasticity without any downtime and how to have multiple workloads on a single cluster. The two talks in this track will also use multiple labs that you can run yourself during the event. Before the event, please make sure you have a ScyllaDB University account (free). We will use this platform during the event for the hands-on labs. Register on ScyllaDB University Trending Topics on the Community Forum The community forum is the place to discuss anything ScyllaDB and NoSQL related, learn from your peers, share how you’re using ScyllaDB, and ask questions about your use case. It’s where you can read Avi Kivity’s, our co-founder and CTO’s, popular, weekly Last week in scylladb.git master update (for example here). It’s also the place to learn about new releases and events. Say Hello here Many of the new topics focus on performance issues, troubleshooting, specific use case questions and general data modeling questions. Many of the recent discussions have been about Tablets and how this feature affects performance and elasticity. Here’s a summary of some of the top topics since my last update. A user asked about latency spikes, hot partitions, and how to detect this. Key insights shared in this discussion emphasize the importance of understanding compaction settings and implementing strategies to mitigate tombstone accumulation. Upgrade paths and Tablets integration: The introduction of the Tablets feature led to significant discussions regarding its adoption for scaling purposes. A user discussed the processes of enabling this feature after an upgrade, and its effects on performance in posts like this one. General cluster management support: different contributors actively assisted newcomers by clarifying different admin procedures, such as addressing schema migrations, compaction, and SSTable behavior. An example of such a discussion deals with the process for gracefully stopping ScyllaDB. Data modeling: A popular topic was data modeling and the data model’s effect of the data model on performance for specific use cases. Users exchanged ideas on addressing challenges tied to row-level reads, batching, drivers, and the implications of large (and hot) partitions. One such discussion dealt with data modeling when having subgroups of data with volume disparity. Alternator: the DynamoDB compatible API was a popular topic. Users asked about how views work under the hood with Alternator as well as other questions related to compatibility with DynamoDB and performance. Hope to see you at the ScyllaDB University Live event! Meanwhile, stay in touch.A Decade of Apache Cassandra® Data Modeling
Data modeling has been a challenge with Apache Cassandra for as long as the project has been around. After a decade, we have tools and functions at our disposal that can help us to better solve this problem from a developer’s perspective.Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type
In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.
But with the release of Cassandra 5, things become much simpler.
Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.
Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.
The power of vector search in Cassandra 5
Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.
The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.
How vectors work
Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.
When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.
Storage-Attached Indexing (SAI): The engine behind vector search
Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.
SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.
Example: Performing similarity search with Cassandra 5’s vector data type
Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.
Step 1: Setting up the embeddings table
To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.
CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', ' replication_factor': 1}}; CREATE TABLE IF NOT EXISTS embeddings ( id UUID, paragraph_uuid UUID, filename TEXT, embeddings vector<float, 300>, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) ); CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai';
This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.
You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:
CREATE INDEX IF NOT EXISTS ann_index ON embeddings(embeddings) USING 'sai' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.
Step 2: Inserting embeddings into Cassandra 5
To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:
import time from uuid import uuid4, UUID from cassandra.cluster import Cluster from cassandra.query import SimpleStatement from cassandra.policies import DCAwareRoundRobinPolicy from cassandra.auth import PlainTextAuthProvider from google.colab import userdata # Connect to the single-node cluster cluster = Cluster( # Replace with your IP list ["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed port=9042, auth_provider=PlainTextAuthProvider ( username='iccassandra', password='replace_with_your_password' ) ) session = cluster.connect() print('Connected to cluster %s' % cluster.metadata.cluster_name) def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None): try: embeddings = list(map(float, embedding)) # Generate UUIDs if not provided if id is None: id = uuid4() if paragraph_uuid is None: paragraph_uuid = uuid4() # Ensure id and paragraph_uuid are UUID objects if isinstance(id, str): id = UUID(id) if isinstance(paragraph_uuid, str): paragraph_uuid = UUID(paragraph_uuid) # Create the query string with placeholders insert_query = f""" INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated) VALUES (?, ?, ?, ?, ?, toTimestamp(now())) """ # Create a prepared statement with the query prepared = session.prepare(insert_query) # Execute the query session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text))) return None # Successful insertion except Exception as e: error_message = f"Failed to execute query:\nError: {str(e)}" return error_message # Return error message on failure def insert_with_retry(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None, max_retries=3, retry_delay_seconds=1): retry_count = 0 while retry_count < max_retries: result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) if result is None: return True # Successful insertion else: retry_count += 1 print(f"Insertion failed on attempt {retry_count} with error: {result}") if retry_count < max_retries: time.sleep(retry_delay_seconds) # Delay before the next retry return False # Failed after max_retries # Replace the file path pointing to the desired file file_path = "/path/to/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path) from tqdm import tqdm for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_with_retry( session=session, embedding=paragraph['embedding'], id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ): # Display an error message if insertion fails tqdm.write(f"Insertion failed after maximum retries for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...")
This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.
Step 3: Performing similarity searches in Cassandra 5
Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:
import numpy as np # ------------------ Embedding Functions ------------------ def text_to_vector(text): """Convert a text chunk into a vector using the FastText model.""" words = text.split() vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): # Convert the input text to an embedding input_embedding = text_to_vector(input_text) input_embedding_str = ', '.join(map(str, input_embedding.tolist())) # Adjusted query without the ORDER BY clause and correct comment syntax query = f""" SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity FROM {keyspace_name}.embeddings ORDER BY embeddings ANN OF [{input_embedding_str}] LIMIT {top_k}; """ prepared = session.prepare(query) bound = prepared.bind((input_embedding,)) rows = session.execute(bound) # Sort the results by similarity in Python similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) return similar_texts[:top_k] from IPython.display import display, HTML # The word you want to find similarities for input_text = "place" # Call the function to find similar texts in the Cassandra database similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)
This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.
Step 4: Displaying the results
To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:
# Print the similar texts along with their similarity scores for similarity, filename, text in similar_texts: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>Similarity:</b> {similarity:.4f}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code will display the top similar texts, along with their similarity scores and associated file names.
Cassandra 5 vs. Cassandra 4 + OpenSearch®
Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.
Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.
Feature | Cassandra 4 + OpenSearch | Cassandra 5 (Preview) |
Embedding Storage | OpenSearch | Native VECTOR Data Type |
Similarity Search | KNN Plugin in OpenSearch | COSINE, EUCLIDEAN, DOT_PRODUCT |
Search Method | Exact K-Nearest Neighbor | Approximate Nearest Neighbor (ANN) |
System Complexity | Requires two systems | All-in-one Cassandra solution |
Conclusion: A simpler path to similarity search with Cassandra 5
With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.
Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!
Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!
The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.
Monster Scale Summit Recap: Scaling Systems, Databases, and Engineering Leadership
Monster Scale Summit brought together some of the sharpest minds in distributed systems, data infrastructure, and engineering leadership — all focused on one thing: what it really takes to build and operate systems at scale. From database internals to leadership lessons, here are some highlights from two packed days of tech talks. Watch On-Demand Kelsey Hightower: Engineering at Scale: What Separates Leaders from Laggards We kicked off with a candid conversation with Kelsey Hightower, a name that needs no introduction if you’ve ever dealt with Kubernetes, CoreOS, or even Puppet. Kelsey has been at the center of some of the biggest shifts in infrastructure over the past decade. Hearing his perspective on what separates companies that succeed at scale from those that don’t was the most memorable part of the event for me. Kelsey tackled questions such as: Misconceptions in scaling engineering efforts: What common mistakes do engineers make? Design trade-offs: How do you balance the need to move fast while still designing for future growth? Avoiding over-engineering: How do you build just enough to handle scale without building complexity that slows you down? Developer experience and tooling: How do you give teams the right tools without overwhelming them? Leadership balance: How do technical depth and soft skills factor into great engineering leadership? And of course, I couldn’t resist asking: “Good programmers copy, great programmers paste” — is that still true? Spoiler: his answer was “They use ChatGPT!” Kelsey shared razor-sharp, unfiltered insights throughout the unscripted live session. If you care about engineering leadership in high-scale environments, watch this – now. Dor Laor, ScyllaDB CEO: Pushing the Boundaries of Performance Dor Laor, ScyllaDB CEO and Co-founder, took the virtual stage to share 10 years of lessons learned building ScyllaDB, a database designed for extreme speed and scale. Dor walked us through: The shard-per-core design that sets ScyllaDB apart. How ScyllaDB evolved from an idea (codename: “Sea Star” [C*]) to production systems handling billions of operations per day. What’s next in terms of performance, cost-efficiency, and scalability. Organizations have wasted time and money overprovisioning other databases at scale. Dor presented the next generation of ScyllaDB X Cloud which provides true elasticity and unmatched storage capability, unique to ScyllaDB. If you’re dealing with high-throughput, low-latency database workloads, take some time to absorb all the advances introduced… and how they might help your team. Real-World Scaling Stories from Industry Leaders One of the best parts of Monster Scale was hearing directly from the people building and operating some of the largest systems on the planet. Some of the talks that got the chat buzzing include… Extreme Scale in Action Cloudflare: Serving millions of boot artifacts to a global audience. Agoda: Scaling 50x throughput with ScyllaDB. Discord: Handling trillions of search requests. American Express: Sharing design choices for routing global payments. Canva: Running machine learning workflows with over 100M images/day. Database Internals and Their Impacts Avi Kivity (ScyllaDB CTO): Deep dive into engineering advances enabling massive scale. Felipe Mendes (ScyllaDB Technical Director): Detailed breakdown of how ScyllaDB stacks up against Cassandra 5.0. Responsive: Almog Gavra on replacing RocksDB with ScyllaDB to achieve next-level Kafka stream processing. Optimizing Cost and Performance in the Cloud ScyllaDB: Cloud cost reduction, tiered storage, and high availability (HA) strategies. Slack: Managing 300+ mission-critical cron jobs efficiently. Yieldmo: Real savings from moving off DynamoDB to ScyllaDB. Gwen Shapira: Reengineering Postgres for Millions of Tenants If you think relational databases can’t handle scale, Gwen Shapira showed up to challenge that. She detailed how Nile is rethinking Postgres to serve millions of tenants and shared the real operational challenges behind that journey. Her bottom line: “Scaling relational data is frigging hard.” But it’s also possible if you know what you’re doing. ShareChat: Building One of the World’s Largest Feature Stores With over 300M monthly active users, ShareChat has built a feature store that processes over a billion features per second. David and Ivan walked us through how they got there, the role ScyllaDB plays, and what they’re doing now to optimize cost without compromising on scale. Martin Kleppmann + Chris Riccomini: Designing Data-Intensive Apps in 2025 Yes, Martin & Chris confirmed an update to “Designing Data-Intensive Applications” is on the way. But this wasn’t a book promo — it was a frank discussion on real-world data architecture, including what’s broken and what still works when scaling distributed systems. Avi Kivity: ScyllaDB’s Monstrous Engineering Advances Avi took us through ScyllaDB’s latest innovations, from internals to future plans — essential viewing if you’re using ScyllaDB and/or you’re curious about the engineering behind high-performance, distributed databases. More Sessions on Tackling Scale Head-On Resonate, Antithesis, Turso, poolside, Uber: Simple (and not-so-simple) mechanics of scaling. Medium, Alex DeBrie, Guilherme Nogueira+ Nadav Har’El, Patrick Bossman: The reality of DynamoDB costs and why customers switch to ScyllaDB – plus practical migration insights. Kostja Osipov (ScyllaDB): Real lessons in surviving majority failures and consensus mechanics. Dzejla Medjedovic (Social Explorer): Exploring the benefits and tradeoffs between B-trees, B^eps-trees, and LSM-trees. Ethan Donowitz: Database Upgrades with Shadow Clusters at Discord Ethan gave us a compelling presentation on the use of “shadow clusters” at Discord to effectively de-risk the upgrade process in large-scale production systems. This included insights on how to build, mirror, validate, test and monitor — all practical tips you can apply to your own database environments. Rachel Stephens + Adam Jacob: Scaling is the Funnest Game Rachel and Adam gave us their honest take on the human side of scaling, with plenty of fun stories around technical trade-offs and why business context matters as much as engineering decisions. To quote Adam (while recounting some anecdotal coffee shop encounters with Chef users): “There is no funner game than the at-scale technology game.” Personal Takeaways As an event host, I get the chance to review the recordings before the show — but it’s not until the entire show is assembled and streamed online that the true depth and quality of content becomes apparent to me. Also, what a privilege it was to interview Kelsey in person. I’ve used many of the systems and software he has influenced, so having a chat with him was both inspiring and grounding. You couldn’t ask for a better role model in software engineering leadership. Cheers mate! Monster Scale Summit wasn’t just about theory — it was about what happens when systems, teams, and businesses hit real limits and what it takes to move past them. From deep engineering to leadership lessons, if you’re working on systems that need to scale and perform predictably, this was a treasure trove of insights. And if you missed it? Check out the replays — because this is the kind of knowledge that will save you months of effort and pain. Watch Tech Talk Replays On-Demand Behind the scenes, from the perspective of Wayne’s Ray-Ban Smart GlassesHigh Performance on a Low Budget: Gwen Shapira’s Tips for Startups
How even a scrappy early-stage startup can deliver outstanding performance “It’s one thing to solve performance challenges when you have plenty of time, money and expertise available. But what do you do in the opposite situation: If you are a small startup with no time or money and still need to deliver outstanding performance?” – Gwen Shapira, co-founder of Nile (PostgreSQL reengineered for multi-tenant apps) That’s the multi-million-dollar question for many early-stage startups. And who better to lead that discussion than Gwen Shapira, who has tackled performance from two vastly different perspectives? After years of focusing on data systems performance at large organizations, she recently pivoted to leading a startup – where she found herself responsible for full-stack performance, from the ground up. In her P99 CONF keynote, “High Performance on a Low Budget,” Gwen explored the topic by sharing performance challenges she and her small team faced at Nile – how they approached them, tradeoffs and lessons learned. A few takeaways that set the conference chat on fire: Benchmarks should pay rent Keep tests stupid simple If you don’t have time to optimize, at least don’t pessimize But the real value is in hearing the experiences behind these and other zingers. You can watch her talk below or keep reading for a guided tour. Enjoy Gwen’s insights? She’ll be delivering another keynote at Monster SCALE Summit—alongside Kelsey Hightower and engineers from Discord, Disney+, Slack, Canva, Atlassian, Uber, ScyllaDB and many other leaders sharing how they’re tackling extreme-scale engineering challenges. Join us – it’s free + virtual. Get Your Conference Pass Do Worry About Performance Before You Ship It Per Gwen, founders get all sorts of advice on how to run the company (whether they want it or not). Regarding performance, it’s common to hear tips like: Don’t worry about performance until users complain. Don’t worry about performance until you have product market fit. Performance is a good problem to have. If people complain about performance, it is a great sign! But she respectfully disagrees. After years of focusing on performance, she’s all too familiar with the aftermath of that approach. Gwen shared, “As you talk to people, you want to figure out the minimal feature set required to bring an impactful product to market. And performance is part of this package. You should discover the target market’s performance expectations when you discover the other expectations.” Things to consider at an early stage: If you’re trying to beat the competition on performance, how much faster do you need to be? Even if performance is not your key differentiator, what are your users’ expectations regarding performance? To what extent are users willing to accept latency or throughput tradeoffs for different capabilities – or accept higher costs to avoid those tradeoffs? Founders are often told that even if you’re not fully satisfied with the product, just ship it and see how people react. Gwen’s reaction: “If you do a startup, there is a 100% chance that you will ship something that you’re not 100% happy with. But the reason you ship early and iterate is because you really want to learn fast. If you identified performance expectations during the discovery phase, try to ship something in that ballpark. Otherwise, you’re probably not learning all that much – with respect to performance, at least.” Hyperfocus on the User’s Perceived Latency For startups looking to make the biggest performance impact with limited resources, you can “cheat” by focusing on what will really help you attract and retain customers: optimizing the user’s perceived latency. Web apps are a great place to begin. Even if your core product is an eBPF-based edge database, your users will likely be interacting with a web app from the start of the user journey. Plus, there are lots of nice metrics to track (for example, Google’s Core Web Vitals). Gwen noted, “Startups very rarely have a scale problem. If you do have a scale problem, you can, for example, put people on a wait list while you’re determining how to scale out and add machines. However, even if you have a small number of users, you definitely care about them having a great experience with low latency. And perceived low latency is what really matters.” For example, consider this dashboard: When users logged in, the Nile team wanted to impress them by having this cool dashboard load instantly. However, they found that response times ranged from a snappy 200 milliseconds to a terrible 10+ seconds. To tackle the problem, the team started by parallelizing requests, filling in dashboard elements as data arrived and creating a progressive loading experience. These optimizations helped – and progressive loading turned out to be a fantastic way to hide latency (keeping the user engaged, like mirrors distracting you in a slow elevator). However, the optimizations exposed another issue. The app was making 2,000 individual API calls just to count open tickets. This is the classic N + 1 problem (when you end up running a query for each result instead of running a single optimized query that retrieves all necessary data at once.). Naturally, that inspired some API refinement and tuning. Then, another discovery. Their front-end dev noticed they were fetching more data than needed, so he cached it in the browser. This update sped up dashboard interactions by serving pre-cached data from the browser’s local storage. However, despite all those optimizations, the dashboard remained data-heavy. “Our customers loved it, but there was no reason why it had to be the first page after logging in,” Gwen remarked. So they moved the dashboard a layer down in the navigation. In its place, they added a much simpler landing page with easy access to the most common user tasks. Benchmarks Should Pay Rent Next, topic: The importance of being strategic about benchmarking. “Performance people love benchmarking (and I’m guilty of that),” Gwen admitted. ”But you can spend infinite time benchmarking with very little to show for it. So I want to share some tips on how to spend less time and have more to show for it.” She continued, “Benchmarks should pay rent by answering some important questions that you have. If you don’t have an important question, don’t run a benchmark. There, I just saved you weeks of your life – something invaluable for startups. You can thank me later. “ If your primary competitive advantage is performance, you will be expected to share performance tests to (attempt to) prove how fast and cool you are. Call it “benchmarketing.” For everyone else, two common questions to answer with benchmarking are: Is our database setup optimal? Are we delivering a good user experience? To assess the database setup, teams tend to: Turn to a standard benchmark like TPCC Increase load over time Look for bottlenecks Fix what they can But is this really the best way for a startup to spend its limited time and resources? Given her background as a performance expert, Gwen couldn’t resist doing such a benchmark early on at Nile. But she doesn’t recommend it – at least not for startups: “First of all, it takes a lot of time to run the standard benchmarks when you’re not used to doing it week in, week out. It takes time to adjust all the knobs and parameters. It takes time to analyze the results, rinse and repeat. Even with good tools, it’s never easy.” They did identify and fix some low-hanging fruits from this exercise. But since tests were based on a standard benchmark, it was unclear how well it mapped to actual user experiences. Gwen continued, “I didn’t feel the ROI was exactly compelling. If you’re a performance expert and it takes you only about a day, it’s probably worth it. But if you have to get up to speed, if you’re spending significant time on the setup, you’re better off focusing your efforts elsewhere.” A better question to obsess over is “Are we delivering a good experience?” More specifically, focus on these three areas: Optimizing user onboarding paths Addressing performance issues that annoy developers (these likely annoy users too) Paying attention to metrics that customers obsess over – even if they’re not the ones your team has focused on Keep Benchmarking Tests Stupid Simple Another testing lesson learned: Focus on extra stupid sanity tests. At Nile, the team ran the simplest possible queries, like loading an empty page or querying an empty table. If those were slow, there was no point in running more complex tests. Stop, fix the problem, then proceed with more interesting tests. Also, obsess over understanding what the numbers actually measure. You don’t want to base critical performance decisions on misleading results (e.g., empty responses) or unintended behaviors. For example, her team once intended to test the write path but ended up testing the read path thanks to a misconfigured DNS. Build Infrastructure for Long-Term Value, Optimize for Quick Wins The instrumentation and observability tools put in place during testing will pay off for years to come. At Nile, this infrastructure became invaluable throughout the product’s lifetime for answering the persistent question “Why is it slow?” As Gwen put it: “Those early performance test numbers, that instrumentation, all the observability – this is very much a gift that keeps on giving as you continue to build and users inevitably complain about performance.” When prioritizing performance improvements, look for quick wins. For example, Nile found that a slow request was spending 40% of the time on parsing, 40% on lookups, and just 20% on actual work. The developer realized he could reuse an existing caching library to speed up lookups. That was a nice quick win – giving 40% of the time back with minimal effort. However, if he’d said, “I’m not 100% sure about caching, but I have this fast JSON parsing library,” then that would have been a better way to shave off an equivalent 40%. About a year later, they pushed most of the parsing down to a Postgres extension that was written in C and nicely optimized. The optimizations never end! No Time to Optimize? Then At Least Don’t Pessimize Gwen’s final tip involved empowering experienced engineers to make common-sense improvements. “Last but not least, sometimes you really don’t have time to optimize. But if you have a team of experienced engineers, they know not to pessimize. They are familiar with faster JSON libraries, async libraries that work behind the scenes, they know not to put slow stuff on the critical path and so on. Even if you lack the time to prove that these things are actually faster, just do them. It’s not premature optimization. It’s just avoiding premature pessimization.”Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®
Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.
They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.
For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.
As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.
In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.
Cassandra 4 and OpenSearch: A partnership for embeddings
Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.
This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.
Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.
This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.
Deploying the environment
To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.
Here’s how to get started:
- Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
- Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.
By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.
Preparing the environment
Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.
Step 1: Setting up Cassandra
In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:
- Create the Table:
Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy', ' replication_factor': 3}; USE file_metadata; DROP TABLE IF EXISTS file_metadata; CREATE TABLE IF NOT EXISTS file_metadata ( id UUID, paragraph_uuid UUID, filename TEXT, text TEXT, last_updated timestamp, PRIMARY KEY (id, paragraph_uuid) );
Step 2: Configuring OpenSearch
In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:
- Create the index:
Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ "settings": { "index": { "number_of_shards": 2, "knn": true, "knn.space_type": "l2" } }, "mappings": { "properties": { "file_uuid": { "type": "keyword" }, "paragraph_uuid": { "type": "keyword" }, "embedding": { "type": "knn_vector", "dimension": 300 } } } }
This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.
With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.
Generating embeddings with FastText
Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.
Step 1: Download and load the FastText model
To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:
1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:
import requests import zipfile import os # Adjust file_url and local_filename variables accordingly file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' def download_file(url, filename): with requests.get(url, stream=True) as r: r.raise_for_status() os.makedirs(os.path.dirname(filename), exist_ok=True) with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) def unzip_file(filename, extract_to): with zipfile.ZipFile(filename, 'r') as zip_ref: zip_ref.extractall(extract_to) # Download and extract download_file(file_url, local_filename) unzip_file(local_filename, extract_dir)
2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly:
from gensim.models import KeyedVectors # Adjust model_path variable accordingly model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec" fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)
Step 2: Generate embeddings from text
With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.
Here’s a function that handles the conversion:
import numpy as np import re def text_to_vector(text): """Convert text into a vector using the FastText model.""" text = text.lower() words = re.findall(r'\b\w+\b', text) vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] if not vectors: print(f"No embeddings found for text: {text}") return np.zeros(fasttext_model.vector_size) return np.mean(vectors, axis=0)
This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.
Step 3: Extract text and generate embeddings from documents
In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:
import uuid import mimetypes import pandas as pd from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer from docx import Document from pptx import Presentation def generate_deterministic_uuid(name): return uuid.uuid5(uuid.NAMESPACE_DNS, name) def generate_random_uuid(): return uuid.uuid4() def get_file_type(file_path): # Guess the MIME type based on the file extension mime_type, _ = mimetypes.guess_type(file_path) return mime_type def extract_text_from_excel(excel_path): xls = pd.ExcelFile(excel_path) text_list = [] for sheet_index, sheet_name in enumerate(xls.sheet_names): df = xls.parse(sheet_name) for row in df.iterrows(): text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1)) # +1 to make it 1 based index return text_list def extract_text_from_pdf(pdf_path): return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) for element in page_layout if isinstance(element, LTTextContainer) for text_line in element if text_line.get_text().strip()] def extract_text_from_word(file_path): doc = Document(file_path) return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] def extract_text_from_txt(file_path): with open(file_path, 'r') as file: return [(line.strip(), 1) for line in file.readlines() if line.strip()] def extract_text_from_pptx(pptx_path): prs = Presentation(pptx_path) return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] def extract_text_with_page_number_and_embeddings(file_path, embedding_function): file_uuid = generate_deterministic_uuid(file_path) file_type = get_file_type(file_path) extractors = { 'text/plain': extract_text_from_txt, 'application/pdf': extract_text_from_pdf, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 'application/vnd.ms-excel': extract_text_from_excel } text_list = extractors.get(file_type, lambda _: [])(file_path) return [ { "uuid": file_uuid, "paragraph_uuid": generate_random_uuid(), "filename": file_path, "text": text, "page_num": page_num, "embedding": embedding } for text, page_num in text_list if (embedding := embedding_function(text)).any() # Check if the embedding is not all zeros ] # Replace the file path with the one you want to process file_path = "../../docs-manager/Cassandra-Best-Practices.pdf" paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)
This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.
With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.
Performing similarity searches
To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.
For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.
This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.
Example: Inserting metadata in Cassandra and embeddings in OpenSearch
In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.
We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.
The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:
from tqdm import tqdm # Function to insert data into both Cassandra and OpenSearch def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): # Insert into Cassandra cassandra_result = insert_with_retry( session=session, id=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], text=paragraph['text'], filename=paragraph['filename'], keyspace_name=keyspace_name, max_retries=3, retry_delay_seconds=1 ) if not cassandra_result: return False # Stop further processing if Cassandra insertion fails # Insert into OpenSearch opensearch_result = insert_embedding_to_opensearch( os_client=os_client, index_name=index_name, file_uuid=paragraph['uuid'], paragraph_uuid=paragraph['paragraph_uuid'], embedding=paragraph['embedding'] ) if opensearch_result is not None: return False # Return False if OpenSearch insertion fails return True # Return True on success for both # Process each paragraph with a progress bar print("Starting batch insertion of paragraphs.") for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): if not insert_paragraph_data( session=session, os_client=os_client, paragraph=paragraph, keyspace_name=keyspace_name, index_name=index_name ): print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") print("Batch insertion completed.")
Performing similarity search
Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.
The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.
Here’s how it works:
- Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
- Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
- Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.
The following code demonstrates this process:
import uuid from IPython.display import display, HTML def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" query = { "size": top_k, "query": { "knn": { "embedding": { "vector": input_embedding.tolist(), "k": top_k } } } } response = os_client.search(index=index_name, body=query) similar_uuids = [] for hit in response['hits']['hits']: file_uuid = hit['_source']['file_uuid'] paragraph_uuid = hit['_source']['paragraph_uuid'] similar_uuids.append((file_uuid, paragraph_uuid)) return similar_uuids def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" file_uuid = uuid.UUID(file_uuid) paragraph_uuid = uuid.UUID(paragraph_uuid) query = f""" SELECT text, filename FROM {keyspace_name}.file_metadata WHERE id = ? AND paragraph_uuid = ?; """ prepared = session.prepare(query) bound = prepared.bind((file_uuid, paragraph_uuid)) rows = session.execute(bound) for row in rows: return row.filename, row.text return None, None # Input text to find similar embeddings input_text = "place" # Convert input text to embedding input_embedding = text_to_vector(input_text) # Find similar embeddings in OpenSearch similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) # Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch for file_uuid, paragraph_uuid in similar_uuids: filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name) if filename and text: html_content = f""" <div style="margin-bottom: 10px;"> <p><b>File UUID:</b> {file_uuid}</p> <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> <p><b>Text:</b> {text}</p> <p><b>File:</b> {filename}</p> </div> <hr/> """ display(HTML(html_content))
This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.
Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch
By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.
Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.
Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.
The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.