ScyllaDB Rust Driver 1.0 is Officially Released

The long-awaited ScyllaDB Rust Driver 1.0 is finally released. This open source project was designed to bring a stable, high-performance, and production-ready CQL driver to the Rust ecosystem. Key changes in the 1.0 release include: Improved Stability: We removed unstable dependencies and put them behind feature flags. This keeps the driver stable, flexible, and future-proof while still allowing access to powerful-yet-unstable third-party libraries when needed. Refactored Error Types: The error types were significantly improved for clarity, type safety, and diagnostic information. This makes debugging easier and prevents API-breaking changes in future updates. Refactored Module Structure: The module structure was reorganized to better reflect abstraction layers and improve clarity. This makes the driver’s architecture more understandable and simplifies importing items. Easier TLS Setup: Rustls support provides a Rust-native alternative to openssl. This simplifies TLS configuration and can prevent system library issues. Faster and Extended Metrics: New metrics were added and metrics were optimized using an atomic histogram that reduces CPU overhead. The entire metrics module is now optional – so users who don’t care about it won’t suffer any performance impacts from it. Read the release notes In this post, we’ll shed light on why we took this unconventionally extensive (years long) path from a popular production-ready 0.x release to a 1.0 release. We’ll also share our versioning/release plans from this point forward. We’ll share a more detailed post next week that provides a deep dive into exactly what we changed and why. The Path to “1.0” Over the past few years, Rust Driver has proven itself to be high quality, with very few bugs compared to other drivers as well as better performance. It is successfully used by customers in production, and by us internally. By all means, we have considered it fully production-ready for a long time. Given that, why did we keep releasing 0.x versions? Although we were confident in the driver’s quality, we weren’t satisfied with some aspects of its API. Keeping the version at 0.x was our way of saying that breaking changes are expected often. Frequent breaking changes are not really great for our users. Instead of just updating the driver, they have to adjust their code after pretty much every update. However, 0.x version numbers suggest that the driver is not actually production-ready (but in this case, it truly was). So we really wanted to release a 1.0 version. One option was to just call one of the previous versions (e.g. 0.9) version 1.0 and be done with it. But we knew there were still many breaking changes we wanted to make – and if we kept introducing planned changes, we would quickly arrive at a high version number like 7.0. In Rust (and semver in general) 1.0 is called an “API-stable” version. There is no definition of that term, so it can have various interpretations. What’s perfectly clear, however, is that rapidly releasing major versions – thus quickly arriving at a high major version number – does not constitute API stability. It also does nothing to help users easily update. They would still need to change their code after most updates! We also realized that we will never be able to achieve complete stabilization. There are, and will probably always be, things that we want to improve in our API. We don’t want stability to stand in the way of driver refinement. Even if we somehow achieve an API that we are fully satisfied with, that we don’t want to change at all, there is another reason for change: the databases that the driver supports (ScyllaDB and Cassandra) are constantly changing, and some of those changes may require modifying the driver API. For example, ScyllaDB recently introduced a new replication mechanism: Tablets. It is possible to add Tablets support to a driver without a breaking change. We did that in our other drivers, which are forks, because we can’t break compatibility there. However, it requires ugly workarounds. With Tablets, calculating a replica list for a request requires knowing which table the request uses. Tablets are per-table data structures, which means that different tables may have different replica sets for the same token (as opposed to the token ring, which is per-keyspace). This affects many APIs in the driver: Metadata, Load Balancing, and Request Routing, to name just a few. In Rust Driver, we could nicely adapt those APIs, and we want to continue doing so when major changes are introduced in ScyllaDB or Cassandra. Given those restrictions, we reached a compromise. We decided to focus on the API-breaking changes we had planned and complete a big portion of them – making the API more future-proof and flexible. This reduces the risk of being forced to make unwanted API-breaking changes in the future. What’s Next for Rust Driver Now that we’ve reached the long-anticipated “1.0” status, what’s next? We will focus on other driver tasks that do not require changing the API. Those will be released as minor updates (1.x versions). Releasing minor versions means that our users can easily update the driver without changing their code, and so they will quickly get the latest improvements. Of course, we won’t stay at 1.0 forever. We don’t know exactly when the 2.0 release will happen, but we want to provide some reasonable stability to make life easier for our users. We’ve settled on 9 months for 1.0 – so 2.0 won’t be released any earlier than 9 months after the 1.0 release date. For future versions (3.0, etc) this time may (almost certainly) be increased since we will have already smoothed out more and more API rough edges. When a new major version (e.g. 2.0) is released, we will keep supporting the previous major version (e.g. 1.x) with bugfixes, but no new functionalities. The duration of such support is not yet decided. This will also make the migration to a new major version a bit easier. Get Started with Rust Driver 1.0 If you’re ready to get started, take a look at: GitHub Repository: ScyllaDB Rust Driver – Contributions welcome! Crates.io: Scylla Crate Documentation: crate docs on docs.rs, the guide to the driver. And if you have any questions, please contact us on the community forum or ScyllaDB User Slack (see the #rust-driver channel).  

Upcoming ScyllaDB University LIVE and Community Forum Updates

What to expect at the upcoming ScyllaDB University Live training event – and what’s trending on the community forum Following up on all the interest in ScyllaDB – at Monster SCALE Summit and a whirlwind of in-person events around the world – let’s continue the ScyllaDB conversation. Is ScyllaDB a good fit for your use case? How do you navigate some of the decisions you face when getting started? We’re here to help! In this post, I’ll update you about the upcoming ScyllaDB University Live training event and highlight some trending topics from the community forum. ScyllaDB University LIVE Our next ScyllaDB University LIVE training event will be held on Wednesday, April 9, 2025, 8 AM PDT – 10 AM PDT. This is a free live virtual training led by our top engineers and architects. Whether you’re just curious about ScyllaDB or an experienced user looking to master advanced strategies, join us for ScyllaDB University LIVE! Sessions are interactive and NOT available on-demand – be sure to mark your calendar and attend! The event will be interactive, and you will have a chance to run some hands-on labs throughout the event, and learn by actually doing. The team and I are preparing lots of new examples and exercises – so if you’ve joined before, there’s a great excuse to join again. 😉 Register here In the event, there will be two parallel tracks, Essentials and Advanced. Essentials Track The Essentials track (Getting Started with ScyllaDB) is intended for people new to ScyllaDB. I will start with a talk covering a quick overview of NoSQL and where ScyllaDB fits in the NoSQL world. Next, you will run the Quick Wins labs, in which you’ll see how easy it is to start a ScyllaDB cluster, create a keyspace, create a table, and run some basic queries. After the lab, you’ll learn about ScyllaDB’s basic architecture, including a node, cluster, data replication, Replication Factor, how the database partitions data, Consistency Level, multiple data centers, and an example of what happens when we write data to a cluster. We’ll cover data modeling fundamentals for ScyllaDB. Key concepts include the difference in data modeling between NoSQL and Relational databases, Keyspace, Table, Row, CQL, the CQL shell, Partition Key, and Clustering Key. After that, you’ll run another lab, where you’ll put the data modeling theory into practice. Finally (if we have enough time left), we will discuss ScyllaDB’s special shard-aware drivers. The next part of this session is led by Attila Toth. Here, we’ll walk through a real-world application and understand how the different concepts from the previous talk come into play. We’ll also use a lab where you can do the coding and test it yourself. Additionally, you will see a demo application running one million ops/sec with single-digit millisecond latency and learn how to run this demo yourself. Advanced Track In the Advanced Track (Extreme Elasticity and Performance) by Tzach Livyatan and Felipe Mendes, you will take a deep dive into ScyllaDB’s unique features and tooling such as Workload Prioritization as well as advanced data modeling, and tips for using counters and Time To Live (TTL). You’ll learn how ScyllaDB’s new Tablets feature enables extreme elasticity without any downtime and how to have multiple workloads on a single cluster. The two talks in this track will also use multiple labs that you can run yourself during the event. Before the event, please make sure you have a ScyllaDB University account (free). We will use this platform during the event for the hands-on labs. Register on ScyllaDB University  Trending Topics on the Community Forum The community forum is the place to discuss anything ScyllaDB and NoSQL related, learn from your peers, share how you’re using ScyllaDB, and ask questions about your use case. It’s where you can read Avi Kivity’s, our co-founder and CTO’s, popular, weekly Last week in scylladb.git master update (for example here). It’s also the place to learn about new releases and events. Say Hello here Many of the new topics focus on performance issues, troubleshooting, specific use case questions and general data modeling questions. Many of the recent discussions have been about Tablets and how this feature affects performance and elasticity. Here’s a summary of some of the top topics since my last update. A user asked about latency spikes, hot partitions, and how to detect this. Key insights shared in this discussion emphasize the importance of understanding compaction settings and implementing strategies to mitigate tombstone accumulation. Upgrade paths and Tablets integration: The introduction of the Tablets feature led to significant discussions regarding its adoption for scaling purposes. A user discussed the processes of enabling this feature after an upgrade, and its effects on performance in posts like this one. General cluster management support: different contributors actively assisted newcomers by clarifying different admin procedures, such as addressing schema migrations, compaction, and SSTable behavior. An example of such a discussion deals with the process for gracefully stopping ScyllaDB. Data modeling: A popular topic was data modeling and the data model’s effect of the data model on performance for specific use cases. Users exchanged ideas on addressing challenges tied to row-level reads, batching, drivers, and the implications of large (and hot) partitions. One such discussion dealt with data modeling when having subgroups of data with volume disparity. Alternator: the DynamoDB compatible API was a popular topic. Users asked about how views work under the hood with Alternator as well as other questions related to compatibility with DynamoDB and performance. Hope to see you at the ScyllaDB University Live event! Meanwhile, stay in touch.

A Decade of Apache Cassandra® Data Modeling

Data modeling has been a challenge with Apache Cassandra for as long as the project has been around. After a decade, we have tools and functions at our disposal that can help us to better solve this problem from a developer’s perspective.

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy',         '       replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature  Cassandra 4 + OpenSearch  Cassandra 5 (Preview) 
Embedding Storage  OpenSearch  Native VECTOR Data Type 
Similarity Search  KNN Plugin in OpenSearch  COSINE, EUCLIDEAN, DOT_PRODUCT 
Search Method  Exact K-Nearest Neighbor  Approximate Nearest Neighbor (ANN) 
System Complexity  Requires two systems  All-in-one Cassandra solution 

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

Monster Scale Summit Recap: Scaling Systems, Databases, and Engineering Leadership

Monster Scale Summit brought together some of the sharpest minds in distributed systems, data infrastructure, and engineering leadership — all focused on one thing: what it really takes to build and operate systems at scale. From database internals to leadership lessons, here are some highlights from two packed days of tech talks. Watch On-Demand  Kelsey Hightower: Engineering at Scale: What Separates Leaders from Laggards We kicked off with a candid conversation with Kelsey Hightower, a name that needs no introduction if you’ve ever dealt with Kubernetes, CoreOS, or even Puppet. Kelsey has been at the center of some of the biggest shifts in infrastructure over the past decade. Hearing his perspective on what separates companies that succeed at scale from those that don’t was the most memorable part of the event for me. Kelsey tackled questions such as: Misconceptions in scaling engineering efforts: What common mistakes do engineers make? Design trade-offs: How do you balance the need to move fast while still designing for future growth? Avoiding over-engineering: How do you build just enough to handle scale without building complexity that slows you down? Developer experience and tooling: How do you give teams the right tools without overwhelming them? Leadership balance: How do technical depth and soft skills factor into great engineering leadership? And of course, I couldn’t resist asking: “Good programmers copy, great programmers paste” — is that still true? Spoiler: his answer was “They use ChatGPT!” Kelsey shared razor-sharp, unfiltered insights throughout the unscripted live session. If you care about engineering leadership in high-scale environments, watch this – now. Dor Laor, ScyllaDB CEO: Pushing the Boundaries of Performance Dor Laor, ScyllaDB CEO and Co-founder, took the virtual stage to share 10 years of lessons learned building ScyllaDB, a database designed for extreme speed and scale. Dor walked us through: The shard-per-core design that sets ScyllaDB apart. How ScyllaDB evolved from an idea (codename: “Sea Star” [C*]) to production systems handling billions of operations per day. What’s next in terms of performance, cost-efficiency, and scalability. Organizations have wasted time and money overprovisioning other databases at scale. Dor presented the next generation of ScyllaDB X Cloud which provides true elasticity and unmatched storage capability, unique to ScyllaDB. If you’re dealing with high-throughput, low-latency database workloads, take some time to absorb all the advances introduced… and how they might help your team. Real-World Scaling Stories from Industry Leaders One of the best parts of Monster Scale was hearing directly from the people building and operating some of the largest systems on the planet. Some of the talks that got the chat buzzing include… Extreme Scale in Action Cloudflare: Serving millions of boot artifacts to a global audience. Agoda: Scaling 50x throughput with ScyllaDB. Discord: Handling trillions of search requests. American Express: Sharing design choices for routing global payments. Canva: Running machine learning workflows with over 100M images/day. Database Internals and Their Impacts Avi Kivity (ScyllaDB CTO): Deep dive into engineering advances enabling massive scale. Felipe Mendes (ScyllaDB Technical Director): Detailed breakdown of how ScyllaDB stacks up against Cassandra 5.0. Responsive: Almog Gavra on replacing RocksDB with ScyllaDB to achieve next-level Kafka stream processing. Optimizing Cost and Performance in the Cloud ScyllaDB: Cloud cost reduction, tiered storage, and high availability (HA) strategies. Slack: Managing 300+ mission-critical cron jobs efficiently. Yieldmo: Real savings from moving off DynamoDB to ScyllaDB. Gwen Shapira: Reengineering Postgres for Millions of Tenants If you think relational databases can’t handle scale, Gwen Shapira showed up to challenge that. She detailed how Nile is rethinking Postgres to serve millions of tenants and shared the real operational challenges behind that journey. Her bottom line: “Scaling relational data is frigging hard.” But it’s also possible if you know what you’re doing. ShareChat: Building One of the World’s Largest Feature Stores With over 300M monthly active users, ShareChat has built a feature store that processes over a billion features per second. David and Ivan walked us through how they got there, the role ScyllaDB plays, and what they’re doing now to optimize cost without compromising on scale. Martin Kleppmann + Chris Riccomini: Designing Data-Intensive Apps in 2025 Yes, Martin & Chris confirmed an update to “Designing Data-Intensive Applications” is on the way. But this wasn’t a book promo — it was a frank discussion on real-world data architecture, including what’s broken and what still works when scaling distributed systems. Avi Kivity: ScyllaDB’s Monstrous Engineering Advances Avi took us through ScyllaDB’s latest innovations, from internals to future plans — essential viewing if you’re using ScyllaDB and/or you’re curious about the engineering behind high-performance, distributed databases. More Sessions on Tackling Scale Head-On Resonate, Antithesis, Turso, poolside, Uber: Simple (and not-so-simple) mechanics of scaling. Medium, Alex DeBrie, Guilherme Nogueira+ Nadav Har’El, Patrick Bossman: The reality of DynamoDB costs and why customers switch to ScyllaDB – plus practical migration insights. Kostja Osipov (ScyllaDB): Real lessons in surviving majority failures and consensus mechanics. Dzejla Medjedovic (Social Explorer): Exploring the benefits and tradeoffs between B-trees, B^eps-trees, and LSM-trees. Ethan Donowitz: Database Upgrades with Shadow Clusters at Discord Ethan gave us a compelling presentation on the use of “shadow clusters” at Discord to effectively de-risk the upgrade process in large-scale production systems. This included insights on how to build, mirror, validate, test and monitor — all practical tips you can apply to your own database environments. Rachel Stephens + Adam Jacob: Scaling is the Funnest Game Rachel and Adam gave us their honest take on the human side of scaling, with plenty of fun stories around technical trade-offs and why business context matters as much as engineering decisions. To quote Adam (while recounting some anecdotal coffee shop encounters with Chef users): “There is no funner game than the at-scale technology game.” Personal Takeaways As an event host, I get the chance to review the recordings before the show — but it’s not until the entire show is assembled and streamed online that the true depth and quality of content becomes apparent to me. Also, what a privilege it was to interview Kelsey in person. I’ve used many of the systems and software he has influenced, so having a chat with him was both inspiring and grounding. You couldn’t ask for a better role model in software engineering leadership. Cheers mate! Monster Scale Summit wasn’t just about theory — it was about what happens when systems, teams, and businesses hit real limits and what it takes to move past them. From deep engineering to leadership lessons, if you’re working on systems that need to scale and perform predictably, this was a treasure trove of insights. And if you missed it? Check out the replays — because this is the kind of knowledge that will save you months of effort and pain. Watch Tech Talk Replays On-Demand    Behind the scenes, from the perspective of Wayne’s Ray-Ban Smart Glasses  

High Performance on a Low Budget: Gwen Shapira’s Tips for Startups

How even a scrappy early-stage startup can deliver outstanding performance “It’s one thing to solve performance challenges when you have plenty of time, money and expertise available. But what do you do in the opposite situation: If you are a small startup with no time or money and still need to deliver outstanding performance?” – Gwen Shapira, co-founder of Nile (PostgreSQL reengineered for multi-tenant apps) That’s the multi-million-dollar question for many early-stage startups. And who better to lead that discussion than Gwen Shapira, who has tackled performance from two vastly different perspectives? After years of focusing on data systems performance at large organizations, she recently pivoted to leading a startup – where she found herself responsible for full-stack performance, from the ground up. In her P99 CONF keynote, “High Performance on a Low Budget,” Gwen explored the topic by sharing performance challenges she and her small team faced at Nile – how they approached them, tradeoffs and lessons learned. A few takeaways that set the conference chat on fire: Benchmarks should pay rent Keep tests stupid simple If you don’t have time to optimize, at least don’t pessimize But the real value is in hearing the experiences behind these and other zingers. You can watch her talk below or keep reading for a guided tour. Enjoy Gwen’s insights? She’ll be delivering another keynote at Monster SCALE Summit—alongside Kelsey Hightower and engineers from Discord, Disney+, Slack, Canva, Atlassian, Uber, ScyllaDB and many other leaders sharing how they’re tackling extreme-scale engineering challenges. Join us – it’s free + virtual. Get Your Conference Pass Do Worry About Performance Before You Ship It Per Gwen, founders get all sorts of advice on how to run the company (whether they want it or not). Regarding performance, it’s common to hear tips like: Don’t worry about performance until users complain. Don’t worry about performance until you have product market fit. Performance is a good problem to have. If people complain about performance, it is a great sign! But she respectfully disagrees. After years of focusing on performance, she’s all too familiar with the aftermath of that approach. Gwen shared, “As you talk to people, you want to figure out the minimal feature set required to bring an impactful product to market. And performance is part of this package. You should discover the target market’s performance expectations when you discover the other expectations.” Things to consider at an early stage: If you’re trying to beat the competition on performance, how much faster do you need to be? Even if performance is not your key differentiator, what are your users’ expectations regarding performance? To what extent are users willing to accept latency or throughput tradeoffs for different capabilities – or accept higher costs to avoid those tradeoffs? Founders are often told that even if you’re not fully satisfied with the product, just ship it and see how people react. Gwen’s reaction: “If you do a startup, there is a 100% chance that you will ship something that you’re not 100% happy with. But the reason you ship early and iterate is because you really want to learn fast. If you identified performance expectations during the discovery phase, try to ship something in that ballpark. Otherwise, you’re probably not learning all that much – with respect to performance, at least.” Hyperfocus on the User’s Perceived Latency For startups looking to make the biggest performance impact with limited resources, you can “cheat” by focusing on what will really help you attract and retain customers: optimizing the user’s perceived latency. Web apps are a great place to begin. Even if your core product is an eBPF-based edge database, your users will likely be interacting with a web app from the start of the user journey. Plus, there are lots of nice metrics to track (for example, Google’s Core Web Vitals). Gwen noted, “Startups very rarely have a scale problem. If you do have a scale problem, you can, for example, put people on a wait list while you’re determining how to scale out and add machines. However, even if you have a small number of users, you definitely care about them having a great experience with low latency. And perceived low latency is what really matters.” For example, consider this dashboard: When users logged in, the Nile team wanted to impress them by having this cool dashboard load instantly. However, they found that response times ranged from a snappy 200 milliseconds to a terrible 10+ seconds. To tackle the problem, the team started by parallelizing requests, filling in dashboard elements as data arrived and creating a progressive loading experience. These optimizations helped – and progressive loading turned out to be a fantastic way to hide latency (keeping the user engaged, like mirrors distracting you in a slow elevator). However, the optimizations exposed another issue. The app was making 2,000 individual API calls just to count open tickets. This is the classic N + 1 problem (when you end up running a query for each result instead of running a single optimized query that retrieves all necessary data at once.). Naturally, that inspired some API refinement and tuning. Then, another discovery. Their front-end dev noticed they were fetching more data than needed, so he cached it in the browser. This update sped up dashboard interactions by serving pre-cached data from the browser’s local storage. However, despite all those optimizations, the dashboard remained data-heavy. “Our customers loved it, but there was no reason why it had to be the first page after logging in,” Gwen remarked. So they moved the dashboard a layer down in the navigation. In its place, they added a much simpler landing page with easy access to the most common user tasks. Benchmarks Should Pay Rent Next, topic: The importance of being strategic about benchmarking. “Performance people love benchmarking (and I’m guilty of that),” Gwen admitted. ”But you can spend infinite time benchmarking with very little to show for it. So I want to share some tips on how to spend less time and have more to show for it.” She continued, “Benchmarks should pay rent by answering some important questions that you have. If you don’t have an important question, don’t run a benchmark. There, I just saved you weeks of your life – something invaluable for startups. You can thank me later. “ If your primary competitive advantage is performance, you will be expected to share performance tests to (attempt to) prove how fast and cool you are. Call it “benchmarketing.” For everyone else, two common questions to answer with benchmarking are: Is our database setup optimal? Are we delivering a good user experience? To assess the database setup, teams tend to: Turn to a standard benchmark like TPCC Increase load over time Look for bottlenecks Fix what they can But is this really the best way for a startup to spend its limited time and resources? Given her background as a performance expert, Gwen couldn’t resist doing such a benchmark early on at Nile. But she doesn’t recommend it – at least not for startups: “First of all, it takes a lot of time to run the standard benchmarks when you’re not used to doing it week in, week out. It takes time to adjust all the knobs and parameters. It takes time to analyze the results, rinse and repeat. Even with good tools, it’s never easy.” They did identify and fix some low-hanging fruits from this exercise. But since tests were based on a standard benchmark, it was unclear how well it mapped to actual user experiences. Gwen continued, “I didn’t feel the ROI was exactly compelling. If you’re a performance expert and it takes you only about a day, it’s probably worth it. But if you have to get up to speed, if you’re spending significant time on the setup, you’re better off focusing your efforts elsewhere.” A better question to obsess over is “Are we delivering a good experience?” More specifically, focus on these three areas: Optimizing user onboarding paths Addressing performance issues that annoy developers (these likely annoy users too) Paying attention to metrics that customers obsess over – even if they’re not the ones your team has focused on Keep Benchmarking Tests Stupid Simple Another testing lesson learned: Focus on extra stupid sanity tests. At Nile, the team ran the simplest possible queries, like loading an empty page or querying an empty table. If those were slow, there was no point in running more complex tests. Stop, fix the problem, then proceed with more interesting tests. Also, obsess over understanding what the numbers actually measure. You don’t want to base critical performance decisions on misleading results (e.g., empty responses) or unintended behaviors. For example, her team once intended to test the write path but ended up testing the read path thanks to a misconfigured DNS. Build Infrastructure for Long-Term Value, Optimize for Quick Wins The instrumentation and observability tools put in place during testing will pay off for years to come. At Nile, this infrastructure became invaluable throughout the product’s lifetime for answering the persistent question “Why is it slow?” As Gwen put it: “Those early performance test numbers, that instrumentation, all the observability – this is very much a gift that keeps on giving as you continue to build and users inevitably complain about performance.” When prioritizing performance improvements, look for quick wins. For example, Nile found that a slow request was spending 40% of the time on parsing, 40% on lookups, and just 20% on actual work. The developer realized he could reuse an existing caching library to speed up lookups. That was a nice quick win – giving 40% of the time back with minimal effort. However, if he’d said, “I’m not 100% sure about caching, but I have this fast JSON parsing library,” then that would have been a better way to shave off an equivalent 40%. About a year later, they pushed most of the parsing down to a Postgres extension that was written in C and nicely optimized. The optimizations never end! No Time to Optimize? Then At Least Don’t Pessimize Gwen’s final tip involved empowering experienced engineers to make common-sense improvements. “Last but not least, sometimes you really don’t have time to optimize. But if you have a team of experienced engineers, they know not to pessimize. They are familiar with faster JSON libraries, async libraries that work behind the scenes, they know not to put slow stuff on the critical path and so on. Even if you lack the time to prove that these things are actually faster, just do them. It’s not premature optimization. It’s just avoiding premature pessimization.”

Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',          '
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

How Cassandra Streaming, Performance, Node Density, and Cost are All related

This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.

A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.