High Performance on a Low Budget: Gwen Shapira’s Tips for Startups

How even a scrappy early-stage startup can deliver outstanding performance “It’s one thing to solve performance challenges when you have plenty of time, money and expertise available. But what do you do in the opposite situation: If you are a small startup with no time or money and still need to deliver outstanding performance?” – Gwen Shapira, co-founder of Nile (PostgreSQL reengineered for multi-tenant apps) That’s the multi-million-dollar question for many early-stage startups. And who better to lead that discussion than Gwen Shapira, who has tackled performance from two vastly different perspectives? After years of focusing on data systems performance at large organizations, she recently pivoted to leading a startup – where she found herself responsible for full-stack performance, from the ground up. In her P99 CONF keynote, “High Performance on a Low Budget,” Gwen explored the topic by sharing performance challenges she and her small team faced at Nile – how they approached them, tradeoffs and lessons learned. A few takeaways that set the conference chat on fire: Benchmarks should pay rent Keep tests stupid simple If you don’t have time to optimize, at least don’t pessimize But the real value is in hearing the experiences behind these and other zingers. You can watch her talk below or keep reading for a guided tour. Enjoy Gwen’s insights? She’ll be delivering another keynote at Monster SCALE Summit—alongside Kelsey Hightower and engineers from Discord, Disney+, Slack, Canva, Atlassian, Uber, ScyllaDB and many other leaders sharing how they’re tackling extreme-scale engineering challenges. Join us – it’s free + virtual. Get Your Conference Pass Do Worry About Performance Before You Ship It Per Gwen, founders get all sorts of advice on how to run the company (whether they want it or not). Regarding performance, it’s common to hear tips like: Don’t worry about performance until users complain. Don’t worry about performance until you have product market fit. Performance is a good problem to have. If people complain about performance, it is a great sign! But she respectfully disagrees. After years of focusing on performance, she’s all too familiar with the aftermath of that approach. Gwen shared, “As you talk to people, you want to figure out the minimal feature set required to bring an impactful product to market. And performance is part of this package. You should discover the target market’s performance expectations when you discover the other expectations.” Things to consider at an early stage: If you’re trying to beat the competition on performance, how much faster do you need to be? Even if performance is not your key differentiator, what are your users’ expectations regarding performance? To what extent are users willing to accept latency or throughput tradeoffs for different capabilities – or accept higher costs to avoid those tradeoffs? Founders are often told that even if you’re not fully satisfied with the product, just ship it and see how people react. Gwen’s reaction: “If you do a startup, there is a 100% chance that you will ship something that you’re not 100% happy with. But the reason you ship early and iterate is because you really want to learn fast. If you identified performance expectations during the discovery phase, try to ship something in that ballpark. Otherwise, you’re probably not learning all that much – with respect to performance, at least.” Hyperfocus on the User’s Perceived Latency For startups looking to make the biggest performance impact with limited resources, you can “cheat” by focusing on what will really help you attract and retain customers: optimizing the user’s perceived latency. Web apps are a great place to begin. Even if your core product is an eBPF-based edge database, your users will likely be interacting with a web app from the start of the user journey. Plus, there are lots of nice metrics to track (for example, Google’s Core Web Vitals). Gwen noted, “Startups very rarely have a scale problem. If you do have a scale problem, you can, for example, put people on a wait list while you’re determining how to scale out and add machines. However, even if you have a small number of users, you definitely care about them having a great experience with low latency. And perceived low latency is what really matters.” For example, consider this dashboard: When users logged in, the Nile team wanted to impress them by having this cool dashboard load instantly. However, they found that response times ranged from a snappy 200 milliseconds to a terrible 10+ seconds. To tackle the problem, the team started by parallelizing requests, filling in dashboard elements as data arrived and creating a progressive loading experience. These optimizations helped – and progressive loading turned out to be a fantastic way to hide latency (keeping the user engaged, like mirrors distracting you in a slow elevator). However, the optimizations exposed another issue. The app was making 2,000 individual API calls just to count open tickets. This is the classic N + 1 problem (when you end up running a query for each result instead of running a single optimized query that retrieves all necessary data at once.). Naturally, that inspired some API refinement and tuning. Then, another discovery. Their front-end dev noticed they were fetching more data than needed, so he cached it in the browser. This update sped up dashboard interactions by serving pre-cached data from the browser’s local storage. However, despite all those optimizations, the dashboard remained data-heavy. “Our customers loved it, but there was no reason why it had to be the first page after logging in,” Gwen remarked. So they moved the dashboard a layer down in the navigation. In its place, they added a much simpler landing page with easy access to the most common user tasks. Benchmarks Should Pay Rent Next, topic: The importance of being strategic about benchmarking. “Performance people love benchmarking (and I’m guilty of that),” Gwen admitted. ”But you can spend infinite time benchmarking with very little to show for it. So I want to share some tips on how to spend less time and have more to show for it.” She continued, “Benchmarks should pay rent by answering some important questions that you have. If you don’t have an important question, don’t run a benchmark. There, I just saved you weeks of your life – something invaluable for startups. You can thank me later. “ If your primary competitive advantage is performance, you will be expected to share performance tests to (attempt to) prove how fast and cool you are. Call it “benchmarketing.” For everyone else, two common questions to answer with benchmarking are: Is our database setup optimal? Are we delivering a good user experience? To assess the database setup, teams tend to: Turn to a standard benchmark like TPCC Increase load over time Look for bottlenecks Fix what they can But is this really the best way for a startup to spend its limited time and resources? Given her background as a performance expert, Gwen couldn’t resist doing such a benchmark early on at Nile. But she doesn’t recommend it – at least not for startups: “First of all, it takes a lot of time to run the standard benchmarks when you’re not used to doing it week in, week out. It takes time to adjust all the knobs and parameters. It takes time to analyze the results, rinse and repeat. Even with good tools, it’s never easy.” They did identify and fix some low-hanging fruits from this exercise. But since tests were based on a standard benchmark, it was unclear how well it mapped to actual user experiences. Gwen continued, “I didn’t feel the ROI was exactly compelling. If you’re a performance expert and it takes you only about a day, it’s probably worth it. But if you have to get up to speed, if you’re spending significant time on the setup, you’re better off focusing your efforts elsewhere.” A better question to obsess over is “Are we delivering a good experience?” More specifically, focus on these three areas: Optimizing user onboarding paths Addressing performance issues that annoy developers (these likely annoy users too) Paying attention to metrics that customers obsess over – even if they’re not the ones your team has focused on Keep Benchmarking Tests Stupid Simple Another testing lesson learned: Focus on extra stupid sanity tests. At Nile, the team ran the simplest possible queries, like loading an empty page or querying an empty table. If those were slow, there was no point in running more complex tests. Stop, fix the problem, then proceed with more interesting tests. Also, obsess over understanding what the numbers actually measure. You don’t want to base critical performance decisions on misleading results (e.g., empty responses) or unintended behaviors. For example, her team once intended to test the write path but ended up testing the read path thanks to a misconfigured DNS. Build Infrastructure for Long-Term Value, Optimize for Quick Wins The instrumentation and observability tools put in place during testing will pay off for years to come. At Nile, this infrastructure became invaluable throughout the product’s lifetime for answering the persistent question “Why is it slow?” As Gwen put it: “Those early performance test numbers, that instrumentation, all the observability – this is very much a gift that keeps on giving as you continue to build and users inevitably complain about performance.” When prioritizing performance improvements, look for quick wins. For example, Nile found that a slow request was spending 40% of the time on parsing, 40% on lookups, and just 20% on actual work. The developer realized he could reuse an existing caching library to speed up lookups. That was a nice quick win – giving 40% of the time back with minimal effort. However, if he’d said, “I’m not 100% sure about caching, but I have this fast JSON parsing library,” then that would have been a better way to shave off an equivalent 40%. About a year later, they pushed most of the parsing down to a Postgres extension that was written in C and nicely optimized. The optimizations never end! No Time to Optimize? Then At Least Don’t Pessimize Gwen’s final tip involved empowering experienced engineers to make common-sense improvements. “Last but not least, sometimes you really don’t have time to optimize. But if you have a team of experienced engineers, they know not to pessimize. They are familiar with faster JSON libraries, async libraries that work behind the scenes, they know not to put slow stuff on the critical path and so on. Even if you lack the time to prove that these things are actually faster, just do them. It’s not premature optimization. It’s just avoiding premature pessimization.”

Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',          '
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

Why TRACTIAN Migrated from MongoDB to ScyllaDB for Real-Time ML

TRACTIAN’s ML model workloads increased over 2X in a year. Here’s why they changed databases and their lessons learned What happens when you hit a database scaling wall? Since TRACTIAN, an AI-driven industrial monitoring company, is all about preventing problems, they didn’t want to wait and see. After the company’s ML workloads doubled in a year, their industrial IoT platform was experiencing unsolvable performance degradation. With more rapid growth on the horizon, their engineering leaders decided to rethink their distributed data system before they hit MongoDB’s breaking point. JP Voltani, TRACTIAN’s Director of Engineering, recently shared the team’s experiences at ScyllaDB Summit. If we gave out Academy Awards for production, this one would have been the clear winner (all credit to the TRACTIAN team). So, be sure to watch this quick look at some impressive scaling work. Enjoy engineering case studies like this? Choose your own adventure through 60+ tech talks at Monster Scale Summit (free + virtual). You can learn from experts like Martin Kleppmann, Kelsey Hightower and Gwen Shapira, plus engineers from Discord, Disney+, Slack, Atlassian, Uber, Canva, Medium, Cloudflare, and more. Get a free conference pass Key Takeaways A few key takeaways: TRACTIAN was reaching a critical inflection point when their sensor network grew more than 2x in a single year. MongoDB struggled, even after the team’s valiant optimization and scaling attempts. The constant stream of time-series sensor data (vibration, temperature, energy consumption) caused performance degradation that could compromise their latency targets. The team wanted a database architecture specifically designed for high-throughput, time-partitioned data workloads, which led them to ScyllaDB. They benchmarked ScyllaDB vs Cassandra, Postgres, and MongoDB. The results showed a 10x performance improvement with ScyllaDB, and they appreciated its operational simplicity compared to Cassandra. The TRACTIAN team moved their most performance-critical workloads to ScyllaDB while maintaining MongoDB for other use cases, exemplifying their “right tool for the job” philosophy. They experienced a 10x improvement in throughput and latency with ScyllaDB. TRACTIAN applied a four-phase migration process (dual writes → historical backfill → read switching → final validation). This phased approach maintained 99.95% availability while transitioning critical industrial IoT data pipelines. The team mapped their IoT workload to ScyllaDB by partitioning data by sensor ID and clustering by timestamp. This data modeling change improved query performance for time-window searches and eliminated the hotspot issues that had plagued their MongoDB implementation. Here’s a lightly edited transcript… Intro Hello, everyone. My name is JP, and I’m the Director of Engineering at TRACTIAN. Today, I’m going to talk about our experience with real time machine learning using ScyllaDB. I will start talking about what TRACTIAN is and what we do, what our infrastructure looks like, why we migrated away from MongoDB for some workloads, our ScyllaDB migration process, and what is next for us. At TRACTIAN, we build solutions for industrial maintenance. We want to empower the maintenance teams around the globe with the best in class hybrid and AI assisted software. We have three products: The Smart Trac is a vibration and temperature sensor that is able to detect more than 70 types of failures in rotating machines. The TracOS is a system with everything needed to manage the operations of maintenance teams on the plant floor, enabling mobile and offline operations. The Energy Trac is a sensor that is able to monitor energy consumption, efficiency and electrical quality. Together, these products form a very concise solution that works seamlessly with one another – bringing a very Apple-like experience to industrial maintenance. We have already raised over $100M through VC funding, establishing a global footprint with customers across the Americas. We have three different headquarters: one in Brazil, one in Mexico, one in the USA. We have employees worldwide. The TRACTIAN Tech Stack Let’s talk about our tech stack. We have a very straightforward approach to adopting new technologies: If it helps solve a real problem, we embrace it. For this reason, our tech stack is very modern and extensive. We use more than 80 databases and 6 different languages for our services. That allows us to leverage the strengths of each technology. We have a microservices architecture with more than 30 services, ranging from APIs, consumers, producers and batch processes. They all handle more than 1500 events per second from different sources. And they do so with an average latency lower than 200 milliseconds and with 99.95% availability. Here’s what our infrastructure looked like before ScyllaDB. The sensor sent data to our APIs, and the APIs put the sensor data into Kafka topics. We had different services that would consume these topics to process the data– saving into MongoDB, into different collections. After that, we sent triggers to the AI pipeline to process the data. We start with a binary blob from the sensor and the processing services expand the data to different tables. Some use it for client visualizations, others as vectors for AI (training and inference). Why They Evolved As the company grew, the number of samples arriving to the system also grew. We saw the workloads increase over 2x in a single year, and the database needed to deal with that increase. Unfortunately, even after upscale operations and optimizations, that was not the case with MongoDB. Performance degradation made us look for alternative solutions for our warehouse and AI workloads. Why ScyllaDB Why ScyllaDB? At the time, we already tested Cassandra. The results were promising, but some database operations, like upscaling, had some aspects that were not attractive to us. MongoDB was not handling the IoT workload very well, and we wanted something that was easier to scale. ScyllaDB showed itself to be a light at the end of the tunnel. We were searching for something really specific, and luckily ScyllaDB had a data model that fit our problem very well. Also, ScyllaDB’s database operations were way better than Cassandra’s. This is just one example of how ScyllaDB’s data model works in favor of our workloads. In this case, we have some binary data that we want to start partitioning by sensor ID and ordering by the timestamp. ScyllaDB will make this query for a specific ID in a time window very fast. We had a plan on our hands. First, we created a new DSL. What would the tables on ScyllaDB look like? How would MongoDB data map to the new tables? After that, we did a bunch of theoretical benchmarks, which is basically testing with synthetic data. This is an easy and fast way to validate an idea. Then we did it all over again, but with real data. Sometimes synthetic tests fail to map some nuances of real data and miss things like partitions and hot spots. Other times, they fail to create a good mapping, and this only becomes visible when you test with real data. So, it’s important to not skip this step. Next, we went into the weeds and refactored all the existing application code to use the new database. It’s important to have very, very clear success criteria. What are you trying to achieve with this migration? We had a very clear number of devices in mind that the new infrastructure should be able to handle. The test results came in favor of ScyllaDB. In some workloads, we saw an increase of 10x in throughput and latency.   Migration Strategy Next, let’s talk about the migration game plan. We did everything live and without downtime. Initially, all the data was being written to MongoDB. After that, we started to write to both databases. This was the first checkpoint of the migration. At this step, we checked to see if both databases agreed if the data was correct and if the initial performance test agreed with the benchmark ones. After that, we started our migration script that would backfill ScyllaDB with the historical data from MongoDB and check that no data was missing. Then, we switched the reads to occur on ScyllaDB, while continuing to write on MongoDB as a backup if any problems occurred. This is how we did our online no downtime migration. The results speak for themselves. Results We have a great write read latency after migration and ScyllaDB has scaled very well with our increasing workload. Our infrastructure now has ScyllaDB as one of its backbones, and we still use MongoDB for other types of workloads – and also a bunch of other databases for other challenges. Read more about TRACTIAN’s comparison of ScyllaDB vs MongoDB and PostgreSQL in ScyllaDB vs MongoDB vs PostgreSQL

Cutting Over from Apache Cassandra® to Astra DB

Learn about the final steps of migrating from Cassandra to Astra DB

IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative

IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.

For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?

While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.

However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.

Decoding the IBM-DataStax link-up

DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.

While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.

This is where Instaclustr comes into the picture as a strong, reliable alternative solution.

Why consider Instaclustr?

Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.

Here’s why Instaclustr could be your best option moving forward:

1. 100% open source commitment

We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.

2. Platform agnostic

One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.

3. Transparent pricing

Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.

4. Expert support and services

With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.

Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.

5. Seamless migrations

Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.

Customer-centric focus

At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.

Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.

With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.

Final thoughts: Why Instaclustr is the smart choice for this moment

IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.

With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.

Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.

Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!

The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.

Build an RPG Using the Bluesky Jetstream, ScyllaDB, and Rust

Learn how to build a Rust application that tracks Bluesky user experiences and events. Let’s build a high-performance, scalable, and reliable application that can: Fetch and process public events from the Bluesky platform. Track user events and experiences. Implement a leveling system with experience points (XP). Display user levels and progress based on XP via a REST API. 1. Background Bluesky, which uses a mix of SQLite and ScyllaDB to store data, has a really cool feature called Firehose. Firehose is an aggregated stream of all the public data updates in the network. You can understand it by accessing FireSky.tv, an app that implements this stream and serves it directly in the browser. Implementing it from scratch requires deep knowledge of the AT Protocol. But a Bluesky engineer built Jetstream: a Firehose aggregator. With Firehose, you can just listen on a websocket and get a JSON stream of selected events. Here’s a sample of an event payload from Jetstream: Just listening to one of these streams without any issues is amazing. And it turns out that you can even select which type of event you want to listen to, like: app.bsky.graph.follow; app.bsky.feed.post; app.bsky.feed.like; app.bsky.feed.repost; and many more! But how can we turn it into an application? Well, it depends on your needs. The data is there; just consume it and do your magic! In my case, I like to transform data into games. 2. Gamifying Jetstream I’m not a game developer, but games follow an Event-Driven Development approach, right? Every time that you earn some points in something, you level up or learn a new skill. But to earn experience points, users need to take actions. And that’s what you do inside a Social Network: actions! Imagine that every time you: Post: Just Text? Earns 50 experience Have Media? Earns 60 experience Have Media with Alt Text? Earns 70 Like: Earns 10 experience Repost: Just Text? Earns 50 experience Have Media? Earns 60 experience Have Media with Alt Text? Earn 70 experience! There are plenty of other abstractions that can be done, but that’s the idea. The experience will be calculated using arithmetic progression, and should follow this simple rule: With that, we can now talk about the technologies used in this project. 3. Meet the Stack Bluesky uses ScyllaDB to serve all the AppView layer thinking about high availability and throughput, so we’re going to do the same! Also, I’ve been using Rust extensively (and always learning more!), so I decided to implement this project with Rust. Here’s the tech stack in a nutshell: Language: Rust Database: ScyllaDB Packages: HTTP Server: actix-web ORM: charybdis Jetstream Client: jetstream-oxide Bluesky Client: atrium-api My goal is to build something that, besides creating cool charts on Grafana, can also display something via REST API. First, let’s explore our data modeling strategy. 4. What about the Data Modeling? Initially, the idea was to just store the events and test how stressed the app/database would become. But, at this point, we can go a little bit further. ScyllaDB follows a Query Driven Development approach because it’s a Wide-Column NoSQL Database.  Let’s think about that. First, it’s an RPG focused on a timeline profile, so it will have heavy read operations on top of the “characters”: Since we only have one item in the WHERE CLAUSE, it means that our query is a Key Value  lookup. But wait…we also need to store the current experience of this user.  For that, I would use the Counter type to atomically store it using key-value pairs: It’s supposed to be simple, just like this! But it also has to be fast enough to serve 1M requests/s with ease. WARNING: Counter types can’t be clusterized or used as partition keys. Also, if you use them in a table, all fields besides the Partition Keys aggregates must be Counters! I also want to track all possible events happening in a user’s account and list them in our extension to show how that person can be a better Bluesky user. So, the queries would be around users and they must be clusterized in descending order: Alright, that should be enough for an MVP. Now let’s model each part showing some Rust and Charybdis ORM! 4.1 Modeling: Leveling State UDT Since we’re using ScyllaDB, we can use UDTs (User Defined Types). Keeping track of operations can be a pain. However, if you’re making this a pattern across all tables, UDTs can be useful when you don’t want to recreate the same fields every time. Now we can just use it around the other tables, whether it’s related to events or characters. 4.2 Modeling: Characters Table This will be the most accessed table inside our project via REST API. And the modeling (at this moment) is simple since we only want the user_handle and the leveling state (udt). Check it out: With the UDT, we can serve exactly the latest leveling state to build a UI later on. We can also add new fields since none of them will be part of the Partition Key. 4.3 Modeling: Characters Experience Table As mentioned earlier, we should store the experience so that it won’t become a race condition. Why? ScyllaDB is a highly available database that can replicate your data across multiple nodes. To avoid race conditions, we need to use the only Atomic Type available: the Counter type. With that, we will ensure that every write/read will be the latest there. Yes, it impacts performance. However, Counters are planned and optimized for this type of operation. The modeling would be: Now the last one, the events table! 4.4 Modeling: Events Table and MV This is the most “complicated” part, but it’s not that hard. As mentioned before, there are plenty of events around ATProto Bluesky, and I want to give all the possible events for each user. Displaying data in descending order is a must. ScyllaDB can provide this functionality if you include a Clustering Key in your table. Check it out: With the CLUSTERING ORDER BY (event_at DESC) I’m basically telling it that every time I fetch a chunk of data from this table, it ALWAYS will be the recent inserts. However, now we have a problem. Imagine that we want to list all events from a specific type. With this table, we’re not able to do that. Why? Because you can only use as WHERE clause items that you add inside your Partitions or Clustering Keys. However, we can get around this by creating a Materialized View! Materialized Views are tables created based on a parent table. Every time that this parent table receives a write, your view will also receive it.  You can then play with the partition/clusterization. Check it out: Now, we have different partitions for the same user, storing different types of events that we’re able to query directly. With that, our data modeling is finally DONE! Let’s jump into some business rules implementation. 5. Hands-on: Application Flow With the basics taken care of, let’s explain how everything works under the hood. 5.1 App: Jetstream Oxide At the Websocket layer, we’re using the Jetstream Oxide package to receive all the events in an elegantly structured way. The boilerplate can be like: For each type of event, we’ll receive a specific amount of experience and a different response in asynchronicity. With that, the goal was to make an OCP integration where we only need to add new events when possible:  That takes us to the last step, which sets up the event default behavior at the Trait. We have three types of event actions: Create, Update, and Delete. The Handler will take care of the whole Action/Communication with ScyllaDB through Charybdis ORM. In this example, you can check how the CreateEventHandler works: We can implement other types of events by only extending the trait to the new Dynamic Struct, and it will be working fine. 5.2 App: Actix Web For serving this data, there’s a simple implementation of an endpoint using Actix. Since the long-term goal is to build a browser extension, we need to serve an endpoint with the character/user information: 6. Conclusion This exploration of Bluesky Jetstream and its potential for gamification showcases the power of leveraging cutting-edge technologies like ScyllaDB and Rust to build scalable, high-performance applications. By focusing on event-driven development, we successfully demonstrated how to create an interactive system that transforms social media activities into measurable, gamified metrics. You can check out the project here.  

Moving Data between Apache Cassandra® and Astra DB

Learn about two methods to move data between Cassandra and Astra DB: the Cassandra Data Migrator and the DataStax Bulk Loader.

How JioCinema Uses ScyllaDB Bloom Filters for Personalization

Why they used ScyllaDB Bloom Filters instead of building their own using common solutions like Redis Bloom filters When you log in to your favorite streaming service, first impressions matter. The featured content should instantly lure you into binge-watching mode. If it’s full of shows and movies you’ve already seen, your brain quickly shifts into “Hmmm, is it time to cancel this service?” mode. At a technical level, ensuring fresh recommendations is something that every streaming platform faces. But the standard solutions weren’t a good fit for JioCinema, a prominent Indian streaming service known for its affordability and extensive content library – and currently experiencing explosive growth (e.g., with world-record-breaking 620M IPL viewers, peaking over 20M concurrent viewers). Instead of building their own Bloom filters or using common solutions like Redis Bloom filters, they took a different path: using ScyllaDB’s built-in Bloom filter support to check watch status in real time. JioCinema’s Charan Kamal (Back-End Developer) and Harshit Jain (Software Engineer) recently shared why they took this unconventional path, including the tradeoffs of the more obvious solutions and the logistics of implementing this with ScyllaDB. Watch their complete tech talk below, or read on to skim the highlights. Enjoy engineering case studies like this? Choose your own adventure through 60+ tech talks at Monster Scale Summit (free + virtual). You can learn from experts like Martin Kleppmann, Kelsey Hightower and Gwen Shapira, plus engineers from Discord, Disney+, Slack, Atlassian, Uber, Canva, Medium, Cloudflare, and more.  Get a free conference pass The Challenge: “Watch Discounting” for Fresh Recommendations JioCinema is a leading “over the top” (OTT) streaming platform. The service features top Indian and international entertainment, including live sports (from IPL cricket, to LaLiga European football, to NBA basketball), new movies, HBO originals, and more. Their massive content library spans 10 Indian languages. The JioCinema app uses customized content trays like “Because You Watched” to keep users engaged and help them discover new content. For example, after a user completes “Game of Thrones,” the platform might commonly recommend “House of the Dragon” – but if the user already watched it, it will suggest something else.     As Harshit Jain put it, “These personalized trays not only keep the customers engaged but also enhance content discoverability, fostering long-term engagement and reducing churn rates. However, personalization comes with its own challenges, particularly the issue of recommending content that the customers have already watched. To address this, we have implemented a solution and termed it ‘Watch Discounting.’” This service must cost-efficiently satisfy low-latency requirements at an impressive scale (e.g., 10M daily active users consuming hundreds of thousands of shows and films per day). Charan Kamal explained, “Keeping the sheer size of our customer base and catalog in mind, we had to use a data structure which was space-efficient as well as blazing fast. While we want to keep our recommendations fresh, we also want to avoid over-engineering and making the system overly complex. We could tolerate occasional false positives here. So this led us to Bloom filters – space-efficient probabilistic data structures designed for rapid membership lookup in a set.” The Problem with Custom and Redis Bloom Filters Okay, but which Bloom filters were the best fit here? The team first considered building a custom Bloom filter to store and serve content. Although this “fun exercise” would have provided complete control over the implementation, it presented significant scaling challenges. They didn’t trust that a simple map of Bloom filters would scale vertically to keep pace with JioCinema’s massive (and rapidly growing) user base. Horizontal scaling would have required either implementing sticky sessions (where specific pods would hold Bloom filters for particular users) or replicating data across every pod in the system. While this would have been an interesting engineering challenge, it just didn’t make sense from a business perspective. The next option they explored was Redis with Bloom filter capabilities. Their initial testing with an open-source Redis cluster revealed an interesting issue with Redis’ cluster topology. During high load, nodes would occasionally get hot and trigger failovers, promoting replicas to primary nodes. This created a cascade effect where entire nodes within the cluster became unusable while primary-replica promotions continued in an endless loop. Looking to avoid that risk, they explored Redis Enterprise. This approach showed significant performance improvements and indeed met their SLA requirements. However, there was a catch. JioCinema’s business requires millisecond-level latency across multiple regions.   According to Charan, “Even with Redis Enterprise, we faced a choice between an active-active deployment to maintain low latency or compromising the customer experience in certain regions. The latter was unacceptable for our use case. Additionally, Redis Enterprise imposed substantial charges for each operation and replication, making it challenging to justify the return on investment of this feature for our business. This led us to explore ScyllaDB.” Implementing Watch Discounting with ScyllaDB Charan continued, “ScyllaDB not only supported Bloom filters out of the box, but we also had prior experience using it for different personalization use cases. Knowing its exceptional speed with the ability to replicate data into multiple regions and serve customers from locations close to their origin, ScyllaDB seemed like a comprehensive solution. This allowed us to concentrate on developing what mattered most for our customers and enabled us to go to market fast.” As the following diagram shows, the Watch Discounting feature was powered by two ingestion pipelines. Batch processes compute users’ watch history within a specified time window, determining if a piece of content meets the completion criteria to be considered “watched.” If so, the system updates the ScyllaDB table with a time-to-live (TTL) attribute, ensuring content only becomes rediscoverable after a specified amount of time. Short videos (e.g., 30-60 second videos that drive high engagement) require a different treatment. Here, content must be marked as “viewed” immediately, so real-time event streaming is used to update the watch discounting repository. Why ScyllaDB Charan concluded, “As mentioned earlier, adopting ScyllaDB enabled us to prioritize developing new functionality over creating data structures. This approach allowed us to keep our nodes small and maintain a clear separation of concerns between Bloom filters and filtering watched content. The unmatched performance of ScyllaDB became evident, especially when dealing with high cardinality of partitions and small data sizes—precisely the characteristics of our dataset. TTLs made cleanups easy and permitted the discovery of watched content after a specified period. Moreover, reading from LOCAL_QUORUM ensured that we could access data from the closest region to the user, resulting in high throughput and lower latencies. This strategic combination of features in ScyllaDB significantly contributed to the efficacy and effectiveness of our system.”