By focusing solely on unrepaired data, we made ScyllaDB’s
incremental repair 10X faster Maintaining data consistency
in large-scale distributed databases often comes at a high
performance cost. As clusters grow and data volume expands rapidly,
traditional repair methods often become a bottleneck. At ScyllaDB,
we needed a way to make consistency checks faster and more
efficient. In response, we implemented
Incremental Repair for ScyllaDB’s tablets data distribution.
It’s an optimization designed to minimize repair overhead by
focusing solely on unrepaired data. This blog explains what it is,
how we implemented it, and the performance gains it delivered. What
is Incremental Repair? Before we talk about the incremental part,
let’s look at what repair actually involves in a
distributed system context. In a system like ScyllaDB, repair is an
essential maintenance operation. Even with the best hardware,
replicas can drift due to network hiccups, disk failures, or load.
Repair detects mismatches between replicas and fixes them to ensure
every node has the latest, correct version of the data. It’s a
safety net that guarantees data consistency across the cluster.
Incremental repair is a new feature in ScyllaDB (currently
available for tablets-based tables). The idea behind it is simple:
why worry about data that we have already repaired? Traditional
repair scans everything. Incremental repair targets only unrepaired
data. Technically, this is achieved by splitting SSTables into two
distinct sets: repaired and unrepaired. The repaired set is
consistent and synchronized, while the unrepaired set is
potentially inconsistent and requires validation.
We created two modes of incremental repair:
incremental and full. In incremental mode, only SSTables in the
unrepaired set are selected for the repair process. Once the repair
completes, those SSTables are marked as repaired and promoted into
the repaired set. This should be your default mode because it
significantly minimizes the IO and CPU required for repair. In full
mode, the incremental logic is still active, but the selection
criteria change. Instead of skipping the repaired set, it selects
all data (both repaired and unrepaired). Once the process is
finished, all participating SSTables are marked as repaired. Think
of this as a “trust but verify” mode. Use this when you want to
revalidate the entire data set from scratch while still using the
incremental infrastructure. Finally, there’s disabled mode, where
the incremental repair logic is turned off. In this case, the
repair behavior is exactly the same as in previous versions of
ScyllaDB – before the incremental repair feature was introduced. It
selects both repaired and unrepaired SSTables for the repair
process. After repair completes, the system does not mark SSTables
as repaired. This is useful for scenarios where you want to run a
repair without affecting the metadata that tracks the repair state.
Incremental repair is integrated directly into the existing
workflow, with three options: nodetool lets you
use the standard nodetool cluster repair command with incremental
flags. ScyllaDB Manager also supports the same
flags for automated scheduling. A REST API makes
incremental repair available to teams building custom tools. Making
incremental repair work (internals) To make incremental repair
work, we had to solve a classic distributed systems problem: state
consistency. We need to know exactly which data is repaired, and
that state must survive crashes. So, we track this using two
decoupled markers. repaired_at, number stored
directly in the SSTable metadata on disk.
sstables_repaired_at, a value stored in our system
tables. The logic follows a two-phase commit model. Phase
one: prepare. First, we run the row-level repair. Once
that is finished, we update the repaired_at value in the SSTables.
At this point, the system still treats them as unrepaired because
they haven’t been activated yet. Phase two:
commit. After every node confirms the row-level repair and
the updated repaired_at value, we update the sstables_repaired_at
value in the system table. We define an SSTable as repaired if and
only if it is not zero and it is less than or equal to the system’s
repaired_at value. If a node crashes between phases, the mismatch
between the file and the system table ensures that we don’t
accidentally skip data that wasn’t 100% verified.
Under normal operations, you don’t need to run full repairs
regularly. Still, it’s needed occasionally. If you experience
significant loss of SSTables (perhaps due to a disk failure), then
a full repair is required to reconstruct missing data across the
cluster. In practice, we suggest one full repair after a long
series of incremental runs. This gives you an extra layer of
security, even if it is not strictly required. This brings us to a
critical challenge: compaction. If we let compaction mix repaired
and unrepaired data, the repaired status would be lost, and we’d
need to re-repair everything. To solve this, we introduce
compaction barriers. We effectively split the tablets into two
independent worlds. The unrepaired set, where all
new writes and memtable flushes go. Compaction only merges
unrepaired SSTables with other unrepaired ones. The
repaired set, where SSTables are compacted together to
maintain and optimize the read path. The rule is that a compaction
strategy can never merge an unrepaired SSTable into the repaired
set. The only bridge between these sets is repair. This prevents
potentially inconsistent data from polluting the repaired set.
With this new design, compaction now has a dependency on repair. As
we run repairs, the repaired sets grow. But because incremental
repair is so much lighter than traditional methods, we encourage
you to run it much more frequently. We are currently working on an
automatic repair feature that will trigger those runs at the very
moment the unrepaired set grows too large. That should keep your
unrepaired window as small as possible. The efficiency of
incremental repair depends on your workload: Update
heavy: If you have lots of overwrites or deletes, new data
will invalidate older repaired SSTables. In extreme cases, there
could be so much data to repair that it looks a lot like full
repair. Append heavy: This is a perfect use case,
like IoT or logging. Since new data doesn’t invalidate old data,
the repaired set stays consistent and untouched. This should
provide nice performance gains. Even in update-heavy cases, don’t
lose anything by choosing incremental repair. In the worst case, it
performs the same amount of work as a full repair would. In almost
all real-world scenarios, you could gain significant improvements
without any trade-off with respect to consistency. Performance
Improvements To understand how this approach translates to
performance improvements, let’s model the improvement ratio. Say
that n is the size of your new unrepaired data and E is the size of
your existing repaired data. A full repair works on E plus n.
Incremental repair works only on n. So, the improvement ratio
equals n divided by E plus n. If
you ingest 100 gigabytes a day on a 10-terabyte node, you are
repairing only 1% of the data instead of 100%. This is an
order-of-magnitude shift in overhead. Our testing confirms that
theory. We ran multiple insert and repair cycles. In
the first round, nearly all the data was new, so the repair time
was almost the same as with full repair. In the second round, with
a 50/50 split, the time dropped by half. In the third round, as the
repaired set became dominant, incremental repair took only 35% of
the time that a full repair would have taken. To
wrap up, incremental repair for tablets is faster, lighter, and
more efficient. It is a foundational step toward our goal of a
fully autonomous database that handles its own maintenance. By
adopting this feature, you reduce operational burden and ensure
your cluster remains consistent without repair storms.
In a
previous post, we described how Netflix uses Apache Druid to
ingest millions of events per second and query trillions of rows,
providing the real-time insights needed to ensure a high-quality
experience for our members. Since that post, our scale has grown
considerably.
With our database holding over 10 trillion rows and regularly
ingesting up to 15 million events per second, the value of our
real-time data is undeniable. But this massive scale introduced a
new challenge: queries. The live show monitoring, dashboards,
automated alerting, canary analysis, and A/B test monitoring that
are built on top of Druid became so heavily relied upon that the
repetitive query load started to become a scaling concern
in itself.
This post describes an experimental caching layer we built to
address this problem, and the trade-offs we chose
to accept.
The Problem
Our internal dashboards are heavily used for real-time
monitoring, especially during high-profile live shows or global
launches. A typical dashboard has 10+ charts, each triggering one
or more Druid queries; one popular dashboard with 26 charts and
stats generates 64 queries per load. When dozens of engineers view
the same dashboards and metrics for the same event, the query
volume quickly becomes unmanageable.
Take the popular dashboard above: 64 queries per load,
refreshing every 10 seconds, viewed by 30 people. That’s 192
queries per second from one dashboard, mostly for nearly identical
data. We still need Druid capacity for automated alerting, canary
analysis, and ad-hoc queries. And because these dashboards request
a rolling last-few-hours window, each refresh changes slightly as
the time range advances.
Druid’s built-in caches are effective. Both the full-result
cache and the per-segment cache. But neither is designed to handle
the continuous, overlapping time-window shifts inherent to
rolling-window dashboards. The full-result cache misses for
two reasons.
If the time window shifts even slightly, the query is
different, so it’s a cache miss.
Druid deliberately refuses to cache results that involve
realtime segments (those still being indexed), because it values
deterministic, stable cache results and query correctness over a
higher cache hit rate.
The per-segment cache does help avoid redundant scans on
historical nodes, but we still need to collect those cached segment
results from each data node and merge them in the brokers with data
from the realtime nodes for every query.
During major shows, rolling-window dashboards can generate a
flood of near-duplicate queries that Druid’s caches mostly miss,
creating heavy redundant load. At our scale, solving this by simply
adding more hardware is prohibitively expensive.
We needed a smarter approach.
The Insight
When a dashboard requests the last 3 hours of data, the vast
majority of that data, everything except the most recent few
minutes, is already settled. The data from 2 hours ago
won’t change.
What if we could remember the older portions of the result and
only ask Druid for the part that’s actually new?
This is the core idea behind a new caching service that
understands the structure of Druid queries and serves
previously-seen results from cache while fetching only the freshest
portion from Druid.
A Deliberate Trade-Off
Before diving into the implementation, it’s worth being explicit
about the trade-off we’re making. Caching query results introduces
some staleness, specifically, up to 5 seconds for the newest data.
This is acceptable for most of our operational dashboards, which
refresh every 10 to 30 seconds. In practice, many of our queries
already set an end time of now-1m or now-5s to avoid the “flappy
tail” that can occur with currently-arriving data.
Since our end-to-end data pipeline latency is typically under 5
seconds at P90, a 5-second cache TTL on the freshest data
introduces negligible additional staleness on top of what’s already
inherent in the system. We decided it was better to accept this
small amount of staleness in exchange for significantly lower query
load on Druid. But a 5s cache on its own is not
very useful.
Exponential TTLs
Not all data points are equally trustworthy. In real-time
analytics, there’s a well-known late-arriving data problem. Events
can arrive out of order or be delayed in the ingestion pipeline. A
data point from 30 seconds ago might still change as late-arriving
events trickle in. A data point from 30 minutes ago is almost
certainly final.
We use this observation to set cache TTLs that increase
exponentially with the age of the data. Data less than 2 minutes
old gets a minimum TTL of 5 seconds. After that, the TTL doubles
for each additional minute of age: 10 seconds at 2 minutes old, 20
seconds at 3 minutes, 40 seconds at 4 minutes, and so on, up to a
maximum TTL of 1 hour.
The effect is that fresh data cycles through the cache rapidly,
so any corrections from late-arriving events in the most recent
couple of minutes are picked up quickly. Older data lingers much
longer, because our confidence in its accuracy grows
with time.
For a 3-hour rolling window, the exponential TTL ensures the
vast majority of the query is served from the cache, leaving Druid
to only scan the most recent, unsettled data.
Bucketing
If we were to use a single-level cache key for the query and
interval, similar to Druid’s existing result-level cache, we
wouldn’t be able to extract only the relevant time range from
cached results. A shifted window means a different key, which means
a cache miss.
Instead, we use a map-of-maps. The top-level key is the query
hash without the time interval; the inner keys are timestamps
bucketed to the query granularity (or 1 minute, whichever is
larger) and encoded as big-endian bytes so lexicographic order
matches time. This enables efficient range scans; fetching all
cached buckets between times A and B for a query hash. A 3-hour
query at 1-minute granularity becomes 180 independent cached
buckets, each with its own TTL; when the window shifts (e.g., 30
seconds later), we reuse most buckets from cache and only query
Druid for the new data.
How It Works
Today, the cache runs as an external service integrated
transparently by intercepting requests at the Druid Router and
redirecting them to the cache. If the cache fully satisfies a
request, it returns the result; otherwise it shrinks the time
interval to the uncached portion and calls back into the Router,
bypassing the redirect to query Druid normally. Non-cached requests
(e.g., metadata queries or queries without time group-bys) pass
straight through to Druid unchanged.
This intercepting proxy design allows us to enable or disable
caching without any client changes and is a key to its adoption. We
see this setup as temporary while we work out a way to better
integrate this capability into Druid more natively.
When a cacheable query arrives, those that are grouping-by time
(timeseries, groupBy), the cache performs the following steps.
Parsing and Hashing. We parse each incoming
query to extract the time interval, granularity, and structure,
then compute a SHA-256 hash of the query with the time interval and
parts of the context removed. That hash is the cache key: it
encodes what is being asked (datasource, filters,
aggregations, granularity) but not when, so the same
logical query over different overlapping time windows maps to the
same cache entry. There are some context properties that can alter
the response structure or contents, so these are included in the
cache-key.
Cache Lookup. Using the cache key, we fetch
cached points within the requested range, but only if they’re
contiguous from the start. Because bucket TTLs can expire unevenly,
gaps can appear; when we hit a gap, we stop and fetch all newer
data from Druid. This guarantees a complete, unbroken result set
while sending at most one Druid query, rather than “filling gaps”
with multiple small, fragmented queries that would increase
Druid load.
Fetching the Missing Tail. On a partial cache
hit (e.g., 2h 50m of a 3h window), we rebuild the query with a
narrowed interval for the missing 10 minutes and send only that to
Druid. Since Druid then scans just the recent segments for a small
time range, the query is usually faster and cheaper than the
original.
Combining. The cached data and fresh data are
concatenated, sorted by timestamp, and returned to the client. From
the client’s perspective, the response looks identical to what
Druid would have returned, same JSON format, same fields.
Asynchronous Caching. The fresh data from Druid
is parsed into individual time-granularity buckets and written back
to the cache asynchronously, so we don’t add latency to the
response path.
Negative Caching
Some metrics are sparse. Certain time buckets may genuinely have
no data. Without special handling, the cache would treat these
empty buckets as gaps and re-query Druid for them
every time.
We handle this by caching empty sentinel values for time buckets
where Druid returned no data. Our gap-detection logic recognizes
these empty entries as valid cached data rather than missing data,
preventing needless re-queries for naturally
sparse metrics.
However, we’re careful not to negative-cache trailing empty
buckets. If a query returns data up to minute 45 and nothing after,
we only cache empty entries for gaps between data points,
not after the last one. This avoids incorrectly caching “no data”
for time periods where events simply haven’t arrived yet, which
would exacerbate the chart delays of late arriving data.
The Storage Layer
For the backing store, we use Netflix’s
Key-Value Data Abstraction Layer (KVDAL), backed by Cassandra.
KVDAL provides a two-level map abstraction, a natural fit for our
needs. The outer key is the query hash, and the inner keys are
timestamps. Crucially, KVDAL supports independent TTLs on each
inner key-value pair, eliminating the need for us to manage cache
eviction manually.
This two-level structure gives us efficient range queries over
the inner keys, which is exactly what we need for partial cache
lookups: “give me all cached buckets between time A and time B for
query hash X.”
Results
The biggest win is during high-volume events (e.g., live shows):
when many users view the same dashboards, the cache serves most
identical queries as full hits, so the query rate reaching Druid is
essentially the same with 1 viewer or 100. The scaling bottleneck
moves from Druid’s query capacity to the much cheaper-to-scale
cache, and with ~5.5 ms P90 cache responses, dashboards load faster
for everyone.
On a typical day, 82% of real user queries get at least a
partial cache hit, and 84% of result data is served from cache. As
a result, the queries that reach Druid scan much narrower time
ranges, touching fewer segments and processing less data, freeing
Druid to focus on aggregating the newest data instead of repeatedly
re-querying historical segments.
An experiment validated this, showing about a 33% drop in
queries to Druid and a 66% improvement in overall P90 query times.
It also cut result bytes and segments queried, and in some cases,
enabling the cache reduced result bytes by more than 14x. Caveat:
the size of these gains depends heavily on how similar and
repetitive the query workload is.
Looking Ahead
This caching layer is still experimental, but results are
promising and we’re exploring next steps. We’ve added partial
support for templated SQL so dashboard tools can benefit without
writing native Druid queries.
Longer term, we’d like interval-aware caching to be built into
Druid: an external proxy adds infrastructure to manage, extra
network hops, and workarounds (like SQL templating) to extract
intervals. Implemented inside Druid, it could be more efficient,
with direct access to the query planner and segment metadata, and
benefit the broader community without custom infrastructure. We’d
likely ship it as an opt-in, configurable, result-level cache in
the Brokers, with metrics to tune TTLs and measure effectiveness.
Please leave a comment if you have a use-case that could benefit
from this feature.
More broadly, this strategy, splitting time-series results into
independently cached, granularity-aligned buckets with age-based
exponential TTLs, isn’t Druid-specific and could apply to any
time-series database with frequent overlapping-window queries.
Summary
As more Netflix teams rely on real-time analytics, query volume
grows too. Dashboards are essential at our scale, but their
popularity can become a scaling bottleneck. By inserting an
intelligent cache between dashboards and Druid, one that
understands query structure, breaks results into
granularity-aligned buckets, and trades a small amount of staleness
for much lower Druid load, we’ve increased query capacity without
scaling infrastructure proportionally, and hope to deliver these
benefits to the Druid community soon as a built-in
Druid feature.
Sometimes the best way to handle a flood of queries is to stop
answering the same question twice.
Today’s filmmakers capture more footage than ever to maximize
their creative options, often generating hundreds, if not
thousands, of hours of raw material per season or franchise.
Extracting the vital moments needed to craft compelling storylines
from this sheer volume of media is a notoriously slow and punishing
process. When editorial teams cannot surface these key moments
quickly, creative momentum stalls and severe fatigue
sets in.
Meanwhile, the broader search landscape is undergoing a profound
transformation. We are moving beyond simple keyword matching toward
AI-driven systems capable of understanding deep context and intent.
Yet, while these advances have revolutionized text and image
retrieval, searching through video, the richest medium for
storytelling, remains a daunting “needle in a haystack”
challenge.
The solution to this bottleneck cannot rely on a single
algorithm. Instead, it demands orchestrating an expansive ensemble
of specialized models: tools that identify specific characters, map
visual environments, and parse nuanced dialogue. The ultimate
challenge lies in unifying these heterogeneous signals, textual
labels, and high-dimensional vectors into a cohesive, real-time
intelligence. One that cuts through the noise and responds to
complex queries at the speed of thought, truly empowering the
creative process.
Why Video Search is Deceptively Complex
Since video is a multi-layered medium, building an effective
search engine required us to overcome significant technical
bottlenecks. Multi-modal search is exponentially more complex than
traditional indexing: it demands the unification of outputs from
multiple specialized models, each analyzing a different facet of
the content to generate its own distinct metadata. The ultimate
challenge lies in harmonizing these heterogeneous data streams to
support rich, multi-dimensional queries in real time.
Unifying the Timeline To ensure critical
moments aren’t lost across scene boundaries, each model segments
the video into overlapping intervals. The resulting metadata varies
wildly, ranging from discrete text-based object labels to dense
vector embeddings. Synchronizing these disjointed, multi-modal
timelines into a unified chronological map presents a massive
computational hurdle.
Processing at Scale A standard 2,000-hour
production archive can contain over 216 million frames. When
processed through an ensemble of specialized models, this baseline
explodes into billions of multi-layered data points. Storing,
aligning, and intersecting this staggering volume of records while
maintaining sub-second query latency far exceeds the capabilities
of traditional database architectures.
Surfacing the Best Moments Surface-level
mathematical similarity is not enough to identify the most relevant
clip. Because continuous shots naturally generate thousands of
visually redundant candidates, the system must dynamically cluster
and deduplicate results to surface the singular best match for a
given scene. To achieve this, effective ranking relies on a
sophisticated hybrid scoring engine that weighs symbolic text
matches against semantic vector embeddings, ensuring both precision
and interpretability.
Zero-Friction Search For filmmakers, search
is a stream-of-consciousness process, and a ten-second delay can
disrupt the creative flow. Because sequential scanning of raw
footage is fundamentally unscalable, our architecture is built to
navigate and correlate billions of vectors and metadata records
efficiently, operating at the speed of thought.
Figure 1: Unified Multimodal Result
Processing
The Ingestion and Fusion Pipeline
To ensure system resilience and scalability, the transition from
raw model output to searchable intelligence follows a decoupled,
three-stage process:
1. Transactional Persistence
Raw annotations are ingested via high-availability
pipelines and stored in our
annotation service, which leverages Apache Cassandra
for distributed storage. This stage strictly prioritizes data
integrity and high-speed write throughput, guaranteeing that every
piece of model output is safely captured.
Figure 2: Sample Scene Search Model Annotation
Output
2. Offline Data Fusion
Once the annotation service securely persists the raw data, the
system publishes an event via Apache Kafka to trigger an
asynchronous processing job. Serving as the architecture’s central
logic layer, this offline pipeline handles the heavy computational
lifting out-of-band. It performs precise temporal intersections,
fusing overlapping annotations from disparate models into cohesive,
unified records that empower complex, multi-dimensional
queries.
Cleanly decoupling these intensive processing tasks from the
ingestion pipeline guarantees that complex data intersections never
bottleneck real-time intake. As a result, the system maintains
maximum uptime and peak responsiveness, even when processing the
massive scale of the Netflix media catalog.
To achieve this intersection at scale, the offline pipeline
normalizes disparate model outputs by mapping them into fixed-size
temporal buckets. This discretization process unfolds in
three steps:
Bucket Mapping: Continuous detections are
segmented into discrete intervals. For example, if a model detects
a character “Joey” from seconds 2 through 8, the pipeline
maps this continuous span of frames into seven distinct one-second
buckets.
Annotation Intersection: When multiple models
generate annotations for the same temporal bucket, such as
character recognition “Joey” and scene detection
“kitchen” overlapping in second 4, the system fuses them
into a single, comprehensive record.
Optimized Persistence: These newly enriched
records are written back to Cassandra as distinct entities. This
creates a highly optimized, second-by-second index of multi-modal
intersections, perfectly associating every fused annotation with
its source asset.
Figure 3: Temporal Data Fusion with Fixed-Size
Time Buckets
The following record shows the overlap of the character
“Joey” and scene “kitchen” annotations during a 4
to 5 second window in a video asset:
Figure 4: Sample Intersection Record for Character +
Scene Search
3. Indexing for Real Time Search
Once the enriched temporal buckets are securely persisted in
Cassandra, a subsequent event triggers their ingestion into
Elasticsearch.
To guarantee absolute data consistency, the pipeline executes
upsert operations using a composite key (asset ID + time
bucket) as the unique document identifier. If a temporal
bucket already exists for a specific second of video, perhaps
populated by an earlier model run, the system intelligently updates
the existing record rather than generating a duplicate. This
mechanism establishes a single, unified source of truth for every
second of footage.
Architecturally, the pipeline structures each temporal bucket as
a nested document. The root level captures the overarching asset
context, while associated child documents house the specific,
multi-modal annotation data. This hierarchical data model is
precisely what empowers users to execute highly efficient,
cross-annotation queries at scale.
The search service provides a high-performance interface for
real-time discovery across the global Netflix catalog. Upon
receiving a user request, the system immediately initiates a query
preprocessing phase, generating a structured execution plan through
three core steps:
Query Type Detection: Dynamically categorizes
the incoming request to route it down the most efficient retrieval
path.
Filter Extraction: Isolates specific semantic
constraints such as character names, physical objects, or
environmental contexts to rapidly narrow the candidate pool.
Vector Transformation: Converts raw text into
high-dimensional, model-specific embeddings to enable deep,
context-aware semantic matching.
Once generated, the system compiles this structured plan into a
highly optimized Elasticsearch query, executing it directly against
the pre-fused temporal buckets to deliver instantaneous,
frame-accurate results.
Fine-Tuning Semantic Search
To support the diverse workflows of different production teams,
the system provides fine-grained control over search behavior
through configurable parameters:
Exact vs. Approximate Search: Users can toggle
between exact k-Nearest Neighbors (k-NN) for uncompromising
precision, and Approximate Nearest Neighbor (ANN) algorithms (such
as HNSW)
to maintain blazing speed when querying massive datasets.
Dynamic Similarity Metrics: The system
supports multiple distance calculations, including cosine
similarity and Euclidean
distance. Because different models shape their high-dimensional
vector spaces distinctly based on their underlying training
architectures, the flexibility to swap metrics ensures that
mathematical closeness perfectly translates to true semantic
relevance.
Confidence Thresholding: By establishing
strict minimum score boundaries for results, users can actively
prune the long tail of low-probability matches. This aggressively
filters out visual noise, guaranteeing that creative teams are not
distracted and only review results that meet a rigorous standard of
mathematical similarity.
Textual Analysis & Linguistic Precision
To handle the deep nuances of dialogue-heavy searches, such as
isolating a character’s exact catchphrase amidst thousands of hours
of speech, we implement a sophisticated text analysis strategy
within Elasticsearch. This ensures that conversational context is
captured and indexed accurately.
Phrase & Proximity Matching: To respect the
narrative weight of specific lines (e.g., “Friends don’t
lie” in Stranger Things), we leverage
match-phrase queries with a configurable slop
parameter. This guarantees the system retrieves the correct scene
even if the user’s memory slightly deviates from the exact
transcription.
N-Gram Analysis for Partial Discovery: Because
video search is inherently exploratory, we utilize edge N-gram tokenizers to
support search-as-you-type functionality. By actively indexing
dialogue and metadata substrings, the system surfaces
frame-accurate results the moment an editor begins typing,
drastically reducing cognitive load.
Tokenization and Linguistic Stemming: To
seamlessly support the global scale of the Netflix catalog, our
analysis chain applies sophisticated stemming across
multiple languages. This ensures a query for “running”
automatically intersects with scenes tagged with “run” or
“ran” collapsing grammatical variations into a single,
unified search intent.
Levenshtein Fuzzy Matching: To account for
transcription anomalies or phonetic misspellings, we incorporate
fuzzy search capabilities based on Levenshtein
distance algorithms. This intelligent soft-matching approach
ensures that high-value shots are never lost to minor data-entry
errors or imperfect queries.
Aggregations and Flexible Grouping
The architecture operates at immense scale, seamlessly executing
queries within a single title or across thousands of assets
simultaneously. To combat result fatigue, the system leverages
custom aggregations to intelligently cluster and group outputs
based on specific parameters, such as isolating the top 5 most
relevant clips of an actor per episode. This guarantees a diverse,
highly representative return set, preventing any single asset from
dominating the search results.
Search Response Curation
While temporal buckets are the internal mechanism for search
efficiency, the system post-processes Elasticsearch results to
reconstruct original time boundaries. The reconstruction process
ensures results reflect narrative scene context rather than
arbitrary intervals. Depending on the query intent, the system
generates results based on two logic types:
Figure 6: Depiction of Temporal Union vs
Intersection
Union: Returns the full span of all matching
annotations (3–8 sec), which prioritizes breadth,
capturing any instance where a specified feature occurs.
Intersection: Returns only the exact
overlapping duration of matching signals (4–6 sec). The
intersection logic focuses on co-occurrence, isolating moments when
multiple criteria align.
Figure 7: Sample Response for “Joey” + “Kitchen”
Query
Future Extensions
While our current architecture establishes a highly resilient
and scalable foundation, it represents only the first phase of our
multi-modal search vision. To continuously close the gap between
human intuition and machine retrieval, our roadmap focuses on three
core evolutions:
Natural Language Discovery: Transitioning from
structured JSON payloads to fluid, conversational interfaces (e.g.,
“Find the best tracking shots of Tom Holland running on a
roof”). This will abstract away underlying query complexity,
allowing creatives to interact with the archive organically.
Adaptive Ranking: Implementing machine
learning feedback loops to dynamically refine scoring algorithms.
By continuously analyzing how editorial teams interact with and
select clips, the system will self-tune its mathematical definition
of semantic relevance over time.
Domain-Specific Personalization: Dynamically
calibrating search weights and retrieval behaviors to match the
exact context of the user. The platform will tailor its results
depending on whether a team is cutting high-action marketing
trailers, editing narrative scenes, or conducting deep archival
research.
Ultimately, these advancements will elevate the platform from a
highly optimized search engine into an intelligent creative
partner, fully equipped to navigate the ever-growing complexity and
scale of global video media.
Acknowledgements
We would like to extend our gratitude to the following teams and
individuals whose expertise and collaboration were instrumental in
the development of this system:
Learn how ScyllaDB vector search simplifies scalable
semantic search & RAG with real-time performance – plus see FAQs on
vector search scaling, embeddings, and architecture As a
Solutions Architect at ScyllaDB, I talk a lot about vector search
these days. Most vector search conversations start with RAG. Build
a chatbot, embed some documents, do a similarity lookup. That’s an
important use case, but it’s just the beginning of what vector
search can do when you have a platform like ScyllaDB that can scale
to massive datasets and hit those p99 latencies At Monster
Scale Summit, I showed two live demos that cover what I think
are the two most practical patterns for vector search today: hybrid
memory for RAG, and real-time anomaly detection. Both run entirely
on ScyllaDB, without needing an external cache (e.g., Redis) or
dedicated vector database (e.g., Pinecone). This post explains what
I showed and also answers the top questions from attendees. The
Limits of Traditional Search and Anomaly Detection The old way of
doing anomaly detection, retrieving data, doing lookups, using
inverted indexes, and running semantic search is brittle. Anyone
who has tried to scale an index-based system has experienced this
firsthand. The first step was figuring out what was said before,
what data was loaded. Redis caches, key-value stores, document
databases, and SQL databases were used to handle that. Then, to
determine what documents are relevant, we built custom search
pipelines using Lucene-based indexes and similar tools. Next, we
built complex rules engines that became difficult to troubleshoot
and couldn’t scale. These systems became extremely complex, with SQL-based
regressions and similar mechanisms. The problem is that they’re
hard to scale and very rigid. Every new failure mode required
building a new rule, sometimes entirely new rules engines. Often,
these detection paths required completely different systems. The
result was graph use cases, vector systems, and multiple databases:
a collection of systems solving what are ultimately simple
problems. As you might imagine, these systems become quite brittle
and expensive to operate. You could avoid this with a conceptual
shift: from matching rules to finding things that look similar.
That shift from rigid logic to semantic similarity is what vector
search is trying to achieve, whether through text embeddings or
other encodings. RAG with Hybrid Memory The first demo showed how we can build
three distinct memory tiers for RAG, all living within ScyllaDB.
This “hybrid memory” pattern allows you to handle everything from
immediate context to long-forgotten conversations without a complex
conglomerate of different databases. Short-term
memory: This handles the last five records of your
interaction as a rolling window. It is very fast, uses a short TTL,
and ensures the AI stays on track with the immediate conversation.
Long-term memory: This is where vectors shine. If
you’ve been chatting with a bot for 6 days, that history won’t fit
in a standard context window. We use vector search to retrieve
specific conversation parts that matter based on embedded
similarity, loading only the relevant history into the prompt. This
is the piece that makes a chatbot feel like it actually remembers.
Document enrichment: This phase takes uploaded
files (PDFs, Markdown files, etc.), chunks them semantically,
embeds each chunk, and stores those vectors in ScyllaDB. When you
ask a question, the system retrieves the most relevant chunks and
feeds them to the model. Basically, I used ScyllaDB to manage key-value lookups, vector
similarity, and TTL expiration in a single cluster, which meant I
didn’t need extra infrastructure like Redis or Pinecone. It
processed LinkedIn profiles by breaking them into semantic chunks
(one profile generated 43 chunks) and stored those embeddings in
ScyllaDB Cloud to be queried in real-time. The frontend is built in
Elixir, the backend is ScyllaDB Cloud with vector search enabled.
The code is on my
GitHub. Pull it, run it, and try it out with your own
documents. Anomaly Detection: Three Paths, One Database The second
demo is where vector search truly earns its keep. The setup models
an IoT pipeline: Devices generate metrics every 10 seconds Metrics
stream through Kafka They’re aggregated into 60-second windows
Finally, they land in ScyllaDB as 384-dimensional embeddings
generated by Ollama. From there, three detection paths run in parallel against
every snapshot. Path 1: Rules engine. This is
classic Z-score analysis. If four or more metrics exceed a Z-score
of 6.0, flag it. This is fast and interpretable. It catches known
anomaly patterns and misses everything else. Path 2:
Embedding similarity. Build a behavioral fingerprint for
each device. Compare every new snapshot against that fingerprint
using cosine similarity. If the score drops below 0.92, something
has drifted. This catches gradual behavioral changes that hard
thresholds miss. Path 3: Vector search. Run an ANN
query across the device’s historical snapshots in ScyllaDB.
Post-filter for the same device, non-anomalous records, and a
cosine similarity above 0.75. If fewer than five similar normal
snapshots exist, that snapshot is anomalous. You don’t need to write any rules or tune thresholds per
device. The data tells you what normal looks like. In the demo, I
showed a sensor drift scenario where power consumption dropped
slightly and a few other metrics shifted just enough to be unusual.
This slipped past the rules engine and embedding similarity, but
vector search caught it. It’s Exhibit A for how vector search
exposes subtle anomalies that rigid systems can’t see.
The reason this works on a single cluster is that ScyllaDB handles
vectors alongside time series, key-value, and TTL data with sub-10
millisecond queries under heavy concurrent load. Snapshots,
profiles, anomaly events, and the vector index all live in one
place. The anomaly
detection demo runs in Docker if you want to try it on your own
data. Common Questions About ScyllaDB Vector Search Technical demos
naturally lead to questions, and we heard plenty at MonsterScale
(thanks for keeping us busy in the chat). Here are responses to the
top questions we fielded… “What about new devices with no
history?” Fair concern. Path 3 decides “anomaly” when it
finds fewer than five similar normal snapshots. A brand new device
with ten records will trigger on everything. The fix is a minimum
history gate. Don’t run the vector path until a device has enough
baseline data to make comparisons meaningful. In practice, the
rules engine and embedding similarity cover the gap while the
device builds history. “How do I know my embeddings are
good enough?” Test them against known anomalies before you
go to production. Bad embeddings produce bad neighbors, and
similarity search will give you confidently wrong answers.
Run a batch of labeled data through your embedding model, query for
nearest neighbors, and check whether the results make sense. If
your model does not capture the right features, no amount of tuning
the similarity threshold will save you. “Does this scale to
millions of devices?” The ANN query itself scales; the
part you want to watch is post-filtering. The query returns
candidates, and your application code still has to filter by
device, status, and similarity threshold. At millions of devices,
that step needs to be efficient. Batch your comparisons and keep
your filter criteria tight. Someone at the talk asked about the
performance impact of deletions on the vector index as old
snapshots expire. ScyllaDB’s index structure is optimized for
deletion, so the index stays healthy as data churns through.
“Can I just use vector search and skip the rules
engine?” Yes, you can…but you’ll regret it. The three-path
approach works because each path catches a different class of
problem. Rules engines are still the fastest way to catch known,
well-defined failures. Someone at the talk asked whether an LLM
with MCP could just explore the context and find anomalies itself.
It can. But vector search is cheaper. No LLM invocation, no context
window limit, and it runs at the speed of a database query. Use
each tool for what it is good at. Try It Yourself If you want to
see what ScyllaDB can do for your vector workloads, sign up for a demo. I’m always
happy to walk through architectures and talk through what makes
most sense for your specific requirements. Connect with me on
LinkedIn
if you have questions, want to collaborate, or just want to talk
about distributed systems. I read every message.
Learn about ScyllaDB’s new features, performance
improvements, and innovations — plus, get a sneak peek into what’s
coming next. 2025 was a busy year for ScyllaDB R&D. We
shipped one major release, three minor releases, and a continuous
stream of updates across both ScyllaDB and ScyllaDB Cloud. Our
users are most excited about elasticity, vector search, and reduced
total cost of ownership. We also made nice progress on years-long
projects like strong consistency and object storage I raced through
a year-in-review recap in my recent Monster Scale Summit talk (time
limits, sorry), which you can
watch on-demand. But I also want to share a written version
that links to related talks and blog posts with additional detail.
Feel free to choose your own adventure.
Watch the talk Tablets: Fast Elasticity ScyllaDB’s “tablets” data
distribution approach is now fully supported across all ScyllaDB
capabilities, for both CQL and
Alternator (our DynamoDB-compatible API). Tablets are the key
technology that dynamically scales clusters in and out. It’s been
in production for over a year now; some customers use it to scale
for the usual daily fluctuations, others rely on its fast responses
to workload volatility like expected events, unpredictable spikes,
etc. Right now, the autoscaling lets users maximize their disks,
which reduces costs. Soon, we’ll automatically scale based on
workload characteristics as well as storage. A few things make this genuinely different from the vNodes
design that we originally inherited from Cassandra, but replaced
with tablets: When you add new nodes, load balancing starts
immediately and in parallel. No cleanup operation is needed.
There’s also no resharding; rebalancing is automated. Moreover, we
shifted from mutation-based streaming to file-based streaming:
stream the entire SSTable files without deserializing them into
mutation fragments and reserializing them back into SSTables on
receiving nodes. As a result, 3X less data is streamed over the
network and less CPU is consumed, especially for data models that
contain small cells. This change provides up to 25X
faster streaming. As
Avi’s talk explains, we can scale out in minutes and the
tablets load balancing algorithm balances nodes based on their
storage consumption. That means more usable space on an ongoing
basis – up to
90% disk utilization. Also noteworthy: you can now scale with
different node sizes – so you can increase capacity in much smaller
increments. You can add tiny instances first, then replace them
with larger ones if needed. That means you rarely pay for unused
capacity. For example, before, if you started with an i4i.16xlarge
node that had 15 TB of storage and you hit 70% utilization, you had
to launch another i4i.16xlarge – adding 15 TB at once. Now, you
might add two xlarge nodes (1.8 TB each) first. Then, if you need
more storage, you add more small nodes, and eventually replace them
with larger nodes. Vector Search: Real-Time AI Customers like
Tripadvisor,
ShareChat,
Medium, and
Agoda have been using ScyllaDB as a fast feature store in their
AI pipelines for several years now. Now, ScyllaDB also has
real-time vector search, which is embedded in ScyllaDB Cloud. Our
vector search takes advantage of ScyllaDB’s unique
architecture. Technologies like our super-fast Rust driver, CDC,
and tablets help us deliver real-time AI queries. We can handle
datasets of
1 billion vectors with P99 latency as low as 1.7 ms and
throughput up to 252,000 QPS. Dictionary-Based Compression I’ve always been interested in
compression, even as a student. It’s particularly interesting for
databases, though. It allows you to deliver more with less, so it’s
yet another way to reduce costs. ScyllaDB Cloud has always had
compression enabled by default, both for data at rest and for data
in transit. We compress our SSTables on disk, we compress traffic
between nodes, and optionally between clients and nodes. In 2025,
we improved its efficiency up to 80% by enabling dictionary-based
compression. The compressor gains better context of the data being
compressed. That gives it higher compression ratios and, in some
cases, even better performance. Our two most popular compression
algorithms, LZ4 and ZSTD, both
benefit from dictionaries now. It works by sampling some data,
from which it creates a dictionary and then uses it to further
compress the data.
The graph on the lower left shows the impact of enabling dictionary
compression for network traffic. Both compression algorithms are
working and both drop nicely – from 70% to less than 50% for LZ4
and from around 50% to 33% for ZSTD. The table on the lower right
shows a similar change. It shows the benefit for disk utilization
on a customer’s production cluster, essentially cutting down
storage consumption from 50% to less than 30%. Note that the 50%
was an already compressed dataset. With this new compression, we
further compressed it to less than 30%, a significant saving.
Raft-Based Topology We’ve been working on Raft and features built on
top of it for several years now. Currently, we use Raft for
multiple purposes. Schema consistency was the first step, but
topology is the more interesting improvement. With Raft-based fast
parallel scaling and safe schema updates, we’re ready to finally
retire Gossip-based topology. Other features that use Raft-based
topology are authentication, service levels (also known as
workload prioritization), and tablets. We are actively working
on making strong consistency available for data as well. ScyllaDB X Cloud ScyllaDB X Cloud is
the next generation of ScyllaDB Cloud, a truly elastic
database-as-a-service. It builds upon innovation in both ScyllaDB
core (such as tablets and Raft-based topology), as well as
innovation and improvements in ScyllaDB Cloud itself (such as
parallel setup of cloud resources, reduced boot time, and the new
resource resizing algorithm) to provide immediate elasticity to
clusters. Curious how it works? It’s quite simple, really. You just
select two important parameters for your cluster, and those define
the minimum values for resources: The minimum vCPU count, which is
a way to measure the ‘horsepower’ that you initially want to
reserve for your clusters. The minimum storage size. And that’s it.
You can do this via the API or UI. In the UI, it looks like this:
If you wish, you can also limit it to a specific instance family.
Now, let’s see how this scaling looks in action. A few things to note here: There are three somewhat large
nodes and three additional nodes that are smaller. Some of the
tablets are not equal, and that’s perfectly fine. The load was very
high initially, then additional load moved gradually to the new
nodes. The workload itself, in terms of requests, didn’t change. It
changed which nodes it is going to, but the overall value remained
the same. The average latency, the P95 latency, and the P99 latency
are all great: even the P99s are in the single-digit milliseconds.
And here’s a look at additional ScyllaDB Cloud updates before we
move on:
A Few More Things We Shipped Last Year Backup is a critical feature
of any database, and we run it regularly in ScyllaDB Cloud for our
customers. In the past, we used an external utility that backed up
snapshots of the data. That was somewhat inefficient. Also, it
competed with the memory, CPU, disk, and network resources that
ScyllaDB was consuming – potentially affecting the throughput and
latency of user workloads. We reimplemented the backup client so it
now runs inside ScyllaDB’s scheduler and cooperates with the rest
of the system. The result is minimal impact on user workload and
11X
faster execution that scales linearly with the number of shards
per node. On the infrastructure side, the
new AWS Graviton i8g instances proved themselves this year. We
measured up to
2X throughput and lower latency at the same price, along with
higher usable storage from improved compression and utilization. We
don’t embrace every new instance type since we have very specific
and demanding requirements. However, when we see clear value like
this, we encourage customers to move to newer generations. On the security side, all new clusters now have their
data encrypted at rest by default. When creating a cluster, you
can either use your own key (known as ‘BYOK’) or use the ScyllaDB
Cloud key. We also reached general availability of
our Rust driver. This is interesting because it’s our fastest
driver. Also, its binding is the foundation for our grand plan:
unifying our drivers under the Rust infrastructure. We started
with a new C++ driver. Next up (almost ready) is our NodeJS driver
– and we’ll continue with others as well. We also released our C# driver (another popular demand) and
across the board improved our drivers’ reliability, capabilities,
compatibility, and performance. Finally, our Alternator clients in
various languages received some important updates as well, such as
network compression to both requests and responses. What’s Next for
ScyllaDB Finally, let’s close with a glimpse at some of the major
things we are expecting to deliver in 2026: An efficient,
easy-to-use, and online process for migrating from vNodes to
tablets.
Strong data consistency for data (beyond metadata and internal
features). A fast dedicated data path to key-value data Additional
capabilities focused on 1) improved total cost of ownership and 2)
more real-time AI features and integrations. There’s a lot to look
forward to! We’d love to answer questions or hear your feedback.
How Martin Kleppmann’s iconic book evolved for AI and
cloud-native architectures Since its release in 2017,
Designing Data-Intensive Applications (DDIA) has become
known as the bible for anyone working on large-scale, data-driven
systems. The book’s focus on fundamentals (like storage engines,
replication, and partitioning) has helped it age well. Still, the
world of distributed systems has evolved substantially in the past
decade. Cloud-native architectures are now the default. Object
storage has become a first-class building block. Databases
increasingly run everywhere from embedded and edge deployments to
fully managed cloud services. And AI-driven workloads have made a
mark on storage formats, query engines, and indexing strategies.
So, after years of requests for a second edition, Martin Kleppmann
revisited the book – and he enlisted his longtime collaborator
Chris Riccomini as a co-author. Their goal: Keep the core intact
while weaving in new ideas and refreshing the details throughout.
With
the second edition of DDIA now arriving at people’s
doorsteps, let’s look at what Kleppmann and Riccomini shared
about the project at Monster Scale Summit 2025. That conversation
was recorded mid-revision. You can watch the entire chat or read
highlights below. Extra: Hear
what Kleppmann and Riccomini had to say when the book went off to
the printers this year. They were featured again at
Monster
SCALE Summit 2026, which just concluded. This talk is available
on-demand, alongside 60+ others by antirez, Pat Helland, Camille
Fournier, Discord, Disney, and more. People have been
requesting a second edition for a while – why now?
Kleppmann: Yeah, I’ve long wanted to do a second
edition, but I also have a lot of other things I want to do, so
it’s always a matter of prioritization. In the end, though, I felt
that the technology had moved on enough. Most of the book focuses
on fundamental concepts that don’t change that quickly. They change
on the timescale of decades, not months. In that sense, it’s a nice
book to revise, because there aren’t that many big changes. At the
same time, many details have changed. The biggest one is
how people use cloud services today compared to ten years ago.
Cloud existed back then, of course, but I think its impact on data
systems architecture has only been felt in the last few years.
That’s something we’re trying to work in throughout the second
edition of the book. How have database architectures evolved since
the first edition? Riccomini: I’ve been thinking
about whether database architecture has changed now that we’re
putting databases in more places – cloud, BYOC, on-prem, on-client,
on-edge, etc. I think so. I have a hypothesis that successful
future databases will be able to move or scale with you, from your
laptop, to your server, to your cloud, and even to another cloud.
We’re already seeing evidence of this with things like DuckDB and
MotherDuck, which span embedded to cloud-based use cases. I think
PGlite and Postgres are another example, where you see Postgres
being embedded. SQLite is an obvious signal on the embedded side.
On the query engine side, you see this with systems like Daft,
which can run locally and in a distributed setting. So the answer
is yes. The biggest shift is the split between control, data, and
compute planes. That architecture is widely accepted now and has
proven flexible when moving between SaaS and BYOC. There’s also an
operational aspect to this. When you talk about SaaS versus
non-SaaS, you’re talking about things like multi-tenancy and how
much you can leverage cloud-provider infrastructure. I had an
interesting discussion about Confluent Freight versus WarpStream,
two competing Kafka streaming systems. Freight is built to take
advantage of a lot of the in-cloud SaaS infrastructure Confluent
has, while WarpStream is built more like a BYOC system and doesn’t
rely on things like a custom network plane. Operationally, there’s
a lot to consider regarding security and multi-tenancy, and I’m not
sure we’re as far along as we could be. A lot of what SaaS
companies are doing still feels proprietary and internal. That’s my
read on the situation. Kleppmann: I’d add a little
to that. At the time of the first edition, the model for databases
was that a node ran on a machine, storing data on its local file
system. Storage was local disks, and replication happened at the
application layer on top of that. Now we’re increasingly seeing a
model where storage is an object store. It’s not a local file
system; it’s a remote service, and it’s already replicated.
Building on top of an abstraction like object storage lets you do
fundamentally different things compared to local disk storage. I’m
not saying one is better than the other – there are always
tradeoffs. But this represents a new point in the tradeoff space
that really wasn’t present at the time of the first edition. Of
course, object stores existed back then, but far fewer databases
took advantage of them in the way people do now. We’ve seen a
proliferation of specialized databases recently – do you think
we’re moving toward consolidation? Riccomini: With
my investment hat on, this is the million-dollar…or
billion-dollar…question. Which of these is it? I think the answer
is probably both. The reality is that Postgres has really taken the
world by storm lately, and its extension ecosystem has become
pretty robust in recent versions. For most people, when they’re
starting out, they’re naturally going to build on something like
Postgres. They’ll use pg\_search or something similar for their
search index, pgvector for vector embeddings, and PG analytics or
pg\_duckdb for their data lake. Then the question is: as they
scale, will that still be okay? And in some cases, yes. In other
cases, no. My personal hypothesis is that as you not only scale up
but also need features that are core to your product, you’re more
likely to move to a specialized system. pgvector, for example, is a
reasonable starting point. But if your entire product is like
Cursor AI or an IDE that does code completion, you probably need
something more robust and scalable than pgvector can provide. At
that point, you’d likely look at something like Pinecone or
Turbopuffer or companies like that. So I think it’s both. And
because Postgres is going to eat the bottom of the market, I do
think there will be fewer specialized vendors, but I don’t think
they’ll disappear entirely. What are some of the key tradeoffs you
see with streaming systems today? Kleppmann:
Streaming sits in a slightly weird space. A typical stream
processor has a one-record-at-a-time, callback-based API. It’s very
imperative. On top of that, you can build things like relational
operators and query plans. But if you keep pushing in that
direction, the result starts to look much more like a database that
does incremental view maintenance. There are projects like
Materialize that are aiming there. You just give it a SQL query,
and the fact that it’s streaming is an implementation detail that’s
almost hidden. I don’t know if that means the result for many of
these systems is this: if you have a query you can express in SQL,
you hand it off to one of these systems and let it maintain the
view. And what we currently think of as streaming, with the
lower-level APIs, is used for a more specialized set of
applications. That might be very high-scale use cases, or queries
that just don’t fit well into a relational style.
Riccomini: Another thing I’d add is the
fundamental tradeoff between latency and throughput, which most
streaming systems have to deal with. Ideally, you want the lowest
possible latency. But when you do that, it becomes harder to get
higher throughput. The usual way to increase throughput is to batch
writes. But as soon as you start batching writes, you increase the
latency between when a message is sent and when it’s received. How
is AI impacting data-intensive applications?
Kleppmann: There’s some AI-plus-data work we’ve
been exploring in research (not really part of the book). The idea
is this: if you want to give an AI some control over a system – if
it’s allowed to press buttons that affect data, like editing or
updating it – then the safest way to do that is through a
well-defined API. That API defines which buttons the AI is allowed
to press, and those actions correspond to things that make sense
and maybe fulfill certain consistency properties. More generally,
it seems important to have interfaces that allow AI agents and
humans to work safely together, with the database as common ground.
Humans can update data, the AI can update data, and both can see
each other’s changes. You can imagine workflows where changes are
reviewed, compared, and merged. Those kinds of processes will be
necessary if we want good collaboration between humans and AI
systems. Riccomini: From an implementation
perspective, storage formats are definitely going to evolve. We’re
already seeing this with systems like LanceDB, which are trying to
support multimodal data better. Arrow, for example, is built for
columnar data, which may not be the best fit for some multimodal
use cases. And this goes beyond storage into things like Arrow RPC
as well. On the query engine side, there’s also a lot of ongoing
work around query optimization and indexing. The idea is to build
smarter databases that can look at query patterns and adjust
themselves over time. There was a good paper from Google a while
back that used more traditional machine learning techniques to do
dynamic indexing based on query patterns. That line of work will
continue. And then, of course, support for embeddings, vector
search, and semantic search will become more common. Good
integrations with RAG (Retrieval-Augmented Generation)…that’s also
important. We’re still very much at the forefront of all of this,
so it’s tricky.
Watch the 2026 Chat with Martin and Chris
It seemed like just yesterday I was here in the US hosting P99 CONF…and then suddenly I’m back
again for Monster SCALE Summit 2026. The pace of it all says
something – not so much about my travel between the US and
Australia, but about how quickly this space is moving. Monster
SCALE Summit delivered an overwhelming amount of content – so much
that you need both time and distance to process it all. Sitting
here now at Los Angeles International Airport, I can conclude this
was the strongest summit to date. Watch
sessions on-demand Day 1 To start with, the keynotes. As
expected, Dor Laor set the tone with
a strong opening – making a clear case that elasticity is the
defining capability behind ScyllaDB operating at scale. “Elastic”
is one of those overloaded terms in cloud infrastructure, but the
distinction here was sharp. The focus was on a database that
continuously adapts under load without breaking performance. Hot on
the heels of Dor was the team from Discord, showing us just
how they operate at scale while automating everything from
standing up new clusters to expanding them under load, all with
ScyllaDB. What stood out wasn’t just the scale but also the level
of automation. I mention it at every conference, but [Discord
engineer] Bo Ingram’s ScyllaDB
in Action book is required reading for anyone running
databases at scale. Since the conference was hosted by ScyllaDB,
there was no shortage of deep dives into operating at scale. Felipe
Cardeneti Mendes, author of Database
Performance at Scale, expanded on the
engineering behind ScyllaDB. Tablets play a key role as a core
abstraction that links predictable performance with true
elasticity. If you didn’t get a chance to participate in any of the
live lounge events watching Felipe scale ScyllaDB to millions of
operations per second, then you can try it yourself. The gap between
demo and production is enticingly small, so it’s well worth a
whirl. From the customer side,
Freshworks and
ShareChat, both with their own grounded presentations about
super low latency and super low costs, focused on what actually
matters in production. Enough of ScyllaDB for a bit, though. Let me
recommend Joy Gao’s presentation from ClickHouse about
cross-system database replication at petabyte scale. As you
would expect, that’s a difficult problem to solve. Or if you want
to know about ad reporting at scale, then check out the
presentation from Pinterest. The presentation of the day for me
was Thea Aarrestad talking about
nanosecond AI at the Large Hadron Collider. The metrics in this
presentation were insane, with datasets far exceeding the size of
the Netflix catalog being generated per second… all in the pursuit
of capturing the rarest particle interactions in real time. It’s a
useful reminder that at true extreme scale, you don’t store then
analyze. Instead, you analyze inline, or you lose the data forever.
A big thanks to Thea. We asked for monster scale, and you certainly
delivered. At the end of day one, we had another dense block of
quality content from equally strong presenters. Long-time
favourite,
Ben Cane from American Express, is always entertaining and
informative. We had some great research presented by Murat Demirbas
from MongoDB around
designing for distributed systems. And on the topic of
distributed systems, Stephanie Wang showed us how to
avoid some distributed traps with fast, simple analytics using
DuckDB. Miguel Young de la Sota from Buf Technologies indulged us
with
speedrunning Super Mario 64 and its relationship to performance
engineering. In parallel, Yaniv from ScyllaDB core engineering
showed us
what’s new and what’s next for our database. Pat Helland and
Daniel May finished Day 1 with a closing keynote on “Yours,
Mine, and Ours,” which was all about set reconciliation in the
distributed world. This is one of those topics that sounds niche,
but sits at the core of correctness at scale. When systems diverge
– and they always do – then reconciliation becomes the real
problem, not replication. Pat is a legend in this space and his
writing at https://pathelland.substack.com/
expands on many of these ideas. Day 2 Day 2 got off to an early
start with a pre-show lounge hosted by Tzach Livyatan and Attila
Toth sharing tips for getting started with ScyllaDB. The opening
keynote from Avi Kivity went deep into the
mechanics behind ScyllaDB tablets, with the kind of
under-the-hood engineering that makes extreme scale tractable
rather than theoretical. I then had the opportunity to host
Camille Fournier, joining remotely for a live session on
engineering leadership and platform strategy under increasing
system complexity. AI was a recurring theme, but the sharper
takeaway was how platform teams need to evolve their role as
abstraction layers become more dynamic and less predictable.
Camille also stayed on for a live Q&A, engaging directly with
the audience on both leadership and the practical realities of
operating modern platforms at scale. Big thanks to Camille! The
next block shifted back into deep technical content, starting with
Dominik Tornow from Resonate HQ, sharing a
first-principles approach to agentic systems. Alex Dathskovsky
followed with a forward-looking session on the
future of data consistency in ScyllaDB, reframing consistency
not as a static tradeoff but as something increasingly adaptive and
workload-aware. Szymon Wasik went deep on
ScyllaDB vector search internals with real-time performance at
billion-vector scale. It was great to see how ScyllaDB avoids the
typical latency cliffs seen in ANN systems. From LinkedIn, Satya
Lakshmikanth shared practical lessons from operating
large-scale data systems in production, grounding the
conversation in real-world constraints. Asias He showed us how
incremental repair for ScyllaDB tablets is much more
operationally feasible without the traditional overhead around
consistency. We also heard from Tyler Denton, showcasing ScyllaDB
vector search in action across two use cases (RAG with Hybrid
Memory and anomaly detection at scale). Before lunch, another top
quality keynote from TigerBeetle’s Joran Dirk Greef who had plenty
of quotable quotes in his talk, “You
Can’t Scale When You’re Dead.” And crowd favorite antirez from
Redis gave us his take on
HNSW indexes and all the tradeoffs that need to be considered.
The final block in an already monster-sized conference followed up
with Teiva Harsanyi from Google, sharing lessons
drawn from large-scale distributed systems like Gemini.
MoEngage brought a
practitioner view on operating high-throughput, user-facing
systems where latency directly impacts engagement. Benny Halevy
outlined the path
to tiered storage on ScyllaDB. Rivian discussed
real-world automotive data platforms and Brian Jones from SAS
highlighted how they
tackled tension between batch and realtime workloads at scale
by applying ScyllaDB. I also enjoyed the AWS talk from KT Tambe
with Tzach Livyatan, tying cloud infrastructure back to database
design decisions with the I8g family. To wrap the conference,
we heard from Brendan Cox on how
Sprig simplified their database infrastructure and achieved
predictable low latency with – I’m sure you’ve guessed by now –
ScyllaDB. Throughout the event, attendees could also binge-watch
talks in Instant Access, which featured gems from
Martin Kleppmann and Chris Ricommini,
Disney/Hulu,
DBOS,
Tiket, and more. Prepare for Takeoff Final call to board my
monstrous 15-hour flight home… It’s a great privilege to host these
virtual conferences live backed by the team at ScyllaDB. The
quality of this event ultimately comes down to the people – both
the presenters who bring great depth and detail and the audience
that keeps the conversation engaged and interesting. Thanks to
everyone who contributed to making this the best Monster SCALE
Summit to date. Catch up
on what you missed If you want to chat with our engineers about
how ScyllaDB might work with your use case, book a
technical strategy session.
OpenShift users gain a trusted, validated path for
installing and managing ScyllaDB Operator – backed by
enterprise-grade support ScyllaDB Operator is an
open-source project that helps you run ScyllaDB on Kubernetes by
managing ScyllaDB clusters deployed to Kubernetes and automating
tasks related to operating a ScyllaDB cluster. For example, it
automates installation, vertical and horizontal scaling, as well as
rolling upgrades. The latest release (version 1.20) is now Red Hat
certified and available directly in the Red Hat OpenShift ecosystem
catalog. Additionally, it brings new features, stability
improvements and documentation updates. Red Hat OpenShift
Certification and Catalog Availablity OpenShift has become a
cornerstone platform for enterprise Kubernetes deployments, and
we’ve been working to ensure ScyllaDB Operator feels like a native
part of that ecosystem. With ScyllaDB Operator 1.20, we’re taking a
significant step forward: the operator is now Red Hat certified and
available directly in the Red Hat OpenShift ecosystem catalog. See
the
ScyllaDB Operator project in the Red Hat Ecosystem Catalog.
This milestone gives OpenShift users a trusted, validated path for
installing and managing ScyllaDB Operator – backed by
enterprise-grade support.
With this release, you can install ScyllaDB Operator through OLM
(Operator Lifecycle Manager) using either the OpenShift Web Console
or CLI. For detailed installation instructions and
OpenShift-specific configuration examples – including guidance for
platforms like Red Hat OpenShift Service on AWS (ROSA) – see the
Installing ScyllaDB Operator on OpenShift guide. IPv6 support
ScyllaDB Operator 1.20 brings native support for IPv6 and
dual-stack networking to your ScyllaDB clusters. With dual-stack
support, your ScyllaDB clusters can operate with both IPv4 and IPv6
addresses simultaneously. That provides flexibility for gradual
migration scenarios or environments requiring support for both
protocols. You can also configure IPv6-first deployments, where
ScyllaDB uses IPv6 for internal communication while remaining
accessible via both protocols. You can control the IP addressing
behaviour of your cluster through v1.ScyllaCluster’s .spec.network.
The API abstracts away the underlying ScyllaDB configuration
complexity so you can focus on your networking requirements rather
than implementation details. With this release, IPv4-first
dual-stack and IPv6-first dual-stack configurations are
production-ready. IPv6-only single-stack mode is available as an
experimental feature under active development; it’s not recommended
for production use. See
Production readiness for details. Learn more about IPv6
networking in the
IPv6 networking documentation. Other notable changes
Documentation improvements This release includes a new guide on
automatic data cleanups. It explains how ScyllaDB Operator
ensures that your ScyllaDB clusters maintain storage efficiency and
data integrity by removing stale data and preventing data
resurrection. Dependency updates This release also includes regular
updates of ScyllaDB Monitoring and the packaged dashboards to
support the latest ScyllaDB releases (4.12.1->4.14.0, #3250),
as well as its dependencies: Grafana (12.2.0->12.3.2) and
Prometheus (v3.6.0->v3.9.1). For more changes and details, check
out the GitHub
release notes. Upgrade instructions For instructions on
upgrading ScyllaDB Operator to 1.20, please refer to the
Upgrading ScyllaDB Operator section of the documentation.
Supported versions ScyllaDB 2024.1, 2025.1, 2025.3 – 2025.4 Red Hat
OpenShift 4.20 Kubernetes 1.32 – 1.35 Container Runtime Interface
API v1 ScyllaDB Manager 3.7 – 3.8 Getting started with ScyllaDB
Operator ScyllaDB
Operator DocumentationLearn
how to deploy ScyllaDB on Google Kubernetes Engine (GKE)Learn
how to deploy ScyllaDB on Amazon Elastic Kubernetes Engine
(EKS) Learn how
to deploy ScyllaDB on a Kubernetes Cluster Related links
ScyllaDB Operator source (on
GitHub) ScyllaDB Operator image on
DockerHub ScyllaDB Operator
Helm Chart repository ScyllaDB Operator documentation
ScyllaDB Operator for Kubernetes
lesson in ScyllaDB UniversityReport a
problem Your feedback is always welcome! Feel free to open an
issue or reach out on the #scylla-operator channel in ScyllaDB User Slack.