Agentic AI State Management with ScyllaDB and LangGraph
How to combine LangGraph and ScyllaDB for durable state management, crash recovery, and a highly available backend for your agentic AI applications. Most agent implementations today are request-response loops. The challenge with this approach is that you are just one network issue or server restart away from losing context and progress. We have more powerful LLMs than ever, yet we’re wrapping them in fragile infrastructure. As an example, assume you have an agent process that takes three minutes and involves seven API calls. There are a lot of places where it can go wrong. The process dies, the state disappears, and the agent starts over with no recollection of what it was doing. Implementing a well-designed workflow orchestration client is not enough to solve this problem. You also need to implement a distributed and highly available backend to support your agents, something with: multi-region, durable storage automatic data replication fault tolerance high-throughput This post shows you how you can simplify your backend by using a single mature database that handles both high availability and durable storage for your agents. You write agent state to a persistent store, it survives crashes by default, and you can still meet 5ms P99 latency requirements. Pair that with an orchestration framework like LangChain’s LangGraph that saves state after every step, and you get a reliable and scalable agentic backend. Let’s see why and how you should implement a system like that with ScyllaDB. Achieving zero agent downtime with ScyllaDB ScyllaDB is a high-performance distributed NoSQL database designed to stay up and available for mission-critical applications. The Raft consensus algorithm handles topology changes and schema updates with strong consistency. Replication is automatic: you set a replication factor and ScyllaDB distributes copies across nodes, racks, and datacenters. On temporary node loss, Hinted Handoffs record missed writes and replay them when the node returns. For longer outages, row-level repair brings a replacement node up to date in the background. You don’t need load balancers, external replication jobs, or manual failover steps. ScyllaDB Cloud is a mature cloud offering. Multi-region clusters with tunable replication factors per datacenter, rack and availability-zone awareness, and zero-downtime operations are all available out of the box, with no extra components required. ScyllaDB also provides practical features for agentic use cases… Persistent by design Every write goes to durable storage. There is no configuration flag to enable durability; it is the default, not an option. Persistence allows your agent to recover from crashes and continue a process. Data model In ScyllaDB, you design tables around the queries your application will run. A partition key determines which node owns the data, rows within a partition are sorted by a clustering key, and that sort order is fixed at schema creation time. This design is a great fit for key-value agentic systems. Lightweight transactions ScyllaDB supports LWTs to provide compare-and-set semantics natively, without client-side locking:INSERT IF NOT EXISTS and UPDATE ... IF ...
This feature enables idempotent checkpoint writes. Time-to-live
Agentic sessions eventually go stale. ScyllaDB provides a native
way to expire old data from your database. ScyllaDB’s role in your
agentic infrastructure Now let’s explore specific use cases where
ScyllaDB helps you build agentic applications. The following
examples use LangGraph (TypeScript) and the community-created
ScyllaDBSaver checkpointer. What is a checkpointer?
Checkpointer is LangGraph’s abstraction for a persistence backend.
This is how LangGraph integrates with databases.
Durable conversation memory One of the main technical problems with
agents is handling failures such as: network hiccups server
restarts other reasons a process gets killed midway through The
in-memory state is gone, and the agent behaves as if the
conversation never happened. LangGraph’s
MemorySaver (built-in in-memory checkpointer) makes this
reproducible. Run two turns, discard the saver object, create a new
one, and run a third turn:
thread_id: a named
conversation/session in LangGraph; all checkpoints for one
conversation share the same thread. With ScyllaDB as the
checkpointer, all three requests operate identically from an
application standpoint. The agent picks up exactly where it left
off because the conversation state lives in the database rather
than in the server process.
ScyllaDBSaver example: The query that loads state
on every invoke() is: Note that we don’t use
ORDER BY or run a full-table scan. There’s only one
row returned: the most recent checkpoint for the thread. Why does
LIMIT 1 return the newest row without an explicit
sort? Let’s see how the ScyllaDB data model enables this kind of
query.
Source:
https://aws.amazon.com/blogs/database/build-durable-ai-agents-with-langgraph-and-amazon-dynamodb/
Query-first schema design: reading the latest checkpoint LangGraph
reads the latest checkpoint on every invoke(). In a
busy agent server, that is a read-heavy query pattern. The
checkpoints table is defined with a compound primary
key: The partition key is (thread_id, checkpoint_ns).
That means this key will be used to partition your data across the
ScyllaDB cluster. All checkpoints for a single conversation land in
the same partition. “Get all steps for this conversation” never
requires cross-node coordination. The clustering key is
checkpoint_id DESC. It makes sure that the rows within
each partition are sorted according to that column in descending
order. Because checkpoint_id is a UUIDv6
(which encodes a timestamp in its bit layout), rows are physically
stored on disk with the newest checkpoint first. LIMIT
1 on a partition scan reads only the first row; no full scan
is required.
Source: https://docs.langchain.com/oss/python/langgraph/persistence
Crash recovery with idempotent writes A node in an agent graph can
fail mid-execution after it has already written some of its output.
Without a write-ahead
log, the only safe option on retry is to re-run the node from
scratch. This may produce duplicates, trigger external side
effects, or be expensive for long-running LLM calls. ScyllaDB and
LangGraph solves this with a second table,
checkpoint_writes, that acts as a write-ahead log at
the
channel level: Before a checkpoint row is written to
checkpoints, each individual channel write is staged
in checkpoint_writes using a lightweight transaction:
IF NOT EXISTS is an idempotent insert. Here’s what
happens if the server crashes after three of five channel writes
have landed and then restarts: LangGraph loads the latest
checkpoints row It loads the pending
checkpoint_writes for that checkpoint ID It finds the
three completed writes It resumes from there without re-running
successful steps The partition key on
checkpoint_writes is (thread_id, checkpoint_ns,
checkpoint_id). All pending writes for a single checkpoint
are in the same partition. “Load all pending writes for checkpoint
X” is a single-partition scan, not a cross-cluster lookup. The two
tables serve different query patterns. Keeping them separate makes
both queries efficient. Time-travel and conversation history
LangGraph exposes historical snapshots through the checkpointer’s
list() method: Each tuple is a full
CheckpointTuple: the serialized state at that step, the
metadata (source, step number, what changed), and the config needed
to resume from that exact point. That last part is what enables
time-travel: pass a past checkpoint_id as the starting
configuration and LangGraph replays from there, branching the
conversation into an alternative trajectory without modifying the
original history. Here’s the underlying ScyllaDB query: You get all
rows for one thread in one partition, sorted newest-first. This is
the same partition that hosts the latest-checkpoint read. No
additional indexes are required for the history use case. The
source field indicates what kind of step produced
it: "input" (user message ingested, before any node
ran) "loop" (a node executed) "update"
(state was patched directly via graph.updateState()).
Secondary indexes on source and step
allow filtering across all threads when needed: Auto-expire data
with time-to-live Production agent deployments accumulate
checkpoint data continuously. A customer support agent with 10,000
active threads, each with a 10-turn history, generates tens of
thousands of checkpoint rows. Sessions eventually go stale. You
might decide, for example, that a thread abandoned by the user
after one message can be deleted and stored elsewhere after a
certain period of time. In ScyllaDB, TTL is part of the data model.
You attach it directly to the inserted row at write time:
USING TTL 86400 tells ScyllaDB to delete this row
after 24 hours. The same TTL clause appears on
checkpoint_writes in the same write batch. The
ScyllaDBSaver accepts a ttlConfig
parameter that applies this clause to every write: Change
defaultTTLSeconds and every subsequent write picks up
the new expiry. No migration required. Integrate ScyllaDB into your
LangGraph project To use ScyllaDB as a persistent store in your
LangGraph application, you need to install the ScyllaDB
checkpointer. This package will handle the migration and all
subsequent CQL queries for you. Install the package: npm
install @gbyte.tech/langgraph-checkpoint-scylladb Create the
schema: npm run migrate # runs: CREATE KEYSPACE IF NOT EXISTS
langgraph ... # CREATE TABLE IF NOT EXISTS langgraph.checkpoints
... # CREATE TABLE IF NOT EXISTS langgraph.checkpoint_writes
... Wire the checkpointer into your graph: Wrapping up By
combining LangGraph with ScyllaDB’s built-in durability and high
availability, you move from fragile, stateful processes to
resilient agent systems. Restarts, retries, or lost context won’t
be a problem because your architecture treats failure as a normal
condition and continues seamlessly. This shift simplifies your
infrastructure as well as enables more ambitious, long-running
agent workflows to operate reliably at scale. Learn more about
ScyllaDB and agentic applications: Clone the example
application Read how others use
ScyllaDB for AI use cases Sign up for ScyllaDB Cloud Why We Changed ScyllaDB’s Approach to Repair
By focusing solely on unrepaired data, we made ScyllaDB’s incremental repair 10X faster Maintaining data consistency in large-scale distributed databases often comes at a high performance cost. As clusters grow and data volume expands rapidly, traditional repair methods often become a bottleneck. At ScyllaDB, we needed a way to make consistency checks faster and more efficient. In response, we implemented Incremental Repair for ScyllaDB’s tablets data distribution. It’s an optimization designed to minimize repair overhead by focusing solely on unrepaired data. This blog explains what it is, how we implemented it, and the performance gains it delivered. What is Incremental Repair? Before we talk about the incremental part, let’s look at what repair actually involves in a distributed system context. In a system like ScyllaDB, repair is an essential maintenance operation. Even with the best hardware, replicas can drift due to network hiccups, disk failures, or load. Repair detects mismatches between replicas and fixes them to ensure every node has the latest, correct version of the data. It’s a safety net that guarantees data consistency across the cluster. Incremental repair is a new feature in ScyllaDB (currently available for tablets-based tables). The idea behind it is simple: why worry about data that we have already repaired? Traditional repair scans everything. Incremental repair targets only unrepaired data. Technically, this is achieved by splitting SSTables into two distinct sets: repaired and unrepaired. The repaired set is consistent and synchronized, while the unrepaired set is potentially inconsistent and requires validation. We created two modes of incremental repair: incremental and full. In incremental mode, only SSTables in the unrepaired set are selected for the repair process. Once the repair completes, those SSTables are marked as repaired and promoted into the repaired set. This should be your default mode because it significantly minimizes the IO and CPU required for repair. In full mode, the incremental logic is still active, but the selection criteria change. Instead of skipping the repaired set, it selects all data (both repaired and unrepaired). Once the process is finished, all participating SSTables are marked as repaired. Think of this as a “trust but verify” mode. Use this when you want to revalidate the entire data set from scratch while still using the incremental infrastructure. Finally, there’s disabled mode, where the incremental repair logic is turned off. In this case, the repair behavior is exactly the same as in previous versions of ScyllaDB – before the incremental repair feature was introduced. It selects both repaired and unrepaired SSTables for the repair process. After repair completes, the system does not mark SSTables as repaired. This is useful for scenarios where you want to run a repair without affecting the metadata that tracks the repair state. Incremental repair is integrated directly into the existing workflow, with three options: nodetool lets you use the standard nodetool cluster repair command with incremental flags. ScyllaDB Manager also supports the same flags for automated scheduling. A REST API makes incremental repair available to teams building custom tools. Making incremental repair work (internals) To make incremental repair work, we had to solve a classic distributed systems problem: state consistency. We need to know exactly which data is repaired, and that state must survive crashes. So, we track this using two decoupled markers. repaired_at, number stored directly in the SSTable metadata on disk. sstables_repaired_at, a value stored in our system tables. The logic follows a two-phase commit model. Phase one: prepare. First, we run the row-level repair. Once that is finished, we update the repaired_at value in the SSTables. At this point, the system still treats them as unrepaired because they haven’t been activated yet. Phase two: commit. After every node confirms the row-level repair and the updated repaired_at value, we update the sstables_repaired_at value in the system table. We define an SSTable as repaired if and only if it is not zero and it is less than or equal to the system’s repaired_at value. If a node crashes between phases, the mismatch between the file and the system table ensures that we don’t accidentally skip data that wasn’t 100% verified. Under normal operations, you don’t need to run full repairs regularly. Still, it’s needed occasionally. If you experience significant loss of SSTables (perhaps due to a disk failure), then a full repair is required to reconstruct missing data across the cluster. In practice, we suggest one full repair after a long series of incremental runs. This gives you an extra layer of security, even if it is not strictly required. This brings us to a critical challenge: compaction. If we let compaction mix repaired and unrepaired data, the repaired status would be lost, and we’d need to re-repair everything. To solve this, we introduce compaction barriers. We effectively split the tablets into two independent worlds. The unrepaired set, where all new writes and memtable flushes go. Compaction only merges unrepaired SSTables with other unrepaired ones. The repaired set, where SSTables are compacted together to maintain and optimize the read path. The rule is that a compaction strategy can never merge an unrepaired SSTable into the repaired set. The only bridge between these sets is repair. This prevents potentially inconsistent data from polluting the repaired set. With this new design, compaction now has a dependency on repair. As we run repairs, the repaired sets grow. But because incremental repair is so much lighter than traditional methods, we encourage you to run it much more frequently. We are currently working on an automatic repair feature that will trigger those runs at the very moment the unrepaired set grows too large. That should keep your unrepaired window as small as possible. The efficiency of incremental repair depends on your workload: Update heavy: If you have lots of overwrites or deletes, new data will invalidate older repaired SSTables. In extreme cases, there could be so much data to repair that it looks a lot like full repair. Append heavy: This is a perfect use case, like IoT or logging. Since new data doesn’t invalidate old data, the repaired set stays consistent and untouched. This should provide nice performance gains. Even in update-heavy cases, don’t lose anything by choosing incremental repair. In the worst case, it performs the same amount of work as a full repair would. In almost all real-world scenarios, you could gain significant improvements without any trade-off with respect to consistency. Performance Improvements To understand how this approach translates to performance improvements, let’s model the improvement ratio. Say that n is the size of your new unrepaired data and E is the size of your existing repaired data. A full repair works on E plus n. Incremental repair works only on n. So, the improvement ratio equals n divided by E plus n. If you ingest 100 gigabytes a day on a 10-terabyte node, you are repairing only 1% of the data instead of 100%. This is an order-of-magnitude shift in overhead. Our testing confirms that theory. We ran multiple insert and repair cycles. In the first round, nearly all the data was new, so the repair time was almost the same as with full repair. In the second round, with a 50/50 split, the time dropped by half. In the third round, as the repaired set became dominant, incremental repair took only 35% of the time that a full repair would have taken. To wrap up, incremental repair for tablets is faster, lighter, and more efficient. It is a foundational step toward our goal of a fully autonomous database that handles its own maintenance. By adopting this feature, you reduce operational burden and ensure your cluster remains consistent without repair storms.Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale
By Ben Sykes
In a previous post, we described how Netflix uses Apache Druid to ingest millions of events per second and query trillions of rows, providing the real-time insights needed to ensure a high-quality experience for our members. Since that post, our scale has grown considerably.
With our database holding over 10 trillion rows and regularly ingesting up to 15 million events per second, the value of our real-time data is undeniable. But this massive scale introduced a new challenge: queries. The live show monitoring, dashboards, automated alerting, canary analysis, and A/B test monitoring that are built on top of Druid became so heavily relied upon that the repetitive query load started to become a scaling concern in itself.
This post describes an experimental caching layer we built to address this problem, and the trade-offs we chose to accept.
The Problem
Our internal dashboards are heavily used for real-time monitoring, especially during high-profile live shows or global launches. A typical dashboard has 10+ charts, each triggering one or more Druid queries; one popular dashboard with 26 charts and stats generates 64 queries per load. When dozens of engineers view the same dashboards and metrics for the same event, the query volume quickly becomes unmanageable.
Take the popular dashboard above: 64 queries per load, refreshing every 10 seconds, viewed by 30 people. That’s 192 queries per second from one dashboard, mostly for nearly identical data. We still need Druid capacity for automated alerting, canary analysis, and ad-hoc queries. And because these dashboards request a rolling last-few-hours window, each refresh changes slightly as the time range advances.
Druid’s built-in caches are effective. Both the full-result cache and the per-segment cache. But neither is designed to handle the continuous, overlapping time-window shifts inherent to rolling-window dashboards. The full-result cache misses for two reasons.
- If the time window shifts even slightly, the query is different, so it’s a cache miss.
- Druid deliberately refuses to cache results that involve realtime segments (those still being indexed), because it values deterministic, stable cache results and query correctness over a higher cache hit rate.
The per-segment cache does help avoid redundant scans on historical nodes, but we still need to collect those cached segment results from each data node and merge them in the brokers with data from the realtime nodes for every query.
During major shows, rolling-window dashboards can generate a flood of near-duplicate queries that Druid’s caches mostly miss, creating heavy redundant load. At our scale, solving this by simply adding more hardware is prohibitively expensive.
We needed a smarter approach.
The Insight
When a dashboard requests the last 3 hours of data, the vast majority of that data, everything except the most recent few minutes, is already settled. The data from 2 hours ago won’t change.
What if we could remember the older portions of the result and only ask Druid for the part that’s actually new?
This is the core idea behind a new caching service that understands the structure of Druid queries and serves previously-seen results from cache while fetching only the freshest portion from Druid.

A Deliberate Trade-Off
Before diving into the implementation, it’s worth being explicit about the trade-off we’re making. Caching query results introduces some staleness, specifically, up to 5 seconds for the newest data. This is acceptable for most of our operational dashboards, which refresh every 10 to 30 seconds. In practice, many of our queries already set an end time of now-1m or now-5s to avoid the “flappy tail” that can occur with currently-arriving data.
Since our end-to-end data pipeline latency is typically under 5 seconds at P90, a 5-second cache TTL on the freshest data introduces negligible additional staleness on top of what’s already inherent in the system. We decided it was better to accept this small amount of staleness in exchange for significantly lower query load on Druid. But a 5s cache on its own is not very useful.
Exponential TTLs
Not all data points are equally trustworthy. In real-time analytics, there’s a well-known late-arriving data problem. Events can arrive out of order or be delayed in the ingestion pipeline. A data point from 30 seconds ago might still change as late-arriving events trickle in. A data point from 30 minutes ago is almost certainly final.
We use this observation to set cache TTLs that increase exponentially with the age of the data. Data less than 2 minutes old gets a minimum TTL of 5 seconds. After that, the TTL doubles for each additional minute of age: 10 seconds at 2 minutes old, 20 seconds at 3 minutes, 40 seconds at 4 minutes, and so on, up to a maximum TTL of 1 hour.
The effect is that fresh data cycles through the cache rapidly, so any corrections from late-arriving events in the most recent couple of minutes are picked up quickly. Older data lingers much longer, because our confidence in its accuracy grows with time.
For a 3-hour rolling window, the exponential TTL ensures the vast majority of the query is served from the cache, leaving Druid to only scan the most recent, unsettled data.

Bucketing
If we were to use a single-level cache key for the query and interval, similar to Druid’s existing result-level cache, we wouldn’t be able to extract only the relevant time range from cached results. A shifted window means a different key, which means a cache miss.
Instead, we use a map-of-maps. The top-level key is the query hash without the time interval; the inner keys are timestamps bucketed to the query granularity (or 1 minute, whichever is larger) and encoded as big-endian bytes so lexicographic order matches time. This enables efficient range scans; fetching all cached buckets between times A and B for a query hash. A 3-hour query at 1-minute granularity becomes 180 independent cached buckets, each with its own TTL; when the window shifts (e.g., 30 seconds later), we reuse most buckets from cache and only query Druid for the new data.

How It Works
Today, the cache runs as an external service integrated transparently by intercepting requests at the Druid Router and redirecting them to the cache. If the cache fully satisfies a request, it returns the result; otherwise it shrinks the time interval to the uncached portion and calls back into the Router, bypassing the redirect to query Druid normally. Non-cached requests (e.g., metadata queries or queries without time group-bys) pass straight through to Druid unchanged.
This intercepting proxy design allows us to enable or disable caching without any client changes and is a key to its adoption. We see this setup as temporary while we work out a way to better integrate this capability into Druid more natively.
When a cacheable query arrives, those that are grouping-by time (timeseries, groupBy), the cache performs the following steps.
Parsing and Hashing. We parse each incoming query to extract the time interval, granularity, and structure, then compute a SHA-256 hash of the query with the time interval and parts of the context removed. That hash is the cache key: it encodes what is being asked (datasource, filters, aggregations, granularity) but not when, so the same logical query over different overlapping time windows maps to the same cache entry. There are some context properties that can alter the response structure or contents, so these are included in the cache-key.

Cache Lookup. Using the cache key, we fetch cached points within the requested range, but only if they’re contiguous from the start. Because bucket TTLs can expire unevenly, gaps can appear; when we hit a gap, we stop and fetch all newer data from Druid. This guarantees a complete, unbroken result set while sending at most one Druid query, rather than “filling gaps” with multiple small, fragmented queries that would increase Druid load.
Fetching the Missing Tail. On a partial cache hit (e.g., 2h 50m of a 3h window), we rebuild the query with a narrowed interval for the missing 10 minutes and send only that to Druid. Since Druid then scans just the recent segments for a small time range, the query is usually faster and cheaper than the original.
Combining. The cached data and fresh data are concatenated, sorted by timestamp, and returned to the client. From the client’s perspective, the response looks identical to what Druid would have returned, same JSON format, same fields.
Asynchronous Caching. The fresh data from Druid is parsed into individual time-granularity buckets and written back to the cache asynchronously, so we don’t add latency to the response path.

Negative Caching
Some metrics are sparse. Certain time buckets may genuinely have no data. Without special handling, the cache would treat these empty buckets as gaps and re-query Druid for them every time.
We handle this by caching empty sentinel values for time buckets where Druid returned no data. Our gap-detection logic recognizes these empty entries as valid cached data rather than missing data, preventing needless re-queries for naturally sparse metrics.
However, we’re careful not to negative-cache trailing empty buckets. If a query returns data up to minute 45 and nothing after, we only cache empty entries for gaps between data points, not after the last one. This avoids incorrectly caching “no data” for time periods where events simply haven’t arrived yet, which would exacerbate the chart delays of late arriving data.
The Storage Layer
For the backing store, we use Netflix’s Key-Value Data Abstraction Layer (KVDAL), backed by Cassandra. KVDAL provides a two-level map abstraction, a natural fit for our needs. The outer key is the query hash, and the inner keys are timestamps. Crucially, KVDAL supports independent TTLs on each inner key-value pair, eliminating the need for us to manage cache eviction manually.
This two-level structure gives us efficient range queries over the inner keys, which is exactly what we need for partial cache lookups: “give me all cached buckets between time A and time B for query hash X.”
Results
The biggest win is during high-volume events (e.g., live shows): when many users view the same dashboards, the cache serves most identical queries as full hits, so the query rate reaching Druid is essentially the same with 1 viewer or 100. The scaling bottleneck moves from Druid’s query capacity to the much cheaper-to-scale cache, and with ~5.5 ms P90 cache responses, dashboards load faster for everyone.
On a typical day, 82% of real user queries get at least a partial cache hit, and 84% of result data is served from cache. As a result, the queries that reach Druid scan much narrower time ranges, touching fewer segments and processing less data, freeing Druid to focus on aggregating the newest data instead of repeatedly re-querying historical segments.

An experiment validated this, showing about a 33% drop in queries to Druid and a 66% improvement in overall P90 query times. It also cut result bytes and segments queried, and in some cases, enabling the cache reduced result bytes by more than 14x. Caveat: the size of these gains depends heavily on how similar and repetitive the query workload is.

Looking Ahead
This caching layer is still experimental, but results are promising and we’re exploring next steps. We’ve added partial support for templated SQL so dashboard tools can benefit without writing native Druid queries.
Longer term, we’d like interval-aware caching to be built into Druid: an external proxy adds infrastructure to manage, extra network hops, and workarounds (like SQL templating) to extract intervals. Implemented inside Druid, it could be more efficient, with direct access to the query planner and segment metadata, and benefit the broader community without custom infrastructure. We’d likely ship it as an opt-in, configurable, result-level cache in the Brokers, with metrics to tune TTLs and measure effectiveness. Please leave a comment if you have a use-case that could benefit from this feature.
More broadly, this strategy, splitting time-series results into independently cached, granularity-aligned buckets with age-based exponential TTLs, isn’t Druid-specific and could apply to any time-series database with frequent overlapping-window queries.
Summary
As more Netflix teams rely on real-time analytics, query volume grows too. Dashboards are essential at our scale, but their popularity can become a scaling bottleneck. By inserting an intelligent cache between dashboards and Druid, one that understands query structure, breaks results into granularity-aligned buckets, and trades a small amount of staleness for much lower Druid load, we’ve increased query capacity without scaling infrastructure proportionally, and hope to deliver these benefits to the Druid community soon as a built-in Druid feature.
Sometimes the best way to handle a flood of queries is to stop answering the same question twice.
Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Powering Multimodal Intelligence for Video Search
Synchronizing the Senses: Powering Multimodal Intelligence for Video Search
By Meenakshi Jindal and Munya Marazanye
Today’s filmmakers capture more footage than ever to maximize their creative options, often generating hundreds, if not thousands, of hours of raw material per season or franchise. Extracting the vital moments needed to craft compelling storylines from this sheer volume of media is a notoriously slow and punishing process. When editorial teams cannot surface these key moments quickly, creative momentum stalls and severe fatigue sets in.
Meanwhile, the broader search landscape is undergoing a profound transformation. We are moving beyond simple keyword matching toward AI-driven systems capable of understanding deep context and intent. Yet, while these advances have revolutionized text and image retrieval, searching through video, the richest medium for storytelling, remains a daunting “needle in a haystack” challenge.
The solution to this bottleneck cannot rely on a single algorithm. Instead, it demands orchestrating an expansive ensemble of specialized models: tools that identify specific characters, map visual environments, and parse nuanced dialogue. The ultimate challenge lies in unifying these heterogeneous signals, textual labels, and high-dimensional vectors into a cohesive, real-time intelligence. One that cuts through the noise and responds to complex queries at the speed of thought, truly empowering the creative process.
Why Video Search is Deceptively Complex
Since video is a multi-layered medium, building an effective search engine required us to overcome significant technical bottlenecks. Multi-modal search is exponentially more complex than traditional indexing: it demands the unification of outputs from multiple specialized models, each analyzing a different facet of the content to generate its own distinct metadata. The ultimate challenge lies in harmonizing these heterogeneous data streams to support rich, multi-dimensional queries in real time.
- Unifying the Timeline
To ensure critical moments aren’t lost across scene boundaries, each model segments the video into overlapping intervals. The resulting metadata varies wildly, ranging from discrete text-based object labels to dense vector embeddings. Synchronizing these disjointed, multi-modal timelines into a unified chronological map presents a massive computational hurdle. - Processing at Scale
A standard 2,000-hour production archive can contain over 216 million frames. When processed through an ensemble of specialized models, this baseline explodes into billions of multi-layered data points. Storing, aligning, and intersecting this staggering volume of records while maintaining sub-second query latency far exceeds the capabilities of traditional database architectures. - Surfacing the Best Moments
Surface-level mathematical similarity is not enough to identify the most relevant clip. Because continuous shots naturally generate thousands of visually redundant candidates, the system must dynamically cluster and deduplicate results to surface the singular best match for a given scene. To achieve this, effective ranking relies on a sophisticated hybrid scoring engine that weighs symbolic text matches against semantic vector embeddings, ensuring both precision and interpretability. - Zero-Friction Search
For filmmakers, search is a stream-of-consciousness process, and a ten-second delay can disrupt the creative flow. Because sequential scanning of raw footage is fundamentally unscalable, our architecture is built to navigate and correlate billions of vectors and metadata records efficiently, operating at the speed of thought.
The Ingestion and Fusion Pipeline
To ensure system resilience and scalability, the transition from raw model output to searchable intelligence follows a decoupled, three-stage process:
1. Transactional Persistence
Raw annotations are ingested via high-availability pipelines and stored in our annotation service, which leverages Apache Cassandra for distributed storage. This stage strictly prioritizes data integrity and high-speed write throughput, guaranteeing that every piece of model output is safely captured.
{
"type": "SCENE_SEARCH",
"time_range": {
"start_time_ns": 4000000000,
"end_time_ns": 9000000000
},
"embedding_vector": [
-0.036, -0.33, -0.29 ...
],
"label": "kitchen",
"confidence_score": 0.72
}
Figure 2: Sample Scene Search Model Annotation Output
2. Offline Data Fusion
Once the annotation service securely persists the raw data, the system publishes an event via Apache Kafka to trigger an asynchronous processing job. Serving as the architecture’s central logic layer, this offline pipeline handles the heavy computational lifting out-of-band. It performs precise temporal intersections, fusing overlapping annotations from disparate models into cohesive, unified records that empower complex, multi-dimensional queries.
Cleanly decoupling these intensive processing tasks from the ingestion pipeline guarantees that complex data intersections never bottleneck real-time intake. As a result, the system maintains maximum uptime and peak responsiveness, even when processing the massive scale of the Netflix media catalog.
To achieve this intersection at scale, the offline pipeline normalizes disparate model outputs by mapping them into fixed-size temporal buckets. This discretization process unfolds in three steps:
- Bucket Mapping: Continuous detections are segmented into discrete intervals. For example, if a model detects a character “Joey” from seconds 2 through 8, the pipeline maps this continuous span of frames into seven distinct one-second buckets.
- Annotation Intersection: When multiple models generate annotations for the same temporal bucket, such as character recognition “Joey” and scene detection “kitchen” overlapping in second 4, the system fuses them into a single, comprehensive record.
- Optimized Persistence: These newly enriched records are written back to Cassandra as distinct entities. This creates a highly optimized, second-by-second index of multi-modal intersections, perfectly associating every fused annotation with its source asset.
The following record shows the overlap of the character “Joey” and scene “kitchen” annotations during a 4 to 5 second window in a video asset:
{
"associated_ids": {
"MOVIE_ID": "81686010",
"ASSET_ID": "01325120–7482–11ef-b66f-0eb58bc8a0ad"
},
"time_bucket_start_ns": 4000000000,
"time_bucket_end_ns": 5000000000,
"source_annotations": [
{
"annotation_id": "7f5959b4–5ec7–11f0-b475–122953903c43",
"annotation_type": "CHARACTER_SEARCH",
"label": "Joey",
"time_range": {
"start_time_ns": 2000000000,
"end_time_ns": 8000000000
}
},
{
"annotation_id": "c9d59338–842c-11f0–91de-12433798cf4d",
"annotation_type": "SCENE_SEARCH",
"time_range": {
"start_time_ns": 4000000000,
"end_time_ns": 9000000000
},
"label": "kitchen",
"embedding_vector": [
0.9001, 0.00123 ....
]
}
]
}
Figure 4: Sample Intersection Record for Character + Scene Search
3. Indexing for Real Time Search
Once the enriched temporal buckets are securely persisted in Cassandra, a subsequent event triggers their ingestion into Elasticsearch.
To guarantee absolute data consistency, the pipeline executes upsert operations using a composite key (asset ID + time bucket) as the unique document identifier. If a temporal bucket already exists for a specific second of video, perhaps populated by an earlier model run, the system intelligently updates the existing record rather than generating a duplicate. This mechanism establishes a single, unified source of truth for every second of footage.
Architecturally, the pipeline structures each temporal bucket as a nested document. The root level captures the overarching asset context, while associated child documents house the specific, multi-modal annotation data. This hierarchical data model is precisely what empowers users to execute highly efficient, cross-annotation queries at scale.
Multimodal Discovery and Result Ranking
The search service provides a high-performance interface for real-time discovery across the global Netflix catalog. Upon receiving a user request, the system immediately initiates a query preprocessing phase, generating a structured execution plan through three core steps:
- Query Type Detection: Dynamically categorizes the incoming request to route it down the most efficient retrieval path.
- Filter Extraction: Isolates specific semantic constraints such as character names, physical objects, or environmental contexts to rapidly narrow the candidate pool.
- Vector Transformation: Converts raw text into high-dimensional, model-specific embeddings to enable deep, context-aware semantic matching.
Once generated, the system compiles this structured plan into a highly optimized Elasticsearch query, executing it directly against the pre-fused temporal buckets to deliver instantaneous, frame-accurate results.
Fine-Tuning Semantic Search
To support the diverse workflows of different production teams, the system provides fine-grained control over search behavior through configurable parameters:
- Exact vs. Approximate Search: Users can toggle between exact k-Nearest Neighbors (k-NN) for uncompromising precision, and Approximate Nearest Neighbor (ANN) algorithms (such as HNSW) to maintain blazing speed when querying massive datasets.
- Dynamic Similarity Metrics: The system supports multiple distance calculations, including cosine similarity and Euclidean distance. Because different models shape their high-dimensional vector spaces distinctly based on their underlying training architectures, the flexibility to swap metrics ensures that mathematical closeness perfectly translates to true semantic relevance.
- Confidence Thresholding: By establishing strict minimum score boundaries for results, users can actively prune the long tail of low-probability matches. This aggressively filters out visual noise, guaranteeing that creative teams are not distracted and only review results that meet a rigorous standard of mathematical similarity.
Textual Analysis & Linguistic Precision
To handle the deep nuances of dialogue-heavy searches, such as isolating a character’s exact catchphrase amidst thousands of hours of speech, we implement a sophisticated text analysis strategy within Elasticsearch. This ensures that conversational context is captured and indexed accurately.
- Phrase & Proximity Matching: To respect the narrative weight of specific lines (e.g., “Friends don’t lie” in Stranger Things), we leverage match-phrase queries with a configurable slop parameter. This guarantees the system retrieves the correct scene even if the user’s memory slightly deviates from the exact transcription.
- N-Gram Analysis for Partial Discovery: Because video search is inherently exploratory, we utilize edge N-gram tokenizers to support search-as-you-type functionality. By actively indexing dialogue and metadata substrings, the system surfaces frame-accurate results the moment an editor begins typing, drastically reducing cognitive load.
- Tokenization and Linguistic Stemming: To seamlessly support the global scale of the Netflix catalog, our analysis chain applies sophisticated stemming across multiple languages. This ensures a query for “running” automatically intersects with scenes tagged with “run” or “ran” collapsing grammatical variations into a single, unified search intent.
- Levenshtein Fuzzy Matching: To account for transcription anomalies or phonetic misspellings, we incorporate fuzzy search capabilities based on Levenshtein distance algorithms. This intelligent soft-matching approach ensures that high-value shots are never lost to minor data-entry errors or imperfect queries.
Aggregations and Flexible Grouping
The architecture operates at immense scale, seamlessly executing queries within a single title or across thousands of assets simultaneously. To combat result fatigue, the system leverages custom aggregations to intelligently cluster and group outputs based on specific parameters, such as isolating the top 5 most relevant clips of an actor per episode. This guarantees a diverse, highly representative return set, preventing any single asset from dominating the search results.
Search Response Curation
While temporal buckets are the internal mechanism for search efficiency, the system post-processes Elasticsearch results to reconstruct original time boundaries. The reconstruction process ensures results reflect narrative scene context rather than arbitrary intervals. Depending on the query intent, the system generates results based on two logic types:
- Union: Returns the full span of all matching annotations (3–8 sec), which prioritizes breadth, capturing any instance where a specified feature occurs.
- Intersection: Returns only the exact overlapping duration of matching signals (4–6 sec). The intersection logic focuses on co-occurrence, isolating moments when multiple criteria align.
{
"entity_id": {
"entity_type": "ASSET",
"id": "1bba97a1–3562–4426–9cd2-dfbacddcb97b"
},
"range_intervals": [
{
"intersection_time_range": {
"start_time_ns": 4000000000,
"end_time_ns": 8000000000
},
"union_time_range": {
"start_time_ns": 2000000000,
"end_time_ns": 9000000000
},
"source_annotations": [
{
"annotation_id": "fc1525d0–93a7–11ef-9344–1239fc3a8917",
"annotation_type": "SCENE_SEARCH",
"metadata": {
"label": "kitchen"
}
},
{
"annotation_id": "5974fb01–93b0–11ef-9344–1239fc3a8917",
"annotation_type": "CHARACTER_SEARCH",
"metadata": {
"character_name": [
"Joey"
]
}
}
]
}
]
}
Figure 7: Sample Response for “Joey” + “Kitchen” Query
Future Extensions
While our current architecture establishes a highly resilient and scalable foundation, it represents only the first phase of our multi-modal search vision. To continuously close the gap between human intuition and machine retrieval, our roadmap focuses on three core evolutions:
- Natural Language Discovery: Transitioning from structured JSON payloads to fluid, conversational interfaces (e.g., “Find the best tracking shots of Tom Holland running on a roof”). This will abstract away underlying query complexity, allowing creatives to interact with the archive organically.
- Adaptive Ranking: Implementing machine learning feedback loops to dynamically refine scoring algorithms. By continuously analyzing how editorial teams interact with and select clips, the system will self-tune its mathematical definition of semantic relevance over time.
- Domain-Specific Personalization: Dynamically calibrating search weights and retrieval behaviors to match the exact context of the user. The platform will tailor its results depending on whether a team is cutting high-action marketing trailers, editing narrative scenes, or conducting deep archival research.
Ultimately, these advancements will elevate the platform from a highly optimized search engine into an intelligent creative partner, fully equipped to navigate the ever-growing complexity and scale of global video media.
Acknowledgements
We would like to extend our gratitude to the following teams and individuals whose expertise and collaboration were instrumental in the development of this system:
- Data Science Engineering: Nagendra Kamath, Chao Pan, Prachee Sharma, Ying Liao and Carolyn Soo for the critical media model insights that informed our architectural design.
- Product Management: Nimesh Narayan, Ian Krabacher, Ananya Poddar, Meghan Bailey and Anita Kuc for defining the user requirements and product vision.
- Media Production Suite Team: Szymon Borodziuk, Mike Czarnota, Dominika Sarkowicz, Bohdan Koval and Sasha Sabov for their work in engineering the end-user search experience.
- Asset Management Platform Team: For their collaborative efforts in operationalizing this design and bringing the system into production.
Powering Multimodal Intelligence for Video Search was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.