New Research on Cloud Database Trends: Technical Risks, Cost Pressures, and Migration Triggers

Good enough until it isn’t: the database complacency trap A database is like a water heater. When all is well, it just does its job in the background. You don’t fantasize about replacing it or envy the one your friend just got. Really, you don’t even think about it — until something goes awry. But new research reveals a key difference: With databases, the problems don’t blindside you. Some 38% of technology leaders worry that their current database won’t meet their needs in the near future. However, they aren’t acting on it. They wait until some compelling event (e.g., a production incident, usage spike, budget cut, or cloud strategy pivot) pushes the database to the top of the priority list. That’s just one of the interesting findings from the Futurum Group’s latest research study, commissioned by ScyllaDB, which explores the latest trends in cloud database cost pressures, performance risks, and migration motivations. Respondents include technical decision-makers who shape cloud database strategy as well as team members directly responsible for the database. Guy Currier, Futurum Group Chief Analyst, summarizes the findings this way: “Those technology leaders expressed complacency with their cloud databases at the same time as concern and caution. This combination suggests that although they would prefer not to take immediate action, they know they will have to move when compelling events force a change.” The full report, Is Cloud Database Complacency Affecting Your Business Objectives?, is available now. Here are some key takeaways. Comfort masks concern A third of the leaders surveyed report satisfaction with the performance of their current cloud databases. Yet, 38% worry that their database isn’t fit to support future AI/ML workloads and the resulting explosion in data volume. The prime characteristic of these workloads is their unpredictability; past database performance is a poor indicator of future behavior as the technology evolves and as volumes increase. “Organizations experience what we might call ‘good enough for now’ syndrome,” Currier noted. “Their databases handle today’s workloads adequately, but leaders doubt these solutions will scale to meet tomorrow’s demands.” Cloud database costs are also a major concern. The research found that 35% of leaders want to improve performance but feel constrained by budget. Another 35% are concerned about rising costs despite being satisfied with performance. The top cloud database cost drivers include: Unexpected loads (40%) New or strict technical requirements (38%) Networking bandwidth growth (38%) Storage growth (38%) The 10% cost-savings tipping point Nearly 40% of organizations are meeting their cloud database budgets, but just as many consider their predictable costs too high. As Currier explains, “Organizations might tolerate high costs when they can plan for them. However, this tolerance creates an opening for solutions that can deliver similar predictability at lower price points.” That opening is quite specific: A 10% cost reduction is all it would take for many tech leaders to consider migrating their cloud database. Why so low? Likely, the answer lies in scale. When database costs climb into the millions annually – which is not unusual for platforms like DynamoDB, according to the research – even a modest 10% translates to substantial savings. Event-driven database migration triggers Still, technical leaders don’t proactively seek alternatives that are more cost-efficient or better prepared for the technical needs of current/future AI/ML workloads. They wait for trigger events that force them into a crisis-driven decision. Leadership changes (36%) and major production incidents (32%) emerged as the primary catalysts. Other significant triggers include: Load spikes (32%) Cost reductions of 10% or more (31%) Maintenance burdens (31%) Performance issues (29%) Volatile costs (28%) Most of these triggers highlight the reactive nature of these migrations, rather than proactive, strategic changes. Note that volatile database costs drive 28% of switching decisions, suggesting that sheer unpredictability can be nearly as disruptive as high costs. “Database decisions are rarely made in a vacuum,” the research report notes. “Even when teams identify performance or cost inefficiencies, acting on them competes with feature delivery, roadmap commitments, limited operational bandwidth, and against their familiar tech stack.” Early warning signs While water heater issues tend to surface without warning, database issues can usually be anticipated. There are several early warning signs that a database is starting to become a constraint: Cost is growing faster than throughput. When database spend rises faster than the throughput it’s handling, the system may not be as scalable as it appears. Teams patch their way forward (e.g., with caches) to sustain performance. But the cost per query keeps climbing. Rising tail latency. When P95 or P99 latency starts to climb during peak periods or background operations, it indicates the system is nearing its breaking point. These changes might be dismissed if they don’t immediately violate SLAs, but they’re canaries in the coal mine. Increasing operational friction. More manual tuning, more frequent capacity adjustments, more time spent managing the database to maintain the same level of performance…all these signal diminishing returns from the current architecture. Disproportionate complexity for organic growth. When routine scaling or new workload support requires outsized engineering effort, it’s a sign that the database has become a constraint rather than an enabler. From reactive to strategic Recognizing these signals is one thing, but actually acting on them before a crisis forces your hand is another. Some due diligence now will help you stay ahead of it. Get a general sense of what options are available for your use cases Define vendor-neutral evaluation criteria Stress test your existing database to understand its breaking point – before production traffic exposes it for you Set clear decision triggers (e.g., specific performance thresholds, cost targets, and capability gaps) Map your database capabilities against your 12–24 month strategic roadmap, not just your current workloads As Currier concludes: “Your database might be ‘good enough for now,’ but if that isn’t aligned with where your business needs to go, complacency is already costing you.” Download the full report here; you’ll also get access to an expert panel discussing the research findings.

Native Vector Search for the DynamoDB API

Developers building on the DynamoDB API can run vector similarity search without the complexity of bolted-on “Zero ETL” For users in the DynamoDB environment, implementing vector search has been overly complicated. Amazon’s “Zero ETL” forces a dual-service approach (managing both DynamoDB and OpenSearch) and requires using two separate APIs just for Vector Semantic Search queries. ScyllaDB believes this is unnecessary complexity. We’re eliminating the heavy lifting by integrating vector search capabilities into Alternator, our DynamoDB-compatible API. This gives DynamoDB users high-performance similarity search within their familiar API, without the need for extra clusters or constant API context-switching. Architectural Differences: Unified vs. Fragmented Amazon’s approach to vector search exports data to S3 and then syncs it to OpenSearch via DynamoDB Streams. While “Zero ETL” sounds hands-off, you’re still responsible for the cost and complexity of a separate search cluster. The AWS cost is composed of DynamoDB, DynamoDB Streams, S3, OpenSearch, and the OSIS pipeline. Each of these elements’ pricing is complex on its own. Amazon Vector Search (using Open Search) for DynamoDB architecture. Source: AWS Blog. ScyllaDB Alternator simplifies this by integrating the vector store engine directly into the backend. Simple module: The ScyllaDB database hosts both the data and the vector index. Native API: You perform vector searches using DynamoDB Query operations. Performance: 10 Million Vectors on a Budget In our latest benchmark using a 10-million-vector dataset (768-dimensional Cohere embeddings), a modest five-node ScyllaDB cluster delivered over 12K QPS with single-digit millisecond latency.
Setup: 10M vectors; 768 dimensions; K: 10 (retrieve top K values); No Quantization
Results Recall: ~90% Throughput: 12,763 QPS P99 Latency: 7.8 ms Cost: $1,643 / Month for 1Y full up front Estimating the AWS cost for this case is not trivial. The write-path includes DynamoDB (storage+ops), DynamoDB streams, S3 (storage, API), OpenSearch (data nodes, master nodes, EBS), and the OSIS pipeline. To read more on the pricing of Amazon Zero ETL, see Implementing search on Amazon DynamoDB data using zero-ETL integration with Amazon OpenSearch service. Code Examples Note: The exact JSON format might change in the next few months. 1. Enabling a Vector Index You can enable vector indexing during CreateTable or via UpdateTable. Note the new VectorSecondaryIndexUpdates parameter. // Adding a vector index to an existing table { "TableName": "ProductCatalog", "AttributeDefinitions": [ {"AttributeName": "ProductEmbedding", "AttributeType": "V"} ], "VectorSecondaryIndexUpdates": [ { "Create": { "IndexName": "VectorIdx", "VectorAttribute": { "AttributeName": "ProductEmbedding", "Dimensions": 768 }, "IndexOptions": { "SimilarityFunction": "COSINE", "M": 32, "ef_construction": 256 } } } ] } Pro Tip: You will get the best results with ScyllaDB’s optimized “V” (Vector) type. Although you can use standard DynamoDB Lists, the “V” type will store data as a tight array of 32-bit floats – and that saves storage while boosting performance. 2. Performing a Vector Search To search, use the Query operation with the ScyllaDB VectorSearch parameter. { "TableName": "ProductCatalog", "IndexName": "VectorIdx", "VectorSearch": { "QueryVector": [0.12, 0.05, ..., 0.88], "Oversampling": 1.5 }, "Limit": 10, "ReturnVectorSearchSimilarity": "SIMILARITY" } Example Use Cases Semantic Product Search Instead of relying on exact keyword matches, users can find products based on intent. For example, a search for “waterproof rugged hiking gear” can surface relevant items even if those exact words aren’t in the title. RAG (Retrieval-Augmented Generation) For knowledge bases, precision is non-negotiable. Using the High Recall configuration, ScyllaDB delivers 99.2% recall. That way, the LLM receives the most accurate context possible for generating responses. Semantic Deduplication At the Max Throughput end of the spectrum, ScyllaDB can quickly scan millions of incoming vectors to find near-duplicates. That prevents redundant data from cluttering your system – reducing costs and improving performance. Conclusion With ScyllaDB, DynamoDB users now have a “fast track” to AI-ready infrastructure. By unifying storage and vector search into a single API, you eliminate the operational tax of “Zero ETL” without sacrificing the sub-millisecond performance ScyllaDB is known for.

ScyllaDB Vector Search Benchmark: 10M Vectors on a Compact Cluster

Even a small, compact setup achieved up to 12,840 QPS at k=10 with a serial P99 latency of 5.5 ms Our 1-billion-vector benchmark demonstrated that ScyllaDB Vector Search can sustain 252,000 QPS with 2 ms P99 latency across a large-scale deployment. But not every workload starts at a billion vectors. Many production use cases (e.g., product catalogs, knowledge bases for RAG, and semantic caches) live comfortably in the 10–100 million range. This post presents a smaller benchmark: a 10-million-vector dataset of 768-dimensional Cohere embeddings on a compact five-node cluster. It used three modest storage nodes and two memory-optimized search nodes, all running on AWS Graviton. We explore four index configurations that span the recall-throughput spectrum, from near-perfect recall to maximum throughput. The results show that even this small setup can deliver up to 12,840 QPS at k=10 with a serial P99 latency of 5.5 ms — without any quantization. Architecture at a Glance First, some background. ScyllaDB Vector Search separates storage and indexing responsibilities while keeping the system unified from the user’s perspective. The ScyllaDB storage nodes hold both the structured attributes and the vector embeddings in the same distributed table. Meanwhile, a dedicated Vector Store service — implemented in Rust and powered by the USearch engine — consumes updates from ScyllaDB via CDC and builds approximate nearest neighbor (ANN) indexes in memory. Queries are issued through standard CQL: SELECT … ORDER BY vector_column ANN OF ? LIMIT k; The queries are internally routed to the Vector Store service, which performs the HNSW similarity search and returns the candidate rows. This design allows each layer to scale independently, optimizing for its own workload characteristics and eliminating resource interference. For a detailed architectural deep-dive, see the 1-billion-vector benchmark and the technical blog Building a Low-Latency Vector Search Engine for ScyllaDB. Benchmark Setup Here’s a look at the dataset and hardware used for the benchmark. Dataset Property Value Vectors 10,000,000 Dimensions 768 Embedding model Cohere Similarity function COSINE Quantization None (f32) Hardware Role Instance vCPUs RAM Count Storage nodes i8g.large 2 16 GB 3 Search nodes r7g.2xlarge 8 64 GB 2 With 768-dimensional f32 vectors and M values up to 64, the in-memory index size can be estimated as: Memory ≈ N × (D × 4 + M × 16) × 1.2 For the largest configuration (M=64): 10M × (768 × 4 + 64 × 16) × 1.2 ≈ 49 GB, which fits comfortably in the 64 GB of a single r7g.2xlarge search node. No quantization is needed at this scale. Experiments We tested four HNSW index configurations, progressively lowering graph connectivity (M) and search effort (ef_search) to shift the balance from recall toward throughput. Experiment M ef_construction ef_search k tested #1 (high quality) 64 384 192 100, 10 #2 (balanced) 32 256 128 100, 10 #3 (high throughput) 24 256 64 100, 10 #4 (max throughput) 20 256 48 10 The three HNSW parameters control different aspects of the index: M (maximum_node_connections): Maximum edges per node in the HNSW graph. Higher values create a richer, better-connected graph that improves recall at the cost of more memory and slower inserts and queries. ef_construction (construction_beam_width): Controls how thoroughly the algorithm searches for the best neighbors when inserting a new vector. Higher values produce a higher-quality graph but slow down index building. This is a one-time cost. ef_search (search_beam_width): The main tuning knob for query performance. Controls the size of the candidate beam during search. Higher values evaluate more candidates, which improves recall but increases query latency. Since vector index options cannot be changed after creation, each experiment required dropping and recreating the index. Here are the CQL statements used: -- Experiment #1: M=64, ef_construction=384, ef_search=192 CREATE CUSTOM INDEX vdb_bench_collection_vector_idx ON vdb_bench.vdb_bench_collection (vector) USING 'vector_index' WITH OPTIONS = { 'search_beam_width': '192', 'construction_beam_width': '384', 'maximum_node_connections': '64', 'similarity_function': 'COSINE' }; -- Experiment #2: M=32, ef_construction=256, ef_search=128 CREATE CUSTOM INDEX vdb_bench_collection_vector_idx ON vdb_bench.vdb_bench_collection (vector) USING 'vector_index' WITH OPTIONS = { 'search_beam_width': '128', 'construction_beam_width': '256', 'maximum_node_connections': '32', 'similarity_function': 'COSINE' }; -- Experiment #3: M=24, ef_construction=256, ef_search=64 CREATE CUSTOM INDEX vdb_bench_collection_vector_idx ON vdb_bench.vdb_bench_collection (vector) USING 'vector_index' WITH OPTIONS = { 'search_beam_width': '64', 'construction_beam_width': '256', 'maximum_node_connections': '24', 'similarity_function': 'COSINE' }; -- Experiment #4: M=20, ef_construction=256, ef_search=48 CREATE CUSTOM INDEX vdb_bench_collection_vector_idx ON vdb_bench.vdb_bench_collection (vector) USING 'vector_index' WITH OPTIONS = { 'search_beam_width': '48', 'construction_beam_width': '256', 'maximum_node_connections': '20', 'similarity_function': 'COSINE' }; The benchmark was run using VectorDBBench with the upcoming ScyllaDB Python driver built on a Rust core (a dev version is available at python-rs-driver). VectorDBBench ramps concurrency from 1 to 150 concurrent search clients and measures QPS, P99 and average latency at each level. A separate serial run of 1,000 queries measures recall and nDCG against brute-force ground truth. Results Peak QPS Comparison To start our analysis, let’s examine the maximum throughput that each index configuration can sustain under peak concurrency. When strictly looking at the highest throughput achieved: The bar chart highlights the dramatic impact of index parameters at k=10: throughput rises sharply as the index becomes lighter. At k=100, the differences are much smaller; all configurations cluster between 2,300 and 3,000 QPS. QPS vs Concurrency The chart below shows how each index configuration scales as concurrency ramps from 1 to 150 clients. At k=10, the lighter configurations (Experiments #3 and #4) scale nearly linearly up to 60–80 concurrent clients before saturating. Experiment #4 demonstrates the benefit of a leaner graph: it achieves 5.5X higher peak QPS than Experiment #1 at k=10. At k=100, all configurations converge to a narrower throughput band (2,300–3,025 QPS). This shows that retrieving 100 neighbors dominates the per-query cost regardless of index parameters. P99 and Average Latency vs Concurrency As expected, increasing throughput adds queuing delay, and that leads to higher tail latencies. <!-- Note: The original document had 6 images. The source note lists the order as 4-1-2-3-5-6. The text contains 7 [image] placeholders. Based on the document's structure, I will assume the sixth placeholder corresponds to the last chart (Average Latency) and omit the extra placeholder, as the source note only accounts for six images. I will adjust the numbering below. The original list was 4-1-2-3-5-6. I will use the final placeholder (Image 6 from source) here. The next section has another chart, so I will add a seventh placeholder and mark it as 'Source Unknown'. Lighter configurations start at dramatically lower baseline latencies. Experiment #4 maintains sub-6 ms P99 latency up to 30 concurrent clients, while Experiment #1 starts above 13 ms, even at concurrency 1. All configurations show latency rising proportionally once throughput saturates. This is the expected queuing behavior when the system is at capacity. QPS vs P99 Latency (Pareto View) Plotting throughput directly against tail latency provides a Pareto frontier of our benchmark configurations: This view makes the operational trade-off easier to read than the concurrency charts alone. At k=10, Experiments #3 and #4 push the frontier outward, with much higher QPS at the same or lower tail latency. At k=100, the frontier is tighter, which again shows that returning more neighbors dominates the total cost per query. Recall vs Peak QPS Finally, plotting recall helps select the optimal index strategy based on business requirements: This chart summarizes the core choice in a single picture: should you spend compute on accuracy or throughput? Experiment #1 sits at the high-recall end, Experiment #4 at the high-throughput end, and Experiment #2 emerges as the practical middle ground for workloads that need both. Scenario Analysis With the charts above as a visual reference, let’s examine the three main usage scenarios that emerge from the data. Scenario 1: Maximum Throughput Experiments #3 (M=24, ef_search=64) and #4 (M=20, ef_search=48) target workloads where throughput is the primary objective and moderate recall is acceptable — for example, coarse candidate retrieval stages in recommendation pipelines or semantic deduplication. At k=10, Experiment #4 reached a peak of 12,840 QPS at concurrency 100, with a serial P99 latency of just 5.5 ms and recall of 92.0%. Experiment #3 achieved 9,719 QPS with marginally better recall at 95.0% and a serial P99 of 6.0 ms. Even at k=100, these lightweight configurations delivered competitive throughput: Experiment #3 peaked at 3,025 QPS (87.9% recall), which is comparable to the heavier configurations. Retrieval of 100 neighbors per query inherently requires more work, which limits the throughput range across all configurations. Scenario 2: High Recall Experiment #1 (M=64, ef_search=192) prioritizes accuracy for applications that cannot tolerate missed results (e.g., high-fidelity semantic search, retrieval-augmented generation [RAG] pipelines, or compliance-sensitive retrieval). At k=10, the system delivered 99.2% recall and 99.1% nDCG — essentially indistinguishable from exact brute-force search. Peak QPS reached 2,324 with a serial P99 latency of 14.6 ms. At k=100, recall was 96.8% with 2,345 QPS and a serial P99 of 15.2 ms. The higher latency and lower throughput are a direct consequence of the richer graph (64 connections per node) and wider search beam (192 candidates), which evaluate substantially more distance computations per query. Scenario 3: Balanced Experiment #2 (M=32, ef_search=128) takes the middle ground, offering strong recall with significantly better throughput than the high-recall configuration. At k=10, it achieved 97.7% recall with 4,897 QPS — roughly double the throughput of Experiment #1, with only a 1.5 percentage-point recall reduction. The serial P99 was 8.7 ms. At k=100, recall was 92.0% with 2,975 QPS and a serial P99 of 9.6 ms. This configuration represents a practical sweet spot for many production deployments where both recall and throughput matter. Summary Tables k=100 Metric #1 M=64 ef_s=192 #2 M=32 ef_s=128 #3 M=24 ef_s=64 Peak QPS 2,345 (c=150) 2,975 (c=40) 3,025 (c=40) QPS @ c=10 947 1,314 1,489 Serial P99 Latency 15.2 ms 9.6 ms 7.8 ms P99 Latency @ c=1 15.5 ms 9.9 ms 8.1 ms P99 Latency @ c=100 81.2 ms 49.9 ms 49.6 ms Recall 96.8% 92.0% 87.9% nDCG 97.3% 93.1% 89.7% k=10 Metric #1 M=64 ef_s=192 #2 M=32 ef_s=128 #3 M=24 ef_s=64 #4 M=20 ef_s=48 Peak QPS 2,324 (c=100) 4,897 (c=80) 9,719 (c=80) 12,840 (c=100) QPS @ c=10 1,054 1,602 2,046 2,311 Serial P99 Latency 14.6 ms 8.7 ms 6.0 ms 5.5 ms P99 Latency @ c=1 14.0 ms 8.5 ms 6.2 ms 5.5 ms P99 Latency @ c=100 81.0 ms 38.1 ms 18.0 ms 12.3 ms Recall 99.2% 97.7% 95.0% 92.0% nDCG 99.1% 97.6% 94.9% 92.0% Key Takeaways k=10 vs k=100: At k=10, lighter index parameters yield massive throughput gains (up to 5.5X) with modest recall loss. At k=100, all configurations converge to a narrow QPS band (~1.3X range) because retrieving more neighbors dominates per-query cost. Recall trade-offs are favorable: At k=10, recall drops only 7.2 pp (99.2% to 92.0%) for a 5.5X QPS increase. At k=100, the trade-off is steeper: 8.9 pp for just 1.3X gain. Latency tracks index weight: Serial P99 drops from 14.6 ms to 5.5 ms at k=10, and from 15.2 ms to 7.8 ms at k=100, as lighter graphs require fewer distance computations. Saturation points differ: Experiments #1–#3 plateau around c=40–80; Experiment #4 scales further to c=100 before saturating, reflecting its lower per-query compute cost. Conclusion These results show that ScyllaDB Vector Search delivers strong performance even on a compact, five-node cluster with 10 million 768-dimensional vectors. A pair of r7g.2xlarge search nodes provides enough memory to hold the full HNSW index at f32 precision – without requiring any quantization. The three storage nodes with replication factor 3, combined with vector search nodes distributed across availability zones, also provide high availability. The system is designed to tolerate node failures without data loss or service interruption. Depending on the index configuration, the system can prioritize near-perfect recall (99.2% at k=10) or maximize throughput (12,840 QPS at k=10 with 92% recall), with practical balanced options in between. This 10M scenario represents the accessible end of the scale. For workloads that push into hundreds of millions or billions of vectors, quantization, additional search nodes and larger instances extend the same architecture. See the ScyllaDB 1-billion-vector benchmark for results at extreme scale, and look for our upcoming 100-million-vector benchmark post. At K=10, the performance bottleneck resides within the vector index nodes, leaving ScyllaDB with significant headroom. This means you can likely add a Vector Search index to your cluster and continue running a similar workload on your existing ScyllaDB infrastructure without needing to scale your database nodes. The full Jupyter notebook with interactive charts and all data is available in this repository. Ready to try it yourself? Follow the ScyllaDB Vector Search Quick Start Guide to get started.

ScyllaDB X Cloud: Your Questions Answered

A technical FAQ on ScyllaDB X Cloud: architecture, autoscaling, compression, use cases, and more It’s been a few months since ScyllaDB X Cloud landed. In case you missed the news, here’s a quick recap… ScyllaDB X Cloud is the next generation of ScyllaDB’s fully-managed database-as-a-service. It’s a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. Users can scale out and scale in almost instantly to match actual usage. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer the lag traditionally associated with ramping up capacity in response to a sudden surge. Some key features (all covered in Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview): Tablets + just-in-time autoscaling Up to 90% storage utilization Support for mixed size clusters File-based streaming Dictionary-based compression Flex credit Here’s a look at ScyllaDB X Cloud in action:  Not surprisingly, users have been quite curious about all these changes and new options. So we thought we’d collect some of the most common questions here, along with our answers. In no particular order… What are the key differences between a “standard” ScyllaDB Cloud database and “ScyllaDB X Cloud”? Compared to a standard ScyllaDB Cloud database, ScyllaDB  X Cloud provides two major advantages: Faster scaling in and out. Higher storage utilization (90% vs. 70%). The above advantages are the result of two technical updates: X Cloud always uses Tablets, while standard databases can use a mix of vNode and Tablets keyspaces. X Cloud enables mixed sized clusters, so you can define more granular cluster and storage sizes. In which cases should you choose a “standard” ScyllaDB Cloud Database vs X Cloud? None! We’ve reached full parity now. Materialized views, CDC, Alternator (DynamoDB API), even counters – it’s all supported. Can I migrate from one type of ScyllaDB Cloud database to the other? Yes. If you are using a standard database with Tablets only, you can migrate this database to X Cloud. If you are using vNode keyspaces, you cannot (yet). How does X Cloud achieve higher storage utilization? Two factors enable higher storage utilization: Faster scaling removes the need to over-reserve storage space (or “sandbag”) while waiting for the cluster to expand Support for mixed instance sizes allows for more granular cluster size How can I start an X Cloud cluster? Simply choose the “X Cloud” Cluster Type on ScyllaDB Cloud’s Launch Cluster page. How can I set the scaling policy? Can I change it later, while the database is in production? (UI/API) The scaling policy is part of the X Cloud cluster properties. You can either set it when launching the cluster or update it later. The policy is optional. It defines the minimum required resources for your database in terms of vCPU and Storage. If you’re not sure how to set it, you can keep the default minimum values (zero) as is. The cluster will scale automatically if and when storage is approaching the threshold, and you can scale the vCPU as required by your workload. Note that the parameters affect each other since more storage may require more compute power. How are X Cloud and Tablets related? X Cloud takes advantage of (and depends on) Tablets to achieve faster scale and higher storage utilization. That means all Keyspaces in X Cloud must use Tablets, which is already the default for ScyllaDB Cloud. How can X Cloud help reduce database costs? There are a few ways that X Cloud reduces cost. The primary factor is the extreme elasticity. You can scale the cluster in and out, even multiple times per day, to meet the demand. If you cannot reliably plan the cluster usage, you can reserve a minimal deployment and pay for bursts using Flex Credit. The higher storage utilization means you use less cloud resources. Improved compression, both on the wire and at rest, reduces cost further. What’s a good use case for ScyllaDB X Cloud? Am I a good candidate for ScyllaDB X Cloud? New (greenfield) workloads should use X Cloud. Workloads that require frequent scaling out/in will benefit the most. For example: A workload with significant fluctuation throughout the day (e.g., peak hours during the evening). A workload with expected high demand on specific days of the year (e.g., Super Bowl, IPL games, or Black Friday). With X Cloud, scaling can be done days in advance. You don’t need to do it one or more weeks ahead. Difficult-to-predict workloads, with common (but volatile) bursts. How many times per day can X Cloud scale? As often as required. Although new nodes start serving requests very fast, it still takes time for the data balancing to be complete if you’re working with rather large nodes. Does X Cloud support multi-DC (region) deployment? Does each region scale independently? X Cloud does not yet support multi-datacenter deployment. Multi-DC support is coming with the 2026.2 release. Scaling Policy: I asked for storage of Y TB and got a bigger cluster with storage of W TB…why? Same for vCPU? vCPU, RAM, and Storage are not independent variables. ScyllaDB will allocate each of these 3 variables to support the required value of the other two. For example, higher storage requires more RAM – which requires more vCPU. The policy UI reflects the expected deployment per each resource selection. Can I suspend / resume the dynamic scaling? Currently: no. Can I restore a backup from X Cloud database to a standard database and vice versa? Yes, you can. Is X Cloud production ready? Absolutely, customers are already using it in production. Why should I care about advanced compression? What is the advantage of having it? ScyllaDB already supported compression before X Cloud – including at-rest and in-transit. However, dictionary-based compression is much more effective in reducing data overhead. By compressing data further, you save on disk space utilization (combined with up to 90% disk space utilization) as well as inter-AZ networking for data replication and high availability. X Cloud claims faster scaling. How fast is it really? The legacy vNode-based architecture imposed some limitations: Nodes could only be added one at a time, even across DCs. Data was replicated in rows – that is, rows were being transferred over the wire. A node only started serving requests after its streaming was fully completed. This process could easily take hours, if not days, to complete on large clusters. Now, X Cloud leverages tablets to remove those limits: Nodes can be added in parallel, multiple nodes at a time, including across DCs. Nodes join the cluster instantly, then start streaming data later. Streaming under Tablets relies on file-based streaming, transferring gigabytes of data per second in a very efficient process. As Tablet transfers complete, nodes start to serve requests immediately; this increases as more transfers complete, until the cluster rebalancing is completed. This allows X Cloud to scale to an unlimited number of nodes at a single step – and streaming data is made super efficient by file-based streaming. A cluster can go from 100K ops per second to 2M ops per second in a matter of a few minutes, not hours or days. Can I use Vector Search with X Cloud? Yes, you can! Enable the Vector Search option at the bottom of the Launch Cluster page and choose the Vector Search instances. Note that Vector Search index nodes scale independently from ScyllaDB nodes. You can learn more about Vector Search here.

6 Reasons ScyllaDB Costs a Fraction of DynamoDB

Why teams typically experience 50% (or greater) cost reductions when moving from DynamoDB to ScyllaDB DynamoDB is expensive at scale. Some of that cost is fundamental to the managed service model. But much of it is the pricing model, the way DynamoDB charges per read, per write, per byte, and per region. ScyllaDB rethinks pricing from first principles. The result: teams typically see more than 50% cost reductions on equivalent workloads. In this post, I’ll share a few reasons why. Cheap writes DynamoDB charges 5x more for writes than reads. Write a 1 KB item and it costs 5 write capacity units. Read the same 1 KB item and it costs 0.25 read capacity units. ScyllaDB pricing is based on provisioned cluster capacity (nodes), not per operation. Whether you do 10K writes/sec or 100K writes/sec on a 3-node cluster, the ScyllaDB cost remains the same. Write-heavy workloads for AI, real-time analytics, logging, time-series data and IoT sensors often see the biggest savings. Take a look at our AI Feature Store example. A batch workload scenario with overnight peaks approximately 3x the daytime average on DynamoDB will cost $2.2M/year. The same workload on ScyllaDB would cost $145K/year. In other words, that’s at least 15x savings just switching to ScyllaDB. No need for a separate cache DynamoDB’s baseline latency is in the 10-20ms range. For many applications, that’s unacceptable. In those cases, teams commonly deploy DAX, Redis, or Memcached on top. That adds cost, complexity, and another service to operate and monitor. ScyllaDB was built for low latency. Internal caching and a shard-per-core architecture deliver sub-millisecond latencies on reads. For most workloads, an external cache is unnecessary. Let’s look at a retail example with a read-heavy workload that is cached and running on demand. On DynamoDB running with DAX, that workload would cost $1.6M/year. The same workload on ScyllaDB would cost $271K/year (and even less if you switch to a hybrid plan). That’s at least 6x cheaper using ScyllaDB. Plus: there are fewer moving parts, simpler operations, and no cache coherency headaches. Affordable multi-region data centers DynamoDB Global Tables charge replicated writes (rWCUs) at a premium: roughly 2x the cost of normal writes. Moreover, cross-region data transfer incurs AWS’s standard rates: $0.02-0.09/GB. For a workload doing 10K writes/sec with 5 KB payloads across 2 regions, data transfer alone can add $10K+/month. A social media scenario modeled across 3 regions on DynamoDB would cost $11.0M/year. The huge cost is partly because the write capacity cannot be reserved, and you effectively pay twice for the writes. The same workload on ScyllaDB would cost $591K/year. That’s a monstrous +$10M/year saving by switching to ScyllaDB. ScyllaDB handles multi-DC replication natively. You provision nodes in each data center, and replication is built into the protocol along with shard-aware and rack-aware drivers. This helps minimize network overhead and avoids the per-operation premium. You pay for the cluster nodes; replication comes with the territory. Large items don’t cost more In DynamoDB, a 1 KB write costs 1 WCU, and a 10KB write costs 10 WCUs. Item size directly drives billing. This incentivizes shrinking payloads, compressing data, and splitting tables. Architectural decisions are driven by cost, not design. A simple on-demand scenario with DynamoDB using 3 KB item sizes would cost $633K/year. ScyllaDB would cost $39K/year. Along with multi-region, item size remains one of the biggest cost levers to pull when looking for savings on DynamoDB. ScyllaDB billing is independent of item size. Store 1 KB items or 100 KB items and the cluster cost is unchanged. You architect around performance and correctness, not billing thresholds. Making multi-tenancy work for you DynamoDB is multi-tenant infrastructure. That’s how AWS achieves efficiency. But it also means: You pay for provisioned capacity AWS oversubscribes hardware Idle capacity benefits AWS, not you You pay for the full machine, but AWS shares it with everyone else. Multi-tenant infrastructure reduces cost for AWS but increases risk for users. Large DynamoDB outages (like us-east-1) impact thousands of customers simultaneously. When shared infrastructure fails, the blast radius is enormous. ScyllaDB flips that model. You get a dedicated cluster, which gives you: Isolation by design The ability to run multiple workloads The option to share idle capacity internally This is especially powerful for: Multi-tenant SaaS Microservices Multiple environments (dev/staging/prod) Instead of provisioning 100 tables separately, you provision one cluster and use it fully. You control your infrastructure. AWS monetizes multi-tenancy. ScyllaDB lets you monetize it. Flexible and predictable pricing DynamoDB is excellent for certain use cases: serverless applications with unpredictable spikes, multi-tenant services that need table-level isolation, and teams that prioritize operational simplicity over cost. But if you’re running a predictable, scale-intensive workload – especially one that’s write-heavy, multi-region, or stores large items – then DynamoDB’s per-operation pricing model becomes a massive cost driver. ScyllaDB’s node-based, cluster-centric model is fundamentally more cost-efficient for these scenarios. Combined with its performance and operational features, it’s why teams see more than 50% cost reductions. Want to see the actual numbers for your workload? Use the ScyllaDB Cost Calculator at calculator.scylladb.com to model a comparison between your current DynamoDB spend and equivalent ScyllaDB infrastructure.

Apache Cassandra® 6 Accord transactions: What you need to know

There have always been architectural trade-offs when considering a distributed database like Apache Cassandra versus a relational database. Cassandra excels at linear horizontal scalability, multi-region replication, and fault-tolerant uptime that relational systems couldn’t match. This comes at the expense of general-purpose ACID (Atomicity, Consistency, Isolation, Durability) transactions which allows the ability to express complex, multi-row operations with guaranteed consistency.

With Cassandra 6 on its way to general availability status (and an alpha already released), we’re approaching a turning point where we can revisit whether these trade-offs will still exist. The latest version delivers general-purpose ACID transactions through a new protocol called Accord. With Cassandra 6, those transactional guarantees will be native, without compromising Cassandra’s operational model or availability.

Transactions

In database parlance, a transaction says, “These operations belong together. They must all be applied, or none of them.” The classic example is a bank transfer. When you move money from one account to another, two things must happen: a debit and a credit. If the debit succeeds but the credit fails, money has disappeared. A transaction prevents this issue by guaranteeing the two operations are atomic, meaning they succeed or fail as a unit; combined with isolation, no other process can see an immediate or half-finished state.

Experiences like these depend on transactional guarantees at the data layer, which rely on ACID semantics, particularly atomicity and isolation, to prevent inconsistent intermediate states.

For most developers who have worked with relational databases, transactions are so fundamental they’re almost invisible. For Cassandra users, comparable guarantees across multiple partitions or tables historically required significant application-level coordination or weren’t natively supported.

Coordination at scale is fundamentally hard

Because Cassandra is designed to deal with data replication and scaling, coordinating atomic changes across multiple nodes is inherently challenging (e.g., decrement a balance here, increment one there). All participating replicas must agree on an order of operations. Distributed consensus protocols exist to solve exactly this, but prior approaches came with trade-offs.

Raft and Zab are examples of protocols that use leaders, which is not suitable for Cassandra since nodes are treated equally.

More information about prior solutions can be found in more details in CEP-15, but generally, leader-based approaches pose issues at scale.

The Accord protocol

The Accord protocol, proposed in CEP-15, is built to achieve fast, general-purpose distributed transactions that remain stable under the same failure conditions Cassandra already tolerates— with no elected leaders.

How it orders transactions

Accord is leaderless so any node can coordinate any transaction. Transactions are assigned unique timestamps using hybrid logical clocks, where each node appends its own unique ID to its clock value to ensure global uniqueness across the cluster. Conflicting transactions execute in timestamp order across all replicas. Under normal conditions, a transaction reaches consensus in a single round trip.

The reorder buffer

The challenge with timestamp-based ordering in a geo-distributed system is that two transactions started concurrently from different regions might arrive at replicas in different orders, breaking fast-path consensus. Accord solves this by having replicas buffer incoming transactions. The wait time is precisely bounded to be just long enough to account for clock differences between nodes and network latency, and no longer. This guarantees that replicas always process transactions in the correct order without needing extra message rounds.

Fast-path electorates

When replicas fail, other leaderless protocols fall back to slower, more expensive message patterns. Accord avoids this by dynamically adjusting which replicas participate in fast-path decisions as failures occur. The result is that Accord maintains fast-path availability under failure, avoiding the degradation to slower message patterns that other leaderless protocols experience.

The net effect: strict serializable isolation across multiple partitions and tables, in a single round trip, with no leaders, and preserving performance characteristics under the same minority‑failure conditions that Cassandra is designed to tolerate.

New CQL syntax to support transactions

The most visible change for developers is new CQL syntax. Transactions in Cassandra 6 are wrapped in BEGIN TRANSACTION and COMMIT TRANSACTION blocks, similar to SQL syntax.

Let’s examine a flight booking transaction that must simultaneously reserve a seat and deduct loyalty miles from two separate tables. Note: Cassandra 6 is pre-release. Syntax shown reflects the current alpha and may evolve before general availability.

BEGIN TRANSACTION LET seat = (SELECT available FROM flight_seats WHERE flight_id = 'ZZ101' AND seat_number = '14C'); LET miles = (SELECT balance FROM loyalty_accounts WHERE member_id = 'M-7823'); IF seat.available = true AND miles.balance >= 25000 THEN UPDATE flight_seats SET available = false, booked_by = 'M-7823' WHERE flight_id = 'ZZ101' AND seat_number = '14C'; UPDATE loyalty_accounts SET balance = miles.balance - 25000 WHERE member_id = 'M-7823'; END IF COMMIT TRANSACTION ;

Everything between BEGIN TRANSACTION and COMMIT TRANSACTION executes atomically with strict serializable isolation from the perspective of all other concurrent transactions. The LET clause reads current values from the database and binds them to variables. The IF block uses those values to guard the writes. If the seat is already taken or the member doesn’t have enough miles, nothing happens. Both updates either apply together or not at all, across two different tables and two different partition keys.

This is logic that previously had to live in the application, complete with retry handling, race condition guards, and compensating operations if something failed halfway through. Now it lives in the database.

Enabling Accord in Cassandra 6: The CMS dependency

We can’t talk about Accord without discussing Cluster Metadata Service (CMS). Before Accord transactions are functional, Cluster Metadata Service (CMS), introduced alongside Accord as CEP-21, must be enabled. For teams upgrading from Cassandra 5, this is the most significant operational change in the release.

CMS is required. Accord needs every replica to have the same authoritative view of cluster topology showing which nodes own which data, and which replicas participate in a given transaction. Before Cassandra 6, this information was propagated via the eventually consistent Gossip Protocol. This is suitable for normal reads and writes, but Accord’s correctness depends on knowing precisely who the transaction participants are before committing. CMS replaces Gossip-based metadata propagation with a distributed, linearized transaction log, giving all nodes a consistent view of cluster state. Without it, Accord’s guarantees don’t hold.

Upgrading from Cassandra 5 to 6—plan carefully

The upgrade cannot begin until every node in the cluster is running Cassandra 6. CMS initialization requires full cluster agreement; no mixed-version clusters are supported. Before upgrading, disable any automation that could trigger schema changes, node bootstrapping, decommissions, or replacements. These operations are blocked during the upgrade window, and if they fire on an older node before CMS is initialized, the migration can fail in ways that require manual intervention to recover.

Once all nodes are upgraded, run nodetool cms initialize on one node to activate CMS. This creates the service with a single member, which is enough to unblock metadata operations but is not suitable for production. Follow up immediately with nodetool cms reconfigure to add more members. CMS uses Paxos internally and requires a minimum of three nodes for a viable quorum, with more recommended for production depending on cluster size.

Important: CMS initialization is not easily reversible. Plan the upgrade window accordingly and treat it as a one-way operational step.

On a fresh Cassandra 6 cluster that wasn’t migrated from a previous version, CMS is automatically enabled. First, one node is designated as the initial CMS member. From there, CMS membership scales automatically based on cluster size, with the service adding members as the cluster grows without requiring manual intervention.

Of course, for Instaclustr users, our platform and techops team will take care of most of this for you and walk you through any requirements on your side when the time comes to upgrade.

Coexistence with Lightweight Transactions (LWT)

Existing LWT syntax (IF NOT EXISTS, IF EXISTS, conditional UPDATE/INSERT statements) continues to work and fundamentally differs from Accord transactions as LWT is scoped to a single partition and is extremely limited. Accord doesn’t replace or break existing applications. Using BEGIN TRANSACTION/END TRANSACTION is how developers opt into the broader cross-partition guarantees.

Why this is architecturally significant

Every prior approach to distributed transactions required accepting one of three constraints: a global leader (single point of failure, WAN latency penalty), limited to single-partition scope (LWT), or degraded performance under failure (prior leaderless protocols). The Accord paper’s central claim is that these constraints are not fundamental. They are artifacts of specific protocol design choices.

By combining flexible fast-path electorates with a timestamp reorder buffer on top of a leaderless execution model, Accord achieves:

  • True cross-partition atomicity across multiple tables and partition keys
  • Strict serializable isolation with formally proven correctness
  • Single round-trip latency under normal operating conditions
  • Failure‑tolerant steady‑state performance, avoiding the systematic degradation seen in earlier leaderless protocols
  • No elected leaders, consistent with Cassandra’s existing operational model

This opens workloads that were previously natively incompatible with Cassandra: financial transaction processing, distributed inventory reservation, multi-step workflow coordination, and any application where ‘commit these changes together or not at all’ is a strict correctness requirement.

Looking ahead

Though the Accord protocol is still maturing, the fundamental capability is finally here. We now have general-purpose, leaderless, multi-partition ACID transactions natively in Apache Cassandra.

The historically difficult problem of achieving strict serializable isolation in a geo-distributed system without compromising fault tolerance now has a proven, working answer.

For Cassandra users, this raises an exciting question: which workloads have you been routing to relational databases specifically because they needed transactional guarantees? It is time to reevaluate.

Stay tuned for a preview release of Cassandra 6 on the Instaclustr Platform and get ready to experience the power of ACID transactions on Cassandra for yourself!

The post Apache Cassandra® 6 Accord transactions: What you need to know appeared first on Instaclustr.

4 DynamoDB Configuration Changes for Significant Cost Savings

Learn about ways to cut DynamoDB costs with minimal code changes, zero migration, and no architectural upheaval If you’re running DynamoDB at scale, your bill might be tens of thousands of dollars higher than it needs to be. However, most teams don’t need a complete migration or architecture overhaul to save significantly. These configuration changes, all easily implemented, can reduce your costs by 50-80%. This guide covers the biggest wins for DynamoDB cost optimization, with the real math behind each recommendation. We will be sharing links to the ScyllaDB Cost Calculator at calculator.scylladb.com, which lets you model different workload scenarios with customized parameters and compare ScyllaDB pricing to DynamoDB pricing at the click of a button.   Switch from on-demand to provisioned + reserved capacity This is the single biggest DynamoDB cost lever for most teams. On-demand capacity is convenient at first, with no planning required and just pay-as-you-go. But it’s also expensive. After AWS’s recent price reduction, on-demand costs 7.5x more than provisioned capacity. Before the drop, it was roughly 15x. Either way, the math is brutal. Let’s look at a simple example: a mid-sized workload running 10,000 reads/sec and 10,000 writes/sec, 24/7. On-Demand: ~$239K/year Provisioned: ~$71K/year Reserved: ~$34K/year That’s a 7x difference between on-demand and reserved. Even if your workload isn’t perfectly predictable, reserved capacity often pays for itself within months. The trade-off here is that you need a predictable load and the financial flexibility to commit. If your traffic varies wildly (or you’re short-term focused) provisioned mode without reservation is the middle ground. Still, it’s 3.3x cheaper than on-demand. Optimize item sizes DynamoDB’s billing is granular: writes are charged per 1KB of item size, and reads per 4KB. This means a 1.1KB item costs the same as a 2KB item on writes. If your items are consistently over these thresholds by a small margin, you’re paying 2-3x more than necessary. Let’s look at the same simple example, but with increasing item size for comparison. On-Demand with 1KB items: ~$239K/year On-Demand with 10KB items: ~$2M/year On-Demand with 100KB items: ~$20M/year Common culprits for higher DynamoDB costs here: Nested JSON with whitespace or redundant fields Variable-length strings with no trimming Metadata or audit fields added to every item Base64-encoded payloads What should you do? Compress JSON payloads before storage, remove redundant attributes, move infrequently accessed data to a separate table, or use a columnar storage strategy. Trimming just 200 bytes per item – across millions of items and thousands of writes/sec – adds up to thousands per month. Deploy DAX (DynamoDB Accelerator) for read-heavy workloads If your workload skews heavily toward reads and you’re not using an in-memory cache layer yet, DAX is one of the highest ROI moves you can make. DAX sits in front of DynamoDB and caches frequently accessed items in memory. Cache hits bypass DynamoDB entirely, meaning you avoid the RCU charge. For hot items queried thousands of times per minute, a single DAX cluster can reduce DynamoDB read capacity needs. Let’s look at another simple example: a read-heavy workload running 80,000 reads/sec and 1,000 writes/sec, 24/7. On-Demand: ~$335K/year On-Demand with DAX: ~$158K/year The cost math: a medium sized DAX cluster (3 nodes, cache.r5g.8xlarge) costs roughly $9K/month. A high hit rate on your cache will proportionally reduce your more expensive read costs. That can lead to potentially hundreds of thousands of dollars saved with DynamoDB. Bonus: DAX also improves latency dramatically. Cache hits respond in microseconds rather than milliseconds. Use the DynamoDB Infrequent Access (IA) table class Not all tables are created equal. If you have tables where data is accessed rarely but storage is high (think audit logs, historical records, compliance archives, or cold lookup tables), then the Standard-IA table class can save you substantially on storage. The pricing difference: Standard class: $0.25/GB Standard-IA class: $0.10/GB (up to 60% savings) The catch is that IA has a minimum item size of 100 bytes and a minimum billing duration. It’s designed for cold data. So, if you’re frequently scanning or querying these tables, IA isn’t the right fit (read costs are identical, but you lose the write discount). However, for true archive tables accessed only occasionally, it’s a no-brainer. The bottom line These four DynamoDB changes require minimal code changes, zero migration, and no architectural upheaval. They’re configuration changes, caching tweaks, and data optimization. Combined, they typically deliver massive cost reductions. Start with switching to provisioned + reserved (highest impact), then layer in the others based on your workload shape. Ready to model your savings? Use the ScyllaDB Cost Calculator at calculator.scylladb.com to compare your current DynamoDB costs against these optimizations. And to save even more, see how ScyllaDB compares.

Shrinking the Search: Introducing ScyllaDB Vector Quantization

Learn how ScyllaDB Vector Quantization shrinks your vector index memory by up to 30x for cost-efficient, real-time AI applications Earlier this year, ScyllaDB launched integrated Vector Search, delivering sub-2ms P99 latencies for billion-vector datasets. However, high-dimensional vectors are notoriously memory-hungry. To help with memory efficiency, ScyllaDB recently introduced Vector Quantization. This allows you to shrink the index memory footprint for storing vectors by up to 30x (excluding index structure) without sacrificing the real-time performance ScyllaDB is known for. What is Quantization? To understand how we compress massive AI datasets, let’s look to the fundamentals of computer science. As Sam Rose explains in the ngrok blog on quantization, computers store numbers in bits, and representing high-precision decimal numbers (floating point) requires a significant number of them. Standard vectors use 32-bit floating point (f32) precision, where each dimension takes 4 bytes. Quantization is the process of compromising on this “floating point precision” to save space. By sacrificing some significant figures of accuracy, we can represent vectors as smaller 16-bit floats or even 8-bit or 1-bit integers. As Sam notes, while this results in a “precision compromise,” modern AI models are remarkably robust to this loss of information. They often maintain high quality even when compressed significantly. The Trade-off: Memory vs. Accuracy In ScyllaDB 2026.1, quantization is an index-only feature. The original source data remains at full precision in storage, while the in-memory HNSW index is compressed. This allows you to choose the level of “information loss” you are willing to accept for a given memory budget: Level Bytes/Dim Memory Savings Best For f32 (default) 4 1x (None) Small datasets, highest possible recall. f16 / bf16 2 ~2x Good balance of accuracy and memory. i8 1 ~4x Large datasets with moderate recall loss. b1 0.125 ~32x Maximum savings for massive datasets. CRITICAL NOTE: Quantization only compresses the vector data itself. The HNSW graph structure (the “neighbor lists” that make search fast) remains uncompressed to ensure query performance. Because of this fixed graph overhead, an i8 index typically provides a total memory reduction of ~3x rather than a raw 4x. Calculating Your Memory Needs To size your ScyllaDB Vector Search cluster effectively, be sure to consider both vector data and graph overhead. The total memory required for a vector index can be estimated with this formula: Memory ≈ N * (D * B + m * 16) * 1.2 N: Total number of vectors. D: Dimensions (e.g., 768 or 1536). B: Bytes per dimension based on quantization level (f32=4, i8=1, b1=0.125). m: Maximum connections per node (default 16). 1.2: 20% operational headroom for system processes and query handling. Example: 10 Million OpenAI Embeddings (768 Dimensions) Using this formula, let’s see how quantization affects your choice of AWS EC2 instances on ScyllaDB Cloud (which primarily utilizes the r7g Graviton and r7i Intel families): f32 (No Quantization): Requires ~40 GB RAM. You would need an r7g.2xlarge (64 GB) to ensure headroom. i8 Quantization: Requires ~12 GB RAM. You can comfortably drop to an r7g.xlarge (32 GB). b1 (1-bit): Requires ~4 GB RAM. This fits on a tiny r7g.medium (8 GB). By moving from f32 to i8, you can drop 2-3 instance tiers. This gets you significant cost savings. Improving Accuracy with Oversampling and Rescoring To mitigate the accuracy loss from quantization, ScyllaDB provides two complementary mechanisms. Oversampling retrieves a larger candidate set during the initial index search, increasing the chance that the true nearest neighbors are included. When a client requests the top K vectors, the algorithm retrieves ceiling(K * oversampling) candidates, sorts them by distance, and returns only the top K. A larger candidate pool means better recall without any extra round-trips to the application. Even without quantization, setting oversampling above 1.0 can improve recall on high-dimensionality datasets. Rescoring is a second-pass operation that recalculates distances using the original full-precision vectors stored in ScyllaDB, then re-ranks candidates before returning results. Because it must fetch and recompute exact distances for every candidate, rescoring can reduce search throughput by roughly 2x – so enable it only when high recall is critical. Note that rescoring is only beneficial when quantization is enabled; for unquantized indexes (default f32), the index already contains full-precision data, making the rescoring pass redundant. Both features are configured as index options when creating a vector index: CREATE CUSTOM INDEX ON myapp.comments(comment_vector) USING 'vector_index' WITH OPTIONS = { 'similarity_function': 'COSINE', 'quantization': 'i8', 'oversampling': '5.0', 'rescoring': 'true' }; When (and When Not) to Use Quantization Use quantization when: You are managing millions or billions of vectors and need to control costs. You are memory-constrained but can tolerate a small drop in recall. You are using high-dimensional vectors (≥ 768), where the savings are most pronounced. Avoid quantization when: You have a small dataset where memory is not a bottleneck. Highest possible recall is your only priority. Your application cannot afford the ~2x throughput reduction that comes with rescoring—the process of recalculating exact distances using the original f32 data to improve accuracy. Choosing the Right Configuration for Your Scenario Here are some guidelines to help you select the right configuration: Scenario Recommendation Small dataset, high recall required Use default f32 — no quantization needed. Large dataset, memory-constrained Use i8 or f16 with oversampling of 3.0–10.0. Add rescoring: true only if very high recall is required. Very large dataset, approximate results acceptable Use b1 for maximum memory savings. Enable oversampling to compensate for accuracy loss. High-dimensionality vectors (≥ 768) Consider oversampling > 1.0 even with f32 to improve recall. Try ScyllaDB Vector Search Now Quantization is just one part of the ScyllaDB 2026.1 release, which also includes Filtering, Similarity Values, and Real-Time Ingestion. With these tools, you can build production-grade RAG applications that are both blazingly fast and cost-efficient. Vector Search is available in ScyllaDB Cloud. Get Started: Check out the Quick Start Guide to Vector Search in ScyllaDB Cloud. Deep Dive: Read our past posts on building a Movie Recommendation App or our 1-billion vector benchmark. Documentation: View the full ScyllaDB Cloud Vector Search Documentation. Try ScyllaDB Cloud for free today and see how quantization can supercharge your AI infrastructure.

The Great Stream Fix: Interleaving Writes in Seastar with AI-Powered Invariants Tracing

How we used AI-assisted invariant-based testing to locate and resolve tricky hidden bugs with complex state transitions Seastar is a high-performance C++ framework for writing asynchronous server applications. It powers projects like ScyllaDB and Redpanda. One of its core rules is simple but strict: no blocking allowed. Every operation that could take time (e.g., reading from disk, writing to a socket, waiting for a lock) must be expressed asynchronously by returning a future that resolves when the work is completed. This makes Seastar applications extremely efficient on modern hardware. However, it also means that even seemingly mundane things, like writing data to a stream, require careful thought about ownership, lifetimes, and buffering. Moreover Seastar’s output stream has always experienced a limitation: the inability to freely mix small, buffered writes with large, zero-copy chunks. It was something that developers avoided and tolerated – but we always considered it something worth improving … someday. Fixing this requires a deep dive into complex state transitions, which inherently creates a high risk for introducing sequencing bugs. A standard coding approach won’t work; the task requires a way to trace the system’s state across millions of test cases. This post describes the process of using AI-assisted invariant-based testing to try to locate and resolve these tricky hidden bugs. TL;DR What could have been an extremely complicated fix ultimately was actually surprisingly smooth and effective. Output streams Output stream is Seastar’s output byte flow abstraction. It’s used wherever data needs to go out of an application. For example, it’s used for disk files, network connections, and stackable virtual streams that transform data on the fly (such as compression or encryption layers sitting on top of another stream). Whatever the underlying sink is, the output stream presents a uniform interface to the caller. It gives callers two ways to push data through: Buffered writes: Copy bytes into an internal buffer; flush when the buffer fills up or when explicitly requested. Zero-copy writes: Hand over memory buffers directly; the stream passes it to the sink without copying a single byte of the buffer data. Zero-copy is important for large blobs since we want to avoid copying megabytes of data. Buffered writes are important for building up small pieces efficiently. In a real application, it’s natural to interleave both: write a small header into the buffer, then attach a large payload as a zero-copy buffer, then write a small trailer. There is also a trim_to_size stream option. When enabled, the stream guarantees that no chunk delivered to the underlying sink exceeds the configured stream buffer size. This matters for sinks that have an upper limit on how much data they can accept in a single call – certain network APIs, for instance, or aligned disk I/O. Without it, a larger buffer can pass through as-is. The Problem Until recently, mixing the two write modes was not supported. Internally, buffered and zero-copy writes used two different storages: internal buffer for the former data, and dedicated container for the latter. There was no clean way to append buffered bytes onto the tail of pending zero-copy data while preserving ordering. The code simply asserted that the zero-copy container was empty whenever a buffered write arrived and vice-versa. The nearby code comment, however, stated that mixing writes was not supported yet – so the intention to fix it had always been there. The goal of the work described here was to make it happen. Start with the Tests We figured we should build a solid test foundation before touching the implementation. We had some pre-existing tests for output streams, but they were really just a collection of ad-hoc cases (specific input sequences with hardcoded expected outputs). This was fine for catching regressions but not great for systematically exploring the large space of possible inputs against drastic code changes. The new approach was invariant-based testing. Rather than checking exact output sequences, the tests need to verify that certain properties always hold, regardless of input. Specifically, we wanted to check that: All written bytes arrive at the sink, in order, with no corruption. Every chunk delivered to the sink (except the last) must be at least stream_size bytes with no undersized non-last chunks. With the trimming option enabled, all outgoing chunks must be exactly stream_size bytes. With these invariants defined, the test iterates over all combinations of chunk sizes (1 byte through 3x times the stream_size bytes) and all assignments of write type (buffered or zero-copy) to each chunk. For n chunks ,that’s 2^n type patterns plus trimming option giving about 1.6 million combinations in total. The ad-hoc tests were then removed – the invariant test subsumed them. One practical issue: 1.6 million cases ran fast in a regular build (~5 seconds), but under sanitizers (ASan, UBSan) it ballooned to over two minutes. Given the whole seastar test suite runs for several minutes, this new timing had to be improved somehow. The fix was to turn an exhaustive test into a fuzzy one: in debug builds, shuffle all 2^n masks, always keep the all-buffered and all-zero-copy patterns, and sample ~10% of the rest. That brought sanitizer runs down to less than twenty seconds. Implementing the Fix With tests in place, the implementation work began. The key challenge was making the internal buffer and zero-copy container interoperate cleanly. Two transitions required handling: Buffered → zero-copy Zero-copy → buffered Buffered → zero-copy When a zero-copy write arrives and there’s buffered data. That data needs to be folded into the zero-copy container so that ordering is preserved. The naive approach – trim buffer to its filled length and move it into container – works, but it wastes the rest of the buffer allocation. Instead, the filled buffer prefix is shared into the container as a view or sub-span, and the buffer itself is advanced past it, thus sharing the underlying memory. This way, the tail of the original allocation is still available for future buffered writes after the zero-copy sequence. No reallocation is needed on the mode switch. This tail – trimmer buffer, pointing at unused capacity within the original allocation – is what we call the remnant. It is a new concept introduced by this change. Before mixed-mode writes were supported, the buffer was always either full (and flushed) or freshly allocated. The remnant is an optimization. But (as will become clear shortly), its existence also introduced several subtle failure modes that took time to track down. Zero-copy → buffered When a buffered write arrives and the zero-copy container is non-empty, the new data can just be appended to the internal buffer regularly. A new zero-copy arrival, if it happens, would keep the buffer ordering as described above. The flushing path was also updated to handle the case where buffered and zero-copy data are both available at that time. The buffered tail is folded into the zero-copy vector before forwarding both downstream. Bugs Found Along the Way Once the implementation was written, the tests were run. They failed – many times, in different ways. To facilitate debugging, each test failure logged the exact stream configuration and the write sequence that triggered the problem, for example: stream_size=5 trim_to_size=0 input_chunks=[3, 2] write_types=[buffered, zero_copy] This made the diagnosis unusually direct. Given a specific failing case, we could trace the execution step by step – mentally simulating the stream, tracking the bytes accumulated in the buffer and zero-copy container state at each point, and comparing that against what the state should be at that moment. The stream’s logic is not huge, but it has enough branches and interacting state variables to make it excessively time- and effort-consuming for a human. So, we turned to AI to hopefully save time and effort. Each time a test failed, instead of walking the input_stream state step-by-step, we prompted the AI with the exact test-case description string, the code itself, and the request to find where the problem could be. The AI then started “thinking” and its mind-flow looked as if it was simulating the input_stream code execution step-by-step, like a developer would do, just much faster. This turned out to be an incredibly fast way to find where the invariant breaks. In many cases, it could correctly analyze the desired state the stream had to happen in. Sometimes, it even suggested excellent ways to fix the problem that was found. Identifying the root cause is often only half the work; understanding which fix is correct, safe, and doesn’t introduce new assumptions that will break later is the other half. AI found the problems for us and was quite helpful in helping us fix them. For example, one of the first debugging sessions with AI looked like this… Developer:
The test fails like that fatal error: in "test_splitting_invariants": data integrity check failed: stream_size=5 trim_to_size=0 input_chunks=[1, 1, 1, 4] write_types=[zero_copy, buffered, buffered, buffered] Please, analyze
AI: Let me trace through this case carefully. stream_size=5, trim_to_size=false, writes: [zc(1), buf(1), buf(1), buf(4)] After zc(1): _zc_bufs = [tb(1)], _zc_len = 1, _end = 0 _zc_len < _size, nothing flushed After buf(1): Goes to write(const char*, 1) fast path — but !_buf, so falls to slow_write _end=0, n=1 < _size=5, so skips the !_end && n >= _size branch Allocates _buf, copies 1 byte, _end = 1 After buf(1): Fast path: fits in _buf, _end = 2 After buf(4): _end=2, n=4, n > _size - _end = 3, so goes to slow_write _end=2, so skips the !_end && n >= _size branch now = min(4, 5-2) = 3, copies 3 bytes, _end = 5 now != n (3 != 4), so allocates next with max(4-3, 5) = 5 bytes, copies 1 byte into it _zc_bufs is non-empty, so enters the new branch: seals _buf.share(0, 5) into _zc_bufs _zc_len += 5 → _zc_len = 6 _zc_len += n - now = 1 → _zc_len = 7 pushes next into _zc_bufs trim_to_size=false → calls zero_copy_put(_zc_bufs) So zero_copy_put sends [tb(1), tb(5), tb(5)] — but next only has 1 byte of valid data in it, yet it’s a 5-byte allocation! We’re pushing the full _size-sized next buffer but only n - now = 1 bytes were written into it. The buffer isn’t trimmed before being pushed. The fix: trim next to n - now bytes before pushing it into _zc_bufs when _zc_bufs is non-empty. Shall I go ahead and fix it? The root-case was obvious and the suggested fix was simple and correct. So, we accepted the suggestion and the debugging session went on. We went through several rounds like that, with each step producing more lengthy analysis and more sophisticated fix proposals. At some point, a test with a small buffered write followed by a zero-copy write showed data corruption downstream. Tracing the execution revealed that after trimming the internal buffer past the filled region, the trimmed buffer still appeared non-null even though it had zero usable capacity remaining. The fast-path check for buffer availability wasn’t prepared for that and didn’t trigger reallocation on the next write. As a result, bytes were written into unowned memory. Another case involved a buffered write code branch which handles writes larger than stream_size by chopping them into chunks. After a mode switch, the internal buffer might become a small remnant – smaller than the tail chunk the code in question was trying to store there. Tracing through the code, we saw that it was the place where the capacity check wasn’t prepared to meet with the remnant buffer. It silently assumed that the internal buffer always had room for a full-sized tail. The result was an ASan-detected heap overflow. A more subtle issue arose around the same remnant buffer in a different scenario. When buffered write chopping code encounters a tail chunk that is smaller than the stream_size, but larger than the remnant's remaining capacity, it has to make a choice. It could either fill the remnant partially and asynchronously put it before allocating a fresh buffer for the rest, or simply abandon the remnant and allocate a fresh full buffer. The first option is more space-efficient, but would require an async flushing inside what is otherwise a synchronous setup step, significantly complicating the code. The second option wastes the unused bytes of the remnant's allocation – but crucially, it doesn’t leak them. The remnant shares its underlying allocation with the sealed buffer already in the zero-copy container, so the memory is freed once that buffer is flushed and all references to the allocation are dropped. The deliberate trade-off – wasted but not leaked – was worth making, and a comment in the code explains the reasoning for whoever reads it next. Each bug effectively had the same shape: a subtle assumption about stream state that held in the original single-mode code silently broke in mixed-mode scenarios. The invariant test exposed the bugs by providing a minimal reproducible case and a clear description of which invariant was violated. Plus, it also made each one straightforward to reason about and fix. The Result The work touches tests and implementation in roughly equal measure, which feels about right for a change like this. The test suite grew from a handful of hand-crafted cases into an exhaustive invariant-based framework that covers all combinations of chunk sizes and write types – something that would have been impractical to write by hand. On the implementation side, the long-standing restriction on mixed-mode writes is gone. Buffered and zero-copy writes can now be freely interleaved in any order, with the stream handling the transitions internally. This preserves ordering and the chunk-size invariants that sinks depend on. In general, writing a test that covers as many possible situations as possible and then making sure that the code passes those tests is a very good approach. It makes sure the end code is correct. In rare cases when the test covers all possible situations the code may have to deal with, we can say that “the code is officially bug free.” Making AI facilitate testing turned out to be the best decision made in this work. Given the amount of test cases and the number of possible combinations of input_stream inner states, debugging each failing test case would be a nightmare for the developer.

The Hidden Insanity of DynamoDB Pricing

Learn how to navigate some of the sneakiest aspects of DynamoDB pricing DynamoDB’s pricing model has some head-scratching quirks that slyly inflate bills by hundreds of thousands of dollars per year. Most of these aren’t malicious; they’re just design decisions from 2012 that made sense at the time, but became increasingly absurd at scale. This post walks through four of the most egregious examples and the real cost impact on teams running large workloads. Cost per item size is punitive DynamoDB charges you for writes per 1KB chunk and reads per 4KB chunk. This means: 1KB write = 1 WCU 1.1KB write = 2 WCUs (you’re charged for 2KB, but only used 1.1KB) 1.5KB write = 2 WCUs 2.1KB write = 3 WCUs Every byte over a threshold doubles your cost for that operation. It’s a tax on items that don’t fit neatly into the billing boundary. And almost nothing fits neatly: JSON payloads with nested objects, variable-length strings, metadata, timestamps… Most real-world items end up hitting those boundaries, so you risk paying 2x or more for the overage. Consider a team logging 100M events per day, averaging 1.2KB each. That’s ~120M writes, almost all hitting the 2KB billing threshold. They’re paying for 200M KB instead of 120M KB. That’s a 67% surcharge baked into every bill. If their write cost is $10,000/month, that surcharge alone is ~$6,700/month in wasted capacity. On demand comes at a premium On-demand pricing was introduced as a convenience layer for unpredictable workloads. It saves teams the pain of provisioning and forecasting (“just pay for what you use”). The trade-off is that pricing is steep. Even after AWS’s recent price cut (it used to be ~15x!), on-demand is 7.5x more expensive than provisioned capacity. For a team that starts on on-demand and never switches, the cost difference is catastrophic. For example, say a SaaS company launches a new product on DynamoDB; they start with on-demand for convenience and quickly scale to 20K reads/sec and 20K writes/sec. On-demand now costs $39K/month. Switching to provisioned would drop that down to $11K/month. And teams often don’t switch because ‘it works’ or ‘the bill surprise hasn’t happened yet.’ The convenience tax on DynamoDB is insane. Even if you wanted to retain that flexibility, ScyllaDB would cost $3K/month for on-demand or just $1K/month with a hybrid subscription + flex component. Multi-region network costs are deceptive Global Tables already charge replicated writes (rWCUs) at a premium. But there’s a second hidden cost too: data transfer. AWS charges for cross-region data transfer at standard EC2 rates: $0.02/GB to adjacent regions, up to $0.09/GB to distant regions. As a result, Global Tables end up costing 2-3x more than expected. These hidden network costs often don’t appear as a line item on your DynamoDB bill. They’re rolled up into ‘Data Transfer’ charges. Many teams don’t notice or attribute it correctly. ScyllaDB can’t escape the variable costs of cross-region data transfer that AWS enforces. However, we have a number of cost reduction mechanisms that assist with these costs. ScyllaDB handles multi-DC replication natively. You provision nodes in each data center, and replication is built into the protocol. There are also shard-aware and rack-aware drivers, which help minimize network overhead. Add network compression, and your cross-region data costs get even lower. Reserved capacity requires you to predict capacity Reserved capacity offers massive discounts, up to 70% off. But there’s a catch: you must commit for 1 or 3 years upfront, and you must predict your read and write throughput independently. This is absurdly difficult. Your workload changes: new features launch, old features get deprecated, customer behavior shifts, and traffic patterns evolve. Predicting the exact read/write ratio years out is impossible. Teams either over-commit (wasting money on unused capacity) or under-commit (paying on-demand rates for the overage). Example: You commit to 200K reads/sec and 500K writes/sec for 1 year. On DynamoDB, that is going to cost $1.4M/year for the upfront and annual commitment. But six months into the year, growth exceeds your capacity estimates and your application starts having requests throttled. You revert to autoscaling a mixture of reserved plus on-demand. Now, you’re paying the 7.5x markup – and that costly misjudgment is locked in for the remainder of the year. The solution? Over-commit to hedge your bets. This guarantees you’re wasting money on overprovisioning, just to avoid even higher on-demand charges. It’s a no-win scenario. Compare this to ScyllaDB with a hybrid subscription + flex component that automatically scales to your requirements throughout the year, which might cost $133K/year to start with. Radically less expensive and more flexible (on both compute and storage requirements) thanks to true elastic scaling with X Cloud. Why does this matter? These four pricing quirks aren’t hypothetical. Combined, they add tens of thousands to six figures per year to bills across the industry. They’re especially brutal for write-heavy workloads, multi-region systems, and large items. And because they’re partially hidden, buried in separate line items, masked by the per-operation model, or justified by architectural constraints… Teams often don’t realize how much they’re paying. Some of this is inevitable with a fully managed service. But databases built on different cost models can deliver the same durability, consistency, flexibility and scale at a fraction of the price. For example, this is the case with ScyllaDB, which charges by the node and includes replication and large items at no extra cost. Curious what your workload actually costs? Use the ScyllaDB DynamoDB Cost Calculator at calculator.scylladb.com to model your real costs, including all the hidden charges, and see how ScyllaDB pricing stacks up.  

Powering a Billion Dreams: Scaling Meesho’s E-commerce Platform

How ScyllaDB plays a critical role in handling Meesho’s millions of transactions – optimizing our catalog rankings and ensuring ultra-low-latency operations With over a billion Indians set to shop online, Meesho is redefining e-commerce by making it accessible, affordable, and inclusive at an unprecedented scale. But scaling for Bharat isn’t just about growth—it’s about building a tech backbone that can handle massive traffic surges, dynamic pricing, real-time recommendations, and seamless user experiences. Let me take you behind the scenes of Meesho’s journey to democratize e-commerce while operating at monster scale. We’ll cover how ScyllaDB plays a critical role in handling Meesho’s millions of transactions – optimizing our catalog rankings and ensuring ultra-low-latency operations.  Note: Adarsha Das from Meesho will be presenting a keynote at the upcoming Monster Scale Summit India/APAC. That talk is on BharatMLStack, an open-source, end-to-end machine learning infrastructure stack built at Meesho to support real-time and batch ML workloads at Bharat scale. Join Monster Scale Summit India/APAC — it’s free and virtual About Meesho In case you’re not familiar with Meesho, we’re an Indian e-commerce platform. The company was founded in 2015 to connect small and medium enterprises in India. Meesho helps consumers from these areas access products from all over India, beyond their local markets. Meesho focuses on bringing affordable product selections to Tier 2 cities and smaller markets. The company operates with a zero-commission model that reduces barriers for sellers. We function as an asset-light marketplace that connects sellers, logistics partners, and consumers. We make the listing process quite simple. Sellers just need to take a picture of the product, upload it, set the price, and start selling. Why Personalization is Essential for Meesho Meesho’s architecture aims to support people who are new to e-commerce. Tech-savvy users from Tier 1 cities likely know how to use search, tweak keywords, and find what they want. But someone from a Tier 2 city, new to e-commerce, needs discovery to be simpler. That’s why we invested in a lot of tech to build personalized experiences on the app. Specifically, we invested significantly in AI and personalization technologies to create intuitive app experiences. We personalize all the way from the moment the app opens to order completion. For example, different users see different homepages and product selections based on their preferences and purchase history. We also personalize for sellers, helping them create product descriptions that make sense to their buyers. Real-Time Feed-First Personalization Meesho meets these needs with a fundamentally feed-first app. We create a tailored product feed, ranking products based on preferences and actions (searches, clicks, etc). To do this, we built a CTR (click-through rate) prediction model to decide what product tiles to show each user, and in what order. Two people logging in will see different selections based on their behavior. Given all this, Meesho had to move from traditional recommendation systems to real-time, highly personalized experiences. Batch processing wasn’t sufficient; our personalization must respond instantly to recent user actions. That requires low-latency databases and systems at scale, with the ability to support millions of sellers and users on the app simultaneously. Why ScyllaDB We experimented with a few different databases and data stores: SQL, NoSQL, columnar, and non-columnar. Some worked at certain scales. But as we kept growing, we had to reinvent our storage strategy. Then we discovered ScyllaDB, which met our needs and proved itself at Meesho scale. More specifically, ScyllaDB provided… Horizontal Scaling Given the ever-increasing scale of Meesho – where user transactions kept increasing and users kept growing over years – horizontal scalability was very important to us. Today, I might be running with X nodes. If that becomes 2X tomorrow, how do you scale in a live manner? Being a low-cost e-commerce platform, we are conscious about server spend, so we try to emulate traffic patterns by dynamically scaling up and down based on demand. For example, not all 24 hours have the same number of orders; there are peaks and lows. We want to provision for baseline load and auto-scale for demand without downtime, since the cost of downtime for a business like ours is very high. Downtime can result in user churn and loss of trust, so we prioritize reliability and availability above all. Moreover, we expect that adding new nodes will linearly increase throughput. For example, if I run an X-node cluster and add nodes, I should get a proportionate throughput increase. This is critical as we scale up or down. We observed that in distributed systems with a primary-secondary configuration, the primary can become a bottleneck. So, we wanted a peer-to-peer architecture like ScyllaDB’s, where each node can service writes as well as reads. ScyllaDB gives us linear scalability. Low-Level Optimizations for Efficiency The database’s efficiency is also a factor for us. A major challenge we saw in JVM- or Java-based systems was garbage collection and related overheads. These impact performance, interrupt scaling, and limit hardware utilization. That’s why we prefer C++-based or other low-level language implementations, with minimal JVM or garbage collection issues, and minimal memory overhead. Most of our use cases require low-latency, real-time personalization, where every bit of memory is used for application logic and data, not overhead. Smart Architecture and Fault Tolerance Having a smart, fault-tolerant architecture was another consideration. Much of our user base is in Tier 3 and 4 cities, where network connectivity is sometimes flaky. We want to provide a Tier 1 user experience to Tier 4 users, so low latency is critical. We prioritize keeping latency within a few milliseconds. One of ScyllaDB’s key features is token-aware routing. When a query comes, it goes directly to the node with relevant data – reducing network hops since each node acts as its own master. This is the kind of distributed architecture we were looking for, and the token-level routing helps with horizontal scalability. Reliability and fault tolerance are also major requirements. When running on a public cloud, a big pain point is a particular zone going down. We’ve seen cloud regions and zones go down before. To minimize impact, we look for automatic data replication across zones and seamless failover in case of failures, so that user impact is minimal. Building trust with first-time e-commerce users is hard. If we lose it, getting them back is even harder. That’s why this capability is critical. Operational Simplicity Another thing we wanted is operational simplicity—having a system where adding or removing nodes is as simple as running a script or clicking a button. We like having an engine where we don’t need to tune everything ourselves. Results So Far We’ve been using ScyllaDB to power very low-latency systems at high throughput, for both reads and writes. We started with small workflows, scaled to platform workflows like ML platform and experimentation performance, and continue to scale. It’s been a good journey so far, and we’re looking forward to using it for more use cases.

Agentic AI State Management with ScyllaDB and LangGraph

How to combine LangGraph and ScyllaDB for durable state management, crash recovery, and a highly available backend for your agentic AI applications. Most agent implementations today are request-response loops. The challenge with this approach is that you are just one network issue or server restart away from losing context and progress. We have more powerful LLMs than ever, yet we’re wrapping them in fragile infrastructure. As an example, assume you have an agent process that takes three minutes and involves seven API calls. There are a lot of places where it can go wrong. The process dies, the state disappears, and the agent starts over with no recollection of what it was doing. Implementing a well-designed workflow orchestration client is not enough to solve this problem. You also need to implement a distributed and highly available backend to support your agents, something with: multi-region, durable storage automatic data replication fault tolerance high-throughput This post shows you how you can simplify your backend by using a single mature database that handles both high availability and durable storage for your agents. You write agent state to a persistent store, it survives crashes by default, and you can still meet 5ms P99 latency requirements. Pair that with an orchestration framework like LangChain’s LangGraph that saves state after every step, and you get a reliable and scalable agentic backend. Let’s see why and how you should implement a system like that with ScyllaDB. Achieving zero agent downtime with ScyllaDB ScyllaDB is a high-performance distributed NoSQL database designed to stay up and available for mission-critical applications. The Raft consensus algorithm handles topology changes and schema updates with strong consistency. Replication is automatic: you set a replication factor and ScyllaDB distributes copies across nodes, racks, and datacenters. On temporary node loss, Hinted Handoffs record missed writes and replay them when the node returns. For longer outages, row-level repair brings a replacement node up to date in the background. You don’t need load balancers, external replication jobs, or manual failover steps. ScyllaDB Cloud is a mature cloud offering. Multi-region clusters with tunable replication factors per datacenter, rack and availability-zone awareness, and zero-downtime operations are all available out of the box, with no extra components required. ScyllaDB also provides practical features for agentic use cases… Persistent by design Every write goes to durable storage. There is no configuration flag to enable durability; it is the default, not an option. Persistence allows your agent to recover from crashes and continue a process. Data model In ScyllaDB, you design tables around the queries your application will run. A partition key determines which node owns the data, rows within a partition are sorted by a clustering key, and that sort order is fixed at schema creation time. This design is a great fit for key-value agentic systems. Lightweight transactions ScyllaDB supports LWTs to provide compare-and-set semantics natively, without client-side locking: INSERT IF NOT EXISTS and UPDATE ... IF ... This feature enables idempotent checkpoint writes. Time-to-live Agentic sessions eventually go stale. ScyllaDB provides a native way to expire old data from your database. ScyllaDB’s role in your agentic infrastructure Now let’s explore specific use cases where ScyllaDB helps you build agentic applications. The following examples use LangGraph (TypeScript) and the community-created ScyllaDBSaver checkpointer. What is a checkpointer? Checkpointer is LangGraph’s abstraction for a persistence backend. This is how LangGraph integrates with databases. Durable conversation memory One of the main technical problems with agents is handling failures such as: network hiccups server restarts other reasons a process gets killed midway through The in-memory state is gone, and the agent behaves as if the conversation never happened. LangGraph’s MemorySaver (built-in in-memory checkpointer) makes this reproducible. Run two turns, discard the saver object, create a new one, and run a third turn: thread_id: a named conversation/session in LangGraph; all checkpoints for one conversation share the same thread. With ScyllaDB as the checkpointer, all three requests operate identically from an application standpoint. The agent picks up exactly where it left off because the conversation state lives in the database rather than in the server process. ScyllaDBSaver example: The query that loads state on every invoke() is: Note that we don’t use ORDER BY or run a full-table scan. There’s only one row returned: the most recent checkpoint for the thread. Why does LIMIT 1 return the newest row without an explicit sort? Let’s see how the ScyllaDB data model enables this kind of query. Source: https://aws.amazon.com/blogs/database/build-durable-ai-agents-with-langgraph-and-amazon-dynamodb/ Query-first schema design: reading the latest checkpoint LangGraph reads the latest checkpoint on every invoke(). In a busy agent server, that is a read-heavy query pattern. The checkpoints table is defined with a compound primary key: The partition key is (thread_id, checkpoint_ns). That means this key will be used to partition your data across the ScyllaDB cluster. All checkpoints for a single conversation land in the same partition. “Get all steps for this conversation” never requires cross-node coordination. The clustering key is checkpoint_id DESC. It makes sure that the rows within each partition are sorted according to that column in descending order. Because checkpoint_id is a UUIDv6 (which encodes a timestamp in its bit layout), rows are physically stored on disk with the newest checkpoint first. LIMIT 1 on a partition scan reads only the first row; no full scan is required. Source: https://docs.langchain.com/oss/python/langgraph/persistence Crash recovery with idempotent writes A node in an agent graph can fail mid-execution after it has already written some of its output. Without a write-ahead log, the only safe option on retry is to re-run the node from scratch. This may produce duplicates, trigger external side effects, or be expensive for long-running LLM calls. ScyllaDB and LangGraph solves this with a second table, checkpoint_writes, that acts as a write-ahead log at the channel level: Before a checkpoint row is written to checkpoints, each individual channel write is staged in checkpoint_writes using a lightweight transaction: IF NOT EXISTS is an idempotent insert. Here’s what happens if the server crashes after three of five channel writes have landed and then restarts: LangGraph loads the latest checkpoints row It loads the pending checkpoint_writes for that checkpoint ID It finds the three completed writes It resumes from there without re-running successful steps The partition key on checkpoint_writes is (thread_id, checkpoint_ns, checkpoint_id). All pending writes for a single checkpoint are in the same partition. “Load all pending writes for checkpoint X” is a single-partition scan, not a cross-cluster lookup. The two tables serve different query patterns. Keeping them separate makes both queries efficient. Time-travel and conversation history LangGraph exposes historical snapshots through the checkpointer’s list() method: Each tuple is a full CheckpointTuple: the serialized state at that step, the metadata (source, step number, what changed), and the config needed to resume from that exact point. That last part is what enables time-travel: pass a past checkpoint_id as the starting configuration and LangGraph replays from there, branching the conversation into an alternative trajectory without modifying the original history. Here’s the underlying ScyllaDB query: You get all rows for one thread in one partition, sorted newest-first. This is the same partition that hosts the latest-checkpoint read. No additional indexes are required for the history use case. The source field indicates what kind of step produced it: "input" (user message ingested, before any node ran) "loop" (a node executed) "update" (state was patched directly via graph.updateState()). Secondary indexes on source and step allow filtering across all threads when needed: Auto-expire data with time-to-live Production agent deployments accumulate checkpoint data continuously. A customer support agent with 10,000 active threads, each with a 10-turn history, generates tens of thousands of checkpoint rows. Sessions eventually go stale. You might decide, for example, that a thread abandoned by the user after one message can be deleted and stored elsewhere after a certain period of time. In ScyllaDB, TTL is part of the data model. You attach it directly to the inserted row at write time: USING TTL 86400 tells ScyllaDB to delete this row after 24 hours. The same TTL clause appears on checkpoint_writes in the same write batch. The ScyllaDBSaver accepts a ttlConfig parameter that applies this clause to every write: Change defaultTTLSeconds and every subsequent write picks up the new expiry. No migration required. Integrate ScyllaDB into your LangGraph project To use ScyllaDB as a persistent store in your LangGraph application, you need to install the ScyllaDB checkpointer. This package will handle the migration and all subsequent CQL queries for you. Install the package: npm install @gbyte.tech/langgraph-checkpoint-scylladb Create the schema: npm run migrate # runs: CREATE KEYSPACE IF NOT EXISTS langgraph ... # CREATE TABLE IF NOT EXISTS langgraph.checkpoints ... # CREATE TABLE IF NOT EXISTS langgraph.checkpoint_writes ... Wire the checkpointer into your graph: Wrapping up By combining LangGraph with ScyllaDB’s built-in durability and high availability, you move from fragile, stateful processes to resilient agent systems. Restarts, retries, or lost context won’t be a problem because your architecture treats failure as a normal condition and continues seamlessly. This shift simplifies your infrastructure as well as enables more ambitious, long-running agent workflows to operate reliably at scale. Learn more about ScyllaDB and agentic applications: Clone the example application Read how others use ScyllaDB for AI use cases Sign up for ScyllaDB Cloud

Why We Changed ScyllaDB’s Approach to Repair

By focusing solely on unrepaired data, we made ScyllaDB’s incremental repair 10X faster Maintaining data consistency in large-scale distributed databases often comes at a high performance cost. As clusters grow and data volume expands rapidly, traditional repair methods often become a bottleneck. At ScyllaDB, we needed a way to make consistency checks faster and more efficient. In response, we implemented Incremental Repair for ScyllaDB’s tablets data distribution. It’s an optimization designed to minimize repair overhead by focusing solely on unrepaired data. This blog explains what it is, how we implemented it, and the performance gains it delivered. What is Incremental Repair? Before we talk about the incremental part, let’s look at what repair actually involves in a distributed system context. In a system like ScyllaDB, repair is an essential maintenance operation. Even with the best hardware, replicas can drift due to network hiccups, disk failures, or load. Repair detects mismatches between replicas and fixes them to ensure every node has the latest, correct version of the data. It’s a safety net that guarantees data consistency across the cluster. Incremental repair is a new feature in ScyllaDB (currently available for tablets-based tables). The idea behind it is simple: why worry about data that we have already repaired? Traditional repair scans everything. Incremental repair targets only unrepaired data. Technically, this is achieved by splitting SSTables into two distinct sets: repaired and unrepaired. The repaired set is consistent and synchronized, while the unrepaired set is potentially inconsistent and requires validation.       We created two modes of incremental repair: incremental and full. In incremental mode, only SSTables in the unrepaired set are selected for the repair process. Once the repair completes, those SSTables are marked as repaired and promoted into the repaired set. This should be your default mode because it significantly minimizes the IO and CPU required for repair. In full mode, the incremental logic is still active, but the selection criteria change. Instead of skipping the repaired set, it selects all data (both repaired and unrepaired). Once the process is finished, all participating SSTables are marked as repaired. Think of this as a “trust but verify” mode. Use this when you want to revalidate the entire data set from scratch while still using the incremental infrastructure. Finally, there’s disabled mode, where the incremental repair logic is turned off. In this case, the repair behavior is exactly the same as in previous versions of ScyllaDB – before the incremental repair feature was introduced. It selects both repaired and unrepaired SSTables for the repair process. After repair completes, the system does not mark SSTables as repaired. This is useful for scenarios where you want to run a repair without affecting the metadata that tracks the repair state. Incremental repair is integrated directly into the existing workflow, with three options: nodetool lets you use the standard nodetool cluster repair command with incremental flags. ScyllaDB Manager also supports the same flags for automated scheduling. A REST API makes incremental repair available to teams building custom tools. Making incremental repair work (internals) To make incremental repair work, we had to solve a classic distributed systems problem: state consistency. We need to know exactly which data is repaired, and that state must survive crashes. So, we track this using two decoupled markers. repaired_at, number stored directly in the SSTable metadata on disk. sstables_repaired_at, a value stored in our system tables. The logic follows a two-phase commit model. Phase one: prepare. First, we run the row-level repair. Once that is finished, we update the repaired_at value in the SSTables. At this point, the system still treats them as unrepaired because they haven’t been activated yet. Phase two: commit. After every node confirms the row-level repair and the updated repaired_at value, we update the sstables_repaired_at value in the system table. We define an SSTable as repaired if and only if it is not zero and it is less than or equal to the system’s repaired_at value. If a node crashes between phases, the mismatch between the file and the system table ensures that we don’t accidentally skip data that wasn’t 100% verified. Under normal operations, you don’t need to run full repairs regularly. Still, it’s needed occasionally. If you experience significant loss of SSTables (perhaps due to a disk failure), then a full repair is required to reconstruct missing data across the cluster. In practice, we suggest one full repair after a long series of incremental runs. This gives you an extra layer of security, even if it is not strictly required. This brings us to a critical challenge: compaction. If we let compaction mix repaired and unrepaired data, the repaired status would be lost, and we’d need to re-repair everything. To solve this, we introduce compaction barriers. We effectively split the tablets into two independent worlds. The unrepaired set, where all new writes and memtable flushes go. Compaction only merges unrepaired SSTables with other unrepaired ones. The repaired set, where SSTables are compacted together to maintain and optimize the read path. The rule is that a compaction strategy can never merge an unrepaired SSTable into the repaired set. The only bridge between these sets is repair. This prevents potentially inconsistent data from polluting the repaired set. With this new design, compaction now has a dependency on repair. As we run repairs, the repaired sets grow. But because incremental repair is so much lighter than traditional methods, we encourage you to run it much more frequently. We are currently working on an automatic repair feature that will trigger those runs at the very moment the unrepaired set grows too large. That should keep your unrepaired window as small as possible. The efficiency of incremental repair depends on your workload: Update heavy: If you have lots of overwrites or deletes, new data will invalidate older repaired SSTables. In extreme cases, there could be so much data to repair that it looks a lot like full repair. Append heavy: This is a perfect use case, like IoT or logging. Since new data doesn’t invalidate old data, the repaired set stays consistent and untouched. This should provide nice performance gains. Even in update-heavy cases, don’t lose anything by choosing incremental repair. In the worst case, it performs the same amount of work as a full repair would. In almost all real-world scenarios, you could gain significant improvements without any trade-off with respect to consistency. Performance Improvements To understand how this approach translates to performance improvements, let’s model the improvement ratio. Say that n is the size of your new unrepaired data and E is the size of your existing repaired data. A full repair works on E plus n. Incremental repair works only on n. So, the improvement ratio equals n divided by E plus n. If you ingest 100 gigabytes a day on a 10-terabyte node, you are repairing only 1% of the data instead of 100%. This is an order-of-magnitude shift in overhead. Our testing confirms that theory. We ran multiple insert and repair cycles. In the first round, nearly all the data was new, so the repair time was almost the same as with full repair. In the second round, with a 50/50 split, the time dropped by half. In the third round, as the repaired set became dominant, incremental repair took only 35% of the time that a full repair would have taken. To wrap up, incremental repair for tablets is faster, lighter, and more efficient. It is a foundational step toward our goal of a fully autonomous database that handles its own maintenance. By adopting this feature, you reduce operational burden and ensure your cluster remains consistent without repair storms.

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale

By Ben Sykes

In a previous post, we described how Netflix uses Apache Druid to ingest millions of events per second and query trillions of rows, providing the real-time insights needed to ensure a high-quality experience for our members. Since that post, our scale has grown considerably.

With our database holding over 10 trillion rows and regularly ingesting up to 15 million events per second, the value of our real-time data is undeniable. But this massive scale introduced a new challenge: queries. The live show monitoring, dashboards, automated alerting, canary analysis, and A/B test monitoring that are built on top of Druid became so heavily relied upon that the repetitive query load started to become a scaling concern in itself.

This post describes an experimental caching layer we built to address this problem, and the trade-offs we chose to accept.

The Problem

Our internal dashboards are heavily used for real-time monitoring, especially during high-profile live shows or global launches. A typical dashboard has 10+ charts, each triggering one or more Druid queries; one popular dashboard with 26 charts and stats generates 64 queries per load. When dozens of engineers view the same dashboards and metrics for the same event, the query volume quickly becomes unmanageable.

Take the popular dashboard above: 64 queries per load, refreshing every 10 seconds, viewed by 30 people. That’s 192 queries per second from one dashboard, mostly for nearly identical data. We still need Druid capacity for automated alerting, canary analysis, and ad-hoc queries. And because these dashboards request a rolling last-few-hours window, each refresh changes slightly as the time range advances.

Druid’s built-in caches are effective. Both the full-result cache and the per-segment cache. But neither is designed to handle the continuous, overlapping time-window shifts inherent to rolling-window dashboards. The full-result cache misses for two reasons.

  • If the time window shifts even slightly, the query is different, so it’s a cache miss.
  • Druid deliberately refuses to cache results that involve realtime segments (those still being indexed), because it values deterministic, stable cache results and query correctness over a higher cache hit rate.

The per-segment cache does help avoid redundant scans on historical nodes, but we still need to collect those cached segment results from each data node and merge them in the brokers with data from the realtime nodes for every query.

During major shows, rolling-window dashboards can generate a flood of near-duplicate queries that Druid’s caches mostly miss, creating heavy redundant load. At our scale, solving this by simply adding more hardware is prohibitively expensive.

We needed a smarter approach.

The Insight

When a dashboard requests the last 3 hours of data, the vast majority of that data, everything except the most recent few minutes, is already settled. The data from 2 hours ago won’t change.

What if we could remember the older portions of the result and only ask Druid for the part that’s actually new?

This is the core idea behind a new caching service that understands the structure of Druid queries and serves previously-seen results from cache while fetching only the freshest portion from Druid.

A Deliberate Trade-Off

Before diving into the implementation, it’s worth being explicit about the trade-off we’re making. Caching query results introduces some staleness, specifically, up to 5 seconds for the newest data. This is acceptable for most of our operational dashboards, which refresh every 10 to 30 seconds. In practice, many of our queries already set an end time of now-1m or now-5s to avoid the “flappy tail” that can occur with currently-arriving data.

Since our end-to-end data pipeline latency is typically under 5 seconds at P90, a 5-second cache TTL on the freshest data introduces negligible additional staleness on top of what’s already inherent in the system. We decided it was better to accept this small amount of staleness in exchange for significantly lower query load on Druid. But a 5s cache on its own is not very useful.

Exponential TTLs

Not all data points are equally trustworthy. In real-time analytics, there’s a well-known late-arriving data problem. Events can arrive out of order or be delayed in the ingestion pipeline. A data point from 30 seconds ago might still change as late-arriving events trickle in. A data point from 30 minutes ago is almost certainly final.

We use this observation to set cache TTLs that increase exponentially with the age of the data. Data less than 2 minutes old gets a minimum TTL of 5 seconds. After that, the TTL doubles for each additional minute of age: 10 seconds at 2 minutes old, 20 seconds at 3 minutes, 40 seconds at 4 minutes, and so on, up to a maximum TTL of 1 hour.

The effect is that fresh data cycles through the cache rapidly, so any corrections from late-arriving events in the most recent couple of minutes are picked up quickly. Older data lingers much longer, because our confidence in its accuracy grows with time.

For a 3-hour rolling window, the exponential TTL ensures the vast majority of the query is served from the cache, leaving Druid to only scan the most recent, unsettled data.

Bucketing

If we were to use a single-level cache key for the query and interval, similar to Druid’s existing result-level cache, we wouldn’t be able to extract only the relevant time range from cached results. A shifted window means a different key, which means a cache miss.

Instead, we use a map-of-maps. The top-level key is the query hash without the time interval; the inner keys are timestamps bucketed to the query granularity (or 1 minute, whichever is larger) and encoded as big-endian bytes so lexicographic order matches time. This enables efficient range scans; fetching all cached buckets between times A and B for a query hash. A 3-hour query at 1-minute granularity becomes 180 independent cached buckets, each with its own TTL; when the window shifts (e.g., 30 seconds later), we reuse most buckets from cache and only query Druid for the new data.

How It Works

Today, the cache runs as an external service integrated transparently by intercepting requests at the Druid Router and redirecting them to the cache. If the cache fully satisfies a request, it returns the result; otherwise it shrinks the time interval to the uncached portion and calls back into the Router, bypassing the redirect to query Druid normally. Non-cached requests (e.g., metadata queries or queries without time group-bys) pass straight through to Druid unchanged.

This intercepting proxy design allows us to enable or disable caching without any client changes and is a key to its adoption. We see this setup as temporary while we work out a way to better integrate this capability into Druid more natively.

When a cacheable query arrives, those that are grouping-by time (timeseries, groupBy), the cache performs the following steps.

Parsing and Hashing. We parse each incoming query to extract the time interval, granularity, and structure, then compute a SHA-256 hash of the query with the time interval and parts of the context removed. That hash is the cache key: it encodes what is being asked (datasource, filters, aggregations, granularity) but not when, so the same logical query over different overlapping time windows maps to the same cache entry. There are some context properties that can alter the response structure or contents, so these are included in the cache-key.

Cache Lookup. Using the cache key, we fetch cached points within the requested range, but only if they’re contiguous from the start. Because bucket TTLs can expire unevenly, gaps can appear; when we hit a gap, we stop and fetch all newer data from Druid. This guarantees a complete, unbroken result set while sending at most one Druid query, rather than “filling gaps” with multiple small, fragmented queries that would increase Druid load.

Fetching the Missing Tail. On a partial cache hit (e.g., 2h 50m of a 3h window), we rebuild the query with a narrowed interval for the missing 10 minutes and send only that to Druid. Since Druid then scans just the recent segments for a small time range, the query is usually faster and cheaper than the original.

Combining. The cached data and fresh data are concatenated, sorted by timestamp, and returned to the client. From the client’s perspective, the response looks identical to what Druid would have returned, same JSON format, same fields.

Asynchronous Caching. The fresh data from Druid is parsed into individual time-granularity buckets and written back to the cache asynchronously, so we don’t add latency to the response path.

Negative Caching

Some metrics are sparse. Certain time buckets may genuinely have no data. Without special handling, the cache would treat these empty buckets as gaps and re-query Druid for them every time.

We handle this by caching empty sentinel values for time buckets where Druid returned no data. Our gap-detection logic recognizes these empty entries as valid cached data rather than missing data, preventing needless re-queries for naturally sparse metrics.

However, we’re careful not to negative-cache trailing empty buckets. If a query returns data up to minute 45 and nothing after, we only cache empty entries for gaps between data points, not after the last one. This avoids incorrectly caching “no data” for time periods where events simply haven’t arrived yet, which would exacerbate the chart delays of late arriving data.

The Storage Layer

For the backing store, we use Netflix’s Key-Value Data Abstraction Layer (KVDAL), backed by Cassandra. KVDAL provides a two-level map abstraction, a natural fit for our needs. The outer key is the query hash, and the inner keys are timestamps. Crucially, KVDAL supports independent TTLs on each inner key-value pair, eliminating the need for us to manage cache eviction manually.

This two-level structure gives us efficient range queries over the inner keys, which is exactly what we need for partial cache lookups: “give me all cached buckets between time A and time B for query hash X.”

Results

The biggest win is during high-volume events (e.g., live shows): when many users view the same dashboards, the cache serves most identical queries as full hits, so the query rate reaching Druid is essentially the same with 1 viewer or 100. The scaling bottleneck moves from Druid’s query capacity to the much cheaper-to-scale cache, and with ~5.5 ms P90 cache responses, dashboards load faster for everyone.

On a typical day, 82% of real user queries get at least a partial cache hit, and 84% of result data is served from cache. As a result, the queries that reach Druid scan much narrower time ranges, touching fewer segments and processing less data, freeing Druid to focus on aggregating the newest data instead of repeatedly re-querying historical segments.

An experiment validated this, showing about a 33% drop in queries to Druid and a 66% improvement in overall P90 query times. It also cut result bytes and segments queried, and in some cases, enabling the cache reduced result bytes by more than 14x. Caveat: the size of these gains depends heavily on how similar and repetitive the query workload is.

Looking Ahead

This caching layer is still experimental, but results are promising and we’re exploring next steps. We’ve added partial support for templated SQL so dashboard tools can benefit without writing native Druid queries.

Longer term, we’d like interval-aware caching to be built into Druid: an external proxy adds infrastructure to manage, extra network hops, and workarounds (like SQL templating) to extract intervals. Implemented inside Druid, it could be more efficient, with direct access to the query planner and segment metadata, and benefit the broader community without custom infrastructure. We’d likely ship it as an opt-in, configurable, result-level cache in the Brokers, with metrics to tune TTLs and measure effectiveness. Please leave a comment if you have a use-case that could benefit from this feature.

More broadly, this strategy, splitting time-series results into independently cached, granularity-aligned buckets with age-based exponential TTLs, isn’t Druid-specific and could apply to any time-series database with frequent overlapping-window queries.

Summary

As more Netflix teams rely on real-time analytics, query volume grows too. Dashboards are essential at our scale, but their popularity can become a scaling bottleneck. By inserting an intelligent cache between dashboards and Druid, one that understands query structure, breaks results into granularity-aligned buckets, and trades a small amount of staleness for much lower Druid load, we’ve increased query capacity without scaling infrastructure proportionally, and hope to deliver these benefits to the Druid community soon as a built-in Druid feature.

Sometimes the best way to handle a flood of queries is to stop answering the same question twice.


Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powering Multimodal Intelligence for Video Search

Synchronizing the Senses: Powering Multimodal Intelligence for Video Search

By Meenakshi Jindal and Munya Marazanye

Today’s filmmakers capture more footage than ever to maximize their creative options, often generating hundreds, if not thousands, of hours of raw material per season or franchise. Extracting the vital moments needed to craft compelling storylines from this sheer volume of media is a notoriously slow and punishing process. When editorial teams cannot surface these key moments quickly, creative momentum stalls and severe fatigue sets in.

Meanwhile, the broader search landscape is undergoing a profound transformation. We are moving beyond simple keyword matching toward AI-driven systems capable of understanding deep context and intent. Yet, while these advances have revolutionized text and image retrieval, searching through video, the richest medium for storytelling, remains a daunting “needle in a haystack” challenge.

The solution to this bottleneck cannot rely on a single algorithm. Instead, it demands orchestrating an expansive ensemble of specialized models: tools that identify specific characters, map visual environments, and parse nuanced dialogue. The ultimate challenge lies in unifying these heterogeneous signals, textual labels, and high-dimensional vectors into a cohesive, real-time intelligence. One that cuts through the noise and responds to complex queries at the speed of thought, truly empowering the creative process.

Why Video Search is Deceptively Complex

Since video is a multi-layered medium, building an effective search engine required us to overcome significant technical bottlenecks. Multi-modal search is exponentially more complex than traditional indexing: it demands the unification of outputs from multiple specialized models, each analyzing a different facet of the content to generate its own distinct metadata. The ultimate challenge lies in harmonizing these heterogeneous data streams to support rich, multi-dimensional queries in real time.

  1. Unifying the Timeline
    To ensure critical moments aren’t lost across scene boundaries, each model segments the video into overlapping intervals. The resulting metadata varies wildly, ranging from discrete text-based object labels to dense vector embeddings. Synchronizing these disjointed, multi-modal timelines into a unified chronological map presents a massive computational hurdle.
  2. Processing at Scale
    A standard 2,000-hour production archive can contain over 216 million frames. When processed through an ensemble of specialized models, this baseline explodes into billions of multi-layered data points. Storing, aligning, and intersecting this staggering volume of records while maintaining sub-second query latency far exceeds the capabilities of traditional database architectures.
  3. Surfacing the Best Moments
    Surface-level mathematical similarity is not enough to identify the most relevant clip. Because continuous shots naturally generate thousands of visually redundant candidates, the system must dynamically cluster and deduplicate results to surface the singular best match for a given scene. To achieve this, effective ranking relies on a sophisticated hybrid scoring engine that weighs symbolic text matches against semantic vector embeddings, ensuring both precision and interpretability.
  4. Zero-Friction Search
    For filmmakers, search is a stream-of-consciousness process, and a ten-second delay can disrupt the creative flow. Because sequential scanning of raw footage is fundamentally unscalable, our architecture is built to navigate and correlate billions of vectors and metadata records efficiently, operating at the speed of thought.
Figure 1: Unified Multimodal Result Processing

The Ingestion and Fusion Pipeline

To ensure system resilience and scalability, the transition from raw model output to searchable intelligence follows a decoupled, three-stage process:

1. Transactional Persistence

Raw annotations are ingested via high-availability pipelines and stored in our annotation service, which leverages Apache Cassandra for distributed storage. This stage strictly prioritizes data integrity and high-speed write throughput, guaranteeing that every piece of model output is safely captured.

{
"type": "SCENE_SEARCH",
"time_range": {
"start_time_ns": 4000000000,
"end_time_ns": 9000000000
},
"embedding_vector": [
-0.036, -0.33, -0.29 ...
],
"label": "kitchen",
"confidence_score": 0.72
}

Figure 2: Sample Scene Search Model Annotation Output

2. Offline Data Fusion

Once the annotation service securely persists the raw data, the system publishes an event via Apache Kafka to trigger an asynchronous processing job. Serving as the architecture’s central logic layer, this offline pipeline handles the heavy computational lifting out-of-band. It performs precise temporal intersections, fusing overlapping annotations from disparate models into cohesive, unified records that empower complex, multi-dimensional queries.

Cleanly decoupling these intensive processing tasks from the ingestion pipeline guarantees that complex data intersections never bottleneck real-time intake. As a result, the system maintains maximum uptime and peak responsiveness, even when processing the massive scale of the Netflix media catalog.

To achieve this intersection at scale, the offline pipeline normalizes disparate model outputs by mapping them into fixed-size temporal buckets. This discretization process unfolds in three steps:

  • Bucket Mapping: Continuous detections are segmented into discrete intervals. For example, if a model detects a character “Joey” from seconds 2 through 8, the pipeline maps this continuous span of frames into seven distinct one-second buckets.
  • Annotation Intersection: When multiple models generate annotations for the same temporal bucket, such as character recognition “Joey” and scene detection “kitchen” overlapping in second 4, the system fuses them into a single, comprehensive record.
  • Optimized Persistence: These newly enriched records are written back to Cassandra as distinct entities. This creates a highly optimized, second-by-second index of multi-modal intersections, perfectly associating every fused annotation with its source asset.
Figure 3: Temporal Data Fusion with Fixed-Size Time Buckets

The following record shows the overlap of the character “Joey” and scene “kitchen” annotations during a 4 to 5 second window in a video asset:

{
"associated_ids": {
"MOVIE_ID": "81686010",
"ASSET_ID": "01325120–7482–11ef-b66f-0eb58bc8a0ad"
},
"time_bucket_start_ns": 4000000000,
"time_bucket_end_ns": 5000000000,
"source_annotations": [
{
"annotation_id": "7f5959b4–5ec7–11f0-b475–122953903c43",
"annotation_type": "CHARACTER_SEARCH",
"label": "Joey",
"time_range": {
"start_time_ns": 2000000000,
"end_time_ns": 8000000000
}
},
{
"annotation_id": "c9d59338–842c-11f0–91de-12433798cf4d",
"annotation_type": "SCENE_SEARCH",
"time_range": {
"start_time_ns": 4000000000,
"end_time_ns": 9000000000
},
"label": "kitchen",
"embedding_vector": [
0.9001, 0.00123 ....
]
}
]
}

Figure 4: Sample Intersection Record for Character + Scene Search

3. Indexing for Real Time Search

Once the enriched temporal buckets are securely persisted in Cassandra, a subsequent event triggers their ingestion into Elasticsearch.

To guarantee absolute data consistency, the pipeline executes upsert operations using a composite key (asset ID + time bucket) as the unique document identifier. If a temporal bucket already exists for a specific second of video, perhaps populated by an earlier model run, the system intelligently updates the existing record rather than generating a duplicate. This mechanism establishes a single, unified source of truth for every second of footage.

Architecturally, the pipeline structures each temporal bucket as a nested document. The root level captures the overarching asset context, while associated child documents house the specific, multi-modal annotation data. This hierarchical data model is precisely what empowers users to execute highly efficient, cross-annotation queries at scale.

Figure 5: Simplified Elasticsearch Document Structure

Multimodal Discovery and Result Ranking

The search service provides a high-performance interface for real-time discovery across the global Netflix catalog. Upon receiving a user request, the system immediately initiates a query preprocessing phase, generating a structured execution plan through three core steps:

  • Query Type Detection: Dynamically categorizes the incoming request to route it down the most efficient retrieval path.
  • Filter Extraction: Isolates specific semantic constraints such as character names, physical objects, or environmental contexts to rapidly narrow the candidate pool.
  • Vector Transformation: Converts raw text into high-dimensional, model-specific embeddings to enable deep, context-aware semantic matching.

Once generated, the system compiles this structured plan into a highly optimized Elasticsearch query, executing it directly against the pre-fused temporal buckets to deliver instantaneous, frame-accurate results.

Fine-Tuning Semantic Search

To support the diverse workflows of different production teams, the system provides fine-grained control over search behavior through configurable parameters:

  • Exact vs. Approximate Search: Users can toggle between exact k-Nearest Neighbors (k-NN) for uncompromising precision, and Approximate Nearest Neighbor (ANN) algorithms (such as HNSW) to maintain blazing speed when querying massive datasets.
  • Dynamic Similarity Metrics: The system supports multiple distance calculations, including cosine similarity and Euclidean distance. Because different models shape their high-dimensional vector spaces distinctly based on their underlying training architectures, the flexibility to swap metrics ensures that mathematical closeness perfectly translates to true semantic relevance.
  • Confidence Thresholding: By establishing strict minimum score boundaries for results, users can actively prune the long tail of low-probability matches. This aggressively filters out visual noise, guaranteeing that creative teams are not distracted and only review results that meet a rigorous standard of mathematical similarity.

Textual Analysis & Linguistic Precision

To handle the deep nuances of dialogue-heavy searches, such as isolating a character’s exact catchphrase amidst thousands of hours of speech, we implement a sophisticated text analysis strategy within Elasticsearch. This ensures that conversational context is captured and indexed accurately.

  • Phrase & Proximity Matching: To respect the narrative weight of specific lines (e.g., “Friends don’t lie” in Stranger Things), we leverage match-phrase queries with a configurable slop parameter. This guarantees the system retrieves the correct scene even if the user’s memory slightly deviates from the exact transcription.
  • N-Gram Analysis for Partial Discovery: Because video search is inherently exploratory, we utilize edge N-gram tokenizers to support search-as-you-type functionality. By actively indexing dialogue and metadata substrings, the system surfaces frame-accurate results the moment an editor begins typing, drastically reducing cognitive load.
  • Tokenization and Linguistic Stemming: To seamlessly support the global scale of the Netflix catalog, our analysis chain applies sophisticated stemming across multiple languages. This ensures a query for “running” automatically intersects with scenes tagged with “run” or “ran” collapsing grammatical variations into a single, unified search intent.
  • Levenshtein Fuzzy Matching: To account for transcription anomalies or phonetic misspellings, we incorporate fuzzy search capabilities based on Levenshtein distance algorithms. This intelligent soft-matching approach ensures that high-value shots are never lost to minor data-entry errors or imperfect queries.

Aggregations and Flexible Grouping

The architecture operates at immense scale, seamlessly executing queries within a single title or across thousands of assets simultaneously. To combat result fatigue, the system leverages custom aggregations to intelligently cluster and group outputs based on specific parameters, such as isolating the top 5 most relevant clips of an actor per episode. This guarantees a diverse, highly representative return set, preventing any single asset from dominating the search results.

Search Response Curation

While temporal buckets are the internal mechanism for search efficiency, the system post-processes Elasticsearch results to reconstruct original time boundaries. The reconstruction process ensures results reflect narrative scene context rather than arbitrary intervals. Depending on the query intent, the system generates results based on two logic types:

Figure 6: Depiction of Temporal Union vs Intersection
  • Union: Returns the full span of all matching annotations (3–8 sec), which prioritizes breadth, capturing any instance where a specified feature occurs.
  • Intersection: Returns only the exact overlapping duration of matching signals (4–6 sec). The intersection logic focuses on co-occurrence, isolating moments when multiple criteria align.
{
"entity_id": {
"entity_type": "ASSET",
"id": "1bba97a1–3562–4426–9cd2-dfbacddcb97b"
},
"range_intervals": [
{
"intersection_time_range": {
"start_time_ns": 4000000000,
"end_time_ns": 8000000000
},
"union_time_range": {
"start_time_ns": 2000000000,
"end_time_ns": 9000000000
},
"source_annotations": [
{
"annotation_id": "fc1525d0–93a7–11ef-9344–1239fc3a8917",
"annotation_type": "SCENE_SEARCH",
"metadata": {
"label": "kitchen"
}
},
{
"annotation_id": "5974fb01–93b0–11ef-9344–1239fc3a8917",
"annotation_type": "CHARACTER_SEARCH",
"metadata": {
"character_name": [
"Joey"
]
}
}
]
}
]
}

Figure 7: Sample Response for “Joey” + “Kitchen” Query

Future Extensions

While our current architecture establishes a highly resilient and scalable foundation, it represents only the first phase of our multi-modal search vision. To continuously close the gap between human intuition and machine retrieval, our roadmap focuses on three core evolutions:

  • Natural Language Discovery: Transitioning from structured JSON payloads to fluid, conversational interfaces (e.g., “Find the best tracking shots of Tom Holland running on a roof”). This will abstract away underlying query complexity, allowing creatives to interact with the archive organically.
  • Adaptive Ranking: Implementing machine learning feedback loops to dynamically refine scoring algorithms. By continuously analyzing how editorial teams interact with and select clips, the system will self-tune its mathematical definition of semantic relevance over time.
  • Domain-Specific Personalization: Dynamically calibrating search weights and retrieval behaviors to match the exact context of the user. The platform will tailor its results depending on whether a team is cutting high-action marketing trailers, editing narrative scenes, or conducting deep archival research.

Ultimately, these advancements will elevate the platform from a highly optimized search engine into an intelligent creative partner, fully equipped to navigate the ever-growing complexity and scale of global video media.

Acknowledgements

We would like to extend our gratitude to the following teams and individuals whose expertise and collaboration were instrumental in the development of this system:


Powering Multimodal Intelligence for Video Search was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

ScyllaDB Vector Search in Action: Real-time RAG and Anomaly Detection

Learn how ScyllaDB vector search simplifies scalable semantic search & RAG with real-time performance – plus see FAQs on vector search scaling, embeddings, and architecture As a Solutions Architect at ScyllaDB, I talk a lot about vector search these days. Most vector search conversations start with RAG. Build a chatbot, embed some documents, do a similarity lookup. That’s an important use case, but it’s just the beginning of what vector search can do when you have a platform like ScyllaDB that can scale to massive datasets and hit those p99 latencies At Monster Scale Summit, I showed two live demos that cover what I think are the two most practical patterns for vector search today: hybrid memory for RAG, and real-time anomaly detection. Both run entirely on ScyllaDB, without needing an external cache (e.g., Redis) or dedicated vector database (e.g., Pinecone). This post explains what I showed and also answers the top questions from attendees. The Limits of Traditional Search and Anomaly Detection The old way of doing anomaly detection, retrieving data, doing lookups, using inverted indexes, and running semantic search is brittle. Anyone who has tried to scale an index-based system has experienced this firsthand. The first step was figuring out what was said before, what data was loaded. Redis caches, key-value stores, document databases, and SQL databases were used to handle that. Then, to determine what documents are relevant, we built custom search pipelines using Lucene-based indexes and similar tools. Next, we built complex rules engines that became difficult to troubleshoot and couldn’t scale. These systems became extremely complex, with SQL-based regressions and similar mechanisms. The problem is that they’re hard to scale and very rigid. Every new failure mode required building a new rule, sometimes entirely new rules engines. Often, these detection paths required completely different systems. The result was graph use cases, vector systems, and multiple databases: a collection of systems solving what are ultimately simple problems. As you might imagine, these systems become quite brittle and expensive to operate. You could avoid this with a conceptual shift: from matching rules to finding things that look similar. That shift from rigid logic to semantic similarity is what vector search is trying to achieve, whether through text embeddings or other encodings. RAG with Hybrid Memory The first demo showed how we can build three distinct memory tiers for RAG, all living within ScyllaDB. This “hybrid memory” pattern allows you to handle everything from immediate context to long-forgotten conversations without a complex conglomerate of different databases. Short-term memory: This handles the last five records of your interaction as a rolling window. It is very fast, uses a short TTL, and ensures the AI stays on track with the immediate conversation. Long-term memory: This is where vectors shine. If you’ve been chatting with a bot for 6 days, that history won’t fit in a standard context window. We use vector search to retrieve specific conversation parts that matter based on embedded similarity, loading only the relevant history into the prompt. This is the piece that makes a chatbot feel like it actually remembers. Document enrichment: This phase takes uploaded files (PDFs, Markdown files, etc.), chunks them semantically, embeds each chunk, and stores those vectors in ScyllaDB. When you ask a question, the system retrieves the most relevant chunks and feeds them to the model. Basically, I used ScyllaDB to manage key-value lookups, vector similarity, and TTL expiration in a single cluster, which meant I didn’t need extra infrastructure like Redis or Pinecone. It processed LinkedIn profiles by breaking them into semantic chunks (one profile generated 43 chunks) and stored those embeddings in ScyllaDB Cloud to be queried in real-time. The frontend is built in Elixir, the backend is ScyllaDB Cloud with vector search enabled. The code is on my GitHub. Pull it, run it, and try it out with your own documents. Anomaly Detection: Three Paths, One Database The second demo is where vector search truly earns its keep. The setup models an IoT pipeline: Devices generate metrics every 10 seconds Metrics stream through Kafka They’re aggregated into 60-second windows Finally, they land in ScyllaDB as 384-dimensional embeddings generated by Ollama. From there, three detection paths run in parallel against every snapshot. Path 1: Rules engine. This is classic Z-score analysis. If four or more metrics exceed a Z-score of 6.0, flag it. This is fast and interpretable. It catches known anomaly patterns and misses everything else. Path 2: Embedding similarity. Build a behavioral fingerprint for each device. Compare every new snapshot against that fingerprint using cosine similarity. If the score drops below 0.92, something has drifted. This catches gradual behavioral changes that hard thresholds miss. Path 3: Vector search. Run an ANN query across the device’s historical snapshots in ScyllaDB. Post-filter for the same device, non-anomalous records, and a cosine similarity above 0.75. If fewer than five similar normal snapshots exist, that snapshot is anomalous. You don’t need to write any rules or tune thresholds per device. The data tells you what normal looks like. In the demo, I showed a sensor drift scenario where power consumption dropped slightly and a few other metrics shifted just enough to be unusual. This slipped past the rules engine and embedding similarity, but vector search caught it. It’s Exhibit A for how vector search exposes subtle anomalies that rigid systems can’t see. The reason this works on a single cluster is that ScyllaDB handles vectors alongside time series, key-value, and TTL data with sub-10 millisecond queries under heavy concurrent load. Snapshots, profiles, anomaly events, and the vector index all live in one place. The anomaly detection demo runs in Docker if you want to try it on your own data. Common Questions About ScyllaDB Vector Search Technical demos naturally lead to questions, and we heard plenty at MonsterScale (thanks for keeping us busy in the chat). Here are responses to the top questions we fielded… “What about new devices with no history?” Fair concern. Path 3 decides “anomaly” when it finds fewer than five similar normal snapshots. A brand new device with ten records will trigger on everything. The fix is a minimum history gate. Don’t run the vector path until a device has enough baseline data to make comparisons meaningful. In practice, the rules engine and embedding similarity cover the gap while the device builds history. “How do I know my embeddings are good enough?” Test them against known anomalies before you go to production. Bad embeddings produce bad neighbors, and similarity search will give you confidently wrong answers. Run a batch of labeled data through your embedding model, query for nearest neighbors, and check whether the results make sense. If your model does not capture the right features, no amount of tuning the similarity threshold will save you. “Does this scale to millions of devices?” The ANN query itself scales; the part you want to watch is post-filtering. The query returns candidates, and your application code still has to filter by device, status, and similarity threshold. At millions of devices, that step needs to be efficient. Batch your comparisons and keep your filter criteria tight. Someone at the talk asked about the performance impact of deletions on the vector index as old snapshots expire. ScyllaDB’s index structure is optimized for deletion, so the index stays healthy as data churns through. “Can I just use vector search and skip the rules engine?” Yes, you can…but you’ll regret it. The three-path approach works because each path catches a different class of problem. Rules engines are still the fastest way to catch known, well-defined failures. Someone at the talk asked whether an LLM with MCP could just explore the context and find anomalies itself. It can. But vector search is cheaper. No LLM invocation, no context window limit, and it runs at the speed of a database query. Use each tool for what it is good at. Try It Yourself If you want to see what ScyllaDB can do for your vector workloads, sign up for a demo. I’m always happy to walk through architectures and talk through what makes most sense for your specific requirements. Connect with me on LinkedIn if you have questions, want to collaborate, or just want to talk about distributed systems. I read every message.