ScyllaDB X Cloud: Your Questions Answered
A technical FAQ on ScyllaDB X Cloud: architecture, autoscaling, compression, use cases, and more It’s been a few months since ScyllaDB X Cloud landed. In case you missed the news, here’s a quick recap… ScyllaDB X Cloud is the next generation of ScyllaDB’s fully-managed database-as-a-service. It’s a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. Users can scale out and scale in almost instantly to match actual usage. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer the lag traditionally associated with ramping up capacity in response to a sudden surge. Some key features (all covered in Introducing ScyllaDB X Cloud: A (Mostly) Technical Overview): Tablets + just-in-time autoscaling Up to 90% storage utilization Support for mixed size clusters File-based streaming Dictionary-based compression Flex credit Here’s a look at ScyllaDB X Cloud in action: Not surprisingly, users have been quite curious about all these changes and new options. So we thought we’d collect some of the most common questions here, along with our answers. In no particular order… What are the key differences between a “standard” ScyllaDB Cloud database and “ScyllaDB X Cloud”? Compared to a standard ScyllaDB Cloud database, ScyllaDB X Cloud provides two major advantages: Faster scaling in and out. Higher storage utilization (90% vs. 70%). The above advantages are the result of two technical updates: X Cloud always uses Tablets, while standard databases can use a mix of vNode and Tablets keyspaces. X Cloud enables mixed sized clusters, so you can define more granular cluster and storage sizes. In which cases should you choose a “standard” ScyllaDB Cloud Database vs X Cloud? None! We’ve reached full parity now. Materialized views, CDC, Alternator (DynamoDB API), even counters – it’s all supported. Can I migrate from one type of ScyllaDB Cloud database to the other? Yes. If you are using a standard database with Tablets only, you can migrate this database to X Cloud. If you are using vNode keyspaces, you cannot (yet). How does X Cloud achieve higher storage utilization? Two factors enable higher storage utilization: Faster scaling removes the need to over-reserve storage space (or “sandbag”) while waiting for the cluster to expand Support for mixed instance sizes allows for more granular cluster size How can I start an X Cloud cluster? Simply choose the “X Cloud” Cluster Type on ScyllaDB Cloud’s Launch Cluster page. How can I set the scaling policy? Can I change it later, while the database is in production? (UI/API) The scaling policy is part of the X Cloud cluster properties. You can either set it when launching the cluster or update it later. The policy is optional. It defines the minimum required resources for your database in terms of vCPU and Storage. If you’re not sure how to set it, you can keep the default minimum values (zero) as is. The cluster will scale automatically if and when storage is approaching the threshold, and you can scale the vCPU as required by your workload. Note that the parameters affect each other since more storage may require more compute power. How are X Cloud and Tablets related? X Cloud takes advantage of (and depends on) Tablets to achieve faster scale and higher storage utilization. That means all Keyspaces in X Cloud must use Tablets, which is already the default for ScyllaDB Cloud. How can X Cloud help reduce database costs? There are a few ways that X Cloud reduces cost. The primary factor is the extreme elasticity. You can scale the cluster in and out, even multiple times per day, to meet the demand. If you cannot reliably plan the cluster usage, you can reserve a minimal deployment and pay for bursts using Flex Credit. The higher storage utilization means you use less cloud resources. Improved compression, both on the wire and at rest, reduces cost further. What’s a good use case for ScyllaDB X Cloud? Am I a good candidate for ScyllaDB X Cloud? New (greenfield) workloads should use X Cloud. Workloads that require frequent scaling out/in will benefit the most. For example: A workload with significant fluctuation throughout the day (e.g., peak hours during the evening). A workload with expected high demand on specific days of the year (e.g., Super Bowl, IPL games, or Black Friday). With X Cloud, scaling can be done days in advance. You don’t need to do it one or more weeks ahead. Difficult-to-predict workloads, with common (but volatile) bursts. How many times per day can X Cloud scale? As often as required. Although new nodes start serving requests very fast, it still takes time for the data balancing to be complete if you’re working with rather large nodes. Does X Cloud support multi-DC (region) deployment? Does each region scale independently? X Cloud does not yet support multi-datacenter deployment. Multi-DC support is coming with the 2026.2 release. Scaling Policy: I asked for storage of Y TB and got a bigger cluster with storage of W TB…why? Same for vCPU? vCPU, RAM, and Storage are not independent variables. ScyllaDB will allocate each of these 3 variables to support the required value of the other two. For example, higher storage requires more RAM – which requires more vCPU. The policy UI reflects the expected deployment per each resource selection. Can I suspend / resume the dynamic scaling? Currently: no. Can I restore a backup from X Cloud database to a standard database and vice versa? Yes, you can. Is X Cloud production ready? Absolutely, customers are already using it in production. Why should I care about advanced compression? What is the advantage of having it? ScyllaDB already supported compression before X Cloud – including at-rest and in-transit. However, dictionary-based compression is much more effective in reducing data overhead. By compressing data further, you save on disk space utilization (combined with up to 90% disk space utilization) as well as inter-AZ networking for data replication and high availability. X Cloud claims faster scaling. How fast is it really? The legacy vNode-based architecture imposed some limitations: Nodes could only be added one at a time, even across DCs. Data was replicated in rows – that is, rows were being transferred over the wire. A node only started serving requests after its streaming was fully completed. This process could easily take hours, if not days, to complete on large clusters. Now, X Cloud leverages tablets to remove those limits: Nodes can be added in parallel, multiple nodes at a time, including across DCs. Nodes join the cluster instantly, then start streaming data later. Streaming under Tablets relies on file-based streaming, transferring gigabytes of data per second in a very efficient process. As Tablet transfers complete, nodes start to serve requests immediately; this increases as more transfers complete, until the cluster rebalancing is completed. This allows X Cloud to scale to an unlimited number of nodes at a single step – and streaming data is made super efficient by file-based streaming. A cluster can go from 100K ops per second to 2M ops per second in a matter of a few minutes, not hours or days. Can I use Vector Search with X Cloud? Yes, you can! Enable the Vector Search option at the bottom of the Launch Cluster page and choose the Vector Search instances. Note that Vector Search index nodes scale independently from ScyllaDB nodes. You can learn more about Vector Search here.6 Reasons ScyllaDB Costs a Fraction of DynamoDB
Why teams typically experience 50% (or greater) cost reductions when moving from DynamoDB to ScyllaDB DynamoDB is expensive at scale. Some of that cost is fundamental to the managed service model. But much of it is the pricing model, the way DynamoDB charges per read, per write, per byte, and per region. ScyllaDB rethinks pricing from first principles. The result: teams typically see more than 50% cost reductions on equivalent workloads. In this post, I’ll share a few reasons why. Cheap writes DynamoDB charges 5x more for writes than reads. Write a 1 KB item and it costs 5 write capacity units. Read the same 1 KB item and it costs 0.25 read capacity units. ScyllaDB pricing is based on provisioned cluster capacity (nodes), not per operation. Whether you do 10K writes/sec or 100K writes/sec on a 3-node cluster, the ScyllaDB cost remains the same. Write-heavy workloads for AI, real-time analytics, logging, time-series data and IoT sensors often see the biggest savings. Take a look at our AI Feature Store example. A batch workload scenario with overnight peaks approximately 3x the daytime average on DynamoDB will cost $2.2M/year. The same workload on ScyllaDB would cost $145K/year. In other words, that’s at least 15x savings just switching to ScyllaDB. No need for a separate cache DynamoDB’s baseline latency is in the 10-20ms range. For many applications, that’s unacceptable. In those cases, teams commonly deploy DAX, Redis, or Memcached on top. That adds cost, complexity, and another service to operate and monitor. ScyllaDB was built for low latency. Internal caching and a shard-per-core architecture deliver sub-millisecond latencies on reads. For most workloads, an external cache is unnecessary. Let’s look at a retail example with a read-heavy workload that is cached and running on demand. On DynamoDB running with DAX, that workload would cost $1.6M/year. The same workload on ScyllaDB would cost $271K/year (and even less if you switch to a hybrid plan). That’s at least 6x cheaper using ScyllaDB. Plus: there are fewer moving parts, simpler operations, and no cache coherency headaches. Affordable multi-region data centers DynamoDB Global Tables charge replicated writes (rWCUs) at a premium: roughly 2x the cost of normal writes. Moreover, cross-region data transfer incurs AWS’s standard rates: $0.02-0.09/GB. For a workload doing 10K writes/sec with 5 KB payloads across 2 regions, data transfer alone can add $10K+/month. A social media scenario modeled across 3 regions on DynamoDB would cost $11.0M/year. The huge cost is partly because the write capacity cannot be reserved, and you effectively pay twice for the writes. The same workload on ScyllaDB would cost $591K/year. That’s a monstrous +$10M/year saving by switching to ScyllaDB. ScyllaDB handles multi-DC replication natively. You provision nodes in each data center, and replication is built into the protocol along with shard-aware and rack-aware drivers. This helps minimize network overhead and avoids the per-operation premium. You pay for the cluster nodes; replication comes with the territory. Large items don’t cost more In DynamoDB, a 1 KB write costs 1 WCU, and a 10KB write costs 10 WCUs. Item size directly drives billing. This incentivizes shrinking payloads, compressing data, and splitting tables. Architectural decisions are driven by cost, not design. A simple on-demand scenario with DynamoDB using 3 KB item sizes would cost $633K/year. ScyllaDB would cost $39K/year. Along with multi-region, item size remains one of the biggest cost levers to pull when looking for savings on DynamoDB. ScyllaDB billing is independent of item size. Store 1 KB items or 100 KB items and the cluster cost is unchanged. You architect around performance and correctness, not billing thresholds. Making multi-tenancy work for you DynamoDB is multi-tenant infrastructure. That’s how AWS achieves efficiency. But it also means: You pay for provisioned capacity AWS oversubscribes hardware Idle capacity benefits AWS, not you You pay for the full machine, but AWS shares it with everyone else. Multi-tenant infrastructure reduces cost for AWS but increases risk for users. Large DynamoDB outages (like us-east-1) impact thousands of customers simultaneously. When shared infrastructure fails, the blast radius is enormous. ScyllaDB flips that model. You get a dedicated cluster, which gives you: Isolation by design The ability to run multiple workloads The option to share idle capacity internally This is especially powerful for: Multi-tenant SaaS Microservices Multiple environments (dev/staging/prod) Instead of provisioning 100 tables separately, you provision one cluster and use it fully. You control your infrastructure. AWS monetizes multi-tenancy. ScyllaDB lets you monetize it. Flexible and predictable pricing DynamoDB is excellent for certain use cases: serverless applications with unpredictable spikes, multi-tenant services that need table-level isolation, and teams that prioritize operational simplicity over cost. But if you’re running a predictable, scale-intensive workload – especially one that’s write-heavy, multi-region, or stores large items – then DynamoDB’s per-operation pricing model becomes a massive cost driver. ScyllaDB’s node-based, cluster-centric model is fundamentally more cost-efficient for these scenarios. Combined with its performance and operational features, it’s why teams see more than 50% cost reductions. Want to see the actual numbers for your workload? Use the ScyllaDB Cost Calculator at calculator.scylladb.com to model a comparison between your current DynamoDB spend and equivalent ScyllaDB infrastructure.Apache Cassandra® 6 Accord transactions: What you need to know
There have always been architectural trade-offs when considering a distributed database like Apache Cassandra versus a relational database. Cassandra excels at linear horizontal scalability, multi-region replication, and fault-tolerant uptime that relational systems couldn’t match. This comes at the expense of general-purpose ACID (Atomicity, Consistency, Isolation, Durability) transactions which allows the ability to express complex, multi-row operations with guaranteed consistency.
With Cassandra 6 on its way to general availability status (and an alpha already released), we’re approaching a turning point where we can revisit whether these trade-offs will still exist. The latest version delivers general-purpose ACID transactions through a new protocol called Accord. With Cassandra 6, those transactional guarantees will be native, without compromising Cassandra’s operational model or availability.
TransactionsIn database parlance, a transaction says, “These operations belong together. They must all be applied, or none of them.” The classic example is a bank transfer. When you move money from one account to another, two things must happen: a debit and a credit. If the debit succeeds but the credit fails, money has disappeared. A transaction prevents this issue by guaranteeing the two operations are atomic, meaning they succeed or fail as a unit; combined with isolation, no other process can see an immediate or half-finished state.
Experiences like these depend on transactional guarantees at the data layer, which rely on ACID semantics, particularly atomicity and isolation, to prevent inconsistent intermediate states.
For most developers who have worked with relational databases, transactions are so fundamental they’re almost invisible. For Cassandra users, comparable guarantees across multiple partitions or tables historically required significant application-level coordination or weren’t natively supported.
Coordination at scale is fundamentally hardBecause Cassandra is designed to deal with data replication and scaling, coordinating atomic changes across multiple nodes is inherently challenging (e.g., decrement a balance here, increment one there). All participating replicas must agree on an order of operations. Distributed consensus protocols exist to solve exactly this, but prior approaches came with trade-offs.
Raft and Zab are examples of protocols that use leaders, which is not suitable for Cassandra since nodes are treated equally.
More information about prior solutions can be found in more details in CEP-15, but generally, leader-based approaches pose issues at scale.
The Accord protocolThe Accord protocol, proposed in CEP-15, is built to achieve fast, general-purpose distributed transactions that remain stable under the same failure conditions Cassandra already tolerates— with no elected leaders.
How it orders transactionsAccord is leaderless so any node can coordinate any transaction. Transactions are assigned unique timestamps using hybrid logical clocks, where each node appends its own unique ID to its clock value to ensure global uniqueness across the cluster. Conflicting transactions execute in timestamp order across all replicas. Under normal conditions, a transaction reaches consensus in a single round trip.
The reorder bufferThe challenge with timestamp-based ordering in a geo-distributed system is that two transactions started concurrently from different regions might arrive at replicas in different orders, breaking fast-path consensus. Accord solves this by having replicas buffer incoming transactions. The wait time is precisely bounded to be just long enough to account for clock differences between nodes and network latency, and no longer. This guarantees that replicas always process transactions in the correct order without needing extra message rounds.
Fast-path electoratesWhen replicas fail, other leaderless protocols fall back to slower, more expensive message patterns. Accord avoids this by dynamically adjusting which replicas participate in fast-path decisions as failures occur. The result is that Accord maintains fast-path availability under failure, avoiding the degradation to slower message patterns that other leaderless protocols experience.
The net effect: strict serializable isolation across multiple partitions and tables, in a single round trip, with no leaders, and preserving performance characteristics under the same minority‑failure conditions that Cassandra is designed to tolerate.
New CQL syntax to support transactionsThe most visible change for developers is new CQL syntax.
Transactions in Cassandra 6 are wrapped in BEGIN
TRANSACTION and COMMIT TRANSACTION blocks,
similar to SQL syntax.
Let’s examine a flight booking transaction that must simultaneously reserve a seat and deduct loyalty miles from two separate tables. Note: Cassandra 6 is pre-release. Syntax shown reflects the current alpha and may evolve before general availability.
BEGIN TRANSACTION LET seat = (SELECT available FROM flight_seats WHERE flight_id = 'ZZ101' AND seat_number = '14C'); LET miles = (SELECT balance FROM loyalty_accounts WHERE member_id = 'M-7823'); IF seat.available = true AND miles.balance >= 25000 THEN UPDATE flight_seats SET available = false, booked_by = 'M-7823' WHERE flight_id = 'ZZ101' AND seat_number = '14C'; UPDATE loyalty_accounts SET balance = miles.balance - 25000 WHERE member_id = 'M-7823'; END IF COMMIT TRANSACTION ;
Everything between BEGIN TRANSACTION and
COMMIT TRANSACTION executes atomically with strict
serializable isolation from the perspective of all other concurrent
transactions. The LET clause reads current values from
the database and binds them to variables. The IF block uses those
values to guard the writes. If the seat is already taken or the
member doesn’t have enough miles, nothing happens. Both updates
either apply together or not at all, across two different tables
and two different partition keys.
This is logic that previously had to live in the application, complete with retry handling, race condition guards, and compensating operations if something failed halfway through. Now it lives in the database.
Enabling Accord in Cassandra 6: The CMS dependencyWe can’t talk about Accord without discussing Cluster Metadata Service (CMS). Before Accord transactions are functional, Cluster Metadata Service (CMS), introduced alongside Accord as CEP-21, must be enabled. For teams upgrading from Cassandra 5, this is the most significant operational change in the release.
CMS is required. Accord needs every replica to have the same authoritative view of cluster topology showing which nodes own which data, and which replicas participate in a given transaction. Before Cassandra 6, this information was propagated via the eventually consistent Gossip Protocol. This is suitable for normal reads and writes, but Accord’s correctness depends on knowing precisely who the transaction participants are before committing. CMS replaces Gossip-based metadata propagation with a distributed, linearized transaction log, giving all nodes a consistent view of cluster state. Without it, Accord’s guarantees don’t hold.
Upgrading from Cassandra 5 to 6—plan carefullyThe upgrade cannot begin until every node in the cluster is running Cassandra 6. CMS initialization requires full cluster agreement; no mixed-version clusters are supported. Before upgrading, disable any automation that could trigger schema changes, node bootstrapping, decommissions, or replacements. These operations are blocked during the upgrade window, and if they fire on an older node before CMS is initialized, the migration can fail in ways that require manual intervention to recover.
Once all nodes are upgraded, run nodetool cms
initialize on one node to activate CMS. This creates the
service with a single member, which is enough to unblock metadata
operations but is not suitable for production. Follow up
immediately with nodetool cms reconfigure to add more
members. CMS uses Paxos internally and requires a minimum of three
nodes for a viable quorum, with more recommended for production
depending on cluster size.
Important: CMS initialization is not easily reversible. Plan the upgrade window accordingly and treat it as a one-way operational step.
On a fresh Cassandra 6 cluster that wasn’t migrated from a previous version, CMS is automatically enabled. First, one node is designated as the initial CMS member. From there, CMS membership scales automatically based on cluster size, with the service adding members as the cluster grows without requiring manual intervention.
Of course, for Instaclustr users, our platform and techops team will take care of most of this for you and walk you through any requirements on your side when the time comes to upgrade.
Coexistence with Lightweight Transactions (LWT)Existing LWT syntax (IF NOT EXISTS, IF
EXISTS, conditional UPDATE/INSERT statements)
continues to work and fundamentally differs from Accord
transactions as LWT is scoped to a single partition and is
extremely limited. Accord doesn’t replace or break existing
applications. Using BEGIN TRANSACTION/END TRANSACTION
is how developers opt into the broader cross-partition
guarantees.
Every prior approach to distributed transactions required accepting one of three constraints: a global leader (single point of failure, WAN latency penalty), limited to single-partition scope (LWT), or degraded performance under failure (prior leaderless protocols). The Accord paper’s central claim is that these constraints are not fundamental. They are artifacts of specific protocol design choices.
By combining flexible fast-path electorates with a timestamp reorder buffer on top of a leaderless execution model, Accord achieves:
- True cross-partition atomicity across multiple tables and partition keys
- Strict serializable isolation with formally proven correctness
- Single round-trip latency under normal operating conditions
- Failure‑tolerant steady‑state performance, avoiding the systematic degradation seen in earlier leaderless protocols
- No elected leaders, consistent with Cassandra’s existing operational model
This opens workloads that were previously natively incompatible with Cassandra: financial transaction processing, distributed inventory reservation, multi-step workflow coordination, and any application where ‘commit these changes together or not at all’ is a strict correctness requirement.
Looking aheadThough the Accord protocol is still maturing, the fundamental capability is finally here. We now have general-purpose, leaderless, multi-partition ACID transactions natively in Apache Cassandra.
The historically difficult problem of achieving strict serializable isolation in a geo-distributed system without compromising fault tolerance now has a proven, working answer.
For Cassandra users, this raises an exciting question: which workloads have you been routing to relational databases specifically because they needed transactional guarantees? It is time to reevaluate.
Stay tuned for a preview release of Cassandra 6 on the Instaclustr Platform and get ready to experience the power of ACID transactions on Cassandra for yourself!
The post Apache Cassandra® 6 Accord transactions: What you need to know appeared first on Instaclustr.
4 DynamoDB Configuration Changes for Significant Cost Savings
Learn about ways to cut DynamoDB costs with minimal code changes, zero migration, and no architectural upheaval If you’re running DynamoDB at scale, your bill might be tens of thousands of dollars higher than it needs to be. However, most teams don’t need a complete migration or architecture overhaul to save significantly. These configuration changes, all easily implemented, can reduce your costs by 50-80%. This guide covers the biggest wins for DynamoDB cost optimization, with the real math behind each recommendation. We will be sharing links to the ScyllaDB Cost Calculator at calculator.scylladb.com, which lets you model different workload scenarios with customized parameters and compare ScyllaDB pricing to DynamoDB pricing at the click of a button. Switch from on-demand to provisioned + reserved capacity This is the single biggest DynamoDB cost lever for most teams. On-demand capacity is convenient at first, with no planning required and just pay-as-you-go. But it’s also expensive. After AWS’s recent price reduction, on-demand costs 7.5x more than provisioned capacity. Before the drop, it was roughly 15x. Either way, the math is brutal. Let’s look at a simple example: a mid-sized workload running 10,000 reads/sec and 10,000 writes/sec, 24/7. On-Demand: ~$239K/year Provisioned: ~$71K/year Reserved: ~$34K/year That’s a 7x difference between on-demand and reserved. Even if your workload isn’t perfectly predictable, reserved capacity often pays for itself within months. The trade-off here is that you need a predictable load and the financial flexibility to commit. If your traffic varies wildly (or you’re short-term focused) provisioned mode without reservation is the middle ground. Still, it’s 3.3x cheaper than on-demand. Optimize item sizes DynamoDB’s billing is granular: writes are charged per 1KB of item size, and reads per 4KB. This means a 1.1KB item costs the same as a 2KB item on writes. If your items are consistently over these thresholds by a small margin, you’re paying 2-3x more than necessary. Let’s look at the same simple example, but with increasing item size for comparison. On-Demand with 1KB items: ~$239K/year On-Demand with 10KB items: ~$2M/year On-Demand with 100KB items: ~$20M/year Common culprits for higher DynamoDB costs here: Nested JSON with whitespace or redundant fields Variable-length strings with no trimming Metadata or audit fields added to every item Base64-encoded payloads What should you do? Compress JSON payloads before storage, remove redundant attributes, move infrequently accessed data to a separate table, or use a columnar storage strategy. Trimming just 200 bytes per item – across millions of items and thousands of writes/sec – adds up to thousands per month. Deploy DAX (DynamoDB Accelerator) for read-heavy workloads If your workload skews heavily toward reads and you’re not using an in-memory cache layer yet, DAX is one of the highest ROI moves you can make. DAX sits in front of DynamoDB and caches frequently accessed items in memory. Cache hits bypass DynamoDB entirely, meaning you avoid the RCU charge. For hot items queried thousands of times per minute, a single DAX cluster can reduce DynamoDB read capacity needs. Let’s look at another simple example: a read-heavy workload running 80,000 reads/sec and 1,000 writes/sec, 24/7. On-Demand: ~$335K/year On-Demand with DAX: ~$158K/year The cost math: a medium sized DAX cluster (3 nodes, cache.r5g.8xlarge) costs roughly $9K/month. A high hit rate on your cache will proportionally reduce your more expensive read costs. That can lead to potentially hundreds of thousands of dollars saved with DynamoDB. Bonus: DAX also improves latency dramatically. Cache hits respond in microseconds rather than milliseconds. Use the DynamoDB Infrequent Access (IA) table class Not all tables are created equal. If you have tables where data is accessed rarely but storage is high (think audit logs, historical records, compliance archives, or cold lookup tables), then the Standard-IA table class can save you substantially on storage. The pricing difference: Standard class: $0.25/GB Standard-IA class: $0.10/GB (up to 60% savings) The catch is that IA has a minimum item size of 100 bytes and a minimum billing duration. It’s designed for cold data. So, if you’re frequently scanning or querying these tables, IA isn’t the right fit (read costs are identical, but you lose the write discount). However, for true archive tables accessed only occasionally, it’s a no-brainer. The bottom line These four DynamoDB changes require minimal code changes, zero migration, and no architectural upheaval. They’re configuration changes, caching tweaks, and data optimization. Combined, they typically deliver massive cost reductions. Start with switching to provisioned + reserved (highest impact), then layer in the others based on your workload shape. Ready to model your savings? Use the ScyllaDB Cost Calculator at calculator.scylladb.com to compare your current DynamoDB costs against these optimizations. And to save even more, see how ScyllaDB compares.Shrinking the Search: Introducing ScyllaDB Vector Quantization
Learn how ScyllaDB Vector Quantization shrinks your vector index memory by up to 30x for cost-efficient, real-time AI applications Earlier this year, ScyllaDB launched integrated Vector Search, delivering sub-2ms P99 latencies for billion-vector datasets. However, high-dimensional vectors are notoriously memory-hungry. To help with memory efficiency, ScyllaDB recently introduced Vector Quantization. This allows you to shrink the index memory footprint for storing vectors by up to 30x (excluding index structure) without sacrificing the real-time performance ScyllaDB is known for. What is Quantization? To understand how we compress massive AI datasets, let’s look to the fundamentals of computer science. As Sam Rose explains in the ngrok blog on quantization, computers store numbers in bits, and representing high-precision decimal numbers (floating point) requires a significant number of them. Standard vectors use 32-bit floating point (f32) precision, where each dimension takes 4 bytes. Quantization is the process of compromising on this “floating point precision” to save space. By sacrificing some significant figures of accuracy, we can represent vectors as smaller 16-bit floats or even 8-bit or 1-bit integers. As Sam notes, while this results in a “precision compromise,” modern AI models are remarkably robust to this loss of information. They often maintain high quality even when compressed significantly. The Trade-off: Memory vs. Accuracy In ScyllaDB 2026.1, quantization is an index-only feature. The original source data remains at full precision in storage, while the in-memory HNSW index is compressed. This allows you to choose the level of “information loss” you are willing to accept for a given memory budget: Level Bytes/Dim Memory Savings Best For f32 (default) 4 1x (None) Small datasets, highest possible recall. f16 / bf16 2 ~2x Good balance of accuracy and memory. i8 1 ~4x Large datasets with moderate recall loss. b1 0.125 ~32x Maximum savings for massive datasets. CRITICAL NOTE: Quantization only compresses the vector data itself. The HNSW graph structure (the “neighbor lists” that make search fast) remains uncompressed to ensure query performance. Because of this fixed graph overhead, an i8 index typically provides a total memory reduction of ~3x rather than a raw 4x. Calculating Your Memory Needs To size your ScyllaDB Vector Search cluster effectively, be sure to consider both vector data and graph overhead. The total memory required for a vector index can be estimated with this formula: Memory ≈ N * (D * B + m * 16) * 1.2 N: Total number of vectors. D: Dimensions (e.g., 768 or 1536). B: Bytes per dimension based on quantization level (f32=4, i8=1, b1=0.125). m: Maximum connections per node (default 16). 1.2: 20% operational headroom for system processes and query handling. Example: 10 Million OpenAI Embeddings (768 Dimensions) Using this formula, let’s see how quantization affects your choice of AWS EC2 instances on ScyllaDB Cloud (which primarily utilizes the r7g Graviton and r7i Intel families): f32 (No Quantization): Requires ~40 GB RAM. You would need an r7g.2xlarge (64 GB) to ensure headroom. i8 Quantization: Requires ~12 GB RAM. You can comfortably drop to an r7g.xlarge (32 GB). b1 (1-bit): Requires ~4 GB RAM. This fits on a tiny r7g.medium (8 GB). By moving from f32 to i8, you can drop 2-3 instance tiers. This gets you significant cost savings. Improving Accuracy with Oversampling and Rescoring To mitigate the accuracy loss from quantization, ScyllaDB provides two complementary mechanisms. Oversampling retrieves a larger candidate set during the initial index search, increasing the chance that the true nearest neighbors are included. When a client requests the top K vectors, the algorithm retrieves ceiling(K * oversampling) candidates, sorts them by distance, and returns only the top K. A larger candidate pool means better recall without any extra round-trips to the application. Even without quantization, setting oversampling above 1.0 can improve recall on high-dimensionality datasets. Rescoring is a second-pass operation that recalculates distances using the original full-precision vectors stored in ScyllaDB, then re-ranks candidates before returning results. Because it must fetch and recompute exact distances for every candidate, rescoring can reduce search throughput by roughly 2x – so enable it only when high recall is critical. Note that rescoring is only beneficial when quantization is enabled; for unquantized indexes (default f32), the index already contains full-precision data, making the rescoring pass redundant. Both features are configured as index options when creating a vector index:CREATE CUSTOM INDEX ON myapp.comments(comment_vector)
USING 'vector_index' WITH OPTIONS = { 'similarity_function':
'COSINE', 'quantization': 'i8', 'oversampling': '5.0', 'rescoring':
'true' }; When (and When Not) to Use Quantization
Use quantization when: You are managing millions
or billions of vectors and need to control costs. You are
memory-constrained but can tolerate a small drop in recall. You are
using high-dimensional vectors (≥ 768), where the savings are most
pronounced. Avoid quantization when: You have a
small dataset where memory is not a bottleneck. Highest possible
recall is your only priority. Your application cannot afford the
~2x throughput reduction that comes with
rescoring—the process of recalculating exact
distances using the original f32 data to improve accuracy. Choosing
the Right Configuration for Your Scenario Here are some guidelines
to help you select the right configuration:
Scenario Recommendation Small
dataset, high recall required Use default f32 — no quantization
needed. Large dataset, memory-constrained Use i8 or f16 with
oversampling of 3.0–10.0. Add rescoring: true only if very high
recall is required. Very large dataset, approximate results
acceptable Use b1 for maximum memory savings. Enable oversampling
to compensate for accuracy loss. High-dimensionality vectors (≥
768) Consider oversampling > 1.0 even with f32 to improve
recall. Try ScyllaDB Vector Search Now Quantization is just one
part of the
ScyllaDB 2026.1 release, which also includes
Filtering,
Similarity Values, and
Real-Time Ingestion. With these tools, you can build
production-grade RAG applications that are both blazingly fast and
cost-efficient. Vector Search is available in ScyllaDB Cloud.
Get Started: Check out the
Quick Start Guide to Vector Search in ScyllaDB Cloud.
Deep Dive: Read our past posts on
building a Movie Recommendation App or our
1-billion vector benchmark. Documentation:
View the full ScyllaDB
Cloud Vector Search Documentation. Try ScyllaDB Cloud for free
today and see how quantization can supercharge your AI
infrastructure. The Great Stream Fix: Interleaving Writes in Seastar with AI-Powered Invariants Tracing
How we used AI-assisted invariant-based testing to locate and resolve tricky hidden bugs with complex state transitions Seastar is a high-performanceC++ framework for writing asynchronous server
applications. It powers projects like ScyllaDB and Redpanda. One of its core rules is
simple but strict: no blocking allowed. Every operation that could
take time (e.g., reading from disk, writing to a socket, waiting
for a lock) must be expressed asynchronously by returning a future
that resolves when the work is completed. This makes Seastar
applications extremely efficient on modern hardware. However, it
also means that even seemingly mundane things, like writing data to
a stream, require careful thought about ownership, lifetimes, and
buffering. Moreover Seastar’s output stream has always
experienced a limitation: the inability to freely mix small,
buffered writes with large, zero-copy chunks. It was something that
developers avoided and tolerated – but we always considered it
something worth improving … someday. Fixing this requires a deep
dive into complex state transitions, which inherently creates a
high risk for introducing sequencing bugs. A standard coding
approach won’t work; the task requires a way to trace the system’s
state across millions of test cases. This post describes the
process of using AI-assisted invariant-based testing to try to
locate and resolve these tricky hidden bugs. TL;DR What could have
been an extremely complicated fix ultimately was actually
surprisingly smooth and effective. Output streams Output
stream is Seastar’s output byte flow abstraction. It’s used
wherever data needs to go out of an application. For example, it’s
used for disk files, network connections, and stackable virtual
streams that transform data on the fly (such as compression or
encryption layers sitting on top of another stream). Whatever the
underlying sink is, the output stream presents a
uniform interface to the caller. It gives callers two ways to push
data through: Buffered writes: Copy bytes into an
internal buffer; flush when the buffer fills up or
when explicitly requested. Zero-copy writes: Hand
over memory buffers directly; the stream passes it to the sink
without copying a single byte of the buffer data.
Zero-copy is important for large blobs since we want
to avoid copying megabytes of data. Buffered writes
are important for building up small pieces efficiently. In a real
application, it’s natural to interleave both: write a small header
into the buffer, then attach a large payload as a zero-copy
buffer, then write a small trailer. There is also a
trim_to_size stream option. When enabled, the stream
guarantees that no chunk delivered to the underlying sink exceeds
the configured stream buffer size. This matters for
sinks that have an upper limit on how much data they can accept in
a single call – certain network APIs, for instance, or aligned disk
I/O. Without it, a larger buffer can pass through as-is. The
Problem Until recently, mixing the two write modes was not
supported. Internally, buffered and zero-copy writes used two
different storages: internal buffer for the former
data, and dedicated container for the latter. There
was no clean way to append buffered bytes onto the tail of pending
zero-copy data while preserving ordering. The code simply asserted
that the zero-copy container was empty whenever a
buffered write arrived and vice-versa. The nearby code
comment, however, stated that mixing writes was not supported
yet – so the intention to fix it had always been there.
The goal of the work described here was to make it happen. Start
with the Tests We figured we should build a solid test foundation
before touching the implementation. We had some pre-existing tests
for output streams, but they were really just a
collection of ad-hoc cases (specific input sequences with hardcoded
expected outputs). This was fine for catching regressions but not
great for systematically exploring the large space of possible
inputs against drastic code changes. The new approach was
invariant-based testing. Rather than checking exact output
sequences, the tests need to verify that certain properties always
hold, regardless of input. Specifically, we wanted to check that:
All written bytes arrive at the sink, in order, with no corruption.
Every chunk delivered to the sink (except the last) must be at
least stream_size bytes with no undersized non-last
chunks. With the trimming option enabled, all outgoing chunks must
be exactly stream_size bytes. With these invariants
defined, the test iterates over all combinations of chunk sizes (1
byte through 3x times the stream_size bytes) and all
assignments of write type (buffered or zero-copy) to
each chunk. For n chunks ,that’s 2^n type
patterns plus trimming option giving about 1.6 million combinations
in total. The ad-hoc tests were then removed – the invariant test
subsumed them. One practical issue: 1.6 million cases ran fast in a
regular build (~5 seconds), but under sanitizers
(ASan, UBSan) it ballooned to over two
minutes. Given the whole seastar test suite runs for
several minutes, this new timing had to be improved somehow. The
fix was to turn an exhaustive test into a fuzzy one: in debug
builds, shuffle all 2^n masks, always keep the
all-buffered and all-zero-copy patterns, and sample ~10% of the
rest. That brought sanitizer runs down to less than twenty seconds.
Implementing the Fix With tests in place, the implementation work
began. The key challenge was making the internal
buffer and zero-copy container interoperate
cleanly. Two transitions required handling: Buffered → zero-copy
Zero-copy → buffered Buffered → zero-copy When a
zero-copy write arrives and there’s buffered data.
That data needs to be folded into the zero-copy
container so that ordering is preserved. The naive approach
– trim buffer to its filled length and move it into container –
works, but it wastes the rest of the buffer
allocation. Instead, the filled buffer prefix is shared into
the container as a view or sub-span, and the buffer itself is
advanced past it, thus sharing the underlying memory. This way, the
tail of the original allocation is still available for
future buffered writes after the zero-copy
sequence. No reallocation is needed on the mode switch. This
tail – trimmer buffer, pointing at unused capacity within the
original allocation – is what we call the
remnant. It is a new concept introduced by
this change. Before mixed-mode writes were supported, the buffer
was always either full (and flushed) or freshly allocated. The
remnant is an optimization. But (as will become clear
shortly), its existence also introduced several subtle failure
modes that took time to track down. Zero-copy →
buffered When a buffered write arrives and
the zero-copy container is non-empty, the new data can
just be appended to the internal buffer regularly. A
new zero-copy arrival, if it happens, would keep the
buffer ordering as described above. The flushing path was also
updated to handle the case where buffered and zero-copy
data are both available at that time. The buffered
tail is folded into the zero-copy vector before
forwarding both downstream. Bugs Found Along the Way Once the
implementation was written, the tests were run. They failed – many
times, in different ways. To facilitate debugging, each test
failure logged the exact stream configuration and the write
sequence that triggered the problem, for example:
stream_size=5 trim_to_size=0 input_chunks=[3, 2]
write_types=[buffered, zero_copy] This made the diagnosis
unusually direct. Given a specific failing case, we could trace the
execution step by step – mentally simulating the stream, tracking
the bytes accumulated in the buffer and zero-copy container
state at each point, and comparing that against what the
state should be at that moment. The stream’s logic is not
huge, but it has enough branches and interacting state variables to
make it excessively time- and effort-consuming for a human. So, we
turned to AI to hopefully save time and effort. Each time a test
failed, instead of walking the input_stream state
step-by-step, we prompted the AI with the exact test-case
description string, the code itself, and the request to find where
the problem could be. The AI then started “thinking” and its
mind-flow looked as if it was simulating the
input_stream code execution step-by-step, like a
developer would do, just much faster. This turned out to be an
incredibly fast way to find where the invariant breaks. In many
cases, it could correctly analyze the desired state the stream had
to happen in. Sometimes, it even suggested excellent ways to fix
the problem that was found. Identifying the root cause is often
only half the work; understanding which fix is correct, safe, and
doesn’t introduce new assumptions that will break later is the
other half. AI found the problems for us and was quite helpful in
helping us fix them. For example, one of the first debugging
sessions with AI looked like this… Developer:
The test fails like that fatal error: in
"test_splitting_invariants": data integrity check failed:
stream_size=5 trim_to_size=0 input_chunks=[1, 1, 1, 4]
write_types=[zero_copy, buffered, buffered, buffered]
Please, analyze
AI: Let me trace through this case carefully.
stream_size=5, trim_to_size=false, writes:
[zc(1), buf(1), buf(1), buf(4)] After
zc(1): _zc_bufs = [tb(1)], _zc_len = 1, _end =
0 _zc_len < _size, nothing flushed
After buf(1): Goes to write(const char*,
1) fast path — but !_buf, so falls to
slow_write _end=0, n=1 < _size=5, so
skips the !_end && n >= _size branch Allocates
_buf, copies 1 byte, _end = 1
After buf(1): Fast path: fits in _buf, _end
= 2 After buf(4): _end=2, n=4, n >
_size - _end = 3, so goes to slow_write
_end=2, so skips the !_end && n >=
_size branch now = min(4, 5-2) = 3, copies 3
bytes, _end = 5 now != n (3
!= 4), so allocates next with
max(4-3, 5) = 5 bytes, copies 1 byte into it
_zc_bufs is non-empty, so enters the new branch: seals
_buf.share(0, 5) into _zc_bufs
_zc_len += 5 → _zc_len = 6 _zc_len += n - now =
1 → _zc_len = 7 pushes next into
_zc_bufs trim_to_size=false → calls
zero_copy_put(_zc_bufs) So zero_copy_put
sends [tb(1), tb(5), tb(5)] — but next
only has 1 byte of valid data in it, yet it’s a 5-byte allocation!
We’re pushing the full _size-sized next
buffer but only n - now = 1 bytes were written into
it. The buffer isn’t trimmed before being pushed. The fix: trim
next to n - now bytes before pushing it
into _zc_bufs when _zc_bufs is non-empty.
Shall I go ahead and fix it? The root-case was obvious and the
suggested fix was simple and correct. So, we accepted the
suggestion and the debugging session went on. We went through
several rounds like that, with each step producing more lengthy
analysis and more sophisticated fix proposals. At some point, a
test with a small buffered write followed by a
zero-copy write showed data corruption downstream.
Tracing the execution revealed that after trimming the
internal buffer past the filled region, the
trimmed buffer still appeared non-null
even though it had zero usable capacity remaining. The
fast-path check for buffer availability wasn’t
prepared for that and didn’t trigger reallocation on
the next write. As a result, bytes were written into unowned
memory. Another case involved a buffered write code
branch which handles writes larger than
stream_size by chopping them into chunks. After a mode
switch, the internal buffer might become a small
remnant – smaller than the tail chunk the
code in question was trying to store there. Tracing through the
code, we saw that it was the place where the capacity
check wasn’t prepared to meet with the remnant
buffer. It silently assumed that the internal
buffer always had room for a full-sized tail. The result was
an ASan-detected heap overflow. A more
subtle issue arose around the same remnant buffer in a
different scenario. When buffered write chopping code
encounters a tail chunk that is smaller than the
stream_size, but larger than the
remnant's remaining capacity, it has to make a choice.
It could either fill the remnant partially and
asynchronously put it before allocating a fresh buffer for the
rest, or simply abandon the remnant and allocate a
fresh full buffer. The first option is more space-efficient, but
would require an async flushing inside what is
otherwise a synchronous setup step, significantly complicating the
code. The second option wastes the unused bytes of the
remnant's allocation – but crucially, it doesn’t leak
them. The remnant shares its underlying allocation
with the sealed buffer already in the zero-copy
container, so the memory is freed once that buffer is
flushed and all references to the allocation are dropped. The
deliberate trade-off – wasted but not leaked – was worth making,
and a comment in the code explains the reasoning for whoever reads
it next. Each bug effectively had the same shape: a subtle
assumption about stream state that held in the
original single-mode code silently broke in mixed-mode scenarios.
The invariant test exposed the bugs by providing a
minimal reproducible case and a clear description of which
invariant was violated. Plus, it also made each one straightforward
to reason about and fix. The Result The work touches tests and
implementation in roughly equal measure, which feels about right
for a change like this. The test suite grew from a handful of
hand-crafted cases into an exhaustive invariant-based
framework that covers all combinations of chunk sizes and
write types – something that would have been impractical to write
by hand. On the implementation side, the long-standing restriction
on mixed-mode writes is gone. Buffered and
zero-copy writes can now be freely interleaved in any
order, with the stream handling the transitions internally. This
preserves ordering and the chunk-size invariants that
sinks depend on. In general, writing a test that covers as many
possible situations as possible and then making sure that the code
passes those tests is a very good approach. It makes sure the end
code is correct. In rare cases when the test covers all
possible situations the code may have to deal with, we can say that
“the code is officially bug free.” Making AI facilitate testing
turned out to be the best decision made in this work. Given the
amount of test cases and the number of possible combinations of
input_stream inner states, debugging each failing test
case would be a nightmare for the developer.