Cut LLM Costs and Latency with ScyllaDB Semantic Caching

How semantic caching can help with costs and latency as you scale up your AI workload Developers building large-scale LLM solutions often rely on powerful APIs such as OpenAI’s. This approach outsources model hosting and inference, allowing teams to focus on application logic rather than infrastructure. However, there are two main challenges you might face as you scale up your AI workload: high costs and high latency. This blog post introduces semantic caching as a possible solution to these problems. Along the way, we cover how ScyllaDB can help implement semantic caching. What is semantic caching? Semantic caching follows the same principle as traditional caching: storing data in a system that allows faster access than your primary source. In conventional caching solutions, that source is a database. In AI systems, the source is an LLM. Here’s a simplified semantic caching workflow: User sends a question (“What is ScyllaDB?”) Check if this type of question has been asked before (for example “whats scylladb” or “Tell me about ScyllaDB”) If yes, deliver the response from cache If no a)Send the request to LLM and deliver the response from there b) Save the response to cache Semantic caching stores the meaning of user queries as vector embeddings and uses vector search to find similar ones. If there’s a close enough match, it returns the cached result instead of calling the LLM. The more queries you can serve from the cache, the more you save on cost and latency over time. Invalidating data is just as important for semantic caching as it is for traditional caching. For instance, if you are working with RAGs (where the underlying base information can change over time), then you need to invalidate the cache periodically so it returns accurate information. For example, if the user query is “What’s the most recent version of ScyllaDB Enterprise,” the answer depends on when you ask this question. The cached response to this answer must be refreshed accordingly (assuming the only context your LLM works with is the one provided by the cache). Why use a semantic cache? Simply put, semantic caching saves you money and time. You save money by making fewer LLM calls, and you save time from faster responses. When a use case involves repeated or semantically similar queries, and identical responses are acceptable, semantic caching offers a practical way to reduce both inference costs and latency. Heavy LLM usage might put you on OpenAI’s top spenders list. That’s great for OpenAI. But is it great for you? Sure, you’re using cutting-edge AI and delivering value to users, but the real question is: can you optimize those costs? Cost isn’t the only concern. Latency matters too. LLMs inherently cannot achieve sub-millisecond response times. But users still expect instant responses. So how do you bridge that gap? You can combine LLM APIs with a low-latency database like ScyllaDB to speed things up. Combining AI models with traditional optimization techniques is key to meeting strict latency requirements. Semantic caching helps mitigate these issues by caching LLM responses associated with the input embeddings. When a new input is received, its embedding is compared to those stored in the cache. If a similar-enough embedding is found (based on a defined similarity threshold), the saved response is returned from the cache. This way, you can skip the round trip to the LLM provider. This leads to two major benefits: Lower latency: No need to wait for the LLM to generate a new response. Your low-latency database will always return responses faster than an LLM. Lower cost: Cached responses are “free” – no LLM API fees. Unlike LLM calls, database queries don’t charge you per request or per token. Why use ScyllaDB for semantic caching? From day one, ScyllaDB has focused on three things: cutting latency, cost, and operational overhead. All three of those things matter just as much for LLM apps and semantic caching as they do for “traditional” applications. Furthermore, ScyllaDB is more than an in-memory cache. It’s a full-fledged high-performance database with a built-in caching layer. It offers high availability and strong P99 latency guarantees, making it ideal for real-time AI applications. ScyllaDB has recently added Vector Search offering, which is essential for building a semantic cache, and it’s also used for a wide range of AI and LLM-based applications. For example, it’s quite commonly used as a feature store. In short, you can consolidate all your AI workloads into a single high-performance, low-latency database. Now let’s see how you can implement semantic caching with ScyllaDB. How to implement semantic caching with ScyllaDB > If you just want to dive in, clone the repo, and try it yourself, check out the GitHub repository here. Here’s a simplified, general guide on how to implement semantic caching with ScyllaDB (using Python examples): 1. Create a semantic caching schema First, we create a keyspace, then a table called prompts, which will act as our cache table. It includes the following columns: prompt_id: The partition key for the table. Inserted_at: Stores the timestamp when the row was originally inserted (the response first cached) prompt_text: The actual input provided by the user, such as a question or query. prompt_embedding: The vector embedding representation of the user input. llm_response: The LLM’s response for that prompt, returned from the cache when a similar prompt appears again. updated_at: Timestamp of when the row was last updated, useful if the underlying data changes and the cached response needs to be refreshed. Finally, we create an ANN (Approximate Nearest Neighbor) index on the prompt_embedding column to enable fast and efficient vector searches. Now that ScyllaDB is ready to receive and return responses, let’s implement semantic caching in our application code. 2. Convert user input to vector embedding Take the user’s text input (which is usually a question or some kind of query) and convert it into an embedding using your chosen embedding model. It’s important that the same embedding model is used consistently for both cached data and new queries. In this example, we’re using a local embedding model from sentence transformers. In your application, you might use OpenAI or some other embedding provider platform. 3. Calculate similarity score Use ScyllaDB Vector Search syntax: `ANN OF` to find semantically similar entries in the cache. There are two key components in this part of the application. Similarity score: You need to calculate the similarity between the user’s new query and the most similar item returned by vector search. Cosine similarity, which is the most frequently used similarity function in LLM-based applications, ranges from 0 to 1. A similarity of 1 means the embeddings are identical. A similarity of 0 means they are completely dissimilar. Threshold: Determines whether the response can be provided from cache. If the similarity score is above that threshold, it means the new query is similar enough to one already stored in the cache, so the cached response can be returned. If it falls below the threshold, the system should fetch a fresh response from the LLM. The exact threshold should be tuned experimentally based on your use case. 4. Implement cache logic Finally, putting it all together, you need a function that decides whether to serve a response from the cache or make a request to the LLM. If the user query matches something similar in the cache, follow the earlier steps and return the cached response. If it’s not in the cache, make a request to your LLM provider, such as OpenAI, return that response to the user, and then store it in the cache. This way, the next time a similar query comes in, the response can be served instantly from the cache. Get started! Get started building with ScyllaDB; check out our examples on GitHub: `git clone https://github.com/scylladb/vector-search-examples.git ` Vector Search Semantic Cache RAG See the Quick Start Guide and give it try Contact us with your questions, or for a personalized tour

Managing ScyllaDB Background Operations with Task Manager

Learn about Task Manager, which provides a unified way to observe and control ScyllaDB’s background maintenance work In each ScyllaDB cluster, there are a lot of background processes that help maintain data consistency, durability, and performance in a distributed environment. For instance, such operations include compaction (which cleans up on-disk data files) and repair (which ensures data consistency in a cluster). These operations are critical for preserving cluster health and integrity. However, some processes can be long-running and resource-intensive. Given that ScyllaDB is used for latency-sensitive database workloads, it’s important to monitor and track these operations. That’s where ScyllaDB’s Task Manager comes in. Task Manager allows administrators of self-managed ScyllaDB to see all running operations, manage them, or get detailed information about a specific operation. And beyond being a monitoring tool, Task Manager also provides a unified way to manage asynchronous operations. How Task Manager Organizes and Tracks Operations Task Manager adds structure and visibility into ScyllaDB’s background work. It groups related maintenance activities into modules, represents them as hierarchical task trees, and tracks their lifecycle from creation through completion. The following sections explain how operations are organized, retained, and monitored at both node and cluster levels. Supported Operations Task Manager supports the following operations: Local: Compaction; Repair; Streaming; Backup; Restore. Global: Tablet repair; Tablet migration; Tablet split and merge; Node operations: bootstrap, replace, rebuild, remove node, decommission. Reviewing Active/Completed Tasks Task Manager is divided into modules: the entities that gather information about operations of similar functionality. Task Manager captures and exposes this data using tasks. Each task covers an operation or its part (e.g., a task can represent the part of the repair operation running on a specific shard). Each operation is represented by a tree of tasks. The tree root covers the whole operation. The root may have children, which give more fine-grained control over the operation. The children may have their own children, etc. Let’s consider the example of a global major compaction task tree: The root covers the compaction of all keyspaces in a node; The children of the root task cover a single keyspace; The second-degree descendants of the root task cover a single keyspace on a single shard; The third-degree descendants of the root task cover a single table on a single shard; etc. You can inspect a task from each depth to see details on the operation’s progress.   Determining How Long Tasks Are Shown Task Manager can show completed tasks as well as running ones. The completed tasks are removed from Task Manager after some time. To customize how long a task’s status is preserved, modify task_ttl_in_seconds (aka task_ttl) and user_task_ttl_in_seconds (aka user_task_ttl) configuration parameters. Task_ttl applies to operations that are started internally, while user_task_ttl refers to those initiated by the user. When the user starts an operation, the root of the task tree is a user-task. Descendant tasks are internal and such tasks are unregistered after they finish, propagating their status to their parents. Node Tasks vs Cluster Tasks Task Manager tracks operations local to a node as well as global cluster-wide operations. A local task is created on a node that the respective operation runs on. Its status may be requested only from a node on which the task was created. A global task always covers the whole operation. It is the root of a task tree and it may have local children. A global task is reachable from each node in a cluster. Task_ttl and user_task_ttl are not relevant for global tasks. Per-Task Details When you list all tasks in a Task Manager module, it shows brief information about them with task_stats. Each task has a unique task_id and sequence_number that’s unique within its module. All tasks in a task tree share the same sequence_number. Task stats also include several descriptive attributes: kind: either “node” (a local operation) or “cluster” (a global one). type: what specific operation this task involves (e.g., “major compaction” or “intranode migration”). scope: the level of granularity (e.g., “keyspace” or “tablet”). Additional attributes such as shard, keyspace, table, and entity can further specify the scope. Status fields summarize the task’s state and timing: state: indicate if the task was created, running, done, failed, or suspended. start_time and end_time: indicate when the task began and finished. If a task is still running, its end_time is set to epoch. When you request a specific task’s status, you’ll see more detailed metrics: progress_total and progress_completed show how much work is done, measured in progress_units. parent_id and children_ids place the task within its tree hierarchy. is_abortable indicates whether the task can be stopped before completion. If the task failed, you will also see the exact error message. Interacting with Task Manager Task Manager provides a REST API for listing, monitoring, and controlling ScyllaDB’s background operations. You can also use it to manage the execution of long-running maintenance tasks started with the asynchronous API instead of blocking a client call. If you prefer command-line tools, the same functionality is available through nodetool tasks. Using the Task Management API Task Manager exposes a REST API that lets you manage tasks: GET /task_manager/list_modules – lists all supported Task Manager modules. GET /task_manager/list_module_tasks/{module} – lists all tasks in a specified module. GET /task_manager/task_status/{task_id} – shows the detailed status of a specified task. GET /task_manager/wait_task/{task_id} – waits for a specified task and shows its status. POST /task_manager/abort_task/{task_id} – aborts a specified task. GET /task_manager/task_status_recursive/{task_id} – gets statuses of a specified task and all its descendants. GET/POST /task_manager/ttl – gets/sets task_ttl. GET/POST /task_manager/user_ttl – gets/sets user_task_ttl. POST /task_manager/drain/{module} – drains the finished tasks in a specified module. Running Maintenance Tasks Asynchronously Some ScyllaDB maintenance operations can take a while to complete, especially at scale. Waiting for them to finish through a synchronous API call isn’t always practical. Thanks to Task Manager, existing synchronous APIs are easily and consistently converted into asynchronous ones. Instead of waiting for an operation to finish, a new API can immediately return the ID of the root task representing the started operation. Using this task_id, you can check the operation’s progress, wait for completion, or abort it if needed. This gives you a unified and consistent way to manage all those long-running tasks. Nodetool A task can be managed using nodetool’s tasks command. For details, see the related nodetool docs page. Example: Tracking and Managing Tasks Preparation To start, we locally set up a cluster of three nodes with the IP addresses 127.43.0.1, 127.43.0.2, and 127.43.0.3. Next, we create two keyspaces: keyspace1 with replication factor 3 and keyspace2 with replication factor 2. In each keyspace, we create 2 tables: table1 and table2 in keyspace1, and table3 and table4 in keyspace2. We populate them with data. Exploring Task Manager Let’s start by listing the modules supported by Task Manager: nodetool tasks modules -h 127.43.0.1 Starting and Tracking a Repair Task We request a tablet repair on all tokens of table keyspace2.table3. curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://127.43.0.3:10000/storage_service/tablets/repair?ks=keyspace2&table=table3&tokens=all' {"tablet_task_id":"2f06bff0-ab45-11f0-94c2-60ca5d6b2927"} In response, we get the task id of the respective tablet repair task. We can use it to track the progress of the repair. Let’s check whether the task with id 2f06bff0-ab45-11f0-94c2-60ca5d6b2927 will be listed in a tablets module. nodetool tasks list tablets -h 127.43.0.1 Apart from the repair task, we can see that there are two intranode migrations running. All the tasks are of type “cluster”, which means that they cover the global operations. All these tasks would be visible regardless of which node we request them from. We can also see the scope of the operations. We always migrate one tablet at a time, so the migration tasks’ scope is “tablet”. For repair, the scope is “table” because we previously started the operation on a whole table. Entity, sequence_number, and shard are irrelevant for global tasks. Since all tasks are running, their end_time is set to a default value (epoch). Examining Task Status Let’s examine the status of the tablet repair using its task_id. Global tasks are available on the whole cluster, so we change the requested node… just because we can. 😉 nodetool tasks status 2f06bff0-ab45-11f0-94c2-60ca5d6b2927 -h 127.43.0.3 The task status contains detailed information about the tablet repair task. We can see whether the task is abortable (via task_manager API). There could also be some additional information that’s not applicable for this particular task : error, which would be set if the task failed; parent_id, which would be set if it had a parent (impossible for a global task); progress_unit, progress_total, progress_comwepleted, which would indicate task progress (not yet supported for tablet repair tasks). There’s also a list of tasks that were created as a part of the global task. The list above has been shortened to improve readability. The key point is that children of a global task may be created on all nodes in a cluster. Those children are local tasks (because global tasks cannot have a parent). Thus, they are reachable only from the nodes where they were created. For example, the status of a task 1eb69569-c19d-481e-a5e6-0c433a5745ae should be requested from node 127.43.0.2. nodetool tasks status 1eb69569-c19d-481e-a5e6-0c433a5745ae -h 127.43.0.2 As expected, the child’s kind is “node”. Its parent_id references the tablet repair task’s task_id. The task has completed successfully, as indicated by the state. The end_time of a task is set. Its sequence_number is 15, which means it is the 15th task in its module. The task’s scope is wider than the parent’s. It could encompass the whole keyspace, but – in this case – it is limited to the parent’s scope. The task’s progress is measured in ranges, and we can see that exactly one range was repaired. This task has one child that is created on the same node as its parent. That’s always true for local tasks. nodetool tasks status 70d098c4-df79-4ea2-8a5e-6d7386d8d941 -h 127.43.0.3 We may examine other children of the global tablet repair task too. However, we may only check each one on the node where it was created. Let’s wait until the global task is completed. nodetool tasks wait 2f06bff0-ab45-11f0-94c2-60ca5d6b2927 -h 127.43.0.2 We can see that its state is “done” and its end_time is set. Working with Compaction Tasks Let’s start some compactions and have a look at the compaction module. nodetool tasks list compaction -h 127.43.0.2 We can see that one of the major compaction tasks is still running. Let’s abort it and check its task tree. nodetool tasks abort 16a6cdcc-bb32-41d0-8f06-1541907a3b48 -h 127.43.0.2 nodetool tasks tree 16a6cdcc-bb32-41d0-8f06-1541907a3b48 -h 127.43.0.2 We can see that the abort request propagated to one of the task’s children and aborted it. That task now has a failed state and its error field contains abort_requested_exception. Managing Asynchronous Operations Beyond examining the running operations, Task Manager can manage asynchronous operations started with the REST API. For example, we may start a major compaction of a keyspace synchronously with /storage_service/keyspace_compaction/{keyspace} or use an asynchronous version of this API: curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://127.43.0.1:10000/tasks/compaction/keyspace_compaction/keyspace2' "4c6f3dd4-56dc-4242-ad6a-8be032593a02" The response includes the task_id of the operation we just started. This id may be used in Task Manager to track the progress, wait for the operation, or abort it. Key Takeaways The Task Manager provides a clear, unified way to observe and control background maintenance work in ScyllaDB. Visibility: It shows detailed, hierarchical information about ongoing and completed operations, from cluster-level tasks down to individual shards. Consistency: You can use the same mechanisms for listing, tracking, and managing all asynchronous operations. Control: You can check progress, wait for completion, or abort tasks directly, without guessing what’s running. Extensibility: It also provides a framework for turning synchronous APIs into asynchronous ones by returning task IDs that can be monitored or managed through the Task Manager. Together, these capabilities make it easier to see what ScyllaDB is doing, keep the system stable, and convert long-running operations to asynchronous workflows.

The Cost of Multitenancy

DynamoDB and ScyllaDB share many similarities, but DynamoDB is a multi-tenant database, while ScyllaDB is single-tenant The recent DynamoDB outage is a stark reminder that even the most reliable and mature cloud services can experience downtime. Amazon DynamoDB remains a strong and proven choice for many workloads, and many teams are satisfied with its latency and cost. However, incidents like this highlight the importance of architecture, control, and flexibility when building for resilience. DynamoDB and ScyllaDB share many similarities: Both are distributed NoSQL databases with the same “ancestor”: the Dynamo paper (although both databases have significantly evolved from the original concept). A compatible API: The DynamoDB API is one of two supported APIs in ScyllaDB Cloud. Both use multi-zone deployment for higher HA. Both support multi-region deployment. DynamoDB uses Global Tablets (See this analysis for more). ScyllaDB can go beyond and allow multi-cloud deployments, or on-prem / hybrid deployments. But they also have a major difference: DynamoDB is a multi-tenant database, while ScyllaDB is single-tenant. Source: https://blog.bytebytego.com/p/a-deep-dive-into-amazon-dynamodb Multi-tenancy has notable advantages for the vendor: Lower infrastructure cost: Since tenants’ peaks don’t align, the vendor can provision for the aggregate average rather than the sum of all peaks, and even safely over-subscribe resources. Shared burst capacity: Extra capacity for traffic spikes is pooled across all users. Multi-tenancy also comes with significant technical challenges and is never perfect. All users still share the same underlying resources (CPU, storage, and network) while the service works hard to preserve the illusion of a dedicated environment for each tenant (e.g., using various isolation mechanisms). However, sometimes the isolation breaks and the real architecture behind the curtain is revealed. One example is the Noisy Neighbor issue. Another is that when a shared resource breaks, like the DNS endpoint in the latest DynamoDB outage, MANY users are affected. In this case, all DynamoDB users in a region suffer. ScyllaDB Cloud takes a different approach: all database resources are completely separated from each other. Each ScyllaDB database is running: On dedicated VMs On a dedicated VPC In a dedicated Security Group Using a dedicated endpoint and (an optional) dedicated Private Link Isolated authorization and authentication (per database) Dedicated Monitoring and Administration (ScyllaDB Manager) servers When using ScyllaDB Cloud Bring Your Own Account (BYOA), the entire deployment is running on the *user* account, often on a dedicated sub-account. This provides additional isolation. The ScyllaDB Cloud control plane is loosely coupled to the managed databases. Even in the case of a disconnect, the database clusters will continue to serve requests. This design greatly reduces the blast radius of any one issue. While the single-tenant architecture is more resilient, it does come with a few challenges: Scaling: To scale, ScyllaDB needs to allocate new resources (nodes) from EC2, and depend on the EC2 API to allocate them. Tablets and X Cloud have made a great improvement in reducing scaling time. Workload Isolation: ScyllaDB allows users to control the resource bandwidth per workload with Workload Prioritization (docs | tech talk | demo) Pricing: Using numerous optimization techniques, like shard-per-core, ScyllaDB achieves extreme performance per node, which allows us to provide lower prices than DynamoDB for most use cases. To conclude: DynamoDB optimizes for multi-tenancy, whereas ScyllaDB favors stronger tenant isolation and a smaller blast radius.

Cache vs. Database: How Architecture Impacts Performance

Lessons learned comparing Memcached with ScyllaDB Although caches and databases are different animals, databases have always cached data and caches started to use disks, extending beyond RAM. If an in-memory cache can rely on flash storage, can a persistent database also function as a cache? And how far can you reasonably push each beyond its original intent, given the power and constraints of its underlying architecture? A little while ago, I joined forces with Memcached maintainer Alan Kasindorf (a.k.a. dormando) to explore these questions. The collaboration began with the goal of an “apples to oranges” benchmark comparing ScyllaDB with Memcached, which is covered in the article “We Compared ScyllaDB and Memcached and… We Lost?” A few months later, we were pleasantly surprised that the stars aligned for P99 CONF. At the last minute, Kasindorf was able to join us to chat about the project – specifically, what it all means for developers with performance-sensitive use cases. Note: P99 CONF is a highly technical conference on performance and low-latency engineering. We just wrapped P99 CONF 2025, and you can watch the core sessions on-demand. Watch on demand  Cache Efficiency Which data store uses memory more efficiently? To test it, we ran a simple key-value workload on both systems. The results: Memcached cached 101 million items before evictions began ScyllaDB cached only 61 million items before evictions Cache efficiency comparison What’s behind the difference? ScyllaDB also has its own LRU (Least Recently Used) cache, bypassing the Linux cache. But unlike Memcached, ScyllaDB supports a wide-column data representation: A single key may contain many rows. This, along with additional protocol overhead, causes a single write in ScyllaDB to consume more space than a write in Memcached. Drilling down into the differences, Memcached has very little per-item overhead. In the example from the image above, each stored item consumes either 48 or 56 bytes, depending on whether compare and swap (CAS) is enabled. In contrast, ScyllaDB has to handle a lot more (it’s a persistent database after all!). It needs to allocate space for its memtables, Bloom filters and SSTable summaries so it can efficiently retrieve data from disk when a cache miss occurs. On top of that, ScyllaDB supports a much richer data model(wide column). Another notable architectural difference stands out in the performance front: Memcached is optimized for pipelined requests (think batching, as in DynamoDB’s BatchGetItem), considerably reducing the number of roundtrips over the network to retrieve several keys. ScyllaDB is optimized for single (and contiguous) key retrievals under a wide-column representation. Read-only in-memory efficiency comparison Following each system’s ideal data model, both ScyllaDB and Memcached managed to saturate the available network throughput, servicing around 3 million rows/s while sustaining below single-digit millisecond P99 latencies. Disks and IO Efficiency Next, the focus shifted to disks. We measured performance under different payload sizes, as well as how efficiently each of the systems could maximize the underlying storage. With Extstore and small (1K) payloads, Memcached stored about 11 times more items (compared to its in-memory workload) before evictions started to kick in, leaving a significant portion of free available disk space. This happens because, in addition to the regular per-key overhead, Memcached stores an additional 12 bytes per item in RAM as a pointer to storage. As RAM gets depleted, Extstore is no longer effective and users will no longer observe savings beyond that point. Disk performance with small payloads comparison For the actual performance tests, we stressed Extstore against item sizes of 1KB and 8KB. The table below summarizes the results: Test Type Payload Size I/O Threads GET Rate P99 Latency perfrun_metaget_pipe 1KB 32 188K/s 4~5 ms perfrun_metaget 1KB 32 182K/s <1ms perfrun_metaget_pipe 1KB 64 261K/s 5~6 ms perfrun_metaget 1KB 64 256K/s 1~2ms perfrun_metaget_pipe 8KB 16 92K/s 5~6 ms perfrun_metaget 8KB 16 90K/s <1ms perfrun_metaget_pipe 8KB 32 110K/s 3~4 ms perfrun_metaget 8KB 32 105K/s <1ms We populated ScyllaDB with the same number of items as we used for Memcached. ScyllaDB actually achieved higher throughput – and just slightly higher latency – than Extstore. I’m pretty sure that if the throughput had been reduced, the latency would have been lower. But even with no tuning, the performance is quite comparable. This is summarized below: Test Type Payload Size GET Rate Server-Side P99 Client-Side P99 1KB Read 1KB 268.8K/s 2ms 2.4ms 8KB Read 8KB 156.8K/s 1.54ms 1.9ms A few notable points from these tests: Extstore required considerable tuning to fully saturate flash storage I/O. Due to Memcached’s architecture, smaller payloads are unable to fully use the available disk space, providing smaller gains compared to ScyllaDB. ScyllaDB rates were overall higher than Memcached in a key-value orientation, especially under higher payload sizes. Latencies were better than pipelined requests, but slightly higher than individual GETs in Memcached. I/O Access Methods Discussion These disk-focused tests unsurprisingly sparked a discussion about the different I/O access methods used by ScyllaDB vs. Memcached/Extstore. I explained that ScyllaDB uses asynchronous direct I/O. For an extensive discussion of this, read this blog post by ScyllaDB CTO and cofounder Avi Kivity. Here’s the short version: ScyllaDB is a persistent database. When people adopt a database, they rightfully expect that it will persist their data. So, direct I/O is a deliberate choice. It bypasses the kernel page cache, giving ScyllaDB full control over disk operations. This is critical for things like compactions, write-ahead logs and efficiently reading data off disk. A user-space I/O scheduler is also involved. It lives in the middle and decides which operation gets how much I/O bandwidth. That could be an internal compaction task or a user-facing query. It arbitrates between them. That’s what enables ScyllaDB to balance persistence work with latency-sensitive operations. Extstore takes a rather very different approach: keep things as simple as possible and avoid touching the disk unless it’s absolutely necessary. As Kasindorf put it: “We do almost nothing.” That’s fully intentional. Most operations — like deletes, TTL updates, or overwrites — can happen entirely in memory. No disk access needed. So Extstore doesn’t bother with a scheduler.” Without a scheduler, Extstore performance tuning is manual. You can change the number of Extstore I/O threads to get better utilization. If you roll it out and notice that your disk doesn’t look fully utilized – and you still have a lot of spare CPU – you can bump up the thread count. Kasindorf mentioned that it will likely become self-tuning at some point. But for now, it’s a knob that users can tweak. Another important piece is how Extstore layers itself on top of Memcached’s existing RAM cache. It’s not a replacement; it’s additive. You still have your in-memory cache and Extstore just handles the overflow. Here’s how Kasindorf explained it: “If you have, say, five gigs of RAM and one gig of that is dedicated to these small pointers that point from memory into disk, we still have a couple extra gigs left over for RAM cache.” That means if a user is actively clicking around, their data may never even go to disk. The only time Extstore might need to read from disk is when the cache has gone cold (for instance, a user returning the next day). Then the entries get pulled back in. Basically, while ScyllaDB builds around persistent, high-performance disk I/O (with scheduling, direct control and durable storage), Extstore is almost the opposite. It’s light, minimal and tries to avoid disk entirely unless it really has to. Conclusion and Takeaways Across these and the other tests that we performed in the full benchmark, Memcached and ScyllaDB both managed to maximize the underlying hardware utilization and keep latencies predictably low. So which one should you pick? The real answer: It depends. If your existing workload can accommodate a simple key-value model and it benefits from pipelining, then Memcached should be more suitable to your needs. On the other hand, if the workload requires support for complex data models, then ScyllaDB is likely a better fit. Another reason for sticking with Memcached: It easily delivers traffic far beyond what a network interface card can sustain. In fact, in this Hacker News thread, dormando mentioned that he could scale it up past 55 million read ops/sec for a considerably larger server. Given that, you could make use of smaller and/or cheaper instance types to sustain a similar workload, provided the available memory and disk footprint meet your workload needs. A different angle to consider is the data set size. Even though Extstore provides great cost savings by allowing you to store items beyond RAM, there’s a limit to how many keys can fit per gigabyte of memory. Workloads with very small items should observe smaller gains compared to those with larger items. That’s not the case with ScyllaDB, which allows you to store billions of items irrespective of their sizes. It’s also important to consider whether data persistence is required. If it is, then running ScyllaDB as a replicated distributed cache provides you greater resilience and non-stop operations, with the tradeoff being (and as Memcached correctly states) that replication halves your effective cache size. Unfortunately, Extstore doesn’t support warm restarts and thus the failure or maintenance of a single node is prone to elevating your cache miss ratios. Whether this is acceptable depends on your application semantics: If a cache miss corresponds to a round-trip to the database, then the end-to-end latency will be momentarily higher. Regardless of whether you choose a cache like Memcached or a database like ScyllaDB, I hope this work inspires you to think differently about performance testing. As we’ve seen, databases and caches are fundamentally different. And at the end of the day, just comparing performance numbers isn’t enough. Moreover, recognize that it’s hard to fully represent your system’s reality with simple benchmarks, and every optimization comes with some trade-offs. For example, pipelining is great, but as we saw with Extstore, it can easily introduce I/O contention. ScyllaDB’s shard-per-core model and support for complex data models are also powerful, but they come with costs too, like losing some pipelining flexibility and adding memory overhead.  

11X Faster ScyllaDB Backup

Learn about ScyllaDB’s new native backup, which improves backup speed up to 11X by using Seastar’s CPU and IO scheduling ScyllaDB’s 2025.3 release introduces native backup functionality. Previously, an external process managed backups independently, without visibility into ScyllaDB’s internal workload. Now, Seastar’s CPU and I/O schedulers handle backups internally, which gives ScyllaDB full control over prioritization and resource usage. In this blog post, we explain why we changed our approach to backup, share what users need to know, and provide a preview of what to expect next. What We Changed and Why Previously, SSTable backups to S3 were managed entirely by ScyllaDB Manager and the Scylla Manager Agent running on each node. You would schedule the backup, and Manager would coordinate the required operations (taking snapshots, collecting metadata, and orchestrating uploads). Scylla Manager Agent handled all the actual data movement. The problem with this approach was that it was often too slow for our users’ liking, especially at the massive scale that’s common across our user base. Since uploads ran through an external process, they competed with ScyllaDB for resources (CPU, disk I/O, and network bandwidth). The rclone process read from \disk at the same time that ScyllaDB did – so two processes on the same node were performing heavy disk I/O simultaneously. This contention on the disk could impact query latencies when user requests were being processed during a backup. To mitigate the effect on real-time database requests, we use Systemd slice to control Scylla Manager Agent resources. This solution successfully reduced backup bandwidth, but failed to increase the bandwidth when the pressure from online requests was low. To optimize this process, ScyllaDB now provides a native backup capability. Rather than relying on an external agent (ScyllaDB Manager) to copy files, ScyllaDB uploads files directly to S3. The new approach is faster and more efficient because ScyllaDB uses its internal IO and CPU scheduling to control the backup operations. Backup operations are assigned a lower priority than user queries. In the event of resource contention, ScyllaDB will deprioritize them so they don’t interfere with the latency of the actual workload. Note that this new native backup capability is currently available for AWS. It is coming soon for other backup targets (such as GCP Cloud Storage and Azure Storage). To enable native backup, configure the S3 connectivity on each node’s scylla.yaml and set the desired strategy (Native, Auto, or Rclone) in ScyllaDB Manager. Note that the rclone agent is always used to upload backup metadata, so you should still configure the Manager Agent even if you are using native backup and restore. Performance Improvements So how much faster is the new backup approach? We recently ran some tests to find out. We ran two tests which are the same in all aspects except for the tool being used for backup: rclone in one and native scylla in the other. Test Setup The test uses 6 nodes i4i.2xlarge with total injected data of 2TB with RF=3. That means that the 2TB injected data becomes 6TB (RF=3) and these 6TB are spread across 6 nodes, resulting in each node holding 1TB of data. The backup benchmark then measures how long it takes to backup the entire cluster, indicating the data size of one node Native Backup Here are the results of the native backup tests: Name Size Time [s] native_backup_1016_2234 1.057 TiB 00:19:18 Data was uploaded at a rate of approximately  900 MB/s. OS Tx Bytes during backup The slightly higher values for the OS metrics are due to for example tcp-retransmit, size of HTTP headers that is not part of the data but part of the transmitted bytes, and more alike. rclone Backup The same exact test with rclone produced the following results: Name Size Time [s] rclone_backup_1017_2334 1.057 TiB 03:48:57 Here, data was uploaded at a rate of approximately 80MB/s Next Up: Faster Restore Next, we’re optimizing restore, which is the more complex part of the backup/restore process. Backups are relatively straightforward: you just upload the data to object storage. But restoring that data is harder, especially if you need to bring a cluster back online quickly or restore it onto a topology that’s different from the original one. The original cluster’s nodes, token ranges, and data distribution might look quite different from the new setup – but during restore, ScyllaDB must somehow map between what was backed up and what the new topology expects. Replication adds even more complexity. ScyllaDB replicates data according to the specified replication factor (RF), so the backup has multiple copies of the same data. During the restore process, we don’t want to redundantly download or process those copies; we need a way to handle them efficiently. And one more complicating factor: the restore process must understand whether the cluster uses virtual nodes or tablets because that affects how data is distributed. Wrapping Up ScyllaDB’s move to native integration with object storage is a big step forward for the faster backup/restore operations that many of our large-scale users have been asking for. We’ve already sped up backups by eliminating the extra rclone layer. Now, our focus is on making restores equally efficient while handling complex topologies, replication, and data distribution. This will make it faster and easier to restore large clusters. Looking ahead, we’re working on using object storage not only for backup and restore, but also for tiering: letting ScyllaDB read data directly from object storage as if it were on local disk. For a more detailed look at ScyllaDB’s plans for backup, restore, and object storage as native storage, see this video:

Vector search benchmarking: Embeddings, insertion, and searching documents with ClickHouse® and Apache Cassandra®

Welcome back to our series on vector search benchmarking. In part 1, we dove into setting up a benchmarking project and explored how to implement vector search in PostgreSQL from the example code in GitHub. We saw how a hands-on project with students from Northeastern University provided a real-world testing ground for Retrieval-Augmented Generation (RAG) pipelines.

Now, we’re continuing our journey by exploring two more powerful open source technologies: ClickHouse and Apache Cassandra. Both handle vector data differently and understanding their methods is key to effective vector search benchmarking. Using the same student project as our guide, this post will examine the code for embedding, inserting, and retrieving data to see how these technologies stack up.

Let’s get started.

Vector search benchmarking with ClickHouse

ClickHouse is a column-oriented database management system known for its incredible speed in analytical queries. It’s no surprise that it has also embraced vector search. Let’s see how the student project team implemented and benchmarked the core components.

Step 1: Embedding and inserting data

scripts/vectorize_and_upload.py

This is the file that handles Step 1 of the pipeline for ClickHouse. Embeddings in this file (scripts/vectorize_and_upload.py) are used as vector representations of Guardian news articles for the purpose of storing them in a database and performing semantic search. Here’s how embeddings are handled step-by-step (the steps look similar to PostgreSQL).

First up, is the generation of embeddings. The same SentenceTransformer model used in part 1 (all-MiniLM-L6-v2) is loaded in the class constructor. In the method generate_embeddings(self, articles), for each article:

  • The article’s title and body are concatenated into a text string.
  • The model generates an embedding vector (self.model.encode(text_for_embedding)), which is a numerical representation of the article’s semantic content.
  • The embedding is added to the article’s dictionary under the key embedding.

Then the embeddings are stored in ClickHouse as follows.

  • The database table guardian_articles is created with an embedding Array(Float64) NOT NULL column specifically to store these vectors.
  • In upload_to_clickhouse_debug(self, articles_with_embeddings), the script inserts articles into ClickHouse, including the embedding vector as part of each row.

Step 2: Vector search and retrieval

services/clickhouse/clickhouse_dao.py

The steps to search are the same as for PostgreSQL in part 1. Here’s part of the related_articles method for ClickHouse:

def related_articles(self, query: str, limit: int = 5):

"""Search for similar articles using vector similarity""" ... query_embedding = self.model.encode(query).tolist() search_query = f""" SELECT url, title, body, publication_date, cosineDistance(embedding, {query_embedding}) as distance FROM guardian_articles ORDER BY distance ASC LIMIT {limit} """ ...

When searching for related articles, it encodes the query into an embedding, then performs a vector similarity search in ClickHouse using cosineDistance between stored embeddings and the query embedding, and results are ordered by similarity, returning the most relevant articles.

Vector search benchmarking with Apache Cassandra

Next, let’s turn our attention to Apache Cassandra. As a distributed NoSQL database, Cassandra is designed for high availability and scalability, making it an intriguing option for large-scale RAG applications.

Step 1: Embedding and inserting data

scripts/pull_docs_cassandra.py

As in the above examples, embeddings in this file are used to convert article text (body) into numerical vector representations for storage and later retrieval in Cassandra.

For each article, the code extracts the body and computes the embeddings:

embedding = model.encode(body) embedding_list = [float(x) for x in embedding]
  • model.encode(body) converts the text to a NumPy array of 384 floats.
  • The array is converted to a standard Python list of floats for Cassandra storage.

Next, the embedding is stored in the vector column of the articles table using a CQL INSERT:

insert_cql = SimpleStatement(""" INSERT INTO articles (url, title, body, publication_date, vector) VALUES (%s, %s, %s, %s, %s) IF NOT EXISTS; """) result = session.execute(insert_cql, (url, title, body, publication_date, embedding_list))

The schema for the table specifies: vector vector<float, 384>, meaning each article has a corresponding 384-dimensional embedding. The code also creates a custom index for the vector column:

session.execute(""" CREATE CUSTOM INDEX IF NOT EXISTS ann_index ON articles(vector) USING 'StorageAttachedIndex'; """)

This enables efficient vector (ANN: Approximate Nearest Neighbor) search capabilities, allowing similarity queries on stored embeddings.

A key part of the setup is the schema and indexing. The Cassandra schema in services/cassandra/init/01-schema.cql defines the vector column.

Being a NoSQL database, Cassandra schemas are a bit different to normal SQL databases, so it’s worth taking a closer look. This Cassandra schema is designed to support Retrieval-Augmented Generation (RAG) architectures, which combine information retrieval with generative models to answer queries using both stored data and generative AI. Here’s how the schema supports RAG:

  • Keyspace and table structure
    • Keyspace (vectorembeds): Analogous to a database, this isolates all RAG-related tables and data.
    • Table (articles): Stores retrievable knowledge sources (e.g., articles) for use in generation.
  • Table columns
    • url TEXT PRIMARY KEY: Uniquely identifies each article/document, useful for referencing and deduplication.
    • title TEXT and body TEXT: Store the actual content and metadata, which may be retrieved and passed to the generative model during RAG.
    • publication_date TIMESTAMP: Enables filtering or ranking based on recency.
    • vector VECTOR<FLOAT, 384>: Stores the embedding representation of the article. The new Cassandra vector data type is documented here.
  • Indexing
    • Sets up an Approximate Nearest Neighbor (ANN) index using Cassandra’s Storage Attached Index.

More information about Cassandra vector support is in the documentation.

Step 2: Vector search and retrieval

The retrieval logic in services/cassandra/cassandra_dao.py showcases the elegance of Cassandra’s vector search capabilities.

The code to create the query embeddings and perform the query is similar to the previous examples, but the CQL query to retrieve similar documents looks like this:

query_cql = """ SELECT url, title, body, publication_date FROM articles ORDER BY vector ANN OF ? LIMIT ? """ prepared = self.client.prepare(query_cql) rows = self.client.execute(prepared, (emb, limit))

What have we learned?

By exploring the code from this RAG benchmarking project we’ve seen distinct approaches to vector search. Here’s a summary of key takeaways:

  • Critical steps in the process:
    • Step 1: Embedding articles and inserting them into the vector databases.
    • Step 2: Embedding queries and retrieving relevant articles from the database.
  • Key design pattern:
    • The DAO (Data Access Object) design pattern provides a clean, scalable way to support multiple databases.
    • This approach could extend to other databases, such as OpenSearch, in the future.
  • Additional insights:
    • It’s possible to perform vector searches over the latest documents, pre-empting queries, and potentially speeding up the pipeline.

What’s next?

So far, we have only scratched the surface. The students built a complete benchmarking application with a GUI (using Steamlit), used multiple other interesting components (e.g. LangChain, LangGraph, FastAPI and uvicorn), Grafana and LangSmith for metrics, and Claude to use the retrieved articles to answer questions, and Docker support for the components. They also revealed some preliminary performance results! Here’s what the final system looked like (this and the previous blog focused on the bottom boxes only).

student-built benchmarking application flow chart

In a future article, we will examine the rest of the application code, look at the preliminary performance results the students uncovered, and discuss what they tell us about the trade-offs between these different databases.

Ready to learn more right now? We have a wealth of resources on vector search. You can explore our blogs on ClickHouse vector search and Apache Cassandra Vector Search (here, here, and here) to deepen your understanding.

The post Vector search benchmarking: Embeddings, insertion, and searching documents with ClickHouse® and Apache Cassandra® appeared first on Instaclustr.

P99 CONF 2025 Recap: Latency to LLMs

Another year has flown by — in the blink of an eye, I found myself back in the US hosting P99 CONF again. This makes it my third time on stage and the fifth in this incredible series of talks from engineers around the world, all sharing their stories about chasing that elusive P99 latency. Beyond raw speed and tail latency, we explored modern systems programming, kernel innovation, databases and storage at scale, observability, testing, performance insights… and of course, this year’s big wave: artificial intelligence, with vector search and LLMs taking center stage. In this blog post, I’ll help you chart a course through it all (though honestly, every session is worth your time). Watch 60+ P99 CONF Talks On Demand Starting with low latency “Taming tail latency” has been a running theme at P99 for years, and this time PayPal and TigerBeetle both took the stage to show how they deal with unpredictable outliers that wreck user experience. PayPal showed how tiny inefficiencies multiply under load. And (as we’ve come to expect from the past optimization talks), TigerBeetle engineered determinism, single-threaded scheduling, predictable I/O, and batching everywhere. ScyllaDB joined the party with their low-latency vector search engine, proving that AI queries don’t have to experience high latency. If your database architecture already handles the long tail at the storage layer, your vector workloads inherit that speed for free. And since I can’t resist some speculative trading (Aussies love a bet), I enjoyed the talks from Maven Securities on lock-free queues for trading systems and Bloomberg on building scalable, end-to-end latency metrics from distributed traces. If you’re chasing nanoseconds instead of milliseconds, check out Steve Heller’s Design Considerations for P99-Optimized Hash Tables. Rust was everywhere (again) Turso is rewriting SQLite in Rust. ClickHouse tried converting 1.5 million lines of C++ … or at least part of it. Neon rebuilt its I/O stack with tokio and io_uring, Datadog squeezed out extra juice from Lambda extensions in Rust, and Trigo visualized async abstractions (also Rust). Maybe we’re the unofficial Rust conf? But Go and C++ held their ground too. Miguel Young de la Sota presented a faster protobuf implementation in Go, and Manya Bansal (MIT PhD student) gave a highly detailed talk on how not to program GPUs with some C++ insights there. On the database and storage side Avi Kivity kicked off the conference by sharing how ScyllaDB’s Seastar CPU scheduler prioritizes complex workloads while keeping latency predictable. Nadav Har’El showed how ScyllaDB manages client ingestion to prevent memory blowouts, Dor Laor connected theory to real-world tail behavior, and Felipe explained how tablet replication delivers true elasticity. And Andy Pavlo took us through the very real challenge of both humans and autonomous systems tuning databases. Other databases showed up strong too. Turso is pushing SQLite into new dimensions, DragonflyDB nailed sorted sets with B+ trees, and Qian Li from DBOS shared how they merged app logic and state right into the database. These are all good reminders that performance starts at the data layer. Performance engineering, testing, and observability My old mate Ashutosh Agrawal shared how he built and tested systems for 32 million concurrent cricket fans. Also on testing, I loved the double feature on deterministic simulation testing: Resonate’s approach to catching Heisenbugs and Antithesis running fuzzing workloads at scale in near real time. eBPF continues to fascinate. Cosmic elaborated on reliability versus memory trade-offs, and Tanel Poder showed thread-level observability with some seriously impressive tooling. AWS’s Geoff Blake introduced aperf for profiling the nitty-gritty, Arm unlocked new insights with the PMUv3 plugin, and Raphael Carvalho took us on a wild hunt through a Linux kernel bug triggered by io_uring. Proper detective work, that one. And then there’s AI and ML, the new frontier Chip Huyen delivered a standout keynote on LLM inference optimization, tying together hardware, software, and architecture choices. Eshcar Hillel went deep into KV cache offloading, Microsoft’s Magdalen Manohar showed how to make vector search cost-effective and fast, and ScyllaDB’s Pawel Pery explained how we decoupled a Rust-based vector engine for high-performance ANN queries. We wrapped it up with a cracking conversation between Rachel Stephens from RedMonk and Adrian Cockcroft, exploring AI-assisted analytics and that eternal hunt for predictability at the tail. Hosting this conference is always a privilege. Big thanks to Natalie Estrada, who keeps the whole thing running (and somehow curates killer playlists). Cynthia Dunlop, who finds the best speakers and replies to messages faster than anyone I know, and Felipe and the lounge crew, daring the demo deities and answering audience questions live. Plus the many ScyllaDB engineers who present, coordinate, and quietly make it all happen behind the scenes. And to all 30,000 registrants and 100,000+ participants so far… You’re what makes this community so special. From all of us at ScyllaDB: a huge thank you. See you at the next one. Join is for P99 CONF’s sister conference, Monster SCALE Summit

Optimizing Cassandra Repair for Higher Node Density

This is the fourth post in my series on improving the cost efficiency of Apache Cassandra through increased node density. In the last post, we explored compaction strategies, specifically the new UnifiedCompactionStrategy (UCS) which appeared in Cassandra 5.

Now, we’ll tackle another aspect of Cassandra operations that directly impacts how much data you can efficiently store per node: repair. Having worked with repairs across hundreds of clusters since 2012, I’ve developed strong opinions on what works and what doesn’t when you’re pushing the limits of node density.

Building a Movie Recommendation App with ScyllaDB Vector Search

Use ScyllaDB to perform semantic search across movie plot descriptions We built a sample movie recommendation app to showcase ScyllaDB’s new vector search capabilities. The sample app gives you a simple way to experience building low-latency semantic search and vector-based applications with ScyllaDB. See the Quick Start Guide and give it try Contact us with your questions, or for a personalized tour In this post, we’ll show how to perform semantic search across movie plot descriptions to find movies by meaning, not keywords. This example also shows how you can add ScyllaDB Vector Search to your existing applications.    Before diving into the application, let’s clarify what we mean by semantic search and provide some context about similarity functions. About vector similarity functions Similarity between two vectors can be calculated in several ways. The most common methods are cosine similarity, dot product (inner product), and L2 (Euclidean) distance. ScyllaDB Vector Search supports all of these functions. For text embeddings, cosine similarity is the most often used similarity function. That’s because, when working with text, we mostly focus on the direction of the vector, rather than its magnitude. Cosine similarity considers only the angle between the vectors (i.e., the difference in directions) and ignores the magnitude (length of the vector). For example, a short document (1 page) and a longer document (10 pages) on the same topic will still point in similar directions in the vector space even though they are different lengths. This is what makes cosine similarity ideal for capturing topical similarity. Cosine similarity formula In practice, many embedding models (e.g., OpenAI models) produce normalized vectors. Normalized vectors all have the same length (magnitude of 1). For normalized vectors, cosine similarity and the dot product return the same result. This is because cosine similarity divides the dot product by the magnitudes of the vectors, which are all 1 when vectors are normalized. The L2 function produces different distance values compared to the dot product or cosine similarity, but the ordering of the embeddings remains the same (assuming normalized vectors). Now that you have a better understanding of semantic similarity functions, let’s explain how the recommendation app works. App overview The application allows users to input what kind of movie they want to watch. For example, if you type “American football,” the app compares your input to the plots of movies stored in the database. The first result is the best match, followed by other similar recommendations. This comparison uses ScyllaDB Vector Search. You can find the source code on GitHub, along with setup instructions and a step-by-step tutorial in the documentation. For the dataset, we are reusing a TMDB dataset available on Kaggle. Project requirements To run the application, you need a ScyllaDB Cloud account and a vector search enabled cluster. Right now, you need to use the API to create a vector search enabled cluster. Follow the instructions here to get started! The application depends on a few Python packages: ScyllaDB Python driver – for connecting and querying ScyllaDB. Sentence Transformers – to generate embeddings locally without requiring OpenAI or other paid APIs. Streamlit – for the UI. Pydantic – to make working with query results easier. By default, the app uses the all-MiniLM-L6-v2 model so anyone can run it locally without heavy compute requirements. Other than ScyllaDB Cloud, no commercial or paid services are needed to run the example. Configuration and database connection A config.py file stores ScyllaDB Cloud credentials, including the host address and connection details. A separate ScyllaDB helper module handles the following: Creating the connection and session Inserting and querying data Providing helper functions for clean database interactions Database schema The schema is defined in a schema.cql file, executed when running the project’s migration script. It includes: Keyspace creation (with a replication factor of 3) Table definition for movies, storing fields like release_date, title, genre, and plot Vector search index Schema highlights: `plot` – text, stores the movie description used for similarity comparison. `plot_embedding` – vector, embedding representation of the plot, defined using the vector data type with 384 dimensions (matching the Sentence Transformers model). `Primary key` – id as the partition key for efficient lookups querying by id CDC enabled – required for ScyllaDB vector search. `Vector index` – an Approximate Nearest Neighbor (ANN) index created on the plot_embedding column to enable efficient vector queries. The goal of this schema is to allow efficient search on the plot embeddings and store additional information alongside the vectors. Embeddings An Embedding Creator class handles text embedding generation with Sentence Transformers. The function accepts any text input and returns a list of float values that you can insert into ScyllaDB’s `vector` column. Recommendations implemented with vector search The app’s main function is to provide movie recommendations. These recommendations are implemented using vector search. So we create a module called recommender that handles Taking the input text Turning the text into embeddings Running vector search Let’s break down the vector search query: SELECT * FROM recommend.movies ORDER BY plot_embedding ANN OF [0.1, 0.2, 0.3, …] LIMIT 5; User input is first converted to an embedding, ensuring that we’re comparing embedding to embedding. The rows in the table are ordered by similarity using the ANN operator (ANN OF). Results are limited to five similar movies. The SELECT statement retrieves all columns from the table. In similarity search, we calculate the distance between two vectors. The closer the vectors in vector space, the more similar their underlying content. Or, in other words, a smaller distance suggests higher similarity. Therefore, an ORDER BY sort results in ascending order, with smaller distances appearing first. Streamlit UI The UI, defined in app.py, ties everything together. It takes the user’s query, converts it to an embedding, and executes a vector search. The UI displays the best match and a list of other similar movie recommendations. Try it yourself! If you want to get started building with ScyllaDB Vector Search, you have several options: Explore the source code on GitHub Use the README to set up the app on your computer Follow the tutorial to build the app from scratch And if you have questions, use the forum and we’ll be happy to help.

Building a Low-Latency Vector Search Engine for ScyllaDB

ScyllaDB Vector Search is now available. Learn about the design decisions, testing, and optimizations involved in achieving our performance goals. December 18, 2025 Update:  ScyllaDB Vector Search is now GA and production-ready ScyllaDB Vector Search is now available. It brings millisecond-latency vector retrieval to massive scale. This makes ScyllaDB optimal for large-scale semantic search and retrieval-augmented generation workloads. See the Quick Start Guide and give it try Contact us with your questions, or for a personalized tour In this blog post, we share a bit about what was involved in introducing low latency and high throughput Vector Search to ScyllaDB. We’ll cover the architectural design decisions behind our integration of ScyllaDB’s shard-per-core for real-time operations and high-performance ANN processing. Additionally, we’ll look at some unexpected performance challenges we encountered and how we addressed them. If you’re really just looking for some early performance numbers, here you go: ScyllaDB Vector Search outperforms industry averages in both throughput and latency. Using public VectorDBBench datasets, it sustained up to 65K QPS (P99 < 20ms) on openai_small_50k, and 12K QPS (P99 < 40ms) on laion_large_100m. Across both configurations, tests demonstrate consistently high recall accuracy and predictable latencies, even under extreme concurrency. Why Vector Search for ScyllaDB? You might be wondering why we built Vector Search for ScyllaDB. Many vendors offer Vector Search, but we had some unique goals when we started our journey. ScyllaDB’s architecture is recognized for its performance. Users have been relying on us for real-time ML, predictive analytics, fraud detection and other latency-sensitive AI workloads for years. A growing number of users mentioned they were working with third-party Vector Search databases, but found them overly complex (and costly) to manage at scale. So we committed to building integrated low-latency vector search for ScyllaDB scale. We started with the question: How do we bring ScyllaDB’s low latencies and high throughput to something as complex as Vector Search? Most built-in vector solutions sacrifice performance for accuracy or scale. We wanted to deliver all three. Vector Search Design Decisions and Architecture Note: The topics in the remainder of this blog will be covered in more detail during P99 CONF, a free + virtual conference on all things performance. Join us live to learn more and ask questions. Rather than embedding HNSW indexing directly into the core database, we decoupled vector indexing and similarity search into a dedicated Rust engine. ScyllaDB replicas are paired with a local Vector Store node living under the same availability zone as the core ScyllaDB database. ScyllaDB nodes store tables with vectors and other data. The Vector Store service builds internal indexes based on the data read from these tables. Vector Store retrieves data from ScyllaDB using its native CQL protocol and CDC functionality. The client performs a CQL query on ScyllaDB, then ScyllaDB requests the list of neighbors from the Vector Store index using HTTP. Why did we design it this way? It allows the database and Vector Store nodes to scale independently. Running each component on its own VM lets you fine-tune hardware types: SSTables live on storage-optimized nodes, while vectors benefit from RAM-optimized ones. Traffic remains zone-local, optimizing network transfer costs for intensive workloads. It isolates the performance of regular queries in contrast to ANN queries to optimize latency. This allows real-time ingestion to progress while updates get transparently replicated to the Vector Store for inferencing. From the user’s perspective, clients simply issue ANN queries to ScyllaDB via the CQL API, and ScyllaDB transparently requests the list of neighbors from the Vector Store. The vector type is already supported by ScyllaDB’s Java, Rust, C++, Python, and C# drivers; it’s coming soon for GoCQL. Vector Store Architecture The core of our Vector Store is built on top of the USearch engine. We also use a set of Rust services to interface with ScyllaDB, build vector indexes, and provide search capabilities. The Vector Store service is built based on the Actor Framework architecture, using Rust, Tokio, Axum, and USearch. Its functionality is divided into several actors: “httpd” serves as a REST API endpoint for executing ANN searches. “db” and “db-index” are responsible for communicating with ScyllaDB. Specifically, “db-index” is responsible for building an index upfront when created (via a full table scan), as well as consuming CDC streams and forwarding those results to “monitor-items” to update the underlying index. “db” retrieves schema information and handles metadata changes (like DROP’ing an index), therefore ensuring that the underlying Vector Store remains consistent with ScyllaDB. Communication between actors is done using Tokio channels (queues) using async-await Rust features. There’s also a separate actor type for search functionality. It encapsulates all USearch computations and serves as a foundation for the entire service. One important note about our current implementation: for optimal performance, the Vector Store keeps all indexes in memory. This means that the entire index needs to fit into a single node’s RAM. We’re exploring hybrid approaches for future iterations. Building an Index We extended ScyllaDB with a CUSTOM INDEX function, as well as a set of options that the Vector Store service uses to build the index. The Vector Store service will first perform a full table scan to build the initial index. After that, the Vector Store index is kept in sync with ScyllaDB via Change Data Capture (CDC). Each write appends an entry to ScyllaDB’s CDC log, which the Vector Store service eventually consumes to keep its corresponding index consistent. A key design choice is that the Vector Store holds only the primary key and its corresponding vector embedding in memory. This greatly reduces the Vector Store memory requirements. When an ANN query runs (as shown above by the ANN OF syntax with a LIMIT clause), it will return just the list of primary keys back to the ScyllaDB caller. Those keys are then used by ScyllaDB internally to service the ResultSet back to the caller application. Testing and Optimizing Performance Update: Read our latest benchmark: 1B vectors with 2ms P99s and 250K QPS throughput Read 1B Vector Benchmark While building low-latency systems is no easy task, building low-latency Vector Stores is an even harder problem. Not surprisingly, we went through quite a few testing + optimization loops before reaching our latency targets for the Early Access program. Our basic testing environment involved a single shared instance in AWS, where we manually pinned CPUs to each process via cgroups. Next, we loaded a small dataset using VectorDBBench and proceeded with testing performance using the same set of parameters through each run. Even though we used a single instance, we decided to use a replication factor of 3 to simulate the load of a small Cloud cluster. Next, to define our embeddings, we used the ScyllaDB native vector type during table creation. We built an index as described above. Then, we microbenchmarked both CQL ANN OF queries through ScyllaDB. We also benchmarked direct requests to the in-memory Vector Store. Once done, we compared QPS and P99 latency under increasing concurrency levels to identify bottlenecks in our integration layer. Exploring the Latency Penalty of Nagle’s Algorithm Our initial benchmarks against ScyllaDB produced an unexpected result. Even at very low concurrency, we observed latencies around 50ms. More interestingly, latency remained nearly constant as we increased concurrency, indicating that the system wasn’t struggling to handle additional load. The bottleneck had to be elsewhere. When we compared ScyllaDB queries with requests sent directly to the Vector Store, the difference became clear. Vector Store queries returned in single-digit milliseconds and scaled smoothly until around 5K QPS. In contrast, ScyllaDB requests showed much higher P99 latency, which directly reduced throughput. At low concurrency, the gap between the two paths was about 46ms: a clue that pointed to a networking issue. A network capture confirmed it. Linux’s TCP Delayed ACK can wait up to 40ms before sending acknowledgments. Combined with Nagle’s algorithm, which buffers small packets until an ACK arrives, this created a feedback loop that directly inflated ScyllaDB’s latencies. The fix was straightforward: disable Nagle’s algorithm with the TCP_NODELAY socket option. With Nagle disabled, ScyllaDB latencies dropped to nearly match those of direct Vector Store queries. That said, throughput was still lower. While the Vector Store sustained ~5K QPS, ScyllaDB saturated around ~3K QPS. And that led to, of course, more testing and more optimization. Experimenting with Thread Layouts Our tests measuring performance across different thread layouts for our Vector Store service also yielded some interesting results. Each layout implements a different set of asynchronous and synchronous threads. Async threads are provided by the Rust Tokio runtime. They’re primarily used for I/O intensive computation, like networking and actor coordination. Synchronous threads used Rayon to execute CPU-intensive USearch tasks. The image below shows the layouts we implemented. The letter ‘a’ denotes a thread for asynchronous (io-intensive) computation and ‘s’ indicates a thread for synchronous (cpu-intensive) computation. For example, a1s3, stands for one asynchronous thread with three synchronous threads. The initial results below show that the layout using only asynchronous tasks provided the best QPS, at the expense of higher latency in high concurrency tests. The lowest latency was observed when threads weren’t fighting for CPU resources, with one asynchronous task and three synchronous threads. This layout, however, also provided the lowest QPS compared with all other tests. Looking at other variants (below), we can see that while oversubscribing CPUs (a1s4) does improve QPS to some extent, it comes at a significant latency cost. Dedicating one thread per CPU (a1s3) provided lower latency in contrast. Similarly, oversubscribing a single CPU for asynchronous processing also performed better than oversubscribing all CPU cores for both async and synchronous work. See those results below. Therefore, the only optimization opportunity we found here was to reduce latency on the asynchronous-only variant. The chart below shows that its latency is lower than the oversubscribed one, but grows at a faster pace under higher concurrency. So in summary, we found that: Async only (a4s0) delivered the best QPS, but latencies rose sharply at higher concurrency. Mixed (a1s3) avoided CPU contention, yielding the lowest latencies (but also the lowest QPS). Oversubscribed setups (a1s4, a4s4) gained some throughput (but at the cost of latency). The key takeaway is that adding sync threads improved latency at the cost of throughput, while async-only favored throughput but suffered under load. More Latency Optimizations A closer look at CPU traces revealed why. Each ANN request runs a burst of USearch computation. However, under concurrency, tasks preempt one another. This delays completions and hurts P99 latency. Tokio doesn’t offer task prioritization, but we implemented a neat trick: inserting a yield_now before starting USearch computation. This moved new tasks to the back of the queue, giving in-flight requests a chance to finish first. Comparing both approaches side by side (below) shows that our one-line code change provides marginally worse throughput, but big latency wins. As you can see below, the asynchronous-only, yield layout also drives even lower latency than the previous oversubscribed setup. Moreover, the graph below shows that it still drives higher QPS and now lower latencies than the mixed non-oversubscribed layout. It’s quite fascinating what a single line of code can do these days…   Scaling with ScyllaDB Cloud Finally, we turned to ScyllaDB Cloud environments to test scaling. On the R7i.xlarge, we started by replicating the same tests that we ran in our previous single-node setup. Here, each ANN query retrieves the 100 most similar neighbors. This is quite a compute-intensive operation, often used for re-ranking scenarios. We achieved the same 5K QPS with single-digit millisecond latencies under moderate concurrency, while we approached the saturation point somewhere close to a concurrency of 80. Using R7i.8xlarge instances, we scaled our setup by 4X: going from 4 vCPUs to 16 vCPUs per node. Here, we ran two series of tests. For the 100 most similar neighbors, throughput saturates between 13 to 14K QPS while latency remains below 5ms under low concurrency, up to 20ms under a concurrency of 100. For the 10 most similar neighbors, throughput saturates at 20K QPS, with single-digit millisecond latencies even under a concurrency of 100. Large-Scale Performance Test Our final test involved scaling the Vector Store nodes to 64 CPUs per node. Our goal here was to get enough memory to run a larger dataset with 100M embeddings at 768 dimensions. This scale is rarely published by other vector search providers, and it still leaves plenty of headroom for even larger datasets. With 100M embeddings, we reached 12K QPS with P99 latency ranging between 20ms at low concurrency to 40ms at 200 concurrency, while maintaining over 97% recall. For comparison, the smaller dataset reached around 65K QPS for k=10 while keeping latencies steadily low even under extreme concurrency. Of course, your mileage may vary. Our tests ran on static datasets, and real-world workloads may behave differently. Still, the trajectory is promising, and we’re continuing to push towards linear scaling. Next Steps ScyllaDB Vector Search was built for users with real-time workload needs; our architecture isolates similarity function computation from the database and abstracts complexity for the user. This blog has outlined some of the design decisions, testing, and optimization involved in achieving those performance goals. We’re excited about the results of these early performance tests, and we hope you are too. We’re eager to hear our community’s feedback. Give a try, share your feedback, and help shape the future of this product. See the Quick Start Guide and give it try Contact us with your questions, or for a personalized tour

Inside the Database Internals Talks at P99 CONF 2025

“Never write a database. Even if you want to, even if you think you should. Resist. Never write a database. Unless you have to write a database. But you don’t.” – Charity Majors
But someone has to write the databases that others rely on. Hearing about the engineering challenges they’re tackling is both fascinating and Schadenfreude-invoking – so perfect tech conference material. 😉 Since database performance is so near and dear to ScyllaDB, we reached out to our friends and colleagues across the community to that ensure a nice range of distributed data systems, approaches, and challenges would be represented at P99 CONF 2025. As you can see from our agenda, the response was overwhelming. A quick PSA for the uninitiated: P99 CONF is a free 2-day community event that’s intentionally virtual, highly interactive, and purely technical. It’s an immersion into all things performance. Distributed systems, database internals, Rust, C++, Java, Go, Wasm, Zig, Linux kernel, tracing, AI/ML & more – it’s all on the agenda. This year, you can look forward to first-hand engineering experiences from the likes of Pinterest, Clickhouse, Gemini, Arm, Rivian and VW Group Technology, Meta, Wayfair, Disney, NVIDIA, Turso, Neon, TigerBeetle, ScyllaDB, and too many others to list here. Here’s a sneak peek of the database internals talks you can look forward to at P99 CONF 2025… Join us at P99 CONF (free + virtual) Clickhouse’s C++ and Rust Journey Alexey Milovidov, Co-founder and CTO at Clickhouse Full rewrite from C++ to Rust or gradual integration with Rust libraries? For a large C++ codebase, only the latter works, but even then, there are many complications and rough edges. In my presentation, I will describe our experience integrating Rust and C++ code and some weird and unusual problems we had to overcome. Rethinking Durable Workflows and Queues: A Library-based Approach Qian Li, Co-founder at DBOS, Inc Durable workflow engines checkpoint program state to persistent storage (like a database) so that execution can always recover from where it left off. Most systems today rely on external orchestration: a centralized orchestrator and distributed workers communicating via message-passing. While this model is well-established, it’s often heavyweight, introducing substantial overhead, write amplification, and operational complexity. In this talk, we explore an alternative: a lightweight library-based durable workflow engine that embeds into application code and checkpoint state directly to the database. It handles queues and flow control through the database itself. This approach eliminates the need for a separate orchestrator, reduces network traffic, and improves performance by avoiding unnecessary writes. We’ll share our experience building DBOS, a library-based engine designed for simplicity and efficiency. We’ll discuss the architectural trade-offs, challenges in failure recovery, and key optimizations for scalability and maintainability. The Gory Details of a Full-Featured Userspace CPU Scheduler Avi Kivity, Co-founder and CTO at ScyllaDB Userspace CPU schedulers, which often accompany asynchronous I/O engines like io_uring and Linux AIO, are usually simplistic run-to-completion FIFO loops. This suffices for I/O bound applications, but for use cases that can be both CPU bound and I/O bound, this is not enough. Avi Kivity, CTO of ScyllaDB and co-maintainer of Seastar, will cover the design and implementation of the Seastar userspace CPU scheduler, which caters to more complex applications that require preemption and prioritization. The Tale of Taming TigerBeetle’s Tail Latency Tobias Ziegler, Software Engineer at Tigerbeetle In this talk, we dive into how we reduced TigerBeetle’s tail latency through algorithm engineering. ‘Algorithm engineering goes beyond studying theoretical complexity and considers how algorithms are executed efficiently on modern super-scalar CPUs. Specifically, we will look at Radix Sort and a k-way merge and explore how to implement them efficiently. We then demonstrate how we apply these algorithms incrementally to avoid latency spikes in practice. Why We’re Rewriting SQLite in Rust Glauber Costa, Co-founder and CEO at Turso Over two years ago, we forked SQLite. We were huge fans of the embedded nature of SQLite, but wanted a more open model of development…and libSQL was born as an Open Contribution project. Last year, as we were adding Vector Search to SQLite, we had a crazy idea. What could we achieve if we were to completely rewrite SQLite in Rust? This talk explains what drove us down this path, how we’re using deterministic simulation testing to ensure the reliability of the Rust rewrite, and the lessons learned (so far). I will show how a reimagining of this iconic database can lead to performance improvements of over 500x in some cases by looking at what powers it under the hood. Shared Nothing Databases at Scale Nick Van Wiggeren, CTO at PlanetScale This talk will discuss how PlanetScale scales databases in the cloud, focusing on a shared-nothing architecture that is built around expecting failure. Nick will go into how they built low-latency high-throughput systems that span multiple nodes, availability zones, and regions, while maintaining sub-millisecond response times. This starts at the storage layer and builds all the way up to micro-optimizing the load balancer, with a lot of learning at every step of the way. Reworking the Neon IO stack: Rust+tokio+io_uring+O_DIRECT Christian Schwarz,  Member of Technical Staff at Databricks Neon is a serverless Postgres platform. Recently acquired by Databricks, the same technology now also powers Databricks Lakebase. In this talk, we will dive into Pageserver, the multi-tenant storage service at the heart of the architecture. We share techniques and lessons learned from reworking its IO stack to a fully asynchronous model, with direct IO against local NVMe drives; all during a period of rapid growth. Pageserver is implemented in Rust, we use the tokio async runtime for networking, and integrate it with io_uring for filesystem access. A Deep Dive into the Seastar Event Loop Pavel Emelyanov, Principal Software Engineer at ScyllaDB The core and the basis of ScyllaDB’s outstanding performance is the Seastar framework, and the core and the basis of seastar is its event loop. In this presentation, we’ll see what the loop does in great detail, analyze the limitations that it runs in and all the consequences that follow those limitations. We’ll also learn how the loop is observed by the user and various means to understand its behavior. Cost Effective, Low Latency Vector Search In Databases: A Case Study with Azure Cosmos DB Magdalen Manohar, Senior Researcher at Microsoft We’ve integrated DiskANN, a state-of-the-art vector indexing algorithm, into Azure Cosmos DB NoSQL, a state-of-the-art cloud-native operational database. Learn how we overcame the systems and algorithmic challenges of this integration to achieve <20ms query latency at the 10 million scale, while supporting scale-out to billions of vectors via automatic partitioning. Measuring Query Latency the Hard Way: An Adventure in Impractical Postgres Monitoring Simon Notley, Observability and Optimization at EnterpriseDB Sampling the session state (as exposed by pg_stat_activity) is a surprisingly powerful way to understand how your Postgres instance spends its time. It is something I can wholeheartedly recommend to any Postgres DBA that needs a lightweight way to monitor query performance in production. However, it’s a terrible way to measure query latency, fraught with complexity and weird statistical biases that could be avoided by simply using an extension built for the job, or even log analysis. But pursuing terrible ideas can be fun, so in this talk, I dive into my adventures in measuring query latency from session sampling, generate some extremely funky charts, and end up unexpectedly performing a vector similarity search. In this talk I’ll show how instead of attempting to correct the biases that plague estimates of query latency based time-domain sampling, we can instead pre-calculate the distribution of (biased) estimates based on a range of true distributions and use vector search to compare our observed distribution to these pre-calculate ones, thereby inferring the true query latency. This ‘eccentric’ method is actually surprisingly effective, and surprisingly fun. Fast and Deterministic Full Table Scans at Scale Felipe Cardeneti Mendes, Technical Director at ScyllaDB ScyllaDB’s new tablet replication algorithm replaces static vNodes with dynamic, elastic data distribution that adapts to shifting workloads. This talk discusses how tablets enable fast, predictable full table scans by keeping operations shard-local, balancing load automatically, and scaling linearly through a simple layer of indirection. Optimizing Tiered Storage for Low-Latency Real-Time Analytics Neha Pawar, Founding Engineer and Head of Data at StarTree Real-time OLAP databases usually trade performance for cost when moving from local storage to cloud object storage. This talk shows how we extended Apache Pinot to use cloud storage while still achieving sub-second P99 latencies. We’ll cover the abstraction that makes Pinot location-agnostic, strategies like pipelining, prefetching, and selective block fetches, and how to balance local and cloud storage for both cost efficiency and speed. As Fast as Possible, But Not Faster: ScyllaDB Flow Control Nadav Har’El, Distinguished Engineer at ScyllaDB  Pushing requests faster than a system can handle results in rapidly growing queues. If unchecked, it risks depleting memory and system stability. This talk discusses how we engineered ScyllaDB’s flow control for high volume ingestions, allowing it to throttle over-eager clients to exactly the right pace – not so fast that we run out of memory, but also not so slow that we let available resources go to waste. Push the Database Beyond the Edge Nikita Sivukhin, Software Engineer at Turso Almost any application can benefit from having data available locally – enabling blazing-fast access and optimized write patterns. This talk will walk you through one approach to designing a full-featured sync engine, applicable across a wide range of domains, including front-end, back-end, and machine learning training. Engineering a Low-Latency Vector Search Engine for ScyllaDB Pawel Pery, Senior Software Engineer at ScyllaDB Implementing Vector Search in ScyllaDB brings challenges from low-latency to predictable performance at scale. Rather than embedding HNSW indexing directly into the core database, we decoupled vector indexing and similarity search into a dedicated Rust engine. Learn about the architectural design decisions that enabled us to combine and integrate ScyllaDB’s shard-per-core for real-time operations and high-performance ANN processing via USearch. We Told B+ Trees to Do Sorted Sets—They Nailed It (Joe Zhou, Dragonfly) Joe Zhou, Developer Advocate at DragonflyDB Sorted sets are a critical Redis data type used for leaderboards, time-series data, and priority queues. However, Redis’s skiplist-based implementation introduces significant memory overhead—averaging 37 bytes per entry on top of the essential 16 bytes for the (member, score) pair. For large sorted sets, this inefficiency can become a major bottleneck. In this talk, we’ll explore how Dragonfly reimplemented sorted sets using a B+ tree, reducing memory overhead to just 2-3 bytes per entry while improving performance. We’ll cover: Why skiplists are inefficient for large sorted sets. How B+ trees with bucketing drastically cut memory usage while maintaining O(log N) operations. Benchmark results showing 40% lower memory and better throughput vs. Redis. This optimization, now stable in Dragonfly, demonstrates how rethinking core data structures can unlock major efficiency gains. Attendees will leave with insights into: Trade-offs between skiplists and B+ trees. Real-world impact on memory and latency (P99 improvements). Lessons from implementing a custom ranking API for B+ trees. Keynote: Andy Pavlo You can also look forward to a keynote by Andy Pavlo.  We’re not revealing the topic yet, but if you know Andy, you know you won’t want to miss it. Join us at P99 CONF (free + virtual)

Building a Resilient Data Platform with Write-Ahead Log at Netflix

By Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, John Lu

Introduction

Netflix operates at a massive scale, serving hundreds of millions of users with diverse content and features. Behind the scenes, ensuring data consistency, reliability, and efficient operations across various services presents a continuous challenge. At the heart of many critical functions lies the concept of a Write-Ahead Log (WAL) abstraction. At Netflix scale, every challenge gets amplified. Some of the key challenges we encountered include:

  • Accidental data loss and data corruption in databases
  • System entropy across different datastores (e.g., writing to Cassandra and Elasticsearch)
  • Handling updates to multiple partitions (e.g., building secondary indices on top of a NoSQL database)
  • Data replication (in-region and across regions)
  • Reliable retry mechanisms for real time data pipeline at scale
  • Bulk deletes to database causing OOM on the Key-Value nodes

All the above challenges either resulted in production incidents or outages, consumed significant engineering resources, or led to bespoke solutions and technical debt. During one particular incident, a developer issued an ALTER TABLE command that led to data corruption. Fortunately, the data was fronted by a cache, so the ability to extend cache TTL quickly together with the app writing the mutations to Kafka allowed us to recover. Absent the resilience features on the application, there would have been permanent data loss. As the data platform team, we needed to provide resilience and guarantees to protect not just this application, but all the critical applications we have at Netflix.

Regarding the retry mechanisms for real time data pipelines, Netflix operates at a massive scale where failures (network errors, downstream service outages, etc.) are inevitable. We needed a reliable and scalable way to retry failed messages, without sacrificing throughput.

With these problems in mind, we decided to build a system that would solve all the aforementioned issues and continue to serve the future needs of Netflix in the online data platform space. Our Write-Ahead Log (WAL) is a distributed system that captures data changes, provides strong durability guarantees, and reliably delivers these changes to downstream consumers. This blog post dives into how Netflix is building a generic WAL solution to address common data challenges, enhance developer efficiency, and power high-leverage capabilities like secondary indices, enable cross-region replication for non-replicated storage engines, and support widely used patterns like delayed queues.

API

Our API is intentionally simple, exposing just the essential parameters. WAL has one main API endpoint, WriteToLog, abstracting away the internal implementation and ensuring that users can onboard easily.

rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse) {...}

/**
* WAL request message
* namespace: Identifier for a particular WAL
* lifecycle: How much delay to set and original write time
* payload: Payload of the message
* target: Details of where to send the payload
*/
message WriteToLogRequest {
string namespace = 1;
Lifecycle lifecycle = 2;
bytes payload = 3;
Target target = 4;
}

/**
* WAL response message
* durable: Whether the request succeeded, failed, or unknown
* message: Reason for failure
*/
message WriteToLogResponse {
Trilean durable = 1;
string message = 2;
}

A namespace defines where and how data is stored, providing logical separation while abstracting the underlying storage systems. Each namespace can be configured to use different queues: Kafka, SQS, or combinations of multiple. Namespace also serves as a central configuration of settings, such as backoff multiplier or maximum number of retry attempts, and more. This flexibility allows our Data Platform to route different use cases to the most suitable storage system based on performance, durability, and consistency needs.

WAL can assume different personas depending on the namespace configuration.

Persona #1 (Delayed Queues)

In the example configuration below, the Product Data Systems (PDS) namespace uses SQS as the underlying message queue, enabling delayed messages. PDS uses Kafka extensively, and failures (network errors, downstream service outages, etc.) are inevitable. We needed a reliable and scalable way to retry failed messages, without sacrificing throughput. That’s when PDS started leveraging WAL for delayed messages.

"persistenceConfigurations": {
"persistenceConfiguration": [
{
"physicalStorage": {
"type": "SQS",
},
"config": {
"wal-queue": [
"dgwwal-dq-pds"
],
"wal-dlq-queue": [
"dgwwal-dlq-pds"
],
"queue.poll-interval.secs": 10,
"queue.max-messages-per-poll": 100
}
}
]
}

Persona #2 (Generic Cross-Region Replication)

Below is the namespace configuration for cross-region replication of EVCache using WAL, which replicates messages from a source region to multiple destinations. It uses Kafka under the hood.

"persistence_configurations": {
"persistence_configuration": [
{
"physical_storage": {
"type": "KAFKA"
},
"config": {
"consumer_stack": "consumer",
"context": "This is for cross region replication for evcache_foobar",
"target": {
"euwest1": "dgwwal.foobar.cluster.eu-west-1.netflix.net",
"type": "evc-replication",
"useast1": "dgwwal.foobar.cluster.us-east-1.netflix.net",
"useast2": "dgwwal.foobar.cluster.us-east-2.netflix.net",
"uswest2": "dgwwal.foobar.cluster.us-west-2.netflix.net"
},
"wal-kafka-dlq-topics": [],
"wal-kafka-topics": [
"evcache_foobar"
],
"wal.kafka.bootstrap.servers.prefix": "kafka-foobar"
}
}
]
}

Persona #3 (Handling multi-partition mutations)

Below is the namespace configuration for supporting mutateItems API in Key-Value, where multiple write requests can go to different partitions and have to be eventually consistent. A key detail in the below configuration is the presence of Kafka and durable_storage. These data stores are required to facilitate two phase commit semantics, which we will discuss in detail below.

"persistence_configurations": {
"persistence_configuration": [
{
"physical_storage": {
"type": "KAFKA"
},
"config": {
"consumer_stack": "consumer",
"contacts": "unknown",
"context": "WAL to support multi-id/namespace mutations for dgwkv.foobar",
"durable_storage": {
"namespace": "foobar_wal_type",
"shard": "walfoobar",
"type": "kv"
},
"target": {},
"wal-kafka-dlq-topics": [
"foobar_kv_multi_id-dlq"
],
"wal-kafka-topics": [
"foobar_kv_multi_id"
],
"wal.kafka.bootstrap.servers.prefix": "kaas_kafka-dgwwal_foobar7102"
}
}
]
}

An important note is that requests to WAL support at-least once semantics due to the underlying implementation.

Under the Hood

The core architecture consists of several key components working together.

Message Producer and Message Consumer separation: The message producer receives incoming messages from client applications and adds them into the queue, while the message consumer processes messages from the queue and sends them to the targets. Because of this separation, other systems can bring their own pluggable producers or consumers, depending on their use cases. WAL’s control plane allows for a pluggable model, which, depending on the use-case, allows us to switch between different message queues.

SQS and Kafka with a dead letter queue by default: Every WAL namespace has its own message queue and gets a dead letter queue (DLQ) by default, because there can be transient errors and hard errors. Application teams using Key-Value abstraction simply need to toggle a flag to enable WAL and get all this functionality without needing to understand the underlying complexity.

  • Kafka-backed namespaces: handle standard message processing
  • SQS-backed namespaces: support delayed queue semantics (we added custom logic to go beyond the standard defaults enforced in terms of delay, size limits, etc)
  • Complex multi-partition scenarios: use queues and durable storage

Target Flexibility: The messages added to WAL are pushed to the target datastores. Targets can be Cassandra databases, Memcached caches, Kafka queues, or upstream applications. Users can specify the target via namespace configuration and in the API itself.

Architecture of WAL
Architecture of WAL

Deployment Model

WAL is deployed using the Data Gateway infrastructure. This means that WAL deployments automatically come with mTLS, connection management, authentication, runtime and deployment configurations out of the box.

Each data gateway abstraction (including WAL) is deployed as a shard. A shard is a physical concept describing a group of hardware instances. Each use case of WAL is usually deployed as a separate shard. For example, the Ads Events service will send requests to WAL shard A, while the Gaming Catalog service will send requests to WAL shard B, allowing for separation of concerns and avoiding noisy neighbour problems.

Each shard of WAL can have multiple namespaces. A namespace is a logical concept describing a configuration. Each request to WAL has to specify its namespace so that WAL can apply the correct configuration to the request. Each namespace has its own configuration of queues to ensure isolation per use case. If the underlying queue of a WAL namespace becomes the bottleneck of throughput, the operators can choose to add more queues on the fly by modifying the namespace configurations. The concept of shards and namespaces is shared across all Data Gateway Abstractions, including Key-Value, Counter, Timeseries, etc. The namespace configurations are stored in a globally replicated Relational SQL database to ensure availability and consistency.

Deployment model of WAL
Deployment model of WAL

Based on certain CPU and network thresholds, the Producer group and the Consumer group of each shard will (separately) automatically scale up the number of instances to ensure the service has low latency, high throughput and high availability. WAL, along with other abstractions, also uses the Netflix adaptive load shedding libraries and Envoy to automatically shed requests beyond a certain limit. WAL can be deployed to multiple regions, so each region will deploy its own group of instances.

Solving different flavors of problems with no change to the core architecture

The WAL addresses multiple data reliability challenges with no changes to the core architecture:

Data Loss Prevention: In case of database downtime, WAL can continue to hold the incoming mutations. When the database becomes available again, replay mutations back to the database. The tradeoff is eventual consistency rather than immediate consistency, and no data loss.

Generic Data Replication: For systems like EVCache (using Memcached) and RocksDB that do not support replication by default, WAL provides systematic replication (both in-region and across-region). The target can be another application, another WAL, or another queue — it’s completely pluggable through configuration.

System Entropy and Multi-Partition Solutions: Whether dealing with writes across two databases (like Cassandra and Elasticsearch) or mutations across multiple partitions in one database, the solution is the same — write to WAL first, then let the WAL consumer handle the mutations. No more asynchronous repairs needed; WAL handles retries and backoff automatically.

Data Corruption Recovery: In case of DB corruptions, restore to the last known good backup, then replay mutations from WAL omitting the offending write/mutation.

There are some major differences between using WAL and directly using Kafka/SQS. WAL is an abstraction on the underlying queues, so the underlying technology can be swapped out depending on use cases with no code changes. WAL emphasizes an easy yet effective API that saves users from complicated setups and configurations. We leverage the control plane to pivot technologies behind WAL when needed without app or client intervention.

WAL usage at Netflix

Delay Queue

The most common use case for WAL is as a Delay Queue. If an application is interested in sending a request at a certain time in the future, it can offload its requests to WAL, which guarantees that their requests will land after the specified delay.

Netflix’s Live Origin processes and delivers Netflix live stream video chunks, storing its video data in a Key-Value abstraction backed by Cassandra and EVCache. When Live Origin decides to delete certain video data after an event is completed, it issues delete requests to the Key-Value abstraction. However, the large amount of delete requests in a short burst interfere with the more important real-time read/write requests, causing performance issues in Cassandra and timeouts for the incoming live traffic. To get around this, Key-Value issues the delete requests to WAL first, with a random delay and jitter set for each delete request. WAL, after the delay, sends the delete requests back to Key-Value. Since the deletes are now a flatter curve of requests over time, Key-Value is then able to send the requests to the datastore with no issues.

Requests being spread out over time through delayed requests
Requests being spread out over time through delayed requests

Additionally, WAL is used by many services that utilize Kafka to stream events, including Ads, Gaming, Product Data Systems, etc. Whenever Kafka requests fail for any reason, the client apps will send WAL a request to retry the kafka request with a delay. This abstracts away the backoff and retry layer of Kafka for many teams, increasing developer efficiency.

Backoff and delayed retries for clients producing to Kafka
Backoff and delayed retries for clients consuming from Kafka

Cross-Region Replication

WAL is also used for global cross-region replication. The architecture of WAL is generic and allows any datastore/applications to onboard for cross-region replication. Currently, the largest use case is EVCache, and we are working to onboard other storage engines.

EVCache is deployed by clusters of Memcached instances across multiple regions, where each cluster in each region shares the same data. Each region’s client apps will write, read, or delete data from the EVCache cluster of the same region. To ensure global consistency, the EVCache client of one region will replicate write and delete requests to all other regions. To implement this, the EVCache client that originated the request will send the request to a WAL corresponding to the EVCache cluster and region.

Since the EVCache client acts as the message producer group in this case, WAL only needs to deploy the message consumer groups. From there, the multiple message consumers are set up to each target region. They will read from the Kafka topic, and send the replicated write or delete requests to a Writer group in their target region. The Writer group will then go ahead and replicate the request to the EVCache server in the same region.

EVCache Global Cross-Region Replication Implemented through WAL

The biggest benefits of this approach, compared to our legacy architecture, is being able to migrate from multi-tenant architecture to single tenant architecture for the most latency sensitive applications. For example, Live Origin will have its own dedicated Message Consumer and Writer groups, while a less latency sensitive service can be multi-tenant. This helps us reduce the blast radius of the issues and also prevents noisy neighbor issues.

Multi-Table Mutations

WAL is used by Key-Value service to build the MutateItems API. WAL enables the API’s multi-table and multi-id mutations by implementing 2-phase commit semantics under the hood. For this discussion, we can assume that Key-Value service is backed by Cassandra, and each of its namespaces represents a certain table in a Cassandra DB.

When a Key-Value client issues a MutateItems request to Key-Value server, the request can contain multiple PutItems or DeleteItems requests. Each of those requests can go to different ids and namespaces, or Cassandra tables.

message MutateItemsRequest {
repeated MutationRequest mutations = 1;
message MutationRequest {
oneof mutation {
PutItemsRequest put = 1;
DeleteItemsRequest delete = 2;
}
}
}

The MutateItems request operates on an eventually consistent model. When the Key-Value server returns a success response, it guarantees that every operation within the MutateItemsRequest will eventually complete successfully. Individual put or delete operations may be partitioned into smaller chunks based on request size, meaning a single operation could spawn multiple chunk requests that must be processed in a specific sequence.

Two approaches exist to ensure Key-Value client requests achieve success. The synchronous approach involves client-side retries until all mutations complete. However, this method introduces significant challenges; datastores might not natively support transactions and provide no guarantees about the entire request succeeding. Additionally, when more than one replica set is involved in a request, latency occurs in unexpected ways, and the entire request chain must be retried. Also, partial failures in synchronous processing can leave the database in an inconsistent state if some mutations succeed while others fail, requiring complex rollback mechanisms or leaving data integrity compromised. The asynchronous approach was ultimately adopted to address these performance and consistency concerns.

Given Key-Value’s stateless architecture, the service cannot maintain the mutation success state or guarantee order internally. Instead, it leverages a Write-Ahead Log (WAL) to guarantee mutation completion. For each MutateItems request, Key-Value forwards individual put or delete operations to WAL as they arrive, with each operation tagged with a sequence number to preserve ordering. After transmitting all mutations, Key-Value sends a completion marker indicating the full request has been submitted.

The WAL producer receives these messages and persists the content, state, and ordering information to a durable storage. The message producer then forwards only the completion marker to the message queue. The message consumer retrieves these markers from the queue and reconstructs the complete mutation set by reading the stored state and content data, ordering operations according to their designated sequence. Failed mutations trigger re-queuing of the completion marker for subsequent retry attempts.

Architecture of Multi-Table Mutations through WAL
Sequence diagram for Multi-Table Mutations through WAL

Closing Thoughts

Building Netflix’s generic Write-Ahead Log system has taught us several key lessons that guided our design decisions:

Pluggable Architecture is Core: The ability to support different targets, whether databases, caches, queues, or upstream applications, through configuration rather than code changes has been fundamental to WAL’s success across diverse use cases.

Leverage Existing Building Blocks: We had control plane infrastructure, Key-Value abstractions, and other components already in place. Building on top of these existing abstractions allowed us to focus on the unique challenges WAL needed to solve.

Separation of Concerns Enables Scale: By separating message processing from consumption and allowing independent scaling of each component, we can handle traffic surges and failures more gracefully.

Systems Fail — Consider Tradeoffs Carefully: WAL itself has failure modes, including traffic surges, slow consumers, and non-transient errors. We use abstractions and operational strategies like data partitioning and backpressure signals to handle these, but the tradeoffs must be understood.

Future work

  • We are planning to add secondary indices in Key-Value service leveraging WAL.
  • WAL can also be used by a service to guarantee sending requests to multiple datastores. For example, a database and a backup, or a database and a queue at the same time etc.

Acknowledgements

Launching WAL was a collaborative effort involving multiple teams at Netflix, and we are grateful to everyone who contributed to making this idea a reality. We would like to thank the following teams for their roles in this launch.

  • Caching team — Additional thanks to Shih-Hao Yeh, Akashdeep Goel for contributing to cross region replication for KV, EVCache etc. and owning this service.
  • Product Data System team — Carlos Matias Herrero, Brandon Bremen for contributing to the delay queue design and being early adopters of WAL giving valuable feedback.
  • KeyValue and Composite abstractions team — Raj Ummadisetty for feedback on API design and mutateItems design discussions. Rajiv Shringi for feedback on API design.
  • Kafka and Real Time Data Infrastructure teams — Nick Mahilani for feedback and inputs on integrating the WAL client into Kafka client. Sundaram Ananthanarayan for design discussions around the possibility of leveraging Flink for some of the WAL use cases.
  • Joseph Lynch for providing strategic direction and organizational support for this project.

Building a Resilient Data Platform with Write-Ahead Log at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 3)

ScyllaDB’s co-founder/CTO discusses decisions to increase efficiency for storage-bound workloads and allow deployment on mixed size clusters To get the engineering perspective on the recent shifts to ScyllaDB’s architecture, Tim Koopmans recently caught up with ScyllaDB Co-Founder and CTO Avi Kivity. In part 1 of this 3-part series, they talked about the motivations and architectural shifts behind ScyllaDB X Cloud, particularly with respect to Raft and tablets-based data distribution. In part 2, they went deeper into how tablets work, then looked at the design of ScyllaDB X Cloud’s autoscaling. In this final part of the series, they discuss changes that increase efficiency for storage-bound workloads and allow deployment on mixed size clusters. You can watch the complete video here. Storage-bound workloads and compression Tim: Let’s switch gears and talk about compression. This was a way to double-down on storage-bound workloads, right? Would you say storage-bound workloads are more common than CPU bound ones? Is that what’s driving this? Avi: Yes, and there’s two reasons for that. One reason is that our CPU efficiency is quite high. If you’re CPU-efficient, then storage is going to dominate. And the other reason is that when your database grows – say it’s twice as large as before – it’s rare that you actually have twice the amount of work. It can happen. But for many workloads, the growth of data is mostly historical, so the number of ops doesn’t scale linearly with the size of the database. As the database grows, the ratio of ops to storage decreases, and it becomes storage-bound. So, many of our larger workloads are storage-bound. The small and medium ones can be either storage-bound or CPU-bound…it really depends on the workload. We have some workloads where most of the storage in the cluster isn’t used because they’re so CPU-intensive. And we have others where the CPU is mostly idle, but the cluster is holding a lot of storage. We try to cater to all of these workloads. Tim: So a storage-bound workload is likely to have lower CPU utilization in general, and that gives you more CPU bandwidth to do things like more advanced compression? What’s the default compression, in terms of planning for storage? Is it like 50%? Or what’s the typical rate? Or is the real answer just “it depends”? Avi: “It depends” is an easy escape, but the truth is there’s a wide variety of storage options now. A recent addition is dictionary-based compression. That’s where the cluster periodically samples data on disk and constructs a dictionary from those samples. That dictionary is then used to boost compression. Everyone probably knows dictionary compression: it finds repetitive byte sequences in the data and matches against them. By having samples, you can match against the samples and gain higher compression. We recently started rolling it out, and it does give a nice improvement. Of course, it varies widely. Some people store data that’s already compressed, so it won’t compress further. Others store data like JSON, which compresses very well. In those cases, we might see above 50% compression ratios. And for many storage-bound workloads, you can set the compression parameters higher and gain more compression at the expense of CPU…but it’s CPU that you already have. Tim: Is there anything else on the compression roadmap, like column aware compression? Avi:  It’s not on the roadmap yet, but we will do columnar storage for time series and data. But there’s no timeline for that yet. Tim: Any hardware accelerated stuff? Avi: We looked at hardware acceleration, but it’s too rare to really matter. One problem is that on the cloud, it’s only available with the very largest instance sizes. And while we do have clusters with large instance sizes, it’s not enough to justify the work. I’m talking about machines with 96 vCPUs and 60TB of storage per node. It would only make sense for the very largest clusters, the petabyte-class clusters. They do exist, but they’re not yet common enough to make it worth the effort. On smaller instances, the accelerators are just hidden by virtualization. The other problem with hardware-accelerated compression is that it doesn’t keep up with the advances in software compression. That’s a general problem with hardware. For example, dictionary compression isn’t supported by those accelerators, but dictionary compression is very useful. We wouldn’t want to give that up. Tim:  Yeah, it seems like unless there’s a very specific, almost niche need for it, it’s safer to stick with software-based compression. Mixed size  types  & CPU: Storage ratios Tim: And in a roundabout way, this brings me back to the last thing I wanted to ask about. I think we’ve already touched on it: the idea of 90% storage utilization. You’ve already mentioned reasons why, including tablets. And we also spoke about having mixed instance types in the cluster. That’s quite significant for this release, right? Avi: Yes, it’s quite important. Assume you have those large instances with 96 vCPUs and 60TB of storage per node… and your data grows. It’s not doubling, just incremental growth. If you have a large amount of data, the rate of growth won’t be very large. So, you want to add a smaller amount of storage each time, not 60TB. That gives you two options. One option is to compose your cluster from a large number of very small instances. But large clusters introduce management problems. The odds of a node failing grow as the cluster grows, so you want to keep clusters at a manageable size. The other option is to have mixed-size clusters. For example, if you have clusters of 60TB nodes, then you might add a 6TB node. As the data grows, you can then replace those smaller nodes with larger ones, until you’re back to having a cluster that’s full of the largest node size. There’s another reason for mixed-size clusters: changing the CPU-to-storage ratio. Typically, storage bound clusters use nodes with a large disk-to-CPU ratio – a lot of disk and relatively little CPU. But there might be times across a day or throughout the year where the number of OPS increases without a corresponding increase in storage. For example, think about Black Friday or workloads spiking in certain geographies. In those cases, you might switch from nodes with a low CPU-to-disk ratio to ones with a high CPU-to-disk ratio, then switch back later. That way, you keep total storage constant, but increase the amount of CPU serving that storage. It lets you adapt to changing CPU requirements without having to buy more storage. Tim: Got it. So it’s really about averaging out the ratios to get the price–performance balance you want between storage and CPU. Is that something the user has to figure out, or does it fall under the autoscaler? Avi:  It will be automatic. It’s too much to ask a user to track the right mix of instances and keep managing that. Looking back and looking forward Tim: Looking back, and a little forward…if you could go back to 2014, when you first came up with ScyllaDB, would you tell your past self to do anything different? Or do you think it’s evolved naturally? Would you save yourself some pain? Avi:  Yeah. So, when you start a project, it always looks simple and you think you know everything. Then you discover how much you didn’t know. I don’t even know what my 2014 self would say about how much I mispredicted the amount of work that would be necessary to do this. I mean, I knew databases were hard – one of the most complex areas in software engineering – but I didn’t know how hard. Tim: And what about looking forward?What’s the next big thing on the horizon that people aren’t really talking about yet? Avi:  I want to fully complete the tablets project before we talk about the next step. Tim:  Just one last question from me before we wrap. Aside from the correct pronunciation of ScyllaDB, what’s the most misunderstood part of ScyllaDB’s new architecture? What are people getting wrong? Avi:  I don’t think people are getting it wrong. It’s not that complicated. It’s another layer of indirection, and people do understand that. We have some nice visualizations of that as well. Maybe we should have a session showing how tablets move around, because it’s a little like Tetris – how we fit different tablets to fill the nodes.  So I think tablets are easily understood. It’s complex to implement, but not complicated to understand.

The Latency vs. Complexity Tradeoffs with 6 Caching Strategies

How to choose between cache-aside, read-through, write-through, client-side, and distributed caching strategies As we mentioned in the recent  Why Cache Data? post,  we’re delighted that Pekka Enberg decided to write an entire book on latency and we’re proud to sponsor 3 chapters from it. Get the Latency book excerpt PDF Also, Pekka just shared key takeaways from that book in a masterclass on Building Low Latency Apps (now available on demand). Let’s continue our Latency book excerpts with more from Pekka’s caching chapter. It’s reprinted here with permission of the publisher. *** When adding caching to your application, you must first consider your caching strategy, which determines how reads and writes happen from the cache and the underlying backing store, such as a database or a service. At a high level, you need to decide if the cache is passive or active when there is a cache miss. In other words, when your application looks up a value from the cache, but the value is not there or has expired, the caching strategy mandates whether it’s your application or the cache that retrieves the value from the backing store. As usual, different caching strategies have different trade-offs on latency and complexity, so let’s get right into it. Cache-Aside Caching Cache-aside caching is perhaps the most typical caching strategy you will encounter. When there is a cache hit, data access latency is dominated by communication latency, which is typically small, as you can get a cache close by on a cache server or even in your application memory space. However, when there is a cache miss, with cache-aside caching, the cache is a passive store updated by the application. That is, the cache just reports a miss and the application is responsible for fetching data from the backing store and updating the cache. Figure 1 shows an example of cache-aside caching in action. An application looks up a value from a cache by a caching key, which determines the data the application is interested in. If the key exists in the cache, the cache returns the value associated with the key, which the application can use. However, if the key does not exist or is expired in the cache, we have a cache miss, which the application has to handle. The application queries the value from the backing store and stores the value in the cache. Suppose you are caching user information and using the user ID as the lookup key. In that case, the application performs a query by the user ID to read user information from the database. The user information returned from the database is then transformed into a format you can store in the cache. Then, the cache is updated with the user ID as the cache key and the information as the value. For example, a typical way to perform this type of caching is to transform the user information returned from the database into JSON and store that in the cache. Figure 1: With cache-aside caching, the client first looks up a key from the cache. On cache miss, the client queries the database and updates the cache. Cache-aside caching is popular because it is easy to set up a cache server such as Redis and use it to cache database queries and service responses. With cache-aside caching, the cache server is passive and does not need to know which database you use or how the results are mapped to the cache. It is your application doing all the cache management and data transformation. In many cases, cache-aside caching is a simple and effective way to reduce application latency. You can hide database access latency by having the most relevant information in a cache server close to your application. However, cache-aside caching can also be problematic if you have data consistency or freshness requirements. For example, if you have multiple concurrent readers that are looking up a key in the cache, you need to coordinate in your application how you handle concurrent cache misses; otherwise, you may end up with multiple database accesses and cache updates, which may result in subsequent cache lookups returning different values. However, with cache-aside caching, you lose transaction support because the cache and the database do not know each other, and it’s the application’s responsibility to coordinate updates to the data. Finally, cache-aside caching can have significant tail latency because some cache lookups experience the database read latency on a cache miss. That is, although in the case of a cache hit, access latency is fast because it’s coming from a nearby cache server; cache lookups that experience a cache miss are only as fast as database access. That’s why the geographic latency to your database still can matter a great deal even if you are caching because tail latency is experienced surprisingly often in many scenarios. Read-Through Caching Read-through caching is a strategy where, unlike cache-aside caching, the cache is an active component when there is a cache miss. When there is a cache miss, a read-through cache attempts to read a value for the key from the backing store automatically. Latency is similar to cache-aside caching, although backing store retrieval latency is from the cache to the backing store, not from application to backing store, which may be smaller, depending on your deployment architecture. Figure 2 shows an example of a read-through cache in action. The application performs a cache lookup on a key, and if there is a cache miss, the cache performs a read to the database to obtain the value for the key. The cache then updates itself and returns the value to the application. From an application point of view, a cache miss is transparent because the cache always returns a key if one exists, regardless of whether there was a cache miss or not. Figure 2: With read-through caching, the client looks up a key from the cache. Unlike with cache-aside caching, the cache queries the database and updates itself on cache miss. Read-through caching is more complex to implement because a cache needs to be able to read the backing store, but it also needs to transform the database results into a format for the cache. For example, if the backing store is an SQL database server, you need to convert the query results into a JSON or similar format to store the results in the cache. The cache is, therefore, more coupled with your application logic because it needs to know more about your data model and formats. However, because the cache coordinates the updates and the database reads with read-through caching, it can give transactional guarantees to the application and ensure consistency on concurrent cache misses. Furthermore, although a read-through cache is more complex from an application integration point of view, it does remove cache management complexity from the application. Of course, the same caveat of tail latency applies to read-through caches as they do to cache-aside caching. An exception: as active components, read-through caches can hide the latency better with, for example, refresh-ahead caching. Here, the cache asynchronously updates the cache before the values are expired – therefore hiding the database access latency from applications altogether when a value is in the cache. Write-Through Caching Cache-aside and read-through caching are strategies around caching reads, but sometimes, you also want the cache to support writes. In such cases, the cache provides an interface for updating the value of a key that the application can invoke. In the case of cache-aside caching, the application is the only one communicating with the backing store and, therefore, updates the cache. However, with read-through caching, there are two options for dealing with writes: write-through and write-behind caching. Write-through caching is a strategy where an update to the cache propagates immediately to the backing store. Whenever a cache is updated, the cache synchronously updates the backing store with the cached value. The write latency of write-through cache is dominated by the write latency to the backing store, which can be significant. As shown in Figure 3, an application updates a cache using an interface provided by the cache with a key and a value pair. The cache updates its state with the new value, updates the database with the new value and waits for the database to commit the update until acknowledging the cache update to the application. Figure 3: With write-through caching, the client writes a key-value pair to the cache. The cache immediately updates the cache and the database. Write-through caching aims to keep the cache and the backing storage in sync. However, for non-transactional caches, the cache and backing store can be out of sync in the presence of errors. For example, if write to cache succeeds, but the write to backing store fails, the two will be out of sync. Of course, a write-through cache can provide transactional guarantees by trading off some latency to ensure that the cache and the database are either both updated or neither of them is. As with a read-through cache, write-through caching assumes that the cache can connect to the database and transform a cache value into a database query. For example, if you are caching user data where the user ID serves as the key and a JSON document represents the value, the cache must be able to transform the JSON representation of user information into a database update. With write-through caching, the simplest solution is often to store the JSON in the database. The primary drawback of write-through caching is the latency associated with cache updates, which is essentially equivalent to database commit latency. This can be significant. Write-Behind Caching Write-behind caching strategy updates the cache immediately, unlike write-through caching, which defers the database updates. In other words, with write-behind caching, the cache may accept multiple updates before updating the backing store, as shown in Figure 4, where the cache accepts three cache updates before updating the database. Figure 4: With write-behind caching, the client writes a key-value pair to the cache. However, unlike with write-through caching, the cache updates the cache but defers the database update. Instead, write-behind cache will batch multiple cache updates to a single database update. The write latency of a write-behind cache is lower than with write-through caching because the backing store is updated asynchronously. That is, the cache can acknowledge the write immediately to the application, resulting in a low-latency write, and then perform the backing store update in the background. However, the downside of write-behind caching is that you lose transaction support because the cache can no longer guarantee that the cache and the database are in sync. Furthermore, write-behind caching can reduce durability, which is the guarantee that you don’t lose data. If the cache crashes before flushing updates to the backing store, you can lose the updates. Client-Side Caching A client-side caching strategy means having the cache at the client layer within your application. Although cache servers such as Redis use in-memory caching, the application must communicate over the network to access the cache via the Redis protocol. If the application is a service running in a data center, a cache server is excellent for caching because the network round trip within a data center is fast, and the cache complexity is in the cache itself. However, last-mile latency can still be a significant factor in user experience on a device, which is why client-side caching is so lucrative. Instead of using a cache server, you have the cache in your application. With client-side caching, a combination of read-through and write-behind caching is optimal from a latency point of view because both reads and writes are fast. Of course, your client usually won’t be able to connect with the database directly, but instead accesses the database indirectly via a proxy or an API server. Client-side caching also makes transactions hard to guarantee because of the database access indirection layers and latency. For many applications that need low-latency client-side caching, the local-first approach to replication may be more practical. But for simple read caching, client-side caching can be a good solution to achieve low latency. Of course, client-side caching also has a trade-off: It can increase the memory consumption of the application because you need space for the cache. Distributed Caching So far, we have only discussed caching as if a single cache instance existed. For example, you use an in-application cache or a single Redis server to cache queries from a PostgreSQL database. However, you often need multiple copies of the data to reduce geographic latency across various locations or scale out to accommodate your workload. With such distributed caching, you have numerous instances of the cache that either work independently or in a cache cluster. With distributed caching, you have many of the same complications and considerations as with discussed in Chapter 4 on replication and Chapter 5 on partitioning. With distributed caching, you don’t want to fit all the cached data on every instance but instead have cached data partitioned between the nodes. Similarly, you can replicate the partitions on multiple instances for high availability and reduced access latency. Overall, distributed caching is an intersection of the benefits and problems of caching, partitioning and replication, so watch out if you’re going with that. *** To keep reading, download the 3-chapter Latency excerpt free from ScyllaDB or purchase the complete book from Manning.

ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 2)

ScyllaDB’s co-founder/CTO goes deeper into how tablets work, then looks at the design behind ScyllaDB X Cloud’s autoscaling Following the recent ScyllaDB X Cloud release, Tim Koopmans sat down (virtually) with ScyllaDB Co-Founder and CTO Avi Kivity. The goal: get the engineering perspective on all the multiyear projects leading up to this release. This includes using Raft for topology and schema metadata, moving from vNodes to tablets-based data distribution, allowing up to 90% storage utilization, new compression approaches, etc. etc. In part 1 of this 3-part series, we looked at the motivations and architectural shifts behind ScyllaDB X Cloud, particularly with respect to Raft and tablets-based data distribution. This blog post goes deeper into how tablets work, then looks at the design behind ScyllaDB X Cloud’s autoscaling. Read part 1 Read part 3 You can watch the complete video here. Tackling technical challenges Tim: With such a complex project, I’m guessing that you didn’t nail everything perfectly on the first try. Could you walk us through some of the hard problems that took time to crack? How did you work around those hurdles? Avi: One of the difficult things was the distribution related to racks or availability zones (we use those terms interchangeably). With the vNodes method of data distribution, a particular replica can hop around different racks. That does work, but it creates problems when you have materialized views. With a materialized view, each row in the base table is tied to a row in the materialized view. If there’s a change in the relationship between which replica on the base table owns the row on the materialized view, that can cause problems with data consistency. We struggled with that a lot until we came to a solution of just forbidding having a replication factor that’s different from the number of racks or availability zones. That simple change solved a lot of problems. It’s a very small restriction because, practically speaking, the vast majority of users have a replication factor of 3, and they use 3 racks or 3 availability zones. So the restriction affects very few people, but solves a large number of problems for us…so we’re happy that we made it. How tablets prevent hot partitions Tim: What about things like hot partitions and data skew in tablets? Does tablets help here since you’re working with smaller chunks? Avi: Yes. With tablets, our granularity is 5GB, so we can balance data in 5GB chunks. That might sound large, but it’s actually very small compared to the node capacity. The 5GB size was selected because it’s around 1% of the data that a single vCPU can hold. For example, an i3 node has around 600GB of storage per vCPU, and 1% of that is 5GB. That’s where the 5GB number came from. Since we control individual tablets, we can isolate a tablet to a single vCPU. Then, instead of a tablet being 1% of a vCPU, it can take 100% of it. That effectively increases the amount of compute power that is dedicated to the tablet by a factor of 100. This will let us isolate hot partitions into their own vCPUs. We don’t do this yet, but detecting hot partitions and isolating them in this way will improve the system’s resilience to hot partition problems. Tim: That’s really interesting. So have we gone from shard per core to almost tablet per core? Is that what the 1% represents, on average? Avi:  The change is that we now have additional flexibility. With a static distribution, you look at the partition key and you know in advance where it will go. Here, you look at the partition key and you consult an indirection table. And that indirection table is under our control…which means we can play with it and adjust things. Tim:  Can you say more about the indirection table? Avi:  It’s called system.tablets. It lays out the topology of the cluster. For every table and every token range, it lists what node and what shard will handle those keys. It’s important that it’s per table. With vNodes, we had the same layout for all tables. Some tables can be very large, some tables can be very small, some tables can be hot, some tables can be cold…so the one-size-fits-all approach doesn’t always work. Now, we have the flexibility to lay out different tables in different ways. How driver changes simplify complexity Tim:  Very cool. So tablets seem to solve a lot of problems – they just have a lot of good things going for them. I guess they can start servicing requests as soon as a new node receives a tablet? That should help with long-tail latency for cluster operations. We also get more fine-grained control over how we pack data into the cluster (and we’ll talk about storage utilization shortly). But you mentioned the additional table. Is there any other overhead or any operational complexity? Avi: Yes. It does introduce more complexity. But since it’s under our control, we also introduced mitigations for that. For example, the drivers now have to know about this indirection layer, so we modified them. We have this reactive approach where a driver doesn’t read the tablets table upfront. Instead, when it doesn’t know the layout of tablets on a cluster, it just fires off a request randomly. If it hits, great. If it misses, then along with the results, we’ll get back a notification about the topology of that particular tablet. As it fires off more requests, it will gradually learn the topology of the cluster. And when the topology changes, it will react to how the cluster layout changes. That saves it from doing a lot of upfront work – so it can send requests as soon as it connects to the cluster.   ScyllaDB’s approach to autoscaling Tim:  Let’s shift over to autoscaling. Autoscaling in databases generally seems more like marketing than reality to me. What’s different about ScyllaDB X Cloud’s approach to autoscaling? Avi:  One difference is that we can autoscale much later, at least for storage-bound workloads. Before, we would scale at around 70% storage utilization. But now we will start scaling at 90%. This decreases the cluster cost because more of the cluster storage is used to store data, rather than being used as a free space cushion. Tablets allow us to do that. Since tablets lets us add nodes concurrently, we can scale much faster. Also, since each tablet is managed independently, we can remove its storage as soon as the tablet is migrated off its previous node. Before, we had to wait until the data was completely transitioned to a new node, and then we would run a cleaner process that would erase it from the original node. But now this is done incrementally (in 5GB increments), so it happens very quickly. We can migrate a 5GB tablet in around a minute, sometimes even less. As soon as a cluster scale out begins, the node storage decreases immediately. That means we can defer the scale out decision, waiting until it’s really needed. Scaling for CPU, by measuring the CPU usage, will be another part of that. CPU is used for many different things in ScyllaDB. It can be used for serving queries, but it’s also used for internal background tasks like compaction. It can also be used for queries that – from the user’s perspective – are background queries like running analytics. You wouldn’t want to scale your cluster just because you’re running analytics on it. These are jobs that can take as long as they need to; you wouldn’t necessarily want to add more hardware just to make them run faster. We can distinguish between CPU usage for foreground tasks (for queries that are latency sensitive) and CPU usage for maintenance tasks, for background work, and for queries where latency is not so important. We will only scale when the CPU for foreground tasks runs low. Tim: Does the user have to do anything special to prioritize the foreground vs background queries? Is that just part of workload prioritization? Or does it just understand the difference? Avi: We’re trying not to be too clever. It does use the existing service level mechanism. And in the service level definition, you can say whether it’s a transaction workload or a batch workload. All you need to do is run an alter service level statement to designate a particular service level as a batch workload. And once you do that, then the cluster will not scale because that service level needs more CPU. It will only scale if your real-time queries are running out of CPU. It’s pretty normal to see ScyllaDB at 100% CPU. But that 100% is split: part goes to your workload, and part goes to maintenance like compaction. You don’t want to trigger scaling just because the cluster is using idle CPU power for background work. So, we track every cycle and categorize it as either foreground work or background work, then we make decisions based on that. We don’t want it to scale out too far when that’s just not valuable.

Why Cache Data? [Latency Book Excerpt]

Latency is a monstrous concern here at ScyllaDB. So we’re pleased to bring you excerpts from Pekka Enberg’s new book on Latency…and a masterclass with Pekka as well! Latency is a monstrous concern here at ScyllaDB. Our engineers, our users, and our broader community are obsessed with it…to the point that we developed an entire conference on low latency. [Side note: That’s P99 CONF, a free + virtual conference coming to you live, October 22-23.] Join P99 CONF – Free + Virtual We’re delighted that Pekka Enberg decided to write an entire book on latency to punish himself share his hard-fought latency lessons learned. The book (quite efficiently titled “Latency”) is now off to the printers. From his intro: Latency is so important across a variety of use cases today. Still, it’s a tricky topic because many low-latency techniques are effectively developer folklore hidden in blog posts, mailing lists, and side notes in books. When faced with a latency problem, understanding what you’re even dealing with often takes a lot of time. I remember multiple occasions where I saw peculiar results from benchmarks, which resulted in an adventure down the software and hardware stack where I learned something new. By reading this book, you will understand latency better, how you can measure latency accurately, and discover the different techniques for achieving low latency. This is the book I always wished I had when grappling with latency issues. Although this book focuses on applying the techniques in practice, I will also bring up enough of the background side of things to try to balance between theory and practice. ScyllaDB is sponsoring chapters from the book, so we’ll be trickling out some excerpts on our blogs. Get the Latency book excerpt PDF More good news: Pekka just joined us for a masterclass on Building Low Latency Apps (now available on demand), where he shared key takeaways from that book. Let’s kick off our Latency book excerpts with the start of Pekka’s caching chapter. It’s reprinted here with permission of the publisher. ***   Why cache data? Typically, you should consider caching for reducing latency over other techniques if your application or system: Doesn’t need transactions or complex queries. Cannot be changed which makes using techniques such as replication hard. Has compute or storage constraints that prevent other techniques. Many applications and systems are simple enough that a key-value interface, typical for caching solutions, is more than sufficient. For example, you can store user data such as profiles and settings as a key-value pair where the key is the user ID, and the value is the user data in JSON or a similar format. Similarly, session management, where you keep track of logged in user session state is often simple enough that it doesn’t require complex queries. However, caching can eventually be too limiting as you move to more complicated use cases, such as recommendations or ad delivery. You have to look into other techniques. Overall, whether your application is simple enough to use caching is highly use case-specific. Often, you look into caching because you cannot or don’t want to change the existing system. For example, you may have a database system that you cannot change, which does not support replication, but you have clients accessing the database from multiple locations. You may then look into caching some query results to reduce latency and scale the system, which is a typical use of caching. However, this comes with various caveats on data freshness and consistency, which we’ll discuss in this chapter. Compute and storage constraints can also be a reason to use caching instead of other techniques. Depending on their implementation, colocation and replication can have high storage requirements, which may prevent you from using them. For example, suppose you want to reduce access latency to a large data set, such as a product catalog in an e-commerce site. In that case, it may be impractical to replicate the whole data set in a client with the lowest access latency. However, caching parts of the product catalog may still make sense to cache in the client to reduce latency but simultaneously live with the client’s storage constraints. Similarly, it may be impractical to replicate a whole database to a client or a service because database access requires compute capacity for query execution, which may not be there. Caching overview With caching, you can keep a temporary copy of data to reduce access time significantly by reusing the same result many times. For example, if you have a REST API that takes a long time to compute a result, you can cache the REST API results in the client to reduce latency. Accessing the cached results can be as fast as reading from the memory, which can significantly reduce latency. You can also use caching for data items that don’t exist, called negative caching. For example, maybe the REST API you use is there to look up customer information based on some filtering parameters. In some cases, no results will match the filter, but you still need to perform the expensive computation to discover that. In that scenario, you would use negative caching to cache the fact that there are no results, speeding up the search. Of course, caching has a downside, too: you trade off data freshness for reduced access latency. You also need more storage space to keep the cached data around. But in many use cases, it’s a trade-off you are willing to take. Cache storage is where you keep the temporary copies of data. Depending on the use case, cache storage can either be in the main memory or on disk, and cache storage can be accessed either in the same memory address space as the application or over a network protocol. For example, you can use an in-memory cache library value in your application memory or a key-value store such as Redis or Memcached to cache values in a remote server. With caching, an application looks up values based on a cache key from the cache. When the cache has a copy of the value, we call that a cache hit and serve the data access from the cache. However, if there is no value in the cache, we call that scenario a cache miss and must retrieve the value from the backing store. A key metric for an effective caching solution is the cache hit-to-miss ratio, which describes how often the application finds a relevant value in the cache and how frequently the cache does not have a value. If a cache has a high cache hit ratio, it is utilized well, meaning there is less need to perform a slow lookup or compute the result. With a high cache miss ratio, you are not taking advantage of the cache. This can mean that your application runs slower than without caching because caching itself has some overhead. One major complication with caches is cache eviction policies or what values to throw out from the cache. The main point of a cache is to provide fast access but also fit the cache in a limited storage space. For example, you may have a database with hundreds of gigabytes of data. Still, you can only reasonably cache tens of gigabytes in the memory address space of your application because of machine resource limitations. You, therefore, need some policy to determine which values stay in the cache and which ones you can evict if you run out of cache space. Similarly, once you cache a value, you can’t always retain the value in the cache indefinitely if the source value changes. For example, you may have a time-based eviction policy enforcing that a cached value can be at least a minute old before updating to the latest source value. Despite the challenges, caching is an effective technique to reduce latency in your application, in particular when you can’t change some parts of the system and when your use case doesn’t warrant investment in things like colocation, replication, or partitioning. With that in mind, let’s look at the different caching strategies. Caching strategies When adding caching to your application, you must first consider your caching strategy, which determines how reads and writes happen from the cache and the underlying backing store, such as a database or a service. At a high level, you need to decide if the cache is passive or active when there is a cache miss. In other words, when your application looks up a value from the cache, but the value is not there or has expired, the caching strategy mandates whether it’s your application or the cache that retrieves the value from the backing store. As usual, different caching strategies have different trade-offs on latency and complexity, so let’s get right into it. To be continued… Get the Latency book excerpt PDF Join the Latency Masterclass on September 25