ScyllaDB Vector Search in Action: Real-time RAG and Anomaly Detection

Learn how ScyllaDB vector search simplifies scalable semantic search & RAG with real-time performance – plus see FAQs on vector search scaling, embeddings, and architecture As a Solutions Architect at ScyllaDB, I talk a lot about vector search these days. Most vector search conversations start with RAG. Build a chatbot, embed some documents, do a similarity lookup. That’s an important use case, but it’s just the beginning of what vector search can do when you have a platform like ScyllaDB that can scale to massive datasets and hit those p99 latencies At Monster Scale Summit, I showed two live demos that cover what I think are the two most practical patterns for vector search today: hybrid memory for RAG, and real-time anomaly detection. Both run entirely on ScyllaDB, without needing an external cache (e.g., Redis) or dedicated vector database (e.g., Pinecone). This post explains what I showed and also answers the top questions from attendees. The Limits of Traditional Search and Anomaly Detection The old way of doing anomaly detection, retrieving data, doing lookups, using inverted indexes, and running semantic search is brittle. Anyone who has tried to scale an index-based system has experienced this firsthand. The first step was figuring out what was said before, what data was loaded. Redis caches, key-value stores, document databases, and SQL databases were used to handle that. Then, to determine what documents are relevant, we built custom search pipelines using Lucene-based indexes and similar tools. Next, we built complex rules engines that became difficult to troubleshoot and couldn’t scale. These systems became extremely complex, with SQL-based regressions and similar mechanisms. The problem is that they’re hard to scale and very rigid. Every new failure mode required building a new rule, sometimes entirely new rules engines. Often, these detection paths required completely different systems. The result was graph use cases, vector systems, and multiple databases: a collection of systems solving what are ultimately simple problems. As you might imagine, these systems become quite brittle and expensive to operate. You could avoid this with a conceptual shift: from matching rules to finding things that look similar. That shift from rigid logic to semantic similarity is what vector search is trying to achieve, whether through text embeddings or other encodings. RAG with Hybrid Memory The first demo showed how we can build three distinct memory tiers for RAG, all living within ScyllaDB. This “hybrid memory” pattern allows you to handle everything from immediate context to long-forgotten conversations without a complex conglomerate of different databases. Short-term memory: This handles the last five records of your interaction as a rolling window. It is very fast, uses a short TTL, and ensures the AI stays on track with the immediate conversation. Long-term memory: This is where vectors shine. If you’ve been chatting with a bot for 6 days, that history won’t fit in a standard context window. We use vector search to retrieve specific conversation parts that matter based on embedded similarity, loading only the relevant history into the prompt. This is the piece that makes a chatbot feel like it actually remembers. Document enrichment: This phase takes uploaded files (PDFs, Markdown files, etc.), chunks them semantically, embeds each chunk, and stores those vectors in ScyllaDB. When you ask a question, the system retrieves the most relevant chunks and feeds them to the model. Basically, I used ScyllaDB to manage key-value lookups, vector similarity, and TTL expiration in a single cluster, which meant I didn’t need extra infrastructure like Redis or Pinecone. It processed LinkedIn profiles by breaking them into semantic chunks (one profile generated 43 chunks) and stored those embeddings in ScyllaDB Cloud to be queried in real-time. The frontend is built in Elixir, the backend is ScyllaDB Cloud with vector search enabled. The code is on my GitHub. Pull it, run it, and try it out with your own documents. Anomaly Detection: Three Paths, One Database The second demo is where vector search truly earns its keep. The setup models an IoT pipeline: Devices generate metrics every 10 seconds Metrics stream through Kafka They’re aggregated into 60-second windows Finally, they land in ScyllaDB as 384-dimensional embeddings generated by Ollama. From there, three detection paths run in parallel against every snapshot. Path 1: Rules engine. This is classic Z-score analysis. If four or more metrics exceed a Z-score of 6.0, flag it. This is fast and interpretable. It catches known anomaly patterns and misses everything else. Path 2: Embedding similarity. Build a behavioral fingerprint for each device. Compare every new snapshot against that fingerprint using cosine similarity. If the score drops below 0.92, something has drifted. This catches gradual behavioral changes that hard thresholds miss. Path 3: Vector search. Run an ANN query across the device’s historical snapshots in ScyllaDB. Post-filter for the same device, non-anomalous records, and a cosine similarity above 0.75. If fewer than five similar normal snapshots exist, that snapshot is anomalous. You don’t need to write any rules or tune thresholds per device. The data tells you what normal looks like. In the demo, I showed a sensor drift scenario where power consumption dropped slightly and a few other metrics shifted just enough to be unusual. This slipped past the rules engine and embedding similarity, but vector search caught it. It’s Exhibit A for how vector search exposes subtle anomalies that rigid systems can’t see. The reason this works on a single cluster is that ScyllaDB handles vectors alongside time series, key-value, and TTL data with sub-10 millisecond queries under heavy concurrent load. Snapshots, profiles, anomaly events, and the vector index all live in one place. The anomaly detection demo runs in Docker if you want to try it on your own data. Common Questions About ScyllaDB Vector Search Technical demos naturally lead to questions, and we heard plenty at MonsterScale (thanks for keeping us busy in the chat). Here are responses to the top questions we fielded… “What about new devices with no history?” Fair concern. Path 3 decides “anomaly” when it finds fewer than five similar normal snapshots. A brand new device with ten records will trigger on everything. The fix is a minimum history gate. Don’t run the vector path until a device has enough baseline data to make comparisons meaningful. In practice, the rules engine and embedding similarity cover the gap while the device builds history. “How do I know my embeddings are good enough?” Test them against known anomalies before you go to production. Bad embeddings produce bad neighbors, and similarity search will give you confidently wrong answers. Run a batch of labeled data through your embedding model, query for nearest neighbors, and check whether the results make sense. If your model does not capture the right features, no amount of tuning the similarity threshold will save you. “Does this scale to millions of devices?” The ANN query itself scales; the part you want to watch is post-filtering. The query returns candidates, and your application code still has to filter by device, status, and similarity threshold. At millions of devices, that step needs to be efficient. Batch your comparisons and keep your filter criteria tight. Someone at the talk asked about the performance impact of deletions on the vector index as old snapshots expire. ScyllaDB’s index structure is optimized for deletion, so the index stays healthy as data churns through. “Can I just use vector search and skip the rules engine?” Yes, you can…but you’ll regret it. The three-path approach works because each path catches a different class of problem. Rules engines are still the fastest way to catch known, well-defined failures. Someone at the talk asked whether an LLM with MCP could just explore the context and find anomalies itself. It can. But vector search is cheaper. No LLM invocation, no context window limit, and it runs at the speed of a database query. Use each tool for what it is good at. Try It Yourself If you want to see what ScyllaDB can do for your vector workloads, sign up for a demo. I’m always happy to walk through architectures and talk through what makes most sense for your specific requirements. Connect with me on LinkedIn if you have questions, want to collaborate, or just want to talk about distributed systems. I read every message.

ScyllaDB R&D Year in Review: Elasticity, Efficiency, and Real-Time Vector Search

2025 was a busy year for ScyllaDB R&D. We shipped one major release, three minor releases, and a continuous stream of updates across both ScyllaDB and ScyllaDB Cloud. Our users are most excited about elasticity, vector search, and reduced total cost of ownership. We also made nice progress on years-long projects like strong consistency and object storage I raced through a year-in-review recap in my recent Monster Scale Summit talk (time limits, sorry), which you can watch on-demand. But I also want to share a written version that links to related talks and blog posts with additional detail. Feel free to choose your own adventure. Watch the talk Tablets: Fast Elasticity ScyllaDB’s “tablets” data distribution approach is now fully supported across all ScyllaDB capabilities, for both CQL and Alternator (our DynamoDB-compatible API). Tablets are the key technology that dynamically scales clusters in and out. It’s been in production for over a year now; some customers use it to scale for the usual daily fluctuations, others rely on its fast responses to workload volatility like expected events, unpredictable spikes, etc. Right now, the autoscaling lets users maximize their disks, which reduces costs. Soon, we’ll automatically scale based on workload characteristics as well as storage. A few things make this genuinely different from the vNodes design that we originally inherited from Cassandra, but replaced with tablets: When you add new nodes, load balancing starts immediately and in parallel. No cleanup operation is needed. There’s also no resharding; rebalancing is automated. Moreover, we shifted from mutation-based streaming to file-based streaming: stream the entire SSTable files without deserializing them into mutation fragments and reserializing them back into SSTables on receiving nodes. As a result, 3X less data is streamed over the network and less CPU is consumed, especially for data models that contain small cells. This change provides up to 25X faster streaming. As Avi’s talk explains, we can scale out in minutes and the tablets load balancing algorithm balances nodes based on their storage consumption. That means more usable space on an ongoing basis – up to 90% disk utilization. Also noteworthy: you can now scale with different node sizes – so you can increase capacity in much smaller increments. You can add tiny instances first, then replace them with larger ones if needed. That means you rarely pay for unused capacity. For example, before, if you started with an i4i.16xlarge node that had 15 TB of storage and you hit 70% utilization, you had to launch another i4i.16xlarge – adding 15 TB at once. Now, you might add two xlarge nodes (1.8 TB each) first. Then, if you need more storage, you add more small nodes, and eventually replace them with larger nodes. Vector Search: Real-Time AI Customers like Tripadvisor, ShareChat, Medium, and Agoda have been using ScyllaDB as a fast feature store in their AI pipelines for several years now. Now, ScyllaDB also has real-time vector search, which is embedded in ScyllaDB Cloud. Our vector search takes advantage of ScyllaDB’s unique architecture. Technologies like our super-fast Rust driver, CDC, and tablets help us deliver real-time AI queries. We can handle datasets of 1 billion vectors with P99 latency as low as 1.7 ms and throughput up to 252,000 QPS. Dictionary-Based Compression I’ve always been interested in compression, even as a student. It’s particularly interesting for databases, though. It allows you to deliver more with less, so it’s yet another way to reduce costs. ScyllaDB Cloud has always had compression enabled by default, both for data at rest and for data in transit. We compress our SSTables on disk, we compress traffic between nodes, and optionally between clients and nodes. In 2025, we improved its efficiency up to 80% by enabling dictionary-based compression. The compressor gains better context of the data being compressed. That gives it higher compression ratios and, in some cases, even better performance. Our two most popular compression algorithms, LZ4 and ZSTD, both benefit from dictionaries now. It works by sampling some data, from which it creates a dictionary and then uses it to further compress the data. The graph on the lower left shows the impact of enabling dictionary compression for network traffic. Both compression algorithms are working and both drop nicely – from 70% to less than 50% for LZ4 and from around 50% to 33% for ZSTD. The table on the lower right shows a similar change. It shows the benefit for disk utilization on a customer’s production cluster, essentially cutting down storage consumption from 50% to less than 30%. Note that the 50% was an already compressed dataset. With this new compression, we further compressed it to less than 30%, a significant saving. Raft-Based Topology We’ve been working on Raft and features built on top of it for several years now. Currently, we use Raft for multiple purposes. Schema consistency was the first step, but topology is the more interesting improvement. With Raft-based fast parallel scaling and safe schema updates, we’re ready to finally retire Gossip-based topology. Other features that use Raft-based topology are authentication, service levels (also known as workload prioritization), and tablets. We are actively working on making strong consistency available for data as well. ScyllaDB X Cloud ScyllaDB X Cloud is the next generation of ScyllaDB Cloud, a truly elastic database-as-a-service. It builds upon innovation in both ScyllaDB core (such as tablets and Raft-based topology), as well as innovation and improvements in ScyllaDB Cloud itself (such as parallel setup of cloud resources, reduced boot time, and the new resource resizing algorithm) to provide immediate elasticity to clusters. Curious how it works? It’s quite simple, really. You just select two important parameters for your cluster, and those define the minimum values for resources: The minimum vCPU count, which is a way to measure the ‘horsepower’ that you initially want to reserve for your clusters. The minimum storage size. And that’s it. You can do this via the API or UI. In the UI, it looks like this: If you wish, you can also limit it to a specific instance family. Now, let’s see how this scaling looks in action. A few things to note here: There are three somewhat large nodes and three additional nodes that are smaller. Some of the tablets are not equal, and that’s perfectly fine. The load was very high initially, then additional load moved gradually to the new nodes. The workload itself, in terms of requests, didn’t change. It changed which nodes it is going to, but the overall value remained the same. The average latency, the P95 latency, and the P99 latency are all great: even the P99s are in the single-digit milliseconds. And here’s a look at additional ScyllaDB Cloud updates before we move on: A Few More Things We Shipped Last Year Backup is a critical feature of any database, and we run it regularly in ScyllaDB Cloud for our customers. In the past, we used an external utility that backed up snapshots of the data. That was somewhat inefficient. Also, it competed with the memory, CPU, disk, and network resources that ScyllaDB was consuming – potentially affecting the throughput and latency of user workloads. We reimplemented the backup client so it now runs inside ScyllaDB’s scheduler and cooperates with the rest of the system. The result is minimal impact on user workload and 11X faster execution that scales linearly with the number of shards per node. On the infrastructure side, the new AWS Graviton i8g instances proved themselves this year. We measured up to 2X throughput and lower latency at the same price, along with higher usable storage from improved compression and utilization. We don’t embrace every new instance type since we have very specific and demanding requirements. However, when we see clear value like this, we encourage customers to move to newer generations. On the security side, all new clusters now have their data encrypted at rest by default. When creating a cluster, you can either use your own key (known as ‘BYOK’) or use the ScyllaDB Cloud key. We also reached general availability of our Rust driver. This is interesting because it’s our fastest driver. Also, its binding is the foundation for our grand plan: unifying our drivers under the Rust infrastructure. We started with a new C++ driver. Next up (almost ready) is our NodeJS driver – and we’ll continue with others as well. We also released our C# driver (another popular demand) and across the board improved our drivers’ reliability, capabilities, compatibility, and performance. Finally, our Alternator clients in various languages received some important updates as well, such as network compression to both requests and responses. What’s Next for ScyllaDB Finally, let’s close with a glimpse at some of the major things we are expecting to deliver in 2026: An efficient, easy-to-use, and online process for migrating from vNodes to tablets. Strong data consistency for data (beyond metadata and internal features). A fast dedicated data path to key-value data Additional capabilities focused on 1) improved total cost of ownership and 2) more real-time AI features and integrations. There’s a lot to look forward to! We’d love to answer questions or hear your feedback.

Rethinking “Designing Data-Intensive Applications”

How Martin Kleppmann’s iconic book evolved for AI and cloud-native architectures Since its release in 2017, Designing Data-Intensive Applications (DDIA) has become known as the bible for anyone working on large-scale, data-driven systems. The book’s focus on fundamentals (like storage engines, replication, and partitioning) has helped it age well. Still, the world of distributed systems has evolved substantially in the past decade. Cloud-native architectures are now the default. Object storage has become a first-class building block. Databases increasingly run everywhere from embedded and edge deployments to fully managed cloud services. And AI-driven workloads have made a mark on storage formats, query engines, and indexing strategies. So, after years of requests for a second edition, Martin Kleppmann revisited the book – and he enlisted his longtime collaborator Chris Riccomini as a co-author. Their goal: Keep the core intact while weaving in new ideas and refreshing the details throughout. With the second edition of DDIA now arriving at people’s doorsteps, let’s look at what Kleppmann and Riccomini shared about the project at Monster Scale Summit 2025. That conversation was recorded mid-revision. You can watch the entire chat or read highlights below.  Extra: Hear what Kleppmann and Riccomini had to say when the book went off to the printers this year.  They were featured again at Monster SCALE Summit 2026, which just concluded. This talk is available on-demand, alongside 60+ others by antirez, Pat Helland, Camille Fournier, Discord, Disney, and  more. People have been requesting a second edition for a while – why now? Kleppmann: Yeah, I’ve long wanted to do a second edition, but I also have a lot of other things I want to do, so it’s always a matter of prioritization. In the end, though, I felt that the technology had moved on enough. Most of the book focuses on fundamental concepts that don’t change that quickly. They change on the timescale of decades, not months. In that sense, it’s a nice book to revise, because there aren’t that many big changes. At the same time, many details have changed. The biggest one is how people use cloud services today compared to ten years ago. Cloud existed back then, of course, but I think its impact on data systems architecture has only been felt in the last few years. That’s something we’re trying to work in throughout the second edition of the book. How have database architectures evolved since the first edition? Riccomini: I’ve been thinking about whether database architecture has changed now that we’re putting databases in more places – cloud, BYOC, on-prem, on-client, on-edge, etc. I think so. I have a hypothesis that successful future databases will be able to move or scale with you, from your laptop, to your server, to your cloud, and even to another cloud. We’re already seeing evidence of this with things like DuckDB and MotherDuck, which span embedded to cloud-based use cases. I think PGlite and Postgres are another example, where you see Postgres being embedded. SQLite is an obvious signal on the embedded side. On the query engine side, you see this with systems like Daft, which can run locally and in a distributed setting. So the answer is yes. The biggest shift is the split between control, data, and compute planes. That architecture is widely accepted now and has proven flexible when moving between SaaS and BYOC. There’s also an operational aspect to this. When you talk about SaaS versus non-SaaS, you’re talking about things like multi-tenancy and how much you can leverage cloud-provider infrastructure. I had an interesting discussion about Confluent Freight versus WarpStream, two competing Kafka streaming systems. Freight is built to take advantage of a lot of the in-cloud SaaS infrastructure Confluent has, while WarpStream is built more like a BYOC system and doesn’t rely on things like a custom network plane. Operationally, there’s a lot to consider regarding security and multi-tenancy, and I’m not sure we’re as far along as we could be. A lot of what SaaS companies are doing still feels proprietary and internal. That’s my read on the situation. Kleppmann: I’d add a little to that. At the time of the first edition, the model for databases was that a node ran on a machine, storing data on its local file system. Storage was local disks, and replication happened at the application layer on top of that. Now we’re increasingly seeing a model where storage is an object store. It’s not a local file system; it’s a remote service, and it’s already replicated. Building on top of an abstraction like object storage lets you do fundamentally different things compared to local disk storage. I’m not saying one is better than the other – there are always tradeoffs. But this represents a new point in the tradeoff space that really wasn’t present at the time of the first edition. Of course, object stores existed back then, but far fewer databases took advantage of them in the way people do now. We’ve seen a proliferation of specialized databases recently – do you think we’re moving toward consolidation? Riccomini: With my investment hat on, this is the million-dollar…or billion-dollar…question. Which of these is it? I think the answer is probably both. The reality is that Postgres has really taken the world by storm lately, and its extension ecosystem has become pretty robust in recent versions. For most people, when they’re starting out, they’re naturally going to build on something like Postgres. They’ll use pg\_search or something similar for their search index, pgvector for vector embeddings, and PG analytics or pg\_duckdb for their data lake. Then the question is: as they scale, will that still be okay? And in some cases, yes. In other cases, no. My personal hypothesis is that as you not only scale up but also need features that are core to your product, you’re more likely to move to a specialized system. pgvector, for example, is a reasonable starting point. But if your entire product is like Cursor AI or an IDE that does code completion, you probably need something more robust and scalable than pgvector can provide. At that point, you’d likely look at something like Pinecone or Turbopuffer or companies like that. So I think it’s both. And because Postgres is going to eat the bottom of the market, I do think there will be fewer specialized vendors, but I don’t think they’ll disappear entirely. What are some of the key tradeoffs you see with streaming systems today? Kleppmann: Streaming sits in a slightly weird space. A typical stream processor has a one-record-at-a-time, callback-based API. It’s very imperative. On top of that, you can build things like relational operators and query plans. But if you keep pushing in that direction, the result starts to look much more like a database that does incremental view maintenance. There are projects like Materialize that are aiming there. You just give it a SQL query, and the fact that it’s streaming is an implementation detail that’s almost hidden. I don’t know if that means the result for many of these systems is this: if you have a query you can express in SQL, you hand it off to one of these systems and let it maintain the view. And what we currently think of as streaming, with the lower-level APIs, is used for a more specialized set of applications. That might be very high-scale use cases, or queries that just don’t fit well into a relational style. Riccomini: Another thing I’d add is the fundamental tradeoff between latency and throughput, which most streaming systems have to deal with. Ideally, you want the lowest possible latency. But when you do that, it becomes harder to get higher throughput. The usual way to increase throughput is to batch writes. But as soon as you start batching writes, you increase the latency between when a message is sent and when it’s received. How is AI impacting data-intensive applications? Kleppmann: There’s some AI-plus-data work we’ve been exploring in research (not really part of the book). The idea is this: if you want to give an AI some control over a system – if it’s allowed to press buttons that affect data, like editing or updating it – then the safest way to do that is through a well-defined API. That API defines which buttons the AI is allowed to press, and those actions correspond to things that make sense and maybe fulfill certain consistency properties. More generally, it seems important to have interfaces that allow AI agents and humans to work safely together, with the database as common ground. Humans can update data, the AI can update data, and both can see each other’s changes. You can imagine workflows where changes are reviewed, compared, and merged. Those kinds of processes will be necessary if we want good collaboration between humans and AI systems. Riccomini: From an implementation perspective, storage formats are definitely going to evolve. We’re already seeing this with systems like LanceDB, which are trying to support multimodal data better. Arrow, for example, is built for columnar data, which may not be the best fit for some multimodal use cases. And this goes beyond storage into things like Arrow RPC as well. On the query engine side, there’s also a lot of ongoing work around query optimization and indexing. The idea is to build smarter databases that can look at query patterns and adjust themselves over time. There was a good paper from Google a while back that used more traditional machine learning techniques to do dynamic indexing based on query patterns. That line of work will continue. And then, of course, support for embeddings, vector search, and semantic search will become more common. Good integrations with RAG (Retrieval-Augmented Generation)…that’s also important. We’re still very much at the forefront of all of this, so it’s tricky. Watch the 2026 Chat with Martin and Chris

Monster SCALE Summit 2026 Recap: From Database Elasticity to Nanosecond AI

It seemed like just yesterday I was here in the US hosting P99 CONF…and then suddenly I’m back again for Monster SCALE Summit 2026. The pace of it all says something – not so much about my travel between the US and Australia, but about how quickly this space is moving. Monster SCALE Summit delivered an overwhelming amount of content – so much that you need both time and distance to process it all. Sitting here now at Los Angeles International Airport, I can conclude this was the strongest summit to date. Watch sessions on-demand Day 1 To start with, the keynotes. As expected, Dor Laor set the tone with a strong opening – making a clear case that elasticity is the defining capability behind ScyllaDB operating at scale. “Elastic” is one of those overloaded terms in cloud infrastructure, but the distinction here was sharp. The focus was on a database that continuously adapts under load without breaking performance. Hot on the heels of Dor was the team from Discord, showing us just how they operate at scale while automating everything from standing up new clusters to expanding them under load, all with ScyllaDB. What stood out wasn’t just the scale but also the level of automation. I mention it at every conference, but [Discord engineer] Bo Ingram’s ScyllaDB in Action book is required reading for anyone running databases at scale. Since the conference was hosted by ScyllaDB, there was no shortage of deep dives into operating at scale. Felipe Cardeneti Mendes, author of Database Performance at Scale, expanded on the engineering behind ScyllaDB. Tablets play a key role as a core abstraction that links predictable performance with true elasticity. If you didn’t get a chance to participate in any of the live lounge events watching Felipe scale ScyllaDB to millions of operations per second, then you can try it yourself. The gap between demo and production is enticingly small, so it’s well worth a whirl. From the customer side, Freshworks and ShareChat, both with their own grounded presentations about super low latency and super low costs, focused on what actually matters in production. Enough of ScyllaDB for a bit, though. Let me recommend Joy Gao’s presentation from ClickHouse about cross-system database replication at petabyte scale. As you would expect, that’s a difficult problem to solve. Or if you want to know about ad reporting at scale, then check out the presentation from Pinterest. The presentation of the day for me was Thea Aarrestad talking about nanosecond AI at the Large Hadron Collider. The metrics in this presentation were insane, with datasets far exceeding the size of the Netflix catalog being generated per second… all in the pursuit of capturing the rarest particle interactions in real time. It’s a useful reminder that at true extreme scale, you don’t store then analyze. Instead, you analyze inline, or you lose the data forever. A big thanks to Thea. We asked for monster scale, and you certainly delivered. At the end of day one, we had another dense block of quality content from equally strong presenters. Long-time favourite, Ben Cane from American Express, is always entertaining and informative. We had some great research presented by Murat Demirbas from MongoDB around designing for distributed systems. And on the topic of distributed systems, Stephanie Wang showed us how to avoid some distributed traps with fast, simple analytics using DuckDB. Miguel Young de la Sota from Buf Technologies indulged us with speedrunning Super Mario 64 and its relationship to performance engineering. In parallel, Yaniv from ScyllaDB core engineering showed us what’s new and what’s next for our database. Pat Helland and Daniel May finished Day 1 with a closing keynote on “Yours, Mine, and Ours,” which was all about set reconciliation in the distributed world. This is one of those topics that sounds niche, but sits at the core of correctness at scale. When systems diverge – and they always do – then reconciliation becomes the real problem, not replication. Pat is a legend in this space and his writing at https://pathelland.substack.com/ expands on many of these ideas. Day 2 Day 2 got off to an early start with a pre-show lounge hosted by Tzach Livyatan and Attila Toth sharing tips for getting started with ScyllaDB. The opening keynote from Avi Kivity went deep into the mechanics behind ScyllaDB tablets, with the kind of under-the-hood engineering that makes extreme scale tractable rather than theoretical. I then had the opportunity to host Camille Fournier, joining remotely for a live session on engineering leadership and platform strategy under increasing system complexity. AI was a recurring theme, but the sharper takeaway was how platform teams need to evolve their role as abstraction layers become more dynamic and less predictable. Camille also stayed on for a live Q&A, engaging directly with the audience on both leadership and the practical realities of operating modern platforms at scale. Big thanks to Camille! The next block shifted back into deep technical content, starting with Dominik Tornow from Resonate HQ, sharing a first-principles approach to agentic systems. Alex Dathskovsky followed with a forward-looking session on the future of data consistency in ScyllaDB, reframing consistency not as a static tradeoff but as something increasingly adaptive and workload-aware. Szymon Wasik went deep on ScyllaDB vector search internals with real-time performance at billion-vector scale. It was great to see how ScyllaDB avoids the typical latency cliffs seen in ANN systems. From LinkedIn, Satya Lakshmikanth shared practical lessons from operating large-scale data systems in production, grounding the conversation in real-world constraints. Asias He showed us how incremental repair for ScyllaDB tablets is much more operationally feasible without the traditional overhead around consistency. We also heard from Tyler Denton, showcasing ScyllaDB vector search in action across two use cases (RAG with Hybrid Memory and anomaly detection at scale). Before lunch, another top quality keynote from TigerBeetle’s Joran Dirk Greef who had plenty of quotable quotes in his talk, “You Can’t Scale When You’re Dead.” And crowd favorite antirez from Redis gave us his take on HNSW indexes and all the tradeoffs that need to be considered. The final block in an already monster-sized conference followed up with Teiva Harsanyi from Google, sharing lessons drawn from large-scale distributed systems like Gemini. MoEngage brought a practitioner view on operating high-throughput, user-facing systems where latency directly impacts engagement. Benny Halevy outlined the path to tiered storage on ScyllaDB. Rivian discussed real-world automotive data platforms and Brian Jones from SAS highlighted how they tackled tension between batch and realtime workloads at scale by applying ScyllaDB. I also enjoyed the AWS talk from KT Tambe with Tzach Livyatan, tying cloud infrastructure back to database design decisions with the I8g family. To wrap the conference, we heard from Brendan Cox on how Sprig simplified their database infrastructure and achieved predictable low latency with – I’m sure you’ve guessed by now – ScyllaDB. Throughout the event, attendees could also binge-watch talks in Instant Access, which featured gems from Martin Kleppmann and Chris Ricommini, Disney/Hulu, DBOS, Tiket, and more. Prepare for Takeoff Final call to board my monstrous 15-hour flight home… It’s a great privilege to host these virtual conferences live backed by the team at ScyllaDB. The quality of this event ultimately comes down to the people – both the presenters who bring great depth and detail and the audience that keeps the conversation engaged and interesting. Thanks to everyone who contributed to making this the best Monster SCALE Summit to date. Catch up on what you missed If you want to chat with our engineers about how ScyllaDB might work with your use case, book a technical strategy session.

Announcing ScyllaDB Operator, with Red Hat OpenShift Certification

OpenShift users gain a trusted, validated path for installing and managing ScyllaDB Operator – backed by enterprise-grade support ScyllaDB Operator is an open-source project that helps you run ScyllaDB on Kubernetes by managing ScyllaDB clusters deployed to Kubernetes and automating tasks related to operating a ScyllaDB cluster. For example, it automates installation, vertical and horizontal scaling, as well as rolling upgrades. The latest release (version 1.20) is now Red Hat certified and available directly in the Red Hat OpenShift ecosystem catalog. Additionally, it brings new features, stability improvements and documentation updates. Red Hat OpenShift Certification and Catalog Availablity OpenShift has become a cornerstone platform for enterprise Kubernetes deployments, and we’ve been working to ensure ScyllaDB Operator feels like a native part of that ecosystem. With ScyllaDB Operator 1.20, we’re taking a significant step forward: the operator is now Red Hat certified and available directly in the Red Hat OpenShift ecosystem catalog. See the ScyllaDB Operator project in the Red Hat Ecosystem Catalog. This milestone gives OpenShift users a trusted, validated path for installing and managing ScyllaDB Operator – backed by enterprise-grade support. With this release, you can install ScyllaDB Operator through OLM (Operator Lifecycle Manager) using either the OpenShift Web Console or CLI. For detailed installation instructions and OpenShift-specific configuration examples – including guidance for platforms like Red Hat OpenShift Service on AWS (ROSA) – see the Installing ScyllaDB Operator on OpenShift guide. IPv6 support ScyllaDB Operator 1.20 brings native support for IPv6 and dual-stack networking to your ScyllaDB clusters. With dual-stack support, your ScyllaDB clusters can operate with both IPv4 and IPv6 addresses simultaneously. That provides flexibility for gradual migration scenarios or environments requiring support for both protocols. You can also configure IPv6-first deployments, where ScyllaDB uses IPv6 for internal communication while remaining accessible via both protocols. You can control the IP addressing behaviour of your cluster through v1.ScyllaCluster’s .spec.network. The API abstracts away the underlying ScyllaDB configuration complexity so you can focus on your networking requirements rather than implementation details. With this release, IPv4-first dual-stack and IPv6-first dual-stack configurations are production-ready. IPv6-only single-stack mode is available as an experimental feature under active development; it’s not recommended for production use. See Production readiness for details. Learn more about IPv6 networking in the IPv6 networking documentation. Other notable changes Documentation improvements This release includes a new guide on automatic data cleanups. It explains how ScyllaDB Operator ensures that your ScyllaDB clusters maintain storage efficiency and data integrity by removing stale data and preventing data resurrection. Dependency updates This release also includes regular updates of ScyllaDB Monitoring and the packaged dashboards to support the latest ScyllaDB releases (4.12.1->4.14.0, #3250), as well as its dependencies: Grafana (12.2.0->12.3.2) and Prometheus (v3.6.0->v3.9.1). For more changes and details, check out the GitHub release notes. Upgrade instructions For instructions on upgrading ScyllaDB Operator to 1.20, please refer to the Upgrading ScyllaDB Operator section of the documentation. Supported versions ScyllaDB 2024.1, 2025.1, 2025.3 – 2025.4 Red Hat OpenShift 4.20 Kubernetes 1.32 – 1.35 Container Runtime Interface API v1 ScyllaDB Manager 3.7 – 3.8 Getting started with ScyllaDB Operator ScyllaDB Operator Documentation Learn how to deploy ScyllaDB on Google Kubernetes Engine (GKE) Learn how to deploy ScyllaDB on Amazon Elastic Kubernetes Engine (EKS)  Learn how to deploy ScyllaDB on a Kubernetes Cluster Related links ScyllaDB Operator source (on GitHub) ScyllaDB Operator image on DockerHub ScyllaDB Operator Helm Chart repository ScyllaDB Operator documentation ScyllaDB Operator for Kubernetes lesson in ScyllaDB University Report a problem Your feedback is always welcome! Feel free to open an issue or reach out on the #scylla-operator channel in ScyllaDB User Slack.  

Instaclustr product update: March 2026

Here’s a roundup of the latest features and updates that we’ve recently released.

If you have any particular feature requests or enhancement ideas that you would like to see, please get in touch with us.

Major announcements Introducing AI Cluster Health: Smarter monitoring made simple

Turn complex metrics into clear, actionable insights with AI-powered health indicators—now available in the NetApp Instaclustr console. The new AI Cluster Health page simplifies cluster monitoring, making it easy to understand your cluster’s state at a glance without requiring deep technical expertise. This AI-driven analysis reviews recent metrics, highlights key indicators, explains their impact, and assigns an easy traffic-light health score for a quick status overview.

NetApp introduces Apache Kafka® Tiered Storage support on GCP in public preview

Tiered Storage is now available in public review for Instaclustr for Apache Kafka on Google Cloud Platform. This feature enables customers to optimize storage costs and improve scalability by offloading older Kafka log segments from local disk to Google Cloud Storage (GCS), while keeping active data local for fast access. Kafka clients continue to consume data seamlessly with no changes required, allowing teams to reduce infrastructure costs, simplify cluster scaling, and extend retention periods for analytics or compliance.

Other significant changes Apache Cassandra®
  • Released Apache Cassandra 4.0.19 and 5.0.6 into general availability on the NetApp Instaclustr Managed Platform, giving customers access to the latest stability, performance, and security improvements.
  • Multi–data center Apache Cassandra clusters can now be provisioned across public and private networks via the Instaclustr Console, API, and Terraform provider, enabling customers to provision multi-DC clusters from day one.
  • Single CA feature is now available for Apache Cassandra clusters. See Single CA for more details.
  • GCP n4 machine types are now supported for Apache Cassandra in the available regions.
Apache Kafka®
  • Apache Kafka and Kafka Connect 4.1.1 are now generally available.
  • Added Client Metrics and Observability feature support for Kafka in private preview.
  • Single CA feature is now available for Apache Kafka clusters. See Single CA for more details.
ClickHouse®
  • ClickHouse 25.8.11 has been added to our managed platform in General Availability.
  • Enabled ClickHouse system.session_log table that plays a key role in tracking session lifecycle and auditing user activities for enhanced session monitoring. This helps you with troubleshooting client-side connectivity issues and provides insights into failed connections.
OpenSearch®
  • OpenSearch 2.19.4 and 3.3.2 have been released to general availability.
  • Added support for the OpenSearch Assistant feature in OpenSearch Dashboards for clusters with Dashboards and AI Search enabled.
PostgreSQL® Instaclustr Managed Platform
  • Added self-service Tags Management feature—allowing users to add, edit, or delete tags for their clusters directly through the Instaclustr console, APIs, or Terraform provider for RIYOA deployments
  • Added new region Germany West Central for Azure
Future releases Apache Kafka®
  • Following the private preview release, Kafka’s Client Telemetry feature is progressing toward general availability soon. Read more here.
ClickHouse®
  • We plan to extend the current ClickHouse integration with FSxN data sources by adding support for deployments across different VPCs, enabling broader enterprise lakehouse architectures.
  • Apache Iceberg and Delta Lake integration are planned to soon be available for ClickHouse on the NetApp Instaclustr Platform, giving you a practical way to run analytics on open table formats while keeping control of your existing data platforms.
  • We plan to soon introduce fully integrated AWS PrivateLink as a ClickHouse Add-On for secure and seamless connectivity with ClickHouse.
PostgreSQL®
  • We’re aiming to launch PostgreSQL integrated with FSx for NetApp ONTAP (FSxN) along with NVMe support into general availability soon. This enhancement is designed to combine enterprise-grade PostgreSQL with FSxN’s scalable, cost-efficient storage, enabling customers to optimize infrastructure costs while improving performance and flexibility. NVMe support is designed to deliver up to 20% greater throughput vs NFS.
OpenSearch®
  • An AI Search plugin for OpenSearch is being released to GA (currently in public preview) to enhance search experiences using AI‑powered techniques such as semantic, hybrid, and conversational search, enabling more relevant, context‑aware results and unlocking new use cases including retrieval‑augmented generation (RAG) and AI‑driven chatbots.
Instaclustr Managed Platform
  • Following the public preview release, Zero Inbound Access is progressing to General Availability, designed to deliver the most secure management connectivity by eliminating inbound internet exposure and removing the need for any routable public IP addresses, including bastion or gateway instances.
Did you know?

If you have any questions or need further assistance with these enhancements to the Instaclustr Managed Platform, please contact us.

SAFE HARBOR STATEMENT: Any unreleased services or features referenced in this blog are not currently available and may not be made generally available on time or at all, as may be determined in NetApp’s sole discretion. Any such referenced services or features do not represent promises to deliver, commitments, or obligations of NetApp and may not be incorporated into any contract. Customers should make their purchase decisions based upon services and features that are currently generally available. 

The post Instaclustr product update: March 2026 appeared first on Instaclustr.

Integrated Gauges: Lessons Learned Monitoring Seastar’s IO Stack

Why integrated gauges are a better approach for measuring frequently changing values Many performance metrics and system parameters are inherently volatile or fluctuate rapidly. When using a monitoring system that periodically “scrapes” (polls) a target for its current metric value, the collected data point is merely a snapshot of the system’s state at that precise moment. It doesn’t reveal much about what’s actually happening in that area. Sometimes it’s possible to overcome this problem by accumulating those values somehow – for example, by using histograms or exporting a derived monotonically increasing counter. This article suggests yet another way to extend this approach for a broader set of frequently changing parameters. The Problem with Instantaneous Metrics For rapidly changing values such as queue lengths or request latencies, a single snapshot is most often misleading. If the scrape happens during a peak load, the value will appear high. If it happens during an idle moment, the value will appear low. Over time, the sampled data usually does not accurately represent the system’s true average behavior, leading to misinformed alerting or capacity planning. This phenomenon is apparent when examining synthetically generated data. Take a look at the example below. On that plot, there’s a randomly generated XY series. The horizontal axis represents the event number, the vertical axis represents the frequently fluctuating value. The “value” changes on every event. Fig.1 Random frequently changing data Now let’s thin the data out and see what would happen if we show only every 40th point, as if it were a real monitoring system that captures the value at a notably lower rate. The next plot shows what it would look like. Fig.2 Sampling every 40th point from the data Apparently, this is a very poor approximation of the first plot. It becomes crystal clear how poor it is if we zoom into the range [80-120] of both plots. For the latter (dispersed) plot, we’ll see a couple of dots with values of 2 and 0. But the former (original) plot would reveal drastically different information, shown below. Fig.3 Zoom-in range [80-120] from the data Now remember that the problem revealed itself at the ratio of the real rate to scrape rate being just 40. In real systems, internal parameters may change thousands of times per second while the scrape period is minutes. In that case, you end up scraping less than a fraction of a percent. A better way to monitor the frequently changing data is needed…badly. Histograms Request latency is one example of a statistic that usually contains many data points per second. Histograms can help you see how slow or fast the requests are – without falling into the problem described above. A monitoring histogram is a type of metric that samples observations and counts them in configurable buckets. Instead of recording just a single average or maximum latency value, a histogram provides a distribution of these values over a given period. By analyzing the bucket counts, you can calculate percentiles, including p50 (known as the median value) or p99 (the value below which 99% of all requests fell). This provides a much more robust and actionable view of performance volatility than simple instantaneous gauges or averages. Fig.4 Histogram built from data X-points It’s a static snapshot of the data at the last “time point.” If the values change over time, modern monitoring systems can offer various data views, such as time-series percentiles or heatmaps, to provide the necessary insight into the data. Histograms, however, have a drawback: They require system memory to work, node networking throughput to get reported, and monitoring disk space to be stored. A histogram of N buckets consumes N times more resources than a single value. Addable latencies When it comes to reporting latencies, a good compromise solution between efficiency and informativeness is “counter” metrics coupled with the rate() function. Counters and rates Counter metrics are one of the simplest and most fundamental metric types in the Prometheus monitoring system. It’s a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero upon a restart of the monitored target. Because a counter itself is just an ever-increasing total, it is rarely analyzed directly. Instead, one derives another meaningful metric from counters, most notably a “rate”. Below is a mathematical description of how the prometheus rate() function works. Internally, the system calculates the total sum of the observed values. After scraping the value several times, the monitoring system calculates the “rate” function of the value, which takes the total value difference from the previous scrape, divided by the time passed since then. The rated value thus shows the average value over the specified period of time. Now let’s get back to latencies. To report average latency over the scraping period using counters, the system should accumulate two of them: the total sum of request latencies (e.g., in milliseconds) and the total number of completed requests. The monitoring system would then “rate” both counters to get their averages and divide them, thus showing the average per-request latency. It looks like this: Since the total number of requests served is useful on its own to see the IOPS value, this method is as efficient as exporting the immediate latency value. Yet, it provides a much more stable and representative metric than an instantaneous gauge. We can try to adopt counters to the artificial data that was introduced earlier. The plot below shows what the averaged values for the sub-ranges of length 40 (the sampling step from above) would look like. It differs greatly from the sampled plot of Figure 2 and provides a more accurate approximation of the real data set. Fig.5 Average values for sub-ranges of length 40 Of course, this method has its drawbacks when compared to histograms. It swallows latency spikes that last for a short duration, as compared to the scraping period. The shorter the spike duration, the more likely it would go unnoticed. Queue length Summing up individual values is not always possible though. Queue length is another example of a statistic that’s also volatile and may contain thousands of data points. For example, this could be the queue of tasks to be executed or the queue of IO requests. Whenever an entry is added to the queue, its length increases. When an entry is removed, the length decreases – but it makes little sense to add up the updated queue length elsewhere. If we return to Figure 3 showing the real values in the range of [80-120], what would the average queue length over that period be? Apparently, it’s the sum of individual values divided by their number. But even though we understand what the “average request latency” is, the idea of “average queue length” is harder to accept, mainly because the X value of the above data is usually not an “event”, but it’s rather a “time”. And if, for example, a queue was empty most of the time and then changed to N elements for a few milliseconds, we’d have just two events that changed the queue. Seeing the average queue length of N/2 is counterintuitive. Some time ago, we researched how the “real” length of an IO queue can be implicitly derived from the net sum of request latencies. Here, we’ll show another approach to getting an idea of the queue length. Integrated gauge The approach is a generalization of averaging the sum of individual data points from the data set using counters and the “rate” function. Let’s return to our synthetic data when it was summed and rated (Figure 3) and look at the corresponding math from a slightly different angle. If we treat each value point as a bar with the width of 1 and the height being equal to the value itself, we can see that: The total value is the total sum of the bars’ squares The difference between two scraped total values is the sum of squares of bars between two scraping points The number of points in the range equals the distance between those two scraping points Fig.6 Geometrical interpretation of the exported counters Figure 6 shows this area in green. The average value is thus the height of the imaginary rectangular whose width equals the scraping period. If the distance between two adjacent points is not 1, but some duration, the interpretation still works. We can still calculate the square under the plot and divide it by the area width to get its “average” value. But we need to apply two changes. First, the square of the individual bar is now the product of the value itself and the duration of time passed from the previous point. Think of it this way: previously, the duration was 1, so the multiplication was invisible.Now, it is explicit. Second, we no longer need to count the number of points in the range. Instead, the denominator of the averaging function will be the duration between scraping points. In other words, previously the number of points was the sum of 1-s (one-s) between scraping points; now, it’s the sum of durations – total scraping period It’s now clear that the average value is the result of applying the “rate()” function to the total value. Naming problem (a side note) There are two hard problems in computer science: cache invalidation, naming, and off-by-one errors. Here we had to face one of them – how to name the newly introduced metrics? It is the queue size multiplied by the time the queue exists in such a state. A very close analogy comes from project management. There’s a parameter called “human-hour” and it’s widely used to estimate the resources needed to accomplish some task. In physics, there are not many units that measure something multiplied by time. But there are plenty of parameters that measure something divided by time, like velocity (distance-per-second), electric current (charge-per-second) or power (energy-per-second). Even though it’s possible to define (e.g., charge as current multiplied by time), it’s still charge that’s the primary unit and current is secondary. So far I’ve found quite a few examples of such units. In classical physics, one is called “action” and measures energy-times-time, and another one is “impulse,” which is measured in newtons-times-seconds. Finally, in kinematics there’s a thing called “absement,” which is literally meters-times-seconds. That describes the measure of an object displacement from its initial position. Integrated queue length The approach described above was recently [applied] to the Seastar IO stack – the length of IO classes’ request queues was patched to account for and export the integrated value. Below is the screenshot of a dashboard showing the comparison of both compaction class queue lengths, randomly scraped and integrated. Interesting (but maybe hard to accept) are the fraction values of the queue length. That’s OK, since the number shows the average queue length over a scrape period of a few minutes. A notable advantage of the integrated metric can be seen on the bottom two plots that show the length of the in-disk queue. Only the integrated metric (bottom right plot) shows that disk load actually decreased over time, while the randomly scraped one (bottom left plot) just tells us that the disk wasn’t idling…but not more than that. This effect of “hiding” the reality from the observer can also be seen in the plot that shows how the commitlog class queue length was changing over time. Here we can see a clear rectangular “spike” in the integrated disk queue length plot. That spike was completely hidden by the randomly scraped one. Similarly, the software queue length (upper pair of plots) dynamics is only visible from the integrated counter (upper right plot). Conclusion Relying solely on instantaneous metrics, or gauges, to monitor rapidly fluctuating system parameters very often leads to misleading data that poorly reflects the system’s actual behavior over a period of time. While solutions like histograms offer better statistical insights, they incur notable resource overhead. For metrics where the individual values can be aggregated (like request latencies), converting them into cumulative counters and deriving a rate provides a much more stable and representative average over the scraping interval. This offers a very efficient compromise between informative granularity and resource consumption. For metrics where instantaneous values cannot be simply summed (such as queue lengths), the concept of the Integrated Gauge offers a generalization of the same efficiency. In essence, by treating the gauge not as a point-in-time value but as a continuously accumulating measure of time-at-value, integral gauges provide a highly reliable and definitive representation of a parameter’s average behavior across any given measurement interval.

From ScyllaDB to Kafka: Natura’s Approach to Real-Time Data at Scale

How Natura built a real-time data pipeline to support orders, analytics, and operations Natura, one of the world’s largest cosmetics companies, relies on a network of millions of beauty consultants generating a massive amount of orders, events, and business data every day. From an infrastructure perspective, this requires processing vast amounts of data to support orders, campaigns, online KPIs, predictive analytics, and commercial operations. Natura’s Rodrigo Luchini (Software Engineering Manager) and Marcus Monteiro (Senior Engineering Tech Lead) shared the technical challenges and architecture behind these use cases at Monster SCALE Summit 2025. Fitting for the conference theme, they explained how the team powers real-time sales insights at massive scale by building upon ScyllaDB’s CDC Source Connector. About Natura Rodrigo kicked off the talk with a bit of background on Natura. Natura was founded in 1969 by Antônio Luiz Seabra. It is a Brazilian multinational cosmetics and personal care company known for its commitment to sustainability and ethical sourcing. They were one of the first companies to focus on products connected to beauty, health, and self-care. Natura has three core pillars: Sustainability: They are committed to producing products without animal testing and to supporting local producers, including communities in the Amazon rainforest. People: They value diversity and believe in the power of relationships. This belief drives how they work at Natura and is reflected in their beauty consultant network. Technology: They invest heavily in advanced engineering as well as product development. The Technical Challenge: Managing Massive Data Volume with Real-Time Updates The first challenge they face is integrating a high volume of data (just imagine millions of beauty consultants generating data and information every single day). The second challenge is having real-time updated data processing for different consumers across disconnected systems. Rodrigo explained, “Imagine two different challenges: creating and generating real-time data, and at the same time, delivering it to the network of consultants and for various purposes within Natura.” This is where ScyllaDB comes in. Here’s a look at the initial architecture flow, focusing on the order and API flow. As soon as a beauty consultant places an order, they need to update and process the related data immediately. This is possible for three main reasons: As Rodrigo put it, “The first is that ScyllaDB is a fast database infrastructure that operates in a resilient way. The second is that it is a robust database that replicates data across multiple nodes, which ensures data consistency and reliability. The last but not least reason is scalability – it’s capable of supporting billions of data processing operations.” The Natura team architected their system as follows: In addition to the order ingestion flow mentioned above, there’s an order metrics flow closely connected to ScyllaDB Cloud (ScyllaDB’s fully-managed database-as-a-service offering), as well as the ScyllaDB CDC Source Connector (A Kafka source connector capturing ScyllaDB CDC changes). Together, this enables different use cases in their internal systems. For example, it’s used to determine each beauty consultant’s business plan and also to report data across the direct sales network, including beauty consultants, leaders, and managers. These real-time reports drive business metrics up the chain for accurate, just-in-time decisions. Additionally, the data is used to define strategy and determine what products to offer customers. Contributing to the ScyllaDB CDC Source Connector When Natura started testing the ScyllaDB Connector, they noticed a significant spike in the cluster’s resource consumption. This continued until the CDC log table was fully processed, then returned to normal. At that point, the team took a step back. After reviewing the documentation, they learned that the connector operates with small windows (15 seconds by default) for reading the CDC log tables and sending the results to Kafka. However, at startup, these windows are actually based on the table TTL, which ranged from one to three days in Natura’s use case. Marcus shared: “Now imagine the impact. A massive amount of data, thousands of partitions, and the database reading all of it and staying in that state until the connector catches up to the current time window. So we asked ourselves: ‘Do we really need all the data?’ No. We had already run a full, readable load process for the ScyllaDB tables. What we really needed were just incremental changes, not the last three days, not the last 24 hours, just the last 15 minutes.” So, as ScyllaDB was adding this feature to the GitHub repo, the Natura team created a new option: scylla.custom.window.start. This let them tell the connector exactly where to start, so they could avoid unnecessary reads and relieve unnecessary load on the database. Marcus wrapped up the talk with the payoff: “This results in a highly efficient real-time data capture system that streams the CDC events straight to Kafka. From there, we can do anything—consume the data, store it, or move it to any database. This is a gamechanger. With this optimization, we unlocked a new level of efficiency and scalability, and this made a real difference for us.”