Rethinking “Designing Data-Intensive Applications”

How Martin Kleppmann’s iconic book evolved for AI and cloud-native architectures Since its release in 2017, Designing Data-Intensive Applications (DDIA) has become known as the bible for anyone working on large-scale, data-driven systems. The book’s focus on fundamentals (like storage engines, replication, and partitioning) has helped it age well. Still, the world of distributed systems has evolved substantially in the past decade. Cloud-native architectures are now the default. Object storage has become a first-class building block. Databases increasingly run everywhere from embedded and edge deployments to fully managed cloud services. And AI-driven workloads have made a mark on storage formats, query engines, and indexing strategies. So, after years of requests for a second edition, Martin Kleppmann revisited the book – and he enlisted his longtime collaborator Chris Riccomini as a co-author. Their goal: Keep the core intact while weaving in new ideas and refreshing the details throughout. With the second edition of DDIA now arriving at people’s doorsteps, let’s look at what Kleppmann and Riccomini shared about the project at Monster Scale Summit 2025. That conversation was recorded mid-revision. You can watch the entire chat or read highlights below.  Extra: Hear what Kleppmann and Riccomini had to say when the book went off to the printers this year.  They were featured again at Monster SCALE Summit 2026, which just concluded. This talk is available on-demand, alongside 60+ others by antirez, Pat Helland, Camille Fournier, Discord, Disney, and  more. People have been requesting a second edition for a while – why now? Kleppmann: Yeah, I’ve long wanted to do a second edition, but I also have a lot of other things I want to do, so it’s always a matter of prioritization. In the end, though, I felt that the technology had moved on enough. Most of the book focuses on fundamental concepts that don’t change that quickly. They change on the timescale of decades, not months. In that sense, it’s a nice book to revise, because there aren’t that many big changes. At the same time, many details have changed. The biggest one is how people use cloud services today compared to ten years ago. Cloud existed back then, of course, but I think its impact on data systems architecture has only been felt in the last few years. That’s something we’re trying to work in throughout the second edition of the book. How have database architectures evolved since the first edition? Riccomini: I’ve been thinking about whether database architecture has changed now that we’re putting databases in more places – cloud, BYOC, on-prem, on-client, on-edge, etc. I think so. I have a hypothesis that successful future databases will be able to move or scale with you, from your laptop, to your server, to your cloud, and even to another cloud. We’re already seeing evidence of this with things like DuckDB and MotherDuck, which span embedded to cloud-based use cases. I think PGlite and Postgres are another example, where you see Postgres being embedded. SQLite is an obvious signal on the embedded side. On the query engine side, you see this with systems like Daft, which can run locally and in a distributed setting. So the answer is yes. The biggest shift is the split between control, data, and compute planes. That architecture is widely accepted now and has proven flexible when moving between SaaS and BYOC. There’s also an operational aspect to this. When you talk about SaaS versus non-SaaS, you’re talking about things like multi-tenancy and how much you can leverage cloud-provider infrastructure. I had an interesting discussion about Confluent Freight versus WarpStream, two competing Kafka streaming systems. Freight is built to take advantage of a lot of the in-cloud SaaS infrastructure Confluent has, while WarpStream is built more like a BYOC system and doesn’t rely on things like a custom network plane. Operationally, there’s a lot to consider regarding security and multi-tenancy, and I’m not sure we’re as far along as we could be. A lot of what SaaS companies are doing still feels proprietary and internal. That’s my read on the situation. Kleppmann: I’d add a little to that. At the time of the first edition, the model for databases was that a node ran on a machine, storing data on its local file system. Storage was local disks, and replication happened at the application layer on top of that. Now we’re increasingly seeing a model where storage is an object store. It’s not a local file system; it’s a remote service, and it’s already replicated. Building on top of an abstraction like object storage lets you do fundamentally different things compared to local disk storage. I’m not saying one is better than the other – there are always tradeoffs. But this represents a new point in the tradeoff space that really wasn’t present at the time of the first edition. Of course, object stores existed back then, but far fewer databases took advantage of them in the way people do now. We’ve seen a proliferation of specialized databases recently – do you think we’re moving toward consolidation? Riccomini: With my investment hat on, this is the million-dollar…or billion-dollar…question. Which of these is it? I think the answer is probably both. The reality is that Postgres has really taken the world by storm lately, and its extension ecosystem has become pretty robust in recent versions. For most people, when they’re starting out, they’re naturally going to build on something like Postgres. They’ll use pg\_search or something similar for their search index, pgvector for vector embeddings, and PG analytics or pg\_duckdb for their data lake. Then the question is: as they scale, will that still be okay? And in some cases, yes. In other cases, no. My personal hypothesis is that as you not only scale up but also need features that are core to your product, you’re more likely to move to a specialized system. pgvector, for example, is a reasonable starting point. But if your entire product is like Cursor AI or an IDE that does code completion, you probably need something more robust and scalable than pgvector can provide. At that point, you’d likely look at something like Pinecone or Turbopuffer or companies like that. So I think it’s both. And because Postgres is going to eat the bottom of the market, I do think there will be fewer specialized vendors, but I don’t think they’ll disappear entirely. What are some of the key tradeoffs you see with streaming systems today? Kleppmann: Streaming sits in a slightly weird space. A typical stream processor has a one-record-at-a-time, callback-based API. It’s very imperative. On top of that, you can build things like relational operators and query plans. But if you keep pushing in that direction, the result starts to look much more like a database that does incremental view maintenance. There are projects like Materialize that are aiming there. You just give it a SQL query, and the fact that it’s streaming is an implementation detail that’s almost hidden. I don’t know if that means the result for many of these systems is this: if you have a query you can express in SQL, you hand it off to one of these systems and let it maintain the view. And what we currently think of as streaming, with the lower-level APIs, is used for a more specialized set of applications. That might be very high-scale use cases, or queries that just don’t fit well into a relational style. Riccomini: Another thing I’d add is the fundamental tradeoff between latency and throughput, which most streaming systems have to deal with. Ideally, you want the lowest possible latency. But when you do that, it becomes harder to get higher throughput. The usual way to increase throughput is to batch writes. But as soon as you start batching writes, you increase the latency between when a message is sent and when it’s received. How is AI impacting data-intensive applications? Kleppmann: There’s some AI-plus-data work we’ve been exploring in research (not really part of the book). The idea is this: if you want to give an AI some control over a system – if it’s allowed to press buttons that affect data, like editing or updating it – then the safest way to do that is through a well-defined API. That API defines which buttons the AI is allowed to press, and those actions correspond to things that make sense and maybe fulfill certain consistency properties. More generally, it seems important to have interfaces that allow AI agents and humans to work safely together, with the database as common ground. Humans can update data, the AI can update data, and both can see each other’s changes. You can imagine workflows where changes are reviewed, compared, and merged. Those kinds of processes will be necessary if we want good collaboration between humans and AI systems. Riccomini: From an implementation perspective, storage formats are definitely going to evolve. We’re already seeing this with systems like LanceDB, which are trying to support multimodal data better. Arrow, for example, is built for columnar data, which may not be the best fit for some multimodal use cases. And this goes beyond storage into things like Arrow RPC as well. On the query engine side, there’s also a lot of ongoing work around query optimization and indexing. The idea is to build smarter databases that can look at query patterns and adjust themselves over time. There was a good paper from Google a while back that used more traditional machine learning techniques to do dynamic indexing based on query patterns. That line of work will continue. And then, of course, support for embeddings, vector search, and semantic search will become more common. Good integrations with RAG (Retrieval-Augmented Generation)…that’s also important. We’re still very much at the forefront of all of this, so it’s tricky. Watch the 2026 Chat with Martin and Chris

Monster SCALE Summit 2026 Recap: From Database Elasticity to Nanosecond AI

It seemed like just yesterday I was here in the US hosting P99 CONF…and then suddenly I’m back again for Monster SCALE Summit 2026. The pace of it all says something – not so much about my travel between the US and Australia, but about how quickly this space is moving. Monster SCALE Summit delivered an overwhelming amount of content – so much that you need both time and distance to process it all. Sitting here now at Los Angeles International Airport, I can conclude this was the strongest summit to date. Watch sessions on-demand Day 1 To start with, the keynotes. As expected, Dor Laor set the tone with a strong opening – making a clear case that elasticity is the defining capability behind ScyllaDB operating at scale. “Elastic” is one of those overloaded terms in cloud infrastructure, but the distinction here was sharp. The focus was on a database that continuously adapts under load without breaking performance. Hot on the heels of Dor was the team from Discord, showing us just how they operate at scale while automating everything from standing up new clusters to expanding them under load, all with ScyllaDB. What stood out wasn’t just the scale but also the level of automation. I mention it at every conference, but [Discord engineer] Bo Ingram’s ScyllaDB in Action book is required reading for anyone running databases at scale. Since the conference was hosted by ScyllaDB, there was no shortage of deep dives into operating at scale. Felipe Cardeneti Mendes, author of Database Performance at Scale, expanded on the engineering behind ScyllaDB. Tablets play a key role as a core abstraction that links predictable performance with true elasticity. If you didn’t get a chance to participate in any of the live lounge events watching Felipe scale ScyllaDB to millions of operations per second, then you can try it yourself. The gap between demo and production is enticingly small, so it’s well worth a whirl. From the customer side, Freshworks and ShareChat, both with their own grounded presentations about super low latency and super low costs, focused on what actually matters in production. Enough of ScyllaDB for a bit, though. Let me recommend Joy Gao’s presentation from ClickHouse about cross-system database replication at petabyte scale. As you would expect, that’s a difficult problem to solve. Or if you want to know about ad reporting at scale, then check out the presentation from Pinterest. The presentation of the day for me was Thea Aarrestad talking about nanosecond AI at the Large Hadron Collider. The metrics in this presentation were insane, with datasets far exceeding the size of the Netflix catalog being generated per second… all in the pursuit of capturing the rarest particle interactions in real time. It’s a useful reminder that at true extreme scale, you don’t store then analyze. Instead, you analyze inline, or you lose the data forever. A big thanks to Thea. We asked for monster scale, and you certainly delivered. At the end of day one, we had another dense block of quality content from equally strong presenters. Long-time favourite, Ben Cane from American Express, is always entertaining and informative. We had some great research presented by Murat Demirbas from MongoDB around designing for distributed systems. And on the topic of distributed systems, Stephanie Wang showed us how to avoid some distributed traps with fast, simple analytics using DuckDB. Miguel Young de la Sota from Buf Technologies indulged us with speedrunning Super Mario 64 and its relationship to performance engineering. In parallel, Yaniv from ScyllaDB core engineering showed us what’s new and what’s next for our database. Pat Helland and Daniel May finished Day 1 with a closing keynote on “Yours, Mine, and Ours,” which was all about set reconciliation in the distributed world. This is one of those topics that sounds niche, but sits at the core of correctness at scale. When systems diverge – and they always do – then reconciliation becomes the real problem, not replication. Pat is a legend in this space and his writing at https://pathelland.substack.com/ expands on many of these ideas. Day 2 Day 2 got off to an early start with a pre-show lounge hosted by Tzach Livyatan and Attila Toth sharing tips for getting started with ScyllaDB. The opening keynote from Avi Kivity went deep into the mechanics behind ScyllaDB tablets, with the kind of under-the-hood engineering that makes extreme scale tractable rather than theoretical. I then had the opportunity to host Camille Fournier, joining remotely for a live session on engineering leadership and platform strategy under increasing system complexity. AI was a recurring theme, but the sharper takeaway was how platform teams need to evolve their role as abstraction layers become more dynamic and less predictable. Camille also stayed on for a live Q&A, engaging directly with the audience on both leadership and the practical realities of operating modern platforms at scale. Big thanks to Camille! The next block shifted back into deep technical content, starting with Dominik Tornow from Resonate HQ, sharing a first-principles approach to agentic systems. Alex Dathskovsky followed with a forward-looking session on the future of data consistency in ScyllaDB, reframing consistency not as a static tradeoff but as something increasingly adaptive and workload-aware. Szymon Wasik went deep on ScyllaDB vector search internals with real-time performance at billion-vector scale. It was great to see how ScyllaDB avoids the typical latency cliffs seen in ANN systems. From LinkedIn, Satya Lakshmikanth shared practical lessons from operating large-scale data systems in production, grounding the conversation in real-world constraints. Asias He showed us how incremental repair for ScyllaDB tablets is much more operationally feasible without the traditional overhead around consistency. We also heard from Tyler Denton, showcasing ScyllaDB vector search in action across two use cases (RAG with Hybrid Memory and anomaly detection at scale). Before lunch, another top quality keynote from TigerBeetle’s Joran Dirk Greef who had plenty of quotable quotes in his talk, “You Can’t Scale When You’re Dead.” And crowd favorite antirez from Redis gave us his take on HNSW indexes and all the tradeoffs that need to be considered. The final block in an already monster-sized conference followed up with Teiva Harsanyi from Google, sharing lessons drawn from large-scale distributed systems like Gemini. MoEngage brought a practitioner view on operating high-throughput, user-facing systems where latency directly impacts engagement. Benny Halevy outlined the path to tiered storage on ScyllaDB. Rivian discussed real-world automotive data platforms and Brian Jones from SAS highlighted how they tackled tension between batch and realtime workloads at scale by applying ScyllaDB. I also enjoyed the AWS talk from KT Tambe with Tzach Livyatan, tying cloud infrastructure back to database design decisions with the I8g family. To wrap the conference, we heard from Brendan Cox on how Sprig simplified their database infrastructure and achieved predictable low latency with – I’m sure you’ve guessed by now – ScyllaDB. Throughout the event, attendees could also binge-watch talks in Instant Access, which featured gems from Martin Kleppmann and Chris Ricommini, Disney/Hulu, DBOS, Tiket, and more. Prepare for Takeoff Final call to board my monstrous 15-hour flight home… It’s a great privilege to host these virtual conferences live backed by the team at ScyllaDB. The quality of this event ultimately comes down to the people – both the presenters who bring great depth and detail and the audience that keeps the conversation engaged and interesting. Thanks to everyone who contributed to making this the best Monster SCALE Summit to date. Catch up on what you missed If you want to chat with our engineers about how ScyllaDB might work with your use case, book a technical strategy session.

Announcing ScyllaDB Operator, with Red Hat OpenShift Certification

OpenShift users gain a trusted, validated path for installing and managing ScyllaDB Operator – backed by enterprise-grade support ScyllaDB Operator is an open-source project that helps you run ScyllaDB on Kubernetes by managing ScyllaDB clusters deployed to Kubernetes and automating tasks related to operating a ScyllaDB cluster. For example, it automates installation, vertical and horizontal scaling, as well as rolling upgrades. The latest release (version 1.20) is now Red Hat certified and available directly in the Red Hat OpenShift ecosystem catalog. Additionally, it brings new features, stability improvements and documentation updates. Red Hat OpenShift Certification and Catalog Availablity OpenShift has become a cornerstone platform for enterprise Kubernetes deployments, and we’ve been working to ensure ScyllaDB Operator feels like a native part of that ecosystem. With ScyllaDB Operator 1.20, we’re taking a significant step forward: the operator is now Red Hat certified and available directly in the Red Hat OpenShift ecosystem catalog. See the ScyllaDB Operator project in the Red Hat Ecosystem Catalog. This milestone gives OpenShift users a trusted, validated path for installing and managing ScyllaDB Operator – backed by enterprise-grade support. With this release, you can install ScyllaDB Operator through OLM (Operator Lifecycle Manager) using either the OpenShift Web Console or CLI. For detailed installation instructions and OpenShift-specific configuration examples – including guidance for platforms like Red Hat OpenShift Service on AWS (ROSA) – see the Installing ScyllaDB Operator on OpenShift guide. IPv6 support ScyllaDB Operator 1.20 brings native support for IPv6 and dual-stack networking to your ScyllaDB clusters. With dual-stack support, your ScyllaDB clusters can operate with both IPv4 and IPv6 addresses simultaneously. That provides flexibility for gradual migration scenarios or environments requiring support for both protocols. You can also configure IPv6-first deployments, where ScyllaDB uses IPv6 for internal communication while remaining accessible via both protocols. You can control the IP addressing behaviour of your cluster through v1.ScyllaCluster’s .spec.network. The API abstracts away the underlying ScyllaDB configuration complexity so you can focus on your networking requirements rather than implementation details. With this release, IPv4-first dual-stack and IPv6-first dual-stack configurations are production-ready. IPv6-only single-stack mode is available as an experimental feature under active development; it’s not recommended for production use. See Production readiness for details. Learn more about IPv6 networking in the IPv6 networking documentation. Other notable changes Documentation improvements This release includes a new guide on automatic data cleanups. It explains how ScyllaDB Operator ensures that your ScyllaDB clusters maintain storage efficiency and data integrity by removing stale data and preventing data resurrection. Dependency updates This release also includes regular updates of ScyllaDB Monitoring and the packaged dashboards to support the latest ScyllaDB releases (4.12.1->4.14.0, #3250), as well as its dependencies: Grafana (12.2.0->12.3.2) and Prometheus (v3.6.0->v3.9.1). For more changes and details, check out the GitHub release notes. Upgrade instructions For instructions on upgrading ScyllaDB Operator to 1.20, please refer to the Upgrading ScyllaDB Operator section of the documentation. Supported versions ScyllaDB 2024.1, 2025.1, 2025.3 – 2025.4 Red Hat OpenShift 4.20 Kubernetes 1.32 – 1.35 Container Runtime Interface API v1 ScyllaDB Manager 3.7 – 3.8 Getting started with ScyllaDB Operator ScyllaDB Operator Documentation Learn how to deploy ScyllaDB on Google Kubernetes Engine (GKE) Learn how to deploy ScyllaDB on Amazon Elastic Kubernetes Engine (EKS)  Learn how to deploy ScyllaDB on a Kubernetes Cluster Related links ScyllaDB Operator source (on GitHub) ScyllaDB Operator image on DockerHub ScyllaDB Operator Helm Chart repository ScyllaDB Operator documentation ScyllaDB Operator for Kubernetes lesson in ScyllaDB University Report a problem Your feedback is always welcome! Feel free to open an issue or reach out on the #scylla-operator channel in ScyllaDB User Slack.  

Instaclustr product update: March 2026

Here’s a roundup of the latest features and updates that we’ve recently released.

If you have any particular feature requests or enhancement ideas that you would like to see, please get in touch with us.

Major announcements Introducing AI Cluster Health: Smarter monitoring made simple

Turn complex metrics into clear, actionable insights with AI-powered health indicators—now available in the NetApp Instaclustr console. The new AI Cluster Health page simplifies cluster monitoring, making it easy to understand your cluster’s state at a glance without requiring deep technical expertise. This AI-driven analysis reviews recent metrics, highlights key indicators, explains their impact, and assigns an easy traffic-light health score for a quick status overview.

NetApp introduces Apache Kafka® Tiered Storage support on GCP in public preview

Tiered Storage is now available in public review for Instaclustr for Apache Kafka on Google Cloud Platform. This feature enables customers to optimize storage costs and improve scalability by offloading older Kafka log segments from local disk to Google Cloud Storage (GCS), while keeping active data local for fast access. Kafka clients continue to consume data seamlessly with no changes required, allowing teams to reduce infrastructure costs, simplify cluster scaling, and extend retention periods for analytics or compliance.

Other significant changes Apache Cassandra®
  • Released Apache Cassandra 4.0.19 and 5.0.6 into general availability on the NetApp Instaclustr Managed Platform, giving customers access to the latest stability, performance, and security improvements.
  • Multi–data center Apache Cassandra clusters can now be provisioned across public and private networks via the Instaclustr Console, API, and Terraform provider, enabling customers to provision multi-DC clusters from day one.
  • Single CA feature is now available for Apache Cassandra clusters. See Single CA for more details.
  • GCP n4 machine types are now supported for Apache Cassandra in the available regions.
Apache Kafka®
  • Apache Kafka and Kafka Connect 4.1.1 are now generally available.
  • Added Client Metrics and Observability feature support for Kafka in private preview.
  • Single CA feature is now available for Apache Kafka clusters. See Single CA for more details.
ClickHouse®
  • ClickHouse 25.8.11 has been added to our managed platform in General Availability.
  • Enabled ClickHouse system.session_log table that plays a key role in tracking session lifecycle and auditing user activities for enhanced session monitoring. This helps you with troubleshooting client-side connectivity issues and provides insights into failed connections.
OpenSearch®
  • OpenSearch 2.19.4 and 3.3.2 have been released to general availability.
  • Added support for the OpenSearch Assistant feature in OpenSearch Dashboards for clusters with Dashboards and AI Search enabled.
PostgreSQL® Instaclustr Managed Platform
  • Added self-service Tags Management feature—allowing users to add, edit, or delete tags for their clusters directly through the Instaclustr console, APIs, or Terraform provider for RIYOA deployments
  • Added new region Germany West Central for Azure
Future releases Apache Kafka®
  • Following the private preview release, Kafka’s Client Telemetry feature is progressing toward general availability soon. Read more here.
ClickHouse®
  • We plan to extend the current ClickHouse integration with FSxN data sources by adding support for deployments across different VPCs, enabling broader enterprise lakehouse architectures.
  • Apache Iceberg and Delta Lake integration are planned to soon be available for ClickHouse on the NetApp Instaclustr Platform, giving you a practical way to run analytics on open table formats while keeping control of your existing data platforms.
  • We plan to soon introduce fully integrated AWS PrivateLink as a ClickHouse Add-On for secure and seamless connectivity with ClickHouse.
PostgreSQL®
  • We’re aiming to launch PostgreSQL integrated with FSx for NetApp ONTAP (FSxN) along with NVMe support into general availability soon. This enhancement is designed to combine enterprise-grade PostgreSQL with FSxN’s scalable, cost-efficient storage, enabling customers to optimize infrastructure costs while improving performance and flexibility. NVMe support is designed to deliver up to 20% greater throughput vs NFS.
OpenSearch®
  • An AI Search plugin for OpenSearch is being released to GA (currently in public preview) to enhance search experiences using AI‑powered techniques such as semantic, hybrid, and conversational search, enabling more relevant, context‑aware results and unlocking new use cases including retrieval‑augmented generation (RAG) and AI‑driven chatbots.
Instaclustr Managed Platform
  • Following the public preview release, Zero Inbound Access is progressing to General Availability, designed to deliver the most secure management connectivity by eliminating inbound internet exposure and removing the need for any routable public IP addresses, including bastion or gateway instances.
Did you know?

If you have any questions or need further assistance with these enhancements to the Instaclustr Managed Platform, please contact us.

SAFE HARBOR STATEMENT: Any unreleased services or features referenced in this blog are not currently available and may not be made generally available on time or at all, as may be determined in NetApp’s sole discretion. Any such referenced services or features do not represent promises to deliver, commitments, or obligations of NetApp and may not be incorporated into any contract. Customers should make their purchase decisions based upon services and features that are currently generally available. 

The post Instaclustr product update: March 2026 appeared first on Instaclustr.

Integrated Gauges: Lessons Learned Monitoring Seastar’s IO Stack

Why integrated gauges are a better approach for measuring frequently changing values Many performance metrics and system parameters are inherently volatile or fluctuate rapidly. When using a monitoring system that periodically “scrapes” (polls) a target for its current metric value, the collected data point is merely a snapshot of the system’s state at that precise moment. It doesn’t reveal much about what’s actually happening in that area. Sometimes it’s possible to overcome this problem by accumulating those values somehow – for example, by using histograms or exporting a derived monotonically increasing counter. This article suggests yet another way to extend this approach for a broader set of frequently changing parameters. The Problem with Instantaneous Metrics For rapidly changing values such as queue lengths or request latencies, a single snapshot is most often misleading. If the scrape happens during a peak load, the value will appear high. If it happens during an idle moment, the value will appear low. Over time, the sampled data usually does not accurately represent the system’s true average behavior, leading to misinformed alerting or capacity planning. This phenomenon is apparent when examining synthetically generated data. Take a look at the example below. On that plot, there’s a randomly generated XY series. The horizontal axis represents the event number, the vertical axis represents the frequently fluctuating value. The “value” changes on every event. Fig.1 Random frequently changing data Now let’s thin the data out and see what would happen if we show only every 40th point, as if it were a real monitoring system that captures the value at a notably lower rate. The next plot shows what it would look like. Fig.2 Sampling every 40th point from the data Apparently, this is a very poor approximation of the first plot. It becomes crystal clear how poor it is if we zoom into the range [80-120] of both plots. For the latter (dispersed) plot, we’ll see a couple of dots with values of 2 and 0. But the former (original) plot would reveal drastically different information, shown below. Fig.3 Zoom-in range [80-120] from the data Now remember that the problem revealed itself at the ratio of the real rate to scrape rate being just 40. In real systems, internal parameters may change thousands of times per second while the scrape period is minutes. In that case, you end up scraping less than a fraction of a percent. A better way to monitor the frequently changing data is needed…badly. Histograms Request latency is one example of a statistic that usually contains many data points per second. Histograms can help you see how slow or fast the requests are – without falling into the problem described above. A monitoring histogram is a type of metric that samples observations and counts them in configurable buckets. Instead of recording just a single average or maximum latency value, a histogram provides a distribution of these values over a given period. By analyzing the bucket counts, you can calculate percentiles, including p50 (known as the median value) or p99 (the value below which 99% of all requests fell). This provides a much more robust and actionable view of performance volatility than simple instantaneous gauges or averages. Fig.4 Histogram built from data X-points It’s a static snapshot of the data at the last “time point.” If the values change over time, modern monitoring systems can offer various data views, such as time-series percentiles or heatmaps, to provide the necessary insight into the data. Histograms, however, have a drawback: They require system memory to work, node networking throughput to get reported, and monitoring disk space to be stored. A histogram of N buckets consumes N times more resources than a single value. Addable latencies When it comes to reporting latencies, a good compromise solution between efficiency and informativeness is “counter” metrics coupled with the rate() function. Counters and rates Counter metrics are one of the simplest and most fundamental metric types in the Prometheus monitoring system. It’s a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero upon a restart of the monitored target. Because a counter itself is just an ever-increasing total, it is rarely analyzed directly. Instead, one derives another meaningful metric from counters, most notably a “rate”. Below is a mathematical description of how the prometheus rate() function works. Internally, the system calculates the total sum of the observed values. After scraping the value several times, the monitoring system calculates the “rate” function of the value, which takes the total value difference from the previous scrape, divided by the time passed since then. The rated value thus shows the average value over the specified period of time. Now let’s get back to latencies. To report average latency over the scraping period using counters, the system should accumulate two of them: the total sum of request latencies (e.g., in milliseconds) and the total number of completed requests. The monitoring system would then “rate” both counters to get their averages and divide them, thus showing the average per-request latency. It looks like this: Since the total number of requests served is useful on its own to see the IOPS value, this method is as efficient as exporting the immediate latency value. Yet, it provides a much more stable and representative metric than an instantaneous gauge. We can try to adopt counters to the artificial data that was introduced earlier. The plot below shows what the averaged values for the sub-ranges of length 40 (the sampling step from above) would look like. It differs greatly from the sampled plot of Figure 2 and provides a more accurate approximation of the real data set. Fig.5 Average values for sub-ranges of length 40 Of course, this method has its drawbacks when compared to histograms. It swallows latency spikes that last for a short duration, as compared to the scraping period. The shorter the spike duration, the more likely it would go unnoticed. Queue length Summing up individual values is not always possible though. Queue length is another example of a statistic that’s also volatile and may contain thousands of data points. For example, this could be the queue of tasks to be executed or the queue of IO requests. Whenever an entry is added to the queue, its length increases. When an entry is removed, the length decreases – but it makes little sense to add up the updated queue length elsewhere. If we return to Figure 3 showing the real values in the range of [80-120], what would the average queue length over that period be? Apparently, it’s the sum of individual values divided by their number. But even though we understand what the “average request latency” is, the idea of “average queue length” is harder to accept, mainly because the X value of the above data is usually not an “event”, but it’s rather a “time”. And if, for example, a queue was empty most of the time and then changed to N elements for a few milliseconds, we’d have just two events that changed the queue. Seeing the average queue length of N/2 is counterintuitive. Some time ago, we researched how the “real” length of an IO queue can be implicitly derived from the net sum of request latencies. Here, we’ll show another approach to getting an idea of the queue length. Integrated gauge The approach is a generalization of averaging the sum of individual data points from the data set using counters and the “rate” function. Let’s return to our synthetic data when it was summed and rated (Figure 3) and look at the corresponding math from a slightly different angle. If we treat each value point as a bar with the width of 1 and the height being equal to the value itself, we can see that: The total value is the total sum of the bars’ squares The difference between two scraped total values is the sum of squares of bars between two scraping points The number of points in the range equals the distance between those two scraping points Fig.6 Geometrical interpretation of the exported counters Figure 6 shows this area in green. The average value is thus the height of the imaginary rectangular whose width equals the scraping period. If the distance between two adjacent points is not 1, but some duration, the interpretation still works. We can still calculate the square under the plot and divide it by the area width to get its “average” value. But we need to apply two changes. First, the square of the individual bar is now the product of the value itself and the duration of time passed from the previous point. Think of it this way: previously, the duration was 1, so the multiplication was invisible.Now, it is explicit. Second, we no longer need to count the number of points in the range. Instead, the denominator of the averaging function will be the duration between scraping points. In other words, previously the number of points was the sum of 1-s (one-s) between scraping points; now, it’s the sum of durations – total scraping period It’s now clear that the average value is the result of applying the “rate()” function to the total value. Naming problem (a side note) There are two hard problems in computer science: cache invalidation, naming, and off-by-one errors. Here we had to face one of them – how to name the newly introduced metrics? It is the queue size multiplied by the time the queue exists in such a state. A very close analogy comes from project management. There’s a parameter called “human-hour” and it’s widely used to estimate the resources needed to accomplish some task. In physics, there are not many units that measure something multiplied by time. But there are plenty of parameters that measure something divided by time, like velocity (distance-per-second), electric current (charge-per-second) or power (energy-per-second). Even though it’s possible to define (e.g., charge as current multiplied by time), it’s still charge that’s the primary unit and current is secondary. So far I’ve found quite a few examples of such units. In classical physics, one is called “action” and measures energy-times-time, and another one is “impulse,” which is measured in newtons-times-seconds. Finally, in kinematics there’s a thing called “absement,” which is literally meters-times-seconds. That describes the measure of an object displacement from its initial position. Integrated queue length The approach described above was recently [applied] to the Seastar IO stack – the length of IO classes’ request queues was patched to account for and export the integrated value. Below is the screenshot of a dashboard showing the comparison of both compaction class queue lengths, randomly scraped and integrated. Interesting (but maybe hard to accept) are the fraction values of the queue length. That’s OK, since the number shows the average queue length over a scrape period of a few minutes. A notable advantage of the integrated metric can be seen on the bottom two plots that show the length of the in-disk queue. Only the integrated metric (bottom right plot) shows that disk load actually decreased over time, while the randomly scraped one (bottom left plot) just tells us that the disk wasn’t idling…but not more than that. This effect of “hiding” the reality from the observer can also be seen in the plot that shows how the commitlog class queue length was changing over time. Here we can see a clear rectangular “spike” in the integrated disk queue length plot. That spike was completely hidden by the randomly scraped one. Similarly, the software queue length (upper pair of plots) dynamics is only visible from the integrated counter (upper right plot). Conclusion Relying solely on instantaneous metrics, or gauges, to monitor rapidly fluctuating system parameters very often leads to misleading data that poorly reflects the system’s actual behavior over a period of time. While solutions like histograms offer better statistical insights, they incur notable resource overhead. For metrics where the individual values can be aggregated (like request latencies), converting them into cumulative counters and deriving a rate provides a much more stable and representative average over the scraping interval. This offers a very efficient compromise between informative granularity and resource consumption. For metrics where instantaneous values cannot be simply summed (such as queue lengths), the concept of the Integrated Gauge offers a generalization of the same efficiency. In essence, by treating the gauge not as a point-in-time value but as a continuously accumulating measure of time-at-value, integral gauges provide a highly reliable and definitive representation of a parameter’s average behavior across any given measurement interval.

From ScyllaDB to Kafka: Natura’s Approach to Real-Time Data at Scale

How Natura built a real-time data pipeline to support orders, analytics, and operations Natura, one of the world’s largest cosmetics companies, relies on a network of millions of beauty consultants generating a massive amount of orders, events, and business data every day. From an infrastructure perspective, this requires processing vast amounts of data to support orders, campaigns, online KPIs, predictive analytics, and commercial operations. Natura’s Rodrigo Luchini (Software Engineering Manager) and Marcus Monteiro (Senior Engineering Tech Lead) shared the technical challenges and architecture behind these use cases at Monster SCALE Summit 2025. Fitting for the conference theme, they explained how the team powers real-time sales insights at massive scale by building upon ScyllaDB’s CDC Source Connector. About Natura Rodrigo kicked off the talk with a bit of background on Natura. Natura was founded in 1969 by Antônio Luiz Seabra. It is a Brazilian multinational cosmetics and personal care company known for its commitment to sustainability and ethical sourcing. They were one of the first companies to focus on products connected to beauty, health, and self-care. Natura has three core pillars: Sustainability: They are committed to producing products without animal testing and to supporting local producers, including communities in the Amazon rainforest. People: They value diversity and believe in the power of relationships. This belief drives how they work at Natura and is reflected in their beauty consultant network. Technology: They invest heavily in advanced engineering as well as product development. The Technical Challenge: Managing Massive Data Volume with Real-Time Updates The first challenge they face is integrating a high volume of data (just imagine millions of beauty consultants generating data and information every single day). The second challenge is having real-time updated data processing for different consumers across disconnected systems. Rodrigo explained, “Imagine two different challenges: creating and generating real-time data, and at the same time, delivering it to the network of consultants and for various purposes within Natura.” This is where ScyllaDB comes in. Here’s a look at the initial architecture flow, focusing on the order and API flow. As soon as a beauty consultant places an order, they need to update and process the related data immediately. This is possible for three main reasons: As Rodrigo put it, “The first is that ScyllaDB is a fast database infrastructure that operates in a resilient way. The second is that it is a robust database that replicates data across multiple nodes, which ensures data consistency and reliability. The last but not least reason is scalability – it’s capable of supporting billions of data processing operations.” The Natura team architected their system as follows: In addition to the order ingestion flow mentioned above, there’s an order metrics flow closely connected to ScyllaDB Cloud (ScyllaDB’s fully-managed database-as-a-service offering), as well as the ScyllaDB CDC Source Connector (A Kafka source connector capturing ScyllaDB CDC changes). Together, this enables different use cases in their internal systems. For example, it’s used to determine each beauty consultant’s business plan and also to report data across the direct sales network, including beauty consultants, leaders, and managers. These real-time reports drive business metrics up the chain for accurate, just-in-time decisions. Additionally, the data is used to define strategy and determine what products to offer customers. Contributing to the ScyllaDB CDC Source Connector When Natura started testing the ScyllaDB Connector, they noticed a significant spike in the cluster’s resource consumption. This continued until the CDC log table was fully processed, then returned to normal. At that point, the team took a step back. After reviewing the documentation, they learned that the connector operates with small windows (15 seconds by default) for reading the CDC log tables and sending the results to Kafka. However, at startup, these windows are actually based on the table TTL, which ranged from one to three days in Natura’s use case. Marcus shared: “Now imagine the impact. A massive amount of data, thousands of partitions, and the database reading all of it and staying in that state until the connector catches up to the current time window. So we asked ourselves: ‘Do we really need all the data?’ No. We had already run a full, readable load process for the ScyllaDB tables. What we really needed were just incremental changes, not the last three days, not the last 24 hours, just the last 15 minutes.” So, as ScyllaDB was adding this feature to the GitHub repo, the Natura team created a new option: scylla.custom.window.start. This let them tell the connector exactly where to start, so they could avoid unnecessary reads and relieve unnecessary load on the database. Marcus wrapped up the talk with the payoff: “This results in a highly efficient real-time data capture system that streams the CDC events straight to Kafka. From there, we can do anything—consume the data, store it, or move it to any database. This is a gamechanger. With this optimization, we unlocked a new level of efficiency and scalability, and this made a real difference for us.”

Optimizing a Fast Feature Store for Costs: Lessons Learned

How ShareChat reduced costs for its billion-feature scale feature store with ScyllaDB “Great system…now please make it 10 times cheaper.” That’s not exactly what the ShareChat team wanted to hear after completing a major engineering feat: scaling a real-time feature store 1000X without scaling their database (ScyllaDB). To stretch from supporting 1 million features per second to supporting 1 billion features per second, the team already… Redesigned their database schema to store all features together as protocol buffers and optimized tile configurations (reduced required rows from 2B to 73M/sec) Switched their database compaction strategy from incremental to leveled (doubled database capacity) Forked caching and gRPC libraries to eliminate mutex contention and connection bottlenecks Applied object pooling and garbage collector tuning to reduce memory allocation overhead You can read about those performance optimizations in Scaling an ML Feature Store From 1M to 1B Features per Second. But ShareChat – an Indian leader in a globally competitive social media market – is always looking to optimize. After reaching this scalability milestone, the team received a follow-up challenge: reducing the feature store’s costs by 10X (without compromising performance, of course). David Malinge (Sr. Staff Software Engineer at ShareChat) and Ivan Burmistrov (then Principal Software Engineer at ShareChat) shared how they approached this new challenge in a keynote at Monster Scale Summit 2025. Watch the complete talk, or read the highlights below. Note: Monster Scale Summit is a free + virtual conference on extreme scale engineering, with a focus on data-intensive applications. Learn from luminaries like antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications” and more than 50 others, including engineers from Discord, Disney, Pinterest, Rivian, LinkedIn, Nextdoor, and American Express. Register (free) and join us March 11-12 for some lively chats Background ShareChat is one of the largest Indian media network, with over 300 million monthly active users. The app’s popularity stems from its powerful recommendation system – and that’s powered by their feature store. As Malinge explained, “We process more than 2 billion events daily to compute our features and serve close to a billion features per second at peak, with P99 latency under 20 milliseconds. We also read more than 30 billion rows per day in ScyllaDB, so we’re very happy that we’re not charged by the transaction.” Cloud Cost Optimization: Where Do You Even Start? When it came time to make this system 10 times cheaper, the team very quickly realized how complicated and confusing cloud cost optimization could be. For example, Burmistrov noted: Cloud billing is cloudy. AWS and GCP each have around 40,000 different SKUs, which makes it really hard to figure out where your money’s actually going. And cloud providers aren’t exactly motivated to make this easier for you to understand and debug. It’s easy to lose track of little things that add up. Maybe you forgot about an instance that’s still running. Maybe there’s an old deployment nobody uses anymore. Or perhaps you have storage buckets filled with data that should have expired by now (e.g., via Time to Live [TTL]). It all accumulates over time. Your cost intuition from on-prem doesn’t necessarily translate to the cloud. Something that was cost-efficient in your own data center might be expensive in the cloud, and what works well on one cloud provider might not be cost-effective on another. For the first optimization step, the team became sticklers for “hygiene”: clearing out anything they didn’t really need. This involved a deep cleaning for forgotten instances and deployments, unused buckets, data without proper TTLs, and overprovisioning. Having proper attribution was critical here. As Burmistrov explained, “Every workload, every instance, and every resource must be attributed to a specific service and a specific team. Without attribution, cost optimization is a global problem. With attribution, it becomes a local problem. Once each team can clearly see the costs associated with their own services, optimization becomes much easier. The team that owns the service has both the context and the incentive to improve its cost profile. They can see which components are expensive and decide what to do about them.” Cloud Trap 1: Getting Sucked Into Unsustainable Costs Next, they shifted focus to challenges unique to running applications in the cloud. To get the required cost savings, they had to navigate around a few cloud traps. The first trap was the allure and ease of using the cloud provider’s own products – particularly, databases. “They’re great to start with, but at scale they become painful in terms of cost,” explained Burmistrov. Scaling can be particularly costly for solutions that use a metered pricing model (e.g., based on traffic). “We used several GCP databases, including Bigtable and Spanner, but they became expensive and, more importantly, those costs were largely out of our control, we had little leverage over cost,” continued Burmistrov. “So we migrated many workloads to ScyllaDB. Today, more than 30 ScyllaDB clusters are deployed across the organization, including for our feature store use case. Besides stability and low latency, ScyllaDB gives us strong cost control. We can choose instance types, merge use cases into a single cluster, and run at high utilization. ScyllaDB works well at 80%, 90%, even close to 100% utilization, which gives us a very strong cost profile.” Cloud Trap 2: Kubernetes Wastage Next, the team tackled Kubernetes wastage. In Kubernetes, you deploy containers into pods, but you pay for nodes, which are actual VMs. You typically choose node types upfront – for example, nodes with 4 CPUs and 16 GB of memory. However, over time, workloads change, teams change, and these nodes are no longer optimal. Pods cannot fully occupy the nodes, either because of CPU, memory, or scheduling constraints. At that point, you’re paying for resources that aren’t being used. “You may decide, ‘Okay, let’s isolate workloads into separate node pools,’” explained Burmistrov. “But it’s not a solution because, as shown in the image below, we have one node completely wasted.” Simple isolation just moves waste around. The solution: dynamic node allocation based on workloads. Options include open source solutions like Karpenter, cloud-specific ones like GKE Autopilot, or commercial multi-cloud solutions like Cast AI. ShareChat likes Cast AI because it lets them fine-tune the configuration with respect to long-term commitments, spot instances, and other pricing impacts. Cloud Trap 3: Network Costs (“The Cloud Tax”) And then there’s network costs, which the ShareChat team calls “the cloud tax.” Massive apps like ShareChat deploy across multiple zones for high availability. However, multiple zones need to communicate with one another, and the traffic between zones comes with its own cost: Network Inter-Zone Egress (NIZE). That’s the outbound network traffic between availability zones within the same region. It’s billed separately by many cloud providers For example, if you deploy ScyllaDB in three GCP zones with two nodes per zone and use a non-ScyllaDB driver (not recommended), reads may cross zones both at the client level and inside the cluster. For traffic of volume T, you pay for T in NIZE. ScyllaDB drivers help here. As Burmistrov explained, “ Using ScyllaDB’s token-aware routing removes the extra hop inside the cluster, reducing inter-zone traffic. Using token-aware plus zone-aware routing ensures reads stay within the same zone, reducing inter-zone traffic to zero for reads.” Removing the extra hop cuts NIZE down to about 2/3 of T. On the write path, it’s a little different because replication is required. Burmistrov continued, “With three zones, writes generate 2 times T of inter-zone traffic. You can trade availability for cost by using two zones instead of three, reducing this to 1.5 times T. ScyllaDB also lets teams model each zone as a separate data center. In that case, for traffic T, you pay for T of inter-zone traffic. You generally can’t go lower unless you deploy in a single zone.” Tip: Oracle Cloud and Azure don’t charge for inter-zone traffic. If you use one of these cloud providers, you can deploy across three zones with zero inter-zone cost. Database Layer Optimization Another challenge: ShareChat’s generally stable read latencies would spike beyond their SLAs when write-heavy workloads would occur (like a backfill job, for example). The usual option – just scale up the database – would be simple, but the team wanted to reduce costs. Isolating reads and writes into separate data centers could technically improve performance, but that would be even worse from the cost perspective. That approach would double the infrastructure and probably lead to underutilization. Ultimately, they discovered and applied ScyllaDB’s workload prioritization. This lets them control how their various workloads compete for system resources. Malinge explained, “Under the hood, this relies on ScyllaDB’s thread-per-core architecture. Each core runs an async scheduler with multiple queues, each with its own priority. Workload prioritization maps directly to these priorities. Looking at the image, we have a very high share of compute for serving because we have strong SLAs on the read path, which is user facing. Most of the rest goes to less latency-sensitive workloads from asynchronous jobs, like writing the features to the database. We also keep a tiny share for manual queries (debugging, for example). This enables us to have exactly what we want – different latencies for reads and writes – and that allowed us to have the best possible serving for our features.” Feature Serving Layer Optimization That serving layer is a distributed gRPC (Google Remote Procedure Call) service responsible for caching and handling consumer requests. The team made some cost optimizations here too. One of the most impactful was a clever shortcut: instead of fully deserializing complex Protobuf messages, they treated repeated messages as serialized byte blobs. Since Protobuf encodes embedded messages using the same wire format as bytes, they could simply append these serialized records to merge data. With that, they could skip the heavy lift of fully unpacking and rebuilding the messages from scratch. That optimization alone could be the subject of an entire article; please see the video (starting at 18:15) for a detailed walkthrough of their approach. Malinge’s top lesson learned from optimizing this layer: “The first advice is the Captain Obvious one: don’t go blindly. We set up continuous profiling very early on. Nowadays, there are tons of tools available. Integration is super straightforward. You might recognize the vanilla GCP profiler in the image below. It’s just four lines of code to add in Go programs, and this has really helped us on our quest to reduce costs. We’ve actually unlocked more than a 50% reduction in compute by doing optimizations guided by continuous profiling.” Another lesson learned: if you’re using protobuf at high-RPS, serde is probably burning your CPU. Even with ~95% cache hit rates, the hot path was still deserializing protos, merging them, and serializing them again just to stitch responses together. Most of that work was pointless. The team ended up leveraging protobuf’s wire format, where repeated fields are just sequences of records and embedded messages can be treated as raw bytes (no deserialization). They switched to this “lazy” deserialization and merged cached protos without ever touching individual fields. For a practical example, see this repository david-sharechat/lazy-proto. And one parting tip: the wins compound fast. Benchmarking showed lazy merging was about 6 times faster, with about a third of the allocations. In production, that meant lower compute bills and better tail latency. Cost Optimization Actually Fostered Innovation The team’s smart work here underscores that cost optimizations don’t have to compromise the product. They can actually act as a catalyst for better system design. Malinge left us with this: “Our cost savings initiatives actually led to a lot of innovations. As I tried to represent in the left quadrant, there is a very unhealthy way to look at cost savings – one that focuses on shortcuts that negatively impact your product. The key is to stay on the right side of this quadrant, to aim for the sweet spot of reducing costs while improving the product.”

How Agoda Scaled Its Feature Store 50X

Lessons learned on data modeling, cache optimization, and hardware selection Agoda is the Singapore wing of Booking Holdings, the world’s leading provider of online travel (the brand behind Booking.com, Kayak, Priceline, etc.). From January 2023 to February 2025, Agoda server traffic spiked by 50 times. That’s fantastic business growth, but also the trigger for an interesting engineering challenge. Specifically, the team had to determine how to scale their ScyllaDB-backed online feature store to maintain 10ms P99 latencies despite this growth. Complicating the situation, traffic was highly bursty, cache hit rates were unpredictable and cold-cache scenarios could flood the database with duplicate read requests in a matter of seconds. At Monster Scale Summit 2025, Worakarn Isaratham, lead software engineer at Agoda,shared how they tackled the challenge. You can watch his entire talk or read the highlights below. Note: Monster Scale Summit is a free + virtual conference on extreme scale engineering, with a focus on data-intensive applications. Learn from luminaries like antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications” and more than 50 others, including engineers from Discord, Disney, Pinterest, Rivian, LinkedIn, Nextdoor, and American Express. Register (free) and join us March 11-12 for some lively chats. Background: A feature store powered by ScyllaDB and DragonflyDB Agoda operates an in-house feature store that supports both offline model training and online inference. For anyone not familiar with feature stores, Isaratham provided a quick primer. A feature store is a centralized repository designed for managing and serving machine learning features. In the context of machine learning, a feature is a measurable property or characteristic of a data point used as input to models. The feature store helps manage features across the entire machine learning pipeline — from data ingestion to model training to inference. Feature stores are integral to Agoda’s business. Isaratham explained: “We’re a digital travel platform and some use cases are directly tied to our product. For example, we try to predict what users want to see, which hotels to recommend and what promotions to serve. On the more technical side, we use it for things like bot detection. The model uses traffic patterns to predict whether a user is a bot, and if so, we can block or deprioritize requests. So the feature store is essential for both product and engineering at Agoda. We’ve got tools to help create feature ingestion pipelines, model training, and the focus here: online feature serving.” One layer deeper into how it works: “We’re currently serving about 3.5 million entities per second (EPS) to our users. About half the features are served from cache within the client SDK, which we provide in Scala and Python. That means 1.7 million entities per second reach our application servers. These are written in Rust, running in our internal Kubernetes pods in our private cloud. From the app servers, we first check if features exist in the cache. We use DragonflyDB as a non-persistent centralized cache. If it’s not in the cache, then we go to ScyllaDB, our source of truth.” ScyllaDB is a high-performance database for workloads that require ultra-low latency at scale. Agoda’s current ScyllaDB cluster is deployed as six bare-metal nodes, replicated across four data centers. Under steady-state conditions, ScyllaDB serves about 200K entities per second across all data centers while meeting a service-level agreement (SLA) of 10ms P99 latency. (In practice, their latencies are typically even lower than their SLA requires.) Traffic growth and bursty workloads However, it wasn’t always that smooth and steady. Around mid-2023, they hit a major capacity problem when a new user wanted to onboard to the Agoda feature store. Their traffic pattern was super bursty: It was normally low, but occasionally flooded them with requests triggered by external signals. These were cold-cache scenarios, where the cache couldn’t help. Isaratham shared, “Bursts reached 120K EPS, which was 12 times the normal load back then.” Request duplication exacerbated the situation. Many identical requests arrived in quick succession. Instead of one request populating the cache and subsequent requests benefiting, all of them hit ScyllaDB at the same time — a classic cache stampede. They also retried failed requests until they succeeded – and that kept the pressure high. This load involved two data centers. One slowed down but remained online. The other was effectively taken out of service. More details from Worakarn: “On the bad DC, error rates were high and retries took 40 minutes to clear; on the good one, it only took a few minutes. Metrics showed that ScyllaDB read latency spiked into seconds instead of milliseconds.” Diagnosing the bottleneck So, they compared setups and found the difference: the problematic data center used SATA SSDs while the better one used NVMe SSDs. SATA (serial advanced technology attachment) was already old tech, even then. The team’s speed tests suggested that replacing the disks would yield a 10X read performance boost – and better write rates too. The team ordered new disks immediately. However, given that the disks wouldn’t arrive for months, they had to figure out a survival strategy until then. As Isaratham shared, “Capacity tests and projections showed that we would hit limits within eight or nine months even without new load – and sooner with it. So, we worked with users to add more aggressive client-side caching, remove unnecessary requests and smooth out bursts. That reduced the new load from 120K to 7K EPS. That was enough to keep things stable, but we were still close to the limit.” Surviving with SATA Given the imminent capacity cap, the team brainstormed ways to improve the situation while still on the existing SATA disks. Since you have to measure before you can improve, getting a clean baseline was the first order of business. “The earlier capacity numbers were from real-world traffic, which included caching effects,” Isaratham detailed. “We wanted to measure cold-cache performance directly. So, we created artificial load using one-time-use test entities, bypassed cache in queries and flushed caches before and after each run. The baseline read capacity on the bad DC was 5K EPS.” With that baseline set, the team considered a few different approaches. Data modeling All features from all feature sets were stored in a single table. The team hoped that splitting tables by feature set might improve locality and reduce read amplification. It didn’t. They were already partitioning by feature set and entity, so the logical reorganization didn’t change the physical layout. Compaction strategy Given a read-heavy workload with frequent updates, ScyllaDB documentation recommends the size-tiered compaction strategy to avoid write amplification. But the team was most concerned about read latency, so they took a different path. According to Worakarn: “We tried leveled compaction to reduce the number of SSTables per read. Tests showed fetching 1KB of data required reading 70KB from disk, so minimizing SSTable reads was key. Switching to leveled compaction improved throughput by about 50%.” Larger SSTable summaries ScyllaDB uses summary files to more efficiently navigate index files. Their size is controlled by the sstable_summary_ratio setting. Increasing the ratio enlarges the summary file and that reduces index reads at the cost of additional memory. The team increased the ratio by 20 times, which boosted capacity to 20K EPS. This yielded a nice 4X improvement, so they rolled it out immediately. What a difference a disk makes Finally, the NVMe disks arrived a few months later. This one change made a massive difference. Capacity jumped to 300K EPS, a staggering 50–60X improvement. The team rolled out improvements in stages: first, the summary ratio tweak (for 2–3X breathing room), then the NVMe upgrade (for 50X capacity). They didn’t apply leveled compaction in production because it only affects new tables and would require migration. Anyway, NVMe already solved the problem. After that, the team shifted focus to other areas: improving caching, rewriting the application in Rust and adding cache stampede prevention to reduce the load on ScyllaDB. They still revisit ScyllaDB occasionally for experiments. A couple of examples: New partitioning scheme: They tried partitioning by feature set only and clustering by entity. However, performance was actually worse, so they didn’t move forward with this idea. Data remodeling: The application originally stored one row per feature. Since all features for an entity are always read together, the team tested storing all features in a single row instead. This improved performance by 35%, but it requires a table migration. It’s on their list of things to do later. Lessons learned Isaratham wrapped it up as follows: “We’d been using ScyllaDB for years without realizing its full potential, mainly because we hadn’t set it up correctly. After upgrading disks, benchmarking and tuning data models, we finally reached proper usage. Getting the basics right – fast storage, knowing capacity, and matching data models to workload – made all the difference. That’s how ScyllaDB helped us achieve 50X scaling.”