ScyllaDB Customer Experience Spotlight: Faisal Saeed

Welcome to the second installment of a new blog series introducing some of the experts you might encounter when you work with ScyllaDB. (In the first, we met Tyler Denton, Solutions Architect). Today we’re featuring Faisal Saeed, Principal Customer Engineer on the Customer Experience team here at ScyllaDB. He lives in Singapore and has been at ScyllaDB for more than 2 years. Let’s learn a little about Faisal… What do you do here at ScyllaDB I have a hybrid role where I work with existing customers as their Principal Customer Engineer, helping them ensure their ScyllaDB Cloud / on-prem clusters are in good health and performing according to their expectations. Secondly, I work as a pre-sales Solutions Architect for clients who are not existing ScyllaDB customers and are evaluating ScyllaDB. Here, I often help with data modeling or planning their data migration from their existing database into ScyllaDB Enterprise / ScyllaDB Cloud clusters. Please share a little about your path to ScyllaDB I have worked in the IT industry for about 30 years and have extensive database experience. Before joining ScyllaDB, I was a Principal Solutions Architect with MariaDB for 6 years. Before that, I worked with ACI Worldwide as a database architect on projects for DBS Bank in Singapore. Before that, I spent many years at NCS, working as a database architect on DBS Bank projects. Tell me about one of the most interesting projects you’ve worked on here While I work with many amazing customers, the project I cherish the most is an in-house developed tool that automates ScyllaDB Enterprise/Cloud/X Cloud clusters with a single command, allowing the user to run various workloads and perform stress testing of multiple clusters. This is the ScyllaDB Automation Framework, and I have worked on this project for more than a year. This helps various team members in ScyllaDB with their day to day tasks, whether running a demo for a customer or simulating a customer use case. What’s the most impressive ScyllaDB feat you’ve seen a team accomplish If we talk about teams in ScyllaDB, X Cloud is an amazing ScyllaDB product that lets customers save costs while running at any scale. The team has done an outstanding job. Talking about customers, every one of them is unique in some way. JioStar from India uses ScyllaDB to support IPL, World Cup Cricket, and many other supporting events where millions of users concurrently log in to ScyllaDB clusters through their app — and ScyllaDB handles them gracefully without any lags. There are many others, but I can’t mention everyone. What do you like to do when you’re not working or on-call I spend time with my wife at home, go out for long walks, watch movies, and care for two bunnies who have been with us for more than 5 years. What’s your top tip for getting the most out of ScyllaDB I can’t recommend just one thing, but ScyllaDB is designed to run almost on autopilot. Rarely is there a need to tune any aspect of the ScyllaDB cluster. But if I had to pick one thing, it would be “proper NoSQL data modeling.” I have seen many teams struggle with performance because they had a poor data model. After spending some time with them and helping them fix their data model mistakes, their ScyllaDB cluster ran smoothly with the promised single-digit P99 latencies. I recommend everyone to join ScyllaDB University (it’s free) and take the beginner and advanced data modeling courses.

ScyllaDB Operator 1.21 Release — with Oracle Kubernetes Engine (OKE) Support

Introducing Oracle Kubernetes Engine support, stronger TLS, and a lighter dependency footprint ScyllaDB Operator 1.21.0 is now available. For background, ScyllaDB Operator is an open-source project that helps you run ScyllaDB on Kubernetes. It lets you manage ScyllaDB clusters deployed to Kubernetes and automate tasks related to operating a ScyllaDB cluster (e.g., installation, vertical and horizontal scaling, as well as rolling upgrades). ScyllaDB Operator 1.21 expands cloud platform support with OKE, adds ECDSA as an alternative key type for TLS certificates, and removes a hard dependency on Prometheus Operator. Oracle Kubernetes Engine (OKE) support ScyllaDB Operator 1.21 adds Oracle Container Engine for Kubernetes (OKE) as a supported platform. The new OKE support comes with comprehensive documentation covering the entire workflow , from provisioning the underlying OCI infrastructure (VCN, subnets, gateways, and node pools with Dense I/O shapes and local NVMe storage) to deploying a 3-node ScyllaDB cluster spread across fault domains. An automated setup script is also provided for one-command infrastructure provisioning. To get started with ScyllaDB on OKE, see the Set up an OKE cluster for ScyllaDB infrastructure guide and the OKE reference deployment. ECDSA support for TLS certificates ScyllaDB Operator manages TLS certificates internally for securing client-to-node communication. Until now, only RSA keys were supported for certificate generation. ScyllaDB Operator 1.21 adds elliptic curve cryptography (ECDSA) as an alternative key type. This allows smaller key sizes and faster cryptographic operations with strong security. You can opt in to ECDSA by setting the –crypto-key-type=ECDSA flag on the operator, with the curve bit-size configurable via –crypto-ecdsa-key-size (defaulting to P-384). RSA remains the default key type. The RSA key size is now configured with a dedicated –crypto-rsa-key-size flag; the previous –crypto-key-size flag is deprecated and remains accepted as an alias. Prometheus Operator is now an optional dependency Previously, ScyllaDB Operator required Prometheus Operator CRDs (monitoring.coreos.com/v1) to be installed in the cluster, even if you did not intend to use ScyllaDBMonitoring. Missing CRDs would result in error logs at startup. With ScyllaDB Operator 1.21, Prometheus Operator becomes a purely optional dependency. The operator auto-detects whether the CRDs are present at startup using Kubernetes API discovery. When they are absent, the ScyllaDBMonitoring controller is not started and no error logs are emitted. If you install Prometheus Operator after the ScyllaDB Operator is already running, restart the operator to pick up the new CRDs. Refer to the monitoring setup guide for details.

Dynamic Repartitioning for Time Series Workloads

By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan

Introduction

Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal event data with millisecond latency. We use Apache Cassandra 4.x as the underlying storage for these main reasons:

  • Throughput, latency, and cost: Cassandra can handle millions of low‑latency reads and writes in a cost-effective manner.
  • Operational maturity: Our data platform team has deep operational expertise running large Cassandra clusters in production.

However, using Cassandra at this scale introduces trade‑offs for TimeSeries workloads. A key challenge is wide partitions, as TimeSeries dataset partitions can grow quite large with events accumulating over time.

This problem is further compounded by the fact that TimeSeries servers routinely deal with a very high read throughput:

Reads/second for different datasets

This post walks through our journey to reduce the impact of wide partitions in our TimeSeries datasets, the solutions we built, and the lessons we learned.

Note: Although this post walks through re-partitioning in Cassandra, the same techniques can be applied more broadly to other data stores.

Impact of Wide Partitions

For most of our datasets, we observe an average read latency in the order of single-digit milliseconds:

Ideal Latency for Reads (ms)

However, in some datasets, as partitions grow too wide, we observe high read latencies in the order of seconds, especially towards the tail end:

High Tail Latency for Reads (seconds)

This can result in timeouts:

Read timeouts / second

In extreme cases, if most of the reads target wide partitions, we can see Garbage Collection pauses, high CPU utilization and thread queueing.

High CPU utilization and thread-queueing in Cassandra clusters

Scaling up the underlying Cassandra cluster is always an option, but we need smarter alternatives than just throwing more money at the problem.

TimeSeries Partitioning Strategy

The TimeSeries Abstraction was designed to solve the problem of wide partitions by dividing the data into discrete time chunks. For more in-depth information, refer to our previous blog.

To summarize, here is an illustration of how TimeSeries partitioning strategy helps us break up wide partitions into manageable chunks.

Time Series partitioning breaking up a dataset into Time slices, time buckets and event buckets

This strategy further allows us to efficiently query and drop data based on time, without having to deal with tombstones.

Picking the Partitioning Strategy

When a namespace (a.k.a. dataset) is created, users must specify their anticipated workload characteristics. This specification is then fed into our provisioning pipeline. The pipeline processes these inputs, runs Monte Carlo simulations, and produces an optimal infrastructure and partition configuration.

Provisioning picks optimal infra and configuration based on user inputs

You can learn more about our methodology of capacity planning in this insightful AWS re:Invent talk given by one of our stunning colleagues.

The Problem with the Current Approach

Although this method of provisioning is effective in many situations, it proves insufficient for TimeSeries workloads under these conditions:

  • Workload is unknown or inaccurately estimated: Early on in a project, users can lack a reliable picture of production traffic or simply misestimate key parameters.
  • Workload evolves over time: Traffic patterns, client behavior, and product requirements change. A “good” partitioning strategy on day one can become inefficient months later.
  • Data outliers exist: Not all TimeSeries IDs behave the same. A small percentage of IDs can receive a vastly higher volume of events than the rest.

Fortunately, our design with discrete Time Slices gives us a natural escape hatch for the first two scenarios; each new Time Slice can use a different partitioning strategy.

Each Time Slice can have a unique partition strategy

However, manually adjusting these configurations in a fleet that has thousands of TimeSeries datasets is not sustainable. We need automation.

Solution 1: Time Slice Re-Partitioning

Cassandra exposes useful introspection APIs for understanding data usage and access patterns. For example, nodetool tablehistograms provide percentile distributions for partition sizes in a table. Using these tools, we can detect cases of both over and under partitioning.

Below is an example of over‑partitioning, where the TimeSeries provisioning pipeline selected very small time_bucket intervals based on user provided inputs:

Provisioning selected 60s time buckets based on user inputs

causing partitions to have less than 10 KB of data, leading to high read amplification and thread queueing:

Histogram of the given Cassandra table showing partition size percentiles

In order to tune partition strategies efficiently, we added a background worker, which monitors partition histograms of Time Slices attached to a given application, and exposes it via a Cassandra virtual table:

Histograms exposed through a Cassandra Virtual table

It then computes an adjustment factor when it detects partition sizes not meeting a configured density. This configured density is often set between 2 MiB to 10 MiB depending on the workload.

DynamicTimeSliceConfigWorker: 
namespace: my_dataset_1
Observed: TimeSlices have p99 partitions below configured target of 10MB.
Proposed: time_bucket interval: 60s -> 604800s

The worker can then update future Time Slices with the new partition strategy:

Partitioning adjusted for future Time Slice(s)

This strategy has yielded real results in reducing our read latencies, as well as reducing the number of timeouts caused by thread queueing.

Reduction in tail latency and thread queueing for

However, this strategy only works if most of the data exhibits such behavior that warrants re-partitioning of the entire table. It does not work in cases where only a percentage of IDs within the table are wide.

We have a couple of options here:

  • Do Nothing: This is sometimes the right approach if there is no observed impact to the application’s top-level metrics.
  • Partial Returns: We implemented a ‘Partial Return’ feature, which aborts an inflight request if it has breached a configured latency SLO, while returning whatever data it has collected up until that point. This is a great option for clients who care more about latency than fetching all the data.
Tail latency drops around the SLO cutoff as Partial Returns are enabled
  • Block IDs: This is an extreme step but worth mentioning, because we do deal with bad data that occasionally seeps into the system e.g. test or spam IDs that can make the system unstable.
dgwts.config.<dataset>.block.Ids: "<tsid-1>, <tsid-2>, <tsid-3>"

Ultimately, we encounter scenarios where valid and important TimeSeries IDs accumulate a high enough volume of events, with callers needing to process all the related data. Simply tolerating elevated latencies or timeouts when querying these IDs is not a desirable outcome.

This is where dynamic partitioning comes into play.

Solution 2: Dynamic Partitioning per ID

Dynamic partitioning is an asynchronous pipeline that auto-detects and splits wide partitions on a TimeSeries ID level rather than at the table level.

It has three main stages:

  • Detection: Detects wide partitions for a given TimeSeries ID during the read path.
  • Planning & Splitting: Plans and executes splits of those partitions into optimal sizes asynchronously.
  • Serving Reads: Re-routes the read queries transparently to read data from the split partitions when ready.

This is how it works at a high level; we will dive into details after:

Dynamic Wide Partition Split Async Pipeline

Here are the different stages of the pipeline:

Detection

Every TimeSeries read operation tracks how many bytes are read for a given partition. If the bytes read exceed a configured threshold, the server emits a detection event to Kafka:

{
"time_slice": "data_20260328", // the Cassandra table this event was detected in
"time_series_id": "profileId:123", // the ID detected as wide
"time_bucket": 7, // the existing time_bucket partition
"event_bucket": 2, // the existing event_bucket partition
"immutable": true, // TimeSeries servers can compute if this partition is no longer receiving writes
"version": "0" // reserved for future use e.g. invalidate if partition is no longer immutable
}

Our decision to detect wide partitions on reads, as opposed to writes, is based on our observation that the majority of the data in the wild doesn’t need this treatment. The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up.

Immutability

Although splitting mutable partitions is possible, it is inherently more complex. As a first step towards solving this problem, we chose to reduce the surface area of this change by focusing on immutable partitions, while still meaningfully reducing caller timeouts.

Planning

Detection may occur based on a partial read, so the planner must still read the entire partition once to compute an accurate split plan. The checkpointing becomes crucial here. For planning reads that fail to process the entire partition, the process can always continue from the last saved checkpoint.

Checkpointing

The wide_row metadata table serves as the backbone for state transitions and checkpointing of partition splits. It also stores information that is used later by TimeSeries servers to properly route Read queries.

wide_row metadata for storing split states and checkpoints

Splitting

The Planner delegates the splitting of data to an appropriate split-strategy. For example, if EventBucketPartitionSplitStrategy is selected, we split the partition by assigning more event buckets to the same time bucket. If the partition is ultra-wide, we cap the number of event buckets we split into, in order to control the resultant read amplification. Spreading into multiple partitions in such cases is still beneficial in order to spread the read workload to multiple Cassandra replicas.

Split by assigning more event buckets for a given time bucket

Further, since the Splitter has the full view of the partition, it can ensure total sort order across all the split buckets.

Validating Splits

The Planner stores a pre-split checksum of a given partition during the planning phase, while the Splitter computes and stores the post-split checksum. The split status is marked as completed only if the two checksums match.

Ensure checksums match pre- and post-split before marking a split as COMPLETED

Tracking Splits

The pre- and post-split partition sizes across different datasets are tracked to see how effectively the partition splits are being planned and executed:

Track pre- and post-split partition sizes to ensure we are splitting optimally

Serving Reads

The TimeSeries servers load the partition-keys of completed splits periodically into in-memory Bloom filters. Every read operation checks the Bloom filter to see whether a query can be diverted to the split partitions.

Here is what the Read path looks like:

Read path for diverting reads to existing or split partitions

The size of the Bloom filters is monitored to ensure we have enough memory per server. Due to the compactness of partition keys, and ratio of wide partitions in a given dataset, the filters fit comfortably in each server instance.

Bloom filter approximate element count per namespace and time slice

The Bloom filter latency to check whether a given partition key is wide for every read request is typically in single-digit microseconds or better, making this diversion practically invisible to the callers.

Latency for checking Bloom filters is extremely small for callers to notice the diversion

For the cases that do end up with a Bloom filter hit, the TimeSeries servers lookup the wide_row metadata to see how to read a specific wide partition:

{
"pre_split_data": {
"time_slice": "data_20260328",
"time_series_id": "6313825", → What to read
"time_bucket": 0,
"event_bucket": 2

},
"post_split_data": {
"time_slice": "wide_data_20260328_0", → Where to read it from
"event_bucket_partition_strategy": { → Strategy to delegate to for reading
"target_event_buckets": 2,
"start_event_bucket": 32 → How should the strategy read it
}

}

This metadata read is backed by a read-through cache, making it quite performant:

Metadata fetch latency is quite low to affect read operations

Finally, the reads for the split partitions are delegated to our existing PartitionReader, which reads N smaller partitions in parallel, rather than 1 large partition, improving overall performance and stability!

Read much smaller partitions in parallel and merge results

Fallbacks

The existing wide partition from the original time slice is never deleted. This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency. The slightly larger storage space we use as a result is worth the operational safety we gain.

Building Additional Confidence

Serving incorrect reads would be disastrous. To establish trust beyond checksums, we leveraged additional mechanisms such as:

  • Using our existing Data Bridge pipelines to verify splits offline:
Spark job to ensure that the split data is an exact match to the original data
  • Implementing a phased rollout strategy to safely advance through stages as our confidence in the system grew:
Advance through Read modes once previous mode passes checks

A critical part of this phased rollout was the Comparison phase, which compared bytes served by old read path and the new read path while in shadow mode:

A chart of bytes match vs bytes differ in a given shadow period

Results

As a result of these dynamic splits, we see a huge improvement in the average read latency of most wide partitions, bringing it down from seconds:

Existing average latency for reading wide partitions

to low double-digit milliseconds!

Average latency for reading dynamically split partitions

Tail latencies of reading wide partitions dropped from several seconds:

Existing tail latency for reading wide partitions

to around 200 ms or better:

Tail latency for reading dynamically split partitions

resulting in a drop in read timeouts:

Overall, this has resulted in a more stable Cassandra cluster with lower CPU utilization and little to no thread queuing:

Low CPU utilization and no thread-queueing

Further, for extreme wide rows, where a dataset would face constant timeouts and unavailability blips, the service was able to paginate and query 500MB+ partitions while remaining available:

grpc … com.netflix.dgw.ts.TimeSeriesService/SearchEventRecords -d
'{"namespace": "...",
"search_query": {...},
"time_interval": {
"start": "2026–05–11T23:42:51.484398Z",
"end": "2026–05–12T00:13:50.694205Z"
},
"pageSize" : 1000,
}'
# Response:
{
"next_page_token" : ….,
"records": [
{

}
],
"response_context": [{
"namespace": "...",

# Trades elevated latency for being available
"time_taken": "41.072410142s"
}
]
}

Conclusion

There is more work planned around this feature, like splitting mutable wide partitions, or re-processing previously failed splits, but this has been a successful start in improving service performance and reducing our support burden.

Further, we would like to highlight some key lessons that we learned at different points in this journey.

  • Reducing Surface Area: As a first step, explore simpler solutions that can still deliver meaningful impact. Also, reducing the surface area of a complex change and deploying incrementally pays off operationally.
  • Building Confidence: Invest time and resources to build confidence in new features, especially when justified by the feature complexity, deployment blast radius, and/or potential impact.

Acknowledgements: Special thanks to our stunning colleagues who further contributed to this feature’s success: Tom DeVoe, Chris Lohfink, Sumanth Pasupuleti and Joey Lynch.


Dynamic Repartitioning for Time Series Workloads was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dear cqlsh: Your dependencies were killing us (P.S. We rewrote you in Rust)

A story of rewriting cqlsh in Rust…with Claude Code and a lot of planning Dear cqlsh, I vouched for you. I told the team you were fine. I forked you, catered to you, vendored your dependencies and your dependencies’ dependencies. I patched things upstream that I knew you would never merge. I pinned your Python, re-pinned it after the OS upgraded, and explained to people (with a straight face) why that was totally normal and not a problem at all. I wrote you twice already. You never wrote back. I’m not even mad. I get it: you’re busy. 30+ CLI flags, 25 CQL types, a COPY engine with enough options to fill a man page…You’ve got a lot going on. But I found someone faster, someone who compiles to a static binary without a runtime, without vendoring. They don’t make me think about “which Python are we using today?” They just…work. I hope you understand. Yours (for now), Israel This is the story of cqlsh-rs – a ground-up Rust rewrite of the Python cqlsh, the interactive CQL shell used daily by everyone working with Cassandra and ScyllaDB. It’s also a story about what happens when you take the lessons from one AI-assisted project and apply them to another project. Why bother rewriting? Because packaging is a nightmare. ScyllaDB ships a relocatable package, a self-contained bundle with its own Python runtime baked in. The system Python can change, upgrade, or disappear entirely, and ScyllaDB’s startup scripts and cqlsh keep working because they’re running against a known, pinned Python version inside the bundle. Except cqlsh has to live inside that bundle. And cqlsh is a Python tool. It has dependencies, those dependencies’ dependencies have dependencies, and they all need to be vendored in alongside the bundled Python. Every time cqlsh or one of its dependencies needs updating (a bug fix, a new Cassandra protocol version, a security patch), you need to update the bundle, test the bundle, and ship the bundle. And if something conflicts or breaks inside that carefully pinned environment, it’s your problem to untangle. A static Rust binary sidesteps all of this. You compile once per target, you get a single file with zero runtime dependencies, and you ship it. Done. The second pain point is COPY TO/FROM, cqlsh‘s built-in feature for bulk-exporting and importing table data to CSV. It’s one of the most-used features, and it’s been carrying around a long list of bugs for years. It does have parallel workers (threads and processes), but the machinery is complicated, fragile, and notoriously hard to test. The bug list reflects that. Both of these are solvable in Rust. So, the question became: is now the time to actually solve them? It all started with a BIG plan (to the tune of The Big Bang Theory) In a previous post, I wrote about using GitHub Copilot to bring a 4-year-old Python idea (coodie, a Pydantic ODM for Cassandra) back to life. That project was relatively contained: give the AI a concept, come back to a working implementation. Fire and forget it, more or less. cqlsh-rs is a different category of project. The original Python cqlsh has been around for over a decade. It has hundreds of CLI flags, a compatibility matrix that spans multiple database versions, a COPY engine with 30+ options per direction, tab completion that must be schema-aware, and a type system covering 25+ CQL types with specific formatting rules. Shipping something that “mostly works” is not good enough if people are going to actually switch to it. Every muscle-memory command has to work the same way. So before writing a single line of Rust, I started with a plan. That plan started as one document. It grew, then it became a master design document plus sub-plans. By the time the architecture settled, there were 19 sub-plans (SP01 through SP19) covering everything from the CLI argument parser to the CQL type formatter to the COPY engine to a future --ai-help flag for offline CQL error diagnostics. Here’s what the roadmap looked like near the start: 5 out of 108 tasks. 0.4 tasks per day. The footer on that SVG read: “Approximately 8.9 months remaining… just like Windows said.” Reader, it did not take 8.9 months. “Wait, why is there a skill for that?” I started in Claude web, but not because that’s my comfort zone. With Copilot, I liked the browser because it made the conversation visible to the team, a kind of shared thinking space. I had the same instinct here. This way, design conversations, architecture decisions, trade-off explorations, etc all happened in the browser before a single file was created. Questions like What driver to use? How to structure the CLI argument parsing? Should we write a hand-rolled CQL parser or keep it simple with a line-buffer approach? are genuinely better answered in conversation than in code. The master plan came together there. So did the first sub-plans and the initial CI skeleton. Then I started exploring Claude Code, the CLI. Somewhere around phase 2, I closed that browser tab once and for all. One reason is the feedback loop: you’re in the same environment as the code, so cargo test runs immediately after a change, failures surface in context, and the next prompt can reference the actual output. Another reason is just familiarity: the more you use it, the more you learn to point it at exactly the right problem. Skills: write your conventions once, use them forever The skills library was also critical for this project: /rust-testing – What to test at the unit layer vs. the integration layer, how to use assert_cmd for CLI tests, when to reach for insta snapshots /rust-clippy – Run Clippy with strict settings and fix everything it complains about /rust-error-handling – Idiomatic error handling patterns for this codebase /development-process – The full loop: review the relevant sub-plan, design tests first, implement, run tests, update the plan, commit I carried the pattern directly from coodie. The specific skills are different (Python vs. Rust), but the idea is the same. Each skill you write makes every subsequent feature cheaper to build. Living documents (or, an outdated plan is worse than no plan) The 19 sub-plans are living documents that are updated when decisions are made (vs written upfront and then abandoned, like most docs). When a design decision changes mid-implementation, the plan changes too. When a task is done, the checkbox gets ticked. When a new edge case surfaces, it gets added. This matters more than it might seem. An outdated plan is worse than no plan because the AI will follow it faithfully…in the wrong direction. What’s in the box Nothing terribly exotic; there’s: Rust with Tokio for async. The scylla crate for the database driver. rustyline for the REPL and line editing. comfy-table and owo-colors for output formatting. testcontainers-rs for spinning up real Cassandra instances in CI. While the stack itself might not be exciting, the interesting part is what it takes to get every CQL type to format exactly like the Python implementation – right down to float precision and frozen collection syntax. That’s where most of the compatibility work lives. Where are we now? Here’s the same roadmap today: Phases 1 through 3 are done. The shell works: you can… Connect Run queries Get formatted output with colors and pagination Tab-complete keyspace and table names Run DESCRIBE on anything Use SOURCE to execute a file Phase 4 – COPY TO/FROM – is implemented. Phase 5 (testing) is in progress, with 327 tests and counting. Takeaways Planning pays (but living documents are a nice touch). A static plan written at the start and never touched again is a liability. A plan that gets updated as decisions are made is an asset – and the primary reason Claude can work effectively across multiple sessions on a project this size. Skills compound. A good amount of work is required to find the right skill for the task and adapt it to the project: the conventions, the patterns, the “this is how we do it here” info. But once that’s written down, it becomes easier to implement every feature. The workflow is never done. The pace of this space is genuinely disorienting. We now regularly use tools that didn’t even exist six months ago. This means that what works today might not work in a month. It’s still writing code, just differently. (I have a bit of trouble using the word “engineering” here.) Claude doesn’t replace judgment on architecture, on what actually matters to users, on “is this the right trade-off?” It removes the friction between having a clear idea of what you want and that thing existing. Whether that makes it better or worse probably depends on the day. Lessons from one project carry over to the next. The skills pattern from coodie was carried into cqlsh-rs with a different language and a different domain. You can start from what you already learned, and the AI follows the same process docs that you wrote last time. Things to look forward to One idea that popped up during this: an --ai-help flag that embeds a small local model to give offline diagnostics when your CQL query fails. In other words, building an AI-assisted tool with an AI assistant that will assist with AI-assisted queries. I’m going to stop thinking about that too hard. 😉 For the model routing, we’ll probably use LiteLLM. I heard it’s become quite popular lately. I had fun. Claude had fun too, probably. I didn’t ask.

High-Throughput Graph Abstraction at Netflix: Part I

By Oleksii Tkachuk, Kartik Sathyanarayanan, Rajiv Shringi

Introduction

Netflix has a diverse range of graph use cases, each serving specific business needs with unique functionality and performance requirements. These use cases fall into two broad categories:

  1. OLAP: These use cases typically involve open-ended and algorithmic exploration of large graph datasets. They often utilize industry-standard models and languages such as RDF with SPARQL, Property Graphs with Gremlin or openCypher, and even SQL. The primary focus in these situations is in-depth analysis, rather than achieving high throughput and low latency.
  2. OLTP: These use cases require extremely high throughput — up to millions of operations per second — while delivering traversal results within milliseconds. Achieving such a level of performance often requires making trade-offs, which can include accepting eventual consistency or restricting query complexity. For example, the service can demand a specified starting point for traversals and enforce a maximum traversal depth. Such use cases are often directly tied to streaming or user experiences and demand high global availability.

Netflix’s Graph Abstraction was designed specifically for this second category of use cases. As of this writing, the abstraction is handling close to 10 million operations per second across 650 TB of graph datasets with low latency and cost efficiency.

This post is the first in a multi-part series that explores the Graph Abstraction architecture in depth. We’ll cover how the abstraction indexes data for real-time and historical views, manages strongly typed graphs, performs efficient traversals, and integrates with the Netflix Big Data ecosystem.

Usage at Netflix

From a business standpoint, the primary driver for developing the Graph Abstraction was internal demand for supporting several key use cases:

  • Real-Time Distributed Graph (RDG): A graph capturing dynamic relationships across entities and interactions throughout the Netflix ecosystem. You can learn more about the initial RDG implementation in this insightful blog post. This functionality has since been integrated into the Graph Abstraction.
  • Social Graph: A graph of social connections within Netflix Gaming, designed to boost user engagement.
  • Service Topology: A graph of all internal Netflix services, used for real-time and historical analysis to improve root cause analysis during incidents.

Let’s examine the overall architecture of the Graph Abstraction and how it integrates with the Netflix Online Datastore ecosystem.

Architecture

Instead of building the persistence and caching layers from scratch, we chose to build taller on top of existing Netflix data abstractions.

The Key-Value (KV) Abstraction stores the latest view of nodes and edges, serving as the real-time index for all queries. Optionally, users can plug-in the TimeSeries (TS) Abstraction if they are interested in a historical view of how the graph evolves over time. Additionally, we use EVCache to achieve low-millisecond latencies and are actively experimenting with more specialized caching layers to further improve performance. Finally, the Graph Abstraction integrates with the Data Gateway Control Plane to manage graph schemas and automate the provisioning, deletion, and configuration of datasets in both KV and TS.

Property Graph Model

The Abstraction uses the Property Graph model to store its data. The graph consists of nodes and edges of various types, each with associated properties. These properties are strongly typed to enable efficient filtering and ensure consistent data exports. For semantic reasons, edges can be either unidirectional or bidirectional.

Namespaces

The Abstraction separates data into isolated units called “namespaces.” Each namespace is associated with a physical storage layer, as configured in the Data Gateway Control Plane, and can be deployed on either dedicated or shared hardware. The optimal, most cost-effective hardware configuration is determined by our provisioning automation, based on user-provided requirements such as throughput, latency, dataset size, and workload criticality. For more details on this topic, see this talk given by our stunning colleague Joey Lynch at AWS re:Invent.

Graph Schema

Each namespace is further associated with an explicit graph schema configured in the Control Plane. The graph schema defines node and edge types, allowed properties, permitted relationships, and directions.

The Graph schema is implemented as a collection of edge mappings that describe the nature of the relationship between given node types.

{
"edgeConfig": {
"edgeMappings": [
{
"edgeMappingKey": {
"fromNodeType": "account",
"edgeType": "owns",
"toNodeType": "profile"
},
"directionType": "UNIDIRECTIONAL"
},
{
"edgeMappingKey": {
"fromNodeType": "profile",
"edgeType": "linked_to",
"toNodeType": "device"
},
"directionType": "BIDIRECTIONAL"
}
]
}
}

Edge mappings are further extended with specification of property schema that consists of allowed property names and their type specification:

{
"edgeMappingKey":{
"fromNodeType":"profile",
"edgeType":"linked_to",
"toNodeType":"device"
},
"propertySchema":{
"propertyMappings":[
{ "propertyKey":"registration_time", "propertyValueType":"TIMESTAMP" },
{ "propertyKey":"status", "propertyValueType":"STRING" }
]
}
}

The Abstraction servers load this schema on startup and build an in-memory metadata graph of possible relationships, enabling several key optimizations:

  • Data Quality: The Abstraction rejects non-conforming nodes, edges, and properties during writes, ensuring high data quality and consistent exports.
  • Query Planning: The Abstraction uses the schema to quickly construct the possible traversal paths the service should take to answer a given user query.
  • Deduplication of Traversed Edges: For bidirectional traversals on edges between the same node type, the schema helps avoid redundant processing by deduplicating traversed paths.
  • Eliminating Traversal paths: For a given user query, the Abstraction removes traversal paths associated with impossible relationships, as well as those where filters or property types are incompatible.

Further, the Abstraction servers periodically poll the schema from the Data Gateway Control Plane in order to keep it updated with user changes. Looking ahead, we plan to leverage the graph schema for additional improvements, such as:

  • Minimizing Query Fanout: By using edge cardinality within edge mappings, we aim to select the most efficient traversal paths and minimize query fanout.
  • Improved Developer Experience: The schema will support generating a type-safe data access layer and enhance the Gremlin-like API with schema awareness.

Next, let’s look at how this data is organized in a real-time index within the KV Abstraction.

Real-Time Index: Key-Value Storage

Before we discuss how the data is organized into graph indexes, let’s discuss how KV organizes data within namespaces and provides idempotency guarantees:

  • Data partitioning: A namespace is associated with a table in the underlying storage layer. Within the table, data is partitioned into records by unique IDs, with each record holding multiple sorted items as key-value pairs. This structure effectively makes each namespace a map of sorted maps, providing flexibility for diverse access patterns.
  • Idempotency: Writes to a given ID and key are idempotent, enabling request hedging and safe retries. The idempotency token contains a timestamp, which KV uses to enforce Last-Write-Wins (LWW) semantics at the storage layer.

We use the KV as the underlying storage for all real-time graph indices on nodes and edges. For more on Netflix’s Key-Value Abstraction, see this excellent post published by our KeyValue team.

Node Storage

The two-tiered partitioning strategy works well for node storage. Each node type is isolated within its own KV namespace, which stores all the properties for nodes of that type.

This storage format enables several efficient access patterns for nodes:

  • Efficient reads: A given node and all its properties are fetched in a single partition lookup, achieving single-digit millisecond latency.
  • Property selection pushdown: Target property keys are pushed down to the KV layer, reducing the amount of data fetched and further decreasing latencies and network overhead.
  • Property filtering pushdown: Property keys and values can be efficiently filtered at the KV layer.
  • Efficient exports: This model supports highly parallelized node exports by node type.

Edge Storage

Links and Property Index

Edges utilize two distinct types of indexes: one exclusively for the edge connections (links), and one for edge properties.

The Edge links are arranged as an adjacency list mapping source nodes to their connected neighbors.

The Edge Property index stores information about properties of every edge.

Separating edge links from their properties brings several benefits, but also introduces a key trade-off:

Benefits:

  • Efficient property upserts: Allows individual properties to be upserted over time without needing to read the entire property set for an edge.
  • Wide row prevention: Decoupling edge links from their properties prevents large partitions in databases like Cassandra, enabling efficient storage and low-latency reads — even for edges with millions of connections.

Trade-off:

  • Non-atomic writes: Storing edges across multiple namespaces means that writes across these namespaces are not atomic. We’ll discuss how this is addressed in the Consistency Enforcement section.

Forward and Reverse Indexes

Additionally, edge indexes are separated into forward and reverse indexes to support traversals in either direction. The illustration below shows an example of the reverse index counterpart for the links namespace shown above.

To ensure consistent record identifiers when updating edge properties in either direction, the Abstraction lexicographically sorts and concatenates the source and destination node IDs to create a direction-agnostic identifier for property storage. This ensures that properties can be accessed or mutated in a single database call regardless of the direction specified in the request.

This storage format enables several efficient access patterns:

  • Point Reads: Given an edge id, all properties can be fetched in a single partition lookup on the properties index.
  • Range Reads: Given a source node, a range read on a partition in the links index can efficiently return all edges. Depending on the desired direction, the Abstraction can target the forward or reverse index.
  • Property Filtering: Properties are fetched only for the links that match the record or page limit criteria, minimizing the data exchanged over the network.
  • Sort Orders: By default, edge links are sorted lexicographically by their target node. To support fetching the latest connections, the Abstraction retrieves target edge links in memory, sorts them by their last-write time, and returns the results. In order to ensure optimal performance without exerting too much memory pressure, we aim to limit the number of edges per source node within the system.

Next, let’s explore the caching strategies used by the Abstraction.

Caching Strategies in Graph Abstraction

Although the Graph Abstraction already provides efficient reads and writes to durable storage, caching remains critical for the stability and performance of any graph datastore for two key reasons:

  • Write amplification: A single write on the fronting service can result in multiple writes to the backing durable storage due to the use of multiple indexes. Whenever possible, it’s best to avoid unnecessary writes — for example, by not writing an edge link that already exists.
  • Read amplification: A single traversal request on the fronting service may translate into thousands of fetch operations on the backend, especially for highly interconnected graphs.

To address these challenges, the Graph Abstraction employs two distinct caching strategies.

Write-aside Caching of Edge Links

An edge link contains no additional information beyond the link itself and its last-write timestamp. To reduce write amplification on durable storage, we cache edge links for short durations, helping to avoid writing a link that already exists. This mechanism is balanced with configurable TTL windows, cache invalidation on deletes, and lease acquisitions with exponential backoff. These strategies provide the necessary consistency guarantees while still allowing the last-write timestamp to be refreshed according to the predefined staleness.

Read-aside Caching of Properties

To reduce read amplification on the durable store, the Graph Abstraction leverages KV’s integration with EVCache. Multiple KV namespaces can share the same caching clusters for cost efficiency. The Abstraction first fetches data from durable storage, while subsequent reads are served from the cache. Caching is applied at both the record and item levels, benefiting all graph objects.

Graph Abstraction employs two invalidation strategies, selected based on write throughput and consistency requirements:

  • Invalidation on write: Both record and item caches are invalidated with every write, ensuring consistency across regions. This strategy is ideal for graphs that change infrequently and cannot tolerate data staleness, but comes with the tradeoff of pushing a higher throughput on the cache.
  • TTL-driven invalidation: Cache entries are invalidated only when their TTL expires. This approach works best for frequently modified objects that can tolerate some staleness.

Work In Progress: Write-Through Caching

We are also developing a write-through caching strategy designed to store most of the data required by the Abstraction during traversals. This caching mechanism can organize indexes by different sort orders (e.g., sorting data by last-write timestamp), at the cost of increased memory consumption. Stay tuned for more details on this approach.

Next, let’s examine the consistency guarantees in Graph Abstraction and how they are enforced for both reads and writes.

Consistency Enforcement

Enforcing data consistency in Graph Abstraction poses several challenges. The connected nature of the data, low-latency API requirements, and the need to handle intermittent failures have led to design choices that enforce strict eventual consistency across multiple regions.

Entropy Repair

Each write in the Abstraction persists data for both inward and outward indices in parallel to support high throughput. Further, each write happens on multiple KV namespaces. To prevent inconsistencies or lasting entropy from failures in any operation, the Abstraction uses a robust retry mechanism using Kafka:

Node Deletions

Deleting nodes in a highly connected graph is more complex than simply removing a KV record as each node may have thousands of connected edges that must be handled to maintain graph integrity. Further, synchronously deleting all such connections would introduce unacceptable latency for the Abstraction callers.

The Abstraction employs an asynchronous deletion strategy to manage this issue. The consequence of this approach, however, is that the observed mutated state is only eventually consistent. Further, to ensure correctness of asynchronous deletes during concurrent updates, the Last-Write-Wins (LWW) conflict resolution mechanism is essential.

Global Replication

The consistency guarantees of Graph Abstraction are shaped by its multi-region availability. As illustrated in the diagram below, both the caching layer and durable storage replicate data asynchronously across regions, resulting in an eventually consistent system.

Now that we’ve covered storing the real-time graph index, let’s see how it enables graph traversals.

Graph Traversals

The Abstraction provides a custom gRPC traversal API, inspired by Gremlin, which enables exploration of the distributed graph by letting users chain traversals, apply filter criteria, sort results, limit results, and more.

Let’s explore a hypothetical scenario where the Abstraction is used to recommend shows to users on a shared device, by considering the duration of the most recent viewing session for each show across all profiles and accounts associated with that device:

TraversalRequest.newBuilder()
.setNamespace("<graph-namespace>")
.setTraversalQuery(
TraversalQuery.newBuilder()
// Given id of the 'device' node type.
.setStartNode(node("device", "my-device-id"))
.setTraversal(
Traversal.newBuilder()
// fetch the first 5 connections
.setEdgeLimit(5)
.setDirectionTraversal(
DirectionTraversal.newBuilder()
// traverse in the IN direction
.setDirection(IN)
// minimize data exchange: only interested in certain properties
.addNodePropertiesSelections(propSelection("account", "created_at"))
.addNodePropertiesSelections(propSelection("profile", "last_active"))
.setDirectionFilter(
DirectionFilter.newBuilder()
// only interested in certain connected types
.setTypeMatchingStrategy(EXCLUDE_NON_TARGETED)
.addAllNodeFilters(typeFilters("account", "profile"))))
// chain traversals to the intermediate result
.addNextTraversals(
Traversal.newBuilder()
.setOrder(LATEST)
// limit to 200 connections for the 2nd hop
.setEdgeLimit(200)
.setDirectionTraversal(
DirectionTraversal.newBuilder()
// now traverse in the OUT direction
.setDirection(OUT)
.addEdgePropertiesSelections(propSelection("watched", "view_time"))
.addEdgePropertiesSelections(propSelection("has_plan", "active"))
.setDirectionFilter(
DirectionFilter.newBuilder()
.setTypeMatchingStrategy(EXCLUDE_NON_TARGETED)
.addAllNodeFilters(typeFilters("title", "plan")))))))
.build();

And let’s visualize the intended results set produced by the request above:

We’ll explore the design and implementation of traversal planning and execution, along with different traversal types, in the Part II of this blog series.

Now let’s look at the performance metrics of Graph Abstraction based on current production use cases.

Real World Performance

Across all applications at Netflix, Graph Abstraction ensures high availability while processing up to 10 million operations per second across all writes, individual edge / node reads and traversals at peak hours:

Edge and node persistence achieve single-digit millisecond latencies (p99 shown in red, p90 shown in orange, and p50 shown in green):

Traversal performance depends on the number of hops, the edge fanout at each stage, and associated filters and sort orders. We parallelize work as much as possible to reduce latencies. Typically 1-hop traversals are executed with single-digit millisecond latency:

1-hop traversal latencies

We also support a Count API that performs counting traversals at a very high rate with similar latencies, which we will cover in Part II of this series:

Currently, the RDG is powered by 2-hop traversals with a higher degree of fan-out. While these operations can reach upwards of 100 ms in latency, the 90th percentile (p90) latency remains under 50ms.

2-hop traversal latencies

We track the average and max edge fanout at different depths to give us insights into the traversal performance for different graph datasets.

Median edge fan-out
Max edge fan-out

Asynchronous operations such as node deletions can be slightly latent, but typically perform with sub-second latency:

At the moment, we are storing close to 650 TB of data globally across all our graph datasets.

Conclusion

As Netflix scales further into new verticals such as live content, games, and ads, Graph Abstraction will remain crucial for uncovering and leveraging rich connections — while continuing to support a high throughput and availability at low latencies.

Stay tuned for Part II of this blog series, where we’ll explore the implementation of graph traversals, counting and constraint mechanisms.

In Part III, we’ll take a closer look at the temporal index implementation and its integration with the Time Series Abstraction.

Acknowledgments

Special thanks to our stunning colleagues who contributed to Graph Abstraction’s success: Kaidan Fullerton, Joey Lynch, Sudhesh Suresh, Vinay Chella, Sumanth Pasupuleti, Vidhya Arvind, Raj Ummadisetty, Jordan West, Chris Lohfink, Joe Lee, Jingxi Huang, Jessica Walton, Prudhviraj Karumanchi, Akashdeep Goel, Sriram Rangarajan, Chris Van Vlack, Christopher Gray, Luis Medina, Ajit Koti, Mohidul Abedin.


High-Throughput Graph Abstraction at Netflix: Part I was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

ScyllaDB Customer Experience Spotlight: Tyler Denton

Welcome to the first installment of a new blog series introducing some of the experts you’re likely to encounter when you work with ScyllaDB. Tyler Denton is a Solutions Architect on the Customer Experience team here at ScyllaDB. He lives in Fort Myers Florida, USA. He’s been at ScyllaDB for about a year, Let’s get to know a little about Tyler… What do you do here at ScyllaDB I’m a Solutions Architect, which is sometimes known as a Sales Engineer or Solutions Engineer. I help customers or prospects review their architectures and find the best place for ScyllaDB to be deployed. Does it make sense, and what’s the most efficient, impactful way to use ScyllaDB in their product or solution? I’m also our field AI subject Subject Matter Expert, so I do a lot with our vector search, a lot with our feature store deployments, agent, state management…things like that. Please share a little about your path to ScyllaDB My first job ever was as a machinist’s mate operating nuclear reactors in the US Navy. That might seem like an odd place to start for somebody who works as a Solutions Architect…but what that taught me was systems. How does the failure of a main steam root valve affect the starboard steam generators? Understanding how complex systems interact and work together taught me a lot about architecture and how to build systems that can survive failure. I started writing software in about the sixth grade and continued doing that, and so I’ve worked at companies like AWS, Couchbase, Rockset (acquired by OpenAI), and that all kind of led me here — where I can focus heavily on bringing large, distributed systems into production and focusing on AI. Tell me about one of the most interesting projects you’ve worked on here One of the most interesting projects I’ve worked on here is an AdTech company that used every feature of our flagship product, ScyllaDB X Cloud. We got to see the major, nearly instantaneous scaling of ScyllaDB. If anybody’s ever used Cassandra or earlier versions of ScyllaDB, you know that wide-table databases can be very hard to scale and can take a long time. Here, we were able to go from 6 nodes to 60 nodes in about 15 minutes, and the throughput and performance we saw from that was absolutely incredible. Watching this develop in real time was very cool and very rewarding. What’s the most impressive ScyllaDB feat you’ve seen a team accomplish Right now, I’m working on bringing a deployment into production where we were displacing another technology. By moving from their existing data model to one supported in ScyllaDB using static maps, we saw a huge cost reduction and a huge performance improvement. They were able to support queries across very complicated data structures in sub-millisecond latency across their entire corpus of data, and they were able to do that because they migrated to ScyllaDB. What do you like to do when you’re not working or on-call When I’m not working my day job, I focus a lot on building my AI knowledge. I do a lot of speaking engagements, development work, and community outreach. And when I’m not doing that, I’m working on my boat. Every now and then I actually get to take it out, but anybody who owns a boat knows that most of the time is spent actually working on it. What’s your top tip for getting the most out of ScyllaDB Follow the instructions. RTFM. Don’t try to be unique. ScyllaDB is designed to solve very specific use cases, and it does that incredibly well. When you try to get creative and build a database within a database, or start doing things ScyllaDB wasn’t designed for, it gets painful really fast, and ScyllaDB will punish you. So just read the manual, follow the best practices, and you’ll have a great time.

ScyllaDB Elastic Scaling in Action [Demo]

Watch along to see how fast ScyllaDB X Cloud scales from 10K to 1M ops/sec and back down again – with single-digit millisecond latency ScyllaDB X Cloud is ScyllaDB’s fully-managed database-as-a-service. It’s a truly elastic database designed to support variable/unpredictable workloads with consistent low latency as well as low costs. We’ve previously blogged about how users can scale out and scale in almost instantly to match actual usage. For example, you can scale all the way from 100K OPS to 2M OPS in just minutes, with consistent single-digit millisecond P99 latency. This means you don’t need to overprovision for the worst-case scenario or suffer the lag traditionally associated with ramping up capacity in response to a sudden surge. In this post, I want to show you how it looks in action: increasing capacity 10X, as well as scaling it back, in minutes. Part 1: Scaling 10X Fast, with Single-Digit Millisecond P99 Latency This first video provides a quick look at how fast ScyllaDB XCloud scales out to increase capacity. It shows you how ScyllaDB’s new tablets architecture lets you scale a cluster to support 10x or more workload capacity in minutes (vs. the usual hours or days). Simulating a massive sales event, we scale a cluster from a moderate 100K ops/sec up to 1M. As we start, the cluster is currently managing a moderate load of 100K ops/sec across three small nodes. Knowing that a surge of 1M ops/sec is imminent, we use the built-in calculator to precisely size our needs. By simply entering the desired read and write throughput and selecting the schema complexity, the system automatically determines the necessary vCPU requirement. In this case, we add three larger nodes to our existing setup. Once the new scaling policy is saved, you can watch the scaling happen as the nodes join and tablets are automatically streamed and rebalanced in parallel. In this demo, the entire scale-out process, including data rebalancing, completed in roughly 23 minutes—all while the cluster remained under load. You’ll see that the new nodes immediately start sharing the responsibility of serving requests even before the rebalancing is fully finished. Finally, we simulate the 10x load jump to 1 million operations per second. You can see that even with mixed instance sizes, ScyllaDB perfectly balances the workload, with the larger nodes serving more requests as expected. Most importantly, despite this massive increase in traffic, the cluster maintains impressive performance. It achieves single-digit millisecond P99 latencies throughout the entire event. Part 2: Achieving Rapid Parallel Scale-Down After Peak Workload This next video demonstrates the process of scaling the ScyllaDB cluster back down to its original size following a simulated high-traffic sales event. You can see how the system handles a drop from 1M ops/sec down to its baseline load. After running at 1M ops/sec for about 20 minutes, our simulated sale event has concluded. That means our load is dropping back to its original 100K ops/sec. Once the load stabilizes and the monitoring overview panel confirms that we are back to 20K writes and 80K reads, we’re ready to scale the cluster back to its original size of 24 vCPUs. To do this, we simply update the scaling policy back to 24 vCPUs. That leaves us with the same three 2x large nodes we had before the simulated sale event started. As the scaling progress begins, we can watch the nodes leave the cluster in real-time. By viewing the monitoring dashboard’s detailed panel, we can see an animation of the tablets streaming from the larger 8x nodes back to the original nodes. Once that’s completed, the cluster is back to its original configuration of three nodes. The scale-down process took about 22 or 23 minutes, which is nearly identical to the time it took to scale up earlier (in the other video). While scaling out has always been fast with tablets, scaling back down used to be a sequential process. Now, starting with version ScyllaDB 2026.1.3, we can scale the cluster in parallel both out and back. That makes it possible to handle a massive workload spike and return to baseline capacity all within about an hour. ScyllaDB Cloud – Free Trial

Apache Cassandra Performance Tuning: What We Learned

This blog post (tries to) consolidate what we’ve learned from years of tuning Apache Cassandra for performance Here at ScyllaDB, we often run internal and external performance comparisons. Internal testing helps ensure ScyllaDB’s performance advantage, track performance regressions, and maintain compatibility, including catching subtle API semantic-layer changes early. External comparisons are our way to aggregate the performance results for the general public every once in a while. Performance tuning can be a double-edged sword. Overlook one aspect, and you may end up under- or overestimating one’s performance numbers – and that may introduce deep ramifications down the road. While ScyllaDB and Cassandra both share a common API layer and feature set, both systems have fundamentally different architectures. This naturally adds to differences in how each system is tested and tuned. This blog post (tries to) consolidate what we’ve learned from years of tuning Apache Cassandra for performance. We spent a good amount of time hunting down the information we needed. Hopefully, the details described here help others improve their existing Cassandra cluster performance, as well as conduct more meaningful performance comparisons. Side-note: ScyllaDB shares how to reproduce our tests, including references on which settings and parameters we tuned. Check out our Cassandra 4 vs Cassandra 3.11 comparison, my recent talk on how ScyllaDB compares to Cassandra 5, and the comparison between Cassandra vNodes and ScyllaDB tablets as some concrete examples. Overview Perhaps the most relevant Apache Cassandra tuning source publicly available is Amy’s Cassandra 2.1 tuning guide. Despite its 2.1 reference (released in 2014), we find that most of the guidance (or, at least, the high-level concepts) provided there survived the ashes of time, including the array of settings that administrators need to configure by hand. Despite the over-a-decade-long difference, one of Amy’s particular thoughts stands out, and should guide you whenever you’re working with Apache Cassandra tuning:
“The inaccuracy of some comments in Cassandra configs is an old tradition, dating back to 2010 or 2011. (…) What you need to know is that a lot of the advice in the config commentary is misleading. Whenever it says “number of cores” or “number of disks” is a good time to be suspicious. (…)” – Excerpt from Amy’s Cassandra 2.1 tuning guide, cassandra.yaml section
Apache Cassandra was originally conceived to run on commodity hardware. It is shipped under the assumption that the end user will configure and tune it for their specific environment. And it also assumes users know what they’re doing. What’s counterintuitive about Apache Cassandra tuning is how small settings can have an outsized impact on performance. Figure 1 perfectly demonstrates this aspect. It shows how both throughput and latencies vary significantly under different GC, compaction, and disk read-ahead settings. Figure 1 – Apache Cassandra 5 performance under different settings One last note before we dive right into tuning specifics: our goal is not to replace Amy’s well-covered, exhaustive guide. Instead, take our words as a complementary reference. We also don’t claim to be experts in the art of Cassandra performance tuning or troubleshooting; rather, we’re practitioners who learned some things (the hard way). Cassandra-Specific Tuning At a minimum, focus your efforts on the following files: cassandra.yaml jvm[NN]-server.options jvm-server.options cassandra.yaml To help users get started, a stock Apache Cassandra installation ships with two config files. The first file – cassandra.yaml – is oriented for users upgrading from a previous Cassandra release and comes with backward-compatible settings. The second – cassandra_latest.yaml – “contains configuration defaults that enable the latest features of Cassandra, including improved functionality as well as higher performance. This version is provided for new users of Cassandra who want to get the most out of their cluster, and for users evaluating the technology.” Source: the Cassandra project. If you spin a fresh cassandra:5 container or simply initiate your tuning journey without taking this into consideration, you’ll end up running your deployment under compatibility mode. The following command demonstrates how a freshly spun Cassandra 5 container starts under compatibility mode, rather than enabling its latest features: root@container:/etc/cassandra# diff cassandra.yaml cassandra_latest.yaml | sed 's/^>/[cassandra_latest.yaml]/g;s/^</[cassandra.yaml]/g' | egrep 'compatibility|memtable' | sort [cassandra.yaml] memtable_allocation_type: heap_buffers [cassandra.yaml] storage_compatibility_mode: CASSANDRA_4 [cassandra_latest.yaml] memtable_allocation_type: offheap_objects [cassandra_latest.yaml] storage_compatibility_mode: NONE It’s beyond the scope of this write-up to provide an exhaustive list of settings you should pay attention to when setting up Cassandra. The stock cassandra.yaml is often irrelevant, and we ended up simply replacing it with the cassandra\_latest.yaml instead. If you are starting a fresh new cluster, we highly recommend you do the same. However, you probably want need to be extra cautious if you are an existing Cassandra user. Oftentimes the semantics of a particular setting may change entirely, making it particularly hard to track down. In one of our streaming performance tests, we noticed Cassandra’s streaming operations had a default cap of 24MiB/s per node, resulting in suboptimal transfer times. Upon raising those thresholds, we observed: Cassandra 4.0 docs mentioned tuning the stream_throughput_outbound_megabits_per_sec option Both Cassandra 4.1 and Cassandra 5.0 docs referenced the stream_throughput_outbound option Only reading this Instaclustr article (or carefully interpreting cassandra\_latest.yaml) eventually shed some light on the correct option: entire_sstable_stream_throughput_outbound. In other words, 3 distinct settings exist for tuning the previous 3 major releases of Apache Cassandra – and one of them was incorrectly documented under the official project’s page. This raises concerns about the feasibility of upgrading from older releases. Given these constraints, we highly encourage organizations to conduct a careful review and full round of testing on their own. This is not an edge case; others noted similar upgrade problems on the Apache Cassandra Mailing List. With that in mind, here are some examples of misleading Cassandra config comments and why upgrades deserve some extra diligence: CASSANDRA-16315 – Covers the concurrent_compactors setting CASSANDRA-7139 – Describes how that same concurrent_compactors setting default was production unsafe when introduced CASSANDRA-20692 – Describes how a commitlog correctness issue slipped through to Cassandra 5 JVM settings Test Kind Garbage Collector Read-ahead Compaction Throughput P99 Latency Throughput Cassandra RA4 Compaction256 ZGC 4KB 256MB/s 6.662ms 120K/s Cassandra RA4 Compaction0 ZGC 4KB Unthrottled 8.159ms 120K/s Cassandra RA8 Compaction256 ZGC 8KB 256MB/s 4.657ms 100K/s Cassandra RA8 Compaction0 ZGC 8KB Unthrottled 4.903ms 100K/s Cassandra G1GC G1GC 4KB 256MB/s 5.521ms 40K/s Tuning the JVM is the least fun part of operating a Cassandra cluster. It can be a journey on its own, really. The good news is that Cassandra 5 includes support for JDK17, and users may now opt-in for using ZGC rather than the decades-long G1 garbage collector. Unless you are a Java expert and know exactly what you are doing, this theLastPickle article is perhaps your best resource for tuning Cassandra’s JVM. You could read that and call it a day. Still, here are some details on what we’ve discovered along the way, since the DataStax (now IBM) Tuning Java resources page only advises under a remark of adjusting “settings gradually and test each incremental change”: We’ve consistently measured lower latencies and higher throughput using ZGC under a handful of different scenarios. Although we’ve seen some users reporting good G1 performance results, this doesn’t align with what we’ve experimented with in practice. Remember that Cassandra relies on both off-heap as well as on-heap memory. The heap size will depend on how much RAM your setup has. Since we primarily test on 128GB RAM machines, we found that allocating beyond 32G would be wasteful. theLastPickle‘s article mentioned earlier makes a good point about compressed OOPs, though we believe this should be relevant for RAM constrained systems. We didn’t observe any noticeable benefits/disadvantages from having 31G/32G in our results. Most of the JVM settings will sit under the jvm17-server.options file (if you’re using JDK17). However, there is yet another file (jvm-server.options, note there’s no Java version) that you should also edit. Apparently Cassandra has some built-in scriptology in cassandra.in.sh that looks up the latter and inherits options from it. Then, if your heap settings (-Xmx & -Xms) are unset, it will automatically define it for you: ################# # HEAP SETTINGS # ################# # Heap size is automatically calculated by cassandra-env based on this # formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB)) # That is: # - calculate 1/2 ram and cap to 1024MB # - calculate 1/4 ram and cap to 8192MB # - pick the max # # For production use you may wish to adjust this for your environment. # If that's the case, uncomment the -Xmx and Xms options below to override the # automatic calculation of JVM heap memory. # # It is recommended to set min (-Xms) and max (-Xmx) heap sizes to # the same value to avoid stop-the-world GC pauses during resize, and # so that we can lock the heap in memory on startup to prevent any # of it from being swapped out. #-Xms4G #-Xmx4G Therefore, uncomment and override the two lines above for your environment. After you are done, you may want to circle back to the cassandra.yaml file because there are some settings that influence your heap allocation. For example: networking_cache_size file_cache_size memtable_offheap_space repair_session_space among others… If you feel like Cassandra is choking and the system is not under heap pressure, then playing with these settings is probably your next step. Sadly, this is where things become trial-and-error, and even more time consuming. (Though, in Cassandra’s defense, tuning most of these parameters is workload specific). About Cassandra Caching Apache Cassandra ships two caching-related settings: row_cache_size and key_cache_size You should almost never enable either of these settings (0GiB means these are disabled). The only exception is when your workload has a (VERY) high cache hit ratio and is relatively static. The table below shows how both Row & Key caches have a negative performance impact in Cassandra during a scale-out: Kind Step Throughput Retries Cassandra 5.0 – Page Cache 3 > 6 nodes 56K ops/sec 2010 Cassandra 5.0 – Page Cache 6 > 9 nodes 112K ops/sec 0 Cassandra 5.0 – Row & Key Cache 3 > 6 nodes 56K ops/sec 5004 Cassandra 5.0 – Row & Key Cache 6 > 9 nodes 112K ops/sec 8779 Likewise, Figure 2 shows how throughput varies significantly under a fully cached workload:   Figure 2 – Cassandra Row Cache vs OS Page Cache performance (speedup falls between 1.14x to 1.5x) Figure 2 – Cassandra Row Cache vs OS Page Cache performance (speedup falls between 1.14x to 1.5x) An old DataStax (IBM) documentation page strongly discourages its use, noting that users should prefer using the OS page cache instead: Note: Utilizing the appropriate OS page cache will result in better performance than using row caching. Counterintuitively, DataStax (IBM) later recommends enabling the Row Cache when the number of reads dominate compared to writes: Tip: Enable a row cache only when the number of reads is much bigger (rule of thumb is 95%) than the number of writes. Consider using the operating system page cache instead of the row cache, because writes to a partition invalidate the whole partition in the cache. OS Tuning Operating system tuning for Cassandra shares many similarities with other databases. Preventing swapping, tuning the kernel via sysctl, setting disk read_ahead_kb settings, configuring user limits and enabling Transparent HugePages are the primary settings we touch when deploying Cassandra. This is (undoubtedly) a non-exhaustive list, although it should cover the strategies seen across most production Cassandra deployments in practice. Depending on your setup, you may want to further check: your clocksource – particularly under Xen hypervisors; whether cpupower supports setting the CPU scaling governor to “performance” mode; experimenting with jemalloc; configuring SMP IRQ Affinity; and pinning Cassandra to specific CPUs via taskset(1). Disks We primarily store Cassandra related files (including its related logs) on locally-attached NVMe disks, as commonly found within cloud hyperscalers. If there’s more than one attached disk to the VM, we combine them into a RAID-0 array using mdadm. In addition, we use XFS as the backing filesystem, particularly as it’s the same we use for ScyllaDB. We also set only one-hit merges, limit read_ahead_kb to just 4kB, and disable the IO scheduler (if any): MD_NAME=nvme1n1 sudo sh -c "echo 1 > /sys/block/$MD_NAME/queue/nomerges" sudo sh -c "echo 4 > /sys/block/$MD_NAME/queue/read_ahead_kb" sudo sh -c "echo none > /sys/block/$MD_NAME/queue/scheduler" Some important remarks: the scheduler command may “fail” in modern Cloud instances (and that’s fine); when using mdadm, tune each block device individually backing the RAID device; read_ahead_kb is a workload dependent setting. We often test small partition lookups, but workloads with larger wide-rows may benefit from increasing that setting. Memory We don’t configure swapping at all to keep matters simple. The rationale is that Cassandra already benefits from the OS page cache, and we leave over half of the server’s RAM just for it. During our tests, we also observed that enabling Transparent Huge pages, especially with ZGC, contributed positively to Cassandra’s performance. Although the improvement wasn’t remarkable, we observed positive results similar to what both Amy and Netflix reported. The provided links already go in-depth on how to enable THP, as well as how to configure Cassandra to benefit from it. Keep in mind, however, that we recommend you set the -XX:+AlwaysPreTouch JVM option regardless of whether THP is enabled or not. That’s because it’s known to improve overall JVM runtime performance at the expense of increased JVM startup times. Kernel and User limits Put simply, you don’t want Cassandra to be limited on either networking, memory allocation, or the number of files it can open. We set sysctl.conf.d/99-cassandra.conf to the following values: net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_probes=3 net.ipv4.tcp_keepalive_intvl=10 net.core.rmem_default=16777216 net.core.wmem_default=16777216 net.core.optmem_max=40960 vm.max_map_count = 1048575 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.netdev_max_backlog = 2500 net.core.somaxconn = 65000 net.ipv4.tcp_ecn = 0 net.ipv4.tcp_window_scaling = 1 net.ipv4.ip_local_port_range = 10000 65535 net.ipv4.tcp_syncookies = 0 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_fack = 1 net.ipv4.tcp_dsack = 1 net.ipv4.tcp_orphan_retries = 1 vm.dirty_background_bytes = 10485760 vm.dirty_bytes = 1073741824 vm.zone_reclaim_mode = 0 fs.file-max = 1073741824 vm.max_map_count = 1073741824 Lastly, the user running Cassandra must be allowed to allocate enough resources for the process to run. As our VMs are short-lived, we enable unlimited limits.conf consumption to all users: * - nofile 1000000 * - memlock unlimited * - fsize unlimited * - data unlimited * - rss unlimited * - stack unlimited * - cpu unlimited * - nproc unlimited * - as unlimited * - locks unlimited * - sigpending unlimited * - msgqueue unlimited Parting Thoughts As demonstrated, Apache Cassandra performance tuning is far from a one-size-fits-all solution. The settings described throughout this article represent what worked for our specific hardware setups and workload profiles. If your deployment spans different hardware, many of the values presented here will likely need to be revisited. This brings us to (perhaps) the most underappreciated cost in Cassandra operations: dependency. That is, every tuning decision is implicitly a contract with the underlying hardware. Adding more disks, increasing CPU/RAM, changing workloads are some overlooked aspects that will require entirely new tuning cycles and re-evaluating your previous decisions. ScyllaDB was designed with this problem in mind. Its shard-per-core architecture and self-tuning capabilities automatically adapt to the underlying hardware, eliminating much of the manual iteration and tuning described here. There’s no JVM at all, and most of the OS heavy lifting is carried out for you via an automated script shipped alongside the core database. If Cassandra performance has been a bottleneck, you’re concerned about the recent IBM acquisition, or you’ve simply spent too much time fighting tuning instead of building – give ScyllaDB a try. And if you want to have a technical discussion about your use case, let us know.