When to Use ScyllaDB vs MongoDB: Lessons Learned From 5+ Years in Production

MongoDB vs ScyllaDB: When to choose each

Numberly has been using both ScyllaDB and MongoDB in production for 5+ years. Learn which NoSQL database they rely on for different use cases and why.

Within the NoSQL domain, ScyllaDB and MongoDB are two totally different animals. MongoDB needs no introduction. Its simple adoption and extensive community/ecosystem have made it the de facto standard for getting started with NoSQL and powering countless web applications. ScyllaDB’s close-to-the-metal architecture enables predictable low latency at high throughput. This is driving a surge of adoption across teams such as Discord, TRACTIAN and many others who are scaling data-intensive applications and hitting the wall with their existing databases.

But database migrations are not the focus here. Instead, let’s look at how these two distinctly different databases might coexist within the same tech stack – how they’re fundamentally different, and the best use cases for each. Just like different shoes work better for running a marathon vs. scaling Mount Everest vs. attending your wedding, different databases work better for different use cases with different workloads and latency/throughput expectations.

So when should you use ScyllaDB vs. MongoDB and why? Rather than provide the vendor perspective, we’re going to share the insights from an open source enthusiast who has extensive experience using both ScyllaDB and MongoDB in production: Alexys Jacob, the CTO of Numberly. Alexys shared his perspective at ScyllaDB Summit 2019, and the video has been trending ever since.



Here are three key takeaways from his detailed tech talk:

Scaling Writes is More Complex on MongoDB

The base unit of a MongoDB topology is called a replica set, which is composed of one primary node and usually multiple secondary nodes (think of hot replicas). Only the primary node is allowed to write data. After you max out vertical write scaling on MongoDB, your only option to scale writes becomes what is called a sharded cluster. This requires adding new replica sets because you can’t have multiple primaries in a single replica set.

Sharding data across MongoDB’s replica sets requires using a special key to specify what data each replica set is responsible for, as well as creating a metadata replica set that tracks what slice of data lives on each replica (the blue triangle in the diagram below). Also, clients connecting to a MongoDB cluster need help determining what node to address. That’s why you also need to deploy and maintain MongoDB’s Smart Router instances (represented by the rectangles at the top of the diagram) connected to the replica sets.

The complexity of scaling writes in MongoDB
The complexity of scaling writes in MongoDB

Having all these nodes leads to higher operational and maintenance costs as well as wasted resources since you can’t tap the replica nodes’ IO for writes, which make sharded MongoDB clusters the worst enemy of your total cost of ownership as Alexys noted.

For ScyllaDB, scaling writes is much simpler. He explained, “On the ScyllaDB side, if you want to add more throughput, you just add nodes. End of story.”

The simplicity of scaling writes in ScyllaDB
The simplicity of scaling writes in ScyllaDB

Alexys tied up this scaling thread:

“Avoid creating MongoDB clusters, please! I could write a book with war stories on this very topic. The main reason why is the fact that MongoDB does not bind the workload to CPUs. And the sharding, the distribution of data between replica sets in a cluster is done by a background job (the balancer). This balancer is always running, always looking at how sharding should be done, and always ensuring that data is spread and balanced over the cluster. It’s not natural because it isn’t based on consistent hashing. It’s something that must be calculated over and over again. It splits the data into chunks and then moves it around. This has a direct impact on the performance of your MongoDB cluster because there is no isolation of this workload versus your actual production workload.”

MongoDB Favors Flexibility Over Performance, While ScyllaDB Favors Consistent Performance Over Versatility

ScyllaDB and MongoDB have distinctly different priorities when it comes to flexibility and performance.

On the data modeling front, MongoDB natively supports geospatial queries, text search, aggregation pipelines, graph queries and change streams. Although ScyllaDB – a wide- column store (a.k.a. key-key-value) – supports user-defined types, counters and lightweight transactions, the data modeling options are more restricted than on MongoDB. Alexys noted, “From a development perspective, interacting with an JSON object just feels more natural than interacting with a row.” Moreover, while MongoDB offers the option of enforcing schema validation before data insertion, ScyllaDB requires that data adhere to the defined schema.

Querying is also simpler with MongoDB since you’re just filtering and interacting with JSON. It’s also more flexible, for better or for worse. MongoDB lets you issue any type of query, including queries that cause suboptimal performance with your production workload. ScyllaDB won’t allow that. If you try, ScyllaDB will warn you. If you decide to proceed at your own risk, you can enter a qualifier indicating that you really do understand what you’re getting yourself into.

MongoDB data modeling

ScyllaDB data modeling

Alexys summed up the key differences from a development perspective:

“MongoDB favors flexibility over performance. It’s easy to interact with and it will not get in your way. But it will have impacts on performance – impacts that are fine for some workloads, but unacceptable for others. On the other hand, ScyllaDB favors consistent performance over versatility. It looks a bit more fixed and a bit more rigid on the outside. But once again, that’s for your own good so you can have consistent performance, operate well and interact well with the system. In my opinion, this makes a real difference when you have workloads that are latency- and performance-sensitive.”

It’s important to note that even queries that follow performance best practices will behave differently on MongoDB than on ScyllaDB. No matter how careful you are, you won’t overcome the performance penalty that stems from fundamental architectural differences.

Together, ScyllaDB and MongoDB are a Great NoSQL Combo

“It’s not a death match; we are happy users of both MongoDB and ScyllaDB,” Alexys continued.
Numberly selects the best database for each use case’s technical requirements.

At Numberly, MongoDB is used for two types of use cases:

  • Web backends with REST APIs and possibly flexible schemas.
  • Real-time queries over unpredictable behavioral data.

For example, some of Numberly’s applications get flooded with web tracking data that their clients collect and send (each client with their own internally-developed applications). Numberly doesn’t have a way to impose a strict schema on that data, but it needs to be able to query and process it. In Alexys’ words, “MongoDB is fine here; its flexibility is advantageous because it allows us to just store the data somewhere and query it easily.”

ScyllaDB is used for three types of use cases at Numberly:

  • Real-time latency-sensitive data pipelines. This involves a lot of data enrichment, where there are multiple sources of data that need to be correlated, in real time, on the data pipelines. According to Alexys, “That’s tricky to do…and you need strong latency guarantees to not break the SLAs [service-level agreements] of the applications and data processes which your clients rely on down the pipe.”
  • Mixed batch and real-time workloads. Numberly also mixes a lot of batch and real-time workloads in ScyllaDB because it provides the best of both worlds (as Numberly shared previously). “We had Hive on one path and MongoDB on the other. We put everything on ScyllaDB and its sustaining Hadoop-like batch workloads and real time pipeline workloads.”
  • Web backends using GraphQL, which imposes a strict schema. Some of Numberly’s web backends are implemented in GraphQL. When working with schema-based APIs, it makes perfect sense to have a schema-based database with low latency and high availability.

Alexys concluded: “A lot of our backend engineers, and frontend engineers as well, are adopting ScyllaDB. We see a trend of people adopting ScyllaDB, more and more tech people asking ‘I have this use case, would ScyllaDB be a good fit?’ Most of the time, the answer is ‘yes.’ So, ScyllaDB adoption is growing. MongoDB adoption is flat, but MongoDB is certainly here to stay because it has some really interesting features. Just don’t go as far as to create a MongoDB sharded cluster, please!”

Bonus: More Insights from Alexys Jacob

Alexys is an extremely generous contributor to open source communities, with respect to both code and conference talks. See more of his contributions at https://ultrabug.fr/

DataStax and Google Cloud Collaborate to Evolve Open Source Apache Cassandra for Generative AI

Companies everywhere are looking for ways to integrate AI into their business as popularity of generative AI technology continues to surge. While organizations need to think about how to implement AI into their business, they also need to think about what is needed to support their AI...

Introducing Vector Search: Empowering Cassandra and Astra DB Developers to Build Generative AI Applications

In the age of AI, Apache Cassandra® has emerged as a powerful and scalable distributed database solution. With its ability to handle massive amounts of data and provide high availability, Cassandra has become a go-to choice for many AI applications including Uber, Netflix, and Priceline. However,...

Security Advisory: CVE 2023-20601 Apache Cassandra®

Following the publication of CVE-2023-20601, Instaclustr began investigating its potential impact on our Instaclustr Managed Apache Cassandra® offering. This vulnerability affects Apache Cassandra from 4.0.0 through to 4.0.9, and from 4.1.0 through to 4.1.1 The vulnerability can be exploited with privilege escalation when enabling FQL/Audit logs, allowing users with JMX access to run arbitrary commands as the user running Apache Cassandra.  

The security controls that exist in our managed service—including but not limited to firewalls, intrusion detection, and compartmentalization practices—lower the risk of this vulnerability. However, our course of action will be to release Cassandra version 4.0.10 as a newer, patched version of Managed Cassandra and subsequently upgrade customers on an impacted version (i.e., any managed Cassandra 4.0.1, 4.0.4, and 4.0.9). Apache Cassandra version 4.0.10 contains the fix and will soon be made available on the Instaclustr Managed Platform. If you have any questions, please get in contact with Instaclustr Support. 

Mitigation for customers on Cassandra 4.0.1, 4.0.4, or 4.0.9: 

  • For customers using the managed service the Instaclustr Support team will be in contact with you to schedule an upgrade of your managed Cassandra clusters to version 4.0.10. 
  • For support only customers, you will need to upgrade your Cassandra clusters to 4.0.10, but in the short term if is advisable to close any remote JMX access to your clusters. 

As a further mitigation step, we will immediately be marking managed Cassandra versions 4.0.1, 4.0.4, and 4.0.9 as Legacy Support prior to these versions being marked as End of Life on 31 July 2023 as per our lifecycle policy. 

As always, customers who want to take a more proactive stance should limit access to their managed Cassandra cluster to only trusted clients and ensure those clients are secure. This is always good security practice in any case. 

If you have any further queries regarding this vulnerability and how it relates to Instaclustr services, please contact Instaclustr Support. 

References: https://nvd.nist.gov/vuln/detail/CVE-2023-30601 

The post Security Advisory: CVE 2023-20601 Apache Cassandra® appeared first on Instaclustr.

The Data Modeling Behind Social Media “Likes”

Did you ever wonder how Instagram, Twitter, Facebook, and  other social media platforms track who liked your posts? This post explains how to do it with ScyllaDB NoSQL. And if you want to go hands-on with these and other NoSQL strategies, join us at ScyllaDB Labs, a free 2 hour interactive event. 

Recently, I was invited to speak at an event called “CityJS.” But here’s the thing: I’m the PHP guy. I don’t do JS at all, but I accepted the challenge. To pull it off, I needed to find a good example to show how a highly scalable and low latency database works.

So, I asked one of my coworkers for examples. He told me to look for high numbers inside any platform, like counters or something like that. At that point, I realized that any type of metrics can fit this example. Likes, views, comments, follows, etc. could be queried as counters. Here’s what I learned about how do proper data modeling for these using ScyllaDB.

 

First things first, right? After deciding what to cover in my talk, I needed to understand how to build this data model.

We’ll need a posts table and also a post_likes table that relates who liked each post. So far, it seems enough to do our likes counter.

My first bet for a query to count all likes was something like:

Ok and if I just do a query with SELECT count(*) FROM social.post_likes it can work, right?

Well, it worked but it was not as performant as expected when I did a test with a couple thousands of likes in a post. As the number of likes grows, the query becomes slower and slower…

“But ScyllaDB can handle thousands of rows easily… why isn’t it performant?” That’s probably what you’re wondering right now.

ScyllaDB – even as a cool database with cool features – will not solve the problem of bad data modeling. We need to think about how to make things faster.

Researching Data Types

Ok, let’s think straight: the data needs to be stored and we need the relation between who liked our post, but we can’t use it for count. So what if I create a new row as integer in the posts table and increment/decrement it every time?

Well, that seems like a good idea, but there’s a problem: we need to keep track of every change on the posts table and if we start to INSERT or UPDATE data there, we’ll probably create a bunch of nonsense records in our database.

Using ScyllaDB, every time that you need to update something, you actually create new data.

You will have to track everything that changes in your data. So, for each increase, there will be one more row unless you don’t change your clustering keys or don’t care about timestamps (a really bad idea).

After that, I went into the ScyllaDB docs and found out that there’s a type called counter that fit our needs and is also ATOMIC!

Ok, it fit our needs but not our data modeling. To use this type, we have to follow a few rules but let’s focus on the ones that are causing trouble for us right now:

  • The only other columns in a table with a counter column can be columns of the primary key (which cannot be updated).
  • No other kinds of columns can be included.
  • You need to use UPDATE queries to handle tables that own a counter data type.
  • You only can INCREMENT or DECREMENT values, setting a specific value is not permitted.

This limitation safeguards correct handling of counter and non-counter updates by not allowing them in the same operation.

So, we can use this counter but not on the posts table… Ok then, it seems that we’re finding a way to get it done.

Proper Data Modeling

With the information that counter type should not be “mixed” with other data types in a table, the only option that is left to us is create a NEW TABLE and store this type of data.

So, I made a new table called post_analytics that will hold only counter types. For the moment, let’s work with only likes since we have a Many to Many relation (post_likes) created already.


These next queries are what you probably will run for this example that we created:

Now you might have new unanswered questions in your mind like: “So every time that I need a new counter related to some data, I’ll need a new table?” Well, it depends on your use case. In the social media case, if you want to store who saw the post, you will probably need a post_viewers table with session_id and a bunch of other stuff.

Having these simple queries that can be done without joins can be way faster than having count(*) queries.

Me talking at CityJS stage

Me on the CityJS stage talking about data modeling using Typescript

Go Hands-On with ScyllaDB and Data Modeling…at the ScyllaDB Labs Event

If you want to go deeper on this topic, with some real hands-on exercises, please join us for ScyllaDB Labs! It’s a free 2 hour virtual event where I’ll be presenting along with Guy Shtub, Head of Training and Felipe Cardeneti Mendes, Solutions Architect.

Join us at the Event

This is an interactive workshop where we’ll go hands-on to build and interact with high-performance apps using ScyllaDB. It will be a  great way to discover the NoSQL strategies used by top teams and apply them in a guided, supportive environment. As you go live with some sample applications, you’ll learn about the features and best practices that will enable your own applications to get the most out of ScyllaDB.

We’ll cover:

  • Understanding if your use case is a good fit for ScyllaDB
  • Achieving and maintaining low-latency NoSQL at scale
  • Building high performance applications with ScyllaDB
  • Data modeling, ingestion, and processing with the sample app
  • Navigating key decisions & logistics when getting started with ScyllaDB

Hope to see you there!

Build your First ScyllaDB Application: New Rust, Python & PHP Tutorials

A couple of years ago, we published the first version of CarePet, an example IoT project designed to help you get started with ScyllaDB. Recently, we’ve expanded the project to make it useful for a larger group of developers by including new examples in Rust, Python, and PHP. Additionally, we also implemented a Terraform example that allows you to spin up a ScyllaDB Cloud cluster with the initial CarePet schema.

This blog post goes over the components of the project and shows you how to get started. In the GitHub repository, you can find examples in multiple languages:

The CarePet app allows tracking of pets’ health indicators. Each tutorial builds the same app with these three components:

  • A virtual collar that reads and pushes sensor data
  • A web app for reading and analyzing the pets’ data
  • A database migration tool

The tutorials and other supporting material are in our documentation as well as GitHub

Keep in mind that this example is meant to be one of your first steps at trying ScyllaDB. It is not production ready and should be used for reference purposes only.

TL;DR High-Level Overview + Deploying to ScyllaDB Cloud with Terraform

To get your hands dirty and start building right away:

  1. Go to iot.scylladb.com
  2. Select your favorite programming language
  3. Follow the instructions in the documentation

If you have problems or questions while working on the app, feel free to post a message in our community forum and we’ll be happy to solve your problem or answer your questions.

Using the ScyllaDB Cloud Terraform provider you can interact with ScyllaDB Cloud – create, edit, or delete your instances – using Terraform. In the CarePet repository, you can find a starter Terraform configuration that sets up a new ScyllaDB Cloud cluster and creates the initial schema in the new database. If you want to complete the tutorial using ScyllaDB Cloud and Terraform, go to this documentation page and get started!

About This Tutorial

Now, a little more detail…

In this sample app tutorial, you’ll create a simple IoT application from scratch that uses ScyllaDB as the database. The application is called “Care Pet”; it collects and analyzes data from sensors attached to virtual pet collars. This data can be used to monitor a pet’s health and activity.

The tutorial walks you through a specific instance of the steps to create a typical IoT application from scratch. This includes gathering requirements, creating the data model, cluster sizing and hardware needed to match the requirements, and (finally) building and running the application.

Use Case Requirements

Each pet collar includes sensors that report four different measurements: Temperature, Pulse, Location, and Respiration. The collar reads the sensor’s data once a minute, aggregates it in a buffer, and sends measurements directly to the app once an hour. The application should scale to 10 million pets. It keeps each pet’s data history for a month. Thus, the database will need to scale to contain 43 billion data points in a month (60 × 24 × 30 × 10,000,000 = 43,200,000,000); 43,200 data samples per pet. If the data variance is low, it will be possible to further compact the data and reduce the number of data points.

Note that in a real world design you’d have a fan-out architecture. The end-node IoT devices (the collars) would communicate wirelessly to an MQTT gateway or equivalent, which would, in turn, send updates to the application via, say, Apache Kafka. We’ll leave out those extra layers of complexity for now, but if you are interested, check out our post on ScyllaDB and Confluent Integration for IoT Deployments.

Performance Requirements

The application has two parts:

  • Sensors: Writes to the database, throughput sensitive
  • Backend dashboard: Reads from the database, latency-sensitive

For this example, we assume 99% writes (sensors) and 1% reads (backend dashboard)

The desired Service Level Objectives (SLOs) are:

  • Writes: Throughput of 670K operations per second
  • Reads: Latency of up to 10 milliseconds per key for the 99th percentile

The application requires high availability and fault tolerance. Even if a ScyllaDB node goes down or becomes unavailable, the cluster is expected to remain available and continue to provide service. You can learn more about ScyllaDB’s high availability in this lesson.

You can also calculate what these requirements would cost using ScyllaDB Cloud vs. other providers using our pricing calculator.

To satisfy the requirements of the example, ScyllaDB needs to be able to do 670K write operations per second and 10K read operations per second with 10ms P99 latency where the average item size is 1KB and the full data set is 1 TB. As you can see, ScyllaDB can provide not just great performance benefits but also huge cost savings in cases like this if you migrate from a different database like DynamoDB.

See benchmarks

Design and Data Model

Now, let’s think about our queries, make the primary key and clustering key selection, and create the database schema. See more in the data model design document.

Here’s how we will create our data using CQL:

We hope you find the project useful to learn more about ScyllaDB. If you have questions or issues that are specific to this sample application, feel free to open an issue in the project’s GitHub repository.

Resources

Here’s what you need to get started with ScyllaDB:

For a deeper understanding of ScyllaDB and to get answers to your questions, go to:

Vector Search Is Coming to Apache Cassandra

There’s no artificial intelligence without data. And when your data is scattered all over the place, you’ll spend more time managing the implementation process instead of focusing on what’s most important: building the application. The world's most prominent applications already use Apache...

How to Maximize Database Concurrency

“ScyllaDB really loves concurrency…to a point.”

That was one of the key takeaways from the recent webinar, How Optimizely (Safely) Maximizes Database Concurrency, featuring Brian Taylor, Principal Software Engineer at Optimizely.

No database or system – digital or not – can tolerate unbounded concurrency. For the most part, high concurrency is essential for achieving impressive performance. But if the clients end up overwhelming the database with requests at a pace that it can’t handle, throughput suffers, then latency rises as a side effect.

As Brian explained, knowing exactly how hard you can push your database is key for facing throughput challenges, like Black Friday surges, without a hitch. Identifying that precise tipping point begins with the Universal Scaling Law, which states that as the number of users (N) increases, the system throughput (X) will:

  • Enjoy a period of near linear scaling
  • Eventually saturate some resource such that increasing the number of users doesn’t increase the throughput
  • Possibly encounter a coordination cost that drives down the throughput with further increasing number of users

Here’s how that USL applies to a database…

In the linear region, throughput is directly proportional to concurrency. The size of the cluster will determine how large the linear region is. You should engineer your system to take full advantage of the linear region.

In the saturation region, throughput is approximately constant, regardless of concurrency. At this point, assuming a well-tuned cluster and workload, you are writing to the system as fast as the durable storage can take it. Concurrency is no longer a mechanism to increase throughput and is now becoming a risk to stability. You should engineer your system to stay out of the saturation region.

In the retrograde region, increasing concurrency now decreases throughput. A system that enters this region is likely to get stuck here until demand declines. The harder you push, the less throughput you get and the more demand builds which makes you want to push harder. However, “pushing harder” consumes more client resources. This is a vicious cycle and you are now on “the road to hell.”

That’s just a very high level look at the first part of the talk. Watch the complete video to hear more about:

  • How Brian determines the bounds of these regions for his specific database and workload
  • How the database operates in each region and what it means for the team
  • Ways you can find more concurrency to throw at ScyllaDB when you’re in that magical linear region
  • His 3 top tips for getting the most out of ScyllaDB’s “awesome concurrency”

Watch the complete video

This talk stirred up some great discussion, so we wanted to share some of the highlights here.

What about adaptive database concurrency limits?

Q: Configuring fixed concurrency limits is not a new way to ensure one has a stable system. A situation which often happens is that the values you previously benchmarked through your performance testing could quickly become stale – due to data or coding changes and natural application autoscaling to meet newer real-world demands. For example, Netflix has a nice write-up around the topic where they implemented Adaptative Concurrency Limits in order to address this situation. Is this something that you already implement or have considered implementing at Optimizely, and do you have any tips for the audience?

A: Great question. So static concurrency limits are a hazard – you should know that in advance. A small subset of problems can work fine under static concurrency limits, but most can’t. Because most of our systems auto scale, most of our database clusters are shared by multiple things.

You need to recognize that the USL fit you made for your cluster is, unfortunately, also a snapshot in time. You got a snapshot of the USL on the day you measured it, or the days you measured it, but it’s going to change as you change your software, as hardware potentially degrades, as you add users to the cluster in the background that are soaking up some of that performance, as backups that you didn’t measure during your peak performance testing kick in and do their thing. The real safe concurrency limit is always dynamic.

The way I’ve tackled that is actually based on a genius suggestion from a former manager who loved digging into foundational tech. He suggested applying the TCP congestion control algorithm. That is the algorithm that operates way down on the network stack, deciding how big a chunk of data to send at once. It chooses the size and sends a chunk. Then if that chunk size fails, it chooses a smaller size and sends a chunk. If that succeeds, it scales up the chunk size. So you have this dynamic tuning that finds the chunk size that yields the maximum throughput for whatever network “weather” you’re sitting on. Network conditions are dynamic just like database conditions. They’re solving the same problem.

I applied the TCP congestion control algorithm to choosing concurrency. So if things are going fine, and the weather at Scylla land is good, then I scale up concurrency, do a little more. If the weather is getting kind of choppy, things are slowing down, I scale it down. So that’s the way that I dynamically manage concurrency. It works very well in this autoscaling, multiple clients sort of real world scenario.

What’s the impact of payload size on concurrency?

Q: What is the impact of payload record size? Correct me if I’m wrong, but I think that the larger your payload gets, the less concurrency you have to apply to your system, right?

A: Exactly. Remember, the fundamental limiter is the SSD. So bigger payloads means more throughput hitting the SSD per request. That’s a first order effect. My recommendation would be to base your testing on your average payload size if your workload is fairly smooth. If things are significantly variable over time (e.g. that average swings around dynamically throughout the day) then I would tune to the maximum payload size. The most pessimistic (aka safe) tuning would be to find the maximum payload size that your system allows, then fit the USL under those test conditions.

How do you address workloads that are spiky by nature?

Q: No system supports unlimited concurrency. Otherwise, no matter how fast your database is, you’ll eventually end up reaching the retrograde region. Do you have any recommendations for workloads that are unbound and/or spiky by nature?

A: Yeah, in my career, all workloads are spiky so you always get to solve this problem. You need to think about what happens when the wheels fall off. When work bunches up, what are you going to do? You have two choices. You can queue and wait or you can load shed. You need to make that choice consistently for your system. Your chosen scale chooses a throughput you can handle within SLA. Your SLA tells you how often you’re allowed to violate your latency target so make sure the reality you’ve observed violates your design throughput well less than the limit your SLA requires. And then when reality drives you outside that window, you either queue and wait and blow up your latency or you load shed and effectively force the user to implement the queue themselves. Load shedding is not magic. It’s just choosing to force the user to implement the queue.

Note: ScyllaDB supports configuring concurrency limiters per shard or throughput limits per partition. [Learn more – Retaining Goodput with Query Rate Limiting]

How can limiting concurrency help you achieve your desired throughput?

Q: One of the situations that we see fairly often happening is when your concurrency goes too high, and then you reach the retrograde region and just see your throughput going down. Latency goes up, your throughput goes down, you start seeing timeouts, errors, etc, and so on. And then the general feedback that we’re going to provide is that you have to fix your concurrency. People don’t understand how reducing their concurrency can actually help them to achieve their desired throughput. Could you elaborate a bit on that?

A: Well, first I’ll tell a story. The basis for this talk was a system that encountered high partition cardinality. This led to lots of concurrent requests to ScyllaDB. From my perspective, it looked like ScyllaDB’s performance just tanked, even though the problem was really on our side. It was a frustrating day. Another law of nature is that your bad day happens at the worst possible time. In my case, huge cardinality happened on Black Friday (because obviously there’s a lot going on that day…). The thing to know is that when you’re thinking about throughput, your primary knob is concurrency. So if you have a throughput problem, go back to your primary knob. You’re observing your concurrency right? (I wasn’t, but I am now.)

How to run open-loop testing with Rust and ScyllaDB?

Ok, this wasn’t part of the Q & A – but Brian developed a great tool, and we wanted to highlight it. We’ll let Brian explain:

With classic closed loop load testing, simulated users have up to one request in flight at any time. Users must receive the response to their request before issuing the next request. This is great for fitting the USL because we’re effectively choosing X when we choose users. But, in my opinion, it’s not directly useful for modern capacity testing.

Modern open loop load testing does not model users or think times. Instead, it models the load as a constant throughput load source. This is a match for capacity planning for internet connected systems where we typically know requests per second but don’t really know or care how many users are behind that load. Concurrency is theoretically unbounded in an open-loop test: This is the property that lets us open our eyes to how the saturation and retrograde regions will look when our system encounters them in real life.

As I was developing this talk, unfortunately, I couldn’t find an open loop load tester for ScyllaDB – so I wrote one. scylla-bench: constant cate (CRrate) simulates a constant rate load and delivers it to ScyllaDB with variable concurrency. The name is also a bad Rust pun.

It’s an open-loop ScyllaDB load testing tool developed in the creation of this talk. It is intended to complement the ScyllaDB-provided closed-loop testing tool, scylla-bench. It overcomes coordinated omission using the technique advocated in my P99 CONF talk.