Database Performance at Scale: Free Book & Masterclass

Discover new ways to optimize database performance – and avoid common pitfalls  – with this free book and  a video crash course by the authors

In case you haven’t heard, we wrote a book on database performance – and you can get it for free! A quick look at some of the buzz  so far:

The 270-page book stems from an unconventional global collaboration between Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, and myself. Fun fact: we produced the entire book without a single all-author virtual meeting – and we’ve never all met in person.

We wrote the book to share our collective experience with performance-focused database engineering as well as performance-focused database users. It represents what we think teams striving for extreme database performance — low latency, high throughput, or both—should be thinking about, but often overlook. This includes the nuances of DB internals, drivers, infrastructure, topology, monitoring, and quite a lot more.

Digital Download + Request Printed Book

Although it was written by people from ScyllaDB, it’s not a book “about” ScyllaDB per se. We wanted to explore the topic of database performance at a broader level so the book would be relevant even beyond the ScyllaDB community. It doesn’t matter what database you’re currently using (MongoDB, MySQL, Postgres, Cassandra, DynamoDB…). If you’re experiencing some pain related to database latency and/or throughput – or you fear you will suffer soon – this is a book for you. And if you’re considering or already using ScyllaDB, you’ll definitely want to look at this book.

We invite you to dive into the book and to share your brutally honest feedback with us (find us on the socials). But we also know that not everyone has time to read a book these days. So, the authors – plus one of our tech reviewers – recently came together to deliver a Database Performance at Scale masterclass sharing key points from the book in just a couple of hours.

If you weren’t among the thousand people who attended live, you missed your chance to get your burning database performance questions during the author panel. But, you can still watch the sessions on-demand for a crash course on database performance at scale. We covered how to:

  • Navigate the top performance challenges and tradeoffs that you’re likely to face with your project’s specific workload characteristics and technical/business requirements (Felipe Cardenti Mendes)
  • Understand the impact of database internals – specifically, what to look for if you need latency and/or throughput improvements (Pavel Emelyanov and Botond Dénes)
  • Explore the critical role drivers play in performance, some of the commonly-overlooked pitfalls, and strategies to apply when things go wrong (Piotr Sarna)

And then came the panel discussion. With Felipe doing double duty as moderator and participant, the group covered a range of topics including:

  • The inspiration for the fantastic tales of Joan and poor Patrick, featured in Chapter 1
  • Why we don’t see DPDK used more often
  • A look at concepts of scheduling groups and isolation in Seastar
  • What key performance metrics organizations teams should be monitoring and why
  • The database challenges that Piotr is tackling over at Turso
  • The most important factor related to database performance

You can watch how it all plays out here:

If you want to explore the three core videos, plus test your knowledge with our certification exam, access the full masterclass (it’s just 90 minutes).

Access the Masterclass

Keep Learning: Podcasts with Book Authors

Piotr Sarna and Felipe Cardeneti Mendes were recently invited to share their insights on various tech podcasts.

Piotr on The Distributed Fabric Pod

Piotr started the podcast ball rolling with an appearance on The Distributed Fabric Pod, where host Vipul Vaibhaw explores the fascinating world of distributed systems, database internals, deep learning, programming languages etc. In this episode, Vipul and Piotr chatted about the book, database drivers and database internals, Zig vs Rust, the broader challenges of distributed systems, and advice for new engineers.

 

Felipe on The Geek Narrator

Felipe kept the podcast momentum going with The Geek Narrator (Kaivalya Apte), who just kicked a new series focused on database internals – featuring ScyllaDB as well as DynamoDB, Cassandra, CockroachDB, DuckDB, Neo4J, TiDb, Clickhouse, and more. Kaivalya and Felipe took a deep dive into ScyllaDB’s unique close-to-the-hardware design, with an emphasis on why design decisions like our shard-per-core architecture, specialized cache, and IO scheduling matter for database users who require predictable performance at scale.

DynamoDB: When to Move Out

A look at the top reasons why teams decide to leave DynamoDB: throttling, latency, item size limits, and limited flexibility…not to mention costs

Full disclaimer: I love DynamoDB. It has been proven at scale and is the back-end that many industry leaders have been using to power business-critical workloads for many years. It also fits nicely within the AWS ecosystem, making it super easy to get started, integrate with other AWS products, and stick with it as your company grows.

But (hopefully!), there comes a time when your organization scales quite significantly. With massive growth, costs become a natural concern for most companies out there, and DynamoDB is fairly often cited as one of the primary reasons for runaway bills due to its pay per operations service model.

You’re probably thinking: Oh, this is yet another post trying to convince me that DynamoDB costs are too high, and I should switch to your database, which you’ll claim is less expensive”.

Hah! Smart guess, but wrong. Although costs are definitely an important consideration when selecting a database solution (and ScyllaDB does offer quite impressive price-performance vs DynamoDB), it is not always the most critical consideration. In fact, many times, switching databases is often seen as an “open-heart surgery”: something you never want to experience – and certainly not more than once.

If costs are a critical consideration for you, there are plenty of other resources you can review. To name a few:

But let’s switch the narrative, and discuss other aspects you should consider when deciding whether to shift away from DynamoDB.

DynamoDB Throttling

Throttling in any context is no fun. Seeing a ProvisionedThroughputExceededException makes people angry. If you’ve seen it, you’ve probably hated it and learned to live with it. Some may even claim they love it after understanding it thoroughly.

There are a variety of ways you can become a victim of DynamoDB throttling. In fact, it happens so often that AWS even has a dedicated troubleshooting page for it. One of the recommendations here is “switch to on-demand mode” before actually discussing evenly distributed access patterns.

The main problem with DynamoDB’s throttling is that it considerably harms latencies. Workloads with a Zipfian distribution are prone to it… and the results achieved in a recent benchmark aren’t pretty:


ScyllaDB versus DynamoDB – Zipfian workload

During that benchmark, it became apparent that throttling could cause the affected partition to become inaccessible for up to 2.5 seconds. That greatly magnifies your P99 tail latencies.

While it is true that most databases have some built-in throttling mechanism, DynamoDB throttling is simply too aggressive. It is unlikely that your application will be able to scale further if it happens to frequently access a specific set of items crossing the DynamoDB partition or table limits.

Intuition might say: “Well, if we are hitting DynamoDB limits, then let’s put a cache in front of it!” Okay, but that will also eventually throttle you down. It is – in fact – particularly hard to find hard numbers on DynamoDB Accelerator’s precise criteria for throttling, which seems primarily targeted to CPU utilization (which isn’t the best criteria for load shedding).

DynamoDB Latency

You may argue that the Zipfian distribution shown above targets DynamoDB’s Achilles heel, and that the AWS solution is a better fit for workloads that follow a uniform type of distribution. We agree. We’ve also tested DynamoDB under that assumption:

ScyllaDB versus DynamoDB – Uniform distribution

Even within DynamoDB’s sweet spot, P99 latencies are considerably higher compared to ScyllaDB across a variety of scenarios.

Of course, you may be able to achieve lower DynamoDB latencies than the ones presented here. Yet, DynamoDB falls short when required to deliver higher throughputs, causing latencies to eventually start to degrade.

One way to avoid such spikes is to use DynamoDB Accelerator (or any other caching solution, really) for a great performance boost. As promised, I’m not going to get into the cost discussion here, but rather point out that DynamoDB’s promised single-digit millisecond latencies are harder to achieve in practice.

To learn more about why placing a cache in front of your database is often a bad idea, you might want to watch this video on Replacing Your Cache with ScyllaDB.

DynamoDB Item Size Limits

Every database has an item size limit, including DynamoDB. However, the 400KB limit may eventually become too restrictive for you. Alex DeBrie sheds some light on this and thoroughly explains some of the reasons behind it in his blog on DynamoDB limits:

“So what accounts for this limitation? DynamoDB is pointing you toward how you should model your data in an OLTP database.

Online transaction processing (or OLTP) systems are characterized by large amounts of small operations against a database. (…) For these operations, you want to quickly and efficiently filter on specific fields to find the information you want, such as a username or a Tweet ID. OLTP databases often make use of indexes on certain fields to make lookups faster as well as holding recently-accessed data in RAM.”

While it is true that smaller payloads (or Items) will generally provide improved latencies, imposing a hard limit may not be the best idea. ScyllaDB users who decided to abandon the DynamoDB ecosystem often mention this limit, positioning it as a major blocker for their application functionality.

If you can’t store such items in DynamoDB, what do you do? You likely come up with strategies such as compressing the relevant parts of your payload or storing them in other mediums, such as AWS S3. Either way, the solution becomes sub-optimal. Both introduce additional latency and unnecessary application complexity.

At ScyllaDB, we take a different approach: A single mutation can be as large as 16 MB by default (this is configurable). We believe that you should be able to decide what works best for you and your application. We won’t limit you by any means.

Speaking of freedom: Let’s talk about …

DynamoDB’s Limited Freedomlexibility

This one is fairly easy.

DynamoDB integrates perfectly within the AWS ecosystem. But that’s precisely where the benefit ends. As soon as you engage in a multi-cloud strategy or decide to switch to another cloud vendor, or maybe fallback to on-prem (don’t get me started on AWS Outposts!), you will inevitably need to find another place for your workloads.

Of course, you CANNOT predict the future, but you CAN avoid future problems. That’s what differentiates good engineers from brilliant ones: whether your solution is ready to tackle your organization’s growing demands.

After all, you don’t want to find yourself in a position where you need to learn a new technology, change your application, perform load testing in the new DBMS, migrate data over, potentially evaluate several vendors… Quite often, that can become a multi-year project depending on your organization size and the number of workloads involved.

DynamoDB workloads are naturally a good fit for ScyllaDB. More often than not, you can simply follow the very same data modeling you already have in DynamoDB and use it on ScyllaDB. We even have ScyllaDB Alternator – an open source  DynamoDB compatible API – to allow you to migrate seamlessly if you want to continue using the same API to communicate with your database, thus requiring fewer code changes.

Final Remarks

Folks, don’t get me started on DynamoDB… is a nice Reddit post you may want to check out. It covers a number of interesting real-life stories, some of which touch on the points we covered here. Despite the notorious community feedback, it does include a fair list of other considerations you may want to be aware of before you stick with DynamoDB.

Still, the first sentence of this article remains true: I love DynamoDB. It is a great fit and an economical solution when your company or use case has low throughput requirements or isn’t bothered by some unexpected latency spikes.

As you grow, data modeling and ensuring you do not exceed DynamoDB’s hard limits can become particularly challenging. Plus you’ll be locked in, and may eventually require considerable effort and planning if your organization’s direction changes.

When selecting a database, it’s ultimately about choosing the right tool for the job. ScyllaDB is by no means a replacement for each and every DynamoDB use case out there. But when a team needs a NoSQL database for high throughput and predictable low latencies, ScyllaDB is invariably the right choice.

Debezium® Change Data Capture for Apache Cassandra®

The Debezium® Change Data Capture (CDC) source connector for Apache Cassandra® version 4.0.11 is now generally available on the Instaclustr Managed Platform. 

The Debezium CDC source connector will stream your Cassandra data (via Apache Kafka®) to a centralized enterprise data warehouse or any other downstream data consumer. Now business information in your Cassandra database is streamed much more simply than before!  

Instaclustr has been supporting the Debezium CDC Cassandra source connector through a private offering since 2020. More recently, our Development and Open Source teams have been developing the CDC feature for use by all Instaclustr Cassandra customers. 

Instaclustr’s support for the Debezium Cassandra source connector was released to Public Preview earlier this year for our customers to trial in a test environment. Now with the general availability release, our customers can get a fully supported Cassandra CDC feature already integrated and tested on the Instaclustr platform, rather than performing the tricky integration themselves.

The Debezium CDC source connector is a really exciting feature to add to our product set, enabling our customers to transform Cassandra database records into streaming events and easily pipe the events to downstream consumers.  

Our team has been actively developing support for Debezium CDC Cassandra source connector both on our managed platform and also through open source contributions by NetApp to the Debezium Cassandra connector project. We’re looking forward to seeing more of our Cassandra customers using CDC to enhance their data infrastructure operations.”—Jaime Borucinski, Apache Cassandra Product Manager

Instaclustr’s industryleading SLAs and support offered for the CDC feature provide our customers with confidence to rely on this solution for production. This means that Instaclustr’s managed Debezium source connector for Cassandra is a good fit for your most demanding production data workloads, increasing the value for our customers with integrated solutions across our product set. Our support document, Creating a Cassandra Cluster With Debezium Connector provides step by step instructions to create an Instaclustr for Cassandra CDC solution for your business.  

How Does Change Data Capture Operate with Cassandra?  

Change Data Capture is a native Cassandra setting that can be enabled on a table at creation, or with an alter table command on an existing table. Once enabled, it will create logs that capture and track cluster data that has changed because of inserts, updates, or deletes. Once initial installation and setup has been completed, the CDC process will take these row-level changes and convert your database records to a streaming event for downstream data integration consumers. When running, the Debezium Cassandra source connector will: 

  1. Read Cassandra commit log files in cdc_raw directory for inserts, updates, deletes, and log the change into the nodes commit log. 
  2. Create a change event for every row-level insert.  
  3. For each table, publish change events in a separate Kafka topic. In practice this means that each CDC enabled table in your Cassandra cluster will have its own Kafka topic.
  4. Delete the commit log from the cdc_raw directory. 

Apache Cassandra and Debezium Open Source Developments 

Support for CDC was introduced in Cassandra version 3.11 and has continued to mature through version 4.0. There have been notable open source contributions that improve CDC functionality, including support for Cassandra version 3.11 and Cassandra 4.0 as separate modules, with shared common logic known as a core module authored by NetApp employee Stefan Miklosovic and committed by Gunnar Morling. This was identified during development for Cassandra version 4.0 CDC support.

As support for version 4.0 was added, certain components of Cassandra version 3.11 CDC would break. The introduction of the core module now allows Debezium features and fixes to be pinned to a specific Cassandra version when released, without breaking Debezium support for other Cassandra versions. This will be particularly valuable for future development of Debezium support for Cassandra version 4.1 and each newer Cassandra version. 

Stefan has continued to enhance Cassandra 4.0 support for Debezium with additional contributions like changing the CQL schema of a node without interrupting streaming events to Kafka. Previously, propagating changes in the CQL schema back to the Debezium source connector required a restart of the Debezium connector. Stefan created a CQL Java Driver schema listener, hooked it to a running node, and as soon as somebody adds a column or a table or similar to their Cassandra cluster these changes can now be detected in Debezium streamed events with no Debezium restarts! 

Another notable improvement for Cassandra CDC in Cassandra version 4.0 is the way the Debezium source connector reads the commit log. Cassandra version 3.11 CDC would buffer the CDC change events and only publish change events in batch cycles creating a processing delay. Cassandra version 4.0 CDC now continuously processes the commit logs as new data is available, achieving near real-time publishing. 

If you already have Cassandra clusters that you want to stream into an enterprise data store, get started today using the Debezium source connector for Cassandra on Instaclustr’s Managed Platform.  

Contact our Support team to learn more about how Instaclustr’s managed Debezium connector for Cassandra can unlock the value of your new or existing Cassandra data stores.  

The post Debezium® Change Data Capture for Apache Cassandra® appeared first on Instaclustr.

Set Up Your Own 1M OPS Benchmark with ScyllaDB Cloud & Terraform

How to run your own ScyllaDB benchmark on a heavy workload in just minutes  

ScyllaDB is a low-latency high-performance NoSQL database that’s compatible with Apache Cassandra and DynamoDB. Whether you read our benchmarks or find 3rd party opinions, you’ll find similar results: ScyllaDB is monstrously fast. In this blog post, we help you run your own tests on ScyllaDB in a cloud environment so you can experience the power of ScyllaDB yourself.

You’ll learn how you can set up cloud infrastructure on top of ScyllaDB Cloud and Amazon Web Services to run a 1 million operations/second workload. We created a Terraform tool to help you set up the minimum infrastructure required to test a heavy workload using ScyllaDB – all in the cloud.

If you want to skip ahead and try the tool, you can find the repository on GitHub. You’ll need Terraform installed locally, an AWS account, and a ScyllaDB Cloud account.

Cloud infrastructure elements needed for the demo

In order to set up the environment needed for the demo, Terraform will build out the following items:

  • ScyllaDB Cloud cluster: three nodes each with i3en.24xlarge hardware (96 vCPU, 768 GB RAM)
  • Loader instances: Three EC2 instances based on ScyllaDB AMI (ami-0a23f9b62c17c53fe) with i4i.8xlarge hardware (32 vCPU, 256 GB RAM)
  • Cassandra-stress script: We use the cassandra-stress benchmarking tool to make 1 million requests/second

Additionally:

  • Security groups
  • Subnets
  • VPC for the EC2 instances
  • VPC peering connection for ScyllaDB Cloud

 

Get up and running

To run the project you need a ScyllaDB Cloud account, ScyllaDB API token, and an AWS account with the CLI tool.

Clone the repository from GitHub:


git clone https://github.com/zseta/scylladb-1m-ops-demo.git
cd scylladb-1m-ops-demo/

Your ScyllaDB Cloud cluster will be up and running in about 15 minutes. After Terraform is finished, it takes another 2-3 minutes to reach 1M ops/sec.

See the workload live

Go to the ScyllaDB Cloud console, select your newly created cluster ScyllaDB-Cloud-Demo and open Monitoring to see the workload live:

After you analyze the results on the dashboard, you can run `terraform destroy` to avoid unnecessary costs. To re-run the benchmark with different parameters you can change the values in the variables.tf file and modify the cassandra-stress profiles.

To learn how much you could save by using ScyllaDB Cloud to run 1 million ops/sec, head over to our pricing calculator.

Wrapping up

We hope that you find our Terraform project useful. Go to GitHub and try it yourself! If you have questions or feedback, head over to the ScyllaDB Forum and we’ll be happy to resolve your issues. Additionally, here are more resources to help you learn about ScyllaDB:

Why You Can't Miss Cassandra Summit

Wondering how Netflix handles data partitions in Apache Cassandra? Looking for tips on how to build and manage huge Cassandra clusters? Then clear your calendar Dec.12 and Dec. 13, because Cassandra Summit, a gathering of the top minds in the Cassandra community, is happening in San Jose, Calif.,...

Benchmarking MongoDB vs ScyllaDB: Caching Workload Deep Dive

benchANT’s comparison of  ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a caching workload

BenchANT recently benchmarked the performance and scalability of the market-leading general-purpose NoSQL database MongoDB and its performance-oriented challenger ScyllaDB. You can read a summary of the results in the blog Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost, see the key takeaways for various workloads in this technical summary,  and access all results (including the raw data) from the benchANT site.

This blog offers a deep dive into the tests performed for the caching workload. The caching workload is based on the YCSB Workload A. It creates a read-update workload, with 50% read operations and 50% update operations. The workload is executed in two versions, which differ in terms of the request distribution patterns (namely uniform and hotspot distribution). This workload is executed against the small database scaling size with a data set of 500GB, the medium scaling size with a data set of 1TB and a large scaling size with a data set of 10TB. In addition to the regular benchmark runtime of 30 minutes, a long-running benchmark over 12 hours is executed.

Before we get into the benchmark details, here is a summary of key insights for this workload.

  • ScyllaDB outperforms MongoDB with higher throughput and lower latency for all measured configurations of the caching workload.
  • Even a small 3-node ScyllaDB cluster performs better than a large 18-node MongoDB cluster
  • ScyllaDB provides constantly higher throughput that increases with growing data sizes to up to 20 times
  • ScyllaDB provides significantly better update latencies (down to 68 times) compared to MongoDB
  • ScyllaDB read latencies are also lower for all scaling sizes and request distributions, down to 2.8 times.
  • ScyllaDB achieves near linear scalability across the tests. MongoDB achieves  340% of the theoretically possible 600% and 900% of the theoretically possible 2400%.
  • ScyllaDB provides 12-16 times more operations/$ compared to MongoDB Atlas for the small scaling size and 18-20 times more operations/$ for the scaling sizes medium and large.

Throughput Results for MongoDB vs ScyllaDB

The throughput results for the caching workload with the uniform request distribution show that the small ScyllaDB cluster is able to serve 77 kOps/s with a cluster utilization of ~87% while the small MongoDB serves only 5 kOps/s under a comparable cluster utilization of 80-90%. For the medium cluster sizes, ScyllaDB achieves an average throughput of 306 kOps/s by ~89% cluster utilization and MongoDB 17 kOps/s. For the large cluster size, ScyllaDB achieves 894 kOps/s against 45 kOps/s of MongoDB.

Note that client side errors occurred when inserting the initial 10TB on MongoDB large; as a result, only 5TB of the specified 10TB were inserted. However, this does not affect the results of the caching workload because the applied YCSB version only operates on the key range 1 –  2.147.483.647 (INTEGER_MAX_VALUE); for more details, see the complete report on the bencHANT site. This fact leads to an advantage for MongoDB because MongoDB’s cache had only to deal with 2,100,000,000 accessed records (i.e. 2.1TB) while ScyllaDB’s cache had to deal with the full 10,000,000,000 records (i.e. 10TB).

MongoDB vs ScyllaDB - Throughput - Caching Uniform

The caching workload with the hotspot distribution is only executed against the small and medium scaling size. The throughput results for the hotspot request distribution show a similar trend, but with higher throughput numbers since the data is mostly read from the cache. The small ScyllaDB cluster serves 153 kOps/s while the small MongoDB only serves 8 kOps/s. For the medium cluster sizes, ScyllaDB achieves an average throughput of 559 kOps/s and MongoDB achieves 28 kOps/s.

MongoDB vs ScyllaDB - Throughput - Caching Hotspot

Scalability Results for MongoDB vs ScyllaDB

The throughput results allow us to compare the theoretical throughput scalability with the actually achieved scalability. For ScyllaDB, the maximum theoretical scaling factor for throughput for the uniform distribution is 1600% when scaling from small to large. For MongoDB, the theoretical maximal throughput scaling factor is 2400% when scaling from small to large.

The ScyllaDB scalability results show that scaling from small to medium is very close to achieving linear scalability by achieving a throughput scalability of 397% of the theoretically possible 400%. Considering the maximal scaling factor from small to large, ScyllaDB achieves 1161% of the theoretical 1600%.

For the hotspot distribution, the small and medium cluster sizes are benchmarked. ScyllaDB achieves a throughput scalability of 365% of the theoretical 400%.

ScyllaDB - Throughput Scalability - Caching Uniform

ScyllaDB - Throughput Scalability - Caching Hotspot

The MongoDB scalability results for the uniform distribution show that MongoDB scaled from small to medium achieves a throughput scalability of 340% of the theoretical 600%. Considering the maximal scaling factor from small to large, MongoDB achieves only 900% of the theoretically possible 2400%.

MongoDB achieves a throughput scalability of 350% of the theoretical 600% for the hotspot distribution.

MongoDB - Throughput Scalability - Caching Uniform

MongoDB - Throughput Scalability - Caching Hotspot

Price-Performance Results for MongoDB vs ScyllaDB

In order to compare the costs/month in relation to the provided throughput, we take the MongoDB Atlas throughput/$ as baseline (i.e. 100%) and compare it with the provided ScyllaDB Cloud throughput/$.

The results for the uniform distribution show that ScyllaDB provides 12 times more operations/$ compared to MongoDB Atlas for the small scaling size and 18 times more operations/$ for the scaling sizes medium and large.

Image

For the hotspot distribution, the results show a similar trend where ScyllaDB provides 16 times more operations/$ for the small scaling size and 20 times for the medium scaling size.

Image

Latency Results for MongoDB vs ScyllaDB

The P99 latency results for the uniform distribution show that ScyllaDB and MongoDB  provide stable P99 read latencies. Yet, the values for ScyllaDB are constantly lower compared to the MongoDB latencies. An additional insight is that the ScyllaDB read latency doubles from medium to large (from 8.1 to 16.1 ms). The MongoDB latency decreases by 1 millisecond (from 23.3 to 22.3 ms), but still does not match the ScyllaDB latency.

For the update latencies, the results show a similar trend as for the social workload where ScyllaDB provides stable and low update latencies while MongoDB provides up to 73 times higher update latencies.

For the hotspot distribution, the results show a similar trend as for the uniform distribution. Both databases provide stable read latencies for the small and medium scaling size with ScyllaDB providing the lower latencies.

For updates, the ScyllaDB latencies are stable across the scaling sizes and slightly lower than for the uniform distribution. Compared to ScyllaDB, the MongoDB update latencies are 25 times higher for the small scaling size and 44 times higher for the medium scaling size respectively.

MongoDB vs ScyllaDB - P99 Latency - caching uniform

MongoDB vs ScyllaDB - P99 Latency - caching hotspot

Technical Nugget A – 12 Hour Benchmark

In addition to the default 30 minute benchmark run, we also select the scaling size large with the uniform distribution for a long-running benchmark of 12 hours.

For MongoDB, we select the determined 8 YCSB instances with 100 threads per YCSB instance and run the caching workload in uniform distribution for 12 hours with a target throughput of 40 kOps/s.

MongoDB -Throughput - 12h run

The throughput results show that MongoDB provides the 40 kOps/s constantly over time as expected.

The P99 read latencies over the 12 hours show some peaks in the latencies that reach 20ms and 30ms and an increase of spikes after 4 hours runtime. On average, the P99 read latency for the 12h run is 8.7 ms; for the regular 30 minutes run, it is 5.7 ms.

The P99 update latencies over the 12 hours show a spiky pattern over the entire 12 hours with peak latencies of 400 ms. On average, the P99 update latency for the 12h run is 163.8 ms while for the regular 30 minutes run it is 35.7 ms.

MongoDB - Read Latency 99th - 12h run

MongoDB - Update Latency 99th -12h run

For ScyllaDB, we select the determined 16 YCSB instances with 200 threads per YCSB instance and run the caching workload in uniform distribution for 12 hours with a target throughput of 500 kOps/s.

The throughput results show that ScyllaDB provides the 500 kOps/s constantly over time as expected.

ScyllaDB - Throughput - 12h run

The P99 read latencies over the 12 hours stay constantly below 10 ms except for one peak of 12 ms. On average, the P99 read latency for the 12h run is 7.8 ms.

The P99 update latencies over the 12 hours show a stable pattern over the entire 12 hours with an average P99 latency of 3.9 ms.

ScyllaDB - Read Latency 99th - 12h run
ScyllaDB - Update Latency 99th - 12 h run

Technical Nugget B – Insert Performance

In addition to the three defined workloads, we also measured the plain insert performance for the small scaling size (500 GB), medium scaling size (1 TB) and large scaling size (10 TB) into MongoDB and ScyllaDB. It needs to be emphasized that batch inserts were enabled for MongoDB but not for ScyllaDB (since YCSB does not support it for ScyllaDB).

The following results show that for the small scaling size, the achieved insert throughput is on a comparable level. However, for the larger data sets of the medium and large scaling sizes, ScyllaDB achieves a 3 times higher insert throughput for the medium size benchmark. For the large-scale benchmark, MongoDB was not able to fully ingest the full 10TB of data due to client side errors, resulting in only 5TB inserted data. Yet, ScyllaDB outperforms MongoDB by a factor of 5.

MongoDB vs Scylla - Insert Throughput - caching uniform

Technical Nugget C – Client Consistency Performance Impact

In addition to the standard benchmark configurations, we also run the caching workload in the uniform distribution with weaker consistency settings. Namely, we enable MongoDB to read from the secondaries (readPreference=secondarypreferred) and for ScyllaDB we set the readConsistency=ONE.

The results show an expected increase in throughput: for ScyllaDB 23% and for MongoDB 14%. This throughput increase is lower compared to the client consistency impact for the social workload since the caching workload is only a 50% read workload and only the read performance benefits from the applied weaker read consistency settings. It is also possible to further increase the overall throughput by applying weaker write consistency settings.

MongoDB vs ScyllaDB - Client Consistency Impact Throughput - caching uniform

Continue Comparing ScyllaDB vs MongoDB

Here are some additional resources for learning about the differences between MongoDB and ScyllaDB:

Top Mistakes with ScyllaDB: Configuration

Get an inside look at the most common ScyllaDB configuration mistakes – and how to avoid them.

In past blogs, we’ve already gone deep into the weeds of all the infrastructure components, their importance, and all the needed considerations one needs to take into account when selecting an appropriate ScyllaDB infrastructure. Now, let’s shift focus to an actual ScyllaDB deployment and understand some of the major mistakes my colleagues and I have seen in real-world deployments.

This is the third installment of this blog series. If you missed them, you may want to read back on the important infrastructure and storage considerations.

1: Intro + Infrastructure

2: Storage

Also, you might want to look at this short video we created to assist you with defining your deployment type according to your requirements:



Now on to the configuration mistakes…

Running an Outdated ScyllaDB Release

Every once in a while, we see (and try to assist) users running on top of a version that’s no longer supported. Admittedly, ScyllaDB has a quick development cycle: the price paid to bring cutting-edge technology to our users – and it’s up to you to ensure that you’re running under a current version. Otherwise, you might miss important correctness, stability, and performance improvements that are unlikely to be backported down to your current release branch.

ScyllaDB ships with the scylla-housekeeping utility enabled by default. This lets you know whenever a new patch or major release comes out. For example, the following message will be printed to the system logs when running the latest major, but are behind a few patch releases:

# /opt/scylladb/scripts/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid --repo-files '/etc/apt/sources.list.d/scylla*.list' version --mode cr
Your current Scylla release is 5.2.2 while the latest patch release is 5.2.9, update for the latest bug fixes and improvements

However, the following message shall be displayed when you are behind a major release:

# /opt/scylladb/scripts/scylla-housekeeping --uuid-file /var/lib/scylla-housekeeping/housekeeping.uuid --repo-files '/etc/apt/sources.list.d/scylla*.list' version --mode cr
Your current Scylla release is 5.1.7, while the latest patch release is 5.1.18, and the latest minor release is 5.2.9 (recommended)

As you can see, there’s a clear distinction between major and patch (minor) releases. Similarly, whenever you deploy the ScyllaDB Monitoring Stack (more on that later), you’ll also have a clear view into which versions your nodes are currently running. This helps you determine when to start planning for an upgrade:


But when should you upgrade? That’s a fair question, which requires an explanation of how our release life cycle works.

Our release cycle begins on ScyllaDB Open Source (OSS), which brings production-ready features and improvements to our community. The OSS branch evolves and receives bug and stability fixes for a while, until it eventually gets branched to our next ScyllaDB Enterprise. From here, the Enterprise branch receives additional Enterprise-only features (such as Workload Prioritization, Encryption at Rest, Incremental Compaction Strategy, etc) and runs through a variety of extended tests, such as longevity, regression and performance tests, functionality testing, and so on. It continues to receive backports from its OSS sibling. As the Enterprise branch matures and passes through rigorous testing, it eventually becomes a ScyllaDB Enterprise release.

Another important aspect worth highlighting is the Open Source to Enterprise upgrade path. Notably, ScyllaDB Enterprise releases are supported for much longer than Open Source releases. We have previously blogged about the Enterprise life cycle. For Enterprise, it is worth highlighting that the versioning numbers (and their meaning) change slightly, although the overall idea remains.

Whether you are running ScyllaDB Enterprise or OSS, ScyllaDB releases are split into two categories:

  • Major releases: These introduce new features, enhancements, newer functionality, all that’s good – really. Major releases are known by their two first digits, such as 2023.1 (for Enterprise) or 5.2 (for Open Source).
  • Minor (patch) releases: Primarily contain bug and stability fixes, although sometimes we might introduce new functionality. A patch release is known by its last digits on the version number. For example, 5.1.18 indicates the eighteenth patch release on top of the 5.1 OSS branch.

ScyllaDB supports the two latest major releases. At the time of this writing, the latest ScyllaDB Open Source release is 5.2. Thus, we also support the 5.1 branch, but no release older than it. Therefore, it is important and recommended that you plan your upgrades in a timely manner in order to stay under a current version. Otherwise, if you report any problem, our response will likely begin with a request to upgrade.

If you’re upgrading from an Open Source version to a ScyllaDB Enterprise version, refer to our documentation for supported upgrade paths. For example, you may directly upgrade from ScyllaDB 5.2 to our latest ScyllaDB Enterprise 2023.1, but you can’t upgrade to an older Enterprise release.

Finally, let’s address a common question we often receive: “Can I directly upgrade from <super ancient release> to <fresh new good looking version>?” No, you can not. It is not that it won’t work (in fact, it might), but it is definitely not an upgrade path that we certify or test against. The proper upgrade involves running major upgrades one at a time, until you land on the target release you are after. This is yet another reason for you to do your due diligence and remain current.

Running a World Heritage OS

We get it, you love RHEL 7 or Ubuntu 18.04. We love them too! 🙂 However, it is time for us to let them go.

ScyllaDB’s close-to-the-hardware design greatly relies on kernel capabilities for performance. Over the years, we have seen performance problems arise from using outdated kernels. For example, back in 2016 we found out that the XFS support for AIO appending writes wasn’t great. More recently, we also reported a regression in AWS kernels with great potential to undermine performance.

The recommendation here is to ensure you run a reasonably recent Linux distribution rather than upgrading your OS every now and then. Even better, if you are running in the cloud, you may want to check out our ScyllaDB EC2, GCP and Azure images. ScyllaDB images go through rigorous testing to ensure that your ScyllaDB deployment will squeeze every last bit of performance from the VM you run it on top of.

The ScyllaDB Images often receive OS updates and we strive to keep them as current as possible. However, remember that once the image gets deployed it becomes your responsibility to ensure you keep the base OS updated. Although it is perfectly fine to simply upgrade ScyllaDB and remain under the same cloud image version you originally provisioned, it is worth emphasizing that – over time – the image will become outdated, up to the point where it may be easier to simply replace it with a fresh new one.

Going through the right (or wrong) way to keep your existing OS/cloud images updated is beyond the scope of this article. There are many variables and processes to account for. It is still worth highlighting the fact that we will handle it for you in ScyllaDB Cloud, which is a fully-managed database-as-a-service Here, all instances are frequently updated for both security and kernel improvements, and cloud images are frequently updated to bring you the latest in OSS technology.

Diverging Configuration

This is an interesting one, and you might be a victim of it. First, we need to explain what we mean by “diverging configuration”, and how it actually manifests itself.

In a ScyllaDB cluster, each node is independent from another. This means that a setting in one node won’t have any effect on other nodes until you apply the same configuration cluster-wide. Due to that, users sometimes end up with settings applied to only a few nodes out of their topology.

Under most circumstances, correcting the problem is fairly straightforward: Simply replicate the missing parameters to the corresponding nodes and perform a rolling restart (or SIGHUP the process, if the parameter happens to be LiveUpdatable parameter) accordingly.
In other situations, however, the problem may manifest itself in a seemingly silent way. Imagine the following hypothetical scenario:

  1. You were properly performing your due diligence and closely upgrading your 3-node ScyllaDB cluster for the past 2 or 3 years, just like we recommended earlier, and are now running the latest and greatest version of our database.
  2. Eventually one of your nodes died, and you decided to replace it. You spun up a new ScyllaDB AMI and started the node replacement procedure.
  3. Everything worked fine, until weeks later noticed that the node you replaced has a different shard count in Monitoring than the rest of the nodes. You double check everything, but can’t pinpont the problem.

What happened?

Short answer: You probably forgot to run scylla_setup during the upgrades carried out in Step 1, and you overlooked that its tuning logic changed between versions. When you replaced a node with an updated AMI, it automatically auto-tuned itself, resulting in the correct configuration.

There are plenty of other situations where you may end up with a similar misconfigured node, such as forgetting to update your ScyllaCluster deployment definitions (such as CPU and Memory) upon scaling up your Kubernetes instances.

The main takeaway here is to always keep a consistent configuration across your ScyllaDB cluster and implement mechanisms to ensure you re-run scylla_setup whenever you perform major upgrades. Granted, we don’t change the setup logic all that often. However, when we do, it is really important for you to pick up its changes because it may greatly improve your overall performance.

Not Configuring Proper Monitoring

The worst offense that can be done to any distributed system is neglecting the need to monitor it. Yet, it happens all too often.

By monitoring, we definitely don’t mean you should stare at the screen. Rather, we simply recommend that users deploy the ScyllaDB Monitoring Stack in order to have insights when things go wrong.

Note that we specifically mentioned using ScyllaDB Monitoring rather than other third party tools. The reasons are plenty:

  • It is open and free to everyone
  • It is built on top of well-known technologies like Grafana and Prometheus (VictoriaMetrics support was introduced in 4.2.0),
  • Metrics and dashboards are updated regularly as we add more metrics
  • It is extremely easy to upload Prometheus data to our support should you ever face any difficulties.

Of course, you can monitor ScyllaDB with other solutions. But if you eventually want assistance from ScyllaDB and you can’t provide meaningful metrics, this can impact our ability to assist.

If you have already deployed the ScyllaDB Monitoring, remember to also upgrade it on a regular basis to fully benefit from the additional functionality, security fixes, and other goodies it brings.

In summary, allow me to quote a presentation from Henrik Rexed from our last Performance Engineering Masterclass. The main Observability pillars to understand the behavior of a distributed system involve having readily access and visibility to: logs, events, metrics, and traces. Therefore, stop flying blind and just deploy our Monitoring stack 🙂

Ignoring Alerts

Here’s a funny story: Once we were doing our due diligence with one of our on-premise Enterprise users and realized one of their nodes was unreachable from their monitoring. We asked whether they were aware of any problems, nope. A bit more digging, and we realized the cluster was under that state for the past 2 weeks. Dang!

Fear not, the real story happened at an on-premise facility and it had a happy ending, with the root cause being identified as a network partition affecting the node.

Things like that really happen. While convention says that a database is pretty much a “set and forget it” thing, other infrastructure components aren’t, and you must be ready to react quickly when things go wrong.

Although alerts such as a node down or high disk space utilization are relatively easier to spot, others such as higher latencies and data imbalances become much harder unless you integrate your Monitoring with an alerting solution.

Fortunately, it’s quite simple. When you deploy ScyllaDB Monitoring, there are several built-in alerting conditions out-of-the-box. Just be sure to connect AlertManager with your favorite alerting mechanism, such as Slack, PagerDuty, or even e-mail.

No Automation

Unsurprisingly, most of the mistakes covered thus far could be avoided or addressed with automation.
Although ScyllaDB does not impose a particular automation solution on you (nor should we, as each organization has its own way of managing processes), we do provide Open Source tooling for you to work with so that you won’t have to start from scratch.

For example, the ScyllaDB Cloud Images support passing User-provided data during provisioning so you can easily integrate with your existing Terraform (OpenTofu anybody?) scripts.

Speaking of Terraform, you can rely on the ScyllaDB Cloud Terraform provider to manage most of the aspects related to your ScyllaDB Cloud provisioning. Not a Terraform user? No problem. Refer to the ScyllaDB Cloud API reference and start playing.

And what if you are not a ScyllaDB Cloud user and don’t use ScyllaDB Cloud images? We’ve still got you covered! You should definitely get started with our Ansible roles for managing, upgrading, and maintaining your ScyllaDB, Monitoring and Manager deployments.

Wrapping Up

This article covered most of the aspects and due diligence required for keeping a ScyllaDB cluster up to date, including examples of how remaining current may greatly boost your performance. We also covered the importance of observability in preventing problems and discussed several options for you to automate, orchestrate, and manage ScyllaDB.

The blog series up until now has primarily covered aspects tied to a ScyllaDB deployment. At this point – you should have a rock-solid ScyllaDB cluster running on adequate infrastructure.

Next, let’s now shift focus and discuss how to properly use what we’ve set up. We’ll cover application-specific topics, , such as load balancing policies, concurrency, consistency levels, timeout settings, idempotency, token and shard awareness, speculative executions, replication strategies … Well, you can see where this is going. See you there!

Understanding and Mitigating Distributed Database Network Costs with ScyllaDB

Considerations and strategies that will help you optimize ScyllaDB’s performance as well as reduce networking costs

ScyllaDB is a high-performance NoSQL database that’s widely adopted for use cases that require high throughput and/or predicable low latency. As a distributed database, ScyllaDB frequently replicates data to other replicas, which inevitably translates to networking costs. Such costs reflect the additional network bandwidth necessary to uphold the consistency and availability of data across multiple nodes.

Factors that influence distributed database network traffic, and thus costs, include:

  • Replication factor, which determines the number of data copies in the cluster. More copies means higher availability and more data traversing the network (increasing cost).
  • Consistency level, which determines the number of nodes required to acknowledge the client operation before it is considered successful.
  • Other aspects like the payload size, Materialized Views, topology, and throughput.

Each factor plays a unique role in determining the overall network cost. Understanding each concept will help you optimize ScyllaDB’s performance as well as reduce networking costs.

Impact of Replication Factor and Consistency Level

ScyllaDB uses a distributed architecture that allows data to be stored and replicated across multiple nodes. The replication factor determines the number of copies of data that are stored across different nodes. The higher the replication factor, the higher the network traffic gets since more data needs to get replicated across the network to maintain consistency and availability.

During reads ScyllaDB applies an optimization in order to reduce traffic: the query coordinator requests only a digest from all but one replica (If using Consistency Level Local Quorum). If there is a mismatch, then it initiates a process known as read repair. This is done to reduce both network costs and latencies.

For write operations, the consistency level has minimal impact on network traffic because it only eliminates the write-ack from the replica, saving only a small amount of bandwidth.

Impact of Payload

Payload size refers to the amount of data that is sent or received over the network. The larger the payload size, the higher the network amplification since more bandwidth is required to transmit and (afterward) replicate the data.

Larger payloads can also increase your storage footprint, as well as your CPU and memory utilization, undermining latencies. These following strategies ensure that your ScyllaDB deployment remains efficient and performant, even with large payloads:

  • Efficient Data Modeling: By optimizing your data model, you can store data more efficiently, reducing the size of payloads. This involves strategies like using appropriate data types, denormalizing data where necessary, and avoiding the storage of large blobs or unnecessary data by following a query-driven approach.
  • Caching Strategies: Smaller payloads mean more room for caching. Intelligent caching can mitigate the impact of larger payloads on memory usage. By avoiding expensive data scans (or by properly specifying the BYPASS CACHE clause) you can ensure efficient memory usage, therefore avoiding round-trips to disk for your most valuable queries.
  • Asynchronous Processing: Asynchronous operations can help in managing larger payloads without an impact on throughput. They allow other operations to continue before the data operation is completed, thereby optimizing CPU utilization and response times.
  • Load Balancing and Partitioning: Proper load balancing along with evenly distributed access patterns on high cardinality keys ensures that large payloads don’t overwhelm specific nodes, preventing potential hotspots.

Impact of Materialized Views

Materialized Views are a powerful feature in ScyllaDB, designed to improve read performance. They are essentially precomputed views that save the results of a query and are automatically updated whenever the base table changes. Given that, the use of Materialized Views results in:

  • Increased Data Transfer: Each Materialized View represents an additional copy of the data, which means that any write to the base table also results in writes to its corresponding view. This increases the volume of data transferred across the network.
  • Enhanced Read Performance: Although base tables paired up with a view increase the data transfer during write operations, they can significantly boost read performance by eliminating the need for complex joins and aggregations at read time.

Secondary Indexes are implemented under the same semantics as Materialized Views. Although indexes will typically (but not always) be considerably smaller than views, they may also contribute to elevated network traffic during operations.

Multi-AZ, Multi-DC, and Single Availability Zone (SAZ) Topologies in ScyllaDB: Trade-offs and Considerations

With distributed databases like ScyllaDB, cluster topology is critical. It significantly influences data availability, resilience, and performance, as well as costs. Let’s evaluate three common setups: Multi-AZ, Multi-DC, and SAZ.

Multi-AZ Architecture

In a Multi-AZ setup, data is replicated across several availability zones but stays within a singular region.

Pros:

  • Enhanced Availability: By distributing data across multiple zones, you gain a safety net against zone-specific failures, ensuring uninterrupted data access.
  • Local Data Durability: Replicating data within the same region but across different zones enhances data durability

Cons:

  • Increased Network Cost: Transmitting data across different zones, albeit within the same region, can elevate network costs, especially in cloud environments.

Multi-DC Setup

A Multi-DC configuration implies that data is replicated across several data centers, often spread across distinct geographical regions. Although it is perfectly possible to have multi-region clusters where data is confined to specific regions with no replication among them, this is beyond the scope of this article.

Pros:

  • Disaster Recovery: Should a catastrophic event strike a region, data remains accessible from other regions, providing robust business continuity capabilities.
  • Geo Replication: For globally distributed applications, having data replicated in multiple regions can reduce latency for end-users spread across the globe.

Cons:

  • Network Amplification Costs: Replicating data across regions intensifies data transfer volumes, leading to higher costs, especially in cloud deployments.

Single Availability Zone (SAZ)

In a SAZ configuration, all nodes of the ScyllaDB cluster reside within one availability zone.

Pros:

  • Reduced Network Costs: Keeping all nodes in one zone significantly curbs inter-node data transfer costs.
  • Simplified Configuration: Having all nodes in a unified zone simplifies setup and eliminates complexities of inter-zone data replication.
  • Minimized Latency: With nodes in close proximity, latencies are lower.

Cons:

  • Potential for Reduced Availability: Operating under a single zone introduces a single point of failure. If the zone faces an outage, downtime occurs.
  • Data Durability Concerns: Without replicas in other zones, there’s an elevated mean time to recovery (MTTR) in case of major zone-specific calamities.

For SAZ users, ScyllaDB Cloud ensures that nodes don’t reside on the same hardware via Placement Groups, further enhancing fault tolerance within the zone.

Considerations for High Availability with Single AZ and Two DCs

Deploying ScyllaDB in a Single AZ (SAZ) is the most cost-efficient model, but it comes with a trade-off in high availability (HA). On the other end of the spectrum, a Multi-DC setup offers the highest availability at the higher costs. However, note that combining Multi-DC with SAZ may not always be the perfect solution.

Understanding the SAZ and 2DC Setup

This approach involves operating one data center within a SAZ to minimize costs, while the second data center — potentially in a different geographical location — serves as a failover to ensure data availability. While this sounds like an optimal balance between cost efficiency and HA, there are significant caveats:

  • Dual SAZ in a Single Region: This setup essentially doubles your infrastructure costs. Although you’re cutting down inter-AZ data transfers; you’re also doubling your infrastructure, negating the cost benefits you might get from avoiding multi-AZ setups.
  • Dual SAZ in Multi-Region: This becomes even more expensive, especially if the application doesn’t specifically require multi-region replication. It doubles the infrastructure and introduces cross-region data transfer costs.

When Does SAZ and 2DC Make Sense?

The only scenario where a SAZ combined with a 2DC strategy becomes cost-effective is when there’s a specific need for multi-region replication, perhaps due to business continuity requirements, geographic data regulations, or the need to serve a globally dispersed user base. In such cases, this setup allows you to leverage the benefits of geographical dispersion while avoiding constant cross-AZ data transfer costs.

Key Takeaway

It’s crucial for decision-makers to understand that while a hybrid approach of SAZ and 2DC might seem like a middle ground, it’s often closer to a high-cost model. This approach should only be considered when the application’s requirements justify the extra expense. Instead of looking at it as a default step up from SAZ, treat it as a specialized solution for particular operational needs.

Leveraging ScyllaDB Manager for Efficient Backup Strategies

Backup strategies in ScyllaDB require a careful balance between ensuring data durability and managing network costs. While it’s challenging to estimate network costs due to factors like application-specific patterns, ScyllaDB Manager simplifies this by offering robust, efficient, and manageable backup solutions.

ScyllaDB Manager is a centralized management system that automates recurrent maintenance tasks, reduces human error, and improves the predictability and efficiency of ScyllaDB. One of its key features is streamlining and optimizing backup processes:

  • Automated Backups: ScyllaDB Manager automates the backup process, scheduling regular backups according to your needs. This automation ensures data durability without the operational overhead of manual backups.
  • Backup Validation: The ‘backup validate’ feature verifies the integrity of the backups. It checks that all the data can be read and that the files are not corrupted, ensuring the reliability of your backups.
  • Efficient Data Storage with Deduplication: ScyllaDB Manager uses deduplication, storing only the changes since the last backup instead of a full copy every time. This process significantly reduces storage requirements and associated network costs.
  • Flexible Retention Policies: You can configure policies to retain backups for a specified period. After that period, older backups are automatically deleted, freeing up storage space.
  • Optimized Network Utilization: By relying on incremental backups and deduplication, ScyllaDB Manager minimizes the amount of data sent across the network, reducing network costs and avoiding bandwidth saturation.

For a practical illustration of how ScyllaDB Manager handles backups, see the documented strategies in its official documentation. These examples provide real-world scenarios and best practices for managing backups efficiently and reliably.

Balancing Act: Data Durability vs. Network Costs

Backups are essential for maintaining data integrity and availability, but they must be managed judiciously, especially considering network costs in cloud environments. A few things to consider:

  • Frequency: Align the frequency of backups with your data’s volatility and your acceptable Recovery Point Objective (RPO). Regular snapshots minimize data loss, but also incur elevated network consumption.
  • Backup Compression: ScyllaDB Manager utilizes rclone for efficient data transfers, which supports wire compression to reduce the size of the data being transferred over the network. Note that this is not the same as SSTable compression in ScyllaDB.
  • Backup Location: The proximity of your backup storage can significantly impact costs. Storing backups within the same data center or region is generally less expensive than cross-region or cross-provider transfers due to reduced data transfer costs.

Compression in ScyllaDB and Application-Level Strategies

ScyllaDB employs various data compression techniques, which are crucial for reducing network costs. The primary advantages are seen with client-side and node-to-node compression; both are pivotal in minimizing the data transmitted over the network. Additionally, application-level compression strategies can further enhance efficiency.

  • Client-Side Compression: ScyllaDB drivers support client-server compression, meaning data is compressed from the client’s side before being sent to the cluster, and vice-versa. This approach reduces the amount of data sent over the network, leading to lower bandwidth usage.
  • Node-to-Node Compression: ScyllaDB supports inter-node compression. Data exchanged between nodes within the cluster — for replication, satisfying consistency requirements, data streaming, as well as other operations — is compressed before transmission.
  • Application-Level Compression: Beyond ScyllaDB’s internal mechanisms, implementing compression at the application level can provide additional benefits. This involves compressing data payloads before they are sent to ScyllaDB, using standard compression libraries, or employing efficient serialization methods.

These strategies reduce the size of the data packets transmitted over the network, saving bandwidth, and greatly magnifying savings for applications dealing with large data volumes or operating over constrained networks.

By understanding and leveraging client-side, node-to-node, and application-level compression strategies, users can significantly reduce the amount of data transmitted over the network. This leads to lower bandwidth consumption and decreased network-related costs.

Application Specific Optimizations

Your application plays a crucial role in optimizing network amplification in ScyllaDB. By implementing certain strategies and best practices, the application deployment can help reduce the amount of data transmitted over the network, thereby minimizing network amplification. Here are some approaches to consider:

  • Data Modeling: Effective data modeling is essential to minimize network amplification. By carefully designing the data model and schema, you can avoid over-replication or excessive data duplication, leading to more efficient data storage and reduced network transmission. See the ScyllaDB University Data Modeling course to learn more.
  • Query Optimization: Optimizing queries can significantly impact network amplification. Ensure that queries are designed to retrieve only the necessary data, avoiding unnecessary data transfers over the network. This involves using appropriate WHERE clauses, limiting result sets, and optimizing data retrieval patterns. See this blog on query optimization to learn more.
  • Connection Pooling: Efficiently reusing database connections reduces the overhead of establishing new connections for each request. This enhances network hops and minimizes network amplification caused by connection setup and teardown. Learn more in this video about How Optimizely (Safely) Maximizes Database Concurrency
  • Application-Level Compression: Beyond the compression features provided by ScyllaDB and its drivers, application-level compression offers additional reductions in data transmission size. This involves integrating compression functionalities directly into your application, ensuring data is compressed before transmission to ScyllaDB.

ScyllaDB’s Zone-Awareness

Zone awareness is an important feature in various ScyllaDB drivers, playing an important role in minimizing cross-AZ data transfer costs. In a standard ScyllaDB deployment, nodes are strategically distributed across multiple Availability Zones (AZs) to ensure heightened availability and fault tolerance. This distribution inherently involves data movement across nodes in different AZs, a process that continues even when using a zone-aware driver.

Zone awareness differs from other load balancing policies in the sense that it prioritizes coordinator nodes located under the same availability zone as the application. This preference for “local” nodes doesn’t eliminate cross-AZ data transfers, especially given the replication and other inter-node communication needs, but it does significantly reduce the volume of data transfers initiated by application requests.

By prioritizing local nodes, the zone-aware driver achieves more cost-effective access patterns. It leverages the fact that data is already replicated across AZs due to ScyllaDB’s distribution strategies, allowing for local access to data despite its wide distribution for fault tolerance. This approach not only reduces data transfer costs but also enhances application performance. Local requests typically have lower latency compared to those traversing AZs, leading to faster data access.

In summary, the zone-aware driver doesn’t alter the way data is distributed across your ScyllaDB deployment; instead, it optimizes access patterns to this distributed data. By intelligently routing requests to local nodes whenever possible, it capitalizes on ScyllaDB’s cross-AZ replication, fostering both cost efficiency and performance improvement.
For further details on zone awareness, see Piotr Grabowski’s P99 CONF talk, Conquering Load Balancing: Experiences from ScyllaDB Drivers.

Conclusion

Optimizing network costs in ScyllaDB is a multifaceted endeavor that requires a strategic approach to both application deployment and database configuration. Through the adept implementation of strategies such as effective data modeling, application-level and built-in compression techniques, and thoughtful load balancing, it’s possible to drastically reduce the volume of data traversing the network. This reduction minimizes network amplification, enhancing the price-performance of your ScyllaDB deployment.

It’s vital to recognize that network amplification costs are not static. They fluctuate based on various factors, including workload characteristics, database settings, and the chosen topology. This dynamic nature necessitates regular monitoring and proactive adjustments to both application behavior and database configuration. Such diligence ensures your ScyllaDB ecosystem is not only high-performing but also cost-optimized.

By embracing a comprehensive approach — one that harmonizes application-level tactics with ScyllaDB’s zone-aware drivers, data distribution methodologies, and compression mechanisms — you position your database deployment to leverage the best of ScyllaDB’s efficiencies. This holistic strategy ensures robust performance, network cost savings, and overall operational excellence, enabling you to extract the maximum value from your ScyllaDB investment.