Retaining Database Goodput Under Stress with Per-Partition Query Rate Limiting
How we implemented ScyllaDB’s per-partition rate limit feature and the impact on “goodput” Distributed database clusters operate best when the data is spread across a large number of small partitions, and reads/writes are spread uniformly across all shards and nodes. But misbehaving client applications (e.g., triggered by bugs, malicious actors, or unpredictable events sending something “viral”) could suddenly upset a previously healthy balance, causing one partition to receive a disproportionate number of requests. In turn, this usually overloads the owning shards, creating a scenario called a “hot partition” and causing the total cluster latency to worsen. To prevent this, ScyllaDB’s per-partition rate limit feature allows users to limit the rate of accepted requests on a per-partition basis. When a partition exceeds the configured limit of operations of a given type (reads/writes) per second, the cluster will then start responding with errors to some of the operations for that partition. That way, the rate of accepted requests is statistically kept at the configured limit. Since rejected operations use fewer resources, this alleviates the “hot partition” situation. This article details why and how we implemented this per-partition rate limiting and measures the impact on “goodput”: the amount of useful data being transferred between clients and servers. Background: The “Hot Partition” Problem To better understand the problem we were addressing, let’s first take a quick look at ScyllaDB’s architecture. ScyllaDB is a distributed database. A single cluster is composed of multiple nodes, and each node is responsible for storing and handling a subset of the total data set. Furthermore, each node is divided into shards. A single shard is basically a CPU core with a chunk of RAM assigned to it. Each of a node’s shards handles a subset of the data the node is responsible for. Client applications that use our drivers open a separate connection to each shard in the cluster. When the client application wants to read or write some data, the driver determines which shard owns this data and sends the request directly to that shard on its dedicated connection. This way, the CPU cores rarely have to communicate with each other, minimizing the cross-core synchronization cost. All of this enables ScyllaDB to scale vertically as well as horizontally. The most granular unit of partitioning is called a partition. A partition in a table is a set of all rows that share the same partition key. For each partition, a set of replica nodes is chosen. Each of the replicas is responsible for storing a copy of the partition. While read and write requests can be handled by non-replica nodes as well, the coordinator needs to contact the replicas in order to execute the operation. Replicas are equal in importance and not all have to be available in order to handle the operation (the number required depends on the consistency level that the user set). This data partitioning scheme works well when the data and load are evenly distributed across all partitions in the cluster. However, a problem can occur when one partition suddenly starts receiving a disproportionate amount of traffic compared to other partitions. Because ScyllaDB assigns only a part of its computing resources to each partition, the shards on replicas responsible for that partition can easily become overloaded. The architecture doesn’t really allow other shards or nodes to “come to the rescue,” therefore even a powerful cluster will struggle to serve requests to that partition. We call this situation a “hot partition” problem. Moreover, the negative effects aren’t restricted to a single partition. Each shard is responsible for handling many partitions. If one of those partitions becomes hot, then any partitions that share at least a single replica shard with that overloaded partition will be impacted as well. It’s the Schema’s Fault (Maybe?) Sometimes, hot partitions occur because the cluster’s schema isn’t a great fit for your data. You need to be aware of how your data is distributed and how it is accessed and used, then model it appropriately so that it fits ScyllaDB’s strengths and avoids its weaknesses. That responsibility lies with you as the system designer; ScyllaDB won’t automatically re-partition the data for you. It makes sense to optimize your data layout for the common case. However, the uncommon case often occurs because end-users don’t always behave as expected. It’s essential that unanticipated user behavior cannot bring the whole system to a grinding halt. What’s more, something inside your system itself could also stir up problems. No matter how well you test your code, bugs are an inevitable reality. Lurking bugs might cause erratic system behavior, resulting in highly unbalanced workloads. As a result, it’s important to think about overload protection at the per-component level as well as at the system level. What is “Goodput” and How Do You Maintain It? My colleague Piotr Sarna explained goodput as follows in the book Database Performance at Scale: “A healthy distributed database cluster is characterized by stable goodput, not throughput. Goodput is an interesting portmanteau of good + throughput, and it’s a measure of useful data being transferred between clients and servers over the network, as opposed to just any data. Goodput disregards errors and other churn-like redundant retries, and is used to judge how effective the communication actually is. This distinction is important. Imagine an extreme case of an overloaded node that keeps returning errors for each incoming request. Even though stable and sustainable throughput can be observed, this database brings no value to the end user.” With hot partitions, requests arrive faster than the replica shards can process them. Those requests form a queue that keeps growing. As the queue grows, requests need to wait longer and longer, ultimately reaching a point where most requests fail because they time out before processing even begins. The system has good throughput because it accepts a lot of work, but poor goodput because most of the work will ultimately be wasted. That problem can be mitigated by rejecting some of the requests when we have reason to believe that we won’t be able to process all of them. Requests should be rejected as early as possible and as cheaply as possible in order to leave the most computing resources for the remaining requests. Enter Per-Partition Rate Limiting In ScyllaDB 5.1 we implemented per-partition rate limiting which follows that idea. For a given table, users can specify a per-partition rate limit: one for reads, another for writes. If the cluster detects that the reads or writes for a given partition are starting to exceed that user-defined limit, the database will begin rejecting some of the requests in an attempt to keep the rate of unrejected requests below the limit. For example, the limit could be set as follows:ALTER TABLE
ks.tbl WITH per_partition_rate_limit = { 'max_writes_per_second':
100, 'max_reads_per_second': 200 };
We recommend that users
set the limit high enough to avoid reads/writes being rejected
during normal usage, but also low enough so that ScyllaDB will
reject most of the reads/writes to a partition when it becomes hot.
A simple guideline is to estimate the maximum expected reads/writes
per second and apply a limit that is an order of magnitude higher.
Although this feature tries to accurately apply the defined limit,
the actual rate of accepted requests may be higher due to the
distributed nature of ScyllaDB. Keep in mind that this feature is
not meant for enforcing strict limits (e.g. for API purposes).
Rather, it was designed as an overload protection mechanism. The
following sections will provide a bit more detail about why there’s
sometimes a discrepancy. How Per-Partition Rate Limiting Works To
determine how many requests to reject, ScyllaDB needs to estimate
the request rate for each partition. Because partitions are
replicated on multiple nodes and not all replicas are usually
required to perform an operation, there is no obvious single place
to track the rate. Instead, we perform measurements on each replica
separately and use them to estimate the rate. Let’s zoom in on a
single replica and see what is actually being measured. Each shard
keeps a map of integer counters indexed by token, table, and
operation type. When a partition is accessed, the counter relevant
to the operation is increased by one. On the other hand, every
second we cut every counter in half, rounding towards zero. Due to
this, it is a simple mathematical exercise to show that if a
partition has a steady rate of requests, then the counter value
will eventually oscillate between X and 2*X, where X is the rate in
terms of requests per second. We managed to implement the counter
map as a hashmap that uses a fixed, statically allocated region of
memory that is independent from the number of tables or the size of
the data set. We employ several tricks to make it work without
losing too much accuracy. You can
review the implementation here. Case 1: When the coordinator is
a replica Now, let’s see how these counters are used in the context
of the whole operation. Let’s start with the case where the
coordinator is also a replica. This is almost always the case when
an application uses a properly configured shard-aware driver for
ScyllaDB [learn
more about shard-aware drivers]. Because the coordinator is a
replica, it has direct access to one of the counters for the
partition relevant to the operation. We let the coordinator make a
decision based only on that counter. If it decides to accept the
operation, it then increments its counter and tells other replicas
to do the same. If it decides to reject it, then the counter is
only incremented locally and replicas are not contacted at all.
Although skipping the replicas causes some undercounting, the
discrepancy isn’t significant in practice – and it saves processing
time and network resources. Case 2: When the coordinator is not a
replica When the coordinator is not a replica, it doesn’t have
direct access to any of the counters. We cannot let the coordinator
ask a replica for a counter because it would introduce additional
latency – and that’s unacceptable for a mechanism that is supposed
to be low cost. Instead, the coordinator proceeds with sending the
request to replicas and lets them decide whether to accept or
reject the operation. Unfortunately, we cannot guarantee that all
replicas will make the same decision. If they don’t , it’s not the
end of the world, but some work will go to waste. Here’s how we try
to reduce the chance of that happening. First, cutting counters in
half is scheduled to happen at the start of every second, according
to the system clock. ScyllaDB depends on an accurate system clock,
so cutting counters on each node should be synchronized relatively
well. Assuming this and that nodes accept a similar, steady rate of
replica work, the counter values should be close to each other most
of the time. Second, the coordinator chooses a random number from
range 0..1 and sends it along with a request. Each replica computes
the probability of rejection (calculated based on the counter
value) and rejects if the coordinator sends a number below it.
Because counters are supposed to be close together, replicas will
usually agree on the decision. Read vs Write Accuracy There is one
important difference in measurement accuracy between reads and
writes that is worth mentioning. When a write operation occurs, the
coordinator asks all live replicas to perform the operation. The
consistency level only affects the number of replicas the
coordinator is supposed to wait for. All live replicas increment
their counters for every write operation so our current
calculations are not affected. However, in the case of reads, the
consistency level also affects the number of replicas contacted. It
can lead to the shard counters being incremented fewer times than
in the case of writes. For example, with replication factor 3 and a
quorum consistency level, only 2 out of 3 replicas will be
contacted for each read. The cluster will think that the real read
rate is proportionally smaller, which can lead to a higher rate of
requests being accepted than the user limit allows. We found this
acceptable for an overload prevention solution where it’s more
important to prevent the cluster performance from collapsing rather
than enforcing some strict quota. Results: Goodput Restored after
Enabling Rate Limit Here is a benchmark that demonstrates the
usefulness of per-partition rate limiting. We set up a cluster on 3
i3.4xlarge AWS nodes with ScyllaDB 5.1.0-rc1. We pre-populated it
with a small data set that fit in memory. First, we ran a uniform
read workload of 80k reads per second from a c5.4xlarge instance –
this represents the first section on the following charts. (Note
that the charts show per-shard measurements.) Next, we started
another loader instance. Instead of reading uniformly, this loader
performed heavy queries on a single partition only – 8 times as
much concurrency and 10 times the data fetched for each read. As
expected, three shards become overloaded. Because the benchmarking
application uses fixed concurrency, it becomes bottlenecked on the
overloaded shards and its read rate dropped from 80k to 26k
requests per second. Finally, we applied a per-partition limit of
10 reads per second. Here the chart shows that the read rate
recovered. Even though the shards that were previously overloaded
now coordinate many more requests per second, nearly all of the
problematic requests are rejected. This is much cheaper than trying
to actually handle them, and it’s still within the capabilities of
the shards. Summary The article explored ScyllaDB’s solution to the
“hot partition” problem in distributed database clusters through
per-partition rate limiting. This capability was designed to
maintain stable performance under stress by rejecting excess
requests to maintain stable goodput. The implementation works by
estimating request rates for each partition and making decisions
based on those estimates. And benchmarks confirmed how rate
limiting restored goodput, even under stressful conditions, by
rejecting problematic requests. You can read the detailed design
notes at
https://github.com/scylladb/scylladb/blob/master/docs/dev/per-partition-rate-limit.md IoT Overdrive Part 2: Where Can We Improve?
In our first blog, “IoT Overdrive Part 1: Compute Cluster Running Apache Cassandra® and Apache Kafka®”, we explored the project’s inception and Version 1.0. Now, let’s dive into the evolution of the project via a walkthrough of Version 1.4 and our plans with Version 2.0.
Some of our initial challenges that need to be addressed include:
- Insufficient hardware
- Unreliable WiFi
- Finding a way to power all the new and existing hardware without as many cables
- Automating the setup of the compute nodes
Upgrade: Version 1.4
Version 1.4 served as a steppingstone to Version 2.0, codenamed Ericht, which needed to be portable for shows and talks. Scaling up and out was a big necessity for this version, to make it operable at demonstrations in a timely manner.
Why the jump to 1.4? We made a few changes in between that were relatively minor, and so we rolled v1.1–1.3 into v1.4.
Let’s walk through the major changes:
Hardware
To resolve the insufficient hardware issue, we enhanced the hardware, implementing both vertical and horizontal scaling, and introduced a new method to power the Pi computers. This includes lowering the baseline model for a node at a 4GB quad-core computer (for vertical scale) and increasing the number of nodes for horizontal scaling.
These nodes are comparable to smaller AWS instances, and the cluster could be compared to a free cluster on the Instaclustr managed service.
Orange Pi 5
We upgraded the 4 Orange Pi worker nodes to Orange Pi 5s with 8gb RAM and added 256GB M.2 storage drives to each Orange Pi for storing database and Kafka data.
Raspberry Pi Workers
The 4 Raspberry Pi workers were upgraded to 8GB RAM Raspberry Pi 4bs, and the 4GB models were reassigned as Application Servers. We added 256GB USB drives to the Raspberry Pis for storing database and Kafka data.
After all that, the cluster plan looked like this:
Source: Kassian Wren
We needed a way to power all of this; luckily there’s a way to kill two birds with one stone here.
Power Over Ethernet (PoE) Switch
To eliminate power cable clutter and switch to ethernet connections, I used a Netgear 16-port PoE switch. Each Pi has its own 5V/3A adapter, which splits out into power USB-C and ethernet connectors.
This setup allowed for a single power plug to be used for the entire cluster instead of one for each Pi, greatly reducing the need for power strips and slimming down the cluster considerably.
The entire cluster plus the switch consume about 20W of power, which is not small but not unreasonable to power with a large solar panel.
Source: Kassian Wren
Software
I made quite a few tweaks on the software side in v1.4. However, the major software enhancement was the integration of Ansible.
Ansible automated the setup process for both Raspberry and Orange Pi computers, from performing apt updates to installing Docker and starting the swarm.
Making it Mobile
We made the enclosure more transport-friendly with configurations for both personal travel and secure shipping. An enclosure was designed from acrylic plates with screw holes to match the new heat sink/fan cases for the Pi computers.
Using a laser cutter/engraver, we cut the plates out of acrylic and used standoff screws for motherboards to stand the racks two high; although it can be configured to be four-high, it gets tall and ungainly.
The next challenge was coming up with a way to ship it for events.
There are two configurations: one where the cluster travels with me, and the other where the cluster is shipped to my destination.
For shipping, I use one large pelican case (hard-sided suitcase) with the pluck foam layers laid out to cradle the enclosures, with the switch in the layer above.
The “travel with me” configuration is still a work in progress; fitting everything into a case small enough to travel with is interesting! I use a carry-on pelican case and squish the cluster in. You can see the yellow case in this photo:
Source: Kassian Wren
More Lessons Learned
While Version 1.4 was much better, it did bring to light the need for more services and capabilities, and showed how my implementation of the cluster was pushing the limits of Docker.
Now that we’ve covered 1.4 and all of the work we’ve done so far, it’s time to look to the future with version 2.0, codenamed “Ericht.”
Next: Version 2.0 (Ericht)
With 2.0, I needed to build out a mobile, easy-to-move version of the cluster.
Version 2.0 focuses on creating a mobile, easily manageable version of the cluster, with enhanced automation for container deployment and management. It also implements a service to monitor the health of the cluster and pass and store the data using Kafka and Cassandra.
The problems we’re solving with 2.0 are:
- Container management overhaul
- Create mobility for the cluster
- Give the cluster something to do and show
- Monitor the health and status of the cluster
Hardware for 2.0
A significant enhancement in this new version is the integration of INA226 Voltage/Current sensors into each Pi. This addition is key in providing advanced power monitoring, and detailed insights into each unit’s energy consumption.
Health Monitoring
I will be implementing an advanced health monitoring system that tracks factors such as CPU and RAM usage, Docker status, and more. These will be passed into the Cassandra database using Kafka consumers and producers.
Software Updates
Significant software updates are planned on the software side of things. As the Docker swarm requires a lot of manual maintenance, I wanted a better way to orchestrate the containers. One way we thought of was Kubernetes.
Kubernetes (K8s)
We’re transitioning to utilizing with Docker, an open source container orchestration system, to manage our containers. Though Kubernetes is often used for more transient containers than Kafka and Cassandra, we have operators available such as Strimzi for Kafka and K8ssandra for Cassandra to help make this integration more feasible and effective.
Moving Ahead
As we continue to showcase and demo this cluster more and more often, (we were at Current 2023 and Community Over Code NA!), we are learning a lot. This project’s potential for broader technological applications is becoming increasingly evident.
As I move forward, I’ll continue to document my progress and share my findings here on the blog. Our next post will go into getting Apache Cassandra and Apache Kafka running on the cluster in Docker Swarm.
The post IoT Overdrive Part 2: Where Can We Improve? appeared first on Instaclustr.
Gemini Code Assist + Astra DB: Build Generative AI Apps Faster
Today at Google Cloud Next, DataStax is showcasing its integration with Gemini Code Assist to allow for automatic generation of Apache Cassandra compliant CQL. This integration, a collaboration with Google Cloud, will enable developers to build applications that use Astra DB faster. Developers...Unlearning Old Habits: From Postgres to NoSQL
Where an RDBMS-pro’s intuition led him astray – and what he learned when our database performance expert helped him get up and running with ScyllaDB NoSQL Recently, I was asked by the ScyllaDB team if I could document some of my learnings moving from relational to NoSQL databases. Of course, there are many (bad) habits to break along the way, and instead of just documenting these, I ended up planning a 3 part livestream series with my good friend Felipe from ScyllaDB. ScyllaDB is a NoSQL database that is designed from the ground up with performance in mind. And by that, I mean optimizing for speed and throughput. I won’t go into the architecture details here, there’s a lot of material that you might find interesting if you want to read more in the resources section of this site. Scaling your database is a nice problem to have, it’s a good indicator that your business is doing well. Many customers of ScyllaDB have been on that exact journey – some compelling event that perhaps has driven them to consider moving from one database or another, all with the goal of driving down latency and increasing throughput. I’ve been through this journey myself, many times in the past, and I thought it would be great if I could re-learn, or perhaps just ditch some bad habits, by documenting the trials and tribulations of moving from a relational database to ScyllaDB. Even better if someone else benefits from these mistakes! Fictional App, Real Scaling Challenges So we start with a fictional application I have built. It’s the Rust-Eze Racing application which you can find on my personal GitHub here: https://github.com/timkoopmans/rust-eze-racing If you have kids, or are just a fan of the Cars movies, you’re no doubt familiar with Rust-Eze, which is one of the Piston Cup teams. — and also, a sponsor of the famous Lightning McQueen… Anyway, I digress. The purpose of this app is built around telemetry, a key factor in modern motor racing. It allows race engineers to interpret data that is captured from car systems so that they can tune for optimum performance. I created a simple Rust application that could simulate these metrics being written to the database, while at the same time having queries that read different metrics in real time. For the proof of concept, I used Postgres for data storage and was able to get decent read and write throughput from it on my development machine. Since we’re still in the world of make believe and cars that can talk, I want you to imagine that my PoC was massively successful, and I have been hired to write the whole back end system for the Piston Cup. At this point in the scenario, with overwhelming demands from the business, I start to get nervous about my choice of database: Will it scale? What happens when I have millions or perhaps billions of rows? What happens when I add many more columns and rows with high cardinality to the database? How can I achieve the highest write throughput while maintaining predictable low latency? All the usual questions a real full-stack developer might start to consider in a production scenario… This leads us to the mistakes I made and the journey I went through, spiking ScyllaDB as my new database of choice. Getting my development environment up and running, and connecting to the database for the first time In my first 1:1 livestreamed session with Felipe, I got my development environment up and running, using docker-compose to set up my first ScyllaDB node. I was a bit over-ambitious and set up a 3-node cluster, since a lot of the training material references this as a minimum requirement for production. Felipe suggested it’s not really required for development and that one node is perfectly fine. There are some good lessons in there, though… I got a bit confused with port mapping and the way that works in general. In Docker, you can map a published port on the host to a port in the container – so I naturally thought: Let’s map each of the container’s 9042 ports to a unique number on the host to facilitate client communication. I did something like this:ports: - "9041:9042"
And I
changed the port number for 3 nodes such that 9041 was node 1, 9042
was node 2 and 9043 was node 3. Then in my Rust code, I did
something like this: SessionBuilder::new()
.known_nodes(vec!["0.0.0.0:9041", "0.0.0.0:9042", "0.0.0.0:9043"])
.build() .await
I did this thinking that the client would
then know how to reach each of the nodes. As it turns out, that’s
not quite true, depending on what system you’re working on. On my
Linux machine, there didn’t seem to be any problem with this, but
on macOS there are problems. Docker runs differently on macOS than
Linux – Docker uses the Linux kernel, so these routes would always
work, but macOS doesn’t have a Linux Kernel, so Docker has to run
in a Linux virtual machine. When the client app connects to a node
in ScyllaDB, part of the discovery logic in the driver is to ask
for other nodes it can communicate with (read more on shard
aware drivers here). Since ScyllaDB will advertise other nodes
on 9042 ,they simply won’t be reachable. So on Linux, it looks like
it established TCP comms with three nodes: ❯ netstat -an |
grep 9042 | grep 172.19 tcp 0 0 172.19.0.1:52272 172.19.0.2:9042
ESTABLISHED tcp 0 0 172.19.0.1:36170 172.19.0.3:9042 ESTABLISHED
tcp 0 0 172.19.0.1:40718 172.19.0.4:9042 ESTABLISHED
But on
macOS it looks a little different, with only one of the nodes with
established TCP comms and the others stuck in
SYN_SENT
. The short of it is, you don’t really need to
do this in development if you’re just using a single node! I was
able to simplify my docker-compose file and avoid this problem. In
reality, production nodes would most likely be on separate
hosts/pods, so no need to map ports anyway. The nice thing about
running with one node instead of three is that you’ll also avoid
this type of problem, depending on which platform you’re using:
std::runtime_error The most common cause is not enough
request capacity in /proc/sys/fs/aio-max-nr
I was able to
circumvent this issue by using this flag:
--reactor-backend=epoll
This option switches Seastar threads to use epoll for event
polling, as opposed to the default linux-aio
implementation. This may be necessary for development workstations
(in particular Mac OS
deployments) where increasing the value for fs.aio-max-nr
on the host system may not turn out to be so easy. Note that
linux-aio (the default) is still the recommended option for
production deployments. Another mistake I made was using this
argument: --overprovisioned
As it turns out, this
argument needs a value, e.g. 1 to set this flag. However, it’s not
necessary since this argument and the following are already set
when you’re using ScyllaDB in docker: --developer-mode
The refactored code looks something like this: scylladb1:
image: scylladb/scylla container_name: scylladb1 expose: - "19042"
- "7000" ports: - "9042:9042" restart: always command: - --smp 1 -
--reactor-backend=epoll
The only additional argument worth
using in development is the --smp
command line
option to restrict ScyllaDB to a specific number of CPUs. You can
read more about that argument and other
recommendations when using Docker to run ScyllaDB here. As I
write all this out, I think to myself: This seems like pretty basic
advice. But you will see from our conversation that these learnings
are pretty typical for a developer new to ScyllaDB. So hopefully,
you can take something away from this and avoid making the same
mistakes as me. Watch
the First Livestream Session On Demand Next Up: NoSQL Data
Modeling In the next session with Felipe, we’ll dive deeper into my
first queries using ScyllaDB and get the app up and running for the
PoC. Join us as we walk through it on April 18 in Developer
Data Modeling Mistakes: From Postgres to NoSQL. Benchmarking MongoDB vs ScyllaDB: IoT Sensor Workload Deep Dive
benchANT’s comparison of ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for an IoT sensor workload BenchANT recently benchmarked the performance and scalability of the market-leading general-purpose NoSQL database MongoDB and its performance-oriented challenger ScyllaDB. You can read a summary of the results in the blog Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost, see the key takeaways for various workloads in this technical summary, and access all results (including the raw data) from the benchANT site. This blog offers a deep dive into the tests performed for the IoT sensor workload. The IoT sensor workload is based on the YCSB and its default data model, but with an operation distribution of 90% insert operations and 10% read operations that simulate a real-world IoT application. The workload is executed with the latest request distribution patterns. This workload is executed against the small database scaling size with a data set of 250GB and against the medium scaling size with a data set of 500GB. Before we get into the benchmark details, here is a summary of key insights for this workload. ScyllaDB outperforms MongoDB with higher throughput and lower latency results for the sensor workload except for the read latency in the small scaling size ScyllaDB provides constantly higher throughput that increases with growing data sizes up to 19 times ScyllaDB provides lower (down to 20 times) update latency results compared to MongoDB MongoDB provides lower read latency for the small scaling size, but ScyllaDB provides lower read latencies for the medium scaling size Throughput Results for MongoDB vs ScyllaDB The throughput results for the sensor workload show that the small ScyllaDB cluster is able to serve 60 kOps/s with a cluster utilization of ~89% while the small MongoDB cluster serves only 8 kOps/s under a comparable cluster utilization of 85-90%. For the medium cluster sizes, ScyllaDB achieves an average throughput of 236 kOps/s with ~88% cluster utilization and MongoDB 21 kOps/s with a cluster utilization of 75%-85%. Scalability Results for MongoDB vs ScyllaDB Analogous to the previous workloads, the throughput results allow us to compare the theoretical scale up factor for throughput with the actually achieved scalability. For ScyllaDB the maximal theoretical throughput scaling factor is 400% when scaling from small to medium. For MongoDB, the theoretical maximal throughput scaling factor is 600% when scaling size from small to medium. The ScyllaDB scalability results show that ScyllaDB is able to nearly achieve linear scalability by achieving a throughput scalability of 393% of the theoretically possible 400%. The scalability results for MongoDB show that it achieves a throughput scalability factor of 262% out of the theoretically possible 600%. Throughput per Cost Ratio In order to compare the costs/month in relation to the provided throughput, we take the MongoDB Atlas throughput/$ as baseline (i.e. 100%) and compare it with the provided ScyllaDB Cloud throughput/$. The results show that ScyllaDB provides 6 times more operations/$ compared to MongoDB Atlas for the small scaling size and 11 times more operations/$ for the medium scaling size. Similar to the caching workload, MongoDB is able to scale the throughput with growing instance/cluster sizes, but the preserved operations/$ are decreasing. Latency Results for MongoDB vs ScyllaDB The P99 latency results for the sensor workload show that ScyllaDB and MongoDB provide constantly low P99 read latencies for the small and medium scaling size. MongoDB provides the lowest read latency for the small scaling size, while ScyllaDB provides the lowest read latency for the medium scaling size. For the insert latencies, the results show a similar trend as for the previous workloads. ScyllaDB provides stable and low insert latencies, while MongoDB experiences up to 21 times higher update latencies. Technical Nugget – Performance Impact of the Data Model The default YCSB data model is composed of a primary key and a data item with 10 fields of strings that results in documents with 10 attributes for MongoDB and a table with 10 columns for ScyllaDB. We analyze how performance changes if a pure key-value data model is applied for both databases: a table with only one column for ScyllaDB and a document with only one field for MongoDB keeping the same record size of 1 KB. Compared to the data model impact for the social workload, the throughput improvements for the sensor workload are clearly lower. ScyllaDB improves the throughput by 8% while for MongoDB there is no throughput improvement. In general, this indicates that using a pure k-v improves the performance of read-heavy workloads rather than write-heavy workloads. Continue Comparing ScyllaDB vs MongoDB Here are some additional resources for learning about the differences between MongoDB and ScyllaDB: Benchmarking MongoDB vs ScyllaDB: Results from benchANT’s complete benchmarking study that comprises 133 performance and scalability measurements that compare MongoDB against ScyllaDB. Benchmarking MongoDB vs ScyllaDB: Caching Workload Deep Dive: benchANT’s comparison of ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a caching workload A Technical Comparison of MongoDB vs ScyllaDB: benchANT’s technical analysis of how MongoDB and ScyllaDB compare with respect to their features, architectures, performance, and scalability. ScyllaDB’s MongoDB vs ScyllaDB page: Features perspectives from users – like Discord – who have moved from MongoDB to ScyllaDB.Steering ZEE5’s Migration to ScyllaDB
Eliminating cloud vendor lockin and supporting rapid growth – with a 5X cost reduction Kishore Krishnamurthy, CTO at ZEE5, recently shared his perspective on Zee’s massive, business-critical database migration as the closing keynote at ScyllaDB Summit. You can read highlights of his talk below. All of the ScyllaDB Summit sessions, including Mr. Krishnamurthy’s session and the tech talk featuring Zee engineers, are available on demand. Watch On Demand About Zee Zee is a 30-year-old publicly-listed Indian media and entertainment company. We have interests in broadcast, OTT, studio, and music businesses. ZEE5 is our premier OTT streaming service, available in over 190 countries. We have about 150M monthly active users. We are available on web Android, iOS, smart TVs, and various other television and OEM devices. Business Pressures on Zee The media industry around the world is under a lot of bottom-line pressure. The broadcast business is now moving to OTT. While a lot of this business is being effectively captured on the OTT side, the business models on OTT are not scaling up similarly to the broadcast business. In this interim phase, as these business models stabilize, etc., there is a lot of pressure on us to run things very cost-efficiently. This problem is especially pronounced in India. Media consumption is on the rise in India. The biggest challenge we face is in terms of monetizability: our revenue is in rupees while our technology expenses are in dollars. For example, what we make from one subscription customer in one year is what Netflix or Disney would make in a month. We are on the lookout for technology and vendor partners who have a strong India presence, who have a cost structure that aligns with the kind of scale we provide, and who are able to provide the cost efficiency we want to deliver. Technical Pressures on the OTT Platform A lot of the costs we had on the platform scale linearly with usage and user base. That’s particularly true with some aspects of the platform like our heartbeat API, which was the primary use case driving our consideration of ScyllaDB. The linear cost escalation limited us in terms of what frequency we could run these kinds of solutions like heartbeat. A lot of our other solutions – like our playback experience, our security, and our recommendation systems – leverage heartbeat in their core infrastructure. Given the cost limitation, we could never scale that up. We also had challenges in terms of the distributed architecture of the solution we had. We were working towards a multi-tenant solution. We were exploring cloud-neutral solutions, etc. What Was Driving the Database Migration Sometime last year, we decided that we wanted to get rid of cloud vendor lockin. Every solution we were looking for had to be cloud-neutral, and the database choice also needed to deliver on that. We were also in the midst of a large merger. We wanted to make sure that our stack was multitenant and ready to onboard multiple OTT platforms. So these were reasons why we were eagerly looking for a solution that was both cloud-neutral and multitenant. Top Factors in Selecting ScyllaDB We wanted to move away from the master-slave architecture we had. We wanted our solution to be infinitely scalable. We wanted the solution to be multi-region-ready. One of the requirements from a compliance perspective was to be ready for any kind of regional disaster. When we came up with a solution for multi-region, the cost became significantly higher. We wanted a high-availability, multi-region solution and ScyllaDB’s clustered architecture allowed us to do that, and to move away from cloud vendor lockin. ScyllaDB allowed us to cut dependencies on the cloud provider. Today, we can run a multi-cloud solution on top of ScyllaDB. We also wanted to make sure that the migration would be seamless. ScyllaDB’s clustered architecture across clouds helped us when we were doing our recent cloud migration It allowed us to make it very seamless. From a support perspective, the ScyllaDB team was very responsive. They had local support in India, they had a local competency in terms of solution architects who held our hands along the way. So we were very confident we could deliver with their support. We found that operationally, ScyllaDB was very efficient. We could significantly reduce the number of database nodes when we moved to ScyllaDB. That also meant that the costs came down. ScyllaDB also happened to be a drop-in replacement for the current incumbents like Cassandra and DynamoDB. All of this together made it an easy choice for us to select ScyllaDB over the other database choices we were looking at. Migration Impact The migration to ScyllaDB was seamless. I’m happy to say there was zero downtime. After the migration to ScyllaDB, we also did a very large-scale cloud migration. From what we heard from the cloud providers, nobody else in the world had attempted this kind of migration overnight. And our migration was extremely smooth. ScyllaDB was a significant part of all the components we migrated, and that second migration was very seamless as well. After the migration, we moved about 525M users’ data, including their references, login details, session information, watch history, etc., to ScyllaDB. We have now hundreds of millions of heartbeats recorded on ScyllaDB. The overall data we store on ScyllaDB is in the tens of terabytes range at this point. Our overall cost is a combination of the efficiency that ScyllaDB provides in terms of the reduction in the number of nodes we use, and the cost structure that ScyllaDB provides. Together, this has given us a 5x improvement in cost – that’s something our CFO is very happy with. As I mentioned before, the support has been excellent. Through both the migrations – first the migration to ScyllaDB and subsequently the cloud migration – the ScyllaDB team was always available on demand during the peak periods. They were available on-prem to support us and hand-hold us through the whole thing. All in all, it’s a combination: the synergy that comes from the efficiency, the cost-effectiveness, and the scalability. The whole is more than the sum of the parts. That’s how I feel about our ScyllaDB migration. Next Steps ScyllaDB is clearly a favorite with the developers at Zee. I expect a lot more database workloads to move to ScyllaDB in the near future. If you’re curious about the intricacies of using heartbeats to track video watch progress, then catch the talk by Srinivas and Jivesh, where they explained the phenomenally efficient system that we have built. Watch the Zee Tech TalkInside a DynamoDB to ScyllaDB Migration
A detailed walk-through of an end-to-end DynamoDB to ScyllaDB migration We previously discussed ScyllaDB Migrator’s ability to easily migrate data from DynamoDB to ScyllaDB – including capturing events. This ensures that your destination table is consistent and abstracts much of the complexity involved during a migration. We’ve also covered when to move away from DynamoDB, exploring both technical and business reasons why organizations seek DynamoDB alternatives, and examined what a migration from DynamoDB looks like, walking through how a migration from DynamoDB looks like and how to accomplish it within the DynamoDB ecosystem. Now, let’s switch to a more granular (and practical) level. Let’s walk through an end-to-end DynamoDB to ScyllaDB migration, building up on top of what we discussed in the two previous articles. Before we begin, a quick heads up: The ScyllaDB Spark Migrator should still be your tool of choice for migrating from DynamoDB to ScyllaDB Alternator – ScyllaDB’s DynamoDB compatible API. But there are some scenarios where the Migrator isn’t an option: If you’re migrating from DynamoDB to CQL – ScyllaDB’s Cassandra compatible API If you don’t require a full-blown migration and simply want to stream a particular set of events to ScyllaDB If bringing up a Spark cluster is an overkill at your current scale You can follow along using the code in this GitHub repository. End-to-End DynamoDB to ScyllaDB Migration Let’s run through a migration exercise where: Our source DynamoDB table contains Items with an unknown (but up to 100) number of Attributes The application constantly ingests records to DynamoDB We want to be sure both DynamoDB and ScyllaDB are in sync before fully switching For simplicity, we will migrate to ScyllaDB Alternator, our DynamoDB compatible API. You can easily transform it to CQL as you go. Since we don’t want to incur any impact to our live application, we’ll back-fill the historical data via an S3 data export. Finally, we’ll use AWS Lambda to capture and replay events to our destination ScyllaDB cluster. Environment Prep – Source and Destination We start by creating our source DynamoDB table and ingesting data to it. Thecreate_source_and_ingest.py
script will create a
DynamoDB table called source
, and ingest 50K records
to it. By default, the table will be created in On-Demand mode, and
records will be ingested using the
BatchWriteItem call serially. We also do not check whether all
Batch Items were successfully processed, but this is something you
should watch out for in a real production application. The output
will look like this, and it should take a couple minutes to
complete: Next, spin up a ScyllaDB cluster. For demonstration
purposes, let’s spin a ScyllaDB container inside an EC2 instance:
Beyond spinning up a ScyllaDB container, our Docker command-line
does two things: Exposes port 8080
to the host OS to
receive external traffic, which will be required later on to bring
historical data and consume events from AWS Lambda Starts ScyllaDB
Alternator – our DynamoDB compatible API, which we’ll be using for
the rest of the migration. Once your cluster is up and running, the
next step is to create your destination table. Since we are using
ScyllaDB Alternator, simply run the create_target.py
script to create your destination table: NOTE: Both source and
destination tables share the same Key Attributes. In this guided
migration, we won’t be performing any data transformations. If you
plan to change your target schema, this would be the time for you
to do it. 🙂 Back-fill Historical Data For back-filling, let’s
perform a S3 Data Export. First, enable DynamoDB’s point-in-time
recovery: Next, request a full table export. Before running the
below command, ensure the destination S3 bucket exists: Depending
on the size of your actual table, the export may take enough time
for you to step away to grab a coffee and some snacks. In our
tests, this process took around 10-15 minutes to complete for our
sample table. To check the status of your full table export,
replace your table ARN and execute: Once the process completes,
modify the s3Restore.py
script accordingly (our
modified version of the
LoadS3toDynamoDB sample code) and execute it to load the
historical data: NOTE: To prevent mistakes and potential
overwrites, we recommend you use a different Table name for your
destination ScyllaDB cluster. In this example, we are migrating
from a DynamoDB Table called source to a destination named
dest. Remember that AWS allows you to request incremental
backups after a full export. If you feel like that process took
longer than expected, simply repeat it in smaller steps with later
incremental backups. This can be yet another strategy to overcome
the DynamoDB Streams 24-hour retention limit. Consuming DynamoDB
Changes With the historical restore completed, let’s get your
ScyllaDB cluster in sync with DynamoDB. Up to this point, we
haven’t necessarily made any changes to our source DynamoDB table,
so our destination ScyllaDB cluster should already be in sync.
Create a Lambda function Name your Lambda function
and select Python 3.12 as its Runtime: Expand the
Advanced settings option, and select the
Enable VPC option. This is required, given that
our Lambda will directly write data to ScyllaDB Alternator,
currently running in an EC2 instance. If you omit this option, your
Lambda function may be unable to reach ScyllaDB. Once that’s done,
select the VPC attached to your EC2 instance, and ensure that your
Security Group allows
Inbound traffic to ScyllaDB. In the screenshot below, we are
simply allowing inbound traffic to ScyllaDB Alternator’s port:
Finally, create the function. NOTE: If you are moving away from the
AWS ecosystem, be sure to attach your Lambda function to a VPC with
external traffic. Beware of the latency across different regions or
when traversing the Internet, as it can greatly delay your
migration time. If you’re migrating to a different protocol (such
as CQL), ensure that your Security Group allows routing traffic to
ScyllaDB relevant ports. Grant Permissions A
Lambda function needs permissions to be useful. We need to be able
to consume events from DynamoDB Streams and load them to ScyllaDB.
Within your recently created Lambda function, go to
Configuration > Permissions. From there, click
the IAM role defined for your Lambda: This will take you to the IAM
role page of your Lambda. Click Add permissions > Attach
policies: Lastly, proceed with attaching the
AWSLambdaInvocation-DynamoDB
policy. Adjust
the Timeout By default, a Lambda function runs for only
about 3 seconds before AWS kills the process. Since we expect to
process many events, it makes sense to increase the timeout to
something more meaningful. Go to Configuration > General
Configuration, and Edit to adjust its
settings: Increase the timeout to a high enough value (we left it
at 15 minutes) that allows you to process a series of events.
Ensure that when you hit the timeout limit, DynamoDB Streams won’t
consider the Stream to have failed processing, which effectively
sends you into an infinite loop. You may also adjust other
settings, such as Memory, as relevant (we left it at 1Gi).
Deploy After the configuration steps, it is time
to finally deploy our logic! The dynamodb-copy
folder
contains everything needed to help you do that. Start by editing
the dynamodb-copy/lambda_function.py
file and replace
the alternator_endpoint
value with the IP address and
port relevant to your ScyllaDB deployment. Lastly, run the
deploy.sh
script and specify the Lambda function to
update: NOTE: The Lambda function in question simply issues PutItem
calls to ScyllaDB in a serial way, and does nothing else. For a
realistic migration scenario, you probably want to handle
DeleteItem and UpdateItem API calls, as well as other aspects such
as TTL and error handling, depending on your use case. Capture
DynamoDB Changes Remember that our application is continuously
writing to DynamoDB, and our ultimate goal is to ensure that all
records ultimately exist within ScyllaDB, without incurring any
data loss. At this step, we’ll simply enable DynamoDB Streams to
Capture change events as they go. To accomplish that, simply turn
on DynamoDB streams to capture Item level changes: In View Type,
specify that you want a New Image capture, and proceed with
enabling the feature: Create a Trigger At this point, your Lambda
is ready to start processing events from DynamoDB. Within the
DynamoDB Exports and streams configuration, let’s
create a Trigger to invoke our Lambda function every time an item
gets changed: Next, choose the previously created Lambda function,
and adjust the Batch size as needed (we used 1000): Once you create
the trigger, data should start flowing from DynamoDB Streams to
ScyllaDB Alternator! Generate Events To show the
situation of an application frequently updating records, let’s
simply re-execute the initial
create_source_and_ingest.py
program: It will insert
another 50K records to DynamoDB, whose Attributes and values will
be very different from the existing ones in ScyllaDB: The program
found out the source table already exists and has simply
overwritten all its existing records. The new records were then
captured by DynamoDB Streams, which should now trigger our
previously created Lambda function, which will stream its records
to ScyllaDB. Comparing Results It may take some minutes for your
Lambda to catch up and ingest all events to ScyllaDB (did we say
coffee?). Ultimately, both databases should get in sync after a few
minutes. Our last and final step is simply to compare both database
records. Here, you can either compare everything, or just a few
selected ones. To assist you with that, here’s our final program:
compare.py! Simply invoke it, and it will compare the first 10K
records across both databases and report any mismatches it finds:
Congratulations! You moved away from DynamoDB! 🙂 Final Remarks In
this article, we explored one of the many ways to migrate a
DynamoDB workload to ScyllaDB. Your mileage may vary, but the
general migration flow should be ultimately similar to what we’ve
covered here. If you are interested in how organizations such as
Digital Turbine or Zee migrated from DynamoDB to ScyllaDB, you may
want to see their recent ScyllaDB
Summit talks. Or perhaps you would like to learn more about
different DynamoDB migration approaches? In that case, watch my
talk in our
NoSQL Data Migration Masterclass. If you want to get your
specific questions answered directly, talk to us!