10 June 2025, 1:11 pm by
ScyllaDB
We built a new DynamoDB cost analyzer that helps developers
understand what their workloads will really cost DynamoDB
costs can blindside you. Teams regularly face “bill shock”: that
sinking feeling when you look at a shockingly high bill and realize
that you haven’t paid enough attention to your usage, especially
with on-demand pricing. Provisioned capacity brings a different
risk: performance. If you can’t accurately predict capacity or your
math is off, requests get throttled. It’s a delicate balancing act.
Although
AWS offers a
DynamoDB pricing calculator, it often misses the nuances of
real-world workloads (e.g., bursty traffic or uneven access
patterns, or using global tables or caching). We wanted something
better. In full transparency, we wanted something better to help
the teams considering
ScyllaDB
as a DynamoDB alternative. So we built a new DynamoDB cost
calculator that helps developers understand what their workloads
will really cost. Although we designed it for teams comparing
DynamoDB with ScyllaDB, we believe it’s useful for anyone looking
to more accurately estimate their DynamoDB costs, for any reason.
You can see the live version at:
calculator.scylladb.com
How We Built It We wanted to build something that would work client
side, without the need for any server components. It’s a simple
JavaScript single page application that we currently host on GitHub
pages. If you want to check out the source code, feel free to take
a look at
https://github.com/scylladb/calculator
To be honest, working with the examples at
https://calculator.aws/ was a bit of
a nightmare, and when you “show calculations,” you get these walls
of text:
I was tempted to take a shorter approach, like: Monthly WCU
Cost = WCUs × Price_per_WCU_per_hour × 730 hours/month But every
time I simplified this, I found it harder to get parity between
what I calculated and the final price in AWS’s calculation.
Sometimes the difference was due to rounding, other times it was
due to the mixture of reserved + provision capacity, and so on. So
to make it easier (for me) to debug, I faithfully followed their
calculations line by line and tried to replicate this in my own
rather ugly function:
https://github.com/scylladb/calculator/blob/main/src/calculator.js#L48
I may still refactor this into smaller functions. But for now, I
wanted to get parity between theirs and ours. You’ll see that there
are also some unit tests for these calculations at
https://github.com/scylladb/calculator/blob/main/src/calculator.test.js.
I use those to test for a bunch of different configurations. I will
probably expand on these in time as well. So that gets the job done
for On Demand, Provisioned (and Reserved) capacity models. If
you’ve used AWS’s calculator, you know that you can’t specify
things like a peak (or peak width) in On Demand. I’m not sure about
their reasoning. I decided it would be easier for users to specify
both the baseline and peak for reads and writes (respectively) in
On Demand, much like Provisioned capacity. Another design decision
was to represent the traffic using a chart. I do better with
visuals, so seeing the peaks and troughs makes it easier for me to
understand – and I hope it does for you as well. You’ll also notice
that as you change the inputs, the URL query parameters change to
reflect those inputs. That’s designed to make it easier to share
and reference specific variations of costs. There’s some other math
in there, like figuring out the true cost of Global Tables and
understanding derived costs of things like network transfer or
DynamoDB Accelerator (DAX). However, explaining all that is a bit
too dense for this format. We’ll talk more about that in an
upcoming webinar (see the next section). The good news is that you
can estimate these costs in addition to your workload, as they can
be big cost multipliers when planning out your usage of DynamoDB.
Explore “what if” scenarios for your own workloads Analyzing
Costs in Real-World Scenarios The ultimate goal of all this
tinkering and tuning is to help you explore various “what-if”
scenarios from a DynamoDB cost perspective. To get you started,
we’ll share the cost impacts of some of the more interesting
DynamoDB user scenarios we’ve come across at ScyllaDB. Join us June
26 for a deep dive into how factors like traffic surges,
multi-datacenter expansion, and the introduction of caching (e.g.,
DAX) impact DynamoDB costs. We’ll explore how a few (anonymized)
teams we work with ended up blindsided by their DynamoDB bills and
the various options they considered for getting costs back under
control.
Join us to
chat about DynamoDB Costs
4 June 2025, 6:31 am by
Datastax Technical HowTo's
Which NoSQL database works best for powering your GenAI use cases?
A look at Cassandra vs. MongoDB and which to use when.
3 June 2025, 12:48 pm by
ScyllaDB
How Blitz scaled their game coaching app with lower latency
and leaner operations Blitz
is a fast-growing startup that provides personalized coaching for
games such as League of Legends, Valorant, and Fortnite. They aim
to help gamers become League of Legends legends through real-time
insights and post-match analysis. While players play, the app does
quite a lot of work. It captures live match data, analyzes it
quickly, and uses it for real-time game screen overlays plus
personalized post-game coaching. The guidance is based on each
player’s current and historic game activity, as well as data
collected across billions of matches involving hundreds of millions
of users. Thanks to growing awareness of Blitz’s popular stats and
game-coaching app, their steadily increasing user base pushed their
original Postgres- and Elixir-based architecture to its limits.
This blog post explains how they recently overhauled their League
of Legends data backend – using Rust and ScyllaDB.
TL;DR – In order to provide low latency, high
availability, and horizontal scalability to their growing user
base, they ultimately: Migrated backend services from Elixir to
Rust. Replaced Postgres with ScyllaDB Cloud. Heavily reduced their
Redis footprint. Removed their Riak cluster. Replaced queue
processing with realtime processing. Consolidated infrastructure
from over a hundred cores of microservices to four n4‑standard‑4
Google Cloud nodes (plus a small Redis instance for edge caching)
As an added bonus, these changes ended up cutting Blitz’s
infrastructure costs and reducing the database burden on their
engineering staff. Blitz Background As Naveed Khan (Head of
Engineering at Blitz) explained, “We collect a lot of data from
game publishers and during gameplay. For example, if you’re playing
League of Legends, we use Riot’s API to pull match data, and if you
install our app we also monitor gameplay in real time. All of this
data is stored in our transactional database for initial
processing, and most of it eventually ends up in our data lake.”
Scaling Past Postgres One key part of Blitz’s system is the
Playstyles API, which analyzes pre-game data for both teammates and
opponents. This intensive process evaluates up to 20 matches per
player and runs nine separate times per game (once for each player
in the match). The team strategically refactored and consolidated
numerous microservices to improve performance. But the data volume
remained intense. According to Brian Morin (Principal Backend
Engineer at Blitz), “Finding a database solution capable of
handling this query volume was critical.” They originally used
Postgres, which served them well early on. However, as their
write-heavy workloads scaled, the operational complexity and costs
on Google Cloud grew significantly. Moreover, scaling Postgres
became quite complex. Naveed shared, “We tried all sorts of things
to scale. We built multiple services around Postgres to get the
scale we needed: a Redis cluster, a Riak cluster, and Elixir Oban
queues that occasionally overflowed. Queue management became a big
task.” To stay ahead of the game, they needed to move on. As
startups scale, they often switch from “just use Postgres” to “just
use NoSQL.” Fittingly, the Blitz team considered moving to MongoDB,
but eventually ruled it out. “We had lots of MongoDB experience in
the team and some of us really liked it. However, our workload is
very write-heavy, with thousands of concurrent players generating a
constant stream of data. MongoDB uses a single-writer architecture,
so scaling writes means vertically scaling one node.” In other
words,
MongoDB’s primary-secondary architecture would create a
bottleneck for their specific workload and anticipated growth. They
then decided to move forward with RocksDB because of its low
latency and cost considerations. Tests showed that it would meet
their latency needs, so they performed the required data
(re)modeling and migrated a few smaller games over from Postgres to
RocksDB. However, they ultimately decided against RocksDB due to
scale and high availability concerns. “Based on available data from
our testing, it was clear RocksDB wouldn’t be able to handle the
load of our bigger games – and we couldn’t risk vertically scaling
a single instance, and then having that one instance go down,”
Naveed explained. Why ScyllaDB One of their backend engineers
suggested ScyllaDB, so they reached out and ran a proof of concept.
They were primarily looking for a solution that can handle the
write throughput, scales horizontally, and provides high
availability. They tested it on their own hardware first, then
moved to ScyllaDB Cloud. Per Naveed, “The cost was pretty close to
self-hosting, and we got full management for free, so it was a
no-brainer. We now have a significantly reduced Redis cluster, plus
we got rid of the Riak cluster and Oban queues dependencies. Just
write to ScyllaDB and it all just works. The amount of time we
spend on infrastructure management has significantly decreased.”
Performance-wise, the shift met their goal of leveling up the user
experience … and also simplified life for their engineering teams.
Brian added, “ScyllaDB proved exceptional, delivering robust
performance with capacity to spare after optimization. Our League
product peaks at around 5k ops/sec with the cluster reporting under
20% load. Our biggest constraint has been disk usage, which we’ve
rolled out multiple updates to mitigate. The new system can now
often return results immediately instead of relying on cached data,
providing more up-to-date information on other players and even
identifying frequent teammates. The results of this migration have
been impressive: over a hundred cores of microservices have been
replaced by just four n4-standard-4 nodes and a minimal Redis
instance for caching. Additionally, a 3xn2-highmem ScyllaDB cluster
has effectively replaced the previous relational database
infrastructure that required significant computing resources.”
High-Level Architecture of Blitz Server with Rust and
ScyllaDB Rewriting Elixir Services into Rust As part of a
major backend overhaul, the Blitz team began rethinking their
entire infrastructure – beyond the previously described shift from
Postgres to the high-performance and distributed ScyllaDB.
Alongside this database migration, they also chose to sunset their
Elixir-based services in favor of a more modern language. After
careful evaluation, Rust emerged as the clear choice. “Elixir is
great and it served its purpose well,” explained Naveed. “But we
wanted to move toward something with broader adoption and a
stronger systems-level ecosystem. Rust proved to be a robust and
future-proof alternative.” Now that the first batch of Rust
rewritten services are in production, Naveed and team aren’t
looking back: “Rust is fantastic. It’s fast, and the compiler
forces you to write memory-safe code upfront instead of debugging
garbage-collection issues later. Performance is comparable to C,
and the talent pool is also much larger compared to Elixir.”
29 May 2025, 1:13 pm by
ScyllaDB
How moving from mutation-based streaming to file-based
streaming resulted in 25X faster streaming time Data
streaming – an internal operation that moves data from node to node
over a network – has always been the foundation of various ScyllaDB
cluster operations. For example, it is used by “add node”
operations to copy data to a new node in a cluster (as well as
“remove node” operations to do the opposite). As part of our
multiyear project to optimize ScyllaDB’s elasticity, we reworked
our approach to streaming. We recognized that when we moved to
tablets-based data distribution, mutation-based streaming would
hold us back. So we shifted to a new approach: stream the entire
SSTable files without deserializing them into mutation fragments
and re-serializing them back into SSTables on receiving nodes. As a
result, less data is streamed over the network and less CPU is
consumed, especially for data models that contain small cells.
Mutation-Based Streaming In ScyllaDB, data streaming is a low-level
mechanism to move data between nodes. For example, when nodes are
added to a cluster, streaming moves data from existing nodes to the
new nodes. We also use streaming to decommission nodes from the
cluster. In this case, streaming moves data from the decommissioned
nodes to other nodes in order to balance the data across the
cluster. Previously, we were using a streaming method called
mutation-based streaming. On the sender side, we read the
data from multiple SSTables. We get a stream of mutations,
serialize them, and send them over the network. On the receiver
side, we deserialize and write them to SSTables. File-Based
Streaming Recently, we introduced a new file-based streaming
method. The big difference is that we do not read the individual
mutations from the SSTables, and we skip all the parsing and
serialization work. Instead, we read and send the SSTable directly
to remote nodes. A given SSTable always belongs to a single tablet.
This means we can always send the entire SSTable to other nodes
without worrying about whether the SSTable contains unwanted data.
We implemented this by having the
Seastar
RPC stream interface stream SSTable files on the network for
tablet migration. More specifically, we take an internal snapshot
of the SSTables we want to transfer so the SSTables won’t be
deleted during streaming. Then, SSTable file readers are created
for them so we can use the Seastar RPC stream to send the SSTable
files over the network. On the receiver side, the file streams are
written into SSTable files by the SSTable writers.
Why did we do this? First, it reduces CPU usage because we
do not need to read each and every mutation fragment from the
SSTables, and we do not need to parse mutations. The CPU reduction
is even more significant for small cells, where the ratio of the
amount of metadata parsed to real user data is higher. Second, the
format of the SSTable is much more compact than the mutation format
(since on-disk presentation of data is more compact than
in-memory). This means we have less data to send over the network.
As a result, it can boost the streaming speed rather significantly.
Performance Improvements To quantify how this shift impacted
performance, we compared the performance of mutation-based and
file-based streaming when migrating tablets between nodes. The
tests involved: 3 ScyllaDB nodes i4i.2xlarge 3 loaders t3.2xlarge 1
billion partitions Here are the results: Note that
file-based streaming results in 25 times faster streaming time. We
also have much higher streaming bandwidth: the network bandwidth is
10 times faster with file-based streaming. As mentioned earlier, we
have less data to send with file streaming. The data sent on the
wire is almost three times less with file streaming. In addition,
we can also see that file-based streaming consumes many fewer CPU
cycles. Here’s a little more detail, in case you’re curious. Disk
IO Queue The following sections show how the IO bandwidth compares
across mutation-based and file-based streaming. Different colors
represent different nodes. As expected, the throughput was higher
with mutation-based streaming. Here are the detailed IO results for
mutation-based streaming: The streaming
bandwidth is 30-40MB/s with mutation-based streaming. Here are the
detailed IO results for
file-based streaming: The
bandwidth for file streaming is much higher than with
mutation-based streaming. The pattern differs from the
mutation-based graph because file streaming completes more quickly
and can sustain a high speed of transfer bandwidth during
streaming. CPU Load We found that the overall CPU usage is much
lower for the file-based streaming. Here are the detailed CPU
results for
mutation-based streaming: Note that
the CPU usage is around 12% for mutation-based streaming. Here are
the detailed CPU results for
file-based streaming:
Note that the CPU usage for the file-based streaming is less than
5%. Again, this pattern differs from the mutation-based streaming
graph because file streams complete much more quickly and can
maintain a high transfer bandwidth throughout. Wrap Up This new
file-based streaming makes data streaming in ScyllaDB faster and
more efficient. You can explore it in ScyllaDB Cloud or ScyllaDB
2025.1. Also, our CTO and co-founder Avi Kivity shares an extensive
look at our other recent and upcoming engineering projects in this
tech talk:
More engineering
blog posts
20 May 2025, 1:02 pm by
ScyllaDB
Migrating 300+ TB of data and 400+ services from a
key-value database to ScyllaDB – with zero downtime
ReversingLabs recently completed the largest migration in their
history: migrating more than 300 TB of data, more than 400
services, and data models from their internally-developed key-value
database to ScyllaDB seamlessly, and with zero downtime. Services
using multiple tables — reading, writing, and deleting data, and
even using transactions — needed to go through a fast and seamless
switch. How did they pull it off? Martina recently shared their
strategy, including data modeling changes, the actual data
migration, service migration, and a peek at how they addressed
distributed locking. Here’s her complete tech talk: And you
can read highlights below… About ReversingLabs Reversing Labs is a
security company that aims to analyze every enterprise software
package, container and file to identify potential security threats
and mitigate cybersecurity risks. They maintain a library of 20B
classified samples of known “goodware” (benign) and malware files
and packages. Those samples are supported by ~300 TB of metadata,
which are processed using a network of approximately 400
microservices. As Martina put it: “It’s a huge system, complex
system – a lot of services, a lot of communication, and a lot of
maintenance.” Never build your own database (maybe?) When the
ReversingLabs team set out to select a database in 2011, the
options were limited. Cassandra was at version 0.6, which lacked
role-level isolation DynamoDB was not yet released ScyllaDB was not
yet released MongoDB 1.6 had consistency issues between replicas
PostgreSQL was struggling with multi-version concurrency control
(MVCC), which created significant overhead “That was an issue for
us—Postgres used so much memory,” Martina explained. “For a startup
with limited resources, having a database that ate all our memory
was a problem. So we built our own data store. I know, it’s
scandalous—a crazy idea today—but in this context, in this market,
it made sense.” The team built a simple key-value store tailored to
their specific needs—no extra features, just efficiency. It
required manual maintenance and was only usable by their
specialized database team. But it was fast, used minimal resources,
and helped ReversingLabs, as a small startup, handle massive
amounts of data (which became a core differentiator). However,
after 10 years, ReversingLabs’ growing complexity and expanding use
cases became overwhelming – to the database itself and the small
database team responsible for it. Realizing that they reached their
home-grown database’s tipping point, they started exploring
alternatives. Enter ScyllaDB. Martina shared: “After an extensive
search, we found ScyllaDB to be the most suitable replacement for
our existing database. It was fast, resilient, and scalable enough
for our use case. Plus, it had all the features our old database
lacked. So, we decided on ScyllaDB and began a major migration
project.” Migration Time The migration involved 300 TB of data,
hundreds of tables, and 400 services. The system was complex, so
the team followed one rule: keep it simple. They made minimal
changes to the data model and didn’t change the code at all. “We
decided to keep the existing interface from our old database and
modify the code inside it,” Martina shared. “We created an
interface library and adapted it to work with the ScyllaDB driver.
The services didn’t need to know anything about the change—they
were simply redeployed with the new version of the library,
continuing to communicate with ScyllaDB instead of the old
database.” Moving from a database with a single primary node to one
with a leaderless ring architecture did require some changes,
though. The team had to adjust the primary key structure, but the
value itself didn’t need to be changed. In the old key-value store,
data was stored as a packed protobuf with many fields. Although
ScyllaDB could unpack these protobufs and separate the fields, the
team chose to keep them as they were to ensure a smoother
migration. At this point, they really just wanted to make it work
exactly like before. The migration had to be invisible — they
didn’t want API users to notice any differences. Here’s an overview
of the migration process they performed once the models were ready:
1. Stream the old database output to Kafka The
first step was to set up a Kafka topic dedicated to capturing
updates from the old database.
2. Dump the old database
into a specified location Once the streaming pipeline was
in place, the team exported the full dataset from the old database.
3. Prepare a ScyllaDB table by configuring its structure
and settings Before loading the data, they needed to
create a ScyllaDB table with the new schema.
4. Prepare and
load the dump into the ScyllaDB table With the table
ready, the exported data was transformed as needed and loaded into
ScyllaDB.
5. Continuously stream data to ScyllaDB
They set up a continuous pipeline with a service that listened to
the Kafka topic for updates and loaded the data into ScyllaDB.
After the backlog was processed, the two databases were fully in
sync, with only a negligible delay between the data in the old
database and ScyllaDB. It’s a fairly straightforward process…but it
had to be repeated for 100+ tables. Next Up: Service Migration The
next challenge was migrating their ~400 microservices. Martina
introduced the system as follows: “We have master services that act
as data generators. They listen for new reports from static
analysis, dynamic analysis, and other sources. These services serve
as the source of truth, storing raw reports that need further
processing. Each master service writes data to its own table and
streams updates to relevant queues. The delivery services in the
pipeline combine data from different master services, potentially
populating, adding, or calculating something with the data, and
combining various inputs. Their primary purpose is to store the
data in a format that makes it easy for the APIs to read. The
delivery services optimize the data for queries and store it in
their own database, while the APIs then read from these new
databases and expose the data to users.” Here’s the 5-step approach
they applied to service migration:
1. Migrate the APIs one
by one The team migrated APIs incrementally. Each API was
updated to use the new ScyllaDB-backed interface library. After
redeploying each API, the team monitored performance and data
consistency before moving on to the next one.
2. Prepare
for the big migration day Once the APIs were migrated,
they had to prepare for the big migration day. Since all the
services before the APIs are intertwined, they all had to be
migrated all at once.
3. Stop the master services
On migration day, the team stopped the master services (data
generators), causing input queues to accumulate until the migration
was complete. During this time, the APIs continued serving traffic
without any downtime. However, the data in the databases was
delayed for about an hour or two until all services were fully
migrated.
4. Migrate the delivery services After
stopping the master services, the team waited for the queues
between the master and delivery services to empty – ensuring that
the delivery services processed all data and stopped writing. The
delivery services were then migrated one by one to the new
database. There was no data at this point because the master
services were stopped.
5. Migrate and start the master
services At last, it was time to migrate and start the
master services. The final step was to shut down the old database
because everything was now working on ScyllaDB. “It worked great,
Martina shared. “We were happy with the latencies we achieved. If
you remember, our old architecture had a single master node, which
created a single point of failure. Now, with ScyllaDB, we had
resiliency and high availability, and we were quite pleased with
the results.” And Finally…Resource Locking One final challenge:
resource locking. Per Martina, “In the old architecture, resource
locking was simple because there was a single master node handling
all writes. You could just use a mutex on the master node, and that
was it—locking was straightforward. Of course, it needed to be tied
to the database connection, but that was the extent of it.”
ScyllaDB’s leaderless architecture meant that the team had to
figure out distributed locking. They leveraged ScyllaDB’s
lightweight transactions and built a distributed locking mechanism
on top of it. The team worked closely with ScyllaDB engineers,
going through several proofs of concept (POCs)—some successful,
others less so. Eventually, they developed a working solution for
distributed locking in their new architecture. You can read all the
details in Martina’s blog post,
Implementing distributed locking with ScyllaDB.
13 May 2025, 12:30 pm by
ScyllaDB
“Tablets” data distribution makes full table scans on
ScyllaDB more performant than ever Full scans are
resource-intensive operations reading through an entire dataset.
They’re often required by analytical queries such as counting total
records, identifying users from specific regions, or deriving top-K
rankings. This article describes how ScyllaDB’s shift to tablets
significantly improves full scan performance and processing time,
as well as how it eliminates the complex tuning heuristics often
needed with the previous vNodes based approach. It’s been quite
some time since we last touched on the subject of handling full
table scans on ScyllaDB. Previously,
Avi Kivity described how the CQL token() function could be used
in a divide and conquer approach to maximize running analytics on
top of ScyllaDB. We also
provided sample Go code and demonstrated how easy and efficient
full scans could be done. With the
recent
introduction of tablets, it turns out that full scans are more
performant than ever. Token Ring Revisited Prior to tablets, nodes
in a ScyllaDB cluster owned fractions of the token ring, also known
as
token ranges. A token range is nothing more than a
contiguous segment represented by two (very large) numbers. By
default, each node used to own 256 ranges, also known as vNodes.
When data gets written to the cluster, the Murmur3 hashing function
is responsible for distributing data to replicas of a given
token range. A full table scan thus involved parallelizing
several token ranges until clients eventually traverse the entire
ring. As a refresher, a scan involves iterating through multiple
subranges (smaller vNode ranges) with the help of the
token() function, like this:
SELECT ... FROM
t WHERE token(key) >= ? AND token(key) < ?
To fully
traverse the ring as fast as possible, clients needed to keep
parallelism high enough (
number of nodes
x shard count
x
some smudge factor
) to fully benefit from all
available processing power. In other words, different cluster
topologies would require different parallelism settings, which
could often change as nodes got added or removed. Traversing vNodes
worked nicely, but the approach introduced some additional
drawbacks, such as: Sparse tables result in wasted work because
most token ranges contain little or no data. Popular and
high-density ranges could require fine-grained tuning to prevent
uneven load distribution and resource contention. Otherwise, they
would be prone to processing bottlenecks and suboptimal
utilization. It was impossible to scan a token range owned by a
single shard, and particularly difficult to even scan a range owned
by a single replica. This increases coordination overhead, and
creates a performance ceiling on how fast a single token range
could be processed. The old way: system.size_estimates To assist
applications during range scans, ScyllaDB provided a node-local
system.size_estimates table (something we
inherited from Apache Cassandra) whose schema looks like this:
CREATE TABLE system.size_estimates ( keyspace_name text,
table_name text, range_start text, range_end text,
mean_partition_size bigint, partitions_count bigint, PRIMARY KEY
(keyspace_name, table_name, range_start, range_end) )
Every
token range owned by a given replica provides an estimated number
of partitions along with a mean partition size. The product of both
columns therefore provides a raw estimate on how much data needs to
be retrieved if a scan reads through the entire range. This design
works nicely under small clusters and when data isn’t frequently
changing. Since the data is node local, an application in charge of
the full scan would be required to keep track of
256
vNodes*Node
entries to submit its queries. Therefore, larger
clusters could introduce higher processing overhead. Even then, (as
the table name suggests) the number of partitions and their sizes
are just
estimates, which can be underestimated or
overestimated. Underestimating a token range size makes a scan more
prone to timeouts, particularly when its data contains a few large
partitions along many smaller sized keys. Overestimating it means a
scan may take longer to complete due to wasted cycles while
scanning through sparse ranges. Parsing the
system.size_estimates table’s data is precisely
what connectors like
Trino and
Spark do when you integrate them with either Cassandra or
ScyllaDB. To address estimate skews, these tools often allow you to
manually tune settings like
split-size in
a trial-and-error fashion until it
somewhat works for your
workload. Its rationale works like this: Clients parse the
system.size_estimates data from every node in the
cluster (since vNodes are non overlapping ranges, fully describing
the ring distribution) The size of a specific range is determined
by
partitionsCount * meanPartitionSize
It then
calculates the estimated number of partitions and the size of the
table to be scanned It evenly splits each vNode range into
subranges, taking its corresponding ring fraction into account
Subranges are parallelized across workers and routed to natural
replicas as an additional optimization Finally, prior to tablets
there was no deterministic way to scan a particular range and
target a specific ScyllaDB shard. vNodes have no 1:1 token/shard
mapping, meaning a single coordinator request would often need to
communicate with other replica shards, making it particularly
easier to introduce CPU contention. A layer of indirection:
system.tablets Starting with
ScyllaDB
2024.2, tablets are production ready. Tablets are the
foundation behind ScyllaDB elasticity, while also effectively
addressing the drawbacks involved with full table scans under the
old vNode structure. In case you missed it, I highly encourage you
to watch Avi Kivity talk on
Tablets: Rethinking Replication for an in-depth understanding
on how tablets evolved from the previous vNodes static topologies.
During his talk, Avi mentions that tablets are implemented as a
layer of indirection involving a token range to a (
replica,
shard
) tuple. This layer of indirection is exposed in
ScyllaDB as the system.tablets table, whose schema looks like this:
CREATE TABLE system.tablets ( table_id uuid, last_token
bigint, keyspace_name text STATIC, resize_seq_number bigint STATIC,
resize_type text STATIC, table_name text STATIC, tablet_count int
STATIC, new_replicas frozen<list<frozen<tuple<uuid,
int>>>>, replicas
frozen<list<frozen<tuple<uuid, int>>>>,
session uuid, stage text, transition text, PRIMARY KEY (table_id,
last_token) )
A tablet represents a contiguous token range
owned by a group of replicas and shards. Unlike the previous static
vNode topology, tablets are created on a per table basis and get
dynamically split or merged on demand. This is important, because
workloads may vary significantly: Some are very throughput
intensive under frequently accessed (and small) data sets and will
have fewer tablets. These take less time to scan. Others may become
considerably storage bound over time, spanning through multiple
terabytes (or even petabytes) of disk space. These take longer to
scan. A single tablet targets a geometric average size of 5GB
before it gets split. Therefore, splits are done when a tablet
reaches 10GB and merges at 2.5GB. Note that the average size is
configurable, and the default might change in the future. However,
scanning over each tablet owned range allows full scans to
deterministically determine up to how much data they are reading.
The only exception to this rule is when very large (larger than the
average) partitions are involved, although this is an edge case.
Consider the following set of operations: In this example, we start
by defining that we want tables within the
ks
keyspace
to start with 128 tablets each. After we create table
t
, observe that the
tablet_count
matches
what we’ve set upfront. If we had asked for a non base 2 number,
the
tablet_count
would be rounded to the next base 2
number. The
tablet_count
represents the total number
of tablets across the cluster, where the
replicas
column represents a tuple of host IDs/shards which are replicas of
that tablet, matching our defined replication factor. Therefore,
the previous logic can be optimized like this: Clients parse the
system.tablets table and retrieve the existing
tablet distribution Tablets ranges spanning the same replica-shards
get grouped and split together Workers route requests to natural
replica/shard endpoints via
shard awareness by setting a
routingKey for every request. Tablet full scans have lots to
benefit from these improvements. By directly querying specific
shards, we eliminate the cost of cross CPU and node communication.
Traversing the ring is not only more efficient, but effectively
removes the problem with sparse ranges and different tuning logic
for small and large tables. Finally, given that a tablet has a
predetermined size, long gone are the days of fine-tuning
splitSizes
! Example
This GitHub
repo contains boilerplate code demonstrating how to carry out
these tasks efficiently. The process involves splitting tablets
into smaller pieces of work, and scheduling them evenly across its
corresponding replica/shards. The scheduler ensures that replica
shards are kept busy with at least 2 inflight requests each,
whereas the least loaded replica always consumes pending work for
processing. The code also simulates real-world latency variability
by introducing some jitter during each request processing.
[
Access
from the GitHub repo] Conclusion This is just the beginning of
our journey with tablets. The logic explained in this blog is
provided for application builders to follow as part of their full
scan jobs. It is worth mentioning that the previous vNode technique
is backward compatible and still works if you use tablets. Remember
that full scans often require reading through lots of data, and we
highly recommend you to use BYPASS CACHE to prevent invalidating
important cached rows. Furthermore,
ScyllaDB Workload Prioritization helps with isolation and
ensures latencies from concurrent are kept low. Happy scanning!
9 May 2025, 4:18 am by
Datastax Technical HowTo's
In this post, we'll look at the benefits of using managed Cassandra
versus self-hosting, as well as what factors to assess before you
make a purchase decision.
5 May 2025, 12:30 pm by
ScyllaDB
Tech journalist George Anadiotis catches up on how
ScyllaDB’s latest releases deliver extreme elasticity and
price-performance — and shares a peek at what’s next (vector
search, object storage, and more) This is a guest post
authored by tech journalist George Anadiotis. It’s
a follow-up to articles that he published in
2023 and
2022. In business, they say it takes ten years to
become an overnight success. In technology, they say it takes ten
years to build a file system. ScyllaDB is in the technology
business, offering a distributed NoSQL database that is monstrously
fast and scalable. It turns out that it also takes
ten
years or more to build a successful database. This is something
that Felipe Mendes and Guilherme Nogueira know well. Mendes and
Nogueira are Technical Directors at ScyllaDB, working directly on
the product as well as consulting clients. Recently, they presented
some of the things they’ve been working on at ScyllaDB’s
Monster
Scale Summit, and they shared their insights in an exclusive
fireside chat. You can also catch the podcast on
Apple,
Spotify,
and
Amazon The evolution of ScyllaDB When
ScyllaDB started out, it was all about raw performance. The
goal was to be “the fastest NoSQL database available in the market,
and we did that – we still are” as Mendes put it. However, as he
added, raw speed alone does not necessarily make a good database.
Features such as materialized views, secondary indexes, and
integrations with third party solutions are really important as
well. Adding such features marked the second generation in
ScyllaDB’s evolution. ScyllaDB started as a performance-oriented
alternative to Cassandra, so inevitably,
evolution meant feature parity with Cassandra. The
third generation of ScyllaDB was marked by the move to the
cloud. ScyllaDB Cloud was introduced in 2019, has been growing at
200% YoY. As Nogueira shared, even today there are daily signups of
new users ready to try the oddly-named database that’s used by
companies such as Discord, Medium, and Tripadvisor, all of which
the duo works with. The
next generation brought a radical break from what Mendes called
the inefficiencies in Cassandra, which involved introducing the
Raft protocol for node
coordination. Now ScyllaDB is moving to a new generation, by
implementing what Mendes and Nogueira referred to as hallmark
features: strong consistency and tablets. Strong consistency and
tablets The combination of the new Raft and Tablets features
enables clusters to scale up in seconds because it enables nodes to
join in parallel, as opposed to sequentially which was the case for
the Gossip protocol in Cassandra (which ScyllaDB also relied on
originally). But it’s not just adding nodes that’s improved, it’s
also removing nodes.When a node goes down for maintenance, for
example, ScyllaDB’s strong consistency support means that the rest
of the nodes in the cluster will be immediately aware. By contrast,
in the previously supported regime of eventual consistency via a
gossip protocol, it could take such updates a while to propagate.
Using Raft means transitioning to a state machine mechanism, as
Mendes noted. A node leader is appointed, so when a change occurs
in the cluster, the state machine is updated and the change is
immediately propagated. Raft is used to propagate updates
consistently at every step of a topology change. It also allows for
parallel topology updates, such as adding multiple nodes at once.
This was not possible under the gossip-based approach. And this is
where
tablets come in. With tablets, instead of having one single
leader per cluster, there is one leader per tablet. A tablet is a
logical abstraction that partitions data in tables into smaller
fragments. Tablets are load-balanced after new nodes join, ensuring
consistent distribution across the cluster. Any changes to Tablets
ownership are also ensured to be consistent by using Raft to
propagate these changes. Each tablet is independent from the rest,
which means that ScyllaDB with Raft can move them to other nodes on
demand atomically and in a strongly consistent way as workloads
grow or shrink. Speed, economy, elasticity By breaking down tables
into smaller and more manageable units, data can be moved between
nodes in a cluster much faster. This means that clusters can be
scaled up rapidly, as Mendes demonstrated. When new nodes join a
cluster, the data is redistributed in minutes rather than hours,
which was the case previously (and is still the case with
alternatives like Cassandra). When we’re talking about machines
that have higher capacity, that also means that they have a higher
storage density to be used, as Mendes noted. Tablets balance out in
a way that utilizes storage capacity evenly, so all nodes in the
cluster will have a similar utilization rate. That’s because the
number of tablets at each node is determined according to the
number of CPUs, which is always tied to storage in cloud nodes. In
this sense, as storage utilization is more flexible and the cluster
can scale faster, it also allows users to run at a much higher
storage utilization rate. A typical storage utilization rate,
Mendes said, is 50% to 60%. ScyllaDB aims to run at up to 90%
storage utilization. That’s because tablets and cloud automations
enable ScyllaDB Cloud to rapidly scale the cluster once those
storage thresholds are exceeded, as ScyllaDB’s benchmarking shows.
Going from 60% to 90% storage utilization means an extra 30% per
node disk space can be utilized. At scale, that translates to
significant savings for users. Further to scaling speed and
economy, there is an additional benefit to tablets:
enabling the elasticity of cloud operations for cloud
deployments, without the complexity. Something old, something
new, something borrowed, something blue Beyond strong consistency
and tablets, there is a wide range of new features and improvements
that the ScyllaDB team is working on. Some of these, such as
support
for S3 object storage, are efforts that are ongoing. Besides
offering users choice, as well as a way to economize even further
on storage, object storage support could also serve resilience.
Other features, such as workload prioritization or the
Alternator DynamoDB-compatible API, have been there for a while
but are being improved and re-emphasized. As Mendes shared, when
running a variety of workloads, it’s very hard for the database to
know which is which and how to prioritize. Workload prioritization
enables users to characterize and prioritize workloads, assigning
appropriate service levels to each. Last but not least, ScyllaDB is
also
adding
vector capabilities to the database engine. Vector data types,
data structures, and query capabilities have been implemented and
are being benchmarked. Initial results show great promise, even
outperforming pure-play vector databases. This will eventually
become a core feature, supported on both on-premise and cloud
offerings. Once again, ScyllaDB is keeping with the times in its
own characteristic way. As Mendes and Nogueira noted, there are
many ScyllaDB clients using ScyllaDB to power AI workloads, some of
them
like Clearview AI sharing their stories. Nevertheless, ScyllaDB
remains focused on database fundamentals,
taking calculated steps in the spirit of continuous improvement
that has become its trademark. After all, why change something
that’s so deeply ingrained in the organization’s culture, is
working well for them and appreciated by the ones who matter most –
users?