17 February 2026, 1:43 pm by
ScyllaDB
Monster Scale Summit is all about extreme scale engineering and
data-intensive applications. So here’s a big announcement: the
agenda is now available! Attendees can join 50+ tech talks,
including:
Keynotes by antirez, Camille Fournier,
Pat Helland, Joran Greef, Thea Aarrestad, Dor Laor, and Avi Kivity
Martin Kleppmann & Chris Riccomini chatting about
the second edition of
Designing Data-Intensive
Applications Tales of extreme scale
engineering – Rivian, Pinterest, LinkedIn, Nextdoor, Uber
Eats, Google, Los Alamos National Labs, CERN, and AmEx
ScyllaDB customer perspectives – Discord, Disney,
Freshworks, ShareChat, SAS, Sprig, MoEngage, Meesho, Tiket, and
Zscaler
Database engineering – Inside looks at
ScyllaDB, turbopuffer, Redis, ClickHouse, DBOS, MongoDB, DuckDB,
and TigerBeetle
What’s new/next for ScyllaDB –
Vector search, tablets, tiered storage, data consistency,
incremental repair, Rust-based drivers, and more Like other
ScyllaDB-hosted conferences (e.g.,
P99 CONF), the event will be
free and virtual so everyone can participate. Take
a look, register, and start choosing your own adventure across the
multiple tracks of tech talks.
Full
Agenda Register [free +
virtual] When you join us March 11 and 12, you can… Chat
directly with speakers and connect with ~20K of your peers
Participate in interactive trainings on topics like real-time AI,
database performance at scale, high availability, and cloud cost
optimization strategies Pick the minds of ScyllaDB engineers and
architects, who are available to answer your toughest database
performance questions Win conference swag, sea monster plushies,
book bundles, and other cool giveaways
Details, Details The
agenda
site has all the scheduling, abstracts, and speaker details.
Please note that times are shown in your local time zone. Be sure
to scroll down into the Instant Access section. This is one of the
best parts of Monster SCALE Summit. You can access these sessions
from the minute the event platform opens until the conference
wraps. Some teams have shared that they use Instant Access to build
their own watch parties beyond the live conference hours. If
you do this, please share photos! Another important
detail: books. Quite a few of our speakers are
gluttons for
punishmentaccomplished book authors. Camille Fournier –
Platform Engineering,
The Manager’s Path Martin Kleppmann and Chris
Riccomini –
Designing Data-Intensive Applications 2E Chris
Riccomini –
The
Missing README Dominik Tornow –
Think
Distributed Systems Teiva Harsanyi –
100 Go Mistakes and How to Avoid Them,
The Coder
Cafe Felipe Cardeneti Mendes –
Database
Performance at Scale We will have book giveaways
throughout the event, so be sure to attend live.
All registrants get 30-day access to the complete O’Reilly
platform, which includes all O’Reilly, Manning, and many other tech
books. This includes the shiny new second edition of
Designing
Data-Intensive Applications, which publishes this month.
Perfect timing…
Register now –
it’s free
16 February 2026, 1:09 pm by
ScyllaDB
Learn about ScyllaDB database-level encryption with
Customer-Managed Keys & see how to set up and manage encryption
with a customer key — or delegate encryption to ScyllaDB
ScyllaDB Cloud takes a proactive approach to ensuring the security
of sensitive data: we provide database-level encryption in addition
to the default storage-level encryption. With this added layer of
protection, customer data is always protected against attacks.
Customers can focus on their core operations, knowing that their
critical business and customer assets are well-protected. Our
clients can either use a customer-managed key (CMK, our version of
BYOK) or let ScyllaDB Cloud manage the CMK for them. The feature is
available in all cloud platforms supported by ScyllaDB Cloud. This
article explains how ScyllaDB Cloud protects customer data. It
focuses on the technical aspects of ScyllaDB database-level
encryption with Customer-Managed Keys (CMK). Storage-level
encryption Encryption at rest is when data files are encrypted
before being written to persistent storage. ScyllaDB Cloud always
uses encrypted volumes to prevent data breaches caused by physical
access to disks. Database-level encryption Database-level
encryption is a technique for encrypting all data before it is
stored in the database.
The ScyllaDB Cloud feature is based on
the proven ScyllaDB Enterprise database-level encryption at rest,
extended with the Customer Managed Keys (CMK) encryption control.
This ensures that the data is securely stored – and the customer is
the one holding the key. The keys are stored and protected
separately from the database, substantially increasing security.
ScyllaDB Cloud provides full database-level encryption using the
Customer Managed Keys (CMK) concept. It is based on envelope
encryption to encrypt the data and decrypt only when the data is
needed. This is essential to protect the customer data at rest.
Some industries, like healthcare or finance, have strict data
security regulations. Encrypting all data helps businesses comply
with these requirements, avoiding the need to prove that all tables
holding sensitive personal data are covered by encryption. It also
helps businesses protect their corporate data, which can be even
more valuable. A key feature of CMK is that the customer has
complete control of the encryption keys. Data encryption keys will
be introduced later (it is confusing to cover them at the
beginning). The customer can: Revoke data access at any time
Restore data access at any time Manage the master keys needed for
decryption Log all access attempts to keys and data Customers can
delegate all key management operations to the ScyllaDB Cloud
support team if they prefer this. To achieve this, the customer can
choose the ScyllaDB key when creating the cluster. To ensure
customer data is secure and adheres to all privacy regulations. By
default, encryption uses the symmetrical algorithm AES-128, a solid
corporate encryption standard covering all practical applications.
Breaking AES-128 can take an immense amount of time, approximately
trillions of years. The strength can be increased to AES-256.
Note: Database-level encryption in ScyllaDB Cloud
is available for all clusters deployed in Amazon Web Services (AWS)
and Google Cloud Platform (GCP). Encryption To ensure all user data
is protected, ScyllaDB will encrypt: All user tables Commit logs
Batch logs Hinted handoff data This ensures all customer data is
properly encrypted. The first step of the encryption process is to
encrypt every record with a data encryption key (DEK). Once the
data is encrypted with the DEK, it is sent to either AWS KMS or GCP
KMS, where the master key (MK) resides. The DEK is then encrypted
with the master key (MK), producing an encrypted DEK (EDEK or a
wrapped key). The master key remains in the KMS, while the EDEK is
returned and stored with the data. The DEK used to encrypt the data
is destroyed to ensure data protection. A new DEK will be generated
the next time new data needs to be encrypted. Decryption Because
the original non-encrypted DEK is destroyed when the EDEK was
produced, the data cannot be decrypted. The EDEK cannot be used to
decrypt the data directly because the DEK key is encrypted. It has
to be decrypted, and for that, the master key will be required
again. This can only be decrypted with the master key(MK) in the
KMS. Once the DEK is unwrapped, the data can be decrypted. As you
can see, the data cannot be decrypted without the master key –
which is protected at all times in the KMS and cannot be “copied”
outside KMS. By revoking the master key, the customer can disable
access to the data independently from the database or application
authorization. Multi-region deployment Adding new data centers to
the ScyllaDB cluster will create additional local keys in those
regions. All master keys support multi-regions, and a copy of each
key resides locally in each region – ensuring those multi-regional
setups are protected from regional outages for the cloud provider
and against disaster. The keys are available in the same region as
the data center and can be controlled independently. In case you
use a Customer Key – cloud providers will charge you for the KMS.
AWS will charge $1/month, GCP will change you $0.06 for each
cluster prorated per hour. Each additional DC creates a replica
that is counted as an additional key. There is an additional
cost per key request. ScyllaDB Enterprise utilizes those requests
efficiently, resulting in an estimated monthly cost of up to $1 for
a 9-node cluster. Managing encryption keys adds another layer of
administrative work in addition to the extra cost. ScyllaDB Cloud
offers database clusters that can be encrypted using keys managed
by ScyllaDB support. They provide the same level of protection, but
our support team helps you manage the master keys. The ScyllaDB
keys are applied by default and are subsidized by ScyllaDB.
Creating a Cluster with Database-Level Encryption Creating a
cluster with database-level encryption requires: A ScyllaDB Cloud
account – If you don’t have one, you can
create a ScyllaDB Cloud account
here. 10 minutes with ScyllaDB Key or 20 minutes creating your
own key To create a cluster with database-level encryption enabled,
we will need a master key. We can either create a customer-managed
key using ScyllaDB Cloud UI or skip this step completely and use a
ScyllaDB Managed Key, which will skip the next six steps. In both
cases, all the data will be protected by strong encryption at the
database level. Setting up the customer-managed key can be found in
the database-level encryption documentation. Transparent
database-level encryption in ScyllaDB Cloud significantly boosts
the security of your ScyllaDB clusters and backups. Next
Steps Start using this feature in
ScyllaDB
Cloud. Get your questions answered in our
community forum and
Slack channel. Or,
use our contact
form.
11 February 2026, 1:56 pm by
ScyllaDB
Lessons learned building a Rust-backed Node.js driver for
ScyllaDB: bridging JS and Rust, performance pitfalls, and benchmark
results This blog post explores the story of building a
new Node.js database driver as part of our Student Team Programming
Project. Up ahead: troubles with bridging Rust with JavaScript, a
new solution being initially a few times slower than the previous
one, and a few charts! Note: We cover the progress made until June
2025 as part of the
ZPP
project, which is a collaboration between ScyllaDB and
University of Warsaw. Since then, the ScyllaDB Driver team adopted
the project (and now it’s almost production ready). Motivation The
database speaks one language, but users want to speak to it in
multiple languages: Rust, Go, C++, Python, JavaScript, etc. This is
where a driver comes in, acting as a “translator” of sorts. All the
JavaScript developers of the world currently rely on the DataStax
Node.js driver. It is developed with the Cassandra database in
mind, but can also be used for connecting to ScyllaDB, as they use
the same protocol – CQL. This driver gets the job done, but it is
not designed to take full advantage of ScyllaDB’s features (e.g.,
shard-per-core architecture, tablets). A solution for that is
rewriting the driver and creating one that is in-house, developed
and maintained by ScyllaDB developers. This is a challenging task
requiring years of intensive development, with new tasks
interrupting along the way. An alternative approach is writing the
new driver as a wrapper around an existing one – theoretically
simplifying the task (
spoiler: not always) to just
bridging the interfaces. This concept was proven in the making of
the
ScyllaDB C / C++ driver, which is an overlay over the Rust
driver. We chose the ScyllaDB Rust driver as the backend of the
new JavaScript driver for a few reasons. ScyllaDB’s Rust driver is
developed and maintained by ScyllaDB. That means it’s always up to
date with the latest database features, bug fixes, and
optimizations. And since it’s written in Rust, it offers
native-level performance without sacrificing memory safety.
[
More
background on this approach] Development of such a solution
skips the implementation of complicated database handling logic,
but brings its own set of problems. We wanted our driver to be as
similar as possible to the Node.js driver so anyone wanting to
switch does not need to do much configuration. This was a
restriction on one side. On the other side, we have limitations of
the Rust driver interface. Driver implementations differ and the
API for communicating with them can vary in some places. Some give
a lot of responsibility to the user, requiring more effort but
giving greater flexibility. Others do most of the work without
allowing for much customization. Navigating these considerations is
a recurring theme when choosing to write a driver as a wrapper over
a different one. Despite the challenges during development, this
approach comes with some major advantages. Once the initial
integration is complete, adding new ScyllaDB features becomes much
easier. It’s often just a matter of implementing a few bridging
functions. All the complex internal logic is handled by the Rust
driver team. That means faster development, fewer bugs, and better
consistency across languages. On top of that, Rust is significantly
faster than Node.js. So if we keep the overhead from the bridging
layer low, the resulting driver can actually outperform existing
solutions in terms of raw speed. The environment: Napi vs Napi-Rs
vs Neon With the goal of creating a driver that uses ScyllaDB Rust
Driver underneath, we needed to decide how we would be
communicating between languages. There are two main options when it
comes to communicating between JavaScript and other languages:
Use a
Node
API (NAPI for short) – an API built directly into the NodeJS
engine, or Interface the program through the V8 JavaScript engine.
While we could use one of those communication methods directly,
they are dedicated for C / C++, which would mean writing a lot of
unsafe code. Luckily, other options exist:
NAPI-RS and
Neon. Those libraries handle all the
unsafe code required for using the C / C++ APIs and expose (mostly
safe) Rust interfaces. The first option uses NAPI exclusively under
the hood, while the Neon option uses both of those interfaces.
After some consideration, we decided to use NAPI-RS over Neon. Here
are the things we considered when deciding which library to use:
– Library approach — In NAPI-RS, the library
handles the serialization of data into the expected Rust types.
This lets us take full advantage of Rust’s static typing and any
related optimizations. With Neon, on the other hand, we have to
manually parse values into the correct types. With NAPI-RS, writing
a simple function is as easy as adding a #[napi] tag:
Simple a+b example And in Neon, we need to manually handle
JavaScript context:
A+b example in Neon
– Simplicity of use — As a
result of the serialization model, NAPI-RS leads to cleaner and
shorter code. When we were implementing some code samples for the
performance comparison, we had serious trouble implementing code in
Neon just for a simple example. Based on that experience, we
assumed similar issues would likely occur in the future.
–
Performance — We made some simple tests comparing the
performance of library function calls and sending data between
languages. While both options were visibly slower than pure
JavaScript code, the NAPI-RS version had better performance. Since
driver efficiency is a critical requirement, this was an important
factor in our decision. You can read more about the benchmarks in
our thesis.
– Documentation — Although the
documentation for both tools is far from perfect, NAPI-RS’s
documentation is slightly more complete and easier to navigate.
Current state and capabilities Note: This represents the state as
of May 2025. More features have been introduced since then.
See the project
readme for a brief overview of current and planned features.
The driver supports regular statements (both select and insert) and
batch statements. It supports all CQL types, including encoding
from almost all allowed JS types. We support prepared statements
(when the driver knows the expected types based on the prepared
statement), and we support unprepared statements (where users can
either provide type hints, or the driver guesses expected value
types). Error handling is one of the few major functions that
behaves differently than the DataStax driver. Since the Rust driver
throws different types of errors depending on the situation, it’s
nearly impossible to map all of them reliably. To avoid losing
valuable information, we pass through the original Rust errors as
is. However, when errors are generated by our own logic in the
wrapper, we try to keep them consistent with the old driver’s error
types. In the DataStax driver, you needed to explicitly call
shutdown() to close the database connection. This generated some
problems: when the connection variable was dropped, the connection
sometimes wouldn’t stop gracefully, even keeping the program
running in some situations. We decided to switch this approach, so
that the connection is automatically closed when the variable
keeping the client is dropped. For now, it’s still possible to call
shutdown on the client. Note: We are still discussing the right
approach to handling a shutdown. As a result, the behavior
described here may change in the future. Concurrent execution The
driver has a dedicated endpoint for executing multiple queries
concurrently. While this endpoint gives you less control over
individual requests — for example, all statements must be prepared
and you can’t set different options per statement — these
constraints allow us to optimize performance. In fact, this
approach is already more efficient than manually executing queries
in parallel (around 35% faster in our internal testing), and we
have additional optimization ideas planned for future
implementation. Paging The Rust and DataStax drivers both have
built-in support for paging, a CQL feature that allows splitting
results of large queries into multiple chunks (pages).
Interestingly, although the DataStax driver has multiple endpoints
for paging, it doesn’t allow execution of unpaged queries. Our
driver supports the paging endpoints (for now, one of those
endpoints is still missing) and we also added the ability to
execute unpaged queries in case someone ever needs that. With the
current paging API, you have several options for retrieving paged
results:
Automatic iteration: You can iterate over
all rows in the result set, and the driver will automatically
request the next pages as needed.
Manual
paging: You can manually request the next page of
results when you’re ready, giving you more control over the paging
process.
Page state transfer: You can
extract the current page state and use it to fetch the next page
from a different instance of the driver. This is especially useful
in scenarios like stateless web servers, where requests may be
handled by different server instances. Prepared statements cache
Whenever executing multiple instances of the same statement, it’s
recommended to use prepared statements. In ScyllaDB Rust Driver, by
default, it’s the user’s responsibility to keep track of the
already prepared statements to avoid preparing them multiple times
(and, as a result, increasing both the network usage and execution
times). In the DataStax driver, it was the driver’s responsibility
to avoid preparing the same query multiple times. In the new
driver, we use Rust’s Driver Caching Session for (most) of the
statement caching. Optimizations One of the initial goals for the
project was to have a driver that is faster than the DataStax
driver. While using NAPI-RS added some overhead, we hoped the
performance of the Rust driver would help us achieve this goal.
With the initial implementation, we didn’t put much focus on
efficient usage of the NAPI-RS layer. When we first benchmarked the
new driver, it turned out to be way slower compared to both the
DataStax JavaScript driver and the ScyllaDB Rust driver…
Operations scylladb-javascript-driver (initial version) [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 4.08 3.53 1.04
250000 13.50 5.81 1.73 1000000 55.05 15.37 4.61 4000000 227.69
66.95 18.43
Operations scylladb-javascript-driver (initial version) [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08
250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72
11.64
Operations scylladb-javascript-driver (initial version) [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08
250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72
11.64
Operations scylladb-javascript-driver (initial version) [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.96 3.11 1.31
250000 4.90 4.33 1.89 1000000 16.99 10.58 4.93 4000000 65.74 31.83
17.26 Those results were a bit of a surprise, as we didn’t fully
anticipate how much overhead NAPI-RS would introduce. It turns out
that using JavaScript
Objects introduced way higher overhead compared to
other built-in types, or
Buffers. You can see on
the following flame graph how much time was spent executing NAPI
functions (
yellow-orange highlight), which are related to
sending objects between languages.
Creating objects with NAPI-RS is as simple as adding the
#[napi] tag to the struct we want to expose to the
NodeJS part of the code. This approach also allows us to create
methods on those objects. Unfortunately, given its overhead, we
needed to switch the approach – especially in the most used parts
of the driver, like parsing parameters, results, or other parts of
executing queries. We can create a napi object like this:
Which is converted to the following JavaScript class:
We
can use this struct between JavaScript and Rust. When accepting
values as arguments to Rust functions exposed in NAPI-RS, we can
either accept values of the types that implement the
FromNapiValue trait, or accept references to values of
types that are exposed to NAPI (these implement the default
FromNapiReference trait). We can do it like this:
Then, when we call the following Rust function
we
can just pass a number in the JavaScript code.
FromNapiValue is implemented for built-in types like
numbers or strings, and the
FromNapiReference trait is created automatically
when using the
#[napi] tag on a Rust struct. Compared
to that, we need to manually implement
FromNapiValue for custom structs. However, this
approach allows us to receive those objects in functions exposed to
NodeJS, without the need for creating Objects – and thus
significantly improves performance. We used this mostly to improve
the performance of passing query parameters to the Rust side of the
driver. When it comes to returning values from Rust code, a type
must have a
ToNapiValue trait implemented. Similarly,
this trait is already implemented for built-in types, and is auto
generated with macros when adding the
#[napi] tag to
the object. And this auto generated implementation was causing most
of our performance problems. Luckily, we can also implement our own
ToNapiValue trait. If we return a raw value and create
an object directly in the JavaScript part of the code, we can avoid
almost all of the negative performance impacts that come from the
default implementation of
ToNapiValue. We can do it
like this:
This will return just the number instead of the whole struct. An
example of such places in the code was UUID. This type is used for
providing the UUID retrieved as part of any query, and can also be
used for inserts. In the initial implementation, we had a UUID
wrapper: an object created in the Rust part of the code, that
had a default
ToNapiValue implementation, that
was handling all the logic for the UUID. When we changed the
approach to returning just a raw buffer representing the UUID and
handling all the logic on the JavaScript side, we shaved off about
20% of the CPU time we were using in the select benchmarks at that
point in time. Note: Since completing the initial project, we’ve
introduced additional changes to how serialization and
deserialization works. This means the current state may be
different from what we describe here. A new round of benchmarking
is in progress; stay tuned for those results. Benchmarks In the
previous section, we showed you some early benchmarks. Let’s talk a
bit more about how we tested and what we tested. All benchmarks
presented here were run on a single machine – the database was run
in a Docker container and the driver benchmarks were run without
any virtualization or containerization. The machine was running on
AMD Ryzen™ 7 PRO 7840U with 32GB RAM, with the database itself
limited to 8GB of RAM in total. We tested the driver both with
ScyllaDB and Cassandra (latest stable versions as of the time of
testing – May 2025). Both of those databases were run in a three
node configuration, with 2 shards per node in the case of ScyllaDB.
With this information on the benchmarks, let’s see the effect all
the optimizations we added had on the driver performance when
tested with ScyllaDB:
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.89 3.45 0.99 4.08 250000 4.15 5.66 1.73 13.50 1000000
13.65 15.86 4.41 55.05 4000000 55.85 56.73 18.42 227.69
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 2.83 2.48 1.04 1.63 250000 1.91 2.91 1.56 4.09 1000000
4.58 4.69 3.42 15.74 4000000 16.05 14.27 11.92 58.96
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.50 3.04 1.33 1.96 250000 2.93 4.52 1.94 4.90 1000000
8.79 11.11 5.08 16.99 4000000 32.99 36.62 17.90 65.74
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.42 3.09 1.25 1.45 250000 2.94 3.81 2.43 3.43 1000000
9.19 8.98 7.21 10.82 4000000 33.51 28.97 25.81 40.74 And here are
the same benchmarks, without the initial driver version.
Here are the results of running the benchmark on Cassandra.
Operations Scylladb-javascript-driver [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 2.48 14.50 1.25
250000 5.82 19.93 2.00 1000000 19.77 19.54 5.16
Operations Scylladb-javascript-driver [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.60 2.99 1.48
250000 3.06 4.46 2.42 1000000 9.02 9.03 6.53
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] 62500 2.32 4.03 2.11 250000 5.45 6.53 4.01
1000000 18.77 16.20 13.21
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] 62500 1.86 4.15 1.57 250000 4.24 5.41 3.36
1000000 13.11 14.11 10.54 The test results across both ScyllaDB and
Cassandra show that the new driver has slightly better performance
on the insert benchmarks. For select benchmarks, it starts ahead
and the performance advantage decreases with time. Despite a series
of optimizations, the majority of the CPU time still comes from
NAPI communication and thread synchronization (according to
internal flamegraph testing). There is still some room for
improvement, which we’re going to explore. Since running those
benchmarks, we introduced changes that improve the performance of
the driver. With those improvements performance of select
benchmarks is much closer to the speed of the DataStax driver.
Again…please stay tuned for another blog post with updated results.
Shards and tablets Since the DataStax driver lacked
tablet and
shard support, we were curious if our new shard-aware and
tablet-aware drivers provided a measurable performance gain with
shards and tablets.
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No
Shards 62,500 1.89 2.61 3.45 3.51 0.99 1.20 250,000 4.15 7.61 5.66
6.14 1.73 2.30 1,000,000 13.65 30.36 15.86 16.62 4.41 8.33
4,000,000 55.85 134.90 56.73 77.68 18.42 42.64
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No
Shards 62,500 1.50 1.52 3.04 3.63 1.33 1.33 250,000 2.93 3.29 4.52
5.09 1.94 2.02 1,000,000 8.79 10.29 11.11 11.13 5.08 5.71 4,000,000
32.99 38.53 36.62 39.28 17.90 20.67 In insert benchmarks, there are
noticeable changes across all drivers when having more than one
shard. The Rust driver improved by around 36%, the new driver
improved by around 46%, and the DataStax driver improved by only
around 10% when compared to the single sharded version. While
sharding provides some performance benefits for the DataStax
driver, which is not shard aware, the new driver benefits
significantly more — achieving performance improvements comparable
to the Rust driver. This shows that it’s not only introducing more
shards that provide an improvement in this case; a major part of
the performance improvement is indeed shard-awareness.
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] No Tablets Standard No Tablets Standard No Tablets Standard
62,500 1.76 1.89 3.67 3.45 1.06 0.99 250,000 3.91 4.15 5.65 5.66
1.59 1.73 1,000,000 12.81 13.65 13.54 15.86 3.74 4.41
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] No Tablets Standard No Tablets Standard No Tablets Standard
62,500 1.46 1.50 2.92 3.04 1.33 1.33 250,000 2.76 2.93 4.03 4.52
1.94 1.94 1,000,000 8.36 8.79 7.68 11.11 4.84 5.08 When it comes to
tablets, the new driver and the Rust driver see only minimal
changes to the performance, while the performance of the DataStax
driver drops significantly. This behavior is expected. The DataStax
driver is not aware of the tablets. As a result, it is unable to
communicate directly with the node that will store the data – and
that increases the time spent waiting on network communication.
Interesting things happen, however, when we look at the network
traffic: WHAT TOTAL CQL TCP Total Size New driver 3 node all
412,764 112,318 300,446 ∼ 43.7 MB New driver 3 node | driver ↔
database 409,678 112,318 297,360 – New driver 3 node | node ↔ node
3,086 0 3,086 – DataStax driver 3 node all 268,037 45,052 222,985 ∼
81.2 MB DataStax driver 3 node | driver ↔ database 90,978 45,052
45,926 – DataStax driver 3 node | node ↔ node 177,059 0 177,059 –
This table shows the number of packets sent during the concurrent
insert benchmark on three-node ScyllaDB with 2 shards per node.
Those results were obtained with RF = 1. While running the database
with such a replication factor is not production-suitable, we
chose it to better visualize the results. When looking at those
numbers, we can draw the following conclusions: The new driver has
a different coalescing mechanism. It has a shorter wait time, which
means it sends more messages to the database and achieves
lower latencies. The new driver knows which node(s) will
store the data. This reduces internal traffic between database
nodes and lets the database serve more traffic with the same
resources. Future plans The goal of this project was to create a
working prototype, which we managed to successfully achieve. It’s
available at
https://github.com/scylladb/nodejs-rs-driver,
but it’s considered experimental at this point. Expect it to change
considerably, with ongoing work and refactors. Some of the features
that were present in DataStax driver, and are expected for the
driver to be considered deployment-ready, are not yet implemented.
The Drivers team is actively working to add those features. If
you’re interested in this project and would like to contribute,
here’s the
project’s GitHub
repository.
10 February 2026, 2:00 pm by
ScyllaDB
A bug hunt into why disk I/O performance failed to scale on
larger AWS instances The promise of cloud computing is
simple: more resources should equate to better, faster performance.
When scaling up our systems by moving to larger instances, we
naturally expect a proportional increase in capabilities,
especially in critical areas like disk I/O. However, ScyllaDB’s
experience enabling support for the AWS i7i and i7ie instance
families uncovered a puzzling performance bottleneck. Contrary to
expectations, bigger instances simply did not scale their I/O
performance as advertised. This blog post traces the challenging,
multi-faceted investigation into why
IOTune (a disk benchmarking tool shipped with
Seastar) was achieving a
fraction of the advertised disk bandwidth on larger instances. On
these machines, throughput plateaued at a modest
8.5GB/s and IOPS were much lower than expected on
increasingly beefy machines. What followed was a deep dive into the
internals of the ScyllaDB IO Scheduler, where we uncovered subtle
bugs and incorrect assumptions that conspired to constrain
performance scaling. Join us as we investigate the symptoms, pin
down the root cause, and share the hard-fought lessons learned on
this long journey. This blog post is the first in a three-part
series detailing our journey to fully harness the performance of
modern cloud instances. While this piece focuses on the initial set
of bottlenecks within the IO Scheduler, the story continues in two
subsequent posts. Part 2,
The deceptively simple act of
writing to disk, tracks down a mysterious write throughput
degradation we observed in realistic ScyllaDB workloads after
applying the fixes discussed here. Part 3,
Common
performance pitfalls of modern storage I/O, summarizes the
invaluable lessons learned and provides a consolidated list of
performance pitfalls to consider when striving for high-performance
I/O on modern hardware and cloud platforms. Problem description
Some time ago, ScyllaDB decided to support the AWS i7i and i7ie
families. Before we support a new instance type, we run extensive
tests to ensure ScyllaDB squeezes every drop of performance out of
the provisioned hardware. While measuring disk capabilities with
the Seastar IOTune tool, we noticed that the IOPS and bandwidth
numbers didn’t scale well with the size of the instance, and we
ended up with much lower values than AWS advertised. Read IOPS were
on par with AWS specs up to i7i.4xlarge, but they were getting
progressively worse, up to 25% lower than spec on i7i.48xlarge.
Write IOPS were worse, starting at around 25% less than spec for
i7i.4xlarge and up to 42% less on i7i.48xlarge. Bandwidth numbers
were even more interesting. Our IOTune measurements were similar to
fio
up to the i7i.4xlarge instance type. However, as we scaled up the
instance type, our IOTune bandwidth numbers were plateauing at
around
8.5GB/s while
fio was
managing to pull up to 40GB/s throughput for i7i.48xlarge
instances. Essential Toolkit The IOTune tool is a disk benchmarking
tool that ships with Seastar. When you run this tool on a storage
mount point, it outputs 4 values corresponding to the read/write
IOPS and read/write bandwidth of the underlying storage system.
These 4 values end up in a file called io-properties.yaml. When
provided with these values, the Seastar
IO
Scheduler will build a model of the disk, which it will use to
help ScyllaDB maximize the drive’s performance. The IO Scheduler
models the disk based on the IOPS and bandwidth properties using a
formula that looks something like:
read_bw/read_bw_max +
write_bw/write_bw_max + read_iops/read_iops_max +
write_iops/write_iops_max <= 1 The internal mechanics of
how the IO Scheduler works are described very thoroughly in the
blog post I linked above. The io_tester tool is another utility
within the Seastar framework. It’s used for testing and profiling
I/O performance, often in more controlled and customizable
scenarios than the automated IOTune. It allows users to simulate
specific I/O workloads (e.g., sequential vs. random, various
request sizes, and concurrency levels) and measure resulting
metrics like throughput and latency. It is particularly useful for:
Deep-dive analysis: Running experiments with fine-grained control
over I/O parameters (e.g.,
--io-latency-goal, request
size, parallelism) to isolate performance characteristics or
potential bottlenecks. Regression testing: Verifying that changes
to the IO Scheduler or underlying storage stack do not negatively
impact I/O performance under various conditions. Fair Queue
experimentation: As shown in this investigation, io_tester can be
used to observe the relationship between configured workload
parameters, the resulting in-disk queue lengths, and the throttling
behavior of the IO Scheduler. What this meant for ScyllaDB We
didn’t want to enable i7i instances if the IOTune numbers didn’t
accurately reflect the underlying disk performance of the instance
type. Lower io-properties numbers cause the IO Scheduler to
overestimate the cost of each request. This leads to more
throttling, making monstrous instances like i7i.48xlarge perform
like much cheaper alternatives (such as the i7i.4xlarge, for
example). Pinning the symptoms Early on, we noticed that the
observed symptoms pointed to two different problems. This helped us
narrow down the root causes much faster (well, fast here is a very
misleading term). We were chasing a lower-than-expected IOPS issue
and a different low-bandwidth issue. IOPS and bandwidth numbers
were behaving differently when scaling up instances. The former was
scaling, but with much lower values than we expected. The latter
would just plateau from one point and stay there, no matter how
much money you’d throw at the problem. We started with the
hypothesis that IOTune might misdetect the disk’s physical block
size from sysfs and that we issue requests with a different size
than what the disk “likes,” leading to lower IOPS. After some
debugging, we confirmed that IOTune indeed failed to detect the
block size, so it defaulted to using requests of 512bytes. There’s
no bug to fix on the IOTune side here, but we decided we needed to
be able to specify the disk block size for reads and writes
independently when measuring. This turned out to be quite helpful
later on. With 4K requests, we were able to measure the expected
~1M IOPS for writes compared to the ~650k IOPS we were getting with
the autodetected 512-byte requests (numbers relevant for the
i7i.12xlarge instance). We had a fix for the IOPS issue, but – as
we discovered later – we didn’t properly understand the actual root
cause. At that point, we thought the problem was specific to this
instance type and caused by IOTune misdetecting the block size. As
you’ll see in the next blog post in the series, the root cause is a
lot more interesting and complicated. The plateauing bandwidth
issue was still on the table. Unfortunately, we had no clue about
what could be going on. So, we started exploring the problem space,
concentrating our efforts as you’d imagine any engineer would.
Blaming the IO Scheduler We dug around, trying to see if IOTune
became CPU-limited for the bandwidth measurements. But that wasn’t
it. It’s somewhat amusing that our initial reaction was to point
the finger at the IO Scheduler. This bias stems from when the IO
Scheduler was first introduced in ScyllaDB. It had such a profound
impact that numerous performance issues over time – things that
were propagating downward to the storage team – were often (and
sometimes unfairly) attributed to it. Understanding the root cause
We went through a series of experiments to try to narrow down the
problem further and hopefully get a better understanding of what
was happening. Most of the experiments in this article, unless
explicitly specified, were run on an i7i.12xlarge instance. The
expected throughput was ~9.6GB/s while IOTune was measuring a write
throughput of 8.5GB/s. To rule out poor disk queue utilization, we
ran fio with various iodepths and block sizes, then recorded the
bandwidth.
We
noticed that the request needs to be ~4MB to fill the disk queue.
Next, we collected the same for io_tester with
–io-latency-goal=1000 to prevent the queue from splitting requests.
A larger latency goal means the scheduler can be more relaxed and
submit the requests as they come because it has plenty of time
(1000 ms) to complete each request in time. If the goal is smaller,
the IO Scheduler gets stressed because it needs to make each
request complete in that tight schedule. Sometimes it might just
split a request in half to take advantage of the in-disk
parallelism and hopefully make the original request fit the tight
latency goal.
The
fio tool seemed to be pulling the full bandwidth from the disk, but
our io_tester tool was not. The issue was definitely on our side.
The good news was that both io_tester and IOTune measured similar
write throughputs, so we weren’t chasing a bug in our measurement
tools. The conclusion of this experiment was that we saturated the
disk queue properly, but we still got low bandwidth. Next, we
pulled an ace out of the sleeve. A few months before this, we were
at a hackathon during our engineering summit. During that
hackathon, our Storage and Networking team built a prototype
Seastar IO Scheduler controller that would bring more transparency
and visibility into how the IO Scheduler works. One of the patches
from that project was a hack that would make the IO Scheduler drop
a lot of the IOPS/bandwidth throttling logic and just drain like
crazy whatever requests are queued. We applied that patch to
Seastar and ran the IOTune tool again. It was very rewarding to see
the following output: Measuring sequential write bandwidth: 9775
MB/s (deviation 6%) Measuring sequential read bandwidth: 13617 MB/s
(deviation 22%) The bandwidth numbers escaped the
8.5GB/s limit that was previously constraining our
measurements. This meant we were correct in blaming the IO
Scheduler. We were indeed experiencing throttling from the
scheduler, specifically from something in the fair queue math. At
that point, we needed to look more closely at the low-level
behavior. We patched Seastar with another home-brewed
project
that adds a low-overhead binary tracer to the IO Scheduler. The
plan was to run the tracer on both the master version and the one
with the hackathon patch applied – then try to understand why the
hackathon-patched scheduler performs better. We added a few traces
and we immediately started to see patterns like these in the slow
master trace:
Here
it took it
134-51=83us to dispatch one request.
The “Q” event is when a request arrives at the scheduler and gets
queued. “D” stands for when a request gets dispatched. For
reference, the patched IO scheduler spent
1us to
dispatch a request. The unexpected behavior suggested an issue with
the
token bucket accumulation, as requests should be dispatched
instantly when running without io-properties.yaml (effectively
providing unlimited tokens). This is precisely the scenario when
IOTune is running: it withholds io-properties.yaml from the IO
Scheduler. This allows the token bucket to operate with unlimited
tokens, stressing the disk to its maximum potential so IOTune can
compute, by itself, the required io-properties.yaml. The token
bucket seems to run out of tokens…but why? When the token bucket
runs out of tokens, it needs to wait for tokens to be replenished
when other requests are completed. This delays the dispatch of the
next request. That’s why the above request waited
83us to get dispatched when it should have
actually been dispatched in
1us. There wasn’t much
more we could do with the event tracer. We needed to get closer to
the fair queue math. We returned to io_tester to examine the
relationship between the parallelism of the test and the size of
the in-disk queues. We ran io_tester for requests sized within
[128k, 1MB] with parallelism within [1,2,4,8,16] fibers. We ran it
once for the master branch (slow) and once for the “hackathon”
branch (fast). Here are some plots from these results. The plots
are throughput (vertical axis) against parallelism (horizontal) for
two request sizes,
1MB and
128kB.
For
both request sizes, the “hackathon” branch outperformed the
“master” branch. Also, the 1MB request saturates the disk with much
lower parallelism than the 128k request. No surprises here, the
result wasn’t that valuable. In a follow-up test, we collected the
in-disk latencies as well. We plotted throughput against
parallelism for both the master and hackathon branches. The lines
crossing the bars represent the in-disk latencies measured.
This is already much better. After the disk is saturated,
increasing parallelism should create a proportional increase for
in-disk latency. That’s exactly what happens for the hackathon
branch. We couldn’t say the same about the master branch. Here, the
throughput plateaued around 4 fibers, and the in-disk latency
didn’t grow! For some reason, we didn’t end up stressing this disk.
To investigate further, we wanted to see the size of the actual
in-disk queues. So, we coded up a patch to make io_tester output
this information. We plotted the in-disk queue size alongside
parallelism for various request sizes. At this point, it became
clear that we weren’t sufficiently leveraging the in-disk
parallelism. Likely, the fair_queue math was making the IO
Scheduler throttle requests excessively. This is indeed what the
plots below show. In the master (slow) branch run, the in-disk
queue length for the 1MB request (which saturates the disk faster)
plateaus at around 4 requests once parallelism=4 and higher. That’s
definitely not alright.
Just
for fun, let’s look at
Little’s Law in
action. We plotted
disk_queue_length / latency for
each branch as follows.
Next,
we wanted to (somehow) replicate this behavior without involving an
actual disk. This way, we could maybe create a regression test for
the IO Scheduler. The Seastar
ioinfo tool was
perfect for this job. i
oinfo can take an
io-properties.yaml file as an argument. It feeds the values to the
IO Scheduler, then the tool outputs the token bucket parameters
(which can be used to calculate the theoretical IOPS and throughput
values that the IO Scheduler can achieve). Our goal was to compare
these calculated values with what was configured in an
io-properties.yaml file and make sure the IO Scheduler could
deliver very close to what it was configured for. For reference,
here’s how the calculated IOPS/bandwidth looked compared to the
configured values.
The
values returned by the scheduler were within a 5% margin of the
configured one. This was fantastic news (in a way). It meant the
fair_queue math behaves correctly even with bandwidths above
8.xGB/s. We didn’t get the regression test we
hoped for, since the fair_queue math was
not
causing the throttling and disk underutilization we’d seen in the
previous experiment. However, we did add a test that would check if
this behavior changes in the future. We did get a huge win from
this, though. We came to the conclusion that something must be
wrong with the fair_queue math or something in the IO Scheduler
must be incorrect only when it’s not configured with an
io-properties file. At that point, the problem space narrowed
significantly. Playing around with the inputs from the
io-properties.yaml file, we uncovered yet another bug. For large
enough read IOPS/bandwidth numbers in the config file, the IO
Scheduler would report request costs of zero. After many
discussions, we learned that this is not really a bug. It’s how the
math should behave. With big io-properties numbers, the math should
plateau the costs at 0. It makes sense: the more resources you have
available, the single unit of effort becomes less significant. This
led us to an important realization: the unconfigured case (our
original issue) should also produce a cost of zero. A zero cost
means that the token bucket won’t consume any tokens. That gives us
unbounded output…which is exactly what IOTune wants. Now we needed
to figure out two things: Why doesn’t the IO Scheduler report a
cost of zero for the unconfigured case? In theory, it should. In
the issue linked above, costs became zero for values that weren’t
even close to UINT64_MAX. Was our code prepared to handle costs of
zero? We should ensure we don’t end up with weird overflows or
divisions by zero or any undefined behavior from code that assumes
costs can’t be zero. When things start to converge At this point,
we had no further leads, so we thought there must be something
wrong with the fair queue math. I reviewed the math from
Implementing a New IO Scheduler Algorithm for Mixed Read/Write
Workloads, but I didn’t find any obvious flaws that could
explain our unconfigured case. We hoped we’d find some formula
mistakes that made the bandwidth hit its theoretical limit at
8.5GB/s. We didn’t find any obvious issues here,
so we concluded there must be some flaw in the implementation of
the math itself. We started suspecting that there must be some
overflow that ends up shrinking the bandwidth numbers. After quite
some time tracking the math implementation in the code, we managed
to find the issue. Two internal IO Scheduler variables that were
storing the IOPS and bandwidth values configured via
io-properties.yaml had a default value set to
`std::numeric_limits<int>::max()`. It wasn’t that intuitive
to figure out – the variables weren’t holding the actual
io-properties values, but rather some values that derived from
them. This made the mistake harder to spot. There is some code that
recalculates those variables when the io-properties.yaml file is
provided and parsed by the Seastar code. However, in the
“unconfigured” case, those code paths are intentionally not hit.
So, the INT_MAX values were carried into the fair queue math,
crunched into the formulas, and resulted in the
8.xGB/s throughput limit we kept seeing. The fix
was as simple as changing the default value to
‘std::numeric_limits<uint64_t>::max()’.
A
one-line fix for many weeks of work. It’s been a crazy
long journey chasing such a small bug, but it has been an
invaluable (and fun!) learning experience. It led to lots of
performance gains and enabled ScyllaDB to support highly efficient
storage instances like i7i, i7ie and i8g. Stay tuned for the next
episode in this series of blog posts, In part 2, we will uncover
that the performance gains after this work weren’t quite what we
were expecting on realistic workloads. We will deep dive into some
very dark corners of modern NVMEs and filesystems to unlock a
significant chunk of write throughput.
4 February 2026, 3:24 pm by
ScyllaDB
When not to worry about scale, when to rearchitect
everything and why passionate criticism is a win
“There’s no funner game than the at-scale technology game. But
if you play it, some people will hate you for it. That’s
okay…that’s the game you chose to play.” – Adam Jacob At
Monster Scale Summit 2025,
Rachel Stephens,
research director at RedMonk, spoke with
Adam Jacob, co-founder
of Chef and CEO of System Initiative, about what it really means to
build and operate software at scale. Note: Monster SCALE Summit
2026 will go live March 11-12, featuring antirez, creator of Redis;
Camille Fournier, author of “The Manager’s Path” and “Platform
Engineering”; Martin Kleppmann, author of “Designing Data-Intensive
Applications” and more than 50 others. The event is free and
virtual.
Register for free
and join the community for some lively chats The Existential
Question of Scale Stephens opened with an existential question:
“Does your software exist if your users can’t run it?” Yes, your
code still exists in GitHub even if us-east-1 goes down. But what
if … Your system crawls under load. Critical integrations
constantly break. You can’t afford the infrastructure costs.
“Software at scale isn’t just about throughput,” Stephens said.
“It’s about making sure that your code endures, adapts and remains
accessible no matter the load and location of where you’re running.
Because if your users can’t use it, your software may as well not
exist.” With that framing, Stephens brought in someone who’s spent
his career dealing with scale firsthand: Adam Jacob. Only Scale
When It Hurts Stephens asked Jacob how teams can balance quality,
speed and scale under uncertainty. How do you avoid both cutting
corners and premature optimization? Jacob argues that early on,
it’s fine not to worry much about scale. Most products fail for
other reasons before scalability ever becomes a problem. He
explained: “I think of it basically through the lens of
optionality. When you start building new things, it’s nice not to
worry too much about scale, because you may never reach it. Most
products don’t fail because they fail to scale. Think about how
badly Twitter failed to scale … and yet here we are.” The first
priority is to build a solid product. Once scale becomes a real
issue, that’s when it makes sense to refactor and remove
bottlenecks. But if you’ve been around the block a little, your
experience helps you make early choices that pay off later. Jacob
noted, “Premature optimization is real. But as you gain experience,
there are some decisions you make early because you know that if
things work out, you’ll be happier later — like factoring your code
so it can be broken apart across network boundaries over time, if
you need to.” Chef Scalability Horror Stories Next, Stephens asked
Jacob if he would share a scaling horror story from his Chef days.
Jacob obliged and offered two memorable ones. “The best was when we
launched the first version of Hosted Chef. The day before the
launch, we discovered it took about a minute and a half to create a
new user. It didn’t take that long when we were running it on a
laptop, but it did later … and we never really tested it. So, in
the final hours before launch, we changed it from ‘anyone can sign
up’ to a queue system with a little space robot saying, ‘Demand is
so high; we’ll get back to you.’ We just papered over the
scalability problem.” “Another example: that same Chef server (the
one that couldn’t create accounts quickly) eventually had to work
at Facebook. The original version was written in Rails, which was
great to work with, but not parallel enough. At Facebook scale, you
might have 40,000 or 50,000 things pointed at one Chef server. So
we rebuilt it in Erlang, which is great for that kind of problem. I
literally brought the Erlang version to Facebook on a USB stick.
When we installed it and bootstrapped a data center, we thought it
was broken because it was using less compute and finished almost
instantly.” Jacob explained that if they’d tried to build the Chef
server in Erlang from the start, the project probably wouldn’t have
gained traction. Starting in Rails made it possible to get Chef out
into the world and learn what the system really needed to do. Only
later, once they understood how the system really behaved, could
they rebuild it with the right architecture and runtime for scale.
Growth or Efficiency: Know Which Game You’re Playing At Chef,
scaling was ultimately required to land customers like Facebook and
JPMorgan Chase, which operate at massive scale. Jacob advised,
“Making it scale required major investment, but it worked. You
can’t buy your core. If it matters to customers, you have to build
it yourself. People often wait too long to realize they have a deep
architectural problem that’s also a business problem. Rebuilding
for scale takes months, so you have to start early.” Your own
approach to scale should ultimately be driven by what game you’re
playing: In the venture-capital game, growth and traction come
first. You can spend money to scale faster because you’re funded.
In the profitability game, efficiency comes first. Overspending on
compute or poor architecture hits the bottom line hard. Why Scaling
is the ‘Funnest’ Game Stephens mentioned that “when software
succeeds, it stops being yours – it becomes everyone’s.” She then
asked Jacob what it’s like when your tech scales to the point that
people have extremely strong opinions about it. His response: “It’s
hard to build things that people care about. If you’re lucky enough
to create something you love and share it with the world and people
love it back, that’s incredibly rewarding. Even when they don’t,
that’s still a gift.” “Someone once tapped me in a coffee shop and
said, ‘You wrote Chef? I hate Chef.’ I said, ‘I’m sorry; I didn’t
write it to hurt you.’ But at scale, that means he used it. It
mattered in his life. And that’s what you want: for people to
experience what you built.” “I love the technology, the problem,
the difficulty. Scaling adds more layers of complexity, more layers
of fun. There’s no funner game than the at-scale technology game.
But if you play it, some people will hate you for it. That’s
okay…that’s the game you chose to play.” You can watch the full
talk below.
28 January 2026, 1:46 pm by
ScyllaDB
Explore isolation mechanisms and prioritization strategies
that allow different database workloads to coexist without resource
contention issues Analytics (OLAP) and real-time (OLTP)
workloads serve distinctly different purposes. OLAP (online
analytical processing) is optimized for data analysis and
reporting, while OLTP (online transaction processing) is optimized
for real-time low-latency traffic. Most databases are designed to
primarily benefit from either OLAP or OLTP, but not both. Worse,
concurrently running both workloads under the same data store will
frequently introduce resource contention. The workloads end up
hurting each other, considerably dragging down the overall
distributed system’s performance. Let’s look at how this problem
arises, then consider a few ways to address it. OLTP vs OLAP
Databases There are basically two fundamental approaches involving
how databases store data on disk. We have row-oriented databases,
often used for real-time workloads. These store all data pertaining
to a single row on disk.
Row-oriented storage (ideal for
OLTP) Column-oriented storage (ideal for OLAP) On the
other side of the spectrum, we have column-oriented databases,
which are often used for running analytics. These databases store
data in a vertical way (versus horizontal partitioning of rows).
This single design decision effectively makes it much easier and
efficient for the database to run aggregations, perform
calculations and answer retrieving insights such as
“Top K”
metrics. OLTP vs. OLAP Workloads So the general consensus is
that if you want to run OLTP workloads, you use a row-oriented
database – and you use a columnar one for your analytics workloads.
However, contrary to popular belief, there are a variety of reasons
why people might actually want to run an OLAP workload on top of
their real-time databases. For example, this might be a good option
when organizations want to avoid data duplication or the complexity
and overhead associated with maintaining two data stores. Or maybe
they don’t extract insights all that often. The Latency Problem But
problems can arise when you try to bring OLAP to your real-time
database. We’ve studied this a lot with
ScyllaDB, a specialized NoSQL
database that’s primarily meant for high throughput and low-latency
real-time workloads. The following graphic from ScyllaDB monitoring
demonstrates what happens to latency when you try to run OLAP and
OLTP workloads alongside one another. The green line represents a
real-time workload, whereas the yellow one represents an analytics
job that’s running at the same time. While the OLTP workload is
running on its own, latencies are great. But as soon as the OLAP
workload starts, the real-time latencies dramatically rise to
unacceptable levels. The Throughput Problem Throughput is also an
issue in such scenarios. Looking at the throughput clarifies why
latencies climbed: The analytics process is consuming much higher
throughput than the OLTP one. You can even see that the real-time
throughput drops, which is a sign that the database got overloaded.
Unsurprisingly, as soon as the OLAP job finishes, the real-time
throughput increases and the database can then process its backlog
of queued requests from that workload. That’s how the contention
plays out in the database when you have two totally different
workloads competing for resources in an uncoordinated way. The
database is naively trying to process requests as they come in.
When Things Gets Contentious But why does this contention happen in
the first place? If you overwhelm your database with too many
requests, it cannot keep up. Usually, that’s because your database
lacks either the CPU or I/O capacity that’s required to fulfill
your requests. As a result, requests queue up and latency climbs.
The workloads contribute to contention too. OLTP applications often
process many smaller transactions and are very latency sensitive.
However, OLAP ones generally run fewer transactions requiring
scanning and processing through large amounts of data. So hopefully
that explains the problem. But how do we actually solve it? Option
A: Physical Isolation One option is to physically isolate these
resources. For example, in a Cassandra deployment, you would simply
add a new data center and separate your real-time processing from
your analytics. This saves you from having to stream data and work
with a different database. However, it considerably elevates your
costs. Some specific examples of this strategy: Instaclustr, a
managed services provider,
shared a benchmark after isolating its deployments (Apache
Spark and Apache Cassandra). GumGum shared the results of this
approach (with multiregion Cassandra) at a past
Cassandra Summit.
There are definitely use cases and organizations running OLAP on
top real-time databases. But are there any other alternatives to
resolve the problem altogether? Option B: Scheduled Isolation Other
teams take a different approach: They avoid running their OLAP
during their peak periods. They simply run through their Analytics
pipelines during off-peak hours in order to mitigate the impact on
latencies. For example, consider a food delivery company. Answering
the question like, “How much did this merchant sell within the past
week?” is simple in OLTP. However, offering discounts to 10
top-selling restaurants within a given region is much more
complicated. In a wide-column database like Cassandra or ScyllaDB,
it inevitably requires a full table scan. Therefore, it would make
sense for such a company to run these analytics from after midnight
until around 10 a.m. – before its peak traffic hours. This is a
doable strategy, but it still doesn’t solve the problem. For
example, what if your dataset doubles or triples? Your pipeline
might overrun your time window. And you have to consider that your
business is still running at that time (people will still order
food at 2 a.m.). If you take this approach, you still need to tune
your analytics job and ensure it doesn’t kill your database. Option
C: Workload Prioritization ScyllaDB has developed an approach
called Workload Prioritization to address this problem. It lets
users define separate workloads and assign different resource
shares to them. For example, you might define two service levels:
The main one has 600 shares, and the secondary one has 200 shares.
CREATE SERVICE LEVEL main WITH shares = 600 CREATE SERVICE LEVEL
secondary WITH shares = 200 ScyllaDB’s internal scheduler will
process three times more tasks from the main workload than the
secondary one. Whenever the system is under contention, the system
prioritizes its resources allocation accordingly. Why does this
kick in only during contention? Because if there’s no contention,
it means there is no bottleneck, so there is effectively nothing to
prioritize. [
Play with
an interactive animation] Workload Prioritization Under the
Hood Under the hood, ScyllaDB’s Workload Prioritization relies on
Seastar scheduling groups. Seastar is a C++ framework for
data-intensive applications. ScyllaDB, Redpanda, Ceph’s SeaStore
and other technologies are built on top of it. Scheduling groups
are effectively the way Seastar allows background operations to
have little impact on foreground activities. For example, in
ScyllaDB and database-specific terms, there are several different
scheduling groups within the database. ScyllaDB has a distinct
group for compactions, streaming, Memtables, and so on. With
Cassandra, you might end up in a situation where compactions impact
your workload performance. But in ScyllaDB, all compaction
resources are scheduled by Seastar. And according to its shares of
resources, the database will allocate a respective share of
resources to the background activity (compaction, in that case) –
therefore ensuring that the latency of the primary user-facing
workload doesn’t suffer. Using scheduling groups in this way also
helps the database auto-tune. If the user workload is running
during off-peak hours, then the system will automatically have more
spare computing and I/O cycles to spend. The database will simply
speed up its background activities. Here’s a guided tour of how
Workload Prioritization actually plays out: OLTP and OLAP
Can Coexist Running OLAP alongside OLTP inevitably
involves anticipating and managing contention. You can control it
in a few ways: isolate analytics to its own cluster, run it in
off-peak windows, or enforce workload prioritization. And workload
prioritization isn’t just for allowing OLAP along with your OLTP.
That same approach could also be used to assign different
priorities to reads vs. writes, for example. If you’d like to learn
more, take a look at my recent tech talk on this topic: “How to
Balance Multiple Workloads in a Cluster.”
22 January 2026, 4:08 pm by
ScyllaDB
How ScyllaDB performs on the new I8g and I8ge instances,
across different workload types Let’s start with the
bottom line. For ScyllaDB, the new Graviton4-based i8g instances
improve i4i throughput by up to 2x with better latency – and the
i8ge improves i3en throughput by up to 2x with better latency.
Benchmarks also show single-digit millisecond latency during
maintenance operations like scaling. Fast and smooth scaling is an
important part of
the new ScyllaDB X Cloud offering. The chart below shows
ScyllaDB max through under a latency SLA of 10ms latency for
different workloads, for the old i4i, i3en and the new i8g, i8ge.
AWS
recently launched the I8g and I8ge storage-optimized EC2
instances powered by
AWS Graviton4 CPUs and
3rd-generation AWS Nitro SSDs. They’re designed
for I/O-intensive workloads like real-time databases, search, and
analytics (so a nice fit for ScyllaDB). Instance Family Use Case
Number of vCPUs per instance Storage i8g Compute bound 2 to 96 0.5
to 22.5 TB i8ge Storage bound 2 to 192 1.25 to 120 TB
Reduced TCO in ScyllaDB Cloud Based on our performance
results, ScyllaDB users migrating to Graviton4 can
reduce
infrastructure requirements by up to 50% compared to i4i
and i3en previous generations. This translates into significantly
lower total cost of ownership (TCO) by requiring fewer nodes to
sustain the same workload. These improvements stem from a few
factors – both in the new instances themselves, and in the match
between ScyllaDB and these instances. The new I8g architecture
features: vCPU-to-core mapping: On x86, each vCPU uses half a
physical core (a hyperthread); for i8g (ARM), each core matches one
physical core Larger caches: 64kB instruction cache and 64kB data
cache, compared to 32/48kB on Intel (shared between the two
hyperthreads) Faster storage and networking (see spec above) In
addition, ScyllaDB’s design allows it to take full advantage of the
new server types: The shard-per-core architecture scales with
linear performance to any number of cores The IO scheduler can take
full advantage of the 3rd-generation AWS Nitro SSD, fully utilizing
the higher IO rate, and lower latency without overloading it and
increasing latency ARM’s relaxed memory model suits
Seastar applications. Since locks and
fences are rare, the memory subsystem has more opportunities to
reorder memory accesses to optimize performance. What this means
for you I8g and i8ge
are now available on ScyllaDB Cloud. If you’re running ScyllaDB
Cloud, the net impact is:
Compute-bound workloads:
Move from I4i to I8g. This should provide up to 2x throughput at
the same ScyllaDB Cloud price.
Storage-bound
workloads: Move from I3en to I8ge. Here, you should expect
up to 2x higher throughput at the same ScyllaDB Cloud price. Note
that using the new ScyllaDB dictionary-based compression can lower
the storage cost further. For both use cases, ScyllaDB can keep the
10ms P99 latency SLA during maintenance operations, including
scaling out and scaling down. What we measured
Max
Throughput: The maximum requests per second the database
can handle
Max Throughput under SLA: The maximum
request per second
under a P99 latency of 10ms.
Only throughput with latency below this SLA counts. This throughput
can be sustained under any operation, like scaling and repair. This
is the number you should use when sizing your ScyllaDB Database on
i8g instances.
P99 Latency: Measures the p99
latency for the Max Throughput under SLA Results Read Workload –
cached data Cached data: working set size < available RAM,
resulting in close to 100% cache hit rate. Instance type Max
throughput Max Throughput Under Latency SLA Improvement P99 in ms
i4i.4xlarge 1,062,578 750,000 100% 7.84
i8g.4xlarge 1,434,215 1,300,000
135% 6.29 i3en.3xlarge 585,975 550,000 100% 4.37
i8ge.3xlarge 962,504 800,000
164%
6.38 Read Workload – non-cached data, storage only Non-cached data:
working set size >> available RAM, resulting in 0% cache hit
rate. When most of the data is not cached, storage becomes a
significant factor for performance. Instance type Max throughput
Max Throughput Under Latency SLA Improvement P99 in ms i4i.4xlarge
218,674 210,000 100% 4.56
i8g.4xlarge 444,548
300,000
203% 4.24 i3en.3xlarge 145,702 140,000
100% 6.83
i8ge.3xlarge 259,693 255,000
178% 7.95 Write Workload Instance type Max
throughput Max Throughput Under Latency SLA Improvement P99 in ms
i4i.4xlarge 289,154 150,000 100% 2.4
i8g.4xlarge
689,474 600,000
238% 4.02 i3en.3xlarge 217,072
200,000 100% 5.42
i8ge.3xlarge 452,968 400,000
209% 3.41 Tests under maintenance
operations ScyllaDB takes pride in testing under realistic use
cases, including scaling out and in, repair, backups, and various
failure tests. The following results represent the P99 average
latency (across all nodes) of different maintenance operations on a
3-node cluster of i8ge.3xlarge. It’s using the same setup as above.
Setup ScyllaDB version: 2025.3.1-20250907.2bbf3cf669bb DB node
amount: 3 DB instance types: i8ge.3xlarge Loader node amount: 4
Loader instance type: c5.2xlarge Throughput: Read 41K, write 81K,
Mixed 35K Results
Read Test: Read Latency
Operation Read P99 latency in ms Base: Steady State 0.95 During
Repair 4.92 During Add Node (out scale) 2.68 During Replace Node
3.10 During Decommission Node (downscale) 2.44
Write Test: Write Latency Operation Write
P99 latency in ms Steady State 2.22 During Repair 3.24 Add Node
(scale out) 2.49 Replace Node 3.07 Decommission Node (downscale)
2.37
Mixed Test: Write and Read
Latency Operation Write P99 Latency in ms Read P99
Latency in ms Steady state 2.03 2.11 During Repair 3.21 4.70 Add
Node (scale out) 2.19 2.71 Replace Node 3.00 3.37 Decommission Node
(downscale) 2.20 3.05 The results indicate that ScyllaDB can
meet the latency SLA under maintenance operations. This is critical
for ScyllaDB Cloud, and in particular ScyllaDB X Cloud, where
scaling out and in scaling are automatic, and can happen multiple
times per day. It’s also critical in unexpected failure cases, when
a node must be replaced rapidly, without hurting availability and
the latency SLA. Test Setup ScyllaDB cluster 3-node cluster
I4i.4xlarge vs. i8g.4xlarge I3en.3xlarge vs. i8ge.3xlarge Loaders
Loader node amount: 4 Loader instance type: c7i.8xlarge Workload
Replication Factor (RF): 3 Consistency Level (CL): Quorum Data size
650GB for read/mixed, 1.5T for write
21 January 2026, 3:36 pm by
ScyllaDB
Alan and Dor chat about high-performance databases & AI
trends Everything about
re:Invent
2025 screamed “massive” – from the exhibit hall’s towering
booths, to the overflowing keynotes, to product announcements at
every turn. ScyllaDB’s “scale fearlessly” message fit in perfectly.
See
ScyllaDB’s re:Invent videos But despite the crowds and chaos,
Alan Shimel (founder and CEO of Techstrong Group) and Dor Laor
(ScyllaDB co-founder and CEO) found a way to meet for a laid-back
chat. Topics ranged from ScyllaDB’s origin story, to OSS, to
ScyllaDB’s latest announcements for AI and extreme database
elasticity. Read highlights below, or enjoy the full interview:
ScyllaDB AI Use Cases: Vector Scale, Feature Store, AI Stack
Alan: What’s it like on the re:Invent floor? What
are the conversations like? What are you hearing?
Dor: There’s certainly no shortage of crowds at
the booth. A lot of the conversation is about AI. We’re seeing a
surge in AI-related use cases. At this point,
about half of the
use cases we see with ScyllaDB are directly related to AI.
Alan: Explain that to me. What’s the use case?
Dor: We usually split
our AI uses cases into
three categories. The first is being part of the AI stack itself.
During training and serving, the stack needs to access a huge
number of objects, and it needs a fast database to do that. In this
case, we’re part of the core AI stack. It’s distributed databases
handling very high workloads – and these are very high workloads.
That can be for large LLM companies, or for much smaller companies
that are just starting their AI journey. That’s the first category.
The second category is the feature store. Feature stores are more
traditionally associated with machine learning, but they’re still
part of the AI world. A feature store lets people classify users,
or sometimes agents, automatically. That can be used for
recommendations in e-commerce, fraud detection, and a variety of
other use cases. In those cases, the feature store needs a fast
database to quickly determine how a user is classified and what’s
appropriate for them – what they might want to watch, what ad they
should see, and so on. The third category is vector search for
running LLMs on private datasets. That’s where RAG comes in, with
vector data. We added vector search ourselves, and we’re already
seeing a lot of interest. In January, we’ll be going live with the
general availability of our RAG and vector store.
Alan: So in essence, they could use ScyllaDB as
their vector database. They’re creating small language models or
RAG. That’s got to be big…that’s fantastic.
Dor:
Our vector search is the most scalable.
We can easily run models with a billion objects. Very few
vendors can even reach a billion. We can do that while handling
hundreds of thousands of requests per second, so we scale to very
high numbers. And for people with lower or medium demand – which is
most users, with models around 10 million or 100 million objects –
we can deliver the best latency at very low price points.
Alan: That’s fantastic. Look, there are a lot of
people saying we’ve scraped everything there is to scrape for these
LLMs. That continuing to make generative AI better by just
increasing model size or training data is starting to hit
diminishing returns. The thinking is that the way forward might be
smaller language models, more RAG. Some people even argue we should
move away from ML altogether and toward things like world models.
But I definitely believe there’s going to be a lot of activity in
the SLM and RAG space. And beyond that, as we build AI for specific
use cases, I don’t need the whole internet. I just need the data
that matters for that use case – especially if it’s my own
proprietary information. I don’t want to put that out there. I want
it right here. So I think that’s a huge business. Congratulations.
Dor: Thanks. It’s market demand. It’s not just an
opportunity, it’s also a defensive move. If we don’t do it,
customers will go elsewhere, to be frank. People now expect the
same ease of use they get from LLMs on public internet data when
they come to any vendor. They want to ask questions in free text,
in a single line, and immediately get the best results – without
digging through a complicated UI. That’s the power of LLMs. And
sometimes it won’t even be people doing that. It could be agents
that come in, automate things, and issue those queries on their
behalf. True Database Elasticity: Scaling Out and In, Fast
Alan: All right, let’s fast-forward past AWS for a
second. You have some new announcements coming soon. Share a bit,
if you don’t mind.
Dor: Thank you for the
opportunity. We’re moving from beta to general availability with
ScyllaDB X
Cloud, our managed platform. This is the new generation of our
core database, delivered as a database-as-a-service, including
management and consumption. The unique thing here is our new core
architecture, which we call tablets. It’s way more elastic than any
other database – or even infrastructure – out there. Before this,
we were okay in terms of scaling clusters out and scaling them back
in. We were about average. But there was demand to do it much
faster. And frankly, we also compete with DynamoDB. We’re
API-compatible
with DynamoDB. Up until now, DynamoDB has been the best in the
industry at scaling up and down quickly. If your workload changes
throughout the day, you don’t want to pay for peak capacity all the
time. You want the system to follow usage dynamically. That’s
exactly what X Cloud is. With tablets, we break a very large
database– say, a petabyte of data – into five-gigabyte chunks. We
can move those chunks around super quickly. That allows us to scale
extremely fast. We can increase capacity by four times in about ten
minutes. For example, you can go from 500,000 operations per second
to two billion operations per second in ten minutes.
Alan: And back to 500K?
Dor:
That’s right.
Alan: Sometimes with these things,
it’s like blowing up a balloon. You know what I mean? It never
really goes back to the size it was before you blew it up.
Dor: So with this, we can also go back and shrink.
It’s complicated, but it works. User workloads come and go –
whether it’s Black Friday or just daily patterns. That leads to big
TCO improvements and usability improvements. It’s also pretty
unique. We have a shard-per-core engine. So if you have a machine
with 32 cores, you’ll have 32 independent threads in the server. If
you have a 64-core machine, you’ll have 64 threads, and it will
perform twice as well. Now, let’s say you have a 64-core machine,
but you actually need 66 threads. If you only had 64, would you buy
another 64-core machine? That’s expensive. Instead, we can mix and
match. You can run a 64-core machine together with a small two-vCPU
machine side by side. Because of the flexibility of our sharding
model, we can combine the two. I haven’t seen any other vendor that
can do that. What the user gets is efficiency. They have exactly
what they need, without having to buy oversized, expensive servers.
Alan: Really, what we’re talking about here is
almost a FinOps play. I think that’s where we are now, especially
with cloud usage. Look, we’re talking about spending five trillion
dollars on data center AI factories. But when I talk to people,
what they actually say is, I want to get control of my cloud bill.
They want to be more efficient in how they use these resources.
That’s why I made the balloon joke– that’s pretty much how the
cloud works. It never seems to go back down. People want insight.
They want to be able to turn the dial. And they want to ask, how
can I do this more efficiently?
Dor: Most
databases aren’t that loaded. I’m not talking about spikes. I’m
talking about normal daily usage, or overnight usage. Often it’s
only 10% or 20% utilized – but you’re paying for the entire thing.
Alan: That was always the promise of the cloud –
that elasticity would go up and down. In practice, it mostly just
went up. Tiered Storage at ScyllaDB
Alan: So, what
else is new at ScyllaDB?
Dor: We’re also working
on things like tiered storage and other technologies to reduce the
bill. Normally, we use NVMe for fast storage and performance. It’s
also relatively cheap compared to other high-performance storage
options. But S3 is cheaper. The problem with S3 is latency. It can
be 50 milliseconds, 100 milliseconds, which is prohibitive for many
workloads. With tiered storage, we can keep the hot data on fast
NVMe and automatically move cold data to S3. That lets us come up
with a good solution for common use cases. For example, you might
want to keep 30 days of data in ScyllaDB on NVMe, but keep a year
of data overall – and still access it through the same API, without
having to build a separate access path. That gives users a single
API and a very cost-effective solution. Learn more about what’s
next for ScyllaDB at
Monster SCALE
Summit — free and virtual.