14 January 2025, 9:55 am by
ScyllaDB
How a team of just two engineers tackled real-time
persisted events for hundreds of millions of players With
just two engineers, Supercell took on the daunting task of growing
their basic account system into a social platform connecting
hundreds of millions of gamers. Account management, friend
requests, cross-game promotions, chat, player presence tracking,
and team formation – all of this had to work across their five
major games. And they wanted it all to be covered by a single
solution that was simple enough for a single engineer to maintain,
yet powerful enough to handle massive demand in real-time.
Supercell’s Server Engineer, Edvard Fagerholm, recently shared how
their mighty team of two tackled this task. Read on to learn how
they transformed a simple account management tool into a
comprehensive cross-game social network infrastructure that
prioritized both operational simplicity and high performance.
Note: If you enjoy hearing about engineering
feats like this, join us at Monster Scale
Summit (free + virtual). Engineers from Disney+/Hulu,, Slack,
Canva, Uber, Salesforce, Atlassian and more will be sharing
strategies and case studies. Background: Who’s Supercell?
Supercell is the Finland-based company behind the hit games Hay
Day, Clash of Clans, Boom Beach, Clash Royale and Brawl Stars. Each
of these games has generated $1B in lifetime revenue.
Somehow they manage to achieve this with a super small staff. Until
quite recently, all the account management functionality for games
servicing hundreds of millions of monthly active users was being
built and managed by just two engineers. And that brings us to
Supercell ID. The Genesis of Supercell ID Supercell ID was born as
a basic account system – something to help users recover accounts
and move them to new devices. It was originally implemented as a
relatively simple HTTP API. Edvard explained, “The client could
perform HTTP queries to the account API, which mainly returned
signed tokens that the client could present to the game server to
prove their identity. Some operations, like making friend requests,
required the account API to send a notification to another player.
For example, ‘Do you approve this friend request?’ For that
purpose, there was an event queue for notifications. We would post
the event there, and the game backend would forward the
notification to the client using the game socket.” Enter Two-Way
Communication After Edvard joined the Supercell ID project in late
2020, he started working on the notification backend – mainly for
cross-promotion across their five games. He soon realized that they
needed to implement two-way communication themselves, and built it
as follows: Clients connected to a fleet of proxy servers, then a
routing mechanism pushed events directly to clients (without going
through the game). This was sufficient for the immediate goal of
handling cross-promotion and friend requests. It was fairly simple
and didn’t need to support high throughput or low latency. But it
got them thinking bigger. They realized they could use two-way
communication to significantly increase the scope of the Supercell
ID system. Edvard explained, “Basically, it allowed us to implement
features that were previously part of the game server. Our goal was
to take features that any new games under development might need
and package them into our system – thereby accelerating their
development.” With that, Supercell ID began transforming into a
cross-game social network that supported features like friend
graphs, teaming up, chat, and friend state tracking. Evolving
Supercell ID into Cross-Game Social Network At this point, the
Social Network side of the backend was still a single-person
project, so they designed it with simplicity in mind. Enter
abstraction. Finding the right abstraction “We wanted to have only
one simple abstraction that would support all of our uses and could
therefore be designed and implemented by a single engineer,”
explained Edvard. “In other words, we wanted to avoid building a
chat system, a presence system, etc. We wanted to build one thing,
not many.” Finding the right abstraction was key. And a
hierarchical key-value store with Change Data Capture fit the bill
perfectly. Here’s how they implemented it: The top-level keys in
the key-value store are topics that can be subscribed to. There’s a
two-layer map under each top-level key –
map(string,
map(string, string)). Any change to the data under a
top-level key is broadcast to all that key’s subscribers. The
values in the innermost map are also timestamped. Each data source
controls its own timestamps and defines the correct order. The
client drops any update with an older timestamp than what it
already has stored. A typical change in the data would be something
like ‘level equals 10’ changes to ‘level equals 11’. As players
play, they trigger all sorts of updates like this, which means a
lot of small writes are involved in persisting all the events.
Finding the Right Database They needed a database that would
support their technical requirements and be manageable, given their
minimalist team. That translated to the following criteria: Handles
many small writes with low latency Supports a hierarchical data
model Manages backups and cluster operations as a service ScyllaDB
Cloud turned out to be a great fit. (ScyllaDB Cloud is the
fully-managed version of ScyllaDB, a database known for delivering
predictable low latency at scale). How it All Plays Out For an idea
of how this plays out in Supercell games, let’s look at two
examples. First, consider chat messages. A simple chat message
might be represented in their data model as follows: <room
ID> -> <timestamp_uuid> -> message -> “hi there”
metadata
-> …
reactions
-> … Edvard explained, “The top-level key that’s subscribed to
is the chat room ID. The next level key is a timestamp-UID, so we
have an ordering of each message and can query chat history. The
inner map contains the actual message together with other data
attached to it.” Next, let’s look at “presence”, which is used
heavily in Supercell’s new (and highly anticipated) game, mo.co.
The goal of presence, according to Edvard: “When teaming up for
battle, you want to see in real-time the avatar and the current
build of your friends – basically the weapons and equipment of your
friends, as well as what they’re doing. If your friend changes
their avatar or build, goes offline, or comes online, it should
instantly be visible in the ‘teaming up’ menu.” Players’ state data
is encoded into Supercell’s hierarchical map as follows: <player
ID> -> “presence” -> weapon -> sword
level
-> 29
status
-> in battle Note that: The top level is the player ID, the
second level is the type, and the inner map contains the data.
Supercell ID doesn’t need to understand the data; it just forwards
it to the game clients. Game clients don’t need to know the friend
graph since the routing is handled by Supercell ID. Deeper into the
System Architecture Let’s close with a tour of the system
architecture, as provided by Edvard. “The backend is split into
APIs, proxies, and event routing/storage servers. Topics live on
the event routing servers and are sharded across them. A client
connects to a proxy, which handles the client’s topic subscription.
The proxy routes these subscriptions to the appropriate event
routing servers. Endpoints (e.g., for chat and presence) send their
data to the event routing servers, and all events are persisted in
ScyllaDB Cloud. Each topic has a primary and backup shard. If the
primary goes down, the primary shard maintains the memory sequence
numbers for each message to detect lost messages. The secondary
will forward messages without sequence numbers. If the primary is
down, the primary coming up will trigger a refresh of state on the
client, as well as resetting the sequence numbers. The API for the
routing layers is a simple post-event RPC containing a batch of
topic, type, key, value tuples. The job of each API is just to
rewrite their data into the above tuple representation. Every event
is written in ScyllaDB before broadcasting to subscribers. Our APIs
are synchronous in the sense that if an API call gives a successful
response, the message was persisted in ScyllaDB. Sending the same
event multiple times does no harm since applying the update on the
client is an idempotent operation, with the exception of possibly
multiple sequence numbers mapping to the same message. When
connecting, the proxy will figure out all your friends and
subscribe to their topics, same for chat groups you belong to. We
also subscribe to topics for the connecting client. These are used
for sending notifications to the client, like friend requests and
cross promotions. A router reboot triggers a resubscription to
topics from the proxy. We use Protocol Buffers to save on bandwidth
cost. All load balancing is at the TCP level to guarantee that
requests over the same HTTP/2 connection are handled by the same
TCP socket on the proxy. This lets us cache certain information in
memory on the initial listen, so we don’t need to refetch on other
requests. We have enough concurrent clients that we don’t need to
separately load balance individual HTTP/2 requests, as traffic is
evenly distributed anyway, and requests are about equally expensive
to handle across different users. We use persistent sockets between
proxies and routers. This way, we can easily send tens of thousands
of subscriptions per second to a single router without an issue.”
But It’s Not Game Over If you want to watch the complete tech talk,
just press play below: And if you want to read more about
ScyllaDB’s role in the gaming world, you might also want to read:
Epic Games: How Epic Games uses ScyllaDB as a
binary cache in front of NVMe and S3 to accelerate global
distribution of large game assets used by Unreal Cloud DDC.
Tencent Games: How Tencent Games built service
architecture based on CQRS and event sourcing patterns with Pulsar
and ScyllaDB.
Discord: How Discord uses ScyllaDB to power
their massive growth, moving from a niche gaming platform to one of
the world’s largest communication platforms.
8 January 2025, 1:20 pm by
ScyllaDB
Monitoring tips that can help reduce cluster size 2-5X
without compromising latency Editor’s note: The
following is a guest post by Andrei Manakov, Senior Staff Software
Engineer at ShareChat. It was originally published
on Andrei’s blog. I had the privilege of giving
a talk at ScyllaDB Summit 2024, where I briefly addressed the
challenge of analyzing the remaining capacity in ScyllaDB clusters.
A good understanding of ScyllaDB internals is required to plan your
computation cost increase when your product grows or to reduce cost
if the cluster turns out to be heavily over-provisioned. In my
experience, clusters can be reduced by 2-5x without latency
degradation after such an analysis. In this post, I provide more
detail on how to properly analyze CPU and disk resources. How Does
ScyllaDB Use CPU? ScyllaDB is a distributed database, and one
cluster typically contains multiple nodes. Each node can contain
multiple shards, and each shard is assigned to a single core. The
database is built on the
Seastar framework and
uses a shared-nothing approach. All data is usually replicated in
several copies, depending on the
replication factor, and each copy is assigned to a specific
shard. As a result, every shard can be analyzed as an independent
unit and every shard efficiently utilizes all available CPU
resources without any overhead from contention or context
switching. Each shard has different tasks, which we can divide into
two categories: client request processing and maintenance tasks.
All tasks are executed by a scheduler in one thread pinned to a
core, giving each one its own CPU budget limit. Such clear task
separation allows
isolation and prioritization of latency-critical tasks for
request processing. As a result of this design, the cluster handles
load spikes more efficiently and provides gradual latency
degradation under heavy load. [
More details about
this architecture].
Another interesting result of this design is that
ScyllaDB supports
workload prioritization. In my experience, this approach
ensures that critical latency is not impacted during less critical
load spikes. I can’t recall any similar feature in other databases.
Such problems are usually tackled by having 2 clusters for
different workloads. But keep in mind that this feature is
available only in ScyllaDB Enterprise.
However, background tasks may occupy all remaining resources, and
overall CPU utilization in the cluster appears spiky. So, it’s not
obvious how to find the real cluster capacity. It’s easy to see
100%
CPU usage with no performance impact. If we increase the
critical load, it will consume the resources (CPU, I/O) from
background tasks. Background tasks’ duration can increase slightly,
but it’s totally manageable. The Best CPU Utilization Metric How
can we understand the remaining cluster capacity when CPU usage
spikes up to 100% throughout the day, yet the system remains
stable? We need to exclude maintenance tasks and remove all these
spikes from the consideration. Since ScyllaDB distributes all the
data by shards and every shard has its own core, we take into
account the max CPU utilization by a shard excluding maintenance
tasks (you can find
other task types here). In my experience, you can keep the
utilization up to 60-70% without visible degradation in tail
latency. Example of a Prometheus query:
max(sum(rate(scylla_scheduler_runtime_ms{group!="compaction|streaming"}))
by (instance, shard))/10
You can find more details about the
ScyllaDB monitoring stack here. In this
article, PromQL queries are used to
demonstrate how to analyse key metrics effectively.
However, I don’t recommend rapidly downscaling the cluster to the
desired size just after looking at max CPU utilization excluding
the maintenance tasks. First, you need to look at average CPU
utilization excluding maintenance tasks across all shards. In an
ideal world, it should be close to max value. In case of
significant skew, it definitely makes sense to find the root cause.
It can be an inefficient schema with an incorrect
partition key or an incorrect
token-aware/rack-aware configuration in the driver. Second, you
need to take a look at the
average CPU
utilization of excluded tasks for some your workload
specific things. It’s rarely more than 5-10% but you might need to
have more buffer if it uses more CPU. Otherwise, compaction will be
too tight in resources and reads start to become more expensive
with respect to CPU and disk. Third, it’s important to downscale
your cluster gradually. ScyllaDB has an in-memory row cache which
is crucial for ScyllaDB. It allocates all remaining memory for the
cache and with the memory reduction, the hit rate might drop more
than you expected. Hence, CPU utilization can be increased
unilinearly and low cache hit rate can harm your tail latency.
1- (sum(rate(scylla_cache_reads_with_misses{})) /
sum(rate(scylla_cache_reads{})))
I haven’t mentioned RAM in this article as there are
not many actionable points. However, since memory cache is crucial
for efficient reading in ScyllaDB, I recommend always using
memory-optimized virtual machines. The more
memory, the better.
Disk Resources ScyllaDB is a
LSMT-based
database. That means it is optimized for writing by design and
any mutation will lead to new appending new data to the disk. The
database periodically rewrites the data to ensure acceptable read
performance. Disk performance plays a crucial role in overall
database performance. You can find more details about the write
path and compaction in the
scylla
documentation. There are 3 important disk resources we will
discuss here: Throughput, IOPs and free disk space. All these
resources depend on the disk type we attached to our ScyllaDB nodes
and their quantity. But how can we understand the limit of the
IOPs/throughput? There 2 possible options: Any cloud provider or
manufacturer usually provides performance of their disks ; you can
find it on their website. For example,
NVMe disks from Google Cloud. The actual disk performance can
be different compared to the numbers that manufacturers share. The
best option might be just to measure it. And we can easily get the
result. ScyllaDB performs a benchmark during installation to a node
and stores the result in the file
io_properties.yaml. The database uses these limits
internally for achieving
optimal performance.
disks: - mountpoint:
/var/lib/scylla/data read_iops: 2400000 //iops read_bandwidth:
5921532416//throughput write_iops: 1200000 //iops write_bandwidth:
4663037952//throughput
file:
io_properties.yaml Disk Throughput
sum(rate(node_disk_read_bytes_total{})) / (read_bandwidth *
nodeNumber) sum(rate(node_disk_written_bytes_total{})) /
(write_bandwidth * nodeNumber)
In my experience, I haven’t
seen any harm with utilization up to 80-90%. Disk IOPs
sum(rate(node_disk_reads_completed_total{})) / (read_iops *
nodeNumber) sum(rate(node_disk_writes_completed_total{})) /
(write_iops * nodeNumber)
Disk free space It’s crucial to
have significant buffer in every node. In case you’re running out
of space, the node will be basically unavailable and it will be
hard to restore it. However, additional space is required for many
operations: Every update, write, or delete will be written to the
disk and allocate new space. Compaction requires some buffer during
cleaning the space. Back up procedure. The best way to control disk
usage is to use
Time To Live in the tables if it matches your use case. In this
case, irrelevant data will expire and be cleaned during compaction.
I usually try to keep at least 50-60% of free space.
min(sum(node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"})
by
(instance)/sum(node_filesystem_size_bytes{mountpoint="/var/lib/scylla"})
by (instance))
Tablets Most apps have significant load
variations throughout the day or week. ScyllaDB is not elastic and
you need to have provisioned the cluster for the peak load. So, you
could waste a lot of resources during night or weekends. But that
could change soon. A ScyllaDB cluster distributes data across its
nodes and the smallest unit of the data is a partition uniquely
identified by a
partition key. A
partitioner hash function computes tokens to understand in
which nodes data are stored. Every node has its own token range,
and all nodes make a
ring. Previously, adding a new node wasn’t a fast procedure
because it required copying (it is called streaming) data to a new
node, adjusting token range for neighbors, etc. In addition, it’s a
manual
procedure. However, ScyllaDB introduced
tablets in 6.0 version, and it provides new opportunities. A
Tablet is a range of tokens in a table and it includes partitions
which can be replicated independently. It makes the overall process
much smoother and it increases elasticity significantly. Adding new
nodes
takes minutes and
a new node starts processing requests even before full data
synchronization. It looks like a significant step toward full
elasticity which can drastically reduce server cost for ScyllaDB
even more. You can
read more about
tablets here. I am looking forward to testing tablets closely
soon. Conclusion Tablets look like a solid foundation for future
pure elasticity, but for now, we’re planning clusters for peak
load. To effectively analyze ScyllaDB cluster capacity, focus on
these key recommendations: Target
max CPU
utilization (excluding maintenance tasks) per shard at
60–70%. Ensure sufficient
free disk
space to handle compaction and backups. Gradually
downsize clusters to avoid sudden cache
degradation.
2 January 2025, 1:09 pm by
ScyllaDB
It’s been a while since my last update. We’ve been busy improving
the existing ScyllaDB training material and adding new lessons and
labs. In this post, I’ll survey the latest developments and update
you on the live training event taking place later this month. You
can discuss these topics (and more!) on the community forum.
Say hello here. ScyllaDB University LIVE Training In addition
to the self-paced online courses you can take on ScyllaDB
University (see below), we host online live training events. These
events are a great opportunity to improve your NoSQL and ScyllaDB
skills, get hands-on practice, and get your questions answered by
our team of experts. The next event is ScyllaDB University LIVE,
which will occur 29th of January 29. As usual, we’re planning on
having two tracks, an Essentials, and an Advanced track. However,
this time we’ll change the format and make each track a complete
learning path. Stay tuned for more details, and I hope to see you
there.
Save
your spot at ScyllaDB University LIVE ScyllaDB University
Content Updates
ScyllaDB
University is our online learning platform where you can learn
about NoSQL and about ScyllaDB and get some hands-on experience. It
includes many different self-paced lessons, meaning you can study
whenever you have some free time and continue where you left off.
The material is free and all you have to do is create a user
account. We recently added new lessons and updated many existing
ones. All of the following topics were added to the course
S201: Data
Modeling and Application Development. Start learning New in the
How To Write Better Apps Lesson General Data Modeling Guidelines
This lesson discusses key principles of NoSQL data modeling,
emphasizing a query-driven design approach to ensuring efficient
data distribution and balanced workloads. It highlights the
importance of selecting high-cardinality primary keys, avoiding bad
access patterns, and using ScyllaDB Monitoring to identify and
resolve issues such as Hot Partitions and Large Partitions.
Neglecting these practices can lead to slow performance,
bottlenecks, and potentially unreadable data – underscoring the
need for using best practices when creating your data model. To
learn more, you can explore
the complete lesson here. Large Partitions and Collections This
lesson provides insights into common pitfalls in NoSQL data
modeling, focusing on issues like large partitions, collections,
and improper use of ScyllaDB features. It emphasizes avoiding large
partitions due to the impact on performance and demonstrates this
with real-world examples and Monitoring data. Collections should
generally remain small to prevent high latency. The schema used
depends on the use case and on the performance requirements.
Practical advice and tools are offered for testing and monitoring.
You can learn more in
the complete lesson here. Hot Partitions, Cardinality and
Tombstones This lesson explores common challenges in NoSQL
databases, focusing on hot partitions, low cardinality keys, and
tombstones. Hot partitions cause uneven load and bottlenecks, often
due to misconfigurations or retry storms. Having many tombstones
can degrade read performance due to read amplification. Best
practices include avoiding retry storms, using efficient full-table
scans over low cardinality views and preferring partition-level
deletes to minimize tombstone buildup. Monitoring tools and
thoughtful schema design are emphasized for efficient database
performance. You can find
the complete lesson here. Diagnosis and Prevention This lesson
covers strategies to diagnose and prevent common database issues in
ScyllaDB, such as large partitions, hot partitions, and
tombstone-related inefficiencies. Tools like the nodetool
toppartitions command help identify hot partition problems, while
features like per-partition rate limits and shard concurrency
limits manage load and prevent contention. Properly configuring
timeout settings avoids retry storms that exacerbate hot partition
problems. For tombstones, using efficient delete patterns helps
maintain performance and prevent timeouts during reads. Proactive
monitoring and adjustments are emphasized throughout. You can see
the
complete lesson here. New in the Basic Data Modeling Lesson CQL
and the CQL Shell The lesson introduces the Cassandra Query
Language (CQL), its similarities to SQL, and its use in ScyllaDB
for data definition and manipulation commands. It highlights the
interactive CQL shell (CQLSH) for testing and interaction,
alongside a high level overview of drivers. Common data types and
collections like Sets, Lists, Maps, and User-Defined Types in
ScyllaDB are briefly mentioned. The “Pet Care IoT” lab example is
presented, where sensors on pet collars record data like heart rate
or temperature at intervals. This demonstrates how CQL is applied
in database operations for IoT use cases. This example is used in
labs later on. You can watch
the video and complete lesson here. Data Modeling Overview and
Basic Concepts The new video introduces the basics of data modeling
in ScyllaDB, contrasting NoSQL and relational approaches. It
emphasizes starting with application requirements, including
queries, performance, and consistency, to design models. Key
concepts such as clusters, nodes, keyspaces, tables, and
replication factors are explained, highlighting their role in
distributed data systems. Examples illustrate how tables and
primary keys (partition keys) determine data distribution across
nodes using consistent hashing. The lesson demonstrates creating
keyspaces and tables, showing how replication factors ensure data
redundancy and how ScyllaDB maps partition keys to replica nodes
for efficient reads and writes. You can find
the complete lesson here. Primary Key, Partition Key,
Clustering Key This lesson explains the structure and importance of
primary keys in ScyllaDB, detailing their two components: the
mandatory partition key and the optional clustering key. The
partition key determines the data’s location across nodes, ensuring
efficient querying, while the clustering key organizes rows within
a partition. For queries to be efficient, the partition key must be
specified to avoid full table scans. An example using pet data
illustrates how rows are sorted within partitions by the clustering
key (e.g., time), enabling precise and optimized data retrieval.
Find
the complete lesson here. Importance of Key Selection This
video emphasizes the importance of choosing partition and
clustering keys in ScyllaDB for optimal performance and data
distribution. Partition keys should have high cardinality to ensure
even data distribution across nodes and avoid issues like large or
hot partitions. Examples of good keys include unique identifiers
like user IDs, while low-cardinality keys like states or ages can
lead to uneven load and inefficiency. Clustering keys should align
with query patterns, considering the order of rows and prioritizing
efficient retrieval, such as fetching recent data for
time-sensitive applications. Strategic key selection prevents
resource bottlenecks and enhances scalability. Learn more in
the complete lesson. Data Modeling Lab Walkthrough (three
parts) The new three-part video lesson focuses on key aspects of
data modeling in ScyllaDB, emphasizing the design and use of
primary keys. It demonstrates creating a cluster and tables using
the CQL shell, highlighting how partition keys determine data
location and efficient querying while showcasing different queries.
Some tables use a Clustering key, which organizes data within
partitions, enabling efficient range queries. It explains compound
primary keys to enhance query flexibility. Next, an example of a
different clustering key order (ascending or descending) is given.
This enables query optimization and efficient retrieval of data.
Throughout the lab walkthrough, different challenges are presented,
as well as data modeling solutions to optimize performance,
scalability, and resource utilization. You can
watch the walkthrough here and also
take the lab yourself. New in the Advanced Data Modeling Lesson
Collections and Drivers The new lesson discusses advanced data
modeling in ScyllaDB, focusing on collections (Sets, Lists, Maps,
and User-defined types) to simplify models with multi-value fields
like phone numbers or emails. It introduces token-aware and
shard-aware drivers as optimizations to enhance query efficiency.
Token-aware drivers allow clients to send requests directly to
replica nodes, bypassing extra hops through coordinator nodes,
while shard-aware clients target specific shards within replica
nodes for improved performance. ScyllaDB supports drivers in
multiple languages like Java, Python, and Go, along with
compatibility with Cassandra drivers. An entire
course
on Drivers is also available. You can learn more in
the complete lesson here. New in the ScyllaDB Operations Course
Replica level Write/Read Path The lesson explains ScyllaDB’s read
and write paths, focusing on how data is written to Memtables
persisted as immutable SSTables. Because the SSTables are
immutable, they are compacted periodically. Writes, including
updates and deletes, are stored in a commit log before being
flushed to SSTables. This ensures data consistency. For reads, a
cache is used to optimize performance (also using bloom filters).
Compaction merges SSTables to remove outdated data, maintain
efficiency, and save storage. ScyllaDB offers different compaction
strategies and you can choose the most suitable one based on your
use case. Learn more in
the full lesson. Tracing Demo The lesson provides a practical
demonstration of ScyllaDB’s tracing using a three-node cluster. The
tracing tool is showcased as a debugging aid to track request flows
and replica responses. The demo highlights how data consistency
levels influence when responses are sent back to clients and
demonstrates high availability by successfully handling writes even
when a node is down, provided the consistency requirements are met.
You can find
the complete lesson here.
26 December 2024, 2:35 pm by
ScyllaDB
Let’s look back at the top 10 ScyllaDB blog posts written this year
– plus 10 “timeless classics” that continue to get attention.
Before we start, thank you to all the community members who
contributed to our blogs in various ways – from users sharing best
practices at ScyllaDB Summit, to engineers explaining how they
raised the bar for database performance, to anyone who has
initiated or contributed to the discussion on HackerNews, Reddit,
and other platforms. And if you have suggestions for 2025 blog
topics, please share them with us on our socials. With no further
ado, here are the most-read blog posts that we published in 2024…
We Compared ScyllaDB and Memcached and… We Lost?
By
Felipe Cardeneti Mendes Engineers behind ScyllaDB joined
forces with Memcached maintainer dormando for an in-depth look at
database and cache internals, and the tradeoffs in each. Read:
We
Compared ScyllaDB and Memcached and… We Lost? Related:
Why Databases Cache, but Caches Go to Disk Inside
ScyllaDB’s Internal Cache
By Pavel “Xemul”
Emelyanov Why ScyllaDB completely bypasses the Linux cache
during reads, using its own highly efficient row-based cache
instead. Read: I
nside
ScyllaDB’s Internal Cache Related:
Replacing Your Cache with ScyllaDB Smooth Scaling: Why
ScyllaDB Moved to “Tablets” Data Distribution
By Avi
Kivity The rationale behind ScyllaDB’s new “tablets”
replication architecture, which builds upon a multiyear project to
implement and extend Raft. Read:
Smooth Scaling:
Why ScyllaDB Moved to “Tablets” Data Distribution Related:
ScyllaDB Fast Forward: True Elastic Scale Rust vs. Zig
in Reality: A (Somewhat) Friendly Debate
By Cynthia
Dunlop A (somewhat) friendly P99 CONF popup debate with
Jarred Sumner (Bun.js), Pekka Enberg (Turso), and Glauber Costa
(Turso) on ThePrimeagen’s stream. Read:
Rust vs. Zig in Reality: A (Somewhat) Friendly Debate Related:
P99 CONF on demand
Database Internals: Working with IO
By Pavel “Xemul”
Emelyanov Explore the tradeoffs of different Linux I/O
methods and learn how databases can take advantage of a modern
SSD’s unique characteristics. Read:
Database Internals: Working with IO Related:
Understanding Storage I/O Under Load How We Implemented
ScyllaDB’s “Tablets” Data Distribution
By Avi
Kivity How ScyllaDB implemented its new Raft-based tablets
architecture, which enables teams to quickly scale out in response
to traffic spikes. Read:
How We
Implemented ScyllaDB’s “Tablets” Data Distribution Related:
Overcoming Distributed Databases Scaling Challenges with
Tablets How ShareChat Scaled their ML Feature Store
1000X without Scaling the Database
By Ivan Burmistrov and
Andrei Manakov How ShareChat engineers managed to meet
their lofty performance goal without scaling the underlying
database. Read:
How ShareChat Scaled their ML Feature Store 1000X without Scaling
the Database Related:
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
New Google Cloud Z3 Instances: Early Performance Benchmarks
By Łukasz Sójka, Roy Dahan ScyllaDB had the
privilege of testing Google Cloud’s brand new Z3 GCE instances in
an early preview. We observed a 23% increase in write throughput,
24% for mixed workloads, and 14% for reads per vCPU – all at a
lower cost compared to N2. Read:
New
Google Cloud Z3 Instances: Early Performance Benchmarks
Related:
A Deep Dive into ScyllaDB’s Architecture Database
Internals: Working with CPUs
By Pavel “Xemul”
Emelyanov Get a database engineer’s inside look at how the
database interacts with the CPU…in this excerpt from the book,
“Database Performance at Scale.” Read:
Database
Internals: Working with CPUs Related:
Database
Performance at Scale: A Practical Guide [Free Book]
Migrating from Postgres to ScyllaDB, with 349X Faster Query
Processing
By Dan Harris and Sebastian Vercruysse
How Coralogix cut processing times from 30 seconds to 86
milliseconds with a PostgreSQL to ScyllaDB migration. Read:
Migrating from Postgres to ScyllaDB, with 349X Faster Query
Processing Related:
NoSQL Migration Masterclass Bonus: Top NoSQL Database
Blogs From Years Past Many of the blogs published in previous years
continued to resonate with the community. Here’s a rundown of 10
enduring favorites:
How io_uring and eBPF Will Revolutionize Programming in
Linux (Glauber Costa): How io_uring and eBPF will
change the way programmers develop asynchronous interfaces and
execute arbitrary code, such as tracepoints, more securely. [2020]
Benchmarking MongoDB vs ScyllaDB: Performance, Scalability
& Cost (Dr. Daniel Seybold): Dr. Daniel Seybold shares
how MongoDB and ScyllaDB compare on throughput, latency,
scalability, and price-performance in this third-party benchmark by
benchANT. [2023]
Introducing “Database Performance at Scale”: A Free, Open Source
Book (Dor Laor): Introducing a new book that provides
practical guidance for understanding the opportunities, trade-offs,
and traps you might encounter while trying to optimize
data-intensive applications for high throughput and low latency.
[2023]
DynamoDB: When to Move Out (Felipe Cardeneti Mendes):
A look at the top reasons why teams decide to leave DynamoDB:
throttling, latency, item size limits, and limited flexibility…not
to mention costs. [2023]
ScyllaDB vs MongoDB vs PostgreSQL: Tractian’s Benchmarking &
Migration (João Pedro Voltani): TRACTIAN shares their
comparison of ScyllaDB vs MongoDB and PostgreSQL, then provides an
overview of their MongoDB to ScyllaDB migration process, challenges
& results. [2023]
Benchmarking Apache Cassandra (40 Nodes) vs ScyllaDB (4
Nodes) (Juliusz Stasiewicz, Piotr Grabowski, Karol
Baryla): We benchmarked Apache Cassandra on 40 nodes vs ScyllaDB on
just 4 nodes. See how they stacked up on throughput, latency, and
cost. [2022]
How Numberly Replaced Kafka with a Rust-Based ScyllaDB
Shard-Aware Application (Alexys Jacob): How Numberly
used Rust & ScyllaDB to replace Kafka, streamlining the way all its
AdTech components send and track messages (whatever their form).
[2023]
Async Rust in Practice: Performance, Pitfalls,
Profiling (Piotr Sarna): How our engineers used
flamegraphs to diagnose and resolve performance issues in our Tokio
framework based Rust driver. [2022]
On Coordinated Omission (Ivan Prisyazhynyy): Your
benchmark may be lying to you! Learn why coordinated omissions are
a concern, and how we account for them in benchmarking ScyllaDB.
[2021]
Why Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB
Cloud (Cynthia Dunlop) – Get the inside perspective on
how Disney+ Hotstar simplified its “continue watching” data
architecture for scale. [2022]
20 December 2024, 9:45 am by
Datastax Technical HowTo's
When Spotify launched its wrapped campaign years ago, it tapped
into something we all love - taking a moment to look back and
celebrate achievements. As we wind down 2024, we thought it would
be fun to do our own "wrapped" for Apache Cassandra®. And what a
year it's been! From groundbreaking...
18 December 2024, 2:00 pm by
ScyllaDB
TL;DR ScyllaDB has decided to focus on a single release stream –
ScyllaDB Enterprise. Starting with the ScyllaDB Enterprise 2025.1
release (ETA February 2025): ScyllaDB Enterprise will change from
closed source to source available. ScyllaDB OSS AGPL 6.2 will stand
as the final OSS AGPL release. A free tier of the full-featured
ScyllaDB Enterprise will be available to the community. This
includes all the performance, efficiency, and security features
previously reserved for ScyllaDB Enterprise. For convenience, the
existing ScyllaDB Enterprise 2024.2 will gain the new source
available license starting from our next patch release (in
December), allowing easy migration of older releases. The source
available Scylla Manager will move to AGPL and the closed source
Kubernetes multi-region operator will be merged with the main
Apache-licensed ScyllaDB Kubernetes operator. Other ScyllaDB
components (e.g., Seastar, Kubernetes operator, drivers) will keep
their current licenses. Why are we doing this? ScyllaDB’s team has
always been extremely passionate about open source, low-level
optimizations, and the delivery of groundbreaking core technologies
– from hypervisors (KVM, Xen), to operating systems (Linux, OSv),
and the ScyllaDB database. Over our 12 years of existence, we
developed an OS, pivoted to the database space, developed Seastar
(the open source standalone core engine of ScyllaDB), and developed
ScyllaDB itself. Dozens of open source projects were created:
drivers, a Kubernetes operator, test harnesses, and various tools.
Open source is an outstanding way to share innovation. It is a
straightforward choice for projects that are not your core
business. However, it is a constant challenge for vendors whose
core product is open source. For almost a decade, we have been
maintaining two separate release streams: one for the open source
database and one for the enterprise product. Balancing the free vs.
paid offerings is a never-ending challenge that involves
engineering, product, marketing, and constant sales discussions.
Unlike other projects that decided to switch to source available or
BSL to protect themselves from “free ride” competition, we were
comfortable with AGPL. We took different paths, from the initial
reimplementation of the Apache Cassandra API, to an open source
implementation of a DynamoDB-compatible API. Beyond the license, we
followed the whole approach of ‘open source first.’ Almost every
line of code – from a new feature, to a bug fix – went to the open
source branch first. We were developing two product lines that
competed with one another, and we had to make one of them
dramatically better. It’s hard enough to develop a single database
and support Docker, Kubernetes, virtual and physical machines, and
offer a database-as-a-service. The value of developing two separate
database products, along with their release trains, ultimately does
not justify the massive overhead and incremental costs required. To
give you some idea of what’s involved, we have had nearly 60 public
releases throughout 2024. Moreover, we have been the single
significant contributor of the source code. Our ecosystem tools
have received a healthy amount of contributions, but not the core
database. That makes sense. The ScyllaDB internal implementation is
a C++, shard-per-core, future-promise code base that is extremely
hard to understand and requires full-time devotion. Thus
source-wise, in terms of the code, we operated as a full
open-source-first project. However, in reality, we benefitted from
this no more than as a source-available project. “Behind the
curtain” tradeoffs of free vs paid Balancing our requirements (of
open source first, efficient development, no crippling of our OSS,
and differentiation between the two branches) has been challenging,
to say the least. Our open source first culture drove us to develop
new core features in the open. Our engineers released these
features before we were prepared to decide what was appropriate for
open source and what was best for the enterprise paid offering. For
example, Tablets, our recent architectural shift, was all developed
in the open – and 99% of its end user value is available in the OSS
release. As the Enterprise version branched out of the OSS branch,
it was helpful to keep a unified base for reuse and efficiency.
However, it reduced our paid version differentiation since all
features were open by default (unless flagged). For a while, we
thought that the OSS release would be the latest and greatest and
have a short lifecycle as a differentiation and a means of
efficiency. Although maintaining this process required a lot of
effort on our side, this could have been a nice mitigation option,
a replacement for a feature/functionality gap between free and
paid. However, the OSS users didn’t really use the latest and
didn’t always upgrade. Instead, most users preferred to stick to
old, end-of-life releases. The result was a lose-lose situation
(for users and for us). Another approach we used was to
differentiate by using peripheral tools – such as Scylla Manager,
which helps to operate ScyllaDB (e.g., running backup/restore and
managing repairs) – and having a usage limit on them. Our
Kubernetes operator is open source and we added a separate closed
source repository for multi-region support for Kubernetes. This is
a complicated path for development and also for our paying users.
The factor that eventually pushed us over the line is that our new
architecture – with Raft, tablets, and native S3 – moves peripheral
functionality into the core database: Our backup and restore
implementation moves from an agent and external manager into the
core database. S3 I/O access for backup and restore (and, in the
future, for tiered storage) is handled directly by the core
database. The I/O operations are controlled by our schedulers,
allowing full prioritization and bandwidth control. Later on,
“point in time recovery” will be provided. This is a large overhaul
unification change, eliminating complexity while improving control.
Repair becomes automatic. Repair is a full-scan, backend process
that merges inconsistent replica data. Previously, it was
controlled by the external Scylla Manager. The new generation core
database runs its own automatic repair with tablet awareness. As a
result, there is no need for an external peripheral tool; repair
will become transparent to the user, like compaction is today.
These changes are leading to a more complete core product, with
better manageability and functionality. However, they eat into the
differentiators for our paid offerings. As you can see, a
combination of architecture consolidations, together with multiple
release stream efforts, have made our lives extremely complicated
and slowed down our progress. Going forward After a tremendous
amount of thought and discussion on these points, we decided to
unify the two release streams as described at the start of this
post. This license shift will allow us to better serve our
customers as well as provide increased free tier value to the
community. The new model opens up access to previously-restricted
capabilities that: Achieve up to 50% higher throughput and 33%
lower latency via profile-guided optimization Speed up node
addition/removal by 30X via file-based streaming Balance multiple
workloads with different performance needs on a single cluster via
workload prioritization Reduce network costs with ZSTD-based
network compression (with a shard dictionary) for intra-node RPC
Combine the best of Leveled Compaction Strategy and Size-tiered
Compaction Strategy with Incremental Compaction Strategy –
resulting in 35% better storage utilization Use encryption at rest,
LDAP integration, and all of the other benefits of the previous
closed source Enterprise version Provide a single (all open source)
Kubernetes operator for ScyllaDB Enable users to enjoy a longer
product life cycle This was a difficult decision for us, and we
know it might not be well-received by some of our OSS users running
large ScyllaDB clusters. We appreciate your journey and we hope you
will continue working with ScyllaDB. After 10 years, we believe
this change is the right move for our company, our database, our
customers, and our early adopters. With this shift, our team will
be able to move faster, better respond to your needs, and continue
making progress towards the major milestones on our roadmap: Raft
for data, optimized tablet elasticity, and tiered (S3) storage.
Read the
FAQ
16 December 2024, 1:09 pm by
ScyllaDB
The following is an excerpt from Chapter 1 of Database
Performance at Scale (an Open Access book that’s available for
free). Follow Joan’s highly fictionalized adventures with some
all-too-real database performance challenges. You’ll laugh. You’ll
cry. You’ll wonder how we worked this “cheesy story” into a deeply
technical book. Get the
complete book, free Lured in by impressive buzzwords like
“hybrid cloud,” “serverless,” and “edge first,” Joan readily joined
a new company and started catching up with their technology stack.
Her first project recently started a transition from their in-house
implementation of a database system, which turned out to not scale
at the same pace as the number of customers, to one of the
industry-standard database management solutions. Their new pick was
a new distributed database, which, contrarily to NoSQL, strives to
keep the original
ACID guarantees known in
the SQL world. Due to a few new data protection acts that tend to
appear annually nowadays, the company’s board decided that they
were going to maintain their own datacenter, instead of using one
of the popular cloud vendors for storing sensitive information. On
a very high level, the company’s main product consisted of only two
layers: The frontend, the entry point for users, which actually
runs in their own browsers and communicates with the rest of the
system to exchange and persist information. The everything-else,
customarily known as “backend,” but actually including load
balancers, authentication, authorization, multiple cache layers,
databases, backups, and so on. Joan’s first introductory task was
to implement a very simple service for gathering and summing up
various statistics from the database, and integrate that service
with the whole ecosystem, so that it fetches data from the database
in real-time and allows the DevOps teams to inspect the statistics
live. To impress the management and reassure them that hiring Joan
was their absolutely best decision this quarter, Joan decided to
deliver a proof-of-concept implementation on her first day! The
company’s unspoken policy was to write software in Rust, so she
grabbed the first driver for their database from a brief crates.io
search and sat down to her self-organized hackathon. The day went
by really smoothly, with Rust’s ergonomy-focused ecosystem
providing a superior developer experience. But then Joan ran her
first smoke tests on a real system. Disbelief turned to
disappointment and helplessness when she realized that every third
request (on average) ended up in an error, even though the whole
database cluster reported to be in a healthy, operable state. That
meant a debugging session was in order! Unfortunately, the driver
Joan hastily picked for the foundation of her work, even though
open-source on its own, was just a thin wrapper over precompiled,
legacy C code, with no source to be found. Fueled by a strong
desire to solve the mystery and a healthy dose of fury, Joan spent
a few hours inspecting the network communication with
Wireshark, and she made an
educated guess that the
bug must be in the hashing key implementation. In the database
used by the company, keys are hashed to later route requests to
appropriate nodes. If a hash value is computed incorrectly, a
request may be forwarded to the wrong node that can refuse it and
return an error instead. Unable to verify the claim due to missing
source code, Joan decided on a simpler path — ditching the
originally chosen driver and reimplementing the solution on one of
the officially supported, open-source drivers backed by the
database vendor, with a solid user base and regularly updated
release schedule. Joan’s diary of lessons learned, part I The
initial lessons include: Choose a driver carefully. It’s at the
core of your code’s performance, robustness, and reliability.
Drivers have bugs too, and it’s impossible to avoid them. Still,
there are good practices to follow: Unless there’s a good reason,
prefer the officially supported driver (if it exists); Open-source
drivers have advantages: They’re not only verified by the
community, but also allow deep inspection of its code, and even
modifying the driver code to get even more insights for debugging;
It’s better to rely on drivers with a well-established release
schedule since they are more likely to receive bug fixes (including
for security vulnerabilities) in a reasonable period of time.
Wireshark is a great open-source tool for interpreting network
packets; give it a try if you want to peek under the hood of your
program. The introductory task was eventually completed
successfully, which made Joan ready to receive her first real
assignment. The tuning Armed with the experience gained working on
the introductory task, Joan started planning how to approach her
new assignment: a misbehaving app. One of the applications
notoriously caused stability issues for the whole system,
disrupting other workloads each time it experienced any problems.
The rogue app was already based on an officially supported driver,
so Joan could cross that one off the list of potential root causes.
This particular service was responsible for injecting data backed
up from the legacy system into the new database. Because the
company was not in a great hurry, the application was written with
low concurrency in mind to have low priority and not interfere with
user workloads. Unfortunately, once every few days something kept
triggering an anomaly. The normally peaceful application seemed to
be trying to perform a denial-of-service attack on its own
database, flooding it with requests until the backend got
overloaded enough to cause issues for other parts of the ecosystem.
As Joan watched metrics presented in a Grafana dashboard, clearly
suggesting that the rate of requests generated by this application
started spiking around the time of the anomaly, she wondered how on
Earth this workload could behave like that. It was, after all,
explicitly implemented to send new requests only when less than 100
of them were currently in progress. Since collaboration was heavily
advertised as one of the company’s “spirit and cultural
foundations” during the onboarding sessions with an onsite coach,
she decided it’s best to discuss the matter with her colleague,
Tony. “Look, Tony, I can’t wrap my head around this,” she
explained. “This service doesn’t send any new requests when 100 of
them are already in flight. And look right here in the logs: 100
requests in progress, one returned a timeout error, and…,” she then
stopped, startled at her own epiphany. “Alright, thanks Tony,
you’re a dear – best
rubber duck ever!,” she
concluded and returned to fixing the code. The observation that led
to discovering the root cause was rather simple: the request didn’t
actually *return* a timeout error because the database server never
sent back such a response. The request was simply qualified as
timed out by the driver, and discarded. But the sole fact that the
driver no longer waits for a response for a particular request does
not mean that the database is done processing it! It’s entirely
possible that the request was instead just stalled, taking longer
than expected, and only the driver gave up waiting for its
response. With that knowledge, it’s easy to imagine that once 100
requests time out on the client side, the app might erroneously
think that they are not in progress anymore, and happily submit 100
more requests to the database, increasing the total number of
in-flight requests (i.e., concurrency) to 200. Rinse, repeat, and
you can achieve extreme levels of concurrency on your database
cluster—even though the application was supposed to keep it limited
to a small number! Joan’s diary of lessons learned, part II The
lessons continue: Client-side timeouts are convenient for
programmers, but they can interact badly with server-side timeouts.
Rule of thumb: make the client-side timeouts around twice as long
as server-side ones, unless you have an extremely good reason to do
otherwise. Some drivers may be capable of issuing a warning if they
detect that the client-side timeout is smaller than the server-side
one, or even amend the server-side timeout to match, but in general
it’s best to double-check. Tasks with seemingly fixed concurrency
can actually cause spikes under certain unexpected conditions.
Inspecting logs and dashboards is helpful in investigating such
cases, so make sure that observability tools are available both in
the database cluster and for all client applications. Bonus points
for distributed tracing, like
OpenTelemetry integration. With
client-side timeouts properly amended, the application choked much
less frequently and to a smaller extent, but it still wasn’t a
perfect citizen in the distributed system. It occasionally picked a
victim database node and kept bothering it with too many requests,
while ignoring the fact that seven other nodes were considerably
less loaded and could help handle the workload too. At other times,
its concurrency was reported to be exactly 200% larger than
expected by the configuration. Whenever the two anomalies converged
in time, the poor node was unable to handle all requests it was
bombarded with, and had to give up on a fair portion of them. A
long study of the driver’s documentation, which was fortunately
available in
mdBook format and kept
reasonably up-to-date, helped Joan alleviate those pains too. The
first issue was simply a misconfiguration of the non-default load
balancing policy, which tried too hard to pick “the least loaded”
database node out of all the available ones, based on heuristics
and statistics occasionally updated by the database itself.
Malheureusement, this policy was also “best effort,” and relied on
the fact that statistics arriving from the database were always
legit – but a stressed database node could become so overloaded
that it wasn’t sending back updated statistics in time! That led
the driver to falsely believe that this particular server was not
actually busy at all. Joan decided that this setup was a premature
optimization that turned out to be a footgun, so she just restored
the original default policy, which worked as expected. The second
issue (temporary doubling of the concurrency) was caused by another
misconfiguration: an overeager speculative retry policy. After
waiting for a preconfigured period of time without getting an
acknowledgment from the database, drivers would speculatively
resend a request to maximize its chances to succeed. This mechanism
is very useful to increase requests’ success rate. However, if the
original request also succeeds, it means that the speculative one
was sent in vain. In order to balance the pros and cons,
speculative retry should be configured to only resend requests if
it’s very likely that the original one failed. Otherwise, as in
Joan’s case, the speculative retry may act too soon, doubling the
number of requests sent (and thus also doubling concurrency)
without improving the success rate at all. Whew, nothing gives a
simultaneous endorphin rush and dopamine hit like a quality
debugging session that ends in an astounding success (except
writing a cheesy story in a deeply technical book, naturally).
Great job, Joan! The end.
Editor’s note: If you
made it this far and can’t get enough of cheesy database
performance stories, see what happened to poor old Patrick in
“
A Tale
of Database Performance Woes: Patrick’s Unlucky Green Fedoras.”
And if you appreciate this sense of humor, see Piotr’s
new
book on writing engineering blog posts.
10 December 2024, 1:05 pm by
ScyllaDB
How Joseph Shorter and Miles Ward led a fast, safe
migration with ScyllaDB’s DynamoDB-compatible API
Digital Turbine is a quiet but powerful player in the mobile ad
tech business. Their platform is preinstalled on Android phones,
connecting app developers, advertisers, mobile carriers, and device
manufacturers. In the process, they bring in $500M annually. And if
their database goes down, their business goes down. Digital Turbine
recently decided to standardize on Google Cloud – so continuing
with their DynamoDB database was no longer an option. They had to
move fast without breaking things. Joseph Shorter (VP, Platform
Architecture at Digital Turbine) teamed up with Miles Ward (CTO at
SADA) and devised a game plan to pull off the move. Spoiler: they
not only moved fast, but also ended up with an approach that was
even faster…and less expensive too. You can hear directly from Joe
and Miles in this conference talk: We’ve captured some highlights
from their discussion below. Why migrate from DynamoDB The tipping
point for the DynamoDB migration was Digital Turbine’s
decision to standardize on GCP following a series of
acquisitions. But that wasn’t the only issue. DynamoDB hadn’t been
ideal from a cost perspective or from a performance perspective.
Joe explained: “It can be a little expensive as you scale, to be
honest. We were finding some performance issues. We were doing a
ton of reads—90% of all interactions with DynamoDB were read
operations. With all those operations, we found that the
performance hits required us to scale up more than we wanted, which
increased costs.” Their DynamoDB migration requirements Digital
Turbine needed the migration to be as fast and low-risk as
possible, which meant keeping application refactoring to a minimum.
The main concern, according to Joe, was “How can we migrate without
radically refactoring our platform, while maintaining at least the
same performance and value, and avoiding a crash-and-burn
situation? Because if it failed, it would take down our whole
company. “ They approached SADA, who helped them think through a
few options – including some Google-native solutions and ScyllaDB.
ScyllaDB stood out due to its DynamoDB API, ScyllaDB Alternator.
What the DynamoDB migration entailed In summary, it was “as easy as
pudding pie” (quoting Joe here). But a little more detail: “There
is a DynamoDB API that we could just use. I won’t say there was no
refactoring. We did some refactoring to make it easy for engineers
to plug in this information, but it was straightforward. It took
less than a sprint to write that code. That was awesome. Everyone
had told us that ScyllaDB was supposed to be a lot faster. Our
reaction was, ‘Sure, every competitor says their product performs
better.’ We did a lot with DynamoDB at scale, so we were skeptical.
We decided to do a proper POC—not just some simple communication
with ScyllaDB compared to DynamoDB. We actually put up multiple
apps with some dependencies and set it up the way it actually
functions in AWS, then we pressure-tested it. We couldn’t afford
any mistakes—a mistake here means the whole company would go down.
The goal was to make sure, first, that it would work and, second,
that it would actually perform. And it turns out, it delivered on
all its promises. That was a huge win for us.” Results so far –
with minimal cluster utilization Beyond meeting their primary goal
of moving off AWS, the Digital Turbine team improved performance –
and they ended up reducing their costs a bit too, as an added
benefit. From Joe: “I think part of it comes down to the fact that
the performance is just better. We didn’t know what to expect
initially, so we scaled things to be pretty comparable. What we’re
finding is that it’s simply running better. Because of that, we
don’t need as much infrastructure. And we’re barely tapping the
ScyllaDB clusters at all right now. A 20% cost difference—that’s a
big number, no matter what you’re talking about. And when you
consider our plans to scale even further, it becomes even more
significant. In the industry we’re in, there are only a few major
players—Google, Facebook, and then everyone else. Digital Turbine
has carved out a chunk of this space, and we have the tools as a
company to start competing in ways others can’t. As we gain more
customers and more people say, ‘Hey, we like what you’re doing,’ we
need to scale radically. That 20% cost difference is already
significant now, and in the future, it could be massive. Better
performance and better pricing—it’s hard to ask for much more than
that. You’ve got to wonder why more people haven’t noticed this
yet.”
Learn more
about the difference between ScyllaDB and DynamoDB Compare
costs: ScyllaDB vs DynamoDB