Apache Cassandra 2024 Wrapped: A Year of Innovation and Growth
When Spotify launched its wrapped campaign years ago, it tapped into something we all love - taking a moment to look back and celebrate achievements. As we wind down 2024, we thought it would be fun to do our own "wrapped" for Apache Cassandra®. And what a year it's been! From groundbreaking...Why We’re Moving to a Source Available License
TL;DR ScyllaDB has decided to focus on a single release stream – ScyllaDB Enterprise. Starting with the ScyllaDB Enterprise 2025.1 release (ETA February 2025): ScyllaDB Enterprise will change from closed source to source available. ScyllaDB OSS AGPL 6.2 will stand as the final OSS AGPL release. A free tier of the full-featured ScyllaDB Enterprise will be available to the community. This includes all the performance, efficiency, and security features previously reserved for ScyllaDB Enterprise. For convenience, the existing ScyllaDB Enterprise 2024.2 will gain the new source available license starting from our next path release (in December), allowing easy migration of older releases. The source available Scylla Manager will move to AGPL and the closed source Kubernetes multi-region operator will be merged with the main Apache-licensed ScyllaDB Kubernetes operator. Other ScyllaDB components (e.g., Seastar, Kubernetes operator, drivers) will keep their current licenses. Why are we doing this? ScyllaDB’s team has always been extremely passionate about open source, low-level optimizations, and the delivery of groundbreaking core technologies – from hypervisors (KVM, Xen), to operating systems (Linux, OSv), and the ScyllaDB database. Over our 12 years of existence, we developed an OS, pivoted to the database space, developed Seastar (the open source standalone core engine of ScyllaDB), and developed ScyllaDB itself. Dozens of open source projects were created: drivers, a Kubernetes operator, test harnesses, and various tools. Open source is an outstanding way to share innovation. It is a straightforward choice for projects that are not your core business. However, it is a constant challenge for vendors whose core product is open source. For almost a decade, we have been maintaining two separate release streams: one for the open source database and one for the enterprise product. Balancing the free vs. paid offerings is a never-ending challenge that involves engineering, product, marketing, and constant sales discussions. Unlike other projects that decided to switch to source available or BSL to protect themselves from “free ride” competition, we were comfortable with AGPL. We took different paths, from the initial reimplementation of the Apache Cassandra API, to an open source implementation of a DynamoDB-compatible API. Beyond the license, we followed the whole approach of ‘open source first.’ Almost every line of code – from a new feature, to a bug fix – went to the open source branch first. We were developing two product lines that competed with one another, and we had to make one of them dramatically better. It’s hard enough to develop a single database and support Docker, Kubernetes, virtual and physical machines, and offer a database-as-a-service. The value of developing two separate database products, along with their release trains, ultimately does not justify the massive overhead and incremental costs required. To give you some idea of what’s involved, we have had nearly 60 public releases throughout 2024. Moreover, we have been the single significant contributor of the source code. Our ecosystem tools have received a healthy amount of contributions, but not the core database. That makes sense. The ScyllaDB internal implementation is a C++, shard-per-core, future-promise code base that is extremely hard to understand and requires full-time devotion. Thus source-wise, in terms of the code, we operated as a full open-source-first project. However, in reality, we benefitted from this no more than as a source-available project. “Behind the curtain” tradeoffs of free vs paid Balancing our requirements (of open source first, efficient development, no crippling of our OSS, and differentiation between the two branches) has been challenging, to say the least. Our open source first culture drove us to develop new core features in the open. Our engineers released these features before we were prepared to decide what was appropriate for open source and what was best for the enterprise paid offering. For example, Tablets, our recent architectural shift, was all developed in the open – and 99% of its end user value is available in the OSS release. As the Enterprise version branched out of the OSS branch, it was helpful to keep a unified base for reuse and efficiency. However, it reduced our paid version differentiation since all features were open by default (unless flagged). For a while, we thought that the OSS release would be the latest and greatest and have a short lifecycle as a differentiation and a means of efficiency. Although maintaining this process required a lot of effort on our side, this could have been a nice mitigation option, a replacement for a feature/functionality gap between free and paid. However, the OSS users didn’t really use the latest and didn’t always upgrade. Instead, most users preferred to stick to old, end-of-life releases. The result was a lose-lose situation (for users and for us). Another approach we used was to differentiate by using peripheral tools – such as Scylla Manager, which helps to operate ScyllaDB (e.g., running backup/restore and managing repairs) – and having a usage limit on them. Our Kubernetes operator is open source and we added a separate closed source repository for multi-region support for Kubernetes. This is a complicated path for development and also for our paying users. The factor that eventually pushed us over the line is that our new architecture – with Raft, tablets, and native S3 – moves peripheral functionality into the core database: Our backup and restore implementation moves from an agent and external manager into the core database. S3 I/O access for backup and restore (and, in the future, for tiered storage) is handled directly by the core database. The I/O operations are controlled by our schedulers, allowing full prioritization and bandwidth control. Later on, “point in time recovery” will be provided. This is a large overhaul unification change, eliminating complexity while improving control. Repair becomes automatic. Repair is a full-scan, backend process that merges inconsistent replica data. Previously, it was controlled by the external Scylla Manager. The new generation core database runs its own automatic repair with tablet awareness. As a result, there is no need for an external peripheral tool; repair will become transparent to the user, like compaction is today. These changes are leading to a more complete core product, with better manageability and functionality. However, they eat into the differentiators for our paid offerings. As you can see, a combination of architecture consolidations, together with multiple release stream efforts, have made our lives extremely complicated and slowed down our progress. Going forward After a tremendous amount of thought and discussion on these points, we decided to unify the two release streams as described at the start of this post. This license shift will allow us to better serve our customers as well as provide increased free tier value to the community. The new model opens up access to previously-restricted capabilities that: Achieve up to 50% higher throughput and 33% lower latency via profile-guided optimization Speed up node addition/removal by 30X via file-based streaming Balance multiple workloads with different performance needs on a single cluster via workload prioritization Reduce network costs with ZSTD-based network compression (with a shard dictionary) for intra-node RPC Combine the best of Leveled Compaction Strategy and Size-tiered Compaction Strategy with Incremental Compaction Strategy – resulting in 35% better storage utilization Use encryption at rest, LDAP integration, and all of the other benefits of the previous closed source Enterprise version Provide a single (all open source) Kubernetes operator for ScyllaDB Enable users to enjoy a longer product life cycle This was a difficult decision for us, and we know it might not be well-received by some of our OSS users running large ScyllaDB clusters. We appreciate your journey and we hope you will continue working with ScyllaDB. After 10 years, we believe this change is the right move for our company, our database, our customers, and our early adopters. With this shift, our team will be able to move faster, better respond to your needs, and continue making progress towards the major milestones on our roadmap: Raft for data, optimized tablet elasticity, and tiered (S3) storage. Read the FAQA Tale from Database Performance at Scale
The following is an excerpt from Chapter 1 of Database Performance at Scale (an Open Access book that’s available for free). Follow Joan’s highly fictionalized adventures with some all-too-real database performance challenges. You’ll laugh. You’ll cry. You’ll wonder how we worked this “cheesy story” into a deeply technical book. Get the complete book, free Lured in by impressive buzzwords like “hybrid cloud,” “serverless,” and “edge first,” Joan readily joined a new company and started catching up with their technology stack. Her first project recently started a transition from their in-house implementation of a database system, which turned out to not scale at the same pace as the number of customers, to one of the industry-standard database management solutions. Their new pick was a new distributed database, which, contrarily to NoSQL, strives to keep the original ACID guarantees known in the SQL world. Due to a few new data protection acts that tend to appear annually nowadays, the company’s board decided that they were going to maintain their own datacenter, instead of using one of the popular cloud vendors for storing sensitive information. On a very high level, the company’s main product consisted of only two layers: The frontend, the entry point for users, which actually runs in their own browsers and communicates with the rest of the system to exchange and persist information. The everything-else, customarily known as “backend,” but actually including load balancers, authentication, authorization, multiple cache layers, databases, backups, and so on. Joan’s first introductory task was to implement a very simple service for gathering and summing up various statistics from the database, and integrate that service with the whole ecosystem, so that it fetches data from the database in real-time and allows the DevOps teams to inspect the statistics live. To impress the management and reassure them that hiring Joan was their absolutely best decision this quarter, Joan decided to deliver a proof-of-concept implementation on her first day! The company’s unspoken policy was to write software in Rust, so she grabbed the first driver for their database from a brief crates.io search and sat down to her self-organized hackathon. The day went by really smoothly, with Rust’s ergonomy-focused ecosystem providing a superior developer experience. But then Joan ran her first smoke tests on a real system. Disbelief turned to disappointment and helplessness when she realized that every third request (on average) ended up in an error, even though the whole database cluster reported to be in a healthy, operable state. That meant a debugging session was in order! Unfortunately, the driver Joan hastily picked for the foundation of her work, even though open-source on its own, was just a thin wrapper over precompiled, legacy C code, with no source to be found. Fueled by a strong desire to solve the mystery and a healthy dose of fury, Joan spent a few hours inspecting the network communication with Wireshark, and she made an educated guess that the bug must be in the hashing key implementation. In the database used by the company, keys are hashed to later route requests to appropriate nodes. If a hash value is computed incorrectly, a request may be forwarded to the wrong node that can refuse it and return an error instead. Unable to verify the claim due to missing source code, Joan decided on a simpler path — ditching the originally chosen driver and reimplementing the solution on one of the officially supported, open-source drivers backed by the database vendor, with a solid user base and regularly updated release schedule. Joan’s diary of lessons learned, part I The initial lessons include: Choose a driver carefully. It’s at the core of your code’s performance, robustness, and reliability. Drivers have bugs too, and it’s impossible to avoid them. Still, there are good practices to follow: Unless there’s a good reason, prefer the officially supported driver (if it exists); Open-source drivers have advantages: They’re not only verified by the community, but also allow deep inspection of its code, and even modifying the driver code to get even more insights for debugging; It’s better to rely on drivers with a well-established release schedule since they are more likely to receive bug fixes (including for security vulnerabilities) in a reasonable period of time. Wireshark is a great open-source tool for interpreting network packets; give it a try if you want to peek under the hood of your program. The introductory task was eventually completed successfully, which made Joan ready to receive her first real assignment. The tuning Armed with the experience gained working on the introductory task, Joan started planning how to approach her new assignment: a misbehaving app. One of the applications notoriously caused stability issues for the whole system, disrupting other workloads each time it experienced any problems. The rogue app was already based on an officially supported driver, so Joan could cross that one off the list of potential root causes. This particular service was responsible for injecting data backed up from the legacy system into the new database. Because the company was not in a great hurry, the application was written with low concurrency in mind to have low priority and not interfere with user workloads. Unfortunately, once every few days something kept triggering an anomaly. The normally peaceful application seemed to be trying to perform a denial-of-service attack on its own database, flooding it with requests until the backend got overloaded enough to cause issues for other parts of the ecosystem. As Joan watched metrics presented in a Grafana dashboard, clearly suggesting that the rate of requests generated by this application started spiking around the time of the anomaly, she wondered how on Earth this workload could behave like that. It was, after all, explicitly implemented to send new requests only when less than 100 of them were currently in progress. Since collaboration was heavily advertised as one of the company’s “spirit and cultural foundations” during the onboarding sessions with an onsite coach, she decided it’s best to discuss the matter with her colleague, Tony. “Look, Tony, I can’t wrap my head around this,” she explained. “This service doesn’t send any new requests when 100 of them are already in flight. And look right here in the logs: 100 requests in progress, one returned a timeout error, and…,” she then stopped, startled at her own epiphany. “Alright, thanks Tony, you’re a dear – best rubber duck ever!,” she concluded and returned to fixing the code. The observation that led to discovering the root cause was rather simple: the request didn’t actually *return* a timeout error because the database server never sent back such a response. The request was simply qualified as timed out by the driver, and discarded. But the sole fact that the driver no longer waits for a response for a particular request does not mean that the database is done processing it! It’s entirely possible that the request was instead just stalled, taking longer than expected, and only the driver gave up waiting for its response. With that knowledge, it’s easy to imagine that once 100 requests time out on the client side, the app might erroneously think that they are not in progress anymore, and happily submit 100 more requests to the database, increasing the total number of in-flight requests (i.e., concurrency) to 200. Rinse, repeat, and you can achieve extreme levels of concurrency on your database cluster—even though the application was supposed to keep it limited to a small number! Joan’s diary of lessons learned, part II The lessons continue: Client-side timeouts are convenient for programmers, but they can interact badly with server-side timeouts. Rule of thumb: make the client-side timeouts around twice as long as server-side ones, unless you have an extremely good reason to do otherwise. Some drivers may be capable of issuing a warning if they detect that the client-side timeout is smaller than the server-side one, or even amend the server-side timeout to match, but in general it’s best to double-check. Tasks with seemingly fixed concurrency can actually cause spikes under certain unexpected conditions. Inspecting logs and dashboards is helpful in investigating such cases, so make sure that observability tools are available both in the database cluster and for all client applications. Bonus points for distributed tracing, like OpenTelemetry integration. With client-side timeouts properly amended, the application choked much less frequently and to a smaller extent, but it still wasn’t a perfect citizen in the distributed system. It occasionally picked a victim database node and kept bothering it with too many requests, while ignoring the fact that seven other nodes were considerably less loaded and could help handle the workload too. At other times, its concurrency was reported to be exactly 200% larger than expected by the configuration. Whenever the two anomalies converged in time, the poor node was unable to handle all requests it was bombarded with, and had to give up on a fair portion of them. A long study of the driver’s documentation, which was fortunately available in mdBook format and kept reasonably up-to-date, helped Joan alleviate those pains too. The first issue was simply a misconfiguration of the non-default load balancing policy, which tried too hard to pick “the least loaded” database node out of all the available ones, based on heuristics and statistics occasionally updated by the database itself. Malheureusement, this policy was also “best effort,” and relied on the fact that statistics arriving from the database were always legit – but a stressed database node could become so overloaded that it wasn’t sending back updated statistics in time! That led the driver to falsely believe that this particular server was not actually busy at all. Joan decided that this setup was a premature optimization that turned out to be a footgun, so she just restored the original default policy, which worked as expected. The second issue (temporary doubling of the concurrency) was caused by another misconfiguration: an overeager speculative retry policy. After waiting for a preconfigured period of time without getting an acknowledgment from the database, drivers would speculatively resend a request to maximize its chances to succeed. This mechanism is very useful to increase requests’ success rate. However, if the original request also succeeds, it means that the speculative one was sent in vain. In order to balance the pros and cons, speculative retry should be configured to only resend requests if it’s very likely that the original one failed. Otherwise, as in Joan’s case, the speculative retry may act too soon, doubling the number of requests sent (and thus also doubling concurrency) without improving the success rate at all. Whew, nothing gives a simultaneous endorphin rush and dopamine hit like a quality debugging session that ends in an astounding success (except writing a cheesy story in a deeply technical book, naturally). Great job, Joan! The end. Editor’s note: If you made it this far and can’t get enough of cheesy database performance stories, see what happened to poor old Patrick in “A Tale of Database Performance Woes: Patrick’s Unlucky Green Fedoras.” And if you appreciate this sense of humor, see Piotr’s new book on writing engineering blog posts.How Digital Turbine Moved DynamoDB Workloads to GCP – In Just One Sprint
How Joseph Shorter and Miles Ward led a fast, safe migration with ScyllaDB’s DynamoDB-compatible API Digital Turbine is a quiet but powerful player in the mobile ad tech business. Their platform is preinstalled on Android phones, connecting app developers, advertisers, mobile carriers, and device manufacturers. In the process, they bring in $500M annually. And if their database goes down, their business goes down. Digital Turbine recently decided to standardize on Google Cloud – so continuing with their DynamoDB database was no longer an option. They had to move fast without breaking things. Joseph Shorter (VP, Platform Architecture at Digital Turbine) teamed up with Miles Ward (CTO at SADA) and devised a game plan to pull off the move. Spoiler: they not only moved fast, but also ended up with an approach that was even faster…and less expensive too. You can hear directly from Joe and Miles in this conference talk: We’ve captured some highlights from their discussion below. Why migrate from DynamoDB The tipping point for the DynamoDB migration was Digital Turbine’s decision to standardize on GCP following a series of acquisitions. But that wasn’t the only issue. DynamoDB hadn’t been ideal from a cost perspective or from a performance perspective. Joe explained: “It can be a little expensive as you scale, to be honest. We were finding some performance issues. We were doing a ton of reads—90% of all interactions with DynamoDB were read operations. With all those operations, we found that the performance hits required us to scale up more than we wanted, which increased costs.” Their DynamoDB migration requirements Digital Turbine needed the migration to be as fast and low-risk as possible, which meant keeping application refactoring to a minimum. The main concern, according to Joe, was “How can we migrate without radically refactoring our platform, while maintaining at least the same performance and value, and avoiding a crash-and-burn situation? Because if it failed, it would take down our whole company. “ They approached SADA, who helped them think through a few options – including some Google-native solutions and ScyllaDB. ScyllaDB stood out due to its DynamoDB API, ScyllaDB Alternator. What the DynamoDB migration entailed In summary, it was “as easy as pudding pie” (quoting Joe here). But a little more detail: “There is a DynamoDB API that we could just use. I won’t say there was no refactoring. We did some refactoring to make it easy for engineers to plug in this information, but it was straightforward. It took less than a sprint to write that code. That was awesome. Everyone had told us that ScyllaDB was supposed to be a lot faster. Our reaction was, ‘Sure, every competitor says their product performs better.’ We did a lot with DynamoDB at scale, so we were skeptical. We decided to do a proper POC—not just some simple communication with ScyllaDB compared to DynamoDB. We actually put up multiple apps with some dependencies and set it up the way it actually functions in AWS, then we pressure-tested it. We couldn’t afford any mistakes—a mistake here means the whole company would go down. The goal was to make sure, first, that it would work and, second, that it would actually perform. And it turns out, it delivered on all its promises. That was a huge win for us.” Results so far – with minimal cluster utilization Beyond meeting their primary goal of moving off AWS, the Digital Turbine team improved performance – and they ended up reducing their costs a bit too, as an added benefit. From Joe: “I think part of it comes down to the fact that the performance is just better. We didn’t know what to expect initially, so we scaled things to be pretty comparable. What we’re finding is that it’s simply running better. Because of that, we don’t need as much infrastructure. And we’re barely tapping the ScyllaDB clusters at all right now. A 20% cost difference—that’s a big number, no matter what you’re talking about. And when you consider our plans to scale even further, it becomes even more significant. In the industry we’re in, there are only a few major players—Google, Facebook, and then everyone else. Digital Turbine has carved out a chunk of this space, and we have the tools as a company to start competing in ways others can’t. As we gain more customers and more people say, ‘Hey, we like what you’re doing,’ we need to scale radically. That 20% cost difference is already significant now, and in the future, it could be massive. Better performance and better pricing—it’s hard to ask for much more than that. You’ve got to wonder why more people haven’t noticed this yet.” Learn more about the difference between ScyllaDB and DynamoDB Compare costs: ScyllaDB vs DynamoDBInnovative data compression for time series: An open source solution
Introduction
There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.
Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.
Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative.
ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub.
There are so many compressors already, so why develop another one?
While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born.
ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that function—no actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use.
Here’s an example: for a temperature change metric—which mostly varies slowly (as do a lot of system metrics!)—instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios.
Image 1: ATSC data for temperature
How does ATSC work?
ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.
By default, ATSC will segment the data—this guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file.
In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function.
ATSC currently uses one (per frame) of those following functions:
- FFT (Fast Fourier Transforms)
- Constant
- Interpolation – Catmull-Rom
- Interpolation – Inverse Distance Weight
Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting
These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.
For a more detailed insight into ATSC internals and operations check our paper!
Use cases for ATSC and results
ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below).
Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results:
Method | Compressed size (bytes) | Compression Ratio |
Prometheus | 454,778,552 | 1.33 |
LZ4 | 141,347,821 | 4.29 |
ATSC | 14,276,544 | 42.47 |
Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times.
Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.
ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include:
- Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss.
- Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events.
- Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data.
- Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)
Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)
Using ATSC
ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions.
Future work
Currently, we are planning to evolve ATSC in two ways (check our open issues):
- Adding features to the core compressor
focused on
these functionalities:
- Frame expansion for appending new data to existing frames
- Dynamic function loading to add more functions without altering the codebase
- Global and per-frame error storage
- Improved error encoding
- Integrations with
additional
technologies (e.g.
databases):
- We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra®
CREATE TABLE sensors_poly ( sensor_id UInt16, location UInt32, timestamp DateTime, pressure Float64 CODEC(ATSC('Polynomial', 1)), temperature Float64 CODEC(ATSC('Polynomial', 1)), ) ENGINE = MergeTree ORDER BY (sensor_id, location, timestamp);
Image 5: Currently testing ClickHouse integration
Sound interesting? Try it out and let us know what you think.
ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data.
But don’t just take our word for it—download and run it!
Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.
Want to integrate this with another tool? You can build and run our demo integration with ClickHouse.
The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.
ScyllaDB 2024.2 Introduces New Efficiency & Elasticity via “Tablets”
New capabilities introduced in ScyllaDB Enterprise make scaling operations with Tablets up to 30X faster while reducing network costs by up to 50% ScyllaDB just released ScyllaDB 2024.2, the first enterprise release featuring ScyllaDB’s new “tablets” replication architecture. This new architecture, which builds upon a multiyear project to implement and extend the Raft consensus protocol, enables new levels of elasticity, speed, simplicity, and efficiency. The enterprise release offers new capabilities designed to help you reduce infrastructure costs and streamline operations: Tablets, a dynamic data distribution architecture that significantly improves elasticity and scalability (limited availability – details in release notes) File-based streaming for tablets further speeds up scaling operations (e.g., adding and removing nodes) Strongly consistent topology updates, authentication updates, service levels (workload prioritization) ZSTD-based network compression for intra-node RPC, with a shard dictionary for improved performance DynamoDB API (Alternator) enhancements such as role-based access control See the release notes for details The ScyllaDB Enterprise 2024.2 release is based on ScyllaDB 6.0; it includes *all* the features available in ScyllaDB 6.0 like: Tablets, a dynamic way to distribute data across nodes that significantly improves scalability Strongly consistent topology, Auth, and Service Level updates In addition, 2024.2 includes enterprise-only features such as: Improved network compression (see below) File-based streaming for tablets A new FIPS enabled Docker Image Related links: Read more about ScyllaDB Enterprise Get ScyllaDB Enterprise 2024.2 (customers only, or 30-day evaluation) Upgrade from ScyllaDB Enterprise 2024.1.x to 2024.2.y Upgrade from ScyllaDB Open Source 6.0 to ScyllaDB Enterprise 2024.2.x ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2024.2, and are welcome to contact our Support Team with questions. Tablets In this release, ScyllaDB enabled Tablets, a new data distribution algorithm as a better alternative to the legacy vNodes approach inherited from Apache Cassandra. While the vNodes approach statically distributes all tables across all nodes and shards based on the token ring, the Tablets approach dynamically distributes each table to a subset of nodes and shards based on its size. In the future, distribution will use CPU, OPS, and other information to further optimize the distribution. In particular, Tablets provide the following: Faster scaling and topology changes. New nodes can start serving reads and writes as soon as the first Tablet is migrated. Together with Strongly Consistent Topology Updates (below), this also allows users to add multiple nodes simultaneously. Automatic support for mixed clusters with different core counts. Manual vNode updates are not required. More efficient operations on small tables., since such tables are placed on a small subset of nodes and shards. Note that you can run a cluster with some of the Keyspaces with Tablets disabled, and some with tablets enabled for as long as you wish. The scaling improvement will be partial, and limited to Keyspaces with Tables enabled. Read the Tablets documentation Currently, tablets are ideal for new clusters that you frequently scale out or in and that have one main large table and potentially many tiny ones. ScyllaDB Support can help you determine if Tablets in release 2024.2 are a good solution for your use case. Learn more about tablets: Smooth Scaling: Why ScyllaDB Moved to “Tablets” Data Distribution: The rationale behind ScyllaDB’s new “tablets” replication architecture ScyllaDB Fast Forward: True Elastic Scale: Introducing major architectural shifts that enable new levels of elasticity and operational simplicity Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities: A technical walkthrough of exactly what’s changed from the user/operator perspective Monitoring Tablets To Monitor Tablets in real time, upgrade ScyllaDB Monitoring Stack to release 4.8, and use the new dynamic Tablet panels, below. Driver Support The Following Drivers support Tablets Java driver 4.x, from 4.18.0.2 Java driver 3.x, from 3.11.5.2 Python driver, from 3.26.6 Gocql driver, from 1.13.0 Rust driver from 0.13.0 Legacy ScyllaDB and Apache Cassandra drivers will continue to work with ScyllaDB but will be less efficient when working with tablet-based Keyspaces. Read the blog post “How We Updated ScyllaDB Drivers for Tablets Elasticity” File based streaming for Tablets File-based streaming is a ScyllaDB Enterprise-only feature that optimizes tablet migration. In ScyllaDB Open Source, migrating tablets is performed by streaming mutation fragments, which involves deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on the other node. In ScyllaDB Enterprise, migrating tablets is performed by streaming entire SStables, which does not require (de)serializing or processing mutation fragments. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. File-based streaming is used for tablet migration in all keyspaces created with tablets enabled. Read the file-based streaming documentation Consistent Topology and Metadata Strongly Consistent Topology Updates With Raft-managed topology enabled, all topology operations are internally sequenced consistently. A centralized coordination process ensures that topology metadata is synchronized across the nodes on each step of a topology change procedure. This makes topology updates fast and safe, as the cluster administrator can trigger many topology operations concurrently, and the coordination process will safely drive all of them to completion. For example, multiple nodes can be bootstrapped concurrently, which couldn’t be done with the previous gossip-based topology. Strongly Consistent Topology Updates is now the default for new clusters, and should be enabled after upgrade for existing clusters. Read the blog post “ScyllaDB’s Safe Topology and Schema Changes on Raft” Strongly Consistent Auth Updates System-auth-2 is a reimplementation of the Authentication and Authorization systems in a strongly consistent way on top of the Raft sub-system. This means that Role-Based Access Control (RBAC) commands like create role or grant permission are safe to run in parallel without a risk of getting out of sync with themselves and other metadata operations, like schema changes. As a result, there is no need to update system_auth RF or run repair when adding a Data Center. Strongly Consistent Service Levels Service Levels allow you to define attributes like timeout per workload. Service levels are now strongly consistent using Raft, like Schema, Topology and Auth. Improved network compression for intra-node RPC This release adds new Enterprise only RPC compression improvements for node to node communication: Using zstd instead of lz4 Using a shared dictionary re-trained periodically on the traffic, instead of the message by message compression. Below is a comparison of compressions algorithms on different types of data. Note that dictionary based compression can be used with either lz4 or zstd. The actual compression is very much workload-dependent and can vary between use cases. Alternator Role-Based Access Control Alternator supports Role-Based Access Control (RBAC) for authorization. Control is done via the CQL API. Native Nodetool The nodetool utility provides simple command-line interface operations and attributes. ScyllaDB inherited the Java based nodetool from Apache Cassandra. In this release, the Java implementation was replaced with a backward-compatible native nodetool. The native nodetool works much faster. Unlike the Java version ,the native nodetool is part of the ScyllaDB repo, and allows easier and faster updates. With the native nodetool, the JMX server has become redundant and will no longer be part of the default ScyllaDB Installation or image. If you are using the JMX server directly, not via nodetool, you can either: Work directly with the ScyllaDB REST API (recommended) Install the JMX server yourself. See https://github.com/scylladb/scylla-jmx for instructions. As part of moving to native tooling and away from Java tools, we will deprecate SSTableloader in future versions of ScyllaDB Enterprise. You can use the Load and Stream to upload SSTables directly to ScyllaDB, either from Apache Cassandra or other ScyllaDB clusters. We are also deprecating the Java version of nodetool, which was replaced by a compatible native version (see above). Maintenance Maintenance mode is a new mode in which the node does not communicate with clients or other nodes and only listens to the local maintenance socket and the REST API. It can be used to fix damaged nodes – for example, by using nodetool compact or nodetool scrub. In maintenance mode, ScyllaDB skips loading tablet metadata if it is corrupted to allow an administrator to fix it. Also, the Maintenance Socket provides a new way to interact with ScyllaDB from within the node it runs on. It is mainly for debugging. You can use CQLSh with the Maintenance Socket as described in the Maintenance Socket docs. Improvements and Bug Fixes The latest release also features extensive improvements to stability, performance, monitoring and more. For details, see the release notes on the ScyllaDB Community Forum. See the details release notesDatabase Internals: Working with IO
Explore the tradeoffs of different Linux I/O methods and learn how databases can take advantage of a modern SSD’s unique characteristics The following blog is an excerpt from Chapter 3 of the Database Performance at Scale book, which is available for free. This book sheds light on often overlooked factors that impact database performance at scale. Unless the database engine is an in-memory one, it will have to keep the data on external storage. There can be many options to do that, including local disks, network-attached storage, distributed file- and object- storage systems, etc. The term “I/O” typically refers to accessing data on local storage – disks or file systems (that, in turn, are located on disks as well). And in general, there are four choices for accessing files on a Linux server: read/write, mmap, direct I/O (DIO) read/write, and asynchronous I/O (AIO/DIO, because this I/O is rarely used in cached mode). Traditional read/write The traditional method is to use the read(2) and write(2) system calls. In a modern implementation, the read system call (or one of its many variants – pread, readv, preadv, etc) asks the kernel to read a section of a file and copy the data into the calling process address space. If all of the requested data is in the page cache, the kernel will copy it and return immediately; otherwise, it will arrange for the disk to read the requested data into the page cache, block the calling thread, and when the data is available, it will resume the thread and copy the data. A write, on the other hand, will usually just copy the data into the page cache; the kernel will write-back the page cache to disk some time afterward. mmap An alternative and more modern method is to memory-map the file into the application address space using the mmap(2) system call. This causes a section of the address space to refer directly to the page cache pages that contain the file’s data. After this preparatory step, the application can access file data using the processor’s memory read and memory write instructions. If the requested data happens to be in cache, the kernel is completely bypassed and the read (or write) is performed at memory speed. If a cache miss occurs, then a page-fault happens and the kernel puts the active thread to sleep while it goes off to read the data for that page. When the data is finally available, the memory-management unit is programmed so the newly read data is accessible to the thread which is then awoken. Direct I/O (DIO) Both traditional read/write and mmap involve the kernel page cache and defer the scheduling of I/O to the kernel. When the application wishes to schedule I/O itself (for reasons that we will explain later), it can use direct I/O, shown in Figure 3-4. This involves opening the file with the O_DIRECT flag; further activity will use the normal read and write family of system calls. However, their behavior is now altered: instead of accessing the cache, the disk is accessed directly, which means that the calling thread will be put to sleep unconditionally. Furthermore, the disk controller will copy the data directly to userspace, bypassing the kernel. Figure 3-4: Direct IO involves opening the file with the O_DIRECT flag; further activity will use the normal read and write family of system calls, but their behavior is now altered Asynchronous I/O (AIO/DIO) A refinement of direct I/O, asynchronous direct I/O behaves similarly but prevents the calling thread from blocking (see Figure 3-5). Instead, the application thread schedules direct I/O operations using theio_submit(2)
system call, but the thread is not blocked; the I/O operation runs
in parallel with normal thread execution. A separate system call,
io_getevents(2)
, is used to wait for and collect the
results of completed I/O operations. Like DIO, the kernel’s page
cache is bypassed, and the disk controller is responsible for
copying the data directly to userspace. Figure 3-5: A
refinement of direct I/O, asynchronous direct I/O behaves similarly
but prevents the calling thread from blocking Note:
io_uring The API to perform asynchronous I/O appeared in
Linux long ago, and it was warmly met by the community. However, as
it often happens, real-world usage quickly revealed many
inefficiencies such as blocking under some circumstances (despite
the name), the need to call the kernel too often, and poor support
for canceling the submitted requests. Eventually, it became clear
that the updated requirements are not compatible with the existing
API and the need for a new one arose. This is how the io_uring()
API appeared. It provides the same facilities as what AIO does, but
in a much more convenient and performant way (also it has notably
better documentation). Without diving into implementation details,
let’s just say that it exists and is preferred over the legacy AIO.
Understanding the tradeoffs The different access methods share some
characteristics and differ in others. Table 3-1 summarizes these
characteristics, which are elaborated on below. Copying and MMU
activity One of the benefits of the mmap method is that if the data
is in cache, then the kernel is bypassed completely. The kernel
does not need to copy data from the kernel to userspace and back,
so fewer processor cycles are spent on that activity. This benefits
workloads that are mostly in cache (for example, if the ratio of
storage size to RAM size is close to 1:1). The downside of mmap,
however, occurs when data is not in the cache. This usually happens
when the ratio of storage size to RAM size is significantly higher
than 1:1. Every page that is brought into the cache causes another
page to be evicted. Those pages have to be inserted into and
removed from the page tables; the kernel has to scan the page
tables to isolate inactive pages, making them candidates for
eviction, and so forth. In addition, mmap requires memory for the
page tables. On x86 processors, this requires 0.2% of the size of
the mapped files. This seems low, but if the application has a
100:1 ratio of storage to memory, the result is that 20% of memory
(0.2% * 100) is devoted to page tables. I/O scheduling One of the
problems with letting the kernel control caching (with the mmap and
read/write access methods) is that the application loses control of
I/O scheduling. The kernel picks whichever block of data it deems
appropriate and schedules it for write or read. This can result in
the following problems: A write storm. When the
kernel schedules large amounts of writes, the disk will be busy for
a long while and impact read latency The kernel cannot
distinguish between “important” and “unimportant” I/O. I/O
belonging to background tasks can overwhelm foreground tasks,
impacting their latency By bypassing the kernel page cache, the
application takes on the burden of scheduling I/O. This doesn’t
mean that the problems are solved, but it does mean that the
problems can be solved – with sufficient attention and effort. When
using Direct I/O, each thread controls when to issue I/O However,
the kernel controls when the thread runs, so responsibility for
issuing I/O is shared between the kernel and the application. With
AIO/DIO, the application is in full control of when I/O is issued.
Thread scheduling An I/O intensive application using mmap or
read/write cannot guess what its cache hit rate will be. Therefore
it has to run a large number of threads (significantly larger than
the core count of the machine it is running on). Using too few
threads, they may all be waiting for the disk leaving the processor
underutilized. Since each thread usually has at most one disk I/O
outstanding, the number of running threads must be around the
concurrency of the storage subsystem multiplied by some small
factor in order to keep the disk fully occupied. However, if the
cache hit rate is sufficiently high, then these large numbers of
threads will contend with each other for the limited number of
cores. When using Direct I/O, this problem is somewhat mitigated.
The application knows exactly when a thread is blocked on I/O and
when it can run, so the application can adjust the number of
running threads according to runtime conditions. With AIO/DIO, the
application has full control over both running threads and waiting
I/O (the two are completely divorced), so it can easily adjust to
in-memory or disk-bound conditions or anything in between. I/O
alignment Storage devices have a block size; all I/O must be
performed in multiples of this block size which is typically 512 or
4096 bytes. Using read/write or mmap, the kernel performs the
alignment automatically; a small read or write is expanded to the
correct block boundary by the kernel before it is issued. With DIO,
it is up to the application to perform block alignment. This incurs
some complexity, but also provides an advantage: the kernel will
usually over-align to a 4096 byte boundary even when a 512-byte
boundary suffices. However, a user application using DIO can issue
512-byte aligned reads, which results in saving bandwidth on small
items. Application complexity While the previous discussions
favored AIO/DIO for I/O intensive applications, that method comes
with a significant cost: complexity. Placing the responsibility of
cache management on the application means it can make better
choices than the kernel and make those choices with less overhead.
However, those algorithms need to be written and tested. Using
asynchronous I/O requires that the application is written using
callbacks, coroutines, or a similar method, and often reduces the
reusability of many available libraries. Choosing the Filesystem
and/or Disk Beyond performing the I/O itself, the database design
must consider the medium against which this I/O is done. In many
cases, the choice is often between a filesystem or a raw block
device, which in turn can be a choice of a traditional spinning
disk or an SSD drive. In cloud environments, however, there can be
the third option because local drives are always ephemeral – which
imposes strict requirements on the replication. Filesystems vs raw
disks This decision can be approached from two angles: management
costs and performance. If accessing the storage as a raw block
device, all the difficulties with block allocation and reclamation
are on the application side. We touched on this topic slightly
earlier when we talked about memory management. The same set of
challenges apply to RAM as well as disks. A connected, though very
different, challenge is providing data integrity in case of
crashes. Unless the database is purely in-memory, the I/O should be
done in a way that avoids losing data or reading garbage from disk
after a restart. Modern file systems, however, provide both and are
very mature to trust the efficiency of allocations and integrity of
data. Accessing raw block devices unfortunately lacks those
features (so they will need to be implemented at the same quality
on the application side). From the performance point of view, the
difference is not that drastic. On one hand, writing data to a file
is always accompanied by associated metadata updates. This consumes
both disk space and I/O bandwidth. However, some modern file
systems provide a very good balance of performance and efficiency,
almost eliminating the IO latency. (One of the most prominent
examples is XFS. Another really good and mature piece of software
is Ext4). The great ally in this camp is the fallocate(2) system
call that makes the filesystem pre-allocate space on disk. When
used, filesystems also have a chance to make full use of the
extents mechanisms, thus bringing the QoS of using files to the
same performance level as when using raw block devices. Appending
writes The database may have a heavy reliance on appends to files
or require in-place updates of individual file blocks. Both
approaches need special attention from the system architect because
they call for different properties from the underlying system. On
one hand, appending writes requires careful interaction with the
filesystem so that metadata updates (file size, in particular) do
not dominate the regular I/O. On the other hand, appending writes
(being sort of cache-oblivious algorithms) handle the disk
overwriting difficulties in a natural manner. Contrary to this,
in-place updates cannot happen at random offsets and sizes because
disks may not tolerate this kind of workload, even if used in a raw
block device manner (not via a filesystem). That being said, let’s
dive even deeper into the stack and descend into the hardware
level. How modern SSDs work Like other computational resources,
disks are limited in the speed they can provide. This speed is
typically measured as a 2-dimensional value with Input/Output
Operations per Second (IOPS) and bytes per second
(throughput). Of course, these parameters are not cut in stone even
for each particular disk, and the maximum number of requests or
bytes greatly depends on the requests’ distribution, queuing and
concurrency, buffering or caching, disk age, and many other
factors. So when performing I/O, a disk must always balance between
two inefficiencies — overwhelming the disk with requests and
underutilizing it. Overwhelming the disk should be avoided because
when the disk is full of requests it cannot distinguish between the
criticality of certain requests over others. Of course, all
requests are important, but it makes sense to prioritize
latency-sensitive requests. For example, ScyllaDB serves real-time
queries that need to be completed in single-digit milliseconds or
less and, in parallel, it processes terabytes of data for
compaction, streaming, decommission, and so forth. The former have
strong latency sensitivity; the latter are less so. Good I/O
maintenance that tries to maximize the I/O bandwidth while keeping
latency as low as possible for latency-sensitive tasks is
complicated enough to become a standalone component called the
IO
Scheduler. When evaluating a disk, one would most likely
be looking at its 4 parameters — read/write IOPS and read/write
throughput (such as in MB/s). Comparing these numbers to one
another is a popular way of claiming one disk is better than the
other and estimating the aforementioned “bandwidth capacity” of the
drive by applying Little’s Law. With that, the IO Scheduler’s job
is to provide a certain level of concurrency inside the disk to get
maximum bandwidth from it, but not to make this concurrency too
high in order to prevent the disk from queueing requests internally
for longer than needed. For instance, Figure 3-6 illustrates how
read request latency depends on the intensity of small reads
(challenging disk IOPS capacity) vs the intensity of large writes
(pursuing the disk bandwidth). The latency value is color-coded,
and the “interesting area” is painted in cyan — this is where the
latency stays below 1 millisecond. The drive measured is the NVMe
disk that comes with the AWS EC2 i3en.3xlarge instance. Figure
3-6: Bandwidth/latency graphs showing how read request latency
depends on the intensity of small reads (challenging disk IOPS
capacity) vs the intensity of large writes (pursuing the disk
bandwidth) This drive demonstrates almost perfect half-duplex
behavior — increasing the read intensity several times requires
roughly the same reduction in write intensity to keep the disk
operating at the same speed. Tip: How to measure your own
disk behavior under load The better you understand how
your own disks perform under load, the better you can tune them to
capitalize on their “sweet spot.” One way to do this is with
Diskplorer, an open-source disk latency/bandwidth exploring
toolset. By using Linux fio under the hood it runs a battery of
measurements to discover performance characteristics for a specific
hardware configuration, giving you an at-a-glance view of how
server storage I/O will behave under load. For a walkthrough of how
to use this tool, see this Linux Foundation video, “Understanding
Storage I/O Under Load.”