Open source software (OSS) is the backbone of innovation. Building and releasing software would be a dramatically different process without the incredible contributions of the open source community — the many open-source-based tools that support development and release as well as the open source code bases and frameworks that enable developers to “stand on the shoulders of giants.”
However, open source contributors rarely get the recognition they deserve. That’s why ScyllaDB wanted to take the occasion of American Thanksgiving to express our gratitude to open source contributors everywhere. Whether you’ve built an open source tool or framework that ScyllaDB engineers rely on, you’ve helped shape our open source code base, or you’re advancing another amazing open source project, we thank you.
ScyllaDB’s engineering team is especially grateful for the following open source tools and frameworks:
- GCC, an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating system
- Linux kernel, the open source operating system kernel that powers the ongoing revolution in cloud computing.
- Ninja, a small build system with a focus on speed
- Golang, an open source programming language that makes it easy to build simple, reliable, and efficient software.
- Git, a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency
- Grafana, an open source observability solution for every database.
- Prometheus, an open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach
- Kubernetes, powering planet-scale container management
- Debezium, the open source platform under the hood of our CDC Kafka Connector
- Tokio, a framework at the heart of our Rust driver
- Wireshark, an open source “sniffer” for TCP/IP packet analysis
We also want to highlight how open source contributions have enabled many great things. For example, if you’ve ever contributed to the Seastar framework or Scylla database projects, or even one of our shard-aware drivers, you can take pride in knowing that your work is benefitting applications that:
- Accelerate pharmaceutical discovery by enabling unprecedented views of biology
- Help people build relationships globally and remain connected virtually
- Provide extremely flexible options for the new realities of travel
- Stop cybersecurity attacks before they can do damage
- Reduce traffic accidents through predictive driver alerts
- Produce AI-driven health insights to improve patient care
- Enable law enforcement to track, arrest, and prosecute child predators
- Harness the power of IoT to provide renewable energy and manage natural resources more sustainably
That’s just a small sampling of the positive impacts made possible by the contributors to ScyllaDB and Seastar open source projects — and that’s just a tiny sliver of the contributions made to the global open source community over the past decades. The overall value of open source contributions is immeasurable but certainly immense.
To every person who’s ever submitted a bug report, committed a bug fix, extended an existing open source project, or took the initiative to start a new one: we truly appreciate your efforts. Thank you for taking the time to contribute, and never lose that spirit of innovation.
The post Giving Thanks to Open Source Software Contributors appeared first on ScyllaDB.
ScyllaDB, being a database able to maintain exabytes of data and provide millions of operations, has to maximize utility of all available hardware resources including CPU, memory, plus disk and network IO. According to its data model ScyllaDB needs to maintain a set of partitions, rows and cell values providing fast lookup, sorted scans and keeping the memory consumption as low as possible. One of the core components that greatly affects ScyllaDB’s performance is the in-memory cache of the user data (we call it the row cache). And one of the key factors to achieving the performance goal is a good selection of collections — the data structures used to maintain the row cache objects. In this blog post I’ll try to demonstrate a facet of the row cache that lies at the intersection of academic computer-science and practical programming — the trees.
In its early days, Scylla used standard implementations of key-value maps that were Red-Black (RB) trees behind the scenes. Although the standard implementation was time-proven to be stable and well performing, we noticed a set of performance-related problems with it: memory consumption could have been better, tree search seemed to take more CPU power than we expected it to, and some design ideas that were considered to be “corner case” turned out to be critical for us. The need for a better implementation arose and, as a part of this journey, we had to re-invent the trees again.
To B- or Not to B-Tree
An important characteristic of a tree is called cardinality. This is the maximum number of child nodes that another node may have. In the corner case of cardinality of two, the tree is called a binary tree. For other cases, there’s a wide class of so-called B-trees. The common belief about binary vs B- trees is that the former ones should be used when the data is stored in the RAM, whilst the latter trees should live in the disk. The justification for this split is that RAM access speed is much higher than disk. Also, disk IO is performed in blocks, so it’s much better and faster to fetch several “adjacent” keys in one request. RAM, unlike disks, allows random access with almost any granularity, so it’s OK to have a disperse set of keys pointing to each other.
However, there are many reasons why B-trees are often a good choice for in-memory collections.The first reason is cache locality. When searching for a key in a binary tree, the algorithm would visit up to logN elements that are very likely dispersed in memory. On a B-tree, this search will consist of two phases — intra-node search and descending the tree — executed one after another. And while descending the tree doesn’t differ much from the binary tree in the aforementioned sense, intra-node search will access keys that are located next to each other, thus making much better use of CPU caches.
The second reason also comes from the dispersed nature of binary trees and from how modern CPUs are designed. It’s well known that when executing a stream of instructions, CPU cores split the processing of each instruction into stages (loading instructions, decoding them, preparing arguments and doing the execution itself) and the stages are run in parallel in a unit called a conveyor. When a conditional branching instruction appears in this stream, the conveyor needs to guess which of two potential branches it will have to execute next and start loading it into the conveyor pipeline. If this guess fails, the conveyor is flushed and starts to work from scratch. Such failures are called branch misprediction. They are very bad from the performance point of view and have direct implications on the binary search algorithm. When searching for a key in such a tree, the algorithm jumps left and right depending on the key comparison result without giving the CPU any chance to learn which direction is “preferred.” In many cases, the CPU conveyer is flushed.
The two-phased B-tree search can be made better with respect to branch predictions. The trick is in making the intra-node search linear, i.e. walking the array of keys forward key-by-key. In this case, there will be only a “should we move forward” condition that’s much more predictable. There’s even a nice trick of turning binary search into linear without sacrificing the number of comparisons.
Linear Search on Steroids
That linear search can be improved a bit more. Let’s count carefully the number of key comparisons that it may take to find a single key in a tree. For a binary tree, it’s well known that it takes log2N comparisons (on average) where N is the number of elements. I put the logarithm base here for a reason. Next, let’s consider a k-ary tree with k kids per node. Does it take less comparisons? (Spoiler: no). To find the element, we have to do the same search — get a node, find in which branch it sits, then proceed to it. We have logkN levels in the tree, so we have to do that many descending steps. However on each step we need to do the search within k elements which is, again, log2k if we’re doing a binary search. Multiplying both, we still need at least log2N comparisons.
The way to reduce this number is to compare more than one key at a time when doing intra-node search. In case keys are small enough, SIMD instructions can compare up to 64 keys in one go. Although a SIMD compare instruction may be slower than a classic cmp one and requires additional instructions to process the comparison mask, linear SIMD-powered search wins on short enough arrays (and B-tree nodes can be short enough). For example, here are the times of looking up an integer in a sorted array using three techniques — linear search, binary search and SIMD-optimized linear search such as the x86 Advanced Vector Extensions (AVX).
The test used a large amount of randomly generated arrays of values dispersed in memory to eliminate differences in cache usage and a large amount of random search keys to blur branch predictions. Shown above are the average times of finding a key in an array normalized by the array length. Smaller results are faster (better).
Scanning the Tree
One interesting flavor of B-trees is called a B+-tree. In this tree, there are two kinds of keys — real keys and separation keys. The real keys live on leaf nodes, i.e. on those that don’t have child ones, while separation keys sit on inner nodes and are used to select which branch to go next when descending the tree. This difference has an obvious consequence that it takes more memory to keep the same amount of keys in a B+-tree as compared to B-tree, but it’s not only that.
A great implicit feature of a tree is the ability to iterate over elements in a sorted manner (called scan below). To scan a classical B-tree, there are both recursive and state-machine algorithms that process the keys in a very non-uniform manner — the algorithm walks up-and-down the tree while it moves. Despite B-trees being described as cache-friendly above, scanning it needs to visit every single node and inner nodes are visited in a cache unfriendly manner.
Opposite to this, B+-trees’ scan only needs to loop through its leaf nodes, which, with some additional effort, can be implemented as a linear scan over a linked list of arrays.
When the Tree Size Matters
Talking about memory, B-trees don’t provide all the above benefits for free (neither do B+-trees). As the tree grows, so does the number of nodes in it and it’s useful to consider the overhead needed to store a single key. For a binary tree, the overhead would be three pointers — to both left and right children as well as to the parent node. For a B-tree, it will differ for inner and leaf nodes. For both types, the overhead is one parent pointer and k pointers to keys, even if they are not inserted in the tree. For inner nodes there will be additionally k+1 pointers to child nodes.
The number of nodes in a B-tree is easy to estimate for a large number of keys. As the number of nodes grows, the per-key overhead blurs as keys “share” parent and children pointers. However, there’s a very interesting point at the beginning of a tree’s growth. When the number of keys becomes k+1 (i.e. the tree overgrows its first leaf node), the number of nodes jumps three times because, in this case, it’s needed to allocate one more leaf node and one inner node to link those two.
There is a good and pretty cheap optimization to mitigate this
spike that we’ve called “linear root.” The leaf root node grows on
demand, doubling each step like a
std::vector in C++,
and can overgrow the capacity of k up to some extent.
Shown on the graph below is the per-key overhead for a 4-ary B-tree
with 50% initial overgrow. Note the first split spike of a
classical algorithm at 5 keys.
If talking about how B-trees work with small amounts of keys, it’s worth mentioning the corner case of 1 key. In Scylla, a B-tree is used to store clustering rows inside a partition. Since it’s allowed to have a schema without a clustering key, it’s thus possible to have partitions that always have just one row inside, so this corner case is not that “corner” for us. In the case of a binary tree, the single-element tree is equivalent to having a direct pointer from the tree owner to this element (plus the cost of two nil pointers to the left and right children). In case of a B-tree, the cost of keeping the single key is always in having a root node that implies extra pointer fetching to access this key. Even the linear root optimization is helpless here. Fixing this corner case was possible by re-using the pointer to the root node to point directly to the single key.
The Secret Life of Separation Keys
Next, let’s dive into technical details of B+-tree implementation — the practical information you won’t read in books.
There are two different ways of managing separation keys in a B+-tree. The separation key at any level must be less than or equal to all the keys from its right subtree and greater than or equal to all the keys from its left subtree. Mind the “or” condition — the exact value of the separation key may or may not coincide with the value of some key from the respective branch (it’s clear that this some will be the rightmost key on the left branch or leftmost on the right). Let’s name these two cases. If the tree balancing maintains the separation key to be independent from other key values, then it’s the light mode; if it must coincide with some of them, then it will be called the strict mode.
In the light separation mode, the insertion and removal operations are a bit faster because they don’t need to care about separation keys that much. It’s enough if they separate branches, and that’s it. A somewhat worse consequence of the light separation is that separation keys are separate values that may appear in the tree by copying existing keys. If the key is simple, e.g. an integer, this will likely not cause any troubles. However, if keys are strings or, as in Scylla’s case, database partition or clustering keys, copying it might be both resource consuming and out-of-memory risky.
On the other hand, the strict separation mode makes it possible to avoid keys copying by implementing separation keys as references on real ones. This would involve some complication of insertion and especially removal operations. In particular, upon real key removal it will be needed to find and update the relevant separation keys. Another difficulty to care about is that moving a real key value in memory, if it’s needed (e.g. in Scylla’s case keys are moved in memory as a part of memory defragmentation hygiene), will also need to update the relevant reference from separation keys. However, it’s possible to show that each real key will be referenced by at most one separation key.
Speaking about memory consumption… although large B-trees were shown to consume less memory per-key as they get filled, the real overhead would very likely be larger, since the nodes of the tree will typically be underfilled because of the way the balancing algorithm works. For example, this is how nodes look like in a randomly filled 4-ary B-tree:
It’s possible to define a compaction operation for a B-tree that will pick several adjacent nodes and squash them together, but this operation has its limitations. First, a certain amount of under-occupied nodes makes it possible to insert a new element into a tree without the need to rebalance, thus saving CPU cycles. Second, since each node cannot contain less than a half of its capacity, squashing 2 adjacent nodes is impossible, even if considering three adjacent nodes then the amount of really squashable nodes would be less than 5% of leaves and less than 1% of inners.
In this blog post, I’ve only touched on the most prominent aspects of adopting B- and B+- trees for in-RAM usage. Lots of smaller points were tossed overboard for brevity — for example the subtle difference in odd vs even number of keys on a node. This exciting journey proved one more time that the exact implementation of an abstract math concept is very very different from its on-paper model.
B+-trees have been supported since Scylla Open Source 4.3, and our B-tree implementation was added in release 4.5. They are hidden, under-the-hood optimizations ScyllaDB users benefit from as we continue to evolve our infrastructure.
Instaclustr is pleased to release our Certification Report and updated Project Assessment Report for Apache Cassandra 4.0.
We first announced our Certification Framework and Apache Cassandra certification process in 2019. The Instaclustr Certification Framework formalizes the overall process of considering open source for inclusion in the Instaclustr SaaS platform. The two key stages are:
- Assessment, where we consider the governance, adoption, and liveliness of the project, and
- Certification, where we develop a test plan and undertake independent technical testing of a specific software release or version. Both phases result in a detailed, publicly available report of the work we have undertaken.
Key highlights of the Cassandra 4.0 Certification Report include:
- Performance testing (latency and throughput) comparing the current version to the previous two released versions for multiple use cases
- 24-hour soak testing (including repairs and replaces) including garbage collection time
- Shakedown testing against popular drivers
- CVE scanning results and impact analysis of open CVEs
The overall conclusion of our testing is that we have found Cassandra 4.0 to be a positive step forward from Cassandra 3.x. In particular, we found significant improvements in tail latencies (p95 and p99) and throughput and latency improvements for most use cases. We did find one test where Cassandra 4.0 performed slightly worse than Cassandra 3.11.8, so we do recommend testing your individual use case before upgrading.
Dissecting a Real-Life Migration Process
It’s kind of common sense… Database migration is one of the most complex and risky operations the “cooks in the kitchen” (platform / infrastructure / OPs / SRE and surroundings) can face — especially when you are dealing with the company’s “main” database. That’s a very abstract definition, I know. But if you’re from a startup you’ve probably already faced such a situation.
Still — once in a while — the work and the risk are worth it, especially if there are considerable gains in terms of cost and performance.
The purpose of this series of posts is to describe the strategic decisions, steps taken and the real experience of migrating a database from Cassandra to ScyllaDB from a primarily technical point of view.
Let’s start with the use case.
Use Case: IoT (Vehicle Fleet Telemetry)
Here at Cobli, we work with vehicle fleet telematics. Let me explain: every vehicle in our fleet has an IoT device that continuously sends data packets every few seconds.
Overall, this data volume becomes significant. At peak usage, it reaches more than 4,000 packets per second. Each package must go through triage where it is processed and stored in the shortest possible time. This processing results in information relevant to our customers (for example, vehicle overspeed alerts, journey history etc)– and this information must also be saved. Everything must be done as close as possible to “real time”.
The main requirements of a “database” for our functionalities are high availability and write capacity, a mission successfully delegated to Cassandra.
I’m not going to elaborate on all the reasons for migrating to ScyllaDB. However, I can summarize it as the search for a faster and — above all — cheaper Cassandra .
For those who are using Cassandra and don’t know Scylla, this page is worth taking a look at. In short: ScyllaDB is a revamped Cassandra, focused on performance while maintaining a high level of compatibility with Cassandra (in terms of features, APIs, tools, CQL, table types and so on…).
Scylla is similar to Cassandra when it comes to the way it is distributed: there is an open source version, an enterprise version, and a SaaS version (database as a service or DBaaS). It’s already ingrained in Cobli’s culture: using SaaS enables us to focus on our core business. The choice for SaaS was unanimous.
The Scylla world is relatively new, and the SaaS that exists is the Scylla Cloud offering (we haven’t found any other… yet). We based our cluster size and cost projection calculations on the options provided by SaaS, which also simplified the process a bit.
Another point that made us comfortable was the form of integration between our infrastructure and SaaS, common to the AWS world: the Bring Your Own Account model. We basically delegate access to Scylla Cloud to create resources on AWS under a specific account, but we continue to own those resources.
We made a little discovery with Scylla Cloud:
- We set up a free-tier cluster linked to our AWS account
- We configured connectivity to our environment (VPC peering) and test
- We validated the operating and monitoring interfaces provided by Scylla Cloud.
Scylla Cloud provides a Scylla Monitoring interface built into the cluster. It’s not a closed solution — it’s part of the open source version of Scylla — but the advantage is that it is managed by them.
We did face a few speed bumps. One important difference compared to Cassandra: there are no metrics per table/keyspace: only global metrics. This restriction comes from the Scylla core, not the Scylla Cloud. It seems that there have been recent developments on that front, but we simply accepted this drawback in our migration.
Over time, we’ve also found that Scylla Cloud metrics are retained for around 40/50 days. Some analyses may take longer than that. Fortunately, it is possible to export the metrics in Prometheus format — accepted by a huge range of integrations — and replicate the metrics in other software.
Lastly, we missed a backup administration interface (scheduling, requests on demand, deleting old backups etc). Backup settings go through ticket systems and, of course, interactions with customer support.
First Question: is it Feasible?
From a technical point of view, the first step of our journey was to assess whether Scylla was viable in terms of functionality, performance and whether the data migration would fit within our constraints of time and effort.
We brought up a container with the “target” version of Scylla (Enterprise 2020.1.4) and ran our migrations (yes, we use migrations in Cassandra!) and voilá!! Our database migrated to Scylla without changing a single line.
Disclaimer: It may not always be like this. Scylla keeps a compatibility information page that is worth visiting to avoid surprises. [Editor’s note: also read this recent article on Cassandra and ScyllaDB: Similarities and Differences.]
Our functional validation came down to running all our complete set of tests for any that previously used Cassandra– but pointing to the dockerized version of Scylla instead of Cassandra.
In most cases, we didn’t have any problems. However, one of the tests returned:
partition key Cartesian product size 4735 is greater than maximum 100
The query in question is a
SELECT with the
IN clause. This is a use not advised by Cassandra, and
that Scylla decided to restrict more aggressively: the amount of
values inside the
IN clause is limited by some
We changed the configuration according to our use case, the test passed, and we moved on.
We instantiated a Cassandra and a Scylla with their “production” settings. We also populated some tables with the help of dsbulk and ran some stress testing.
The tests were basically pre-existing read/write scenarios on our platform using cassandra-stress.
Before testing, we switched from Cassandra’s size-tiered compaction strategy to Scylla’s new incremental compaction to minimize space amplification. This was something that Scylla recommended. Note that this compaction strategy only exists within the enterprise version.
ScyllaDB delivered surprising numbers, decreasing query latencies by 30% to 40% even with half the hardware (in terms of CPU).
This test proved to be insufficient because the production environment is much more complex than we were able to simulate. We faced some mishaps during the migration worthy of another post, but nothing that hasn’t already been offset by the performance benefits that Scylla has shown after proper corrections.
Exploring the Data Migration Process
It was important to know how and how soon we would be able to have Scylla with all the data from good old Cassandra.
The result of this task was a guide for us to be able to put together a migration roadmap, in addition to influencing the decision on the adopted strategy.
- dsbulk: A CLI application that performs theCassandra data unload operation in some intermediate format, which can be used for a load operation later. It’s a process in a JVM, so it only scales vertically.
- scylla-migrator: A Spark job created and maintained by Scylla with various performance/parallelism settings to massively copy data. Like any Spark job, it can be configured with a virtually infinite number of clusters. It implements a savepoints mechanism, allowing process restart from the last successfully copied batch in case of failure.
The option of copying data files from Cassandra to Scylla was scrapped upon the recommendation from our contacts at ScyllaDB. It is important to have Scylla reorganize data on disk as its partitioning system is different (done by CPU, not by node).
In preliminary tests, already using Cassandra and Scylla in production, we got around three to four Gbytes of data migrated per hour using dsbulk, and around 30 Gbytes per hour via scylla-migrator. Obviously these results are affected by a number of factors, but they gave us an idea of the potential uses of each tool.
To try to measure the maximum migration time, we ran the scylla-migrator on our largest table (840G) and got about 10 GBytes per hour, or about 8 days of migration 24/7.
With all these results in hand, we decided to move forward with migration. The next question is “how?”. In the next part of this series we’re going to make the toughest decision in migrations of this type: downtime tracking/prediction.
See you soon!
Since the initial release of our Cassandra-compatible database, ScyllaDB has been perceived as “chasing Cassandra,” working to achieve feature parity.
This meant that through 2020, we were playing catch up. However, with Scylla Open Source 4.0 we went beyond feature completeness. We suddenly had features Cassandra didn’t have at all. We also introduced features that were named similarly, but implemented differently — often radically so.
At the same time, Cassandra has and will keep adding commands, features and formats. For example, the SSTables formats changed once between 4.0 beta and release candidate 1 and then again in the final release.
This results in the following kind of buckets of features. Some core features of Cassandra, which Scylla has also implemented in its core — the same-same. Same configuration. Same command line inputs and outputs. Same wire protocol. And so on.
Then there are some things that are unique to Cassandra, such as the Cassandra 4.0 features. Some of these we plan to add in due time, such as the new SSTable formats. Some simply may not be appropriate to implement because of the very different infrastructure and design philosophies — even the code bases. For instance, since Scylla is implemented in C++, you won’t find Java-specific features like you would have in Cassandra. Inversely, you’ll have some features in Scylla that they just won’t implement in Cassandra.
Lastly, there is a mix of features that may be called by the same name, or may sound quite similar, but are actually implemented uniquely across Cassandra and Scylla.
All of these are points of divergence which could become showstoppers for migration if you depended on them in your use case. Or they may be specific reasons to migrate if they represent features or capabilities that you really need, but the other database just will never offer you.
So while Scylla began by chasing Cassandra, now many of our features are beyond Cassandra, and some of their features diverge from our implementation. While we remain committed to making our database as feature complete and compliant to Cassandra as possible and pragmatic, it will be quite interesting to watch as current points of departure between the two become narrowed or widened over the coming years.
What’s the Same Between Scylla & Cassandra?
Let’s start with the common ground. If you are familiar with Cassandra today, what can you expect to feel as comfortable and natural in Scylla?
A Common Heritage: Bigtable and Dynamo
First, the common ancestry. Many of the principles between Cassandra and Scylla are directly correlated. In many ways, you could call Cassandra the “mother” of Scylla in our little database mythological family tree.
Both draw part of their ancestry from the original Google Bigtable and Amazon Dynamo whitepapers (note: Scylla also offers an Alternator interface for DynamoDB API compatibility; this pulls in additional DNA from Amazon DynamoDB).
Keyspaces, Tables, Basic Operations
With our Cassandra Query Language (CQL) interface, the basic methods of defining how the database is structured and how users interact with it remain the same:
These are all standard Cassandra Query Language (CQL). The same thing with basic CRUD operations:
- Create [
- Read [
- Update [
- Delete [
Plus, there are other standard features across Scylla and Cassandra:
All comfortable and familiar as a favorite sweater. Also, for
database developers who have never used NoSQL before, the whole
syntax of CQL is deceptively similar to SQL, at least at a cursory
glance. But do not be lulled into a false sense of familiarity. For
example, you won’t find
JOIN operations supported in
The high availability architecture that Cassandra is known for is likewise found in Scylla. Peer-to-peer leaderless topology. Replication factors and consistency levels set per request. Multi datacenter replication which allows you to be able to survive a full datacenter loss. All typical “AP”-mode database behavior.
Next, you have the same underlying ring architecture. The key-key-value scheme of a wide column database: partition keys and clustering keys, then data columns.
What else is the same? Nodes and vNodes, automatic sharding, token ranges, and the murmur3 partitioner. If you are familiar with managing Cassandra, all of this is all quite familiar. (Though if it’s not, you’re encouraged to take the Scylla Fundamentals course in Scylla University.)
What’s Similar But Not the Same?
While there are still more features that are alike, let’s not be exhaustive. Let’s move on to what seems similar between the two, but really are just not the same.
Cassandra Query Language (CQL)
That’s right. The Cassandra Query Language implementation itself is often subtly or not so subtly different. While the CQL wire protocol and most of the basic CQL commands are the same, you will note Scylla may have implemented some CQL commands that do not appear in Cassandra. Or vice versa.
What are the specific differences between them? I’ll let you
scour the docs as a homework assignment. A careful eye might notice
a few of the post-3.4.0 features have already been added to Scylla.
PER PARTITION LIMIT, a feature of CQL
3.4.2, was added to Scylla Open Source 3.1 and later.
Some of what you find may seem to be pretty trivial differences. But if you were migrating between the two databases, any unexpected discoveries might represent bumps in the road or unpleasant show-stoppers until Scylla finally reaches CQL parity and completeness again.
Scylla is compatible with Cassandra 3.11’s latest “md” format. But did you spot the difference with Cassandra 4.0?
// na (4.0-rc1): uncompressed chunks, pending repair
session, isTransient, checksummed sstable metadata file, new
// nb (4.0.0): originating host id
In the first release candidate of Cassandra 4.0 they snuck out the “na” format, which added a bunch of small changes. And then when 4.0 itself shipped, they added a way to store the originating hostID in “nb” format SSTable files.
We’ve opened up a Github issue (#8593) to make sure Scylla will have “na” and “nb” format compatibility in due time — but this is the sort of common, everyday feature chasing you’ll have whenever new releases of anything are spun, and everyone else needs to ensure compatibility. There’s always a little lag and gap time before implementation.
Lightweight Transactions, or LWTs, are pretty much the same sort of thing on both systems to do compare-and-set or conditional updates. But on Scylla they are simply more performant because instead of four round trips as with Cassandra, we only require three.
|Cassandra LWT Implementation||Scylla LWT Implementation|
What this has led to in practice is that some folks have tried LWT on Cassandra, only to back them out when performance tanked, or didn’t meet expectations. So if you experimented with LWTs in Cassandra, you might want to try them again with Scylla.
Materialized Views, or MVs, are another case where Scylla put more polish into the apple. While Cassandra has had materialized views since 2017, they’ve been problematic since first introduced.
At Distributed Data Summit 2018 Cassandra PMC Chair Nate McCall told the audience that “If you have them, take them out.” I remember sitting in the audience absorbing the varied reactions as Nate spoke frankly and honestly about the shortcomings of the implementation.
Meanwhile, the following year Scylla introduced its own implementation of production-ready materialized views in Scylla Open Source 3.0. They served as the foundation for other features, such as secondary indexes.
While MVs in Scylla can still get out of sync from the base table, it is not as likely or easy to do. ScyllaDB engineers have poured a lot of effort over the past few years to get materialized views “right, ” and we consider the feature ready for production.
Speaking of secondary indexes, while you have them in Cassandra, they are only local secondary indexes — limited to the same base partition. They are efficient but they don’t scale.
Global secondary indexes, which are only present in Scylla, allow you to index across your entire dataset, but can be more complicated and lead to unpredictable performance. That means you want to be more judicious about how and when you implement them.
The good news is Scylla supports both local and global secondary indexes. You can apply both on a column to run your queries as narrow or as broad as you wish.
Change Data Capture
Change Data Capture, or CDC, is one of the most dramatic differences between Cassandra and Scylla. Cassandra implements CDC as a commitlog-like structure. Each node gets a CDC log, and then, when you want to query them, you have to take these structures off-box, combine them and dedupe them.
Think about the design decisions that went into Scylla’s CDC implementation. First, it uses a CDC table that resides on the same node as the base table data, shadowing any changes to those partitions. Those CDC tables are then queryable using standard CQL.
This means the results you get are already going to be deduped for you. There’s no log merges necessary. You get a stream of data, whether that includes the diffs, the pre-images, and/or the post-images. You can consume it however you want.
We also have a TTL set on the CDC tables so they don’t grow unbounded over time.
This made it very easy for us to implement a Kafka CDC Source Connector based on Debezium. It simply consumes the data from the CDC tables using CQL and pumps it out to Kafka topics. This makes it very easy to integrate Scylla into your event streaming architecture. No muss, no fuss. You can also read more about how we built our CDC Kafka connector using Debezium.
Zero Copy Streaming vs. Row-level Repair
Here’s another example of a point of departure. Cassandra historically had problems with streaming SSTables. This can be important when you are doing topology changes and you need to bring up or down nodes and rebalance your cluster. Zero copy streaming means you can take a whole SSTable — all of its partitions — and copy it over to another node without breaking an SSTable into objects, which creates unnecessary garbage that then needs to be collected. It also avoids bringing data into userspace on the transmitting and receiving nodes. Ideally this was to get you closer to hardware IO bounds.
Scylla, however, has already radically changed how it was going to do internode copying. We used row-level repairs instead of standard streaming methodology. This was more robust, allowing mid-point stops and restarts of transfers, was more granular — meaning you were only sending the needed rows instead of the entire table — and more efficient overall.
So these are fundamentally different ways to solve a problem. You can read how these different designs impacted topology change performance in our comparison of Cassandra 4.0 vs. Scylla 4.4.
Async Messaging: Netty vs. AIO
Netty Async Messaging, new in Cassandra 4.0, is a good thing. Any way to avoid blocking and bottlenecking operations is awesome. Also, the way it does thread pools meant you weren’t setting a fixed number of threads per peer, which could mismatch actual real-world requirements.
Scylla has always believed in non-blocking IO. It is famous for its “async everywhere” C++ architecture. Plus, the shard-per-core design meant that you were minimizing inter-core communications as much as possible in the first place.
Again, these were good things. But for Cassandra async design was an evolutionary realization they wove into their existing design, whereas for Scylla it was a Day One design decision, which we’ve improved upon a lot since. You can read more about what we’ve learned over six years of IO scheduling.
In summation, both databases are sort of doing the same thing, but in very different ways.
For Scylla, we have our own Scylla Operator.
So yes, Kubernetes is available for both. But each operator is purpose-built for each respective database.
What’s Just Totally Different?
Now let’s look at things that are just simply… different. From fundamental design decisions to implementation philosophies to even the vision of what these database platforms are and can do.
A critical Day One decision for Scylla was to build a highly distributed database upon a shared-nothing, shard-per-core architecture — the Seastar framework.
Scale it up, or scale it out, or both. Scylla is a “greedy system,” and is designed to make maximum utilization out of all the hardware you can throw at it.
Because of this, Scylla can take advantage of any size server. 100 cores per server? Sure. 1,000 cores? Don’t laugh. I know of a company working on a 2,000 core system. Such hyperscale servers will be available before you know it.
In comparison, Cassandra shards per node. Not per core. It also gets relatively low utilization out of the system it’s running in. That’s just the nature of a JVM — it doesn’t permit you knowledge of or control over the underlying hardware. This is why people seek to run multi-tenant in the box — to utilize all those cycles that Cassandra can’t harness.
As an aside, this is why attempts to often do an “apples-to-apples” comparison of Scylla to Cassandra on the same base hardware may often be skewed. Cassandra prefers running on low-density boxes, because it isn’t really capable of taking advantage of large scale multicore servers. However, Scylla hits its stride on denser nodes that Cassandra will fail to fully utilize. You can see this density partiality reflected on our “4 vs. 40” benchmark published earlier this year.
Shard-Aware CQL Drivers
While most of our focus has been on the core database itself, we also have a series of shard-aware drivers that provide you an additional performance boost. For example, check out our articles on our shard-aware Python driver — Part 1 discusses the design, and Part 2 the implementation and the performance improvements — as well as our Rust driver update and benchmarks.
Scylla’s drivers are still backward compatible with Apache Cassandra. But when they are utilized with Scylla, they provide additional performance benefits — by as much as 15% to 25%.
Alternator: Scylla’s DynamoDB-compatible API
Alternator. It’s our name for the Amazon DynamoDB-compatible API we’ve built into Scylla. This means you now have freedom. You can still run your workloads on AWS, but you might find that you get a better TCO out of our implementation running on our Scylla Cloud Database-as-a-Service instead. Or you might use it to migrate your workload to Google Cloud, or Azure, or even put in on-premises.
An interesting example of the latter is AWS Outposts. These are cages with AWS servers installed in your own premises. These servers act as an on-premises extension of AWS.
Because we were capable of being deployed anywhere, Scylla Cloud was chosen as AWS’ service ready method to deploy your DynamoDB workloads directly into an AWS Outposts environment.
Using our CDC feature as the underlying infrastructure, we also support DynamoDB Streams. Plus, we have a load balancer to round out the same-same expectations of existing DynamoDB users. Lastly, our Scylla Spark Migrator makes it easy to take those DynamoDB workloads and place them wherever you desire.
There are many, many other things I could have picked out, but I just wanted to show this as one more example of a “quality of life” feature for the database administrators.
Seedless gossip. There’s been a lot of pain and suffering if you lost a seed node. It requires manual assignment. Seed nodes won’t just bootstrap themselves. It can cause a lot of real-world, real-time frustrations when your cluster is at its most temperamental.
That’s why one of our engineers came up with the brilliant idea of just … getting rid of seed nodes entirely. We reworked the way gossip is implemented to be more symmetric and seamless. I hope you have a chance to read this article on how it was done; I promise…it’s pretty juicy!
Discover ScyllaDB for Yourself
This is just a cursory overview and a point-in-time glimpse of how Scylla (currently on release 4.5) and Cassandra (currently on 4.0) are often feature-by-feature the same to maintain the greatest practical level of compatibility, how they sometimes differ slightly due to the logistics of keeping two different open source projects in sync or due to design or implementation decisions, and to point out explicitly how and when they sometimes diverge radically.
Yet however ScyllaDB engineers may have purposefully differed from Cassandra in design or implementation, it was always done with the hope that any changes we’ve made are in your favor, as the user, and not simply done as change for change’s sake.
The post Cassandra and ScyllaDB: Similarities and Differences appeared first on ScyllaDB.
From intrusion detection, to threat analysis, to endpoint security, the effectiveness of cybersecurity efforts often boils down to how much data can be processed — in real time — with the most advanced algorithms and models.
Many factors are obviously involved in stopping cybersecurity threats effectively. However, the databases responsible for processing the billions or trillions of events per day (from millions of endpoints) play a particularly crucial role. High throughput and low latency directly correlate with better insights as well as more threats discovered and mitigated in near real time. But cybersecurity data-intensive systems are incredibly complex; many span 4+ data centers with database clusters exceeding 1000 nodes and petabytes of heterogeneous data under active management.
How do expert engineers and architects at leading cybersecurity companies design, manage, and evolve data architectures that are up to the task? Here’s a look at 3 specific examples.
Accelerating real-time threat analysis by 1000% at FireEye
Cybersecurity use case
FireEye‘s Threat Intelligence application centralizes, organizes, and processes threat intelligence data to support analysts. It does so by grouping threats using analytical correlation, and by processing and recording vast quantities of data, including DNS data, RSS feeds, domain names, and URLs. Using this array of billions of properties, FireEye’s cybersecurity analysts can explore trillions of questions to provide unparalleled visibility into the threats that matter most.
Their legacy system used PostgreSQL with a custom graph database system to store and facilitate the analysis of threat intelligence data. As the team of analysts grew into the hundreds, system limitations emerged. The graph size grew to 500M nodes, with around 1.5B edges connecting them. Each node had more than 100 associated properties accumulated over several years. The PostgreSQL-based system became slow, proved difficult to scale, and was not distributed or highly available. FireEye needed to re-architect their system on a new technology base to serve the growing number of analysts and the businesses that rely on them.
To start, the team evaluated several graph databases and selected JanusGraph. FireEye’s functional criteria included traversing speed, full/free text search, and concurrent user support. Non-functional criteria included requirements for high availability and disaster recovery, plus a pluggable storage backend for flexibility. The team felt that JanusGraph met these criteria well, and also appreciated its user-controllable indexing, schema management, triggers, and OLAP capabilities for distributed graph processing.
Next, they shifted focus to evaluating compatible backend storage solutions. With analysis speed top of mind, they looked past Java-based options. FireEye selected ScyllaDB based on its raw performance and manageability. The FireEye team chose to run ScyllaDB themselves within a secure enclave guarded by an NGINX gateway. Today, the ScyllaDB solution is deployed on AWS i3.8xlarge instances in 7-node clusters. Each node is provisioned with 32 CPUs, 244MB of memory, and 16TB SSD storage.
System architecture diagram – provided by FireEye
With the new JanusGraph + ScyllaDB system, FireEye achieved a performance improvement of 100X. For example, a query traversing 15,000 graph nodes now returns results in 300ms (vs 30 seconds to 3 minutes).
Query execution comparison diagram – provided by FireEye
Moreover, they were able to dramatically slash the storage footprint while preserving the 1000-2000% performance increase they had experienced by switching to ScyllaDB. Ultimately, they reduced AWS spend to 10% of the original cost.
Scaling security from 1.5M to 100M devices at Lookout
Cybersecurity use case
Lookout leverages artificial intelligence to provide visibility and protection from network threats, web-based threats, vulnerabilities, and other risks. To protect an enterprise’s mobile devices, they ingest device telemetry and feed it across a network of services to identify risks. Low-latency is key: if a user installs an app with malware and it takes minutes to detect, that device’s information is already compromised.
Lookout needed to scale from supporting 1.5 million devices to 100 million devices without substantially increasing costs. This required ingesting more and more telemetry as it came in. They needed a highly scalable and fault-tolerant streaming framework that could process device telemetry messages and persist these messages into a scalable, fault-tolerant persistent store with support for operational queries.
Their existing solution involved Spark, DynamoDB, and ElasticSearch. However, DynamoDB alone would cost them about $1M per month if they scaled it for 100M devices. Also, DynamoDB lacked the required level of sorting for time series data. With DynamoDB, they had to use a query API that introduced a significant performance hit, impacting latency.
Lookout replaced Spark with Kafka and migrated from DynamoDB to ScyllaDB. Kafka Connect servers send data from Kafka into Scylla, then it’s pushed into ElasticSearch.
System architecture diagram – provided by Lookout
They also discovered that Kafka’s default partitioner became less and less efficient with sharding as the number of partitions grew, so they replaced the default Kafka sharding function with a murmur3 hash and put it through a consistent hashing algorithm (jump hash) to get an even distribution across all partitions.
Lookout met its goal of growing from 1.5 million devices to 100 million devices while reining in costs. What would have cost them nearly $1M per month with DynamoDB cost them under $50K per month with ScyllaDB. They also achieved the low latency vital to effective threat detection. Their message latency is currently in the milliseconds, on average.
Achieving database independence at ReversingLabs
Cybersecurity use case
ReversingLabs is a complete advanced malware analysis platform that speeds destructive file detection through automated static analysis — enabling analysts to prioritize the highest risks with actionable detail in milliseconds. Their TitaniumCloud Reputation Services provides powerful threat intelligence solutions with up-to-date threat classification and rich context on over 20B goodware and malware files.
ReversingLabs’ database stores the outcome of multiple analyses on billions of files so that end users can perform malware detection “in the wild”: immediately checking if a file they encounter is known to be safe or suspected to be malware. If the file’s status is unknown, it needs to be analyzed then entered into the ReversingLabs database. 50+ APIs and specialized feeds connect end users to that database.
System architecture diagram – provided by ReversingLabs
Years ago, ReversingLabs built their own highly available database for this application. It’s a key-value store, based on the LSM Tree architecture. The database needed to insert/update and read large amounts of data with low latency and stable response times, and it achieved those goals.
However, as their business expanded, they wanted “database independence” to eliminate dependencies on the aging legacy database. They built a library that enabled them to connect to other databases without modifying their application layer. But connecting OSS or COTS solutions to meet their specialized needs wasn’t exactly a plug-and-play endeavor. For example, they needed:
- Key-value native protobuf format
- LZ4 compression for reduced storage size
- Latency <2 ms
- Support for record sizes tanging from 1K to 500M
The first test of their database independence strategy was adopting Scylla, with an 8 node cluster, for their 2 large volumes APIs. They were able to meet their specialized needs by applying strategies like:
- Using blobs for keys as well as values
- Tuning the chunk size to improve compression 49%
- Improving performance by using NVMe disks in RAID 0 configuration and doing inserts/updates and reads at consistency level quorum
ReversingLabs successfully moved beyond their legacy database, with a highly available and scalable system that meets their specialized needs. Within the ScyllaDB database, file reputation queries had average writes latencies of <6 milliseconds and sub-millisecond average reads. For p99 (long-tail) latencies, they achieved <12 millisecond writes, and <7 millisecond reads.
In end-to-end workflow testing (including and beyond the database), average latencies were less than 120 milliseconds, and p99 latencies were 166 milliseconds. This included the user request, authentication process, request validation, the round-trip time to query the database, to format and then finally send the response to the application, scaled to 32 workers, all in parallel.
Databases: what’s essential for cybersecurity
To wrap up, here’s a quick recap of the core database capabilities that have helped these and many other companies stop cybersecurity threats faster and more accurately:
- Consistent low-latency performance for real-time streaming analytics — to identify and prevent digital threats in real-time
- Extreme throughput and rapid scaling — to ingest billions of events across threat analysis, malware protection, and intrusion detection
- Lower total cost of ownership, reduced complexity, automated tuning, and DevOps-friendly operational capabilities — to free resources for strategic cybersecurity projects
- The ability to run at extremely high levels of utilization — to ensure consistent performance and ultra-low latency without overprovisioning and waste
- Cloud, on-premises, and a variety of hybrid topologies to support data distribution and geographic replication — for high availability, resilience, and instant access to critical data anywhere in the world
Want advice on whether a database like ScyllaDB is a good fit for your environment and your use case? Chat with us or sign up for a free technical 1:1 consultation with one of our solution architects.
The post Stopping Cybersecurity Threats: Why Databases Matter appeared first on ScyllaDB.
ScyllaDB’s Annual Conference Focuses on Database Innovations for This Next Tech Cycle
Database monsters of the world, connect! Join us at Scylla Summit, our annual user conference — a free, online virtual event scheduled for February 09-10, 2022.
Connect. Discover. Disrupt.
At Scylla Summit, you’ll be able to hear from your peers, industry experts, and our own engineers on where this next tech cycle is heading and how you can take full advantage of the capabilities of ScyllaDB and related data technologies. This is an event by and for NoSQL distributed database experts, whether your role is an architect, data engineer, app developer, DevOps, SRE or DBA.
Scylla Summit will feature all the same great technical content that has been the hallmark of our in-person events from years past, as well as opportunities to network with your industry peers from all over the world. Now’s the time to sign up yourself and encourage your team members to join you.
Keynotes and Sessions
We will once again feature keynotes from ScyllaDB founders Dor Laor and Avi Kivity, as well as sessions by our engineering staff. Hear about our past year in review, as well as our roadmap for 2022.
Plus stay tuned for a great list of external speakers — your professional peers and technology industry leaders. For example, Oxide Computer CTO Bryan Cantrill will join us to share his vision for how hardware and software are co-evolving in this next tech cycle.
The response to our Call for Speakers has been tremendous! We will soon provide a full agenda, plus upcoming blogs profiling our speakers. Yet if you wish to submit your own session, there’s still a couple of days left!
Scylla Summit will be held on the following days and times:
|Wednesday, February 09, 2022||8:00 AM – 2:00 PM Pacific|
|Thursday, February 10, 2022||8:00 AM – 2:00 PM Pacific|
Within our Scylla Summit event platform, which is accessible once you register, we have a chat channel where we will offer exclusive prizes and contests, as well as access to our engineers and the Summit speakers during the event.
Look for announcements, games and opportunities to connect with other Scylla Summit attendees after registering.
We look forward to seeing you online at our Scylla Summit in February! But don’t delay — sign up for our event today!
Scylla University LIVE is your chance to take FREE half-day of online instructor-led courses to increase your understanding of NoSQL and distributed databases and applications. Our upcoming November Scylla University LIVE event is right around the corner. We will hold two different sessions aimed at different global audiences:
- AMERICAS – Tuesday, Nov 9th – 9AM-1PM PT | 12PM-4PM ET | 1PM-5PM BRT
- EMEA and APAC – Wednesday, Nov 10th – 8:00-12:00 UTC | 9AM-1PM CET | 1:30PM-5:30PM IST
All of our different sessions are led by top ScyllaDB engineers and architects, aimed at different levels of expertise and tasks. From essentials to setting up and configuring Scylla to more advanced topics such as Change Data Capture, Scylla and Kubernetes, and using the DynamoDB API.
As a reminder, the Scylla University LIVE is a FREE, half-day, instructor-led training event, with training sessions. It will include two parallel tracks, one aimed at beginners, covering the basics and how to get started with Scylla and one for more experienced users, covering advanced topics and new features. The sessions in the tracks start concurrently so that you can jump back and forth between sessions.
Following the sessions, we will host a roundtable discussion where you’ll have the opportunity to talk with Scylla experts and network with other users.
We’ll host the live sessions in two different time zones to better support our global community of users. Our Nov 9th training is scheduled for a time convenient in North and South America, while Nov 10th will be the same sessions scheduled for attendees in Europe and Asia.
Detailed Agenda and How to Prepare
Here are the different event sessions and the recommended material you can use to prepare.
|Essentials Track||Advanced Track|
Covering an Intro to Scylla, Basic concepts, Scylla Architecture, and a Hands-on Demo.
Suggested learning material:
|Scylla and Kubernetes
This session will cover Scylla Operator, Scylla Deployment, Alternator Deployment, Maintenance, Hands-on Demo, Recent Updates. Hands-on demo goes over:
Suggested learning material:
Basic Data Modeling, Definitions, Basic Data Types, Primary Key Selection, Clustering key, Scylla Drivers, Compaction Overview and Compaction Strategies
Suggested learning material:
|CDC and Kafka
What is CDC, Consuming Data, Under the hood, Hands-on example
Suggested learning material:
|Advanced Data Modeling
This talk covers TTL, Counters, Materialized Views, Secondary Indexes, Lightweight transactions, When to use each?
Suggested learning material:
|Scylla Alternator and the DynamoDB
Intro, When to Use, Under the hood, Differences with DynamoDB, Hands-on Example
Suggested learning material:
Swag and Certification
Participants that complete the training will have access to more free, online, self-paced learning material such as our hands-on labs on Scylla University.
Additionally, those that complete the training will be able to get a certification and some cool swag!