Why Teams are Eliminating External Database Caches

Often-overlooked risks related to external caches — and how 3 teams are replacing their core database + external cache with a single solution (ScyllaDB) Teams often consider external caches when the existing database cannot meet the required service-level agreement (SLA). This is a clear performance-oriented decision. Putting an external cache in front of the database is commonly used to compensate for subpar latency stemming from various factors, such as inefficient database internals, driver usage, infrastructure choices, traffic spikes and so on. Caching might seem like a fast and easy solution because the deployment can be implemented without tremendous hassle and without incurring the significant cost of database scaling, database schema redesign or even a deeper technology transformation. However, external caches are not as simple as they are often made out to be. In fact, they can be one of the more problematic components of a distributed application architecture. In some cases, it’s a necessary evil, such as when you require frequent access to transformed data resulting from long and expensive computations, and you’ve tried all the other means of reducing latency. But in many cases, the performance boost just isn’t worth it. You solve one problem, but create others. Here are some often-overlooked risks related to external caches and ways three teams have achieved a performance boost plus cost savings by replacing their core database and external cache with a single solution. Spoiler: They adopted ScyllaDB, a high-performance database that achieves improved long-tail latencies by tapping a specialized internal cache. Why Not Cache At ScyllaDB, we’ve worked with countless teams struggling with the costs, hassles and limits of traditional attempts to improve database performance. Here are the top struggles we’ve seen teams experience with putting an external cache in front of their database. An External Cache Adds Latency A separate cache means another hop on the way. When a cache surrounds the database, the first access occurs at the cache layer. If the data isn’t in the cache, then the request is sent to the database. This adds latency to an already slow path of uncached data. One may claim that when the entire data set fits the cache, the additional latency doesn’t come into play. However, unless your data set is considerably small, storing it entirely in memory considerably magnifies costs and is thus prohibitively expensive for most organizations. An External Cache is an Additional Cost Caching means expensive DRAM, which translates to a higher cost per gigabyte than solid-state disks (see this P99 CONF talk by Grafana’s Danny Kopping for more details on that). Rather than provisioning an entirely separate infrastructure for caching, it is often best to use the existing database memory, and even increase it for internal caching. Modern database caches can be just as efficient as traditional in-memory caching solutions when sized correctly. When the working set size is too large to fit in memory, then databases often shine in optimizing I/O access to flash storage, making databases alone (no external cache) a preferred and cheaper option. External Caching Decreases Availability No cache’s high availability solution can match that of the database itself. Modern distributed databases have multiple replicas; they also are topology-aware and speed-aware and can sustain multiple failures without data loss. For example, a common replication pattern is three local replicas, which generally allows for reads to be balanced across such replicas to efficiently make use of your database’s internal caching mechanism. Consider a nine-node cluster with a replication factor of three: Essentially every node will hold roughly a third of your total data set size. As requests are balanced among different replicas, this grants you more room for caching your data, which could completely eliminate the need for an external cache. Conversely, if an external cache happens to invalidate entries right before a surge of cold requests, availability could be impeded for a while since the database won’t have that data in its internal cache (more on this below). Caches often lack high availability properties and can easily fail or invalidate records depending on their heuristics. Partial failures, which are more common, are even worse in terms of consistency. When the cache inevitably fails, the database will get hit by the unmitigated firehose of queries and likely wreck your SLAs. In addition, even if a cache itself has some high availability features, it can’t coordinate handling such failure with the persistent database it is in front of. The bottom line: Rely on the database, rather than making your latency SLAs dependent on a cache. Application Complexity — Your Application Needs to Handle More Cases External caches introduce application and operational complexity. Once you have an external cache, it is your responsibility to keep the cache up to date with the database. Irrespective of your caching strategy (such as write-through, caching aside, etc.), there will be edge cases where your cache can run out of sync from your database, and you must account for these during application development. Your client settings (such as failover, retry and timeout policies) need to match the properties of both the cache as well as your database to function when the cache is unavailable or goes cold. Usually such scenarios are hard to test and implement. External Caching Ruins the Database Caching Modern databases have embedded caches and complex policies to manage them. When you place a cache in front of the database, most read requests will reach only the external cache and the database won’t keep these objects in its memory. As a result, the database cache is rendered ineffective. When requests eventually reach the database, its cache will be cold and the responses will come primarily from the disk. As a result, the round-trip from the cache to the database and then back to the application is likely to add latency. External Caching Might Increase Security Risks An external cache adds a whole new attack surface to your infrastructure. Encryption, isolation and access control on data placed in the cache are likely to be different from the ones at the database layer itself. External Caching Ignores The Database Knowledge And Database Resources Databases are quite complex and built for specialized I/O workloads on the system. Many of the queries access the same data, and some amount of the working set size can be cached in memory to save disk accesses. A good database should have sophisticated logic to decide which objects, indexes and accesses it should cache. The database also should have eviction policies that determine when new data should replace existing (older) cached objects. An example is scan-resistant caching. When scanning a large data set, say a large range or a full-table scan, a lot of objects are read from the disk. The database can realize this is a scan (not a regular query) and choose to leave these objects outside its internal cache. However, an external cache (following a read-through strategy) would treat the result set just like any other and attempt to cache the results. The database automatically synchronizes the content of the cache with the disk according to the incoming request rate, and thus the user and the developer do not need to do anything to make sure that lookups to recently written data are performant and consistent. Therefore, if, for some reason, your database doesn’t respond fast enough, it means that: The cache is misconfigured It doesn’t have enough RAM for caching The working set size and request pattern don’t fit the cache The database cache implementation is poor A Better Option: Let the Database Handle It How can you meet your SLAs without the risks of external database caches? Many teams have found that by moving to a faster database such as ScyllaDB with a specialized internal cache, they’re able to meet their latency SLAs with less hassle and lower costs. Results vary based on workload characteristics and technical requirements, of course. But for an idea of what’s possible, consider what these teams were able to achieve. SecurityScorecard Achieves 90% Latency Reduction with $1 Million Annual Savings SecurityScorecard aims to make the world a safer place by transforming the way thousands of organizations understand, mitigate and communicate cybersecurity. Its rating platform is an objective, data-driven and quantifiable measure of an organization’s overall cybersecurity and cyber risk exposure. The team’s previous data architecture served them well for a while, but couldn’t keep up with their growth. Their platform API queried one of three data stores: Redis (for faster lookups of 12 million scorecards), Aurora (for storing 4 billion measurement stats across nodes), or a Presto cluster on Hadoop Distributed File System (for complex SQL queries on historical results). As data and requests grew, challenges emerged. Aurora and Presto latencies spiked under high throughput. The largest possible instance of Redis still wasn’t sufficient, and they didn’t want the complexity of working with a Redis Cluster. To reduce latencies at the new scale that their rapid business growth required, the team moved to ScyllaDB Cloud and developed a new scoring API that routed less latency-sensitive requests to Presto and S3 storage. Here’s a visualization of this – and considerably simpler – architecture: The move resulted in: 90% latency reduction for most service endpoints 80% fewer production incidents related to Presto/Aurora performance $1 million infrastructure cost savings per year 30% faster data pipeline processing Much better customer experience Read more about the SecurityScorecard use case IMVU Reins in Redis Costs at 100X Scale A popular social community, IMVU enables people all over the world to interact with each other using 3D avatars on their desktops, tablets and mobile devices. To meet growing requirements for scale, IMVU decided it needed a more performant solution than its previous database architecture of Memcached in front of MySQL and Redis. The team looked for something that would be easier to configure, easier to extend and, if successful, easier to scale. “Redis was fine for prototyping features, but once we actually rolled it out, the expenses started getting hard to justify,” said Ken Rudy, senior software engineer at IMVU. “ScyllaDB is optimized for keeping the data you need in memory and everything else in disk. ScyllaDB allowed us to maintain the same responsiveness for a scale a hundred times what Redis could handle.” Comcast Reduces Long Tail Latencies 95% with $2.5 million Annual Savings Comcast is a global media and technology company with three primary businesses: Comcast Cable, one of the United States’ largest video, high-speed internet and phone providers to residential customers; NBCUniversal and Sky. Comcast’s Xfinity service serves 15 million households with more than 2 billion API calls (reads/writes) and over 200 million new objects per day. Over seven years, the project expanded from supporting 30,000 devices to more than 31 million. Cassandra’s long tail latencies proved unacceptable at the company’s rapidly increasing scale. To mask Cassandra’s latency issues from users, the team placed 60 cache servers in front of their database. Keeping this cache layer consistent with the database was causing major admin headaches. Since the cache and related infrastructure had to be replicated across data centers, Comcast needed to keep caches warm. They implemented a cache warmer that examined write volumes, then replicated the data across data centers. After struggling with the overhead of this approach, Comcast soon moved to ScyllaDB. Designed to minimize latency spikes through its internal caching mechanism, ScyllaDB enabled Comcast to eliminate the external caching layer, providing a simple framework in which the data service connected directly to the data store. Comcast was able to replace 962 Cassandra nodes with just 78 nodes of ScyllaDB. They improved overall availability and performance while completely eliminating the 60 cache servers. The result: 95% lower P99, P999 and P9999 latencies with the ability to handle over twice the requests – at 60% of the operating costs. This ultimately saved them $2.5 million annually in infrastructure costs and staff overhead.   Closing Thoughts Although external caches are a great companion for reducing latencies (such as serving static content and personalization data not requiring any level of durability), they often introduce more problems than benefits when placed in front of a database. The top tradeoffs include elevated costs, increased application complexity, additional round trips to your database and an additional security surface area. By rethinking your existing caching strategy and switching to a modern database providing predictable low latencies at scale, teams can simplify their infrastructure and minimize costs. And at the same time, they can still meet their SLAs without the extra hassles and complexities introduced by external caches.

ScyllaDB as a DynamoDB Alternative: Frequently Asked Questions

A look at the top questions engineers are asking about moving from DynamoDB to ScyllaDB to reduce cost, avoid throttling, and avoid cloud vendor lockin A great thing about working closely with our community is that I get a chance to hear a lot about their needs and – most importantly – listen to and take in their feedback. Lately, we’ve seen a growing interest from organizations considering ScyllaDB as a means to replace their existing DynamoDB deployments and, as happens with any new tech stack, some frequently recurring questions. 🙂 ScyllaDB provides you with multiple ways to get started: you can choose from its CQL protocol or ScyllaDB Alternator. CQL refers to the Cassandra Query Language, a NoSQL interface that is intentionally similar to SQL. ScyllaDB Alternator is ScyllaDB’s DynamoDB-compatible API, aiming at full compatibility with the DynamoDB protocol. Its goal is to provide a seamless transition from AWS DynamoDB to a cloud-agnostic or on-premise infrastructure while delivering predictable performance at scale. But which protocol should you choose? What are the main differences between ScyllaDB and DynamoDB? And what does a typical migration path look like? Fear no more, young sea monster! We’ve got you covered. I personally want to answer some of these top questions right here, and right now. Why switch from DynamoDB to ScyllaDB? If you are here, chances are that you fall under at least one of the following categories: Costs are running out of control Latency and/or throughput are suboptimal You are currently locked-in and would like a bit more flexibility ScyllaDB delivers predictable low latency at scale with less infrastructure required. For DynamoDB specifically, we have an in-depth article covering which pain points we address. Is ScyllaDB Alternator a DynamoDB drop-in replacement? In the term’s strict sense, it is not: notable differences across both solutions exist. DynamoDB development is closed source and driven by AWS (which ScyllaDB is not affiliated with), which means that there’s a chance that some specific features launched in DynamoDB may take some time to land in ScyllaDB. A more accurate way to describe it is as an almost drop-in replacement. Whenever you migrate to a different database, some degree of changes will always be required to get started with the new solution. We try to keep the level of changes to a minimum to make the transition as seamless as possible. For example, Digital Turbine easily migrated from DynamoDB to ScyllaDB within just a single two-week sprint, the results showing significant performance improvements and cost savings. What are the main differences between ScyllaDB Alternator and AWS DynamoDB? Provisioning: In ScyllaDB you provision nodes, not tables. In other words, a single ScyllaDB deployment is able to host several tables and serve traffic for multiple workloads combined. Load Balancing: Application clients do not route traffic through a single endpoint as in AWS DynamoDB (dynamodb.<region_name>.amazonaws.com). Instead, clients may use one of our load balancing libraries, or implement a server-side load balancer. Limits: ScyllaDB does not impose a 400KB limit per item, nor any partition access limits. Metrics and Integration: Since ScyllaDB is not a “native AWS service,” it naturally does not integrate in the same way as other AWS services (such as CloudWatch and others) does with DynamoDB. For metrics specifically, ScyllaDB provides the ScyllaDB Monitoring Stack with specific dashboards for DynamoDB deployments. When should I use the DynamoDB API instead of CQL? Whenever you’re interested in moving away from DynamoDB (either to remain in AWS or to another cloud), and either: Have zero interest in refactoring your code to a new API, or Plan to get started or evaluate ScyllaDB prior to major code refactoring. For example, you would want to use the DynamoDB API in a situation where hundreds of independent Lambda services communicating with DynamoDB may require quite an effort to refactor. Or, when you rely on a connector that doesn’t provide compatibility with the CQL protocol. For all other cases, CQL is likely to be a better option. Check out our protocol comparison for more details. What is the level of effort required to migrate to ScyllaDB? Assuming that all features required by the application are supported by ScyllaDB (irrespective of which API you choose), the level of effort should be minimal. The process typically involves lifting your existing DynamoDB tables’ data and then replaying changes from DynamoDB Streams to ScyllaDB. Once that is complete, you update your application to connect to ScyllaDB. I once worked with an AdTech company choosing CQL as their protocol. This obviously required code refactoring to adhere to the new query language specification. On the other hand, a mobile platform company decided to go with ScyllaDB Alternator, eliminating the need for data transformations during the migration and application code changes. Is there a tool to migrate from DynamoDB to ScyllaDB Alternator? Yes. The ScyllaDB Migrator is a Spark-based tool available to perform end-to-end migrations. We also provide relevant material and hands-on assistance for migrating to ScyllaDB using alternative methods, as relevant to your use case. I currently rely on DynamoDB autoscaling; how does that translate to ScyllaDB? More often than not you shouldn’t need it. Autoscaling is not free (there’s idle infrastructure reserved for you), and it requires considerable “time to scale”, which may end up ruining end users’ experience. A small ScyllaDB cluster alone should be sufficient to deliver tens to hundreds of thousands of operations per second – and a moderately-sized one can easily achieve over a million operations per second. That being said, the best practice is to be provisioned for the peak. What about DynamoDB Accelerator (DAX)? ScyllaDB implements a row-based cache, which is just as fast as DAX. We follow a read-through caching strategy (unlike the DAX write-through strategy), resulting in less write amplification, simplified cache management, and lower latencies. In addition, ScyllaDB’s cache is bundled within the core database, not as a separate add-on like DynamoDB Accelerator. Which features are (not) available? The ScyllaDB Alternator Compatibility page contains a detailed breakdown of not yet implemented features. Keep in mind that some features might be just missing the DynamoDB API implementation. You might be able to achieve the same functionality in ScyllaDB in other ways. If any particular missing feature is critical for your ScyllaDB adoption, please let us know. How safe is this? Really? ScyllaDB Alternator has been production-ready since 2020, with leading organizations running it in production both on-premise as well as in ScyllaDB Cloud. Our DynamoDB compatible API is extensively tested and its code is open source. Next Steps If you’d like to learn more about how to succeed during your DynamoDB to ScyllaDB journey, I highly encourage you to check out our recent ScyllaDB Summit talk. For a detailed performance comparison across both solutions, check out our ScyllaDB Cloud vs DynamoDB benchmark. If you are still unsure whether a change makes sense, then you might want to read the top reasons why teams decide to abandon the DynamoDB ecosystem. If you’d like a high-level overview on how to move away from DynamoDB, refer to our DynamoDB: How to Move Out? article. When you are ready for your actual migration, then check out our in-depth walkthrough of an end-to-end DynamoDB to ScyllaDB migration. Chances are that I probably did not address some of your more specific questions (sorry about that!), in which case you can always book a 1:1 Technical Consultation with me so we can discuss your specific situation thoroughly. I’m looking forward to it!

Better “Goodput” Performance through C++ Exception Handling

Some very strange error reporting performance issues we uncovered while working on per-partition query rate limiting — and how we addressed it through C++ exception handling In Retaining Goodput with Query Rate Limiting, I introduced why and how ScyllaDB implemented per-partition rate limiting to maintain stable performance under stress by rejecting excess requests to maintain stable goodput. The implementation works by estimating request rates for each partition and making decisions based on those estimates. And benchmarks confirmed how rate limiting restored goodput, even under stressful conditions, by rejecting problematic requests. However, there was actually an interesting behind-the-scenes twist in that story. After we first coded the solution, we discovered that the performance wasn’t as good as expected. In fact, we were achieving the opposite effect of what we expected. Rejected operations consumed more CPU than the successful ones. This was very strange because tracking per-partition hit counts shouldn’t have been compute-intensive. It turned out that the problem was related to how ScyllaDB reports failures: namely, via C++ exceptions. In this article, I will share more details about the error reporting performance issues that we uncovered while working on per-partition query rate limiting and explain how we addressed them. On C++ (and Seastar) Exceptions Exceptions are notorious in the C++ community. There are many problems with their design, of course. But the most important and relevant one here is the unpredictable and potentially degraded performance that occurs when an exception is actually thrown. Some C++ projects disable exceptions altogether for those reasons. However, it’s hard to avoid exceptions because they’re thrown by the standard library and because they interact with other language features. Seastar, an open-source C++ framework for I/O intensive asynchronous computing, embraces exceptions and provides facilities for handling them correctly. Since ScyllaDB is built on Seastar, the ScyllaDB engineering team is rather accustomed to reporting failures via exceptions. They work fine, provided that errors aren’t very frequent. Under overload, it’s a different story though. We noticed that throwing exceptions in large volumes can introduce a performance bottleneck. This problem affects both existing errors such as timeouts and the new “rate limit exception” that we tried to introduce. This wasn’t the first time that we encountered performance issues with exceptions. In libstdc++, the standard library for GCC, throwing an exception involves acquiring a global mutex. That mutex protects some information important to the runtime that can be modified when a dynamic library is being loaded or unloaded. ScyllaDB doesn’t use dynamic libraries, so we were able to disable the mutex with some clever workarounds. As an unfortunate side effect, we disabled some caching functionality that speeds up further exception throws. However, avoiding scalability bottlenecks was more important to us. Exceptions are usually propagated up the call stack until a try..catch block is encountered. However, in programs with non-standard control flow, it sometimes makes sense to capture the exception and rethrow it elsewhere. For this purpose, std::exception_ptr can be used. Our Seastar framework is a good example of code that uses it. Seastar allows running concurrent, asynchronous tasks that can wait for the results of each other via objects called futures. If a task results in an exception, it is stored in a future as std::exception_ptr and can be later inspected by a different task. future<> do_thing() { return really_do_thing().finally([] { std::cout << "Did the thing\n"; }); } future<> really_do_thing() { if (fail_flag) { return make_exception_future<>( std::runtime_error("oh no!")); } else { return make_ready_future<>(); } } Seastar nicely reduces the time spent in the exception handling runtime. In most cases, Seastar is not interested in the exception itself; it only cares about whether a given future contains an exception. Because of that, the common control flow primitives such as then, finally etc. do not have to rethrow and inspect the exception. They only check whether an exception is present. Additionally, it’s possible to construct an “exceptional future” directly, without throwing the exception. Unfortunately, the standard library doesn’t make it possible to inspect an std::exception_ptr without rethrowing it. Because each exception must be inspected and handled appropriately at some point (sometimes more than once), it’s impossible for us to eliminate throws if we want to use exceptions and have portable code. We had to provide a cheaper way to return timeouts and “rate limit exceptions.” We tried two approaches. Approach 1: Avoid The Exceptions The first approach we explored was quite heavy-handed. It involved plowing through the code and changing all the important functions to return a boost::result. That’s a type from the boost library that holds either a successful result or an error. We also introduced a custom container for holding exceptions that allows inspecting the exception inside it without having to throw it. future<result<>> do_thing() { return really_do_thing().then( [] (result<> res) -> result<> { if (res) { // handle success } else { // handle failure } } ); } This approach worked, but had numerous drawbacks. First, it required a lot of work to adapt the existing code: return types had to be changed, and the boost::result returned by the functions had to be explicitly checked to see whether they held an exception or not. Moreover, those checks introduced a slight overhead on the happy path. We applied this approach to the coordinator logic. Approach 2: Implement the Missing Parts Another approach: create our own implementation for inspecting the value of std::exception_ptr. We encountered a proposal to extend the std::exception_ptr’s interface that included adding the possibility to inspect its value without rethrowing it. The proposal includes an experimental implementation that uses standard-library specific and ABI-specific constructs to implement it for some of the major compilers. Inspired by it, we implemented a small utility function that allows us to match on the exception’s type and inspect its value, and replaced a bunch of try..catch blocks with it. std::exception_ptr ep = get_exception(); if (auto* ex = try_catch(ep)) { // ... } else if (auto* ex = try_catch(ep)) { // ... } else { // ... } This is how we eliminated throws in the most important parts of the replica logic. It was much less work than the previous approach since we only had to get rid of try-catch blocks and make sure that exceptions were propagated using the existing, non-throwing primitives. While the solution is not portable, it works for us since we use and support only a single toolchain at a time to build ScyllaDB. However, it can be easy to accidentally reintroduce performance problems if you’re not careful and forget to use the non-throwing primitives, including the new try_catch. In contrast, the first approach does not have this problem. Results: Better Exception Handling, Better Throughput (and Goodput) To measure the results of these optimizations on the exception handling path, we ran a benchmark. We set up a cluster of 3 i3.4xlarge AWS nodes with ScyllaDB 5.1.0-rc1. We pre-populated it with a small data set that fits in memory. Then, we ran a write workload to a single partition. The data included in each request is very small, so the workload was CPU bound, not I/O bound. Each request had a small server-side timeout of 10ms. The benchmarking application gradually increases its write rate from 50k writes per second to 100k. The chart shows what happens when ScyllaDB is no longer able to handle the throughput. There are two runs of the same benchmark, superimposed on the same charts – a yellow line for ScyllaDB 5.0.3 which handles timeouts in the old, slow way, and a green line for ScyllaDB 5.1.0-rc1 which has exception handling improvements. The older version of ScyllaDB became overwhelmed somewhere around the 60k writes per second mark. We were pushing the shard to its limit, and some requests inevitably failed because we set a short timeout. Since processing timed out requests is more costly than processing a regular request, the shard very quickly entered a state where all requests that it processed resulted in timeouts. The new version, which has better exception handling performance, was able to sustain a larger throughput. Some timeouts occurred, but they did not overwhelm the shard completely. Timeouts don’t start to be more prominent until the 100k requests per second mark. Summary We discovered unexpected performance issues while implementing a special case of ScyllaDB’s per-partition query rate limiting. Despite Seastar’s support for handling exceptions efficiently, throwing exceptions in large volumes during overload situations became a bottleneck, affecting both existing errors like timeouts and new exceptions introduced for rate limiting. We solved the issue by using two different approaches in different parts of the code: Change the code to pass information about timeouts by returning boost::result instead of throwing exceptions. Implement a utility function that allows catching exceptions cheaply without throwing them. Benchmarks with ScyllaDB versions 5.0.3 and 5.1.0-rc1 demonstrated significant improvements in handling throughput with better exception handling. While the older version struggled and became overwhelmed around 60k writes per second, the newer version sustained larger throughput levels with fewer timeouts, even at 100k requests per second. Ultimately, these optimizations not only improved exception handling performance but also contributed to better throughput and “goodput” under high load scenarios.

ScyllaDB University Updates, High Availability Lab & NoSQL Training Events

There are many paths to adopting and mastering ScyllaDB, but the vast majority all share a visit to ScyllaDB University. I want to update the ScyllaDB community about recent changes to ScyllaDB University, introduce the High Availability lab, and share some future events and plans. What is ScyllaDB University? For those new to ScyllaDB University, it’s an online learning and training center for ScyllaDB. The material is self-paced and includes theory, quizzes, and hands-on labs that you can run yourself. Access is free; to get started, just create an account. Start Learning ScyllaDB University has six courses and covers most ScyllaDB-related material. As the product evolves, the training material is updated and enhanced. In addition to the self-paced, on-demand material available on ScyllaDB University, we host live training events. These virtual and highly interactive events offer special opportunities for users to learn from the leading ScyllaDB engineers and experts, ask questions, and connect with their peers. The last ScyllaDB University LIVE event took place in March. It had two tracks: an Essentials track covering basic content, as well as an Advanced track covering the latest features and more complex use cases. The next event in the US/EMEA time zones will be on June 18th. Join us at ScyllaDB University LIVE Also, if you’re in India, we’re hosting a special event for you! On May 29, 9:30am-11:30am IST, join us for an interactive workshop where we’ll go hands-on to build and interact with high-performance apps using ScyllaDB. This is a great way to discover the NoSQL strategies used by top teams and apply them in a guided, supportive environment. Register for ScyllaDB Labs India Example Lab: High Availability Here’s a taste of the type of content you will find on ScyllaDB University. The High Availability lab is one of the most popular labs on the platform. It demonstrates what happens when we read and write data while simulating a situation where some of the replica nodes are unavailable, with different consistency levels. ScyllaDB, at its core, is designed to be highly available. With the right settings, it will remain up and available at all times. The database features a peer-to-peer architecture that is leaderless and uses the Gossip protocol for internode communication. This enhances resiliency and scalability. Upon startup and during topology changes such as node addition or removal, nodes automatically discover and communicate with each other, streaming data between them as needed. Multiple copies of the data (replicas) are kept. This is determined by setting a replication factor (RF) to ensure data is stored on multiple nodes, thereby allowing for data redundancy and high availability. Even if a node fails, the data is still available on other nodes, reducing downtime risks. Topology awareness is part of the design. ScyllaDB enhances its fault tolerance through rack and data center awareness, allowing distribution across multiple locations. It supports multi-datacenter replication with customizable replication factors for each site, ensuring data resiliency and availability even in case of significant failures of entire data centers. First, we’ll bring up a 3-node ScyllaDB Docker cluster. Set up a Docker Container with one node, called Node_X: docker run --name Node_X -d scylladb/scylla:5.2.0 --overprovisioned 1 --smp 1 Create two more nodes, Node_Y and Node_Z, and add them to the cluster of Node_X. The command “$(docker inspect –format='{{ .NetworkSettings.IPAddress }}’ Node_X)” translates to the IP address of Node-X: docker run --name Node_Y -d scylladb/scylla:5.2.0 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' Node_X)" --overprovisioned 1 --smp 1 docker run --name Node_Z -d scylladb/scylla:5.2.0 --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' Node_X)" --overprovisioned 1 --smp 1 Wait a minute or so and check the node status: docker exec -it Node_Z nodetool status You’ll see that eventually, all the nodes have UN for status. U means up, and N means normal. Read more about Nodetool Status Here. Once the nodes are up, and the cluster is set, we can use the CQL shell to create a table. Run a CQL shell: docker exec -it Node_Z cqlsh Create a keyspace called “mykeyspace”, with a Replication Factor of three: CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3}; Next, create a table with three columns: user id, first name, and last name, and insert some data: use mykeyspace; CREATE TABLE users ( user_id int, fname text, lname text, PRIMARY KEY((user_id))); Insert into the newly created table two rows: insert into users(user_id, fname, lname) values (1, 'rick', 'sanchez'); insert into users(user_id, fname, lname) values (4, 'rust', 'cohle'); Read the table contents: select * from users;   Now we will see what happens when we try to read and write data to our table, with varying Consistency Levels and when some of the nodes are down. Let’s make sure all our nodes are still up: docker exec -it Node_Z nodetool status Now, set the Consistency Level to QUORUM and perform a write: docker exec -it Node_Z cqlsh use mykeyspace; CONSISTENCY QUORUM insert into users (user_id, fname, lname) values (7, 'eric', 'cartman'); Read the data to see if the insert was successful: select * from users; The read and write operations were successful. What do you think would happen if we did the same thing with Consistency Level ALL? CONSISTENCY ALL insert into users (user_id, fname, lname) values (8, 'lorne', 'malvo'); select * from users; The operations were successful again. CL=ALL means that we have to read/write to the number of nodes according to the Replication Factor, 3 in our case. Since all nodes are up, this works. Next, we’ll take one node down and check read and write operations with a Consistency Level of Quorum and ALL. Take down Node_Y and check the status (it might take some time until the node is actually down): exit docker stop Node_Y docker exec -it Node_Z nodetool status Now, set the Consistency Level to QUORUM and perform a write: docker exec -it Node_Z cqlsh CONSISTENCY QUORUM use mykeyspace; insert into users (user_id, fname, lname) values (9, 'avon', 'barksdale'); Read the data to see if the insert was successful: select * from users; With CL = QUORUM, the read and write were successful. What will happen with CL = ALL? CONSISTENCY ALL insert into users (user_id, fname, lname) values (10, 'vm', 'varga'); select * from users; Both read and write fail. CL = ALL requires that we read/write to three nodes (based on RF = 3), but only two nodes are up. What happens if another node is down? Take down Node_Z and check the status: exit docker stop Node_Z docker exec -it Node_X nodetool status Now, set the Consistency Level to QUORUM and perform a read and a write: docker exec -it Node_X cqlsh CONSISTENCY QUORUM use mykeyspace; insert into users (user_id, fname, lname) values (11, 'morty', 'smith'); select * from users; With CL = QUORUM, the read and write fail. Since RF=3, QUORUM means at least two nodes need to be up. In our case, just one is. What will happen with CL = ONE? CONSISTENCY ONE insert into users (user_id, fname, lname) values (12, 'marlo', 'stanfield'); select * from users; With CL = QUORUM, the read and write fail. Since RF=3, QUORUM means at least two nodes need to be up. In our case, just one is. What do you think will happen with CL = ONE? Check out the full High Availability lesson and lab on ScyllaDB University, with expected outputs, full explanation of the theory, and an option for seamlessly running the lab within a browser, regardless of the platform you’re using. Next Steps and Upcoming  NoSQL Events A good starting point on ScyllaDB University is ScyllaDB Essentials – Overview of ScyllaDB and NoSQL Basics course. There, you can find more hands-on labs like the one above, and learn about distributed databases, NoSQL, and ScyllaDB. ScyllaDB is backed by a vibrant community of users and developers. If you have any questions about this blog post or about a specific lesson, or want to discuss it, you can write on the ScyllaDB community forum. Say hello here. The next live training events are taking place in: An Asia-friendly time zone event will take place on the 29th of May US/Europe, Middle East, and Africa time zones on the 18th of June. Stay tuned for more details!

Top 5 Questions We’re Asked About Apache Cassandra® 5.0

Apache Cassandra® version 5.0 Beta 1.0 is now available in public preview on the Instaclustr Managed Platform!   

Here at NetApp we are often asked about the latest releases in the open source space. This year, the biggest news is going to be the release of Apache Cassandra 5.0, and it has already garnered a lot of attention since its beta release. Apache Cassandra has been a go-to choice for distributed, highly scalable databases since its inception, and it has evolved over time, catering to the changing needs of its customers.  

As we draw closer to the general availability release of Cassandra 5.0, we have been receiving many questions from our customers regarding some of the features, use cases and benefits of opting for Cassandra 5.0 and how it can help support their growing needs for scalability, performance, and advanced data analysis.  

Let’s dive into the details of some of the frequently asked questions. 

#1: Why should I upgrade my clusters to Cassandra 5.0? 

Upgrading your clusters to Cassandra 5.0 offers several key advantages: 

  • New features: Cassandra 5.0 comes with a host of new and exciting features that include storage attached indexes, unified compaction strategy, and vector search. These features will significantly improve performance, optimize storage costs and pave the way for AI/ML applications.
  • Stability and Bug fixes: Cassandra 5.0 brings more stability through bug fixes and added support for more guardrails. Additional information on the added guardrails can be found here. 
  • Apache Cassandra versions 3.0 and 3.11 will reach the end of life (EOL) with the General Availability release of Cassandra 5.0:  It is standard practice that the project will no longer maintain these older versions (full details of the announcement can be found on the Apache Cassandra website). We understand the challenges this presents customers and NetApp will provide our customers with extended support for these versions for 12 months beyond the Apache Foundation project dates. Our extended support is provided to enable customers to plan their migrations with confidence. Read more about our lifecycle policies on the website. 

#2: Is it possible to upgrade my workload from Cassandra 3.x to Cassandra 5.0? 

You can upgrade Cassandra to any release within the next major version. For example, you can upgrade a cluster running any 3.x release to any 4.x release. However, upgrading non-adjacent major versions is not supported.  

If you want to upgrade by more than one major version increment, you need to upgrade to an intermediate major version first. For example, to upgrade from a 3.x release to a 5.0 release, you must upgrade the entire cluster twice: first to 4.x (preferably the latest release: 4.1.4), then again to 5.0. 

#3: What are the key changes in Apache Cassandra 5.0 compared to previous versions?

For several years, Apache Cassandra 3.x has been a major version for many customers, noted for its stability and faster import/exports. Apache Cassandra 4 focused on significantly enhancing performance with a range of enterprise-grade feature additions. However, Cassandra 5.0 introduces new features that are future-driven and open up numerous new use cases.  

Let’s take a look at what’s new: 

  • Advanced data analysis and a pathway to AI through features such as vector search and storage-attached indexing (SAI). To learn more about SAI, visit our dedicated blog on the topic. 
  • Enhanced performance and efficiency:
    • The Unified Compaction Strategy and SAI directly address the need for more efficient data management and retrieval, optimizing resource utilization and improving overall system performance and efficiency. 
    • The “Trie-based memtables and SSTables” optimize read/write operations and storage efficiency. 
  • Security and Flexibility:
    • Dynamic Data Masking (DDM) enhances data privacy and security by allowing sensitive data to be masked from unauthorized access.
    • While Cassandra 4.0 supported JDK 11, Cassandra 5.0 added experimental support for JDK 17, allowing its users to leverage the latest Java features and improve performance and security. It is not currently recommend to use JDK 17 in your production environment. 
    • The introduction of new mathematical (abs, exp, log, log10 and round) and aggregation scalar CQL functions (count, max/min, sum/avg at a collection level) has expanded Cassandra’s capabilities to handle complex data operations. These functions will help developers handle numerical operations efficiently and contribute to application performance. Read more about the enhanced Mathematical CQL capability on the Apache website.

Apache Cassandra 5.0 introduces hundreds of changes, including minor improvements to brand-new features. Visit the website for a complete list of changes. 

#4: What features in Cassandra 5.0 will help me save money? 

While Cassandra 5.0 promises exciting features to modernize and make you future-ready, it also includes some features that are going to help you manage your infrastructure better and eventually lower your operational costs. A couple of them include: 

  • Storage-attached indexes (SAI) can reduce the resource utilization associated with read operations by providing a more efficient way to index and query data, helping with managing increasing operational costs. For some use cases, smaller cluster sizes or node types may be able to achieve the same performance levels. 
  • The Unified Compaction Strategy optimizes the way Cassandra handles the compaction process. By automating and improving compactions, UCS can help reduce storage requirements and lower I/O overhead, resulting in lower operational costs and performance improvements. Reach out to Instaclustr Support to understand how you can adopt the Unified Compaction Strategy for your Cassandra clusters. 

#5: How can Apache Cassandra help us in our AI/ML journey? With Cassandra 5.0, what are the applications and advantages of the new vector data type and similarity functions?

Some of our customers are well-ahead in their AI/ML journey, understand different use cases and know where they’d like to go, while others are still catching up.  

Apache Cassandra can help you get started on your AI journey with the introduction of vector search capabilities in Cassandra 5.0. The new vector data type and similarity functions, combined with Storage Attached Indexes(SAI), are designed to handle complex and high-dimensional data.  

Vector search is a powerful technique for finding relevant content within large datasets that are either structured (floats, integers, or full strings) or unstructured (such as audio, video, pictures).  

This isn’t something totally new; it has been around for many years. It has evolved from a theoretical mathematical model to a key foundational technology today underlying recent AI/ML work and data storage.  

Vector search plays a pivotal role in AI applications across industries by enabling efficient similarity-based querying in areas such as: 

  • Generative AI, Large Language Models (LLMs)
  • Retrieval-Augmented Generation (RAG)
  • Natural Language Processing (NLP)
  • Geographical Information Systems (GIS) applications 

These applications can benefit from advanced data analysis, creative content generation, semantic search and spatial analysis capabilities provided by vector search. 

Vector search bridges complex data processing with practical applications, transforming the user experience and operational efficiencies across sectors. Cassandra will be an exceptional choice for vector search and AI workloads due to its distributed, highly available and scalable architecture, which is crucial for handling large datasets. 

According to a report published by MarketsandMarkets:

“The demand for vector search is going to increase exponentially.

The global vector database market size is expected to grow from a little over USD 1.5 billion in 2024 to USD 4.3 billion by 2028 at a CAGR of 23.3% during the forecast period.”

As this feature is new to Cassandra 5.0, it is important to test it thoroughly before gradually integrating it into the production environment. This will allow you to explore the full capabilities of vector search while ensuring that it meets your specific needs. 

Stay tuned to learn more about Vector Search in Cassandra 5.0, its prerequisites, limitations, and suitable workloads. 

Instaclustr takes care of cluster upgrades for our customers to help them take advantage of the latest features without compromising stability, performance, or security. If you are not a customer yet and would like help with major version upgrades, contact our sales team. 

 Getting Started

  • If you are an existing customer and would like to try these new features in Cassandra 5.0, you can spin up a cluster today. If you don’t have an account yet, sign up for a free trial and experience the next generation of Apache Cassandra on the Instaclustr Managed Platform. 
  • Read all our technical documentation here. 
  • Discover the 10 rules you need to know when managing Apache Cassandra. 
  • If you are using a relational database and are interested in vector search, check out this blog on support for pgvector, which is available as an add-on for Instaclustr for PostgreSQL services. 

 

The post Top 5 Questions We’re Asked About Apache Cassandra® 5.0 appeared first on Instaclustr.

ScyllaDB Internal Engineering Summit in Cyprus

ScyllaDB is a remote-first company with associates working in distributed locations around the globe. We’re good at it, and we love it. Still, each year we look forward to in-person company summits to spend time together, share knowledge, and have fun. When the time comes, we say goodbye to our kids and cats and fly from all corners of the world to convene in one place. This April, we met in Limassol, a historical city on the southern coast of Cyprus. Many People, One Goal Over 120 attendees gathered at this year’s ScyllaDB Engineering Summit. With the company rapidly growing, it was a great opportunity for dozens of new associates to put faces to names, while ScyllaDB veterans could welcome the new people on board and reunite with their long-time workmates. Peyia, Paphos area with the Edro III shipwreck in the background In our everyday work life, we work in different teams, we have different roles and responsibilities. The event gathered software engineers, QA engineers, product managers, technical support engineers, developer advocates, field engineers, and more. Some of us write the code, and others make sure it’s defect-free. Some take care of the ScyllaDB roadmap, while others provide assistance to ScyllaDB customers. If we were a Git project, we would live on different branches – but there, at the Summit, they would be merged into one. Sharing knowledge, brainstorming ideas, or discussing the things we’d like to improve, we had a profound sense of a common goal – to make a cool product we’re happy with and make the users happy, too. Who-is-who team building activity On the Road to Success The goal is clear, but how do we get there? That question was the central theme of the event. The summit agenda was filled with technical sessions, each with ample time for Q&A. They were a great platform for engaging in discussions and brainstorming, as well as an opportunity to get an insight into other teams’ areas of work. The speakers covered a wide range of topics, including: ScyllaDB roadmap Security features Data distribution with tablets Strong data consistency with Raft Observability and metrics Improving customer experience Streamlining the management of our GitHub projects And many more Dor Laor kicking off the Summit Let’s Get The Party Started The sessions were inspiring and fruitful, but hey, we were in Cyprus! And what is Cyprus famous for if not for its excellent food and wine, beautiful beaches, and holiday vibes? We could enjoy all those things at company dinners in beautifully located Cypriot restaurants – and at amazing beaches or quiet coves. One of the many amazing beachside meals Then came the party. Live music and a DJ putting on all-time hit songs got the crowd going. Rumor has it that engineers don’t know how to party. Please don’t believe it 🙂 Party time! Off-Road Adventure Mid-week, we took a break to recharge the batteries and went on an off-road jeep safari. Fifteen 4x4s took us on a scenic tour along the sea-coast of Paphos, where we could admire beautiful views. Setting off on the jeep safari adventure While enjoying a bumpy ride, we learned a lot about Cyprus’ history and culture from the jeep drivers, who willingly answered all our questions. From time to time, we stopped to see masterpieces of nature, such as a stunning gorge, waterfalls, and the Blue Lagoon, our favorite, where we swam in crystal-clear water. A view over the lagoons, Akamas Peninsula Together as One It was an eventful Summit filled with many memorable moments. It inspired us, fed our spirits, and gave us some food for thought. But most importantly, it brought us together and let us enjoy a shared sense of purpose. When the Summit was coming to a close, we could genuinely feel that we were one team. When we all got safely home, our CEO emailed us to say, “All of our previous summits were great, but this felt even better. The level of teamwork is outstanding, and many companies can just envy what we have here.” I couldn’t agree more. Want to Join our Team? ScyllaDB has a number of open positions worldwide, including many remote positions across Engineering, QA & Automation, Product, and more. Check out the complete list of open positions on our Careers page. Join the ScyllaDB Team!

Node throughput Benchmarking for Apache Cassandra® on AWS, Azure and GCP

At NetApp, we are often asked for comparative benchmarking of our Instaclustr Managed Platform across various cloud providers and the different node sizes we offer.  

For Apache Cassandra® on the Instaclustr Managed Platform, we have recently completed an extensive benchmarking exercise that will help our customers evaluate the node types to use and the differing performance between cloud providers. 

Each cloud service provider is continually introducing new and improved nodes, which we carefully select and curate to provide our customers with a range of options to suit their specific workloads. The results of these benchmarks can assist you in evaluating and selecting the best node sizes that provide optimal performance at the lowest cost. 

How Should Instaclustr Customers Use the Node Benchmark Data? 

Instaclustr node benchmarks provide throughput for each node performed under the same test conditions, serving as valuable performance metrics for comparison. 

As with any generic benchmarking results, for different data models or application workload, the performance may vary from the benchmark. Instaclustr node benchmarks do not account for specific configurations of the customer environment and should only be used to compare the relative performance of node sizes.  

Instaclustr recommends customers complete testing and data modelling in their own environment across a selection of potential nodes.

Test Objective

Cassandra Node Benchmarking creates a Performance Metric value for each General Availability (GA) node type that is available for Apache Cassandra across AWS, GCP, and Azure AZ. Each node is benchmarked before being released to General Availability. The tests ensure that all nodes will deliver sufficient performance based on their configuration.  

All Cassandra production nodes Generally Available on the Instaclustr platform are stress-tested to find the operational capacity of the nodes. This is accomplished by measuring throughput to find the highest ops rate the test cluster can achieve without performance degradation.

Methodology

We used the following configuration to conduct this benchmarking: 

  • Cassandra-stress tool
  • QUORUM consistency level for data operations
  • 3-node cluster using Cassandra 4.0.10 with each node in a separate rack within a single datacenter

On each node size, we run the following testing procedure: 

  1. Fill Cassandra with enough records to approximate 5x the system memory of the node using the “small” size writes. This is done to ensure that the entire data model is not sitting in memory, and to give a better representation of a production cluster. 
  2. Allow the cluster to finish all compactions. This is done to ensure that all clusters are given the same starting point for each test run and to make test runs comparable and verifiable. 
  3. Allow compactions to finish between each step. 
  4. Perform multiple stress tests on the cluster, increasing the number of operations per second for each test, until we overload it. The cluster will be considered overloaded if one of three conditions is met: 
    1. The latency of the operation is above the provided threshold. This tells us that the cluster cannot keep up with the current load and is unable to respond quickly. 
    2. If a node has more than 20 pending compactions a minute after the test completes. This tells us that the cluster is not keeping up with compactions, and this load is not sustainable in the long term. 
    3. The CPU load is above the provided threshold. This is the reason that we pass the number of cores into the Cassandra Stress Tool. 
  5. Using this definition of “overloaded”, the following tests are run to measure maximum throughput: 
    1. Perform a combination of reads and write operations in 30-minute test runs. The combination of writes, simple reads, and range reads is at a rate of 10:10:1, increasing threads until we reach a read median latency of 20ms indicating if a mixed workload throughput has changed. 

Results

Following this approach, the tables below show the results we measured on the different providers and different node sizes we offer.

AWS
Node Type (Node Size Used) Ram / Cores Result (MixedOPS)
t4g.small (CAS-DEV-t4g.small-30) 2 GiB / 2 728*
t4g.medium (CAS-DEV-t4g.medium-80) 4 GiB / 2 3,582
r6g.medium (CAS-PRD-r6g.medium-120) 8 GiB / 1 1,849
m6g.large (CAS-PRD-m6g.large-120) 8 GiB / 2 2,144
r6g.large (CAS-PRD-r6g.large-250) 16 GiB / 2 3,117
r6g.xlarge (CAS-PRD-r6g.xlarge-400) 32 GiB / 4 16,653
r6g.2xlarge (CAS-PRD-r6g.2xlarge-800) 64 GiB / 8 36,494
r6g.4xlarge (CAS-PRD-r6g.4xlarge-1600) 128 GiB / 16 62,195
r7g.medium (CAS-PRD-r7g.medium-120) 8 GiB / 1 2,527
m7g.large (CAS-PRD-m7g.large-120) 8 GiB / 2 2,271
r7g.large (CAS-PRD-r7g.large-120) 16 GiB / 2 3,039
r7g.xlarge (CAS-PRD-r7g.xlarge-400) 32 GiB / 4 20,897
r7g.2xlarge (CAS-PRD-r7g.2xlarge-800) 64 GiB / 8 39,801
r7g.4xlarge (CAS-PRD-r7g.4xlarge-800) 128 GiB / 16 70,880
c5d.2xlarge (c5d.2xlarge-v2) 16 GiB / 8 26,494
c6gd.2xlarge (CAS-PRD-c6gd.2xlarge-441) 16 GiB / 8 27,066
is4gen.xlarge (CAS-PRD-is4gen.xlarge-3492) 24 GiB / 4 19,066
is4gen.2xlarge (CAS-PRD-is4gen.2xlarge-6984) 48 GiB / 8 34,437
im4gn.2xlarge (CAS-PRD-im4gn.2xlarge-3492) 32 GiB / 8 31,090
im4gn.4xlarge (CAS-PRD-im4gn.4xlarge-6984) 64 GiB / 16 59,410
i3en.xlarge (i3en.xlarge) 32 GiB / 4 19,895
i3en.2xlarge (CAS-PRD-i3en.2xlarge-4656) 64 GiB / 8 40,796
i3.2xlarge (i3.2xlarge-v2) 61 GiB / 8 24,184
i3.4xlarge (CAS-PRD-i3.4xlarge-3538) 122 GiB / 16 42,234

* Nodes overloaded with the stress test started dropping out

Azure
Node Type (Node Size Used) Ram / Cores Result (MixedOPS)
Standard_DS12_v2 (Standard_DS12_v2-512-an) 28 GiB / 4 23,878
Standard_DS2_v2 (Standard_DS2_v2-256-an) 7 GiB / 2 1,535
L8s_v2 (L8s_v2-an) 64 GiB / 8 24,188
Standard_L8s_v3 (CAS-PRD-Standard_L8s_v3-1788) 64 GiB / 8 33,990
Standard_DS13_v2 (Standard_DS13_v2-2046-an) 56 GiB / 8 37,908
D15_v2 (D15_v2-an) 140 GiB / 20 68,226
Standard_L16s_v2 (CAS-PRD-Standard_L16s_v2-3576-an) 128 GiB / 16 33,969
GCP
Node Type (Node Size Used) Ram / Cores Result (MixedOPS)
n1-standard-1 (CAS-DEV-n1-standard-1-5) 3.75 GiB / 1 No data*
n1-standard-2 (CAS-DEV-n1-standard-2-80) 7.5 GiB / 2 3,159
t2d-standard-2 (CAS-PRD-t2d-standard-2-80) 8 GiB / 2 6,045
n2-standard-2 (CAS-PRD-n2-standard-2-120) 8 GiB / 2 4,199
n2-highmem-2 (CAS-PRD-n2-highmem-2-250) 16 GiB / 2 5,520
n2-highmem-4 (CAS-PRD-n2-highmem-4-400) 32 GiB / 8 19,893
n2-standard-8 (cassandra-production-n2-standard-8-375) 32 GiB / 8 37,437
n2-highmem-8 (CAS-PRD-n2-highmem-8-800) 64 GiB / 8 38,139
n2-highmem-16 (CAS-PRD-n2-highmem-16-800) 128 GiB / 16 73,918

* Cannot be tested due to the fill data requirement being greater than the available disk space

Conclusion: What We Discovered

In general, we see that clusters with more processing power (CPUs and RAM) produce higher throughput as expected. Some of the key takeaways include: 

  • When it comes to the price-to-performance ratio of Cassandra, there is a sweet spot around the XL/2XL node size (eg r6g.xlarge or r6g.2xlarge).
    • Moving from L to XL nodes, doubles performance, and moving from XL to 2XL, almost always doubles  performance again.
    • However, moving from 2XL to 4XL, the increase in performance is less than double. This is expected as you move from a memory-constrained state to a state with more resources than can be utilized. These findings are specific to Cassandra. 
  • Different node families are tailored for workloads, significantly impacting their performance. Hence, nodes should be selected based on workload requirements and use cases. 

Overall, these benchmarks provide an indication of potential throughput for different environments and instance sizes. Performance can vary significantly depending on individual use cases, and we always recommend benchmarking with your own specific use case prior to production deployment. 

The easiest way to see all the new node offerings for your provider is to log into our Console. We ensure that you have access to the latest instance types by supporting different node types from each cloud provider on our platform to help you get high performance at reduced costs.  

Do you know that R8g instances from AWS deliver up to 30% better performance than Graviton3-based R7g instances? Keep an eye out, as we will make these new instances available on our platform as soon as they are released for general availability. 

If you are interested in migrating your existing Cassandra clusters to the Instaclustr Managed Platform, our highly experienced Technical Operations team can provide all the assistance you need. We have built several tried-and-tested node replacement strategies to provide zero-downtime, non-disruptive migrations. Read our Advanced Node Replacement blog for more details on one such strategy. 

If you want to know more about this benchmarking or need clarification on when to use which instance type for Cassandra, reach out to our Support team (if you are an existing customer) or contact our Sales team. Our support team can assist you in resizing your existing clusters. Alternatively, you can use our in-place data-center resizing feature to do it on your own. 

The post Node throughput Benchmarking for Apache Cassandra® on AWS, Azure and GCP appeared first on Instaclustr.

How to Build a Real-Time CDC pipeline with ScyllaDB & Redpanda

A step-by-step tutorial to capturing changed data for high-throughput, low-latency applications, using Redpanda and ScyllaDB This blog is a guest post by Kovid Rathee. It was first published on the Redpanda blog. CDC is a methodology for capturing change events from a source database. Once captured, the change events can be used for many downstream use cases, such as triggering events based on specific conditions, auditing, and analytics. ScyllaDB is a distributed, wide-column NoSQL database known for its ultra-fast performance as it’s written in C++ to make the best use of the low-level Linux primitives, which helps attain much better I/O. It’s also an excellent alternative to DynamoDB as well as a popular alternative to Apache Cassandra. ScyllaDB’s modern approach to NoSQL and its architecture supporting data-intensive and low-latency applications make it an excellent choice for many applications. ScyllaDB shines when you need it to respond and scale super fast. Redpanda is a source available (BSL), Apache Kafka®-compatible, streaming data platform designed from the ground up to be lighter, faster, and simpler to operate. It employs a single binary architecture, free from ZooKeeper™ and JVMs, with a built-in Schema Registry and HTTP Proxy. Since both ScyllaDB and Redpanda are purpose-built for data-intensive applications, it makes perfect sense to connect the two. This post will take you through setting up a CDC integration between ScyllaDB and Redpanda, using a Debezium CDC connector that’s compatible with Kafka Connect. Let’s dig in. Note: Want an in-depth look at using Rust + Redpanda + ScyllaDB to build high-performance systems for real-time data streaming? Join our interactive Masterclass on May 15. You will learn how to: Understand core concepts for working with event-driven architectures Build a high performance streaming app with Rust Avoid common mistakes that surface in real-world systems at scale Evaluate the pros, cons, and best use cases for different event streaming options How to build a real-time CDC pipeline between ScyllaDB and Redpanda This tutorial will first guide you through the process of setting up ScyllaDB and Redpanda. You’ll create a standalone Kafka Connect cluster and incorporate the ScyllaDB CDC Connector JAR files as plugins. This will allow you to establish a connection between ScyllaDB and Kafka Connect. This tutorial uses ScyllaDB’s CDC quickstart guide, which takes an arbitrary orders table for an e-commerce business and streams data from that to a sink using the Kafka Connect-compatible Debezium connector. The following image depicts the simplified architecture of the setup: Connecting ScyllaDB and Redpanda using Kafka Connect The orders table receives new orders and updates on previous orders. In this example, you’ll insert a few simple orders with an order_id, a customer_id, and a product. You’ll first insert a few records in the orders table and then perform a change on one of the records. All the data, including new records, changes, and deletes, will be available as change events on the Redpanda topic you’ve tied to the ScyllaDB orders table. Prerequisites To complete this tutorial, you’ll need the following: Docker Engine and Docker Compose A Redpanda instance on Docker A ScyllaDB instance running on Docker ScyllaDB Kafka Connect drivers The jq CLI tool The rpk CLI tool
Note: The operating system used in this tutorial is macOS.
1. Run and configure ScyllaDB After installing and starting Docker Engine on your machine, execute the following Docker command to get ScyllaDB up and running: docker run --rm -ti \ -p 127.0.0.1:9042:9042 scylladb/scylla \ --smp 1 --listen-address 0.0.0.0 \ --broadcast-rpc-address 127.0.0.1 This command will spin up a container with ScyllaDB, accessible on 127.0.0.1 on port 9042. To check ScyllaDB’s status, run the following command that uses the ScyllaDB nodetool utility: docker exec -it 225a2369a71f nodetool status This command should give you an output that looks something like the following: Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 0.0.0.0 256 KB 256 ? 2597950d-9cc6-47eb-b3d6-a54076860321 rack1 The UN at the beginning of the table output means Up and Normal. These two represent the cluster’s status and state, respectively. 2. Configure CDC on ScyllaDB To set up CDC on ScyllaDB, you first need to log in to the cluster using the cqlsh CLI. You can do that using the following command: docker exec -it 225a2369a71f cqlsh If you’re able to log in successfully, you’ll see the following message: Connected to at 0.0.0.0:9042. [cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4] Use HELP for help. cqlsh> Keyspaces are high-level containers for all the data in ScyllaDB. A keyspace in ScyllaDB is conceptually similar to a schema or database in MySQL. Use the following command to create a new keyspace called quickstart_keyspace with SimpleStrategy replication and a replication_factor of 1: CREATE KEYSPACE quickstart_keyspace WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}; Now, use this keyspace for further CQL commands: USE quickstart_keyspace; Please note that the SimpleStrategy replication class is not recommended for production use. Instead, you should use the NetworkTopologyStrategy replication class. You can learn more about replication methodologies from ScyllaDB University. 3. Create a table Use the following CQL statement to create the orders table described at the beginning of the tutorial: CREATE TABLE orders( customer_id int, order_id int, product text, PRIMARY KEY(customer_id, order_id)) WITH cdc = {'enabled': true}; This statement creates the orders table with a composite primary key consisting of customer_id and order_id. The orders table will be CDC-enabled. ScyllaDB stores table data in a base table, but when you enable CDC on that table, an additional table that captures all the changed data is created as well. That table is called a log table. 4. Insert a few initial records Populate a few records for testing purposes: INSERT INTO quickstart_keyspace.orders(customer_id, order_id, product) VALUES (1, 1, 'pizza'); INSERT INTO quickstart_keyspace.orders(customer_id, order_id, product) VALUES (1, 2, 'cookies'); INSERT INTO quickstart_keyspace.orders(customer_id, order_id, product) VALUES (1, 3, 'tea'); During the course of the tutorial, you’ll insert three more records and perform an update on one record, totaling seven events in total, four of which are change events after the initial setup. 5. Set up Redpanda Use the docker-compose.yaml file in the Redpanda Docker quickstart tutorial to run the following command: docker compose up -d A Redpanda cluster and a Redpanda console will be up and running after a brief wait. You can check the status of the Redpanda cluster using the following command: docker exec -it redpanda-0 rpk cluster info The output of the command should look something like the following: CLUSTER ======= redpanda.3fdc3646-9c9d-4eff-b5d6-854093a25b67 BROKERS ======= ID HOST PORT 0* redpanda-0 9092 Before setting up an integration between ScyllaDB and Redpanda, check if all the Docker containers you have spawned are running by using the following command: docker ps --format "table {{.Image}}\t{{.Names}}\t{{.Status}}\t{{.Ports}}" Look for the STATUS column in the table output: IMAGE NAMES STATUS PORTS scylladb/scylla gifted_hertz Up 8 hours 22/tcp, 7000-7001/tcp, 9160/tcp, 9180/tcp, 10000/tcp, 127.0.0.1:9042->9042/tcp docker.redpanda.com/vectorized/console:v2.2.4 redpanda-console Up 9 hours 0.0.0.0:8080->8080/tcp docker.redpanda.com/redpandadata/redpanda:v23.1.8 redpanda-0 Up 9 hours 8081-8082/tcp, 0.0.0.0:18081-18082->18081-18082/tcp, 9092/tcp, 0.0.0.0:19092->19092/tcp, 0.0.0.0:19644->9644/tcp If these containers are up and running, you can start setting up the integration. 6. Set up Kafka Connect Download and install Kafka using the following sequence of commands: # Extract the binaries wget https://downloads.apache.org/kafka/3.4.0/kafka_2.13-3.4.0.tgz && tar -xzf kafka_2.13-3.4.0.tgz && cd kafka_2.13-3.4.0#start Kafka connect in standalone mode bin/connect-standalone.sh config/connect-standalone.properties The .properties files will be kept in the kafka_2.13-3.4.0/config folder in your download location. To use Kafka Connect, you must configure and validate two .properties files. You can name these files anything. In this tutorial, the two files are called connect-standalone.properties and connector.properties. The first file contains the properties for the standalone Kafka Connect instance. For this file, you will only change the default values of two variables: The default value for the bootstrap.servers variable is localhost:9092. As you’re using Redpanda and its broker is running on port 19092, you’ll replace the default value with localhost:19092. Find the plugin path directory for your Kafka installation and set the plugin.path variable to that path. In this case, it’s /usr/local/share/kafka/plugins. This is where your ScyllaDB CDC Connector JAR file will be copied. The comment-stripped version of the connect-standalone.properties file should look like the following: bootstrap.servers=localhost:19092 key.converter.schemas.enable=true value.converter.schemas.enable=true offset.storage.file.filename=/tmp/connect.offsets offset.flush.interval.ms=10000 plugin.path=/usr/local/share/kafka/plugins The second .properties file will contain settings specific to the ScyllaDB connector. You need to change the scylla.cluster.ip.addresses variable to 127.0.0.1:9042. The connector.properties file should then look like the following: name = QuickstartConnector connector.class = com.scylladb.cdc.debezium.connector.ScyllaConnector key.converter = org.apache.kafka.connect.json.JsonConverter value.converter = org.apache.kafka.connect.json.JsonConverter scylla.cluster.ip.addresses = 127.0.0.1:9042 scylla.name = QuickstartConnectorNamespace scylla.table.names = quickstart_keyspace.orders Using both the .properties files, go to the Kafka installation directory and run the connect-standalone.sh script with the following command: bin/connect-standalone.sh config/connect-standalone.properties config/connector.properties When you created the orders table, you enabled CDC, which means that there’s a log table with all the records and changes. If the Kafka Connect setup is successful, you should now be able to consume these events using the rpk CLI tool. To do so, use the following command: rpk topic consume --brokers 'localhost:19092' QuickstartConnectorNamespace.quickstart_keyspace.orders | jq . The output should result in the following three records, as shown below: { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":1}}", "value": …output omitted… "timestamp": 1683357426891, "partition": 0, "offset": 0 } { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":2}}", "value": …output omitted… "timestamp": 1683357426898, "partition": 0, "offset": 1 } { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":3}}", "value": …output omitted… "timestamp": 1683357426898, "partition": 0, "offset": 2 } If you get a similar output, you’ve successfully integrated ScyllaDB with Redpanda using Kafka Connect. Alternatively, you can go to the Redpanda console hosted on localhost:8080 and see if the topic corresponding to the ScyllaDB orders table is available: You can now test whether data changes to the orders table can trigger CDC. 7. Capture Change Data (CDC) from ScyllaDB To test CDC for new records, insert the following two records in the orders table using the cqlsh CLI: INSERT INTO quickstart_keyspace.orders(customer_id, order_id, product) VALUES (1, 4, 'chips'); INSERT INTO quickstart_keyspace.orders(customer_id, order_id, product) VALUES (1, 5, 'lollies'); If the insert is successful, the rpk topic consume command will give you the following additional records: { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":4}}", "value": …output omitted… "timestamp": 1683357768358, "partition": 0, "offset": 3 } { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":5}}", "value": …output omitted… "timestamp": 1683358068355, "partition": 0, "offset": 4 } You’ll now insert one more record with the product value of pasta: INSERT INTO quickstart_keyspace.orders(customer_id, order_id, product) VALUES (1, 5, 'pasta'); Later on, you’ll change this value with an UPDATE statement to spaghetti and trigger a CDC update event. The newly inserted record should be visible with your rpk topic consume command: { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":6}}", "value": …output omitted… "timestamp": 1683361158363, "partition": 0, "offset": 5 } Now, execute the following UPDATE statement and see if a CDC update event is triggered: UPDATE quickstart_keyspace.orders SET product = 'spaghetti' WHERE order_id = 6 and customer_id = 1; After running this command, you’ll need to run the rpk topic consume command to verify the latest addition to the QuickstartConnectorNamespace.quickstart_keyspace.orders topic. The change event record should look like the following: { "topic": "QuickstartConnectorNamespace.quickstart_keyspace.orders", "key": "{\"schema\":{\"type\":\"struct\",\"fields\":[{\"type\":\"int32\",\"optional\":true,\"field\":\"customer_id\"},{\"type\":\"int32\",\"optional\":true,\"field\":\"order_id\"}],\"optional\":false,\"name\":\"QuickstartConnectorNamespace.quickstart_keyspace.orders.Key\"},\"payload\":{\"customer_id\":1,\"order_id\":6}}", "value": …output omitted… "timestamp": 1683362868372, "partition": 0, "offset": 6 } 8. Access the captured data using Redpanda Console When you spin up Redpanda using Docker Compose, it creates a Redpanda cluster and a Redpanda Console. Using the console, you can visually inspect topics and messages. After all the change events on the orders table are captured, you can go to the Redpanda Console and select the QuickstartConnectorNamespace.quickstart_keyspace.orders topic to see the messages, as shown below: Redpanda console showing all events in the topic You can also see the schema details, the payload, and other event metadata in the event message, as shown below: Redpanda Console showing details of one of the events You’ve now successfully set up CDC between ScyllaDB and Redpanda, consumed change events from the CLI, and accessed the same messages using the Redpanda Console! Conclusion This tutorial introduced you to ScyllaDB and provided step-by-step instructions to set up a connection between ScyllaDB and Redpanda for capturing changed data. Messaging between systems is one of the most prominent ways to create asynchronous, decoupled applications. Bringing CDC events to Redpanda allows you to integrate various systems. Using this CDC connection, you can consume data from Redpanda for various purposes, such as triggering workflows based on change events, integrating with other data sources, and creating a data lake. By following the tutorial and using the supporting GitHub repository, you can easily get started with CDC. To explore Redpanda, check the documentation and browse the Redpanda blog for more tutorials. If you have any questions or want to chat with the team, join the Redpanda Community on Slack. Likewise, if you want to learn more about ScyllaDB, see our documentation, visit ScyllaDB University (free), and feel free to reach out to the team.