Training Updates, Forum, Upcoming ScyllaDB University LIVE Event
The ScyllaDB “sea monster” community is growing fast, so we’re excited to announce lots of new resources to get you on the fast track to ScyllaDB success. In this post, I’ll update you on our next ScyllaDB Labs and ScyllaDB University LIVE training events, introduce new lessons on ScyllaDB University, and summarize some interesting discussions from the community forum. ScyllaDB University Updates For those not (yet) familiar with ScyllaDB University, it’s an online learning and training center for ScyllaDB. The self-paced lessons include theory and hands-on labs that you can run yourself. To get started, just create a (free) account. As the product evolves, we enhance and update the training material. One of the lessons we recently updated is the “How to Write Better Apps” lesson. Many of the issues new users commonly face can be avoided by using best practices and straightforward built-in debugging mechanisms. This lesson covers important metrics users should track, and how to keep track of them using the Monitoring stack. Other topics include prepared statements, token-aware drivers, denormalizing data, working with multiple data centers, and data modeling best practices (partition sizing, distribution, retries, and batching). Another lesson we recently updated is the “Materialized Views, Secondary Indexes, and Filtering” lesson. ScyllaDB offers three indexing options: Materialized Views, Global Secondary Indexes, and Local Secondary Indexes. I often get questions from users about the differences between them and how to use them. They are all covered in this lesson, along with a comparison, examples of when to use each, quizzes, and hands-on labs. By the end of the lesson, users will have an understanding of the different index types in ScyllaDB, how to use them, and when to use each one. Additionally, they’ll gain some hands-on experience by creating and using these indexes in the labs. Additionally, we are embedding interactive, hands-on labs within the lessons, as you can see in the Quick Wins lab. Having the lab embedded within the browser means that you can run it regardless of your operating system – and without any prerequisites. In addition to the on-demand ScyllaDB University portal, we periodically host live online training sessions. The next two we have scheduled are ScyllaDB Labs and ScyllaDB University LIVE. Start Learning at ScyllaDB University ScyllaDB Labs – 17 September, 2024 The interactive workshop will be held in an Asia-friendly time zone. Its focus is on providing hands-on experience building and interacting with high-performance applications using ScyllaDB. The labs introduce NoSQL strategies used by top teams, guiding participants through the creation of sample applications. We cover key topics such as determining if ScyllaDB is suitable for your use case, achieving low-latency NoSQL at scale, application performance optimization, data modeling and making important decisions when getting started with ScyllaDB. I will be hosting it together with my colleague Tim Koopmans. He will start with an introduction to ScyllaDB, covering some core concepts and a getting started lab. After that, my talk will focus on Data Modeling basics, including some architecture and another hands-on lab. Finally, attendees will have some time to run labs independently while we will be there to answer questions. Save Your Spot ScyllaDB University Live – 24 September, 2024 ScyllaDB University LIVE is an instructor-led NoSQL database online, half-day training event. It includes live sessions conducted by ScyllaDB’s lead engineers and architects and has two parallel tracks. Essentials: ScyllaDB architecture, key components, data modeling, and building your first ScyllaDB-powered app (in Rust!). Advanced Topics: Deep dives into cluster topology, advanced data modeling, optimizing elasticity and efficiency, and power user tips and tricks. Participants can switch between tracks or attend specific sessions of interest. After the training, there will be an expert panel to answer user questions. Attendees will also have the chance to complete quizzes, participate in hands-on labs, earn certificates, and receive exclusive ScyllaDB swag. The sessions are live, and there’s no on-demand option. Save Your Spot What’s Trending on the ScyllaDB Community Forum The community forum is the place to discuss all things NoSQL related. You can ask questions, learn what your peers are doing, and share how you’re using ScyllaDB. Here is a summary of some of the top topics. GUI Visualization for ScyllaDB Many new users ask about using a GUI with ScyllaDB. Some relevant tools are the ScyllaDB Manager and the Monitoring Stack. One user suggested using JetBrains Rider, a paid tool that he found useful. Additionally, DBeaver, in some versions, now supports ScyllaDB. It’s a universal database manager that allows users to run CQL queries and view result tables directly within the interface. See the complete discussion Kafka: How to extract nested field in JSON structure? Here a user that is migrating from MySQL to ScyllaDB, and using ElasticSearch, is using thescylladb-cdc-source-connector
to publish CDC messages
to a Kafka topic and is facing issues with accessing nested fields
in the JSON message structure. Existing Single Message
Transformations (SMTs) don’t support accessing nested fields, some
workaround options are discussed.
See the complete discussion Best way to Fetch N rows in
ScyllaDB: Count, Limit or Paging This topic discusses different
ways for fetching N rows while using the TTL feature, optimizing
for performance and efficiency. Three options are mentioned: using
count(), using LIMIT, and using Paging. Some suggestions were to
change the clustering key to include the timestamp and allow for
more efficient queries, as well as using a Counter Table. Another
point that was brought up was the performance difference between
COUNT and LIMIT.
See the complete discussion Read_concurrency_semaphore & p99
read latency The user is experiencing high P99 read latency in an
application that queries time-series data using ScyllaDB, despite
low average latency. The application uses a ScyllaDB cluster with 3
nodes, each with 16 cores, 128 GB RAM, and 3 TB RAID0 SSDs. The
schema is designed for time-series data with a composite primary
key and TimeWindowCompactionStrategy for compaction. While the
ScyllaDB dashboard shows P99 read latency as low (1-10ms), the
gocql latency report shows occasional spikes in P99 latency (700ms
to 1s). The user has tried profiling and tracing but cannot
identify the root cause.
See the complete discussion ScyllaDB vs Aerospike The user read
a whitepaper comparing ScyllaDB’s and Aerospike’s performance. The
paper shows that ScyllaDB outperforms Aerospike by 30-40%. The user
has several questions about the methodology of the tests used,
versions, configurations, and so on. See the
complete discussion
Say Hello to the Community Be Part of Something BIG – Speak at Monster Scale Summit
A little sneak peek at something massive: a new virtual conference on extreme-scale engineering! Whether you’re designing, implementing, or optimizing systems that are pushed to their limits, we’d love to hear about your most impressive achievements and lessons learned – at Monster Scale Summit 2025. Become a Monster Scale Summit Speaker What’s Monster Scale Summit? Monster Scale Summit is a technical conference that connects the community of professionals working on performance-sensitive data-intensive applications. Engineers, architects, and SREs from gamechangers around the globe will be gathering virtually to explore “monster scale” challenges with respect to extreme levels of throughput, data, and global distribution. It’s a lot like P99 CONF (also hosted by ScyllaDB) – a two-day event that’s free, fully virtual, and highly interactive. The core difference is that it’s focused on extreme-scale engineering vs. all things performance. What About ScyllaDB Summit? You might already be familiar with ScyllaDB Summit. Monster Scale Summit is the next evolution of that conference. We’re scaling it up and out to bring attendees more – and broader – insights on designing, implementing, and optimizing performance-sensitive data-intensive applications. But don’t worry – ScyllaDB and sea monsters will still be featured prominently throughout the event. And speakers will get sea monster plushies as part of the swag pack. 😉 Details please! When: March 11 + 12 Where: Wherever you’d like! It’s intentionally virtual, so you can present and interact with attendees from anywhere around the world. Topics: Core topics include: Distributed databases Streaming and real-time processing Intriguing system designs Approaches to a massive scaling challenge Methods for balancing latency/concurrency/throughput SRE techniques proven at scale Infrastructure built for unprecedented demands. What we’re looking for: We welcome a broad spectrum of talks about tackling the challenges that arise in the most massive, demanding environments. The conference prioritizes technical talks sharing first-hand experiences. Sessions are just 15-20 minutes – so consider this your TED Talk debut! Share Your IdeasClues in Long Queues: High IO Queue Delays Explained
How seemingly peculiar metrics might provide interesting insights into system performance In large systems, you often encounter effects that seem weird at first glance, but – when studied carefully – give an invaluable clue to understanding system behavior. When supporting ScyllaDB deployments, we observe many workload patterns that reveal themselves in various amusing ways. Sometimes what seems to be a system misbehaving stems from a bad configuration or sometimes a bug in the code. However, pretty often what seems to be impossible at first sight turns into an interesting phenomenon. Previously we described one of such effects called “phantom jams.” In this post, we’re going to show another example of the same species. As we’ve learned from many of the ScyllaDB deployments we track, sometimes a system appears to be lightly loaded and only a single parameter stands out, indicating something that typically denotes a system bottleneck. The immediate response is typically to disregard the outlier and attribute it to spurious system slow-down. However, thorough and careful analysis of all the parameters, coupled with an understanding of the monitoring system architecture, shows that the system is indeed under-loaded but imbalanced – and that crazy parameter was how the problem actually surfaced. Scraping metrics Monitoring systems often follow a time-series approach. To avoid overwhelming their monitored targets and frequently populating a time-series database (TSDB) with redundant data, these solutions apply a concept known as a “scrape interval.” Although different monitoring solutions exist, we’ll mainly refer to Prometheus and Grafana throughout this article, given that these are what we use for ScyllaDB Monitoring. Prometheus polls its monitored endpoints periodically and retrieves a set of metrics. This is called “scraping”. Metrics samples collected in a single scrape consist of name:value pairs, where value is a number. Prometheus supports four core types of metrics, but we are going to focus on two of those: counters and gauges. Counters are monotonically increasing metrics that reflect some value accumulated over time. When observed through Grafana, the rate() function is applied to counters, as it reflects the changes since the previous scrape instead of its total accumulated value. Gauges, on the other hand, are a type of metric that can arbitrarily rise and fall. Apparently (and surprisingly at the same time) gauges reflect a metric state as observed during scrape-time. This effectively means that any changes made between scrape intervals will be overlooked, and are lost forever. Before going further with the metrics, let’s take a step back and look at what makes it possible for ScyllaDB to serve millions and billions of user requests per second at sub-millisecond latency. IO in ScyllaDB ScyllaDB uses the Seastar framework to run its CPU, IO, and Network activity. A task represents a ScyllaDB operation run in lightweight threads (reactors) managed by Seastar. IO is performed in terms of requests and goes through a two-phase process that happens inside the subsystem we call the IO scheduler. The IO Scheduler plays a critical role in ensuring that IO gets both prioritized and dispatched in a timely manner, which often means predictability – some workloads require that submitted requests complete no later than within a given, rather small, time. To achieve that, the IO Scheduler sits in the hot path – between the disks and the database operations – and is built with a good understanding of the underlying disk capabilities. To perform an IO, first a running task submits a request to the scheduler. At that time, no IO happens. The request is put into the Seastar queue for further processing. Periodically, the Seastar reactor switches from running tasks to performing service operations, such as handling IO. This periodic switch is called polling and it happens in two circumstances: When there are no more tasks to run (such as when all tasks are waiting for IO to complete), or When a timer known as a task-quota elapses, by default at every 0.5 millisecond intervals. The second phase of IO handling involves two actions. First, the kernel is asked for any completed IO requests that were made previously. Second, outstanding requests in the ScyllaDB IO queues are dispatched to disk using the Linux kernel AIO API. Dispatching requests into the kernel is performed at some rate that’s evaluated out of pre-configured disk throughput and the previously mentioned task-quota parameter. The goal of this throttled dispatching is to make sure that dispatched requests are completed within the duration of task-quota. Urgent requests that may pop up in the queue during that time don’t need to wait for the disk to be able to serve them. For the scope of this article, let’s just say that dispatching happens at the disk throughput. For example, if disk throughput is 10k operations per second and poll happens each millisecond, then the dispatch rate will be 10 requests per poll. IO Scheduler Metrics Since the IO Scheduler sits in the hot path of all IO operations, it is important to understand how the IO Scheduler is performing. In ScyllaDB, we accomplish that via metrics. Seastar exposes many metrics, and several IO-related ones are included among them. All IO metrics are exported per class with the help of metrics labeling, and each represents a given IO class activity at a given point in time. IO Scheduler Metrics for the commitlog class Bandwidth and IOPS are two metrics that are easy to reason about. They show the rates at which requests get dispatched to disk. Bandwidth is a counter that gets increased by the request length every time it’s sent to disk. IOPS is a counter that gets incremented every time a request is sent to disk. When observed through Grafana, the aforementioned rate() function is applied and these counters are shown as BPS (bytes per second) and IO/s (IO per second), under their respective IO classes. Queue length metrics are gauges that represent the size of a queue. There are two kinds of queue length metrics. One represents the number of outstanding requests under the IO class. The other represents the number of requests dispatched to the kernel. These queues are also easy to reason about. Every time ScyllaDB makes a request, the class queue length is incremented. When the request gets dispatched to disk, the class queue length gauge is decremented and the disk queue length gauge is incremented. Eventually, as the IO completes, the disk queue length gauge goes down. When observing those metrics, it’s important to remember that they reflect the queue sizes as they were at the exact moment when they got scraped. It’s not at all connected to how large (or small) the queue was over the scrape period. This common misconception may cause one to end up with the wrong conclusions about how the IO scheduler or the disks are performing. Lastly, we have latency metrics known as IO delays. There are two of those – one for the software queue, and another for the disk. Each represents the average time requests spent waiting to get serviced. In earlier ScyllaDB versions, latency metrics were represented as gauges. The value shown was the latency of the last dispatched request (from the IO class queue to disk), or completed request (a disk IO completion). Because of that, the latencies shown weren’t accurate and didn’t reflect reality. A single ScyllaDB shard can perform thousands of requests per second and show the latency of a single request scraped after a long interval omits important insights about what really happened since the previous scrape. That’s why we eventually replaced these gauges with counters. Since then, latencies have been shown as a rate between the scrape intervals. Therefore, to calculate the average request delay, the new counter metrics are divided by the total number of IOPS dispatched within the scrape period. Disk can do more When observing IO for a given class, it is common to see corresponding events that took place during a specific interval. Consider the following picture: IO Scheduler Metrics – sl:default class The exact numbers are not critical here. What matters is how different plots correspond to each other. What’s strange here? Observe the two rightmost panels – bandwidth and IOPS. On a given shard, bandwidth starts at 5MB/s and peaks at 20MB/s, whereas IOPS starts at 200 operations/sec and peaks at 800 ops. These are really conservative numbers. The system from which those metrics were collected can sustain 1GB/s bandwidth under several thousands IOPS. Therefore, given that the numbers above are per-shard, the disk is using about 10% of its total capacity. Next, observe that the queue length metric (the second from the left) is empty most of the time. This is expected, partially because it’s a gauge and it represents the number of requests sitting under the queue as observed during scrape time – but not the total number of requests which got queued. Since disk capacity is far from being saturated, the IO scheduler dispatches all requests to disk shortly after they arrive into the scheduler queue. Given that IO polling happens at sub-millisecond intervals, in-queue requests get dispatched to disk within a millisecond. So, why do the latencies shown in the queue delay metric (the leftmost one) grow close to 40 milliseconds? In such situations, ScyllaDB users commonly wonder, “The disk can do more – why isn’t ScyllaDB’s IO scheduler consuming the remaining disk capacity?!” IO Queue delays explained To get an idea of what’s going on, let’s simplify the dispatching model described above and then walk through several thought experiments on an imaginary system. Assume that a disk can do 100k IOPS, and ignore its bandwidth as part of this exercise. Next, assume that the metrics scraping interval is 1 second, and that ScyllaDB polls its queues once every millisecond. Under these assumptions, according to the dispatching model described above, ScyllaDB will dispatch at most 100 requests at every poll. Next, we’ll see what happens if servicing 10k requests within a second, corresponding to 10% of what our disk can handle. IOPS Capacity Polling interval Dispatch Rate Target Request Rate Scrape Interval 100K 1ms 100 per poll 10K/second 1s Even request arrival In the first experiment, requests arrive evenly at the queue – one request at every 1/10k = 0.1 millisecond. By the end of each tick, there will be 10 requests in the queue, and the IO scheduler will dispatch them all to disk. When polling occurs, each request will have accumulated its own in-queue delays. The first request waited 0.9ms, the second 0.8ms, …, 0 ms. The sum results in approximately 5ms of total in-queue delay. After 1 second or 1K ticks/polls), we’ll observe a total in-queue delay of 5 seconds. When scraped, the metrics will be: A rate of 10K IOPS An empty queue An average in-queue delay/latency of 0.5ms (5 seconds total delay / 10K IOPS) Single batch of requests In the second experiment, all 10k requests arrive at the queue in the very beginning and queue up. As the dispatch rate corresponds to 100 requests per tick, the IO scheduler will need 100 polls to fully drain the queue. The requests dispatched at the first tick will contribute 1 millisecond each to the total in queue delay, with a total sum of 100 milliseconds. Requests dispatched at the second tick will contribute 2 milliseconds each, with a total sum of 200 milliseconds. Therefore, requests dispatched during the Nth tick will contribute N*100 milliseconds to the delay counter. After 100 ticks the total in-queue delay will be 100 + 200 + … + 10000 ms = 500000 ms = 500 seconds. Once the metrics endpoint gets scraped, we’ll observe: The same rate of 10k IOPS, the ordering of arrival won’t influence the result The same empty queue, given that all requests were dispatched in 100ms (prior to scrape time) 50 milliseconds in-queue delay (500 seconds total delay / 10K IOPS) Therefore, the same work done differently resulted in higher IO delays. Multiple batches If the submission of requests happens more evenly, such as 1k batches arriving at every 100ms, the situation would be better, though still not perfect. Each tick would dispatch 100 requests, fully draining the queue within 10 ticks. However, given our polling interval of 1ms, the following batch will arrive only after 90 ticks and the system will be idling. As we observed in the previous examples, each tick contributes N*100 milliseconds to the total in-queue delay. After the queue gets fully drained, the batch contribution is 100 + 200 + … + 1000 ms = 5000 ms = 5 seconds. After 10 batches, this results in 50 seconds of total delay. When scraped, we’ll observe: The same rate of 10k IOPS The same empty queue 5 milliseconds in-queue delay (50 seconds / 10K IOPS) To sum up: The above experiments aimed to demonstrate that the same workload may render a drastically different observable “queue delay” when averaged over a long enough period of time. It can be an “expected” delay of half-a-millisecond. Or, it can be very similar to the puzzle that was shown previously – the disk seemingly can do more, the software queue is empty, and the in-queue latency gets notably higher than the tick length. Average queue length over time Queue length is naturally a gauge-type metric. It frequently increases and decreases as IO requests arrive and get dispatched. Without collecting an array of all the values, it’s impossible to get an idea of how it changed over a given period of time. Therefore, sampling the queue length between long intervals is only reliable in cases of very uniform incoming workloads. There are many parameters of the same nature in the computer world. The most famous example is the load average in Linux. It denotes the length of the CPU run-queue (including tasks waiting for IO) over the past 1, 5 and 15 minutes. It’s not a full history of run-queue changes, but it gives an idea of how it looked over time. Implementing a similar “queue length average” would improve the observability of IO queue length changes. Although possible, that would require sampling the queue length more regularly and exposing more gauges. But as we’ve demonstrated above, accumulated in-queue total time is yet another option – one that requires a single counter, but still shows some history. Why is a scheduler needed? Sometimes you may observe that doing no scheduling at all may result in much better in-queue latency. Our second experiment clearly shows why. Consider that – as in that experiment, 10k requests arrive in one large batch and ScyllaDB just forwards them straight to disk in the nearest tick. This will result in a 10000 ms total latency counter, respectively 1ms average queue delay. The initial results look great. At this point, the system will not be overloaded. As we know, no new requests will arrive and the disk will have enough time and resources to queue and service all dispatched requests. In fact, the disk will probably perform IO even better than it would while being fed eventually with requests. Doing so would likely maximize the disk’s internal parallelism in a better way, and give it more opportunities to apply internal optimizations, such as request merging or batching FTL updates. So why don’t we simply flush the whole queue into disk whatever length it is? The answer lies in the details, particularly in the “as we know” piece. First of all, Seastar assigns different IO classes for different kinds of workloads. To reflect the fact that different workloads have different importance to the system, IO classes have different priorities called “shares.” It is then the IO scheduler’s responsibility to dispatch queued IO requests to the underlying disk according to class shares value. For example, any IO activity that’s triggered by user queries runs under its own class named “statement” in ScyllaDB Open Source, and “sl:default” in Enterprise. This class usually has the largest shares denoting its high priority. Similarly, any IO performed during compactions occurs in the “compaction” class, whereas memtable flushes happen inside the “memtable” class – and both typically have low shares. We say “typically” because ScyllaDB dynamically adjusts shares of those two classes when it detects more work is needed for a respective workflow (for example, when it detects that compaction is falling behind). Next, after sending 10k requests to disk, we may expect that they will all complete in about 10k/100k = 100ms. Therefore, there isn’t much of a difference whether requests get queued by the IO scheduler or by the disk. The problem happens if and only if a new high-priority request pops up when we are waiting for the batch to get serviced. Even if we dispatch this new urgent request instantly, it will likely need to wait for the first batch to complete. Chances that disk will reorder it and service earlier are too low to rely upon, and that’s the delay the scheduler tries to avoid. Urgent requests need to be prioritized accordingly, and get served much faster. With the IO Scheduler dispatching model, we guarantee that a newly arrived urgent request will get serviced almost immediately. Conclusion Understanding metrics is crucial for understanding the behavior of complex systems. Queues are an essential element present in any data processing, and seeing how data traverses through queues is crucial for engineers solving real-life performance problems. Since it’s impossible to track every single data unit, compound metrics like counters and gauges become great companions for achieving said task. Queue length is a very important parameter. Observing its change over time reveals bottlenecks of the system, thus shedding light on performance issues that can arise in complex highly loaded systems. Unfortunately, one cannot see the full history of queue length changes (like you can with many other parameters), and this results in a misunderstanding of the system behavior. This article described an attempt to map queue length from gauge-type metrics to a counter-type one – thus making it possible to accumulate a history of the queue length changes over time. Even though the described “total delay” metrics and its behavior is heavily tied to how ScyllaDB monitoring and Seastar IO scheduler work, this way of accumulating and monitoring latencies is generic enough to be applied to other systems as well. More ScyllaDB Engineering BlogsInstaclustr for Apache Cassandra® 5.0 Now Generally Available
NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.
NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and on–premises.
Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.
Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by:
- Helping you build AI/ML-based applications through Vector Search
- Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities
- Improving flexibility and security
With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industry–leading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.
Support for continuous backups and private network add–ons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported.
Some of the key new features in Cassandra 5.0 include:
- Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.
- Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.
- Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.
- Numerous stability and testing improvements: You can read all about these changes here.
All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.
Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.
We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.
Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things.
Lifecycle Policy Updates
As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).
To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters.
Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.
Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.
You can read more about our lifecycle policies on our website.
Getting Started
Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.
- Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
- Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
- Learn about 4 new Apache Cassandra 5.0 features to be excited about.
- Click here to learn what you need to know about Apache Cassandra 5.0.
Why Choose Apache Cassandra on the Instaclustr Managed Platform?
NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.
NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.
Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.
With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.
If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.
The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.
Apache Cassandra 5.0 Is Generally Available!
As an Apache Cassandra® committer and long-time advocate, I’m really happy to talk about the release of Cassandra 5.0. This milestone represents not just an upgrade to Cassandra but a big leap in usability and capabilities for the world's most powerful distributed database. There’s something for...Apache Cassandra® 5.0: Behind the Scenes
Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.
Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.
It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform.
Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra project. His changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform.
August 2023: The Beginning
We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.
One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java.
September 2023: First Milestone
The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).
Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.
At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released.
November 2023: Further Testing and Internal Preview
The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.
We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.
We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.
During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us.
If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer:
java.lang.IllegalArgumentException: Could not parse type name vector<float, 5> Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356) <Rest of stacktrace removed for brevity>
December 2023: Focus on new features and planning
As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.
The final list of high impact features we came up with was:
- A new data type – Vectors
- Trie memtables/Trie Indexed SSTables (BTI Formatted SStables)
- Storage-Attached Indexes (SAI)
- Unified Compaction Strategy
A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase.
Once the holiday season arrived, it was time for a break, and we were back in force in February next year.
February 2024: Intensive testing
In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, we discovered issues in the interaction with our monitoring and provisioning setup.
We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.
Next, we broke down the work we needed to do. This included identifying monitoring agents requiring upgrade and config defaults that needed to change.
From this point, the project split into 3 streams of work:
- Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.
- Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.
- Infrastructure Upgrades – Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables.
A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved.
March 2024: Public Preview Release
In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables.
However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.
See our public preview launch blog for further details.
There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults.
April 2024: Configuration Tuning and Deeper Testing
The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.
This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default.
"memtable": { "configurations": { "skiplist": { "class_name": "SkipListMemtable" }, "sharded": { "class_name": "ShardedSkipListMemtable" }, "trie": { "class_name": "TrieMemtable" }, "default": { "inherits": "trie" } } }, "sstable": { "selected_format": "bti" }, "storage_compatibility_mode": "NONE",
The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables. You can override this per table:
CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};
Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information.
You can also reference the cassandra_latest.yaml file for the latest settings (please note you should not apply these to existing clusters without rigorous testing).
May 2024: Major Infrastructure Milestone
We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.
At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed.
June 2024: Successful Rollout New Cassandra Driver
The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default.
July 2024: Path to Release Candidate
We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.
The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform.
The Road Ahead to General Availability
We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including:
- Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure.
At Launch:
When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.
For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.
What Have We Learned From This Project?
- Releasing limited,
small
and frequent changes
has resulted in a smooth project, even if sometimes frequent
releases do not feel smooth. Some
thoughts:
- Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
- Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc.
- Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.
- Releasing to internal and public preview helped
create
momentum within
the Instaclustr
business and
teams:
- This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
- Communicating frequently, transparently, and efficiently is the foundation
of success:
- We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything.
- It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project.
This has been a long–running but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.
You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.
More Readings
- The Top 5 Questions We’re Asked about Apache Cassandra 5.0
- Vector Search in Apache Cassandra 5.0
- Why Cassandra 5.0 is a Game-Changer for Developers
- How Does Data Modeling Change in Apache Cassandra 5.0?
The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.
How to Model Leaderboards for 1M Player Game with ScyllaDB
Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we’ll explore how to properly model a schema for leaderboards…using a monstrously fast database (ScyllaDB)! 1. Prologue Ever since I was a kid, I’ve been fascinated by games and how they’re made. My favorite childhood game was Guitar Hero 3: Legends of Rock. Well, more than a decade later, I decided to try to contribute to some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and YARG (Yet Another Rhythm Game). YARG is another rhythm game, but this project is completely open source. It unites legendary contributors in game development and design. The game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch. I thought: Well, it’s an open-source project, so maybe I can use my database skills to create a monstrously fast leaderboard for storing past games. It started as a simple chat on their Discord, then turned into a long discussion about how to make this project grow faster. Ultimately, I decided to contribute to it by building a leaderboard with ScyllaDB. In this blog, I’ll show you some code and concepts! 2. Query-Driven Data Modeling With NoSQL, you should first understand which query you want to run depending on the paradigm (document, graph, wide-column, etc.). Focus on the query and create your schema based on that query. In this project, we will handle two types of paradigms: Key-Value Wide Column (Clusterization) Now let’s talk about the queries/features of our modeling. 2.1 Feature: Storing the matches Every time you finish a YARG gameplay, you want to submit your scores plus other in-game metrics. Basically, it will be a single query based on a main index.SELECT score, stars, missed_notes, instrument, ...
FROM leaderboard.submisisons WHERE submission_id =
'some-uuid-here-omg'
2.2 Feature: Leaderboard And now our
main goal: a super cool leaderboard that you don’t need to worry
about after you perform good data modeling. The leaderboard is per
song: every time you play a specific song, your best score will be
saved and ranked. The interface has filters that dictate exactly
which leaderboard to bring: song_id: required instrument: required
modifiers: required difficulty: required player_id: optional score:
optional Imagine our query looks like this, and it returns the
results sorted by score in descending order: SELECT
player_id, score, ... FROM leaderboard.song_leaderboard WHERE
instrument = 'guitar' AND difficulty = 'expert' AND modifiers =
{'none'} AND track_id = 'dani-california' LIMIT 100; -- player_id |
score ----------------+------- -- tzach | 12000 -- danielhe4rt |
10000 -- kadoodle | 9999 ----------------+-------
Can you
already imagine what the final schema will look like? No? Ok, let
me help you with that! 3. Data Modeling time! It’s time to take a
deep dive into data modeling with ScyllaDB and better understand
how to scale it. 3.1 – Matches Modeling First, let’s understand a
little more about the game itself: It’s a rhythm game; You play a
certain song at a time; You can activate “modifiers” to make your
life easier or harder before the game; You must choose an
instrument (e.g. guitar, drums, bass, and microphone). Every aspect
of the gameplay is tracked, such as: Score; Missed notes; Overdrive
count; Play speed (1.5x ~ 1.0x); Date/time of gameplay; And other
cool stuff. Thinking about that, let’s start our data modeling. It
will turn into something like this: CREATE TABLE IF NOT
EXISTS leaderboard.submissions ( submission_id uuid, track_id text,
player_id text, modifiers frozen<set>, score int, difficulty
text, instrument text, stars int, accuracy_percentage float,
missed_count int, ghost_notes_count int, max_combo_count int,
overdrive_count int, speed int, played_at timestamp, PRIMARY KEY
(submission_id, played_at) );
Let’s skip all the
int/text
values and jump to the
set<text>
. The set type allows
you to store a list of items of a particular type. I decided to use
this list to store the modifiers because it’s a perfect fit. Look
at how the queries are executed: INSERT INTO
leaderboard.submissions ( submission_id, track_id, modifiers,
played_at ) VALUES ( some-cool-uuid-here, 'starlight-muse'
{'all-taps', 'hell-mode', 'no-hopos'}, '2024-01-01 00:00:00'
);
With this type, you can easily store a list of items to
retrieve later. Another cool piece of information is that this
query is a key-value like! What does that mean? Since you will
always query it by the submission_id
only, it can be
categorized as a key-value. 3.2 Leaderboard Modeling Now we’ll
cover some cool wide-column database concepts. In our leaderboard
query, we will always need some dynamic values in the WHERE
clauses. That means these values will belong to the
Partition Key while the Clustering
Keys will have values that can be “optional”. A
partition key is a hash based on a
combination of fields that you added to identify a
value. Let’s imagine that you played Starlight -
Muse
100x times. If you were to query this information, it
would return 100x different results differentiated by Clustering
Keys like score
or player_id
.
SELECT player_id, score --- FROM leaderboard.song_leaderboard
WHERE track_id = 'starlight-muse' LIMIT 100;
If 1,000,000
players play this song, your query will become slow and it will
become a problem in the future because your partition key consists
of only one field, which is track_id
. However, if you
add more fields to your Partition Key, like mandatory things before
playing the game, maybe you can shrink these possibilities for a
faster query. Now do you see the big picture? Adding the fields
like Instrument, Difficulty, and
Modifiers will give you a way to split the
information about that specific track evenly. Let’s imagine with
some simple numbers: -- Query Partition ID: '1' SELECT
player_id, score, ... FROM leaderboard.song_leaderboard WHERE
instrument = 'guitar' AND difficulty = 'expert' AND modifiers =
{'none'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT
100; -- Query Partition ID: '2' SELECT player_id, score, ... FROM
leaderboard.song_leaderboard WHERE instrument = 'guitar' AND
difficulty = 'expert' AND modifiers = {'all-hopos'} AND --
Modifiers Changed track_id = 'starlight-muse' LIMIT 100;
So,
if you build the query in a specific shape it will always look for
a specific token and retrieve the data based on these specific
Partition Keys. Let’s take a look at the final modeling and talk
about the clustering keys and the application layer: CREATE
TABLE IF NOT EXISTS leaderboard.song_leaderboard ( submission_id
uuid, track_id text, player_id text, modifiers frozen<set>,
score int, difficulty text, instrument text, stars int,
accuracy_percentage float, missed_count int, ghost_notes_count int,
max_combo_count int, overdrive_count int, speed int, played_at
timestamp, PRIMARY KEY ((track_id, modifiers, difficulty,
instrument), score, player_id) ) WITH CLUSTERING ORDER BY (score
DESC, player_id ASC);
The partition key was defined as
mentioned above, consisting of our REQUIRED
PARAMETERS such as track_id, modifiers, difficulty and
instrument. And for the Clustering Keys, we added
score and player_id. Note that by
default the clustering fields are ordered by score
DESC
and just in case a player has the same score, the
criteria to choose the winner will be alphabetical
¯\(ツ)/¯. First, it’s good to understand that we will have only
ONE SCORE PER PLAYER. But, with this modeling, if
the player goes through the same track twice with different scores,
it will generate two different entries. INSERT INTO
leaderboard.song_leaderboard ( track_id, player_id, modifiers,
score, difficulty, instrument, stars, played_at ) VALUES (
'starlight-muse', 'daniel-reis', {'none'}, 133700, 'expert',
'guitar', '2023-11-23 00:00:00' ); INSERT INTO
leaderboard.song_leaderboard ( track_id, player_id, modifiers,
score, difficulty, instrument, stars, played_at ) VALUES (
'starlight-muse', 'daniel-reis', {'none'}, 123700, 'expert',
'guitar', '2023-11-23 00:00:00' ); SELECT player_id, score FROM
leaderboard.song_leaderboard WHERE instrument = 'guitar' AND
difficulty = 'expert' AND modifiers = {'none'} AND track_id =
'starlight-muse' LIMIT 2; -- player_id | score
----------------+------- -- daniel-reis | 133700 -- daniel-reis |
123700 ----------------+-------
So how do we fix this
problem? Well, it’s not a problem per se. It’s a feature! As a
developer, you have to create your own business rules based on the
project’s needs, and this is no different. What do I mean by that?
You can run a simple DELETE query before inserting
the new entry. That will guarantee that you will not have specific
data from the player_id with less than the new
score inside that specific group of
partition keys. -- Before Insert the new
Gampleplay DELETE FROM leaderboard.song_leaderboard WHERE
instrument = 'guitar' AND difficulty = 'expert' AND modifiers =
{'none'} AND track_id = 'starlight-muse' AND player_id =
'daniel-reis' AND score <= 'your-new-score-here'; -- Now you can
insert the new payload...
And with that, we finished our
simple leaderboard system, the same one that runs in YARG and can
also be used in games with MILLIONS of entries per second 😀 4. How
to Contribute to YARG Want to contribute to this wonderful
open-source project? We’re building a brand new platform for all
the players using: Game: Unity3d (Repository) Front-end:
NextJS (Repository)
Back-end: Laravel 10.x (Repository) We will
need as many developers and testers as possible to discuss future
implementations of the game together with the main contributors!
First, make sure to join this Discord Community. This is
where all the technical discussions happen with the backing of the
community before going to the development board. Also, outside of
Discord, the YARG community is mostly focused on the EliteAsian (core
contributor and project owner) X account for development showcases.
Be sure to follow him there as well.
New replay viewer HUD for #YARG! There are still some issues with it, such as consistency, however we are planning to address them by the official stable release of v0.12. pic.twitter.com/9ACIJXAZS4 — EliteAsian (@EliteAsian123) December 16, 2023And FYI, the Lead Artist of the game, (aka Kadu) is also a Broadcast Specialist and Product Innovation Developer at Elgato who worked with streamers like: Ninja Nadeshot StoneMountain64 and the legendary DJ Marshmello. Kadu also uses his X to share some insights and early previews of new features and experimentations for YARG. So, don’t forget to follow him as well!
Here's how the replay venue looks like now, added a lot of details on the desk, really happy with the result so far, going to add a few more and start the textures pic.twitter.com/oHH27vkREe — ⚡Kadu Waengertner (@kaduwaengertner) August 10, 2023Here are some useful links to learn more about the project: Official Website Github Repository Task Board
Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?5. Conclusion Data modeling is sometimes challenging. This project involved learning many new concepts and a lot of testing together with my community on Twitch. I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB! Also, if you like ScyllaDB and want to learn more about it, I strongly suggest you watch our free Masterclass Courses or visit ScyllaDB University!
Will Your Cassandra Database Project Succeed?: The New Stack
Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.
That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.
Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.
Accurate Data Modeling Is a Must
Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.
On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.
Configure Cassandra Clusters the Right Way
Accurate, expertly managed cluster configurations are pivotal to the success of Cassandra implementations. Get those cluster settings wrong and Cassandra can suffer from data inconsistencies and performance issues due to inappropriate node capacities, poor partitioning or replication strategies that aren’t up to the task.
Teams should understand the needs of their particular use case and how each cluster configuration setting affects Cassandra’s abilities to serve that use case. Attuning configurations to best support your application — including the right settings for node capacity, data distribution, replication factor and consistency levels — will ensure that you can harness the full power of Cassandra when it counts.
Take Advantage of Tunable Consistency
Cassandra gives teams the option to leverage the best balance of data consistency and availability for their use case. While these tunable consistency levels are a valuable tool in the right hands, teams that don’t understand the nuances of these controls can saddle their applications with painful latency and troublesome data inconsistencies.
Teams that learn to operate Cassandra’s tunable consistency levels properly and carefully assess their application’s needs — especially with read and write patterns, data sensitivity and the ability to tolerate eventual consistency — will unlock far more beneficial Cassandra experiences.
Perform Regular Maintenance
Regular Cassandra maintenance is required to stave off issues such as data inconsistencies and performance drop-offs. Within their Cassandra operational procedures, teams should routinely perform compaction, repair and node-tool operations to prevent challenges down the road, while ensuring cluster health and performance are optimized.
Anticipate Capacity and Scaling Needs
By its nature, success will yield new needs. Be prepared for your Cassandra cluster to grow and scale well into the future — that is what this database is built to do. Starving your Cassandra cluster for CPU, RAM and storage resources because you don’t have a plan to seamlessly add capacity is a way of plucking failure from the jaws of success. Poor performance, data loss and expensive downtime are the rewards for growing without looking ahead.
Plan for growth and scalability from the beginning of your Cassandra implementation. Practice careful capacity planning. Look at your data volumes, write/read patterns and performance requirements today and tomorrow. Teams with clusters built for growth will be ready to do so far more easily and affordably.
Make Changes With a Careful Testing/Staging/Prod Process
Teams that think they’re streamlining their process efficiency by putting Cassandra changes straight into production actually enable a pipeline for bugs, performance roadblocks and data inconsistencies. Testing and staging environments are essential for validating changes before putting them into production environments and will save teams countless hours of headaches.
At the end of the day, running all data migrations, changes to schema and application updates through testing and staging environments is far more efficient than putting them straight into production and then cleaning up myriad live issues.
Set Up Monitoring and Alerts
Teams implementing monitoring and alerts to track metrics and flag anomalies can mitigate trouble spots before they become full-blown service interruptions. The speed at which teams become aware of issues can mean the difference between a behind-the-scenes blip and a downtime event.
Have Backup and Disaster Recovery at the Ready
In addition to standing up robust monitoring and alerting, teams should regularly test and run practice drills on their procedures for recovering from disasters and using data backups. Don’t neglect this step; these measures are absolutely essential for ensuring the safety and resilience of systems and data.
The less prepared an organization is to recover from issues, the longer and more costly and impactful downtime will be. Incremental or snapshot backup strategies, replication that’s based in the cloud or across multiple data centers and fine-tuned recovery processes should be in place to minimize downtime, stress and confusion whenever the worst occurs.
Nurture Cassandra Expertise
The expertise required to optimize Cassandra configurations, operations and performance will only come with a dedicated focus. Enlisting experienced talent, instilling continuous training regimens that keep up with Cassandra updates, turning to external support and ensuring available resources — or all of the above — will position organizations to succeed in following the best practices highlighted here and achieving all of the benefits that Cassandra can deliver.
The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.