Training Updates, Forum, Upcoming ScyllaDB University LIVE Event

The ScyllaDB “sea monster” community is growing fast, so we’re excited to announce lots of new resources to get you on the fast track to ScyllaDB success. In this post, I’ll update you on our next ScyllaDB Labs and ScyllaDB University LIVE training events, introduce new lessons on ScyllaDB University, and summarize some interesting discussions from the community forum. ScyllaDB University Updates For those not (yet) familiar with ScyllaDB University, it’s an online learning and training center for ScyllaDB. The self-paced lessons include theory and hands-on labs that you can run yourself. To get started, just create a (free) account. As the product evolves, we enhance and update the training material. One of the lessons we recently updated is the “How to Write Better Apps” lesson. Many of the issues new users commonly face can be avoided by using best practices and straightforward built-in debugging mechanisms. This lesson covers important metrics users should track, and how to keep track of them using the Monitoring stack. Other topics include prepared statements, token-aware drivers, denormalizing data, working with multiple data centers, and data modeling best practices (partition sizing, distribution, retries, and batching). Another lesson we recently updated is the “Materialized Views, Secondary Indexes, and Filtering” lesson. ScyllaDB offers three indexing options: Materialized Views, Global Secondary Indexes, and Local Secondary Indexes. I often get questions from users about the differences between them and how to use them. They are all covered in this lesson, along with a comparison, examples of when to use each, quizzes, and hands-on labs. By the end of the lesson, users will have an understanding of the different index types in ScyllaDB, how to use them, and when to use each one. Additionally, they’ll gain some hands-on experience by creating and using these indexes in the labs. Additionally, we are embedding interactive, hands-on labs within the lessons, as you can see in the Quick Wins lab. Having the lab embedded within the browser means that you can run it regardless of your operating system – and without any prerequisites. In addition to the on-demand ScyllaDB University portal, we periodically host live online training sessions. The next two we have scheduled are ScyllaDB Labs and ScyllaDB University LIVE. Start Learning at ScyllaDB University   ScyllaDB Labs – 17 September, 2024 The interactive workshop will be held in an Asia-friendly time zone. Its focus is on providing hands-on experience building and interacting with high-performance applications using ScyllaDB. The labs introduce NoSQL strategies used by top teams, guiding participants through the creation of sample applications. We cover key topics such as determining if ScyllaDB is suitable for your use case, achieving low-latency NoSQL at scale, application performance optimization, data modeling and making important decisions when getting started with ScyllaDB. I will be hosting it together with my colleague Tim Koopmans. He will start with an introduction to ScyllaDB, covering some core concepts and a getting started lab. After that, my talk will focus on Data Modeling basics, including some architecture and another hands-on lab. Finally, attendees will have some time to run labs independently while we will be there to answer questions. Save Your Spot   ScyllaDB University Live – 24 September, 2024 ScyllaDB University LIVE  is an instructor-led NoSQL database online, half-day training event. It includes live sessions conducted by ScyllaDB’s lead engineers and architects and has two parallel tracks. Essentials: ScyllaDB architecture, key components, data modeling, and building your first ScyllaDB-powered app (in Rust!). Advanced Topics: Deep dives into cluster topology, advanced data modeling, optimizing elasticity and efficiency, and power user tips and tricks. Participants can switch between tracks or attend specific sessions of interest. After the training, there will be an expert panel to answer user questions. Attendees will also have the chance to complete quizzes, participate in hands-on labs, earn certificates, and receive exclusive ScyllaDB swag. The sessions are live, and there’s no on-demand option. Save Your Spot   What’s Trending on the ScyllaDB Community Forum The community forum is the place to discuss all things NoSQL related. You can ask questions, learn what your peers are doing, and share how you’re using ScyllaDB. Here is a summary of some of the top topics. GUI Visualization for ScyllaDB Many new users ask about using a GUI with ScyllaDB. Some relevant tools are the ScyllaDB Manager and the Monitoring Stack. One user suggested using JetBrains Rider, a paid tool that he found useful. Additionally, DBeaver, in some versions, now supports ScyllaDB. It’s a universal database manager that allows users to run CQL queries and view result tables directly within the interface. See the complete discussion Kafka: How to extract nested field in JSON structure? Here a user that is migrating from MySQL to ScyllaDB, and using ElasticSearch, is using the scylladb-cdc-source-connector to publish CDC messages to a Kafka topic and is facing issues with accessing nested fields in the JSON message structure. Existing Single Message Transformations (SMTs) don’t support accessing nested fields, some workaround options are discussed. See the complete discussion Best way to Fetch N rows in ScyllaDB: Count, Limit or Paging This topic discusses different ways for fetching N rows while using the TTL feature, optimizing for performance and efficiency. Three options are mentioned: using count(), using LIMIT, and using Paging. Some suggestions were to change the clustering key to include the timestamp and allow for more efficient queries, as well as using a Counter Table. Another point that was brought up was the performance difference between COUNT and LIMIT. See the complete discussion Read_concurrency_semaphore & p99 read latency The user is experiencing high P99 read latency in an application that queries time-series data using ScyllaDB, despite low average latency. The application uses a ScyllaDB cluster with 3 nodes, each with 16 cores, 128 GB RAM, and 3 TB RAID0 SSDs. The schema is designed for time-series data with a composite primary key and TimeWindowCompactionStrategy for compaction. While the ScyllaDB dashboard shows P99 read latency as low (1-10ms), the gocql latency report shows occasional spikes in P99 latency (700ms to 1s). The user has tried profiling and tracing but cannot identify the root cause. See the complete discussion ScyllaDB vs Aerospike The user read a whitepaper comparing ScyllaDB’s and Aerospike’s performance. The paper shows that ScyllaDB outperforms Aerospike by 30-40%. The user has several questions about the methodology of the tests used, versions, configurations, and so on. See the complete discussion Say Hello to the Community  

Be Part of Something BIG – Speak at Monster Scale Summit

A little sneak peek at something massive: a new virtual conference on extreme-scale engineering! Whether you’re designing, implementing, or optimizing systems that are pushed to their limits, we’d love to hear about your most impressive achievements and lessons learned – at Monster Scale Summit 2025. Become a Monster Scale Summit Speaker What’s Monster Scale Summit? Monster Scale Summit is a technical conference that connects the community of professionals working on performance-sensitive data-intensive applications. Engineers, architects, and SREs from gamechangers around the globe will be gathering virtually to explore “monster scale” challenges with respect to extreme levels of throughput, data, and global distribution. It’s a lot like P99 CONF (also hosted by ScyllaDB) – a two-day event that’s free, fully virtual, and highly interactive. The core difference is that it’s focused on extreme-scale engineering vs. all things performance. What About ScyllaDB Summit? You might already be familiar with ScyllaDB Summit. Monster Scale Summit is the next evolution of that conference. We’re scaling it up and out to bring attendees more – and broader – insights on designing, implementing, and optimizing performance-sensitive data-intensive applications. But don’t worry – ScyllaDB and sea monsters will still be featured prominently throughout the event. And speakers will get sea monster plushies as part of the swag pack. 😉   Details please! When: March 11 + 12 Where: Wherever you’d like! It’s intentionally virtual, so you can present and interact with attendees from anywhere around the world. Topics: Core topics include: Distributed databases Streaming and real-time processing Intriguing system designs Approaches to a massive scaling challenge Methods for balancing latency/concurrency/throughput SRE techniques proven at scale Infrastructure built for unprecedented demands. What we’re looking for: We welcome a broad spectrum of talks about tackling the challenges that arise in the most massive, demanding environments. The conference prioritizes technical talks sharing first-hand experiences. Sessions are just 15-20 minutes – so consider this your TED Talk debut! Share Your Ideas

Clues in Long Queues: High IO Queue Delays Explained

How seemingly peculiar metrics might provide interesting insights into system performance In large systems, you often encounter effects that seem weird at first glance, but – when studied carefully – give an invaluable clue to understanding system behavior. When supporting ScyllaDB deployments, we observe many workload patterns that reveal themselves in various amusing ways. Sometimes what seems to be a system misbehaving stems from a bad configuration or sometimes a bug in the code. However, pretty often what seems to be impossible at first sight turns into an interesting phenomenon. Previously we described one of such effects called “phantom jams.” In this post, we’re going to show another example of the same species. As we’ve learned from many of the ScyllaDB deployments we track, sometimes a system appears to be lightly loaded and only a single parameter stands out, indicating something that typically denotes a system bottleneck. The immediate response is typically to disregard the outlier and attribute it to spurious system slow-down. However, thorough and careful analysis of all the parameters, coupled with an understanding of the monitoring system architecture, shows that the system is indeed under-loaded but imbalanced – and that crazy parameter was how the problem actually surfaced. Scraping metrics Monitoring systems often follow a time-series approach. To avoid overwhelming their monitored targets and frequently populating a time-series database (TSDB) with redundant data, these solutions apply a concept known as a “scrape interval.” Although different monitoring solutions exist, we’ll mainly refer to Prometheus and Grafana throughout this article, given that these are what we use for ScyllaDB Monitoring. Prometheus polls its monitored endpoints periodically and retrieves a set of metrics. This is called “scraping”. Metrics samples collected in a single scrape consist of name:value pairs, where value is a number. Prometheus supports four core types of metrics, but we are going to focus on two of those: counters and gauges. Counters are monotonically increasing metrics that reflect some value accumulated over time. When observed through Grafana, the rate() function is applied to counters, as it reflects the changes since the previous scrape instead of its total accumulated value. Gauges, on the other hand, are a type of metric that can arbitrarily rise and fall. Apparently (and surprisingly at the same time) gauges reflect a metric state as observed during scrape-time. This effectively means that any changes made between scrape intervals will be overlooked, and are lost forever. Before going further with the metrics, let’s take a step back and look at what makes it possible for ScyllaDB to serve millions and billions of user requests per second at sub-millisecond latency. IO in ScyllaDB ScyllaDB uses the Seastar framework to run its CPU, IO, and Network activity. A task represents a ScyllaDB operation run in lightweight threads (reactors) managed by Seastar. IO is performed in terms of requests and goes through a two-phase process that happens inside the subsystem we call the IO scheduler. The IO Scheduler plays a critical role in ensuring that IO gets both prioritized and dispatched in a timely manner, which often means predictability – some workloads require that submitted requests complete no later than within a given, rather small, time. To achieve that, the IO Scheduler sits in the hot path – between the disks and the database operations – and is built with a good understanding of the underlying disk capabilities. To perform an IO, first a running task submits a request to the scheduler. At that time, no IO happens. The request is put into the Seastar queue for further processing. Periodically, the Seastar reactor switches from running tasks to performing service operations, such as handling IO. This periodic switch is called polling and it happens in two circumstances: When there are no more tasks to run (such as when all tasks are waiting for IO to complete), or When a timer known as a task-quota elapses, by default at every 0.5 millisecond intervals. The second phase of IO handling involves two actions. First, the kernel is asked for any completed IO requests that were made previously. Second, outstanding requests in the ScyllaDB IO queues are dispatched to disk using the Linux kernel AIO API. Dispatching requests into the kernel is performed at some rate that’s evaluated out of pre-configured disk throughput and the previously mentioned task-quota parameter. The goal of this throttled dispatching is to make sure that dispatched requests are completed within the duration of task-quota. Urgent requests that may pop up in the queue during that time don’t need to wait for the disk to be able to serve them. For the scope of this article, let’s just say that dispatching happens at the disk throughput. For example, if disk throughput is 10k operations per second and poll happens each millisecond, then the dispatch rate will be 10 requests per poll. IO Scheduler Metrics Since the IO Scheduler sits in the hot path of all IO operations, it is important to understand how the IO Scheduler is performing. In ScyllaDB, we accomplish that via metrics. Seastar exposes many metrics, and several IO-related ones are included among them. All IO metrics are exported per class with the help of metrics labeling, and each represents a given IO class activity at a given point in time. IO Scheduler Metrics for the commitlog class Bandwidth and IOPS are two metrics that are easy to reason about. They show the rates at which requests get dispatched to disk. Bandwidth is a counter that gets increased by the request length every time it’s sent to disk. IOPS is a counter that gets incremented every time a request is sent to disk. When observed through Grafana, the aforementioned rate() function is applied and these counters are shown as BPS (bytes per second) and IO/s (IO per second), under their respective IO classes. Queue length metrics are gauges that represent the size of a queue. There are two kinds of queue length metrics. One represents the number of outstanding requests under the IO class. The other represents the number of requests dispatched to the kernel. These queues are also easy to reason about. Every time ScyllaDB makes a request, the class queue length is incremented. When the request gets dispatched to disk, the class queue length gauge is decremented and the disk queue length gauge is incremented. Eventually, as the IO completes, the disk queue length gauge goes down. When observing those metrics, it’s important to remember that they reflect the queue sizes as they were at the exact moment when they got scraped. It’s not at all connected to how large (or small) the queue was over the scrape period. This common misconception may cause one to end up with the wrong conclusions about how the IO scheduler or the disks are performing. Lastly, we have latency metrics known as IO delays. There are two of those – one for the software queue, and another for the disk. Each represents the average time requests spent waiting to get serviced. In earlier ScyllaDB versions, latency metrics were represented as gauges. The value shown was the latency of the last dispatched request (from the IO class queue to disk), or completed request (a disk IO completion). Because of that, the latencies shown weren’t accurate and didn’t reflect reality. A single ScyllaDB shard can perform thousands of requests per second and show the latency of a single request scraped after a long interval omits important insights about what really happened since the previous scrape. That’s why we eventually replaced these gauges with counters. Since then, latencies have been shown as a rate between the scrape intervals. Therefore, to calculate the average request delay, the new counter metrics are divided by the total number of IOPS dispatched within the scrape period. Disk can do more When observing IO for a given class, it is common to see corresponding events that took place during a specific interval. Consider the following picture: IO Scheduler Metrics – sl:default class The exact numbers are not critical here. What matters is how different plots correspond to each other. What’s strange here? Observe the two rightmost panels – bandwidth and IOPS. On a given shard, bandwidth starts at 5MB/s and peaks at 20MB/s, whereas IOPS starts at 200 operations/sec and peaks at 800 ops. These are really conservative numbers. The system from which those metrics were collected can sustain 1GB/s bandwidth under several thousands IOPS. Therefore, given that the numbers above are per-shard, the disk is using about 10% of its total capacity. Next, observe that the queue length metric (the second from the left) is empty most of the time. This is expected, partially because it’s a gauge and it represents the number of requests sitting under the queue as observed during scrape time – but not the total number of requests which got queued. Since disk capacity is far from being saturated, the IO scheduler dispatches all requests to disk shortly after they arrive into the scheduler queue. Given that IO polling happens at sub-millisecond intervals, in-queue requests get dispatched to disk within a millisecond. So, why do the latencies shown in the queue delay metric (the leftmost one) grow close to 40 milliseconds? In such situations, ScyllaDB users commonly wonder, “The disk can do more – why isn’t ScyllaDB’s IO scheduler consuming the remaining disk capacity?!” IO Queue delays explained To get an idea of what’s going on, let’s simplify the dispatching model described above and then walk through several thought experiments on an imaginary system. Assume that a disk can do 100k IOPS, and ignore its bandwidth as part of this exercise. Next, assume that the metrics scraping interval is 1 second, and that ScyllaDB polls its queues once every millisecond. Under these assumptions, according to the dispatching model described above, ScyllaDB will dispatch at most 100 requests at every poll. Next, we’ll see what happens if servicing 10k requests within a second, corresponding to 10% of what our disk can handle. IOPS Capacity Polling interval Dispatch Rate Target Request Rate Scrape Interval 100K 1ms 100 per poll 10K/second 1s   Even request arrival In the first experiment, requests arrive evenly at the queue – one request at every 1/10k = 0.1 millisecond. By the end of each tick, there will be 10 requests in the queue, and the IO scheduler will dispatch them all to disk. When polling occurs, each request will have accumulated its own in-queue delays. The first request waited 0.9ms, the second 0.8ms, …, 0 ms. The sum results in approximately 5ms of total in-queue delay. After 1 second or 1K ticks/polls), we’ll observe a total in-queue delay of 5 seconds. When scraped, the metrics will be: A rate of 10K IOPS An empty queue An average in-queue delay/latency of 0.5ms (5 seconds total delay / 10K IOPS) Single batch of requests In the second experiment, all 10k requests arrive at the queue in the very beginning and queue up. As the dispatch rate corresponds to 100 requests per tick, the IO scheduler will need 100 polls to fully drain the queue. The requests dispatched at the first tick will contribute 1 millisecond each to the total in queue delay, with a total sum of 100 milliseconds. Requests dispatched at the second tick will contribute 2 milliseconds each, with a total sum of 200 milliseconds. Therefore, requests dispatched during the Nth tick will contribute N*100 milliseconds to the delay counter. After 100 ticks the total in-queue delay will be 100 + 200 + … + 10000 ms = 500000 ms = 500 seconds. Once the metrics endpoint gets scraped, we’ll observe: The same rate of 10k IOPS, the ordering of arrival won’t influence the result The same empty queue, given that all requests were dispatched in 100ms (prior to scrape time) 50 milliseconds in-queue delay (500 seconds total delay / 10K IOPS) Therefore, the same work done differently resulted in higher IO delays. Multiple batches If the submission of requests happens more evenly, such as 1k batches arriving at every 100ms, the situation would be better, though still not perfect. Each tick would dispatch 100 requests, fully draining the queue within 10 ticks. However, given our polling interval of 1ms, the following batch will arrive only after 90 ticks and the system will be idling. As we observed in the previous examples, each tick contributes N*100 milliseconds to the total in-queue delay. After the queue gets fully drained, the batch contribution is 100 + 200 + … + 1000 ms = 5000 ms = 5 seconds. After 10 batches, this results in 50 seconds of total delay. When scraped, we’ll observe: The same rate of 10k IOPS The same empty queue 5 milliseconds in-queue delay (50 seconds / 10K IOPS) To sum up: The above experiments aimed to demonstrate that the same workload may render a drastically different observable “queue delay” when averaged over a long enough period of time. It can be an “expected” delay of half-a-millisecond. Or, it can be very similar to the puzzle that was shown previously – the disk seemingly can do more, the software queue is empty, and the in-queue latency gets notably higher than the tick length. Average queue length over time Queue length is naturally a gauge-type metric. It frequently increases and decreases as IO requests arrive and get dispatched. Without collecting an array of all the values, it’s impossible to get an idea of how it changed over a given period of time. Therefore, sampling the queue length between long intervals is only reliable in cases of very uniform incoming workloads. There are many parameters of the same nature in the computer world. The most famous example is the load average in Linux. It denotes the length of the CPU run-queue (including tasks waiting for IO) over the past 1, 5 and 15 minutes. It’s not a full history of run-queue changes, but it gives an idea of how it looked over time. Implementing a similar “queue length average” would improve the observability of IO queue length changes. Although possible, that would require sampling the queue length more regularly and exposing more gauges. But as we’ve demonstrated above, accumulated in-queue total time is yet another option – one that requires a single counter, but still shows some history. Why is a scheduler needed? Sometimes you may observe that doing no scheduling at all may result in much better in-queue latency. Our second experiment clearly shows why. Consider that – as in that experiment, 10k requests arrive in one large batch and ScyllaDB just forwards them straight to disk in the nearest tick. This will result in a 10000 ms total latency counter, respectively 1ms average queue delay. The initial results look great. At this point, the system will not be overloaded. As we know, no new requests will arrive and the disk will have enough time and resources to queue and service all dispatched requests. In fact, the disk will probably perform IO even better than it would while being fed eventually with requests. Doing so would likely maximize the disk’s internal parallelism in a better way, and give it more opportunities to apply internal optimizations, such as request merging or batching FTL updates. So why don’t we simply flush the whole queue into disk whatever length it is? The answer lies in the details, particularly in the “as we know” piece. First of all, Seastar assigns different IO classes for different kinds of workloads. To reflect the fact that different workloads have different importance to the system, IO classes have different priorities called “shares.” It is then the IO scheduler’s responsibility to dispatch queued IO requests to the underlying disk according to class shares value. For example, any IO activity that’s triggered by user queries runs under its own class named “statement” in ScyllaDB Open Source, and “sl:default” in Enterprise. This class usually has the largest shares denoting its high priority. Similarly, any IO performed during compactions occurs in the “compaction” class, whereas memtable flushes happen inside the “memtable” class – and both typically have low shares. We say “typically” because ScyllaDB dynamically adjusts shares of those two classes when it detects more work is needed for a respective workflow (for example, when it detects that compaction is falling behind). Next, after sending 10k requests to disk, we may expect that they will all complete in about 10k/100k = 100ms. Therefore, there isn’t much of a difference whether requests get queued by the IO scheduler or by the disk. The problem happens if and only if a new high-priority request pops up when we are waiting for the batch to get serviced. Even if we dispatch this new urgent request instantly, it will likely need to wait for the first batch to complete. Chances that disk will reorder it and service earlier are too low to rely upon, and that’s the delay the scheduler tries to avoid. Urgent requests need to be prioritized accordingly, and get served much faster. With the IO Scheduler dispatching model, we guarantee that a newly arrived urgent request will get serviced almost immediately. Conclusion Understanding metrics is crucial for understanding the behavior of complex systems. Queues are an essential element present in any data processing, and seeing how data traverses through queues is crucial for engineers solving real-life performance problems. Since it’s impossible to track every single data unit, compound metrics like counters and gauges become great companions for achieving said task. Queue length is a very important parameter. Observing its change over time reveals bottlenecks of the system, thus shedding light on performance issues that can arise in complex highly loaded systems. Unfortunately, one cannot see the full history of queue length changes (like you can with many other parameters), and this results in a misunderstanding of the system behavior. This article described an attempt to map queue length from gauge-type metrics to a counter-type one – thus making it possible to accumulate a history of the queue length changes over time. Even though the described “total delay” metrics and its behavior is heavily tied to how ScyllaDB monitoring and Seastar IO scheduler work, this way of accumulating and monitoring latencies is generic enough to be applied to other systems as well. More ScyllaDB Engineering Blogs 

Instaclustr for Apache Cassandra® 5.0 Now Generally Available

NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.

NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and onpremises.

Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.

Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by: 

  • Helping you build AI/ML-based applications through Vector Search  
  • Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities 
  • Improving flexibility and security 

With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industryleading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.  

Support for continuous backups and private network addons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.  
  • Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.  
  • Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.  
  • Numerous stability and testing improvements: You can read all about these changes here. 

All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.  

Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.  

We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.  

Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things. 

Lifecycle Policy Updates 

As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).

To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters. 

Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.  

Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.  

You can read more about our lifecycle policies on our website. 

Getting Started 

Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.  

  • Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
  • Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
  • Learn about 4 new Apache Cassandra 5.0 features to be excited about. 
  • Click here to learn what you need to know about Apache Cassandra 5.0. 

Why Choose Apache Cassandra on the Instaclustr Managed Platform? 

NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.  

NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.  

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.  

With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.  

If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.  

The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.

Apache Cassandra 5.0 Is Generally Available!

As an Apache Cassandra® committer and long-time advocate, I’m really happy to talk about the release of Cassandra 5.0. This milestone represents not just an upgrade to Cassandra but a big leap in usability and capabilities for the world's most powerful distributed database. There’s something for...

Apache Cassandra® 5.0: Behind the Scenes

Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.  

Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.  

It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform. 

Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra projectHis changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform. 

August 2023: The Beginning

We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.  

One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java. 

September 2023: First Milestone 

The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).  

Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.  

At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released. 

November 2023 Further Testing and Internal Preview 

The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.  

We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.  

We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.  

During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us. 

If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer: 

java.lang.IllegalArgumentException: Could not parse type name vector<float, 5>  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356)  
<Rest of stacktrace removed for brevity>

December 2023: Focus on new features and planning 

As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.  

 The final list of high impact features we came up with was: 

  • A new data type Vectors 
  • Trie memtables/Trie Indexed SSTables (BTI Formatted SStables) 
  • Storage-Attached Indexes (SAI) 
  • Unified Compaction Strategy 

A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase. 

Once the holiday season arrived, it was time for a break, and we were back in force in February next year. 

February 2024: Intensive testing 

In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, wdiscovered issues in the interaction with our monitoring and provisioning setup. 

We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.  

Next, we broke down the work we needed to do This included identifying monitoring agents requiring upgrade and config defaults that needed to change. 

From this point, the project split into 3 streams of work: 

  1. Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.  
  2. Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.  
  3. Infrastructure Upgrades Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables. 

A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved. 

March 2024: Public Preview Release 

In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables. 

However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.  

See our public preview launch blog for further details. 

There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults. 

April 2024: Configuration Tuning and Deeper Testing 

The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.  

This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default. 

"memtable": 
  { 
    "configurations": 
      { 
        "skiplist": 
          { 
            "class_name": "SkipListMemtable" 
          }, 
        "sharded": 
          { 
            "class_name": "ShardedSkipListMemtable" 
          }, 
        "trie": 
          { 
            "class_name": "TrieMemtable" 
          }, 
        "default": 
          { 
            "inherits": "trie" 
          } 
      } 
  }, 
"sstable": 
  { 
    "selected_format": "bti" 
  }, 
"storage_compatibility_mode": "NONE",

The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables You can override this per table: 

CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};

Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information

You can also reference the cassandra_latest.yaml  file for the latest settings (please note you should not apply these to existing clusters without rigorous testing). 

May 2024: Major Infrastructure Milestone 

We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.  

At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed. 

June 2024: Successful Rollout New Cassandra Driver 

The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default. 

July 2024: Path to Release Candidate 

We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.  

The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform. 

The Road Ahead to General Availability 

We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including: 

  • Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure. 

At Launch: 

When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.  

For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.  

What Have We Learned From This Project? 

  • Releasing limited, small and frequent changes has resulted in a smooth project, even if sometimes frequent releases do not feel smooth. Some thoughts: 
    • Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
    • Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc. 
    • Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.  
  • Releasing to internal and public preview helped create momentum within the Instaclustr business and teams:  
    • This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
  • Communicating frequently, transparently, and efficiently is the foundation of success:  
    • We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything. 
    • It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project. 

This has been a longrunning but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.  

You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.  

More Readings 

 

The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.

How to Model Leaderboards for 1M Player Game with ScyllaDB

Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we’ll explore how to properly model a schema for leaderboards…using a monstrously fast database (ScyllaDB)! 1. Prologue Ever since I was a kid, I’ve been fascinated by games and how they’re made. My favorite childhood game was Guitar Hero 3: Legends of Rock. Well, more than a decade later, I decided to try to contribute to some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and YARG (Yet Another Rhythm Game). YARG is another rhythm game, but this project is completely open source. It unites legendary contributors in game development and design. The game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch. I thought: Well, it’s an open-source project, so maybe I can use my database skills to create a monstrously fast leaderboard for storing past games. It started as a simple chat on their Discord, then turned into a long discussion about how to make this project grow faster. Ultimately, I decided to contribute to it by building a leaderboard with ScyllaDB. In this blog, I’ll show you some code and concepts! 2. Query-Driven Data Modeling With NoSQL, you should first understand which query you want to run depending on the paradigm (document, graph, wide-column, etc.). Focus on the query and create your schema based on that query. In this project, we will handle two types of paradigms: Key-Value Wide Column (Clusterization) Now let’s talk about the queries/features of our modeling. 2.1 Feature: Storing the matches Every time you finish a YARG gameplay, you want to submit your scores plus other in-game metrics. Basically, it will be a single query based on a main index. SELECT score, stars, missed_notes, instrument, ... FROM leaderboard.submisisons WHERE submission_id = 'some-uuid-here-omg' 2.2 Feature: Leaderboard And now our main goal: a super cool leaderboard that you don’t need to worry about after you perform good data modeling. The leaderboard is per song: every time you play a specific song, your best score will be saved and ranked. The interface has filters that dictate exactly which leaderboard to bring: song_id: required instrument: required modifiers: required difficulty: required player_id: optional score: optional Imagine our query looks like this, and it returns the results sorted by score in descending order: SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'dani-california' LIMIT 100; -- player_id | score ----------------+------- -- tzach | 12000 -- danielhe4rt | 10000 -- kadoodle | 9999 ----------------+------- Can you already imagine what the final schema will look like? No? Ok, let me help you with that! 3. Data Modeling time! It’s time to take a deep dive into data modeling with ScyllaDB and better understand how to scale it. 3.1 – Matches Modeling First, let’s understand a little more about the game itself: It’s a rhythm game; You play a certain song at a time; You can activate “modifiers” to make your life easier or harder before the game; You must choose an instrument (e.g. guitar, drums, bass, and microphone). Every aspect of the gameplay is tracked, such as: Score; Missed notes; Overdrive count; Play speed (1.5x ~ 1.0x); Date/time of gameplay; And other cool stuff. Thinking about that, let’s start our data modeling. It will turn into something like this: CREATE TABLE IF NOT EXISTS leaderboard.submissions ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY (submission_id, played_at) ); Let’s skip all the int/text values and jump to the set<text>. The set type allows you to store a list of items of a particular type. I decided to use this list to store the modifiers because it’s a perfect fit. Look at how the queries are executed: INSERT INTO leaderboard.submissions ( submission_id, track_id, modifiers, played_at ) VALUES ( some-cool-uuid-here, 'starlight-muse' {'all-taps', 'hell-mode', 'no-hopos'}, '2024-01-01 00:00:00' ); With this type, you can easily store a list of items to retrieve later. Another cool piece of information is that this query is a key-value like! What does that mean? Since you will always query it by the submission_id only, it can be categorized as a key-value. 3.2 Leaderboard Modeling Now we’ll cover some cool wide-column database concepts. In our leaderboard query, we will always need some dynamic values in the WHERE clauses. That means these values will belong to the Partition Key while the Clustering Keys will have values that can be “optional”. A partition key is a hash based on a combination of fields that you added to identify a value. Let’s imagine that you played Starlight - Muse 100x times. If you were to query this information, it would return 100x different results differentiated by Clustering Keys like score or player_id. SELECT player_id, score --- FROM leaderboard.song_leaderboard WHERE track_id = 'starlight-muse' LIMIT 100; If 1,000,000 players play this song, your query will become slow and it will become a problem in the future because your partition key consists of only one field, which is track_id. However, if you add more fields to your Partition Key, like mandatory things before playing the game, maybe you can shrink these possibilities for a faster query. Now do you see the big picture? Adding the fields like Instrument, Difficulty, and Modifiers will give you a way to split the information about that specific track evenly. Let’s imagine with some simple numbers: -- Query Partition ID: '1' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; -- Query Partition ID: '2' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'all-hopos'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; So, if you build the query in a specific shape it will always look for a specific token and retrieve the data based on these specific Partition Keys. Let’s take a look at the final modeling and talk about the clustering keys and the application layer: CREATE TABLE IF NOT EXISTS leaderboard.song_leaderboard ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY ((track_id, modifiers, difficulty, instrument), score, player_id) ) WITH CLUSTERING ORDER BY (score DESC, player_id ASC); The partition key was defined as mentioned above, consisting of our REQUIRED PARAMETERS such as track_id, modifiers, difficulty and instrument. And for the Clustering Keys, we added score and player_id. Note that by default the clustering fields are ordered by score DESC and just in case a player has the same score, the criteria to choose the winner will be alphabetical ¯\(ツ)/¯. First, it’s good to understand that we will have only ONE SCORE PER PLAYER. But, with this modeling, if the player goes through the same track twice with different scores, it will generate two different entries. INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 133700, 'expert', 'guitar', '2023-11-23 00:00:00' ); INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 123700, 'expert', 'guitar', '2023-11-23 00:00:00' ); SELECT player_id, score FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' LIMIT 2; -- player_id | score ----------------+------- -- daniel-reis | 133700 -- daniel-reis | 123700 ----------------+------- So how do we fix this problem? Well, it’s not a problem per se. It’s a feature! As a developer, you have to create your own business rules based on the project’s needs, and this is no different. What do I mean by that? You can run a simple DELETE query before inserting the new entry. That will guarantee that you will not have specific data from the player_id with less than the new score inside that specific group of partition keys. -- Before Insert the new Gampleplay DELETE FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' AND player_id = 'daniel-reis' AND score <= 'your-new-score-here'; -- Now you can insert the new payload... And with that, we finished our simple leaderboard system, the same one that runs in YARG and can also be used in games with MILLIONS of entries per second 😀 4. How to Contribute to YARG Want to contribute to this wonderful open-source project? We’re building a brand new platform for all the players using: Game: Unity3d (Repository) Front-end: NextJS (Repository) Back-end: Laravel 10.x (Repository) We will need as many developers and testers as possible to discuss future implementations of the game together with the main contributors! First, make sure to join this Discord Community. This is where all the technical discussions happen with the backing of the community before going to the development board. Also, outside of Discord, the YARG community is mostly focused on the EliteAsian (core contributor and project owner) X account for development showcases. Be sure to follow him there as well.
New replay viewer HUD for #YARG! There are still some issues with it, such as consistency, however we are planning to address them by the official stable release of v0.12. pic.twitter.com/9ACIJXAZS4 — EliteAsian (@EliteAsian123) December 16, 2023
And FYI, the Lead Artist of the game, (aka Kadu) is also a Broadcast Specialist and Product Innovation Developer at Elgato who worked with streamers like: Ninja Nadeshot StoneMountain64 and the legendary DJ Marshmello. Kadu also uses his X to share some insights and early previews of new features and experimentations for YARG. So, don’t forget to follow him as well!
Here's how the replay venue looks like now, added a lot of details on the desk, really happy with the result so far, going to add a few more and start the textures pic.twitter.com/oHH27vkREe — ⚡Kadu Waengertner (@kaduwaengertner) August 10, 2023
Here are some useful links to learn more about the project: Official Website Github Repository Task Board
Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?
5. Conclusion Data modeling is sometimes challenging. This project involved learning many new concepts and a lot of testing together with my community on Twitch. I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB! Also, if you like ScyllaDB and want to learn more about it, I strongly suggest you watch our free Masterclass Courses or visit ScyllaDB University!  

Will Your Cassandra Database Project Succeed?: The New Stack

Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.

That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.

Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.

Accurate Data Modeling Is a Must

Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.

On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.

The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.