Be Part of Something BIG – Speak at Monster Scale Summit

A little sneak peek at something massive: a new virtual conference on extreme-scale engineering! Whether you’re designing, implementing, or optimizing systems that are pushed to their limits, we’d love to hear about your most impressive achievements and lessons learned – at Monster Scale Summit 2025. Become a Monster Scale Summit Speaker Register for Monster Scale Summit [Free] What’s Monster Scale Summit? Monster Scale Summit is a technical conference that connects the community of professionals working on performance-sensitive data-intensive applications. Engineers, architects, and SREs from gamechangers around the globe will be gathering virtually to explore “monster scale” challenges with respect to extreme levels of throughput, data, and global distribution. It’s a lot like P99 CONF (also hosted by ScyllaDB) – a two-day event that’s free, fully virtual, and highly interactive. The core difference is that it’s focused on extreme-scale engineering vs. all things performance. We just opened the call for speakers, and the lineup already includes engineers from Slack, Salesforce, VISA, American Express, ShareChat, Cloudflare, and Disney. Keynotes include Gwen Sharpira, Chris Riccomini, and Martin Kleppman (Designing Data Intensive Applications). What About ScyllaDB Summit? You might already be familiar with ScyllaDB Summit. Monster Scale Summit is the next evolution of that conference. We’re scaling it up and out to bring attendees more – and broader – insights on designing, implementing, and optimizing performance-sensitive data-intensive applications. But don’t worry – ScyllaDB and sea monsters will still be featured prominently throughout the event. And speakers will get sea monster plushies as part of the swag pack. 😉   Details please! When: March 11 + 12 Where: Wherever you’d like! It’s intentionally virtual, so you can present and interact with attendees from anywhere around the world. Topics: Core topics include: Distributed databases Streaming and real-time processing Intriguing system designs Approaches to a massive scaling challenge Methods for balancing latency/concurrency/throughput SRE techniques proven at scale Infrastructure built for unprecedented demands. What we’re looking for: We welcome a broad spectrum of talks about tackling the challenges that arise in the most massive, demanding environments. The conference prioritizes technical talks sharing first-hand experiences. Sessions are just 15-20 minutes – so consider this your TED Talk debut! Share Your Ideas

Clues in Long Queues: High IO Queue Delays Explained

How seemingly peculiar metrics might provide interesting insights into system performance In large systems, you often encounter effects that seem weird at first glance, but – when studied carefully – give an invaluable clue to understanding system behavior. When supporting ScyllaDB deployments, we observe many workload patterns that reveal themselves in various amusing ways. Sometimes what seems to be a system misbehaving stems from a bad configuration or sometimes a bug in the code. However, pretty often what seems to be impossible at first sight turns into an interesting phenomenon. Previously we described one of such effects called “phantom jams.” In this post, we’re going to show another example of the same species. As we’ve learned from many of the ScyllaDB deployments we track, sometimes a system appears to be lightly loaded and only a single parameter stands out, indicating something that typically denotes a system bottleneck. The immediate response is typically to disregard the outlier and attribute it to spurious system slow-down. However, thorough and careful analysis of all the parameters, coupled with an understanding of the monitoring system architecture, shows that the system is indeed under-loaded but imbalanced – and that crazy parameter was how the problem actually surfaced. Scraping metrics Monitoring systems often follow a time-series approach. To avoid overwhelming their monitored targets and frequently populating a time-series database (TSDB) with redundant data, these solutions apply a concept known as a “scrape interval.” Although different monitoring solutions exist, we’ll mainly refer to Prometheus and Grafana throughout this article, given that these are what we use for ScyllaDB Monitoring. Prometheus polls its monitored endpoints periodically and retrieves a set of metrics. This is called “scraping”. Metrics samples collected in a single scrape consist of name:value pairs, where value is a number. Prometheus supports four core types of metrics, but we are going to focus on two of those: counters and gauges. Counters are monotonically increasing metrics that reflect some value accumulated over time. When observed through Grafana, the rate() function is applied to counters, as it reflects the changes since the previous scrape instead of its total accumulated value. Gauges, on the other hand, are a type of metric that can arbitrarily rise and fall. Apparently (and surprisingly at the same time) gauges reflect a metric state as observed during scrape-time. This effectively means that any changes made between scrape intervals will be overlooked, and are lost forever. Before going further with the metrics, let’s take a step back and look at what makes it possible for ScyllaDB to serve millions and billions of user requests per second at sub-millisecond latency. IO in ScyllaDB ScyllaDB uses the Seastar framework to run its CPU, IO, and Network activity. A task represents a ScyllaDB operation run in lightweight threads (reactors) managed by Seastar. IO is performed in terms of requests and goes through a two-phase process that happens inside the subsystem we call the IO scheduler. The IO Scheduler plays a critical role in ensuring that IO gets both prioritized and dispatched in a timely manner, which often means predictability – some workloads require that submitted requests complete no later than within a given, rather small, time. To achieve that, the IO Scheduler sits in the hot path – between the disks and the database operations – and is built with a good understanding of the underlying disk capabilities. To perform an IO, first a running task submits a request to the scheduler. At that time, no IO happens. The request is put into the Seastar queue for further processing. Periodically, the Seastar reactor switches from running tasks to performing service operations, such as handling IO. This periodic switch is called polling and it happens in two circumstances: When there are no more tasks to run (such as when all tasks are waiting for IO to complete), or When a timer known as a task-quota elapses, by default at every 0.5 millisecond intervals. The second phase of IO handling involves two actions. First, the kernel is asked for any completed IO requests that were made previously. Second, outstanding requests in the ScyllaDB IO queues are dispatched to disk using the Linux kernel AIO API. Dispatching requests into the kernel is performed at some rate that’s evaluated out of pre-configured disk throughput and the previously mentioned task-quota parameter. The goal of this throttled dispatching is to make sure that dispatched requests are completed within the duration of task-quota. Urgent requests that may pop up in the queue during that time don’t need to wait for the disk to be able to serve them. For the scope of this article, let’s just say that dispatching happens at the disk throughput. For example, if disk throughput is 10k operations per second and poll happens each millisecond, then the dispatch rate will be 10 requests per poll. IO Scheduler Metrics Since the IO Scheduler sits in the hot path of all IO operations, it is important to understand how the IO Scheduler is performing. In ScyllaDB, we accomplish that via metrics. Seastar exposes many metrics, and several IO-related ones are included among them. All IO metrics are exported per class with the help of metrics labeling, and each represents a given IO class activity at a given point in time. IO Scheduler Metrics for the commitlog class Bandwidth and IOPS are two metrics that are easy to reason about. They show the rates at which requests get dispatched to disk. Bandwidth is a counter that gets increased by the request length every time it’s sent to disk. IOPS is a counter that gets incremented every time a request is sent to disk. When observed through Grafana, the aforementioned rate() function is applied and these counters are shown as BPS (bytes per second) and IO/s (IO per second), under their respective IO classes. Queue length metrics are gauges that represent the size of a queue. There are two kinds of queue length metrics. One represents the number of outstanding requests under the IO class. The other represents the number of requests dispatched to the kernel. These queues are also easy to reason about. Every time ScyllaDB makes a request, the class queue length is incremented. When the request gets dispatched to disk, the class queue length gauge is decremented and the disk queue length gauge is incremented. Eventually, as the IO completes, the disk queue length gauge goes down. When observing those metrics, it’s important to remember that they reflect the queue sizes as they were at the exact moment when they got scraped. It’s not at all connected to how large (or small) the queue was over the scrape period. This common misconception may cause one to end up with the wrong conclusions about how the IO scheduler or the disks are performing. Lastly, we have latency metrics known as IO delays. There are two of those – one for the software queue, and another for the disk. Each represents the average time requests spent waiting to get serviced. In earlier ScyllaDB versions, latency metrics were represented as gauges. The value shown was the latency of the last dispatched request (from the IO class queue to disk), or completed request (a disk IO completion). Because of that, the latencies shown weren’t accurate and didn’t reflect reality. A single ScyllaDB shard can perform thousands of requests per second and show the latency of a single request scraped after a long interval omits important insights about what really happened since the previous scrape. That’s why we eventually replaced these gauges with counters. Since then, latencies have been shown as a rate between the scrape intervals. Therefore, to calculate the average request delay, the new counter metrics are divided by the total number of IOPS dispatched within the scrape period. Disk can do more When observing IO for a given class, it is common to see corresponding events that took place during a specific interval. Consider the following picture: IO Scheduler Metrics – sl:default class The exact numbers are not critical here. What matters is how different plots correspond to each other. What’s strange here? Observe the two rightmost panels – bandwidth and IOPS. On a given shard, bandwidth starts at 5MB/s and peaks at 20MB/s, whereas IOPS starts at 200 operations/sec and peaks at 800 ops. These are really conservative numbers. The system from which those metrics were collected can sustain 1GB/s bandwidth under several thousands IOPS. Therefore, given that the numbers above are per-shard, the disk is using about 10% of its total capacity. Next, observe that the queue length metric (the second from the left) is empty most of the time. This is expected, partially because it’s a gauge and it represents the number of requests sitting under the queue as observed during scrape time – but not the total number of requests which got queued. Since disk capacity is far from being saturated, the IO scheduler dispatches all requests to disk shortly after they arrive into the scheduler queue. Given that IO polling happens at sub-millisecond intervals, in-queue requests get dispatched to disk within a millisecond. So, why do the latencies shown in the queue delay metric (the leftmost one) grow close to 40 milliseconds? In such situations, ScyllaDB users commonly wonder, “The disk can do more – why isn’t ScyllaDB’s IO scheduler consuming the remaining disk capacity?!” IO Queue delays explained To get an idea of what’s going on, let’s simplify the dispatching model described above and then walk through several thought experiments on an imaginary system. Assume that a disk can do 100k IOPS, and ignore its bandwidth as part of this exercise. Next, assume that the metrics scraping interval is 1 second, and that ScyllaDB polls its queues once every millisecond. Under these assumptions, according to the dispatching model described above, ScyllaDB will dispatch at most 100 requests at every poll. Next, we’ll see what happens if servicing 10k requests within a second, corresponding to 10% of what our disk can handle. IOPS Capacity Polling interval Dispatch Rate Target Request Rate Scrape Interval 100K 1ms 100 per poll 10K/second 1s   Even request arrival In the first experiment, requests arrive evenly at the queue – one request at every 1/10k = 0.1 millisecond. By the end of each tick, there will be 10 requests in the queue, and the IO scheduler will dispatch them all to disk. When polling occurs, each request will have accumulated its own in-queue delays. The first request waited 0.9ms, the second 0.8ms, …, 0 ms. The sum results in approximately 5ms of total in-queue delay. After 1 second or 1K ticks/polls), we’ll observe a total in-queue delay of 5 seconds. When scraped, the metrics will be: A rate of 10K IOPS An empty queue An average in-queue delay/latency of 0.5ms (5 seconds total delay / 10K IOPS) Single batch of requests In the second experiment, all 10k requests arrive at the queue in the very beginning and queue up. As the dispatch rate corresponds to 100 requests per tick, the IO scheduler will need 100 polls to fully drain the queue. The requests dispatched at the first tick will contribute 1 millisecond each to the total in queue delay, with a total sum of 100 milliseconds. Requests dispatched at the second tick will contribute 2 milliseconds each, with a total sum of 200 milliseconds. Therefore, requests dispatched during the Nth tick will contribute N*100 milliseconds to the delay counter. After 100 ticks the total in-queue delay will be 100 + 200 + … + 10000 ms = 500000 ms = 500 seconds. Once the metrics endpoint gets scraped, we’ll observe: The same rate of 10k IOPS, the ordering of arrival won’t influence the result The same empty queue, given that all requests were dispatched in 100ms (prior to scrape time) 50 milliseconds in-queue delay (500 seconds total delay / 10K IOPS) Therefore, the same work done differently resulted in higher IO delays. Multiple batches If the submission of requests happens more evenly, such as 1k batches arriving at every 100ms, the situation would be better, though still not perfect. Each tick would dispatch 100 requests, fully draining the queue within 10 ticks. However, given our polling interval of 1ms, the following batch will arrive only after 90 ticks and the system will be idling. As we observed in the previous examples, each tick contributes N*100 milliseconds to the total in-queue delay. After the queue gets fully drained, the batch contribution is 100 + 200 + … + 1000 ms = 5000 ms = 5 seconds. After 10 batches, this results in 50 seconds of total delay. When scraped, we’ll observe: The same rate of 10k IOPS The same empty queue 5 milliseconds in-queue delay (50 seconds / 10K IOPS) To sum up: The above experiments aimed to demonstrate that the same workload may render a drastically different observable “queue delay” when averaged over a long enough period of time. It can be an “expected” delay of half-a-millisecond. Or, it can be very similar to the puzzle that was shown previously – the disk seemingly can do more, the software queue is empty, and the in-queue latency gets notably higher than the tick length. Average queue length over time Queue length is naturally a gauge-type metric. It frequently increases and decreases as IO requests arrive and get dispatched. Without collecting an array of all the values, it’s impossible to get an idea of how it changed over a given period of time. Therefore, sampling the queue length between long intervals is only reliable in cases of very uniform incoming workloads. There are many parameters of the same nature in the computer world. The most famous example is the load average in Linux. It denotes the length of the CPU run-queue (including tasks waiting for IO) over the past 1, 5 and 15 minutes. It’s not a full history of run-queue changes, but it gives an idea of how it looked over time. Implementing a similar “queue length average” would improve the observability of IO queue length changes. Although possible, that would require sampling the queue length more regularly and exposing more gauges. But as we’ve demonstrated above, accumulated in-queue total time is yet another option – one that requires a single counter, but still shows some history. Why is a scheduler needed? Sometimes you may observe that doing no scheduling at all may result in much better in-queue latency. Our second experiment clearly shows why. Consider that – as in that experiment, 10k requests arrive in one large batch and ScyllaDB just forwards them straight to disk in the nearest tick. This will result in a 10000 ms total latency counter, respectively 1ms average queue delay. The initial results look great. At this point, the system will not be overloaded. As we know, no new requests will arrive and the disk will have enough time and resources to queue and service all dispatched requests. In fact, the disk will probably perform IO even better than it would while being fed eventually with requests. Doing so would likely maximize the disk’s internal parallelism in a better way, and give it more opportunities to apply internal optimizations, such as request merging or batching FTL updates. So why don’t we simply flush the whole queue into disk whatever length it is? The answer lies in the details, particularly in the “as we know” piece. First of all, Seastar assigns different IO classes for different kinds of workloads. To reflect the fact that different workloads have different importance to the system, IO classes have different priorities called “shares.” It is then the IO scheduler’s responsibility to dispatch queued IO requests to the underlying disk according to class shares value. For example, any IO activity that’s triggered by user queries runs under its own class named “statement” in ScyllaDB Open Source, and “sl:default” in Enterprise. This class usually has the largest shares denoting its high priority. Similarly, any IO performed during compactions occurs in the “compaction” class, whereas memtable flushes happen inside the “memtable” class – and both typically have low shares. We say “typically” because ScyllaDB dynamically adjusts shares of those two classes when it detects more work is needed for a respective workflow (for example, when it detects that compaction is falling behind). Next, after sending 10k requests to disk, we may expect that they will all complete in about 10k/100k = 100ms. Therefore, there isn’t much of a difference whether requests get queued by the IO scheduler or by the disk. The problem happens if and only if a new high-priority request pops up when we are waiting for the batch to get serviced. Even if we dispatch this new urgent request instantly, it will likely need to wait for the first batch to complete. Chances that disk will reorder it and service earlier are too low to rely upon, and that’s the delay the scheduler tries to avoid. Urgent requests need to be prioritized accordingly, and get served much faster. With the IO Scheduler dispatching model, we guarantee that a newly arrived urgent request will get serviced almost immediately. Conclusion Understanding metrics is crucial for understanding the behavior of complex systems. Queues are an essential element present in any data processing, and seeing how data traverses through queues is crucial for engineers solving real-life performance problems. Since it’s impossible to track every single data unit, compound metrics like counters and gauges become great companions for achieving said task. Queue length is a very important parameter. Observing its change over time reveals bottlenecks of the system, thus shedding light on performance issues that can arise in complex highly loaded systems. Unfortunately, one cannot see the full history of queue length changes (like you can with many other parameters), and this results in a misunderstanding of the system behavior. This article described an attempt to map queue length from gauge-type metrics to a counter-type one – thus making it possible to accumulate a history of the queue length changes over time. Even though the described “total delay” metrics and its behavior is heavily tied to how ScyllaDB monitoring and Seastar IO scheduler work, this way of accumulating and monitoring latencies is generic enough to be applied to other systems as well. More ScyllaDB Engineering Blogs 

Instaclustr for Apache Cassandra® 5.0 Now Generally Available

NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.

NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and onpremises.

Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.

Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by: 

  • Helping you build AI/ML-based applications through Vector Search  
  • Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities 
  • Improving flexibility and security 

With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industryleading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.  

Support for continuous backups and private network addons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.  
  • Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.  
  • Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.  
  • Numerous stability and testing improvements: You can read all about these changes here. 

All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.  

Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.  

We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.  

Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things. 

Lifecycle Policy Updates 

As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).

To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters. 

Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.  

Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.  

You can read more about our lifecycle policies on our website. 

Getting Started 

Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.  

  • Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
  • Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
  • Learn about 4 new Apache Cassandra 5.0 features to be excited about. 
  • Click here to learn what you need to know about Apache Cassandra 5.0. 

Why Choose Apache Cassandra on the Instaclustr Managed Platform? 

NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.  

NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.  

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.  

With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.  

If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.  

The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.

Apache Cassandra 5.0 Is Generally Available!

As an Apache Cassandra® committer and long-time advocate, I’m really happy to talk about the release of Cassandra 5.0. This milestone represents not just an upgrade to Cassandra but a big leap in usability and capabilities for the world's most powerful distributed database. There’s something for...

Apache Cassandra® 5.0: Behind the Scenes

Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.  

Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.  

It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform. 

Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra projectHis changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform. 

August 2023: The Beginning

We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.  

One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java. 

September 2023: First Milestone 

The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).  

Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.  

At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released. 

November 2023 Further Testing and Internal Preview 

The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.  

We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.  

We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.  

During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us. 

If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer: 

java.lang.IllegalArgumentException: Could not parse type name vector<float, 5>  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356)  
<Rest of stacktrace removed for brevity>

December 2023: Focus on new features and planning 

As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.  

 The final list of high impact features we came up with was: 

  • A new data type Vectors 
  • Trie memtables/Trie Indexed SSTables (BTI Formatted SStables) 
  • Storage-Attached Indexes (SAI) 
  • Unified Compaction Strategy 

A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase. 

Once the holiday season arrived, it was time for a break, and we were back in force in February next year. 

February 2024: Intensive testing 

In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, wdiscovered issues in the interaction with our monitoring and provisioning setup. 

We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.  

Next, we broke down the work we needed to do This included identifying monitoring agents requiring upgrade and config defaults that needed to change. 

From this point, the project split into 3 streams of work: 

  1. Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.  
  2. Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.  
  3. Infrastructure Upgrades Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables. 

A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved. 

March 2024: Public Preview Release 

In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables. 

However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.  

See our public preview launch blog for further details. 

There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults. 

April 2024: Configuration Tuning and Deeper Testing 

The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.  

This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default. 

"memtable": 
  { 
    "configurations": 
      { 
        "skiplist": 
          { 
            "class_name": "SkipListMemtable" 
          }, 
        "sharded": 
          { 
            "class_name": "ShardedSkipListMemtable" 
          }, 
        "trie": 
          { 
            "class_name": "TrieMemtable" 
          }, 
        "default": 
          { 
            "inherits": "trie" 
          } 
      } 
  }, 
"sstable": 
  { 
    "selected_format": "bti" 
  }, 
"storage_compatibility_mode": "NONE",

The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables You can override this per table: 

CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};

Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information

You can also reference the cassandra_latest.yaml  file for the latest settings (please note you should not apply these to existing clusters without rigorous testing). 

May 2024: Major Infrastructure Milestone 

We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.  

At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed. 

June 2024: Successful Rollout New Cassandra Driver 

The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default. 

July 2024: Path to Release Candidate 

We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.  

The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform. 

The Road Ahead to General Availability 

We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including: 

  • Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure. 

At Launch: 

When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.  

For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.  

What Have We Learned From This Project? 

  • Releasing limited, small and frequent changes has resulted in a smooth project, even if sometimes frequent releases do not feel smooth. Some thoughts: 
    • Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
    • Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc. 
    • Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.  
  • Releasing to internal and public preview helped create momentum within the Instaclustr business and teams:  
    • This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
  • Communicating frequently, transparently, and efficiently is the foundation of success:  
    • We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything. 
    • It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project. 

This has been a longrunning but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.  

You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.  

More Readings 

 

The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.

How to Model Leaderboards for 1M Player Game with ScyllaDB

Ever wondered how a game like League of Legends, Fortnite, or even Rockband models its leaderboards? In this article, we’ll explore how to properly model a schema for leaderboards…using a monstrously fast database (ScyllaDB)! 1. Prologue Ever since I was a kid, I’ve been fascinated by games and how they’re made. My favorite childhood game was Guitar Hero 3: Legends of Rock. Well, more than a decade later, I decided to try to contribute to some games in the open source environment, like rust-ro (Rust Ragnarok Emulator) and YARG (Yet Another Rhythm Game). YARG is another rhythm game, but this project is completely open source. It unites legendary contributors in game development and design. The game was being picked up and played mostly by Guitar Hero/Rockband streamers on Twitch. I thought: Well, it’s an open-source project, so maybe I can use my database skills to create a monstrously fast leaderboard for storing past games. It started as a simple chat on their Discord, then turned into a long discussion about how to make this project grow faster. Ultimately, I decided to contribute to it by building a leaderboard with ScyllaDB. In this blog, I’ll show you some code and concepts! 2. Query-Driven Data Modeling With NoSQL, you should first understand which query you want to run depending on the paradigm (document, graph, wide-column, etc.). Focus on the query and create your schema based on that query. In this project, we will handle two types of paradigms: Key-Value Wide Column (Clusterization) Now let’s talk about the queries/features of our modeling. 2.1 Feature: Storing the matches Every time you finish a YARG gameplay, you want to submit your scores plus other in-game metrics. Basically, it will be a single query based on a main index. SELECT score, stars, missed_notes, instrument, ... FROM leaderboard.submisisons WHERE submission_id = 'some-uuid-here-omg' 2.2 Feature: Leaderboard And now our main goal: a super cool leaderboard that you don’t need to worry about after you perform good data modeling. The leaderboard is per song: every time you play a specific song, your best score will be saved and ranked. The interface has filters that dictate exactly which leaderboard to bring: song_id: required instrument: required modifiers: required difficulty: required player_id: optional score: optional Imagine our query looks like this, and it returns the results sorted by score in descending order: SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'dani-california' LIMIT 100; -- player_id | score ----------------+------- -- tzach | 12000 -- danielhe4rt | 10000 -- kadoodle | 9999 ----------------+------- Can you already imagine what the final schema will look like? No? Ok, let me help you with that! 3. Data Modeling time! It’s time to take a deep dive into data modeling with ScyllaDB and better understand how to scale it. 3.1 – Matches Modeling First, let’s understand a little more about the game itself: It’s a rhythm game; You play a certain song at a time; You can activate “modifiers” to make your life easier or harder before the game; You must choose an instrument (e.g. guitar, drums, bass, and microphone). Every aspect of the gameplay is tracked, such as: Score; Missed notes; Overdrive count; Play speed (1.5x ~ 1.0x); Date/time of gameplay; And other cool stuff. Thinking about that, let’s start our data modeling. It will turn into something like this: CREATE TABLE IF NOT EXISTS leaderboard.submissions ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY (submission_id, played_at) ); Let’s skip all the int/text values and jump to the set<text>. The set type allows you to store a list of items of a particular type. I decided to use this list to store the modifiers because it’s a perfect fit. Look at how the queries are executed: INSERT INTO leaderboard.submissions ( submission_id, track_id, modifiers, played_at ) VALUES ( some-cool-uuid-here, 'starlight-muse' {'all-taps', 'hell-mode', 'no-hopos'}, '2024-01-01 00:00:00' ); With this type, you can easily store a list of items to retrieve later. Another cool piece of information is that this query is a key-value like! What does that mean? Since you will always query it by the submission_id only, it can be categorized as a key-value. 3.2 Leaderboard Modeling Now we’ll cover some cool wide-column database concepts. In our leaderboard query, we will always need some dynamic values in the WHERE clauses. That means these values will belong to the Partition Key while the Clustering Keys will have values that can be “optional”. A partition key is a hash based on a combination of fields that you added to identify a value. Let’s imagine that you played Starlight - Muse 100x times. If you were to query this information, it would return 100x different results differentiated by Clustering Keys like score or player_id. SELECT player_id, score --- FROM leaderboard.song_leaderboard WHERE track_id = 'starlight-muse' LIMIT 100; If 1,000,000 players play this song, your query will become slow and it will become a problem in the future because your partition key consists of only one field, which is track_id. However, if you add more fields to your Partition Key, like mandatory things before playing the game, maybe you can shrink these possibilities for a faster query. Now do you see the big picture? Adding the fields like Instrument, Difficulty, and Modifiers will give you a way to split the information about that specific track evenly. Let’s imagine with some simple numbers: -- Query Partition ID: '1' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; -- Query Partition ID: '2' SELECT player_id, score, ... FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'all-hopos'} AND -- Modifiers Changed track_id = 'starlight-muse' LIMIT 100; So, if you build the query in a specific shape it will always look for a specific token and retrieve the data based on these specific Partition Keys. Let’s take a look at the final modeling and talk about the clustering keys and the application layer: CREATE TABLE IF NOT EXISTS leaderboard.song_leaderboard ( submission_id uuid, track_id text, player_id text, modifiers frozen<set>, score int, difficulty text, instrument text, stars int, accuracy_percentage float, missed_count int, ghost_notes_count int, max_combo_count int, overdrive_count int, speed int, played_at timestamp, PRIMARY KEY ((track_id, modifiers, difficulty, instrument), score, player_id) ) WITH CLUSTERING ORDER BY (score DESC, player_id ASC); The partition key was defined as mentioned above, consisting of our REQUIRED PARAMETERS such as track_id, modifiers, difficulty and instrument. And for the Clustering Keys, we added score and player_id. Note that by default the clustering fields are ordered by score DESC and just in case a player has the same score, the criteria to choose the winner will be alphabetical ¯\(ツ)/¯. First, it’s good to understand that we will have only ONE SCORE PER PLAYER. But, with this modeling, if the player goes through the same track twice with different scores, it will generate two different entries. INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 133700, 'expert', 'guitar', '2023-11-23 00:00:00' ); INSERT INTO leaderboard.song_leaderboard ( track_id, player_id, modifiers, score, difficulty, instrument, stars, played_at ) VALUES ( 'starlight-muse', 'daniel-reis', {'none'}, 123700, 'expert', 'guitar', '2023-11-23 00:00:00' ); SELECT player_id, score FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' LIMIT 2; -- player_id | score ----------------+------- -- daniel-reis | 133700 -- daniel-reis | 123700 ----------------+------- So how do we fix this problem? Well, it’s not a problem per se. It’s a feature! As a developer, you have to create your own business rules based on the project’s needs, and this is no different. What do I mean by that? You can run a simple DELETE query before inserting the new entry. That will guarantee that you will not have specific data from the player_id with less than the new score inside that specific group of partition keys. -- Before Insert the new Gampleplay DELETE FROM leaderboard.song_leaderboard WHERE instrument = 'guitar' AND difficulty = 'expert' AND modifiers = {'none'} AND track_id = 'starlight-muse' AND player_id = 'daniel-reis' AND score <= 'your-new-score-here'; -- Now you can insert the new payload... And with that, we finished our simple leaderboard system, the same one that runs in YARG and can also be used in games with MILLIONS of entries per second 😀 4. How to Contribute to YARG Want to contribute to this wonderful open-source project? We’re building a brand new platform for all the players using: Game: Unity3d (Repository) Front-end: NextJS (Repository) Back-end: Laravel 10.x (Repository) We will need as many developers and testers as possible to discuss future implementations of the game together with the main contributors! First, make sure to join this Discord Community. This is where all the technical discussions happen with the backing of the community before going to the development board. Also, outside of Discord, the YARG community is mostly focused on the EliteAsian (core contributor and project owner) X account for development showcases. Be sure to follow him there as well.
New replay viewer HUD for #YARG! There are still some issues with it, such as consistency, however we are planning to address them by the official stable release of v0.12. pic.twitter.com/9ACIJXAZS4 — EliteAsian (@EliteAsian123) December 16, 2023
And FYI, the Lead Artist of the game, (aka Kadu) is also a Broadcast Specialist and Product Innovation Developer at Elgato who worked with streamers like: Ninja Nadeshot StoneMountain64 and the legendary DJ Marshmello. Kadu also uses his X to share some insights and early previews of new features and experimentations for YARG. So, don’t forget to follow him as well!
Here's how the replay venue looks like now, added a lot of details on the desk, really happy with the result so far, going to add a few more and start the textures pic.twitter.com/oHH27vkREe — ⚡Kadu Waengertner (@kaduwaengertner) August 10, 2023
Here are some useful links to learn more about the project: Official Website Github Repository Task Board
Fun fact: YARG got noticed by Brian Bright, project lead on Guitar Hero, who liked the fact that the project was open source. Awesome, right?
5. Conclusion Data modeling is sometimes challenging. This project involved learning many new concepts and a lot of testing together with my community on Twitch. I have also published a Gaming Leaderboard Demo, where you can get some insights on how to implement the same project using NextJS and ScyllaDB! Also, if you like ScyllaDB and want to learn more about it, I strongly suggest you watch our free Masterclass Courses or visit ScyllaDB University!  

Will Your Cassandra Database Project Succeed?: The New Stack

Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.

That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.

Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.

Accurate Data Modeling Is a Must

Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.

On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.

The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.

How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database

How ShareChat successfully scaled 1000X without scaling the underlying database (ScyllaDB) The demand for low-latency machine learning feature stores is higher than ever, but actually implementing one at scale remains a challenge. That became clear when ShareChat engineers Ivan Burmistrov and Andrei Manakov took the P99 CONF 23 stage to share how they built a low-latency ML feature store based on ScyllaDB. This isn’t a tidy case study where adopting a new product saves the day. It’s a “lessons learned” story, a look at the value of relentless performance optimization – with some important engineering takeaways. The original system implementation fell far short of the company’s scalability requirements. The ultimate goal was to support 1 billion features per second, but the system failed under a load of just 1 million. With some smart problem solving, the team pulled it off though. Let’s look at how their engineers managed to pivot from the initial failure to meet their lofty performance goal without scaling the underlying database. Obsessed with performance optimizations and low-latency engineering? Join your peers at P99 24 CONF, a free highly technical virtual conference on “all things performance.” Speakers include: Michael Stonebraker, Postgres creator and MIT professor Bryan Cantrill, Co-founder and CTO of Oxide Computer Avi Kivity, KVM creator, ScyllaDB co-founder and CTO Liz Rice, Chief open source officer with eBPF specialists Isovalent Andy Pavlo, CMU professor Ashley Williams, Axo founder/CEO, former Rust core team, Rust Foundation founder Carl Lerche, Tokio creator, Rust contributor and engineer at AWS Register Now – It’s Free In addition to another great talk by Ivan from ShareChat’, expect more than 60 engineering talks on performance optimizations at Disney/Hulu, Shopify, Lyft, Uber, Netflix,  American Express, Datadog, Grafana, LinkedIn, Google, Oracle, Redis, AWS, ScyllaDB and more. Register for free. ShareChat: India’s Leading Social Media Platform To understand the scope of the challenge, it’s important to know a little about ShareChat, the leading social media platform in India. On the ShareChat app, users discover and consume content in more than 15 different languages, including videos, images, songs and more. ShareChat also hosts a TikTok-like short video platform (Moj) that encourages users to be creative with trending tags and contests. Between the two applications, they serve a rapidly growing user base that already has over 325 million monthly active users. And their AI-based content recommendation engine is essential for driving user retention and engagement. Machine learning feature stores at ShareChat This story focuses on the system behind ML feature stores for the short-form video app Moj. It offers fully personalized feeds to around 20 million daily active users, 100 million monthly active users. Feeds serve 8,000 requests per second, and there’s an average of 2,000 content candidates being ranked on each request (for example, to find the 10 best items to recommend). “Features” are pretty much anything that can be extracted from the data: Ivan Burmistrov, principal staff software engineer at ShareChat, explained: “We compute features for different ‘entities.’ Post is one entity, User is another and so on. From the computation perspective, they’re quite similar. However, the important difference is in the number of features we need to fetch for each type of entity. When a user requests a feed, we fetch user features for that single user. However, to rank all the posts, we need to fetch features for each candidate (post) being ranked, so the total load on the system generated by post features is much larger than the one generated by user features. This difference plays an important role in our story.” What went wrong At first, the primary focus was on building a real-time user feature store because, at that point, user features were most important. The team started to build the feature store with that goal in mind. But then priorities changed and post features became the focus too. This shift happened because the team started building an entirely new ranking system with two major differences versus its predecessor: Near real-time post features were more important The number of posts to rank increased from hundreds to thousands Ivan explained: “When we went to test this new system, it failed miserably. At around 1 million features per second, the system became unresponsive, latencies went through the roof and so on.” Ultimately, the problem stemmed from how the system architecture used pre-aggregated data buckets called tiles. For example, they can aggregate the number of likes for a post in a given minute or other time range. This allows them to compute metrics like the number of likes for multiple posts in the last two hours. Here’s a high-level look at the system architecture. There are a few real-time topics with raw data (likes, clicks, etc.). A Flink job aggregates them into tiles and writes them to ScyllaDB. Then there’s a feature service that requests tiles from ScyllaDB, aggregates them and returns results to the feed service. The initial database schema and tiling configuration led to scalability problems. Originally, each entity had its own partition, with rows timestamp and feature name being ordered clustering columns. [Learn more in this NoSQL data modeling masterclass]. Tiles were computed for segments of one minute, 30 minutes and one day. Querying one hour, one day, seven days or 30 days required fetching around 70 tiles per feature on average. If you do the math, it becomes clear why it failed. The system needed to handle around 22 billion rows per second. However, the database capacity was only 10 million rows/sec.   Initial optimizations At that point, the team went on an optimization mission. The initial database schema was updated to store all feature rows together, serialized as protocol buffers for a given timestamp. Because the architecture was already using Apache Flink, the transition to the new tiling schema was fairly easy, thanks to Flink’s advanced capabilities in building data pipelines. With this optimization, the “Features” multiplier has been removed from the equation above, and the number of required rows to fetch has been reduced by 100X: from around 2 billion to 200 million rows/sec. The team also optimized the tiling configuration, adding additional tiles for five minutes, three hours and five days to one minute, 30 minutes and one day tiles. This reduced the average required tiles from 70 to 23, further reducing the rows/sec to around 73 million. To handle more rows/sec on the database side, they changed the ScyllaDB compaction strategy from incremental to leveled. [Learn more about compaction strategies]. That option better suited their query patterns, keeping relevant rows together and reducing read I/O. The result: ScyllaDB’s capacity was effectively doubled. The easiest way to accommodate the remaining load would have been to scale ScyllaDB 4x. However, more/larger clusters would increase costs and that simply wasn’t in their budget. So the team continued focusing on improving the scalability without scaling up the ScyllaDB cluster. Improved cache locality One potential way to reduce the load on ScyllaDB was to improve the local cache hit rate, so the team decided to research how this could be achieved. The obvious choice was to use a consistent hashing approach, a well-known technique to direct a request to a certain replica from the client based on some information about the request. Since the team was using NGINX Ingress in their Kubernetes setup, using NGINX’s capabilities for consistent hashing seemed like a natural choice. Per NGINX Ingress documentation, setting up consistent hashing would be as simple as adding three lines of code. What could go wrong? A bit. This simple configuration didn’t work. Specifically: The client subset led to a huge key remapping – up 100% in the worst case. Since the node keys can be changed in a hash ring, it was impossible to use real-life scenarios with autoscaling. [See the ingress implementation] It was tricky to provide a hash value for a request because Ingress doesn’t support the most obvious solution: a gRPC header. The latency suffered severe degradation, and it was unclear what was causing the tail latency. To support a subset of the pods, the team modified their approach. They created a two-step hash function: first hashing an entity, then adding a random prefix. That distributed the entity across the desired number of pods. In theory, this approach could cause a collision when an entity is mapped to the same pod several times. However, the risk is low given the large number of replicas. Ingress doesn’t support using gRPC header as a variable, but the team found a workaround: using path rewriting and providing the required hash key in the path itself. The solution was admittedly a bit “hacky” … but it worked. Unfortunately, pinpointing the cause of latency degradation would have required considerable time, as well as observability improvements. A different approach was needed to scale the feature store in time. To meet the deadline, the team split the Feature service into 27 different services and manually split all entities between them on the client. It wasn’t the most elegant approach, but, it was simple and practical – and it achieved great results. The cache hit rate improved to 95% and the ScyllaDB load was reduced to 18.4 million rows per second. With this design, ShareChat scaled its feature store to 1B features per second by March. However, this “old school” deployment-splitting approach still wasn’t the ideal design. Maintaining 27 deployments was tedious and inefficient. Plus, the cache hit rate wasn’t stable, and scaling was limited by having to keep a high minimum pod count in every deployment. So even though this approach technically met their needs, the team continued their search for a better long-term solution. The next phase of optimizations: consistent hashing, Feature service Ready for yet another round of optimization, the team revisited the consistent hashing approach using a sidecar, called Envoy Proxy, deployed with the feature service. Envoy Proxy provided better observability which helped identify the latency tail issue. The problem: different request patterns to the Feature service caused a huge load on the gRPC layer and cache. That led to extensive mutex contention. The team then optimized the Feature service. They: Forked the caching library (FastCache from VictoriaMetrics) and implemented batch writes and better eviction to reduce mutex contention by 100x. Forked gprc-go and implemented buffer pool across different connections to avoid contention during high parallelism. Used object pooling and tuned garbage collector (GC) parameters to reduce allocation rates and GC cycles. With Envoy Proxy handling 15% of traffic in their proof-of-concept, the results were promising: a 98% cache hit rate, which reduced the load on ScyllaDB to 7.4M rows/sec. They could even scale the feature store more: from 1 billion features/second to 3 billion features/second. Lessons learned Here’s what this journey looked like from a timeline perspective: To close, Andrei summed up the team’s top lessons learned from this project (so far): Use proven technologies. Even as the ShareChat team drastically changed their system design, ScyllaDB, Apache Flink and VictoriaMetrics continued working well. Each optimization is harder than the previous one – and has less impact. Simple and practical solutions (such as splitting the feature store into 27 deployments) do indeed work. The solution that delivers the best performance isn’t always user-friendly. For instance, their revised database schema yields good performance, but is difficult to maintain and understand. Ultimately, they wrote some tooling around it to make it simpler to work with. Every system is unique. Sometimes you might need to fork a default library and adjust it for your specific system to get the best performance. Watch their complete P99 CONF talk

Simplifying Cassandra and DynamoDB Migrations with the ScyllaDB Migrator

Learn about the architecture of ScyllaDB Migrator, how to use it, recent developments, and upcoming features. ScyllaDB offers both a CQL-compatible API and a DynamoDB-compatible API, allowing applications that use Apache Cassandra or DynamoDB to take advantage of reduced costs and lower latencies with minimal code changes. We previously described the two main migration strategies: cold and hot migrations. In both cases, you need to backfill ScyllaDB with historical data. Either can be efficiently achieved with the ScyllaDB Migrator. In this blog post, we will provide an update on its status. You will learn about its architecture, how to use it, recent developments, and upcoming features. The Architecture of the ScyllaDB Migrator The ScyllaDB Migrator leverages Apache Spark to migrate terabytes of data in parallel. It can migrate data from various types of sources, as illustrated in the following diagram: We initially developed it to migrate from Apache Cassandra, but we have since added support for more types of data sources. At the time of writing, the Migrator can migrate data from either: A CQL-compatible source: An Apache Cassandra table. Or a Parquet file stored locally or on Amazon S3. Or a DynamoDB-compatible source: A DynamoDB table. Or a DynamoDB table export on Amazon S3. What’s so interesting about ScyllaDB Migrator? Since it runs as an Apache Spark application, you can adjust its throughput by scaling the underlying Spark cluster. It is designed to be resilient to read or write failures. If it stops prior to completion, the migration can be restarted from where it left off. It can rename item columns along the way. When migrating from DynamoDB, the Migrator can endlessly replicate new changes to ScyllaDB. This is useful for hot migration strategies. How to Use the ScyllaDB Migrator More details are available in the official Migrator documentation. The main steps are: Set Up Apache Spark: There are several ways to set up an Apache Spark cluster, from using a pre-built image on AWS EMR to manually following the official Apache Spark documentation to using our automated Ansible playbook on your own infrastructure. You may also use Docker to run a cluster on a single machine. Prepare the Configuration File: Create a YAML configuration file that specifies the source database, target ScyllaDB cluster, and any migration option. Run the Migrator: Execute the ScyllaDB Migrator using the spark-submit command. Pass the configuration file as an argument to the migrator. Monitor the Migration: The Spark UI provides logs and metrics to help you monitor the migration process. You can track the progress and troubleshoot any issues that arise. You should also monitor the source and target databases to check whether they are saturated or not. Recent Developments The ScyllaDB Migrator has seen several significant improvements, making it more versatile and easier to use: Support for Reading DynamoDB S3 Exports: You can now migrate data from DynamoDB S3 exports directly to ScyllaDB, broadening the range of sources you can migrate from. PR #140. AWS AssumeRole Authentication: The Migrator now supports AWS AssumeRole authentication, allowing for secure access to AWS resources during the migration process. PR #150. Schema-less DynamoDB Migrations: By adopting a schema-less approach, the Migrator enhances reliability when migrating to ScyllaDB Alternator, ScyllaDB’s DynamoDB-compatible API. PR #105. Dedicated Documentation Website: The Migrator’s documentation is now available on a proper website, providing comprehensive guides, examples, and throughput tuning tips. PR #166. Update to Spark 3.5 and Scala 2.13: The Migrator has been updated to support the latest versions of Spark and Scala, ensuring compatibility and leveraging the latest features and performance improvements. PR #155. Ansible Playbook for Spark Cluster Setup: An Ansible playbook is now available to automate the setup of a Spark cluster, simplifying the initial setup process. PR #148. Publish Pre-built Assemblies: You don’t need to manually build the Migrator from the source anymore. Download the latest release and pass it to the spark-submit command. PR #158. Strengthened Continuous Integration: We have set up a testing infrastructure that reduces the risk of introducing regressions and prevents us from breaking backward compatibility. PRs #107, #121, #127. Hands-on Migration Example The content of this section has been extracted from the documentation website. The original content is kept up to date. Let’s go through a migration example to illustrate some of the points listed above. We will perform a cold migration to replicate 1,000,000 items from a DynamoDB table to ScyllaDB Alternator. The whole system is composed of the DynamoDB service, a Spark cluster with a single worker node, and a ScyllaDB cluster with a single node, as illustrated below: To make it easier for interested readers to follow along, we will create all those services using Docker. All you need is the AWS CLI and Docker. The example files can be found at  https://github.com/scylladb/scylla-migrator/tree/b9be9fb684fb0e51bf7c8cbad79a1f42c6689103/docs/source/tutorials/dynamodb-to-scylladb-alternator Set Up the Services and Populate the Source Database We use Docker Compose to define each service. Our docker-compose.yml file looks as follows: Let’s break down this Docker Compose file. We define the DynamoDB service by reusing the official image amazon/dynamodb-local. We use the TCP port 8000 for communicating with DynamoDB. We define the Spark master and Spark worker services by using a custom image (see below). Indeed, the official Docker images for Spark 3.5.1 only support Scala 2.12 for now, but we need Scala 2.13. We mount the local directory ./spark-data to the Spark master container path /app so that we can supply the Migrator jar and configuration to the Spark master node. We expose the ports 8080 and 4040 of the master node to access the Spark UIs from our host environment. We allocate 2 cores and 4 GB of memory to the Spark worker node. As a general rule, we recommend allocating 2 GB of memory per core on each worker. We define the ScyllaDB service by reusing the official image scylladb/scylla. We use the TCP port 8001 for communicating with ScyllaDB Alternator. The Spark services rely on a local Dockerfile located at path ./dockerfiles/spark/Dockerfile. For the sake of completeness, here is the content of this file, which you can copy-paste: And here is the entry point used by the image, which needs to be executable: This Docker image installs Java and downloads the official Spark release. The entry point of the image takes an argument that can be either master or worker to control whether to start a master node or a worker node. Prepare your system for building the Spark Docker image with the following commands: mkdir spark-data chmod +x entrypoint.sh Finally, start all the services with the following command: docker compose up Your system’s Docker daemon will download the DynamoDB and ScyllaDB images and build our Spark Docker image. Check that you can access the Spark cluster UI by opening http://localhost:8080 in your browser. You should see your worker node in the workers list. Once all the services are up, you can access your local DynamoDB instance and your local ScyllaDB instance by using the standard AWS CLI. Make sure to configure the AWS CLI as follows before running the dynamodb commands: # Set dummy region and credentials aws configure set region us-west-1 aws configure set aws_access_key_id dummy aws configure set aws_secret_access_key dummy # Access DynamoDB aws --endpoint-url http://localhost:8000 dynamodb list-tables # Access ScyllaDB Alternator aws --endpoint-url http://localhost:8001 dynamodb list-tables The last preparatory step consists of creating a table in DynamoDB and filling it with random data. Create a file named create-data.sh, make it executable, and write the following content into it: This script creates a table named Example and adds 1 million items to it. It does so by invoking another script, create-25-items.sh, that uses the batch-write-item command to insert 25 items in a single call: Every added item contains an id and five columns, all filled with random data. Run the script: ./create-data.sh and wait for a couple of hours until all the data is inserted (or change the last line of create-data.sh to insert fewer items and make the demo faster). Perform the Migration Once you have set up the services and populated the source database, you are ready to perform the migration. Download the latest stable release of the Migrator in the spark-data directory: wget https://github.com/scylladb/scylla-migrator/releases/latest/download/scylla-migrator-assembly.jar \ –directory-prefix=./spark-data Create a configuration file in spark-data/config.yaml and write the following content: This configuration tells the Migrator to read the items from the table Example in the dynamodb service, and to write them to the table of the same name in the scylla service. Finally, start the migration with the following command: docker compose exec spark-master \ /spark/bin/spark-submit \ --executor-memory 4G \ --executor-cores 2 \ --class com.scylladb.migrator.Migrator \ --master spark://spark-master:7077 \ --conf spark.driver.host=spark-master \ --conf spark.scylla.config=/app/config.yaml \ /app/scylla-migrator-assembly.jar This command calls spark-submit in the spark-master service with the file scylla-migrator-assembly.jar, which bundles the Migrator and all its dependencies. In the spark-submit command invocation, we explicitly tell Spark to use 4 GB of memory; otherwise, it would default to 1 GB only. We also explicitly tell Spark to use 2 cores. This is not really necessary as the default behavior is to use all the available cores, but we set it for the sake of illustration. If the Spark worker node had 20 cores, it would be better to use only 10 cores per executor to optimize the throughput (big executors require more memory management operations, which decrease the overall application performance). We would achieve this by passing --executor-cores 10, and the Spark engine would allocate two executors for our application to fully utilize the resources of the worker node. The migration process inspects the source table, replicates its schema to the target database if it does not exist, and then migrates the data. The data migration uses the Hadoop framework under the hood to leverage the Spark cluster resources. The migration process breaks down the data to transfer chunks of about 128 MB each, and processes all the partitions in parallel. Since the source is a DynamoDB table in our example, each partition translates into a scan segment to maximize the parallelism level when reading the data. Here is a diagram that illustrates the migration process: During the execution of the command, a lot of logs are printed, mostly related to Spark scheduling. Still, you should be able to spot the following relevant lines: 24/07/22 15:46:13 INFO migrator: ScyllaDB Migrator 0.9.2 24/07/22 15:46:20 INFO alternator: We need to transfer: 2 partitions in total 24/07/22 15:46:20 INFO alternator: Starting write… 24/07/22 15:46:20 INFO DynamoUtils: Checking for table existence at destination And when the migration ends, you will see the following line printed: 24/07/22 15:46:24 INFO alternator: Done transferring table snapshot During the migration, it is possible to monitor the underlying Spark job by opening the Spark UI available at http://localhost:4040 Example of a migration broken down in 6 tasks. The Spark UI allows us to follow the overall progress, and it can also show specific metrics such as the memory consumption of an executor. In our example the size of the source table is ~200 MB. In practice, it is common to migrate tables containing several terabytes of data. If necessary, and as long as your DynamoDB source supports a higher read throughput level, you can increase the migration throughput by adding more Spark worker nodes. The Spark engine will automatically spread the workload between all the worker nodes. Future Enhancements The ScyllaDB team is continuously improving the Migrator. Some of the upcoming features include: Support for Savepoints with DynamoDB Sources: This will allow users to resume the migration from a specific point in case of interruptions. This is currently supported with Cassandra sources only. Shard-Aware ScyllaDB Driver: The Migrator will fully take advantage of ScyllaDB’s specific optimizations for even faster migrations. Support for SQL-based Sources: For instance, migrate from MySQL to ScyllaDB. Conclusion Thanks to the ScyllaDB Migrator, migrating data to ScyllaDB has never been easier. With its robust architecture, recent enhancements, and active development, the migrator is an indispensable tool for ensuring a smooth and efficient migration process. For more information, check out the ScyllaDB Migrator lesson on ScyllaDB University. Another useful resource is the official ScyllaDB Migrator documentation. Are you using the Migrator? Any specific feature you’d like to see? For any questions about your specific use case or about the Migrator in general, tap into the community knowledge on the ScyllaDB Community Forum.

Inside ScyllaDB’s Continuous Optimizations for Reducing P99 Latency

How the ScyllaDB Engineering team reduced latency spikes during administrative operations through continuous monitoring and rigorous testing In the world of databases, smooth and efficient operation is crucial. However, both ScyllaDB and its predecessor Cassandra have historically encountered challenges with latency spikes during administrative operations such as repair, backup, node addition, decommission, replacement, upgrades, compactions etc.. This blog post shares how the ScyllaDB Engineering team embraced continuous improvement to tackle these challenges head-on. Protecting Performance by Measuring Operational Latency Understanding and improving the performance of a database system like ScyllaDB involves continuous monitoring and rigorous testing. Each week, our team tackles this challenge by measuring performance under three types of workload scenarios: write, read, and mixed (50% read/write). We focus specifically on operational latency: how the system performs during typical and intensive operations like repair, node addition, node termination, decommission or upgrade. Our Measurement Methodology To ensure accurate results, we preload each cluster with data at a 10:1 data-to-memory ratio—equivalent to inserting 650GB on 64GB memory instances. Our benchmarks begin by recording the latency during a steady state to establish a baseline before initiating various cluster operations. We follow a strict sequence during testing: Preload data to simulate real user environments. Baseline latency measurement for a stable reference point. Sequential operational tests involving: Repair operations via Scylla Manager. Addition of three new nodes. Termination and replacement of a node. Decommissioning of three nodes. Latency is our primary metric; if it exceeds 15ms, we immediately start investigating it. We also monitor CPU instructions per operation and track reactor stalls, which are critical for understanding performance bottlenecks. How We Measure Latency Measuring latency effectively requires looking beyond the time it takes for ScyllaDB to process a command. We consider the entire lifecycle of a request: Response time: The time from the moment the query is initiated to when the response is delivered back to the client. Advanced metrics: We utilize High Dynamic Range (HDR) Histograms to capture and analyze latency from each cassandra-stress worker. This ensures we can compute a true representation of latency percentiles rather than relying on simple averages. Results from these tests are meticulously compiled and compared with previous runs. This not only helps us detect any performance degradation but also highlights improvements. It keeps the entire team informed through detailed reports that include operation durations and latency breakdowns for both reads and writes. Better Metrics, Better Performance When we started to verify performance regularly, we mostly focused on the latencies. At that time, reports lacked many details (like HDR results), but were sufficient to identify performance issues. These included high latency when decommissioning a node, or issues with latencies during the steady state. Since then, we have optimized our testing approach to include more – and more detailed – metrics. This enables us to spot emerging performance issues sooner and root out the culprit faster. The improved testing approach has been a valuable tool, providing fast and precise feedback on how well (or not) our product optimization strategies are actually working in practice. Total metrics Our current reports include HDR Histogram details providing a comprehensive overview of system latency throughout the entire test. Number of reactor stalls (which are pauses in processing due to overloaded conditions)  prompts immediate attention and action when they increase significantly. We take a similar approach to kernel callstacks which are logged when kernel space holds locks for too long.   Management repair After populating the cluster with data, we start our test from a full cluster repair using Scylla Manager and measure the latencies:   During this period, the P99 latency was 3.87 ms for writes and 9.41 ms for reads. In comparison, during the “steady state” (when no operations were performed), the latencies were 2.23 ms and 3.87 ms, respectively. Cluster growth After the repair, we add three nodes to the cluster and conduct a similar latency analysis:   Each cycle involves adding one node sequentially. These results provide a clear view of how latency changes and the duration required to expand the cluster. Node Termination and Replacement Following the cluster growth, one node is terminated and replaced with another. Cluster Shrinkage The test concludes with shrinking the cluster back to its initial size by decommissioning random nodes one by one.   These tests and reports are invaluable, uncovering numerous performance issues like increased latencies during decommission, detecting long reactor stalls in row cache update or short but frequent ones in sstable reader paths that lead to crucial fixes, improvements, and insights. This progress is evident in the numbers, where current latencies remain in the single-digit range under various conditions. Looking Ahead Our optimization journey is ongoing. ScyllaDB 6.0 introduced tablets, significantly accelerating cluster resizing to market-leading levels. The introduction of immediate node joining, which can start in parallel with accelerated data streaming, shows significant improvements across all metrics. With these improvements, we start measuring and optimizing not only the latencies during these operations but also the operations durations. Stay tuned for more details about these advancements soon. Our proactive approach to tackling latency issues not only improves our database performance but also exemplifies our commitment to excellence. As we continue to innovate and refine our processes, ScyllaDB remains dedicated to delivering superior database solutions that meet the evolving needs of our users.

ScyllaDB Elasticity: Demo, Theory, and Future Plans

Watch a demo of how ScyllaDB’s Raft and tablets initiatives play out with real operations on a real ScyllaDB cluster — and get a glimpse at what’s next on our roadmap. If you follow ScyllaDB, you’ve likely heard us talking about Raft and tablets initiatives for years now. (If not, read more on tablets from Avi Kivity and Raft from Kostja Osipov) You might have even seen some really cool animations. But how does it play out with real operations on a real ScyllaDB cluster? And what’s next on our roadmap – particularly in terms of user impacts? ScyllaDB Co-Founder Dor Laor and Technical Director Felipe Mendes recently got together to answer those questions. In case you missed it or want a recap of the action-packed and information-rich session, here’s the complete recording:   In case you want to skip to a specific section, here’s a breakdown of what they covered when: 4:45 ScyllaDB already scaled linearly 8:11 Tablets + Raft = elasticity, speed, simplicity, TCO 11:45 Demo time! 30:23 Looking under the hood 46:19: Looking ahead And in case you prefer to read vs watch, here are some key points… Double Double Demo After Dor shared why ScyllaDB adopted a new dynamic “tablets-based” data replication architecture for faster scaling, he passed the mic over to Felipe to show it in action. Felipe covers: Parallel scaling operations (adding and removing nodes) – speed and impact on latency How new nodes can start servicing increased demand almost instantly Dynamic load balancing based on node capacity, including automated resharding for new/different instance types The demo starts with the following initial setup: 3-node cluster running on AWS i4i.xlarge Each node processing ~17,000 operations/second System load at ~50% Here’s a quick play-by-play… Scale out: Bootstrapped 3 additional i4i.large nodes in parallel New nodes start serving traffic once the first tablets arrive, before the entire data set is received. Tablets migration complete in ~3 minutes Writes are at sub-millisecond latencies; so are read latencies once the cache warms up (in the meantime, reads go to warmed up nodes, thanks to heat-weighted load balancing) Scale up: Added 3 nodes of a larger instance size (i4i.2xlarge, with double the capacity of the original nodes) and increased the client load The larger nodes receive more tablets and service almost twice the traffic than the smaller replicas (as appropriate for their higher capacity) The expanded cluster handles over 100,000 operations/second with the potential to handle 200,000-300,000 operations/second Downscale: A total of 6 nodes were decommissioned in parallel As part of the decommission process, tablets were migrated to other replicas Only 8 minutes were required to fully decommission 6 replicas while serving traffic A Special Raft for the ScyllaDB Sea Monster Starting with the ScyllaDB 6.0 release, topology metadata is managed by the Raft protocol. The process of adding, removing, and replacing nodes is fully linearized. This contributes to parallel operations, simplicity, and correctness. Read barriers and fencing are two interesting aspects of our Raft implementation. Basically, if a node doesn’t know the most recent topology, it’s barred from responding to related queries. This prevents, for example, a node from observing an incorrect topology state in the cluster – which could result in data loss. It also prevents a situation where a removed node or an external node using the same cluster name could silently come back or join the cluster simply by gossiping with another replica. Another difference: Schema versions are now linearized, and use a TimeUUID to indicate the most up-to-date schema. Linearizing schema updates not only makes the operation safer; it also considerably improves performance. Previously, a schema change could take a while to propagate via gossip – especially in large cluster deployments. Now, this is gone. TimeUUIDs provide an additional safety net. Since schema versions now contain a time-based component, ScyllaDB can ensure schema versioning, which helps with: Improved visibility on conditions triggering a schema change on logs Accurately restoring a cluster backup Rejecting out-of-order schema updates Tablets relieve operational pains The latest changes simplify ScyllaDB operations in several ways: You don’t need to perform operations one by one and wait in between them; you can just initiate the operation to add or remove all the nodes you need, all at once You no longer need to cleanup after you scale the cluster Resharding (the process of changing the shard count of an existing node) is simple. Since tablets are already split on a per-shard boundary, resharding simply updates the shard ownership Managing the system_auth keyspace (for authentication) is no longer needed. All auth-related data is now automatically replicated to every node in the cluster Soon, repairs will also be automated   Expect less: typeless, sizeless, limitless ScyllaDB’s path forward from here certainly involves less: typeless, sizeless, limitless. You could be typeless. You won’t have to think about instance types ahead of time. Do you need a storage-intensive instance like the i3ens, or a throughput-intensive instance like the i4is? It no longer matters, and you can easily transition or even mix among these. You could be sizeless. That means you won’t have to worry about capacity planning when you start off. Start small and evolve from there. You could also be limitless. You could start off anticipating a high throughput and then reduce it, or you could commit to a base and add on-demand usage if you exceed it.

Use Your Data in LLMs With the Vector Database You Already Have: The New Stack

Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.

Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.

You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.

But, ready for some good news?

If you’re already using Apache Cassandra 5.0OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.

For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.

Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.

RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.

Let’s look closer at what each open source technology brings to the vector database discussion:

Apache Cassandra 5.0 Offers Native Vector Indexing

With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.

Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.

OpenSearch Provides a Combination of Benefits

Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.

With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.

The pgvector Extension Makes Postgres a Powerful Vector Store

Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.

pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.

Was the Answer to Your AI Challenges in Front of You All Along?

The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.

The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.

How to Visualize ScyllaDB Tables and Run Queries with DBSchema

Learn how to connect DBSchema to ScyllaDB, visualize the keyspace, and run queries While ScyllaDB power users are quite accustomed to CQL, users getting started with ScyllaDB often ask if we offer a visual interface for designing database schema and running queries. That’s why we’re excited to share that DBSchema, a visual database design and management tool, just introduced support for ScyllaDB in its most recent release. With DBSchema, teams can centralize efforts to design database schemas and run queries across all major NoSQL and SQL databases (e.g. PostgreSQL, MongoDB, and Snowflake as well as ScyllaDB). DBSchema can be a great alternative to cqlsh, ScyllaDB’s standard command line tool. For example, in cqlsh, you can use the DESCRIBE KEYSPACES command to list all keyspaces in the database: Then, you can use the DESCRIBE TABLES command to list all tables per keyspace: With DBSchema this functionality is easier. Right after connecting to a database, you can see more than the table within a keyspace; you can also see the columns and column types. Both self-hosted and Cloud versions of ScyllaDB work with DBSchema. In this post, I will show you how to connect DBSchema to ScyllaDB, visualize the keyspace, and run queries. Before you get started, download DBSchema on Windows, Mac, or Linux. The free version also has ScyllaDB support. Connect to ScyllaDB in DBSchema To create a new ScyllaDB connection in DBSchema: Download DBSchema: https://dbschema.com/download.html Click the “Connect to Database” button, search for “ScyllaDB,” then click “Next.” Enter the database connection details as follows, then click “Connect.” Select the Keyspaces you want to use (you can select multiple), then click “OK.” DBSchema then reverse engineers your tables in the selected keyspaces. To write new queries, select “Editors -> “SQL Editor.” Don’t worry about the “SQL Editor” label; we’re writing CQL commands here, not SQL: Query examples Once the connection is set up, you can run all your CQL queries and see the result table immediately below your query. For example: Aside from simple SELECT queries, you can also run queries to create new objects in the database – for example, a materialized view: Wrapping up DBSchema is a unique tool in the sense that: It’s available in a free version You can run CQL The output is exactly the same as if you used cqlsh, but with a GUI It provides a seamless way to visualize your database keyspaces or test your CQL queries before using them in production Get started by downloading DBSchema and creating a new ScyllaDB Cloud cluster. Any questions? You can discuss this post and share your thoughts in our community forum.

Who Did What to That and When? Exploring the User Actions Feature

NetApp recently released the user actions feature on the Instaclustr Managed Platform, allowing customers to search for user actions recorded against their accounts and organizations. We record over 100 different types of actions, with detailed descriptions of what was done, by whom, to what, and at what time. 

This provides customers with visibility into the actions users are performing on their linked accounts. NetApp has always collected this information in line with our security and compliance policies, but now, all important changes to your managed cluster resources have self-service access from the Console and the APIs.

In the past, this information was accessible only through support tickets when important questions such as “Who deleted my cluster?” and “When was the firewall rule removed from my cluster?” needed answers. This feature adds more self-discoverability of what your users are doing and what our support staff are doing to keep your clusters healthy. 

This blog post provides a detailed walkthrough of this new feature at a moderate level of technical detail, with the hope of encouraging you to explore and better find the actions you are looking for. 

For this blog, I’ve created two Apache Cassandra® clusters in one account and performed some actions on each. I’ve also created an organization linked to this account and performed some actions on that. This will allow a full example UI to be shown and demonstrate the type of “stories” that can emerge from typical operations via user actions. 

Introducing Global Directory 

During development, we decided to consolidate the other global account pages into a new centralized location, which we are calling the “Directory”.  

This Directory provides you with the consolidated view of all organizations and accounts that you have access to, collecting global searches and account functions into a view that does not have a “selected cluster” context (i.e., global).  For more information on how Organizations, Accounts and Clusters relate to each other, check out this blog.

Organizations serve as an efficient method to consolidate all associated accounts into a single unified, easily accessible location. They introduce an extra layer to the permission model, facilitating the management and sharing of information such as contact and billing details. They also streamline the process of Single Sign-On (SSO) and account creation. 

Let’s log in and click on the new button: 

This will take us to the new directory landing page: 

Here, you will find two types of global searches: accounts and user actions, as well as account creation. Selecting the new “User Actions” item will take us to the new page. You can also navigate to these directory pages directly from the top right ‘folder’ menu:

User Action Search Page: Walkthrough 

This is the new page we land on if we choose to search for user actions: 

When you first enter, it finds the last page of actions that happened in the accounts and organizations you have access to. It will show both organization and account actions on a single consolidated page, even though they are slightly different in nature. 

*Note: The accessible accounts and organisations are defined as those you are linked to as

CLUSTER_ADMIN

or

OWNER

*TIP: If you don’t want an account user to see user actions, give the

READ_ONLY

access. 

You may notice a brief progress bar display as the actions are retrieved. At the time of writing, we have recorded nearly 100 million actions made by our customers over a 6-month period.  

From here, you can increase the number of actions shown on each page and page through the results. Sorting is not currently supported on the actions table, but it is something we will be looking to add in the future. For each action found, the table will display: 

  • Action: What happened to your account (or organization)? There are over 100 tracked kinds of actions recorded. 
  • Domain: The specific account or organization name of the action targeted. 
  • Description: An expanded description of what happened, using context captured at the time of action. Important values are highlighted between square brackets, and the copy button will copy the first one into the clipboard. 
  • User: The user who performed the action, typically using the console/ APIs or Terraform provider, but it can also be triggered by “Instaclustr Support” using our admin tools.
    • For those actions marked with user Instaclustr Support”, please reach out to support for more information about those actions we’ve taken on your behalf or visit https://support.instaclustr.com/hc/en-us. 
  • Local time: The action time from your local web browser’s perspective. 

Additionally, for those who prefer programmatic access, the user action feature is fully accessible via our APIs, allowing for automation and integration into your existing workflows. Please visit our API documentation page here for more details.  

Basic (super-search) Mode 

Let’s say we only care about the LeagueOfNations organization domain; we can type ‘League’ and then click Search: 

The name patterns are simple partial string patterns we look for as being ’contained’ within the name, such as ”Car in ”Carlton”. These are case insensitive. They are not (yet!) general regular expressions.

Advanced “find a needle” Search Mode 

Sometimes, searching by names is not precise enough; you may want to provide more detailed search criteria, such as time ranges or narrowing down to specific clusters or kinds of actions. Expanding the “Advanced Search” section will switch the page to a more advanced search criteria form, disabling the basic search area and its criteria. 

Lets say we only want to see the “Link Account” actions over the last week: 

We select it from the actions multi-chip selector using the cursor (we could also type it and allow autocomplete to kick in). Hitting search will give you your needle time to go chase that Carl guy down and ask why he linked that darn account: 

The available criteria fields are as follows (additive in nature): 

  • Action: the kinds of actions, with a bracketed count of their frequency over the current criteria; if empty, all are included. 
  • Account: The account name of interest OR its UUID can be useful to narrow the matches to only a specific account. It’s also useful when user, organization, and account names share string patterns, which makes the super-search less precise. 
  • Organization: the organization name of interest or its UUID. 
  • User: the user who performed the action. 
  • Description: matches against the value of an expanded description variable. This is useful because most actions mention the ‘target’ of the action, such as cluster-id, in the expanded description. 
  • Starting At: match actions starting from this time cannot be older than 12 months ago. 
  • Ending At: match actions up until this time. 

Bonus Feature: Cluster Actions 

While it’s nice to have this new search page, we wanted to build a higher-order question on top of it: What has happened to my cluster?  

The answer can be found on the details tab of each cluster. When clicked on, it will take you directly to the user actions page with appropriate criteria to answer the question. 

* TIP: we currently support entry into this view with a

descriptionFormat queryParam

allowing you to save bookmarks to particular action ‘targets’. Further

queryParams

may be supported in the future for the remaining criteria: https://console2.instaclustr.com/global/searches/user-action?descriptionContextPattern=acde7535-3288-48fa-be64-0f7afe4641b3

Clicking this provides you the answer:

Future Thoughts 

There are some future capabilities we will look to add, including the ability to subscribe to webhooks that trigger on some criteria. We would also like to add the ability to generate reports against a criterion or to run such things regularly and send them via email. Let us know what other feature improvements you would like to see! 

Conclusion 

This new capability allows customers to search for user actions directly without contacting support. It also provides improved visibility and auditing of what’s been changing on their clusters and who’s been making those changes. We hope you found this interesting and welcome any feedback for “higher-order” types of searches you’d like to see built on top of this new feature. What kind of common questions about user actions can you think of? 

If you have any questions about this feature, please contact Instaclustr Support at any time If you are not a current Instaclustr customer and you’re interested to learn more, register for a free trial and spin up your first cluster for free!

 

The post Who Did What to That and When? Exploring the User Actions Feature appeared first on Instaclustr.

Benchmarking MongoDB vs ScyllaDB: Social Media Workload Deep Dive

benchANT’s comparison of ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a social media workload BenchANT recently benchmarked the performance and scalability of the market-leading general-purpose NoSQL database MongoDB and its performance-oriented challenger ScyllaDB. You can read a summary of the results in the blog Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost, see the key takeaways for various workloads in this technical summary, and access all results (including the raw data) from the complete benchANT report. This blog offers a deep dive into the tests performed for the social workload. This workload is based on the YCSB Workload B. It creates a read-heavy workload, with 95% read operations and 5% update operations. We use two shapes of this workload, which differ in terms of the request distribution patterns, namely uniform and hotspot distribution. These workloads are executed against the small database scaling size with a data set of 500GB and against the medium scaling size with a data set of 1TB. Before we get into the benchmark details, here is a summary of key insights for this workload. ScyllaDB outperforms MongoDB with higher throughput and lower latency for all measured configurations of the social workload. ScyllaDB provides up to 12 times higher throughput ScyllaDB provides significantly lower (down to 47 times) update latencies compared to MongoDB ScyllaDB provides lower read latencies, down to 5 times Throughput Results for MongoDB vs ScyllaDB The throughput results for the social workload with the uniform request distribution show that the small ScyllaDB cluster is able to serve 60 kOps/s with a cluster CPU utilization of ~85% while the small MongoDB cluster serves only 10 kOps/s under a comparable cluster utilization of 80-90%. For the medium cluster sizes, ScyllaDB achieves an average throughput of 232 kOps/s showing ~85% cluster utilization while MongoDB achieves 42 kOps/s at a CPU utilization of ~85%. The throughput results for the social workload with the hotspot request distribution show a similar trend, but with higher throughput numbers since the data is mostly read from the cache. The small ScyllaDB cluster serves 152 kOps/s while the small MongoDB serves 14 kOps/s. For the medium cluster sizes, ScyllaDB achieves an average throughput of 587 kOps/s and MongoDB achieves 48 kOps/s. Scalability Results for MongoDB vs ScyllaDB These results also enable us to compare the theoretical throughput scalability with the actually achieved throughput scalability. For this, we consider a simplified scalability model that focuses on compute resources. It assumes the scalability factor is reflected by the increased compute capacity from the small to medium cluster size. For ScyllaDB, this means we double the cluster size from 3 to 6 nodes and also double the instance size from 8 cores to 16 cores per instance, resulting in a theoretical scalability of 400%. For MongoDB, we move from one replica set of three data nodes to a cluster with three shards and nine data nodes and increase the instance size from 8 cores to 16 cores, resulting in a theoretical scalability factor of 600%. The ScyllaDB scalability results for the uniform and hotspot distributions both show that ScyllaDB is close to achieving linear scalability by achieving a throughput scalability of 386% (of the theoretically possible 400%). With MongoDB, the gap between theoretical throughput scalability and the actually achieved throughput scalability is significantly higher. For the uniform distribution, MongoDB achieves a scaling factor of 420% (of the theoretically possible 600%). For the hotspot distribution, we measure 342% (of the theoretically possible 600%). Throughput per Cost Ratio In order to compare the costs/month in relation to the provided throughput, we take the MongoDB Atlas throughput/$ as baseline (i.e. 100%) and compare it with the provided ScyllaDB Cloud throughput/$. The results for the uniform distribution show that ScyllaDB provides five times more operations/$ compared to MongoDB Atlas for the small scaling size and 5.7 times more operations/$ for the medium scaling size. For the hotspot distribution, the results show an even better throughput/cost ratio for ScyllaDB, providing 9 times more operations/$ for the small scaling size and 12.7 times more for the medium scaling size. Latency Results for MongoDB vs ScyllaDB For the uniform distribution, ScyllaDB provides stable and low P99 latencies for the read and update operations for the scaling sizes small and medium. MongoDB generally has higher P99 latencies. Here, the read latencies are 2.8 times higher for the small scaling size and 5.5 times higher for the medium scaling size. The update latencies show an even more distinct difference; MongoDB’s P99 update latency in the small scaling size is 47 times higher compared to ScyllaDB and 12 times higher in the medium scaling size. For the hotspot distribution, the results show a similar trend for the stable and low ScyllaDB latencies. For MongoDB, read and update latencies increase from the small to medium scaling size. It is interesting that in contrast to the uniform distribution, the read latency only increases by a factor of 2.8 while the update latency increases by 970%. Technical Nugget – Performance Impact of the Data Model The default YCSB data model is composed of a primary key and a data item with 10 fields of strings that results in a document with 10 attributes for MongoDB and a table with 10 columns for ScyllaDB. We analyze how performance changes if a pure key-value data model is applied for both databases: a table with only one column for ScyllaDB and a document with only one field for MongoDB The results show that for ScyllaDB the throughput improves by 24% while for MongoDB the throughput increase is only 5%.   Technical Nugget – Performance Impact of the Consistency Level All standard benchmarks are run with the MongoDB client consistency writeConcern=majority/readPreference=primary and for ScyllaDB with writeConsistney=QUORUM/readConsistency=QUORUM. Besides these client consistent configurations, we also analyze the performance impact of weaker read consistency settings. For this, we enable MongoDB to read from the secondaries (readPreference=secondarypreferred) and set readConsistency=ONE for ScyllaDB. The results show an expected increase in throughput: for ScyllaDB 56% and for MongoDB 49%. Continue Comparing ScyllaDB vs MongoDB Here are some additional resources for learning about the differences between MongoDB and ScyllaDB: Benchmarking  MongoDB vs ScyllaDB: Results from benchANT’s complete benchmarking study that comprises 133 performance and scalability measurements that compare MongoDB against ScyllaDB. Benchmarking MongoDB vs ScyllaDB: Caching Workload Deep Dive: benchANT’s comparison of  ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a caching workload (50% read operations and 50% update operations). Benchmarking MongoDB vs ScyllaDB: IoT Sensor Workload Deep Dive: benchANT’s comparison of  ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a workload simulating an IoT sensor (90% insert operations and 10% read operations). A Technical Comparison of MongoDB vs ScyllaDB: benchANT’s technical analysis of how MongoDB and ScyllaDB compare with respect to their features, architectures, performance, and scalability. ScyllaDB’s MongoDB vs ScyllaDB page: Features perspectives from users – like Discord – who have moved from MongoDB to ScyllaDB.

Powering AI Workloads with Intelligent Data Infrastructure and Open Source

In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively. 

This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI. 

The Power of Open Source in AI Solutions 

Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions: 

  1. Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
  2. Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
  3. Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals. 
  4. Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness. 

Vector Databases: A Key Component for AI Workloads 

Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems. 

Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities. 

Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users. 

The Role of Vector Databases in AI: 

  1. Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models. 
  2. High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly. 
  3. Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance. 

Leveraging Existing Infrastructure for AI Workloads 

Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements: 

  1. Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement. 
  2. Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads. 
  3. Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration. 

Open Source Databases for Enterprise AI 

Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors: 

PostgreSQL® and pgvector 

“The world’s most advanced open source relational database, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure. 

From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.

OpenSearch® 

Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.  

OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI  agents to recommendation engines with trustworthy results and minimal hallucinations. 

Apache Cassandra® 5.0 with Native Vector Indexing

Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.

Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.

Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises. 

Open Source Opens the Door to Successful AI Workloads 

Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutionsand suffering the pitfalls of vendor lock-in or simply mismatched featurescan easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position. 

When leveraging open source databases for AI workloads, consider the following: 

  • Data Security: Ensure robust security measures are in place to protect sensitive data. 
  • Scalability: Plan for future growth by choosing solutions that can scale with your needs. 
  • Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications. 
  • Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI. 

Conclusion 

Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency. 

Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.

And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts! 

The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.