Netflix’s Distributed Counter Abstraction
By: Rajiv Shringi, Oleksii Tkachuk, Kartik Sathyanarayanan
Introduction
In our previous blog post, we introduced Netflix’s TimeSeries Abstraction, a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction. This counting service, built on top of the TimeSeries Abstraction, enables distributed counting at scale while maintaining similar low latency performance. As with all our abstractions, we use our Data Gateway Control Plane to shard, configure, and deploy this service globally.
Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.
Note: When it comes to distributed counters, terms such as ‘accurate’ or ‘precise’ should be taken with a grain of salt. In this context, they refer to a count very close to accurate, presented with minimal delays.
Use Cases and Requirements
At Netflix, our counting use cases include tracking millions of user interactions, monitoring how often specific features or experiences are shown to users, and counting multiple facets of data during A/B test experiments, among others.
At Netflix, these use cases can be classified into two broad categories:
- Best-Effort: For this category, the count doesn’t have to be very accurate or durable. However, this category requires near-immediate access to the current count at low latencies, all while keeping infrastructure costs to a minimum.
- Eventually Consistent: This category needs accurate and durable counts, and is willing to tolerate a slight delay in accuracy and a slightly higher infrastructure cost as a trade-off.
Both categories share common requirements, such as high throughput and high availability. The table below provides a detailed overview of the diverse requirements across these two categories.
Distributed Counter Abstraction
To meet the outlined requirements, the Counter Abstraction was designed to be highly configurable. It allows users to choose between different counting modes, such as Best-Effort or Eventually Consistent, while considering the documented trade-offs of each option. After selecting a mode, users can interact with APIs without needing to worry about the underlying storage mechanisms and counting methods.
Let’s take a closer look at the structure and functionality of the API.
API
Counters are organized into separate namespaces that users set up for each of their specific use cases. Each namespace can be configured with different parameters, such as Type of Counter, Time-To-Live (TTL), and Counter Cardinality, using the service’s Control Plane.
The Counter Abstraction API resembles Java’s AtomicInteger interface:
AddCount/AddAndGetCount: Adjusts the count for the specified counter by the given delta value within a dataset. The delta value can be positive or negative. The AddAndGetCount counterpart also returns the count after performing the add operation.
{
"namespace": "my_dataset",
"counter_name": "counter123",
"delta": 2,
"idempotency_token": {
"token": "some_event_id",
"generation_time": "2024-10-05T14:48:00Z"
}
}
The idempotency token can be used for counter types that support them. Clients can use this token to safely retry or hedge their requests. Failures in a distributed system are a given, and having the ability to safely retry requests enhances the reliability of the service.
GetCount: Retrieves the count value of the specified counter within a dataset.
{
"namespace": "my_dataset",
"counter_name": "counter123"
}
ClearCount: Effectively resets the count to 0 for the specified counter within a dataset.
{
"namespace": "my_dataset",
"counter_name": "counter456",
"idempotency_token": {...}
}
Now, let’s look at the different types of counters supported within the Abstraction.
Types of Counters
The service primarily supports two types of counters: Best-Effort and Eventually Consistent, along with a third experimental type: Accurate. In the following sections, we’ll describe the different approaches for these types of counters and the trade-offs associated with each.
Best Effort Regional Counter
This type of counter is powered by EVCache, Netflix’s distributed caching solution built on the widely popular Memcached. It is suitable for use cases like A/B experiments, where many concurrent experiments are run for relatively short durations and an approximate count is sufficient. Setting aside the complexities of provisioning, resource allocation, and control plane management, the core of this solution is remarkably straightforward:
// counter cache key
counterCacheKey = <namespace>:<counter_name>
// add operation
return delta > 0
? cache.incr(counterCacheKey, delta, TTL)
: cache.decr(counterCacheKey, Math.abs(delta), TTL);
// get operation
cache.get(counterCacheKey);
// clear counts from all replicas
cache.delete(counterCacheKey, ReplicaPolicy.ALL);
EVCache delivers extremely high throughput at low millisecond latency or better within a single region, enabling a multi-tenant setup within a shared cluster, saving infrastructure costs. However, there are some trade-offs: it lacks cross-region replication for the increment operation and does not provide consistency guarantees, which may be necessary for an accurate count. Additionally, idempotency is not natively supported, making it unsafe to retry or hedge requests.
Edit: A note on probabilistic data structures:
Probabilistic data structures like HyperLogLog (HLL) can be useful for tracking an approximate number of distinct elements, like distinct views or visits to a website, but are not ideally suited for implementing distinct increments and decrements for a given key. Count-Min Sketch (CMS) is an alternative that can be used to adjust the values of keys by a given amount. Data stores like Redis support both HLL and CMS. However, we chose not to pursue this direction for several reasons:
- We chose to build on top of data stores that we already operate at scale.
- Probabilistic data structures do not natively support several of our requirements, such as resetting the count for a given key or having TTLs for counts. Additional data structures, including more sketches, would be needed to support these requirements.
- On the other hand, the EVCache solution is quite simple, requiring minimal lines of code and using natively supported elements. However, it comes at the trade-off of using a small amount of memory per counter key.
Eventually Consistent Global Counter
While some users may accept the limitations of a Best-Effort counter, others opt for precise counts, durability and global availability. In the following sections, we’ll explore various strategies for achieving durable and accurate counts. Our objective is to highlight the challenges inherent in global distributed counting and explain the reasoning behind our chosen approach.
Approach 1: Storing a Single Row per Counter
Let’s start simple by using a single row per counter key within a table in a globally replicated datastore.
Let’s examine some of the drawbacks of this approach:
- Lack of Idempotency: There is no idempotency key baked into the storage data-model preventing users from safely retrying requests. Implementing idempotency would likely require using an external system for such keys, which can further degrade performance or cause race conditions.
- Heavy Contention: To update counts reliably, every writer must perform a Compare-And-Swap operation for a given counter using locks or transactions. Depending on the throughput and concurrency of operations, this can lead to significant contention, heavily impacting performance.
Secondary Keys: One way to reduce contention in this approach would be to use a secondary key, such as a bucket_id, which allows for distributing writes by splitting a given counter into buckets, while enabling reads to aggregate across buckets. The challenge lies in determining the appropriate number of buckets. A static number may still lead to contention with hot keys, while dynamically assigning the number of buckets per counter across millions of counters presents a more complex problem.
Let’s see if we can iterate on our solution to overcome these drawbacks.
Approach 2: Per Instance Aggregation
To address issues of hot keys and contention from writing to the same row in real-time, we could implement a strategy where each instance aggregates the counts in memory and then flushes them to disk at regular intervals. Introducing sufficient jitter to the flush process can further reduce contention.
However, this solution presents a new set of issues:
- Vulnerability to Data Loss: The solution is vulnerable to data loss for all in-memory data during instance failures, restarts, or deployments.
- Inability to Reliably Reset Counts: Due to counting requests being distributed across multiple machines, it is challenging to establish consensus on the exact point in time when a counter reset occurred.
- Lack of Idempotency: Similar to the previous approach, this method does not natively guarantee idempotency. One way to achieve idempotency is by consistently routing the same set of counters to the same instance. However, this approach may introduce additional complexities, such as leader election, and potential challenges with availability and latency in the write path.
That said, this approach may still be suitable in scenarios where these trade-offs are acceptable. However, let’s see if we can address some of these issues with a different event-based approach.
Approach 3: Using Durable Queues
In this approach, we log counter events into a durable queuing system like Apache Kafka to prevent any potential data loss. By creating multiple topic partitions and hashing the counter key to a specific partition, we ensure that the same set of counters are processed by the same set of consumers. This setup simplifies facilitating idempotency checks and resetting counts. Furthermore, by leveraging additional stream processing frameworks such as Kafka Streams or Apache Flink, we can implement windowed aggregations.
However, this approach comes with some challenges:
- Potential Delays: Having the same consumer process all the counts from a given partition can lead to backups and delays, resulting in stale counts.
- Rebalancing Partitions: This approach requires auto-scaling and rebalancing of topic partitions as the cardinality of counters and throughput increases.
Furthermore, all approaches that pre-aggregate counts make it challenging to support two of our requirements for accurate counters:
- Auditing of Counts: Auditing involves extracting data to an offline system for analysis to ensure that increments were applied correctly to reach the final value. This process can also be used to track the provenance of increments. However, auditing becomes infeasible when counts are aggregated without storing the individual increments.
- Potential Recounting: Similar to auditing, if adjustments to increments are necessary and recounting of events within a time window is required, pre-aggregating counts makes this infeasible.
Barring those few requirements, this approach can still be effective if we determine the right way to scale our queue partitions and consumers while maintaining idempotency. However, let’s explore how we can adjust this approach to meet the auditing and recounting requirements.
Approach 4: Event Log of Individual Increments
In this approach, we log each individual counter increment along with its event_time and event_id. The event_id can include the source information of where the increment originated. The combination of event_time and event_id can also serve as the idempotency key for the write.
However, in its simplest form, this approach has several drawbacks:
- Read Latency: Each read request requires scanning all increments for a given counter potentially degrading performance.
- Duplicate Work: Multiple threads might duplicate the effort of aggregating the same set of counters during read operations, leading to wasted effort and subpar resource utilization.
- Wide Partitions: If using a datastore like Apache Cassandra, storing many increments for the same counter could lead to a wide partition, affecting read performance.
- Large Data Footprint: Storing each increment individually could also result in a substantial data footprint over time. Without an efficient data retention strategy, this approach may struggle to scale effectively.
The combined impact of these issues can lead to increased infrastructure costs that may be difficult to justify. However, adopting an event-driven approach seems to be a significant step forward in addressing some of the challenges we’ve encountered and meeting our requirements.
How can we improve this solution further?
Netflix’s Approach
We use a combination of the previous approaches, where we log each counting activity as an event, and continuously aggregate these events in the background using queues and a sliding time window. Additionally, we employ a bucketing strategy to prevent wide partitions. In the following sections, we’ll explore how this approach addresses the previously mentioned drawbacks and meets all our requirements.
Note: From here on, we will use the words “rollup” and “aggregate” interchangeably. They essentially mean the same thing, i.e., collecting individual counter increments/decrements and arriving at the final value.
TimeSeries Event Store:
We chose the TimeSeries Data Abstraction as our event store, where counter mutations are ingested as event records. Some of the benefits of storing events in TimeSeries include:
High-Performance: The TimeSeries abstraction already addresses many of our requirements, including high availability and throughput, reliable and fast performance, and more.
Reducing Code Complexity: We reduce a lot of code complexity in Counter Abstraction by delegating a major portion of the functionality to an existing service.
TimeSeries Abstraction uses Cassandra as the underlying event store, but it can be configured to work with any persistent store. Here is what it looks like:
Handling Wide Partitions: The time_bucket and event_bucket columns play a crucial role in breaking up a wide partition, preventing high-throughput counter events from overwhelming a given partition. For more information regarding this, refer to our previous blog.
No Over-Counting: The event_time, event_id and event_item_key columns form the idempotency key for the events for a given counter, enabling clients to retry safely without the risk of over-counting.
Event Ordering: TimeSeries orders all events in descending order of time allowing us to leverage this property for events like count resets.
Event Retention: The TimeSeries Abstraction includes retention policies to ensure that events are not stored indefinitely, saving disk space and reducing infrastructure costs. Once events have been aggregated and moved to a more cost-effective store for audits, there’s no need to retain them in the primary storage.
Now, let’s see how these events are aggregated for a given counter.
Aggregating Count Events:
As mentioned earlier, collecting all individual increments for every read request would be cost-prohibitive in terms of read performance. Therefore, a background aggregation process is necessary to continually converge counts and ensure optimal read performance.
But how can we safely aggregate count events amidst ongoing write operations?
This is where the concept of Eventually Consistent counts becomes crucial. By intentionally lagging behind the current time by a safe margin, we ensure that aggregation always occurs within an immutable window.
Lets see what that looks like:
Let’s break this down:
- lastRollupTs: This represents the most recent time when the counter value was last aggregated. For a counter being operated for the first time, this timestamp defaults to a reasonable time in the past.
- Immutable Window and Lag: Aggregation can only occur safely within an immutable window that is no longer receiving counter events. The “acceptLimit” parameter of the TimeSeries Abstraction plays a crucial role here, as it rejects incoming events with timestamps beyond this limit. During aggregations, this window is pushed slightly further back to account for clock skews.
This does mean that the counter value will lag behind its most recent update by some margin (typically in the order of seconds). This approach does leave the door open for missed events due to cross-region replication issues. See “Future Work” section at the end.
- Aggregation Process: The rollup process aggregates all events in the aggregation window since the last rollup to arrive at the new value.
Rollup Store:
We save the results of this aggregation in a persistent store. The next aggregation will simply continue from this checkpoint.
We create one such Rollup table per dataset and use Cassandra as our persistent store. However, as you will soon see in the Control Plane section, the Counter service can be configured to work with any persistent store.
LastWriteTs: Every time a given counter receives a write, we also log a last-write-timestamp as a columnar update in this table. This is done using Cassandra’s USING TIMESTAMP feature to predictably apply the Last-Write-Win (LWW) semantics. This timestamp is the same as the event_time for the event. In the subsequent sections, we’ll see how this timestamp is used to keep some counters in active rollup circulation until they have caught up to their latest value.
Rollup Cache
To optimize read performance, these values are cached in EVCache for each counter. We combine the lastRollupCount and lastRollupTs into a single cached value per counter to prevent potential mismatches between the count and its corresponding checkpoint timestamp.
But, how do we know which counters to trigger rollups for? Let’s explore our Write and Read path to understand this better.
Add/Clear Count:
An add or clear count request writes durably to the TimeSeries Abstraction and updates the last-write-timestamp in the Rollup store. If the durability acknowledgement fails, clients can retry their requests with the same idempotency token without the risk of overcounting. Upon durability, we send a fire-and-forget request to trigger the rollup for the request counter.
GetCount:
We return the last rolled-up count as a quick point-read operation, accepting the trade-off of potentially delivering a slightly stale count. We also trigger a rollup during the read operation to advance the last-rollup-timestamp, enhancing the performance of subsequent aggregations. This process also self-remediates a stale count if any previous rollups had failed.
With this approach, the counts continually converge to their latest value. Now, let’s see how we scale this approach to millions of counters and thousands of concurrent operations using our Rollup Pipeline.
Rollup Pipeline:
Each Counter-Rollup server operates a rollup pipeline to efficiently aggregate counts across millions of counters. This is where most of the complexity in Counter Abstraction comes in. In the following sections, we will share key details on how efficient aggregations are achieved.
Light-Weight Roll-Up Event: As seen in our Write and Read paths above, every operation on a counter sends a light-weight event to the Rollup server:
rollupEvent: {
"namespace": "my_dataset",
"counter": "counter123"
}
Note that this event does not include the increment. This is only an indication to the Rollup server that this counter has been accessed and now needs to be aggregated. Knowing exactly which specific counters need to be aggregated prevents scanning the entire event dataset for the purpose of aggregations.
In-Memory Rollup Queues: A given Rollup server instance runs a set of in-memory queues to receive rollup events and parallelize aggregations. In the first version of this service, we settled on using in-memory queues to reduce provisioning complexity, save on infrastructure costs, and make rebalancing the number of queues fairly straightforward. However, this comes with the trade-off of potentially missing rollup events in case of an instance crash. For more details, see the “Stale Counts” section in “Future Work.”
Minimize Duplicate Effort: We use a fast non-cryptographic hash like XXHash to ensure that the same set of counters end up on the same queue. Further, we try to minimize the amount of duplicate aggregation work by having a separate rollup stack that chooses to run fewer beefier instances.
Availability and Race Conditions: Having a single Rollup server instance can minimize duplicate aggregation work but may create availability challenges for triggering rollups. If we choose to horizontally scale the Rollup servers, we allow threads to overwrite rollup values while avoiding any form of distributed locking mechanisms to maintain high availability and performance. This approach remains safe because aggregation occurs within an immutable window. Although the concept of now() may differ between threads, causing rollup values to sometimes fluctuate, the counts will eventually converge to an accurate value within each immutable aggregation window.
Rebalancing Queues: If we need to scale the number of queues, a simple Control Plane configuration update followed by a re-deploy is enough to rebalance the number of queues.
"eventual_counter_config": {
"queue_config": {
"num_queues" : 8, // change to 16 and re-deploy
...
Handling Deployments: During deployments, these queues shut down gracefully, draining all existing events first, while the new Rollup server instance starts up with potentially new queue configurations. There may be a brief period when both the old and new Rollup servers are active, but as mentioned before, this race condition is managed since aggregations occur within immutable windows.
Minimize Rollup Effort: Receiving multiple events for the same counter doesn’t mean rolling it up multiple times. We drain these rollup events into a Set, ensuring a given counter is rolled up only once during a rollup window.
Efficient Aggregation: Each rollup consumer processes a batch of counters simultaneously. Within each batch, it queries the underlying TimeSeries abstraction in parallel to aggregate events within specified time boundaries. The TimeSeries abstraction optimizes these range scans to achieve low millisecond latencies.
Dynamic Batching: The Rollup server dynamically adjusts the number of time partitions that need to be scanned based on cardinality of counters in order to prevent overwhelming the underlying store with many parallel read requests.
Adaptive Back-Pressure: Each consumer waits for one batch to complete before issuing the rollups for the next batch. It adjusts the wait time between batches based on the performance of the previous batch. This approach provides back-pressure during rollups to prevent overwhelming the underlying TimeSeries store.
Handling Convergence:
In order to prevent low-cardinality counters from lagging behind too much and subsequently scanning too many time partitions, they are kept in constant rollup circulation. For high-cardinality counters, continuously circulating them would consume excessive memory in our Rollup queues. This is where the last-write-timestamp mentioned previously plays a crucial role. The Rollup server inspects this timestamp to determine if a given counter needs to be re-queued, ensuring that we continue aggregating until it has fully caught up with the writes.
Now, let’s see how we leverage this counter type to provide an up-to-date current count in near-realtime.
Experimental: Accurate Global Counter
We are experimenting with a slightly modified version of the Eventually Consistent counter. Again, take the term ‘Accurate’ with a grain of salt. The key difference between this type of counter and its counterpart is that the delta, representing the counts since the last-rolled-up timestamp, is computed in real-time.
And then, currentAccurateCount = lastRollupCount + delta
Aggregating this delta in real-time can impact the performance of this operation, depending on the number of events and partitions that need to be scanned to retrieve this delta. The same principle of rolling up in batches applies here to prevent scanning too many partitions in parallel. Conversely, if the counters in this dataset are accessed frequently, the time gap for the delta remains narrow, making this approach of fetching current counts quite effective.
Now, let’s see how all this complexity is managed by having a unified Control Plane configuration.
Control Plane
The Data Gateway Platform Control Plane manages control settings for all abstractions and namespaces, including the Counter Abstraction. Below, is an example of a control plane configuration for a namespace that supports eventually consistent counters with low cardinality:
"persistence_configuration": [
{
"id": "CACHE", // Counter cache config
"scope": "dal=counter",
"physical_storage": {
"type": "EVCACHE", // type of cache storage
"cluster": "evcache_dgw_counter_tier1" // Shared EVCache cluster
}
},
{
"id": "COUNTER_ROLLUP",
"scope": "dal=counter", // Counter abstraction config
"physical_storage": {
"type": "CASSANDRA", // type of Rollup store
"cluster": "cass_dgw_counter_uc1", // physical cluster name
"dataset": "my_dataset_1" // namespace/dataset
},
"counter_cardinality": "LOW", // supported counter cardinality
"config": {
"counter_type": "EVENTUAL", // Type of counter
"eventual_counter_config": { // eventual counter type
"internal_config": {
"queue_config": { // adjust w.r.t cardinality
"num_queues" : 8, // Rollup queues per instance
"coalesce_ms": 10000, // coalesce duration for rollups
"capacity_bytes": 16777216 // allocated memory per queue
},
"rollup_batch_count": 32 // parallelization factor
}
}
}
},
{
"id": "EVENT_STORAGE",
"scope": "dal=ts", // TimeSeries Event store
"physical_storage": {
"type": "CASSANDRA", // persistent store type
"cluster": "cass_dgw_counter_uc1", // physical cluster name
"dataset": "my_dataset_1", // keyspace name
},
"config": {
"time_partition": { // time-partitioning for events
"buckets_per_id": 4, // event buckets within
"seconds_per_bucket": "600", // smaller width for LOW card
"seconds_per_slice": "86400", // width of a time slice table
},
"accept_limit": "5s", // boundary for immutability
},
"lifecycleConfigs": {
"lifecycleConfig": [
{
"type": "retention", // Event retention
"config": {
"close_after": "518400s",
"delete_after": "604800s" // 7 day count event retention
}
}
]
}
}
]
Using such a control plane configuration, we compose multiple abstraction layers using containers deployed on the same host, with each container fetching configuration specific to its scope.
Provisioning
As with the TimeSeries abstraction, our automation uses a bunch of user inputs regarding their workload and cardinalities to arrive at the right set of infrastructure and related control plane configuration. You can learn more about this process in a talk given by one of our stunning colleagues, Joey Lynch : How Netflix optimally provisions infrastructure in the cloud.
Performance
At the time of writing this blog, this service was processing close to 75K count requests/second globally across the different API endpoints and datasets:
while providing single-digit millisecond latencies for all its endpoints:
Future Work
While our system is robust, we still have work to do in making it more reliable and enhancing its features. Some of that work includes:
- Regional Rollups: Cross-region replication issues can result in missed events from other regions. An alternate strategy involves establishing a rollup table for each region, and then tallying them in a global rollup table. A key challenge in this design would be effectively communicating the clearing of the counter across regions.
- Error Detection and Stale Counts: Excessively stale counts can occur if rollup events are lost or if a rollup fails and isn’t retried. This isn’t an issue for frequently accessed counters, as they remain in rollup circulation. This issue is more pronounced for counters that aren’t accessed frequently. Typically, the initial read for such a counter will trigger a rollup, self-remediating the issue. However, for use cases that cannot accept potentially stale initial reads, we plan to implement improved error detection, rollup handoffs, and durable queues for resilient retries.
Conclusion
Distributed counting remains a challenging problem in computer science. In this blog, we explored multiple approaches to implement and deploy a Counting service at scale. While there may be other methods for distributed counting, our goal has been to deliver blazing fast performance at low infrastructure costs while maintaining high availability and providing idempotency guarantees. Along the way, we make various trade-offs to meet the diverse counting requirements at Netflix. We hope you found this blog post insightful.
Stay tuned for Part 3 of Composite Abstractions at Netflix, where we’ll introduce our Graph Abstraction, a new service being built on top of the Key-Value Abstraction and the TimeSeries Abstraction to handle high-throughput, low-latency graphs.
Acknowledgments
Special thanks to our stunning colleagues who contributed to the Counter Abstraction’s success: Joey Lynch, Vinay Chella, Kaidan Fullerton, Tom DeVoe, Mengqing Wang, Varun Khaitan
Netflix’s Distributed Counter Abstraction was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
How We Updated ScyllaDB Drivers for Tablets Elasticity
Rethinking ScyllaDB’s shard-aware drivers for our new Raft-based tablets architecture ScyllaDB recently released new versions of our drivers (Rust, Go, Python…) that will provide a nice throughput and latency boost when used in concert with our new tablets architecture. In this blog post, I’d like to share details about how the drivers’ query routing scheme has changed, and why it’s now more beneficial than ever to use ScyllaDB drivers instead of Cassandra drivers. Before we dive into that, let’s take a few steps back. How do ScyllaDB drivers work? And what’s meant by “tablets”? When we say “drivers,” we’re talking about the libraries that your application uses to communicate with ScyllaDB. ScyllaDB drivers use the CQL protocol, which is inherited from Cassandra (ScyllaDB is compatible with Cassandra as well as DynamoDB). It’s possible to use Cassandra drivers with ScyllaDB, but we recommend using ScyllaDB’s drivers for optimal performance. Some of these drivers were forked from Cassandra drivers. Others, such as our Rust driver, were built from the ground up. The interesting thing about ScyllaDB drivers is that they are “shard-aware.” What does that mean? First, it’s important to understand that ScyllaDB is built with a shard-per-core architecture: every node is a multithreaded process whose every thread performs some relatively independent work. Each piece of data stored in a ScyllaDB database is bound to a specific CPU core. ScyllaDB shard-awareness allows client applications to perform load balancing following our shard-per-core architecture. Shard-aware drivers establish one connection per shard, allowing them to load balance and route queries directly to the single CPU core owning it. This optimization further optimizes latency and is also very friendly towards CPU caches. Before we get into how these shard-aware drivers support tablets, let’s take a brief look at ScyllaDB’s new “tablets” replication architecture in case you’re not yet familiar with it. We replaced vNode-based replication with tablets-based replication to enable more flexible load balancing. Each table is split into smaller fragments (“tablets”) to evenly distribute data and load across the system. Tablets are replicated to multiple ScyllaDB nodes for high availability and fault tolerance. This new approach separates token ownership from servers – ultimately allowing ScyllaDB to scale faster and in parallel. For more details on tablets, see these blog posts. How We’ve Modified Shard-Aware Drivers for Tablets With all that defined, let’s move to how we’ve modified our shard-aware drivers for tablets. Specifically, let’s look at how our query routing has changed. Before tablets Without tablets, when you made a select, insert, update, or delete statement, this query was routed to a specific node. We determined this routing by calculating the hashes of the partition key. The ring represented the data, the hashes were the partition keys, and this ring was split into many vNodes. When data was replicated, one piece of the data was stored on many nodes. The problem with this approach is that this mapping was fairly static. We couldn’t move the vNodes around and the driver didn’t expect this mapping to change. It was all very static. Users wanted to add and remove capacity faster than this vNodes paradigm allowed. With tablets With tablets, ScyllaDB now maintains metadata about the tablets. The tablet contains information about which table each tablet represents and the range of data it stores. Each tablet contains information about which replicas hold the data within the specified start and end range. Interestingly, we decided to have the driver start off without any knowledge of where data is located – even though that information is stored in the tablet mapping table. We saw that scanning that mapping would cause a performance hit for large deployments. Instead, we took a different approach: we learn the routing information “lazily” and update it dynamically. For example, assume that the driver wants to execute a query. The driver will make a request to some random node and shard – this initial request is guessing, so it might be the incorrect node and shard. All the nodes know which node owns which data. Even if the driver contacts the wrong node, it will forward the request to the correct node. It will also tell the driver which node owns that data, and the driver will update its tablet metadata to include that information. The next time the driver wants to access data on that tablet, it will reference this tablet metadata and send the query directly to the correct node. If the tablet containing that data moves (e.g., because a new node is added or removed), the driver will continue sending statements to the old replicas – but since at least one of them is not currently a replica, it will return the correct information about this tablet. The alternative would be to refresh all of the tablet metadata periodically (maybe every five seconds), which could place significant strain on the system. Another benefit of this approach: the driver doesn’t have to store metadata about all tablets. For example, if the driver only queries one table, it will only persist information about the tablets for that specific table. Ultimately, this approach enables very fast startup, even with 100K entries in the tablets table. Why use tablet-aware drivers When using ScyllaDB tablets, it’s more important than ever to use ScyllaDB shard-aware – and now also tablet-aware – drivers instead of Cassandra drivers. The existing drivers will still work, but they won’t work as efficiently because they won’t know where each tablet is located. Using the latest ScyllaDB drivers should provide a nice throughput and latency boost.Making Effective Partitions for ScyllaDB Data Modeling
Learn how to ensure your tables are perfectly partitioned to satisfy your queries – in this excerpt from the book “ScyllaDB in Action.” Editor’s note We’re thrilled to share the following excerpt from Bo Ingram’s informative – and fun! – new book on ScyllaDB: ScyllaDB in Action. It’s available now via Manning and Amazon. You can also access a 3-chapter excerpt for free, compliments of ScyllaDB. Get the first 3 book chapters, free You might have already experienced Bo’s expertise and engaging communication style in his blog How Discord Stores Trillions of Messages or ScyllaDB Summit talks How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB and So You’ve Lost Quorum: Lessons From Accidental Downtime If not, you should 😉 And if you want to learn more from Bo, join him at our upcoming Masterclass: Data Modeling for Performance Masterclass. We’ve ordered boxes of print books and will be giving them out! Watch Bo at the “Data Modeling for Performance” Masterclass — Now On Demand The following is an excerpt from Chapter 3; it’s reprinted here with permission of the publisher. Read more from Chapter 3 here *** When analyzing the queries your application needs, you identified several tables for your schema. The next step in schema design is to ask how can these tables be uniquely identified and partitioned to satisfy the queries?. The primary key contains the row’s partition key — you learned in chapter 2 that the primary key determines a row’s uniqueness. Therefore, before you can determine the partition key, you need to know the primary key. PRIMARY KEYS First, you should check to see if what you’re trying to store contains a property that is popularly used for its uniqueness. Cars, for example, have a vehicle identification number, that’s unique per car. Books (including this one!) have an ISBN (international standard book number) to uniquely identify them. There’s no international standard for identifying an article or an author, so you’ll need to think a little harder to find the primary and partition keys. ScyllaDB does provide support for generating unique identifiers, but they come with some drawbacks that you’ll learn about in the next chapter. Next, you can ask if any fields can be combined to make a good primary key. What could you use to determine a unique article? The title might be reused, especially if you have unimaginative writers. The content would ideally be unique per article, but that’s a very large value for a key. Perhaps you could use a combination of fields — maybe date, author, and title? That probably works, but I find it’s helpful to look back at what you’re trying to query. When your application runs theRead Article
query, it’s trying to read only a single
article. Whatever is executing that query, probably a web server
responding to the contents of a URL, is trying to load that
article, so it needs information that can be stored in a URL. Isn’t
it obnoxious when you go to paste a link somewhere and the link
feels like it’s a billion characters long? To load an article, you
don’t want to have to keep track of the author ID, title, and date.
That’s why using an ID as a primary key is often a strong choice.
Providing a unique identifier satisfies the uniqueness requirement
of a primary key, and they can potentially encode other information
inside them, such as time, which can be used for relative ordering.
You’ll learn more about potential types of IDs in the next chapter
when you explore data types. What uniquely identifies an author?
You might think an email, but people sometimes change their email
addresses. Supplying your own unique identifier works well again
here. Perhaps it’s a numeric ID, or maybe it’s a username, but
authors, as your design stands today, need extra information to
differentiate them from other authors within your database. Article
summaries, like the other steps you’ve been through, are a little
more interesting. For the various article summary tables, if you
try to use the same primary key as articles, an ID, you’re going to
run into trouble. If an ID alone makes an article unique in the
articles
table, then presumably it suffices for the
index tables. That turns out to not be the case. An ID can still
differentiate uniqueness, but you also want to query by at least
the partition key to have a performant query, and if an ID is the
partition key, that doesn’t satisfy the use cases for querying by
author, date, or score. Because the partition key is contained
within the primary key, you’ll need to include those fields in your
primary key (figure 3.12). Taking
article_summaries_by_author
, your primary key for that
field would become author and your article ID. Similarly, the other
two tables would have the date and the article ID for
article_summaries_by_dat
e, and
article_summaries_by_score
would use the score and the
article ID for its primary key.
Figure 3.12 You sometimes need to add fields to an already-unique
primary key to use them in a partition key, especially when
denormalizing data. With your primary keys figured out, you can
move forward to determining your partition keys and potentially
adjusting your primary keys. PARTITION KEYS You learned in the last
chapter that the first entry in the primary key serves as the
partition key for the row (you’ll see later how to build a
composite primary key). A good partition key distributes data
evenly across the cluster and is used by the queries against that
table. It stores a relatively small amount of data (a good rule of
thumb is that 100 MB is a large partition, but it depends on the
use case). I’ve listed the tables and their primary key; knowing
the primary keys, you can now rearrange them to specify your
partition keys. What would be a good partition key for each of
these tables? Table 3.3 Each table has a primary key from
which the partition key is extracted For the articles
and authors
tables, it’s very straightforward. There
is one column in the primary key; therefore, the partition key is
the only column. If only everything was that easy!
TIP: By only having the ID as the primary key and
the partition key, you can look up rows only by their ID. A query
where you want to get multiple rows wouldn’t work with this
approach, because you’d be querying across partitions, bringing in
extra nodes and hampering performance. Instead, you want to query
by a partition key that would contain multiple rows, as you’ll do
with article summaries. The article summaries tables, however,
present you with a choice. Given a primary key of an ID and an
author, for article_summaries_by_author,
which should
be the partition key? If you choose ID, Scylla will distribute the
rows for this table around the cluster based on the ID of the
article. This distribution would mean that if you wanted to load
all articles by their author, the query would have to hit nodes all
across the cluster. That behavior is not efficient. If you
partitioned your data by the author, a query to find all articles
written by a given author would hit only the nodes that store that
partition, because you’re querying within that partition (figure
3.13). You almost always want to query by at least the partition
key — it is critical for good performance. Because the use case for
this table is to find articles by who wrote them, the author is the
choice for the partition key. This distribution makes your primary
key author_id, id
because the partition key is the
first entry in the primary key.
Figure 3.13 The partition key for each table should match the
queries you previously determined to achieve the best performance.
article_summaries_by_date
and
article_summaries_by_score,
in what might be the least
surprising statement of the book, present a similar choice as
article_summaries_by_author
with a similar solution.
Because you’re querying article_summaries_by_date
by
the date of the article, you want the data partitioned by it as
well, making the primary key date, id, or score, id for
article_summaries_by_score.
Tip: Right-sizing
partition keys: If a partition key is too large,
data will be unevenly distributed across the cluster. If a
partition key is too small, range scans that load multiple rows
might need several queries to load the desired amount of data.
Consider querying for articles by date — if you publish one article
a week but your partition key is per day, then querying for the
most recent articles will require queries with no results six times
out of seven. Partitioning your data is a balancing exercise.
Sometimes, it’s easy. Author is a natural partition key for
articles by author, whereas something like date might require some
finessing to get the best performance in your application. Another
problem to consider is one partition taking an outsized amount of
traffic — more than the node can handle. In the next chapters, as
you refine your design, you’ll learn how to adjust your partition
key to fit your database just right. After looking closely at the
tables, your primary keys are now ordered correctly to partition
the data. For the article summaries tables, your primary keys are
divided into two categories — partition key and not-partition key.
The leftover bit has an official name and performs a specific
function in your table. Let’s take a look at the not-partition
keys — clustering keys. CLUSTERING KEYS A clustering key is the
non-partition key part of a table’s primary key that defines the
ordering of the rows within the table. In the previous chapter,
when you looked at your example table, you noticed that Scylla had
filled in the table’s implicit configuration, ordering the
non-partition key columns — the clustering keys — by ascending
order. CREATE TABLE initial.food_reviews ( restaurant text,
ordered_at timestamp, food text, review text, PRIMARY KEY
(restaurant, ordered_at, food) ) WITH CLUSTERING ORDER BY
(ordered_at ASC, food ASC); Within the partition, each row is
sorted by its time ordered and then by the name of the food, each
in an ascending sort (so earlier order times and alphabetically by
food). Left to its own devices, ScyllaDB always defaults to the
ascending natural order of the clustering keys. When creating
tables, if you have clustering keys, you need to be intentional
about specifying their order to make sure that you’re getting
results that match your requirements. Consider
article_summaries_by_author
. The purpose of that table
is to retrieve the articles for a given author, but do you want to
see their oldest articles first or their newest ones? By default,
Scylla is going to sort the table by id ASC
, giving
you the old articles first. When creating your summary tables,
you’ll want to specify their sort orders so that you get the newest
articles first — specifying id DESC.
You now have
tables defined, and with those tables, the primary keys and
partition keys are set to specify uniqueness, distribute the data,
and provide an ordering based on your queries. Your queries,
however, are looking for more than just primary keys — authors have
named, and articles have titles and content. Not only do you need
to store these fields, you need to specify the structure of that
data. To accomplish this definition, you’ll use data types. In the
next chapter, you’ll learn all about them and continue practicing
query-first design. Database Internals: Optimizing Memory Management
How databases can get better performance by optimizing memory management The following blog post is an excerpt from Chapter 3 of the Database Performance at Scale book, which is available for free. This book sheds light on often overlooked factors that impact database performance at scale. Get the complete book, free Memory management is the central design point in all aspects of programming. Even comparing programming languages to one another always involves discussions about the way programmers are supposed to handle memory allocation and freeing. No wonder memory management design affects the performance of a database so much. Applied to database engineering, memory management typically falls into two related but independent subsystems: memory allocation and cache control. The former is in fact a very generic software engineering issue, so considerations about it are not extremely specific to databases (though they are crucial and are worth studying). Opposite to that, the latter topic is itself very broad, affected by the usage details and corner cases. Respectively, in the database world, cache control has its own flavor. Allocation The manner in which programs or subsystems allocate and free memory lies at the core of memory management. There are several approaches worth considering. As illustrated by Figure 3-2, a so-called “log-structured allocation” is known from filesystems where it puts sequential writes to a circular log on the persisting storage and handles updates the very same way. At some point, this filesystem must reclaim blocks that became obsolete entries in the log area to make some more space available for future writes. In a naive implementation, unused entries are reclaimed by rereading and rewriting the log from scratch; obsolete blocks are then skipped in the process. Figure 3-2: A log-structured allocation puts sequential writes to a circular log on the persisting storage and handles updates the very same way A memory allocator for naive code can do something similar. In its simplest form, it would allocate the next block of memory by simply advancing a next-free pointer. Deallocation would just need to mark the allocated area as freed. One advantage of this approach is the speed of allocation. Another is the simplicity and efficiency of deallocation if it happens in FIFO order or affects the whole allocation space. Stack memory allocations are later released in the order that’s reverse to allocation, so this is the most prominent and the most efficient example of such an approach. Using linear allocators as general-purpose allocators can be more problematic because of the difficulty of space reclamation. To reclaim space, it’s not enough to just mark entries as free. This leads to memory fragmentation, which in turn outweighs the advantages of linear allocation. So, as with the filesystem, the memory must be reclaimed so that it only contains allocated entries and the free space can be used again. Reclamation requires moving allocated entries around – a process that changes and invalidates their previously known addresses. In naive code, the locations of references to allocated entries (addresses stored as pointers) are unknown to the allocator. Existing references would have to be patched to make the allocator action transparent to the caller; that’s not feasible for a general-purpose allocator like malloc. Logging allocator use is tied to the programming language selection. Some RTTIs, like C++, can greatly facilitate this by providing move-constructors. However, passing pointers to libraries that are outside of your control (e.g., glibc) would still be an issue. Another alternative is adopting a strategy of pool allocators, which provide allocation spaces for allocation entries of a fixed size (see Figure 3-3). By limiting the allocation space that way, fragmentation can be reduced. A number of general-purpose allocators use pool allocators for small allocations. In some cases, those application spaces exist on a per-thread basis to eliminate the need for locking and improve CPU cache utilization. Figure 3-3: Pool allocators provide allocation spaces for allocation entries of a fixed size. Fragmentation is reduced by limiting the allocation space This pool allocation strategy provides two core benefits. First, it saves you from having to search for available memory space. Second, it alleviates memory fragmentation because it pre-allocates in memory a cache for use with a collection of object sizes. Here’s how it works to achieve that: The region for each of the sizes has fixed-size memory chunks that are suitable for the contained objects, and those chunks are all tracked by the allocator. When it’s time for the allocator to actually allocate memory for a certain type of data object, it’s typically possible to use a free slot (chunk) within one of the existing memory slabs. ( Note: We are using the term “slab” to mean one or more contiguous memory pages that contain pre-allocated chunks of memory.) When it’s time for the allocator to free the object’s memory, it can simply move that slot over to the containing slab’s list of unused/free memory slots. That memory slot (or some other free slot) will be removed from the list of free slots whenever there’s a call to create an object of the same type (or a call to allocate memory of the same size). The best allocation approach to pick heavily depends upon the usage scenario. One great benefit of a log-structured approach is that it handles fragmentation of small sub-pools in a more efficient way. Pool allocators, on the other hand, generate less background load on the CPU because of the lack of compacting activity. Cache control When it comes to memory management in a software application that stores lots of data on disk, you cannot overlook the topic of cache control. Caching is always a must in data processing, and it’s crucial to decide what and where to cache. If caching is done at the I/O level, for both read/write and mmap, caching can become the responsibility of the kernel. The majority of the system’s memory is given over to the page cache. The kernel decides which pages should be evicted when memory runs low, when pages need to be written back to disk, and controls read-ahead. The application can provide some guidance to the kernel using the madvise(2) and fadvise(2) system calls. The main advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache. Those algorithms are used by thousands of different applications and are generally effective. The disadvantage, however, is that these algorithms are general-purpose and not tuned to the application. The kernel must guess how the application will behave next. Even if the application knows differently, it usually has no way to help the kernel guess correctly. This results in the wrong pages being evicted, I/O scheduled in the wrong order, or read-ahead scheduled for data that will not be consumed in the near future. Next, doing the caching at the I/O level interacts with the topic often referred to as IMR – in memory representation. No wonder that the format in which data is stored on disk differs from the form the same data is allocated in memory as objects. The simplest reason why it’s not the same is byte-ordering. With that in mind, if the data is cached once it’s read from the disk, it needs to be further converted or parsed into the object used in memory. This can be a waste of CPU cycles, so applications may choose to cache at the object level. Choosing to cache at the object level affects a lot of other design points. With that, the cache management is all on the application side including cross-core synchronization, data coherence, invalidation, etc. Next, since objects can be (and typically are) much smaller than the average I/O size, caching millions and billions of those objects requires a collection selection that can handle it (we’ll get to this in a follow-up blog). Finally, caching on the object level greatly affects the way I/O is done. Read about ScyllaDB’s CachingNoSQL Data Modeling: Application Design Before Schema Design
Learn how to implement query-first design to build a ScyllaDB schema for a sample application – in this excerpt from the book “ScyllaDB in Action.” You might have already experienced Bo’s expertise and engaging communication style in his blog How Discord Stores Trillions of Messages or ScyllaDB Summit talks How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB and So You’ve Lost Quorum: Lessons From Accidental Downtime If not, you should 😉 And if you want to learn more from Bo, join him at our upcoming Masterclass: Data Modeling for Performance Masterclass. We’ve ordered boxes of print books and will be giving them out! Watch Bo at the “Data Modeling for Performance” Masterclass — Now On Demand The following is an excerpt from Chapter 3; it’s reprinted here with permission of the publisher. *** When designing a database schema, you need to create a schema that is synergistic with your database. If you consider Scylla’s goals as a database, it wants to distribute data across the cluster to provide better scalability and fault tolerance. Spreading the data means queries involve multiple nodes, so you want to design your application to make queries that use the partition key, minimizing the nodes involved in a request. Your schema needs to fit these constraints: It needs to distribute data across the cluster It should also query the minimal amount of nodes for your desired consistency level There’s some tension in these constraints — your schema wants to spread data across the cluster to balance load between the nodes, but you want to minimize the number of nodes required to serve a query. Satisfying these constraints can be a balancing act — do you have smaller partitions that might require more queries to aggregate together, or do you have larger partitions that require fewer queries, but spread it potentially unevenly across the cluster? In figure 3.1, you can see the cost of a query that utilizes the partition key and queries across the minimal amount of nodes versus one that doesn’t use the partition key, necessitating scanning each node for matching data. Using the partition key in your query allows the coordinator — the node servicing the request — to direct queries to nodes that own that partition, lessening the load on the cluster and returning results faster. Figure 3.1 Using the partition key minimizes the number of nodes required to serve the request. The aforementioned design constraints are each related to queries. You want your data to be spread across the cluster so that your queries distribute the load amongst multiple nodes. Imagine if all of your data was clustered on a small subset of nodes in your cluster. Some nodes would be working quite hard, whereas others might not be taking much traffic. If some of those heavily utilized nodes became overwhelmed, you could suffer quite degraded performance, as because of the imbalance, many queries could be unable to complete. However, you also want to minimize the number of nodes hit per query to minimize the work your query needs to do; a query that uses all nodes in a very large cluster would be very inefficient. These query-centric constraints necessitate a query-centric approach to design. How you query Scylla is a key component of its performance, and since you need to consider the impacts of your queries across multiple dimensions, it’s critical to think carefully about how you query Scylla. When designing schemas in Scylla, it’s best to practice an approach called query-first-design, where you focus on the queries your application needs to make and then build your database schema around that. In Scylla, you structure your data based on how you want to query it — query-first design helps you. Your query-first design toolbox In query-first design, you take a set of application requirements and ask yourself a series of questions that guide you through translating the requirements into a schema in ScyllaDB. Each of these questions builds upon the next, iteratively guiding you through building an effective ScyllaDB schema. These questions include the following: What are the requirements for my application? What are the queries my application needs to run to meet these requirements? What tables are these queries querying? How can these tables be uniquely identified and partitioned to satisfy the queries? What data goes inside these tables? Does this design match the requirements? Can this schema be improved? This process is visualized in (figure 3.2), showing how you start with your application requirements and ask yourself a series of questions, guiding you from your requirements to the queries you need to run to ultimately, a fully designed ScyllaDB schema ready to store data effectively. Figure 3.2 Query-first design guides you through taking application requirements and converting them to a ScyllaDB schema. You begin with requirements and then use the requirements to identify the queries you need to run. These queries are seeking something — those “somethings” need to be stored in tables. Your tables need to be partitioned to spread data across the cluster, so you determine that partitioning to stay within your requirements and database design constraints. You then specify the fields inside each table, filling it out. At this point, you can check two things: Does the design match the requirements? Can it be improved? This up-front design is important because in ScyllaDB changing your schema to match new use cases can be a high-friction operation. While Scylla supports new query patterns via some of its features (which you’ll learn about in chapter 7), these come at an additional performance cost, and if they don’t fit your needs, might necessitate a full manual copy of data into a new table. It’s important to think carefully about your design: not only what it needs to be, but also what it could be in the future. You start by extracting the queries from your requirements and expanding your design until you have a schema that fits both your application and ScyllaDB. To practice query-first design in Scylla, let’s take the restaurant review application introduced at the beginning of the chapter and turn it into a ScyllaDB schema. The sample application requirements In the last chapter, you took your restaurant reviews and stored them inside ScyllaDB. You enjoyed working with the database, and as you went to more places, you realized you could combine your two great loves — restaurant reviews and databases (if these aren’t your two great loves, play along with me). You decide to build a website to share your restaurant reviews. Because you already have a ScyllaDB cluster, you choose to use that (this is a very different book if you pick otherwise) as the storage for your website. The first step to query-first design is identifying the requirements for your application, as seen in figure 3.4. Figure 3.4 You begin query-first design by determining your application’s requirements. After ruminating on your potential website, you identify the features it needs, and most importantly, you give it a name — Restaurant Reviews. It does what it says! Restaurant Reviews has the following initial requirements: Authors post articles to the website Users view articles to read restaurant reviews Articles contain a title, an author, a score, a date, a gallery of images, the review text, and the restaurant A review’s score is between 1 and 10 The home page contains a summary of articles sorted by most recent, showing the title, author name, score, and one image The home page links to articles Authors have a dedicated page containing summaries of their articles Authors have a name, bio, and picture Users can view a list of article summaries sorted by the highest review score You have a hunch that as time goes by, you’ll want to add more features to this application and use those features to learn more about ScyllaDB. For now, these features give you a base to practice decomposing requirements and building your schema — let’s begin! Determining the queries Next, you ask what are the queries my application needs to run to meet these requirements?, as seen in figure 3.5 Your queries drive how you design your database schema; therefore, it is critical to understand your queries at the beginning of your design. Figure 3.5 Next, you take your requirements and use them to determine your application’s queries. For identifying queries, you can use the familiar CRUD operations — create, read, update, and delete — as verbs. These queries will act on nouns in your requirements, such as authors or articles. Occasionally, you’ll want to filter your queries — you can notate this filtering withby
followed by the filter condition. For example, if
your app needed to load events on a given date, you might use a
Read Events by Date
query. If you take a look at your
requirements, you’ll see several queries you’ll need.
TIP: These aren’t the actual queries you’ll
run; those are written in CQL, as seen in chapter 2, and look more
like SELECT * FROM your_cool_table WHERE
your_awesome_primary_key = 12;
. These are descriptions of
what you’ll need to query — in a later chapter when you finish your
design, you’ll turn these into actual CQL queries. The first
requirement is “Authors post articles to the website,” which sounds
an awful lot like a process that would involve inserting an article
into a database. Because you insert articles in the database via a
query, you will need a Create Article statement. You might be
asking at this point — what is an article? Although other
requirements discuss these fields, you should skip that concern for
the moment. Focus first on what queries you need to run, and then
you’ll later figure out the needed fields. The second requirement
is “Users view articles to read restaurant reviews.” Giving users
unfettered access to the database is a security no-go, so the app
needs to load an article to display to a user. This functionality
suggests a Read Article query (which is different than the user
perusing the article), which you can use to retrieve an article for
a user. The following two requirements refer to the data you need
to store and not a novel way to access them: Articles contain a
title, an author, a score, a date, a gallery of images, the review
text, and the restaurant A review’s score is between 1 and 10
Articles need certain fields, and each article is associated with a
review score that fits within specified parameters. You can save
these requirements for later when you fill out what fields are
needed in each table. The next relevant requirement says “The home
page contains a summary of articles sorted by most recent, showing
the title, author name, score, and one image.” The home page shows
article summaries, which you’ll need to load by their date, sorted
by most recent– Read Article Summaries by Date. Article summaries,
at first glance, look a lot like articles. Because you’re querying
different data, and you also need to retrieve summaries by their
time, you should consider them as different queries. Querying for
an article: loads a title, author, score, date, a gallery of
images, the review text, and the restaurant retrieves one specific
article On the other hand, loading the most recent article
summaries: loads only the title, author name, score, and one image
loads several summaries, sorted by their publishing date Perhaps
they can run against the same table, but you can see if that’s the
case further on in the exercise. When in doubt, it’s best to not
over-consolidate. Be expansive in your designs, and if there’s
duplication, you can reduce it when refining your designs. As you
work through implementing your design, you might discover reasons
for what seems to be unnecessarily duplicated in one step to be a
necessary separation later. Following these requirements is “the
home page links to articles”, which makes explicit that the article
summaries link to the article — you’ll look closer at this one when
you determine what fields you need. The next two requirements are
about authors. The website will contain a page for each
author — presumably only you at the moment, but you have visions of
a media empire. This author page will contain article summaries for
each author — meaning you’ll need to Read Article Summaries
by Author
. In the last requirement, there’s data that each
author has. You can study the specifics of these in a moment, but
it means that you’ll need to read information about the author, so
you will need a Read Author
query. For the last
requirement — ”Users can view a list of article summaries sorted by
the highest review score” — you’ll need a way to surface article
summaries sorted by their scores. This approach requires a
Read Article Summaries by Score
. TIP:
What would make a good partition key for reading articles sorted by
score? It’s a tricky problem; you’ll learn how to attack it
shortly. Having analyzed the requirements, you’ve determined six
queries your schema needs to support: Create Article Read Article
Read Article Summaries by Date Read Article Summaries by Author
Read Article Summaries by Score Read Author You might notice a
problem here — where do article summaries and authors get created?
How can you read them if nothing makes them exist? Requirement
lists often have implicit requirements — because you need to read
article summaries, they have to be created somehow. Go ahead and
add a Create Article Summary and Create Author query to your list.
You now have eight queries from the requirements you created,
listed in table 3.1.
There’s a joke that asks “How do you draw an owl?” — you draw
a couple of circles, and then you draw the rest of the owl.
Query-first design sometimes feels similar. You need to map your
queries to a database schema that is not only effective in meeting
the application’s requirements but is performant and uses
ScyllaDB’s benefits and features. Drawing queries out of
requirements is often straightforward, whereas designing the schema
from those queries requires balancing both application and database
concerns. Let’s take a look at some techniques you can apply as you
design a database schema… New cassandra_latest.yaml configuration for a top performant Apache Cassandra®
Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.
This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters.
Motivation
The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters:
- Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints.
- Operators: who prefer stability and minimal disruption during upgrades.
- Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility.
Using cassandra_latest.yaml
Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.
This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements.
Key changes and features
Key Cache Size
- Old: Evaluated as a minimum from 5% of the heap or 100MB
- Latest: Explicitly set to 0
Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage.
Commit Log Disk Access Mode
- Old: Set to legacy
- Latest: Set to auto
Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.
Memtable Implementation
- Old: Skiplist-based
- Latest: Trie-based
Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.
create table … with memtable = {'class': 'TrieMemtable', … }
Memtable Allocation Type
- Old: Heap buffers
- Latest: Off-heap objects
Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments.
Trickle Fsync
- Old: False
- Latest: True
Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads.
SSTable Format
- Old: big
- Latest: bti (trie-indexed structure)
Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets.
sstable: selected_format: bti default_compression: zstd compression: zstd: enabled: true chunk_length: 16KiB max_compressed_length: 16KiB
Default Compaction Strategy
- Old: STCS (Size-Tiered Compaction Strategy)
- Latest: Unified Compaction Strategy
Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables.
default_compaction: class_name: UnifiedCompactionStrategy parameters: scaling_parameters: T4 max_sstables_to_compact: 64 target_sstable_size: 1GiB sstable_growth: 0.3333333333333333 min_sstable_size: 100MiB
Concurrent Compactors
- Old: Defaults to the smaller of the number of disks and cores
- Latest: Explicitly set to 8
Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient.
Default Secondary Index
- Old: legacy_local_table
- Latest: sai
Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time.
Stream Entire SSTables
- Old: implicity set to True
- Latest: explicity set to True
Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by
entire_sstable_stream_throughput_outbound
and
entire_sstable_inter_dc_stream_throughput_outbound
for inter-DC transfers.
UUID SSTable Identifiers
- Old: False
- Latest: True
Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments.
Storage Compatibility Mode
- Old: Cassandra 4
- Latest: None
Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions.
Testing and validation
The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests.
Future improvements
Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml.
Conclusion
The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone.
Try it out
Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!
The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.
P99 CONF 24 Recap: Heckling on the Shoulders of Giants
As I sit here at the hotel breakfast bar contemplating what a remarkable couple of days it’s been for the fourth annual P99 CONF, I feel quite honored to have helped host it. While the coffee is strong and the presentations are fresh in my mind, let’s recap some of the great content we shared and reveal some of the behind-the-scenes efforts that made it all happen. Watch P99 CONF Talks On Demand Day 1 While we warmed up the live hosting stage, we had a fright with Error 1016 origin DNS errors on our event platform. As we scrambled to potentially host live on an alternative platform, Cloudflare saved the day and the show was back on the road. DNS issues weren’t going to stop us from launching P99 CONF! Felipe Cardeneti Mendes got things started in the lounge with hundreds of people asking great questions about ScyllaDB pre-show. Co-founder of ScyllaDB, Dor Laor, opened the show with his keynote about ScyllaDB tablets. In the first few slides we were looking at assembly, then 128 fully utilized CPU cores not long after that. By the end of the presentation, we had throughput north of 1M ops/sec – complete with ScyllaDB’s now famous predictable low latency. To help set the scene of P99 CONF, we heard from Pekka Enberg, CTO of Turso (sorry, I’m the one who overlooked the company name mistake in the original video). Pekka dove into the patterns of low latency. This generated more great conversation in chat. If you want all the details, then his book simply titled Latency is a must-read. Since parallel programming is hard, we opened up 3 stages for you to choose from following the keynotes. Felipe returned, this time as a session speaker. Proving that not all benchmarks need to be institutionalized cheating, he paired with Alan “Dormando” of Memcached to see how ScyllaDB stacks up from a caching perspective. We also heard from Luc Lenôtre, a talented engineer, who toyed with a kernel written in Rust. Luc showed us lots of flame graphs and low-level tuning of Maestro. Continuing with the Rust theme was Amos Wenger in a very interesting look at making HTTP faster with io_uring. There were other great talks from well-known companies. For example, Jason Rahman from Microsoft shared his insights on tracing Linux scheduler behavior using ftrace. Also, Christopher Peck from Uber shared their experience tuning with generational ZGC. This reflects much of the P99 CONF content – real world, production experience taming P99 latencies at scale. Another expanding theme at this year’s P99 CONF eBPF. And who better than Liz Rice from Isovalent to kick it off with her keynote, Zero-overhead Container Networking with eBPF and Netkit. I love listening to Liz explain in technical detail the concepts and benefits of using eBPF and will definitely be reading through her book–Learning eBPF on the long flight home to Australia. By the way, books! There are now so many authors associated with P99 CONF which, I think, is a testament to the quality and professionalism of the speakers. We were giving away book bundles to members of the community who were top contributors in the chat, answering and asking great questions (huge thank you!). Some of the books on offer – which you can grab for yourself – are: Database Performance at Scale – by Felipe Mendes (ScyllaDB) et. al. Latency – by Pekka Enberg (Turso) Think Distributed Systems – by Dominik Tornow (Resonate HQ) Writing for Developers: Blogs that Get Read – by Piotr Sarna (poolside) ScyllaDB in Action – by Bo Ingram (Discord) And if you’re truly a tech bookworm, see this blog post for an extensive reading list: 14 Books by P99 CONF Speakers: Latency, Wasm, Databases & More. By mid-afternoon, day 1, Gunnar Morling from decodable lightened things up with his keynote on creating the 1 billion row challenge. I’m sure you’ve heard of it, and we had another speaker, Shraddha Agrawal following up with her version in Golang. We enjoyed lots more great content in the afternoon, including Piotr Sarna (from poolside AI and co-author of Database Performance at Scale + Writing for Developers) taking us back to the long-standing database theme of the conference with performance perspectives on database drivers. Speaking of themes, Wasm returned with book authors Brian Sletten and Ramnivas Laddad looking at WebAssembly on the edge. And the two Adams from Zoo gave us unique insight into building a remote CAD solution that feels local. And showing that we can finish day 1 just as strong as we started it, Carl Lerche, creator of tokio-rs, returned to P99 CONF for the Day 1 closing keynote. This year, he highlighted how Rust – which is typically used at the infrastructure level for all the features we love, like safety, concurrency, and performance – is also applicable at higher levels in the stack. He also announced the first version of Toasty, an ORM for Rust. Day 2 The second day kicked off with Andy Pavlo from CMU and his take on the tension between the database and operating system, with a unique research project, Tigger, a database proxy that pushes a database into kernel space using eBPF. Leading on from that, we had Bryan Cantrill, CTO of Oxide and creator of DTrace, reviewing DTrace’s 21-year history with plenty of insights into the origins and evolution of this framework. Bryan has presented at every P99 CONF and is one of the many industry giants, that you can5X to 40X Lower DynamoDB Costs— with Better P99 Latency
At our recent events, I’ve been fielding a lot of questions about DynamoDB costs. So, I wanted to highlight the cost comparison aspect of a recent benchmark comparing ScyllaDB and DynamoDB. This involved a detailed price-performance comparison analyzing: How cost compares across both DynamoDB pricing models under various workload conditions How latency compares across this set of workloads I’ll share details below, but here’s a quick summary: ScyllaDB costs are significantly lower in all but one scenario. In realistic workloads, costs would be 5X to 40X lower — with up to 4X better P99 latency. Here’s a consolidated look at how DynamoDB and ScyllaDB compare on a Uniform distribution (DynamoDB’s sweet spot). Now, more details on the cost aspect of this comparison. For a deeper dive into the performance aspect, see this DynamoDB vs ScyllaDB price-performance comparison blog as well as the complete benchmark report. How We Compared DynamoDB Costs vs ScyllaDB Costs For our cost comparisons, we launched a small 3-node cluster in ScyllaDB Cloud and measured performance on a wide range of workload scenarios. Next, we calculated the cost of running the same workloads on DynamoDB. We used an item size of 1081 bytes, which translates to 2 WCUs per write operation and 1 RCU per read operation on DynamoDB. Our working data set size was 1 TB, with an approximate cost of ~$250/month in DynamoDB. We used the same ScyllaDB cluster through every testing scenario, thus simplifying ScyllaDB Cloud costs. Hourly rates (on-demand) were used. As ScyllaDB linearly scales with the amount of resources, you can predictably adjust costs to match your desired performance outcome. Annual pricing provides significant cost reduction but is out of the scope of this benchmark. DynamoDB has two modes for non-annual pricing: provisioned and on-demand pricing. Provisioned mode is recommended if your workloads are reasonably predictable. On-demand pricing is significantly more expensive and is a fit for unpredictable, low-throughput workloads. It is possible to combine modes, add auto-scaling, and so forth. DynamoDB provides considerable flexibility around managing the cost and scale of the aforementioned options, but this also results in considerable complexity. For details on how we calculated costs, refer to the Cost Calculations section at the end of this article. Throughout all tests, we ensured ScyllaDB had spare capacity at all times by keeping its load below 75%. Given that, note that it is possible to achieve higher traffic than the numbers reported here at no additional cost, in turn allowing for additional growth. The number of operations per second that the ScyllaDB cluster performs for each workload is reported under the X axis in the following graphs. Provisioned Cost Comparison: DynamoDB vs ScyllaDB Provisioned mode is recommended if your workloads are reasonably predictable. With DynamoDB, you need to be able to predict per-table capacity following AWS DynamoDB read/write capacity unit pricing model. With just one exception, DynamoDB’s cost estimates were consistently higher than ScyllaDB – and much more so for the most write-heavy workloads. In the 1 out of 15 cases where DynamoDB turned out to be less expensive, ScyllaDB could actually drive more utilization to win over DynamoDB. However, we wanted to keep the results consistent and fair. This is not surprising, given that DynamoDB charges 5X more for writes than for reads, while ScyllaDB does not differentiate between operations, and its pricing is based on the actual cluster size. On-Demand Cost Comparison: DynamoDB vs ScyllaDB On-demand pricing is best when the application’s workload is unclear, the application’s data traffic patterns are unknown, and/or your company prefers a pay-as-you-go option. However, as the results show, the convenience and flexibility of DynamoDB’s on-demand pricing often come at quite a cost. To see how we calculated costs, refer to the Cost Calculations section at the end of this article. Here, the same general trends hold true. ScyllaDB cost is fixed across the board, and its cost advantage grows as write throughput increases. ScyllaDB’s cost advantage over on-demand DynamoDB tables is significantly greater when compared to provisioned capacity on DynamoDB. Why? Because DynamoDB’s on-demand pricing is significantly higher than its provisioned capacity counterpart. Therefore, workloads with unpredictable traffic spikes (which would justify not using provisioned capacity) may easily end up with runaway bills compared to costs with ScyllaDB Cloud. Making ScyllaDB Even More Cost Effective Unlike DynamoDB (where you provision tables), ScyllaDB is provisioned as a cluster, capable of hosting several tables – and therefore consolidating several workloads under a single deployment. Excess hardware capacity may be shared to power more use cases on that cluster. ScyllaDB Cloud and Enterprise users can also use ScyllaDB’s Workload Prioritization to prioritize specific access patterns and further drive consolidation. For example, assume there are 10 use cases that require 100K OPS each. With DynamoDB, users would be forced to allocate a provisioned workload per table or to use the rather expensive on-demand mode. This introduces a set of caveats: If every workload consistently reaches its peak capacity, it will likely get throttled by AWS (provisioned mode), or result in runaway bills (on-demand mode). Likewise, the opposite also holds true: If most workloads are consistently idle, provisioned mode results in non-consumed capacity bills. On-demand mode doesn’t guarantee immediate capacity to support traffic surges. This, in turn, causes your applications to experience some degree of throttling. A standard ScyllaDB deployment is not only more cost effective, but also simplifies management. It allows users to consolidate all workloads within a single cluster and share idle capacity among them. With ScyllaDB Cloud and Enterprise, users further have the flexibility to define priorities on a per-workload basis, allowing the database to make more informed decisions when two or more workloads compete against each other for resources. Cost Calculation Details Here’s how we calculated costs across the different databases and pricing models. DynamoDB Cost Calculations Provisioned Costs for DynamoDB With DynamoDB’s provisioned capacity mode, planning is required. You specify the number of reads and writes per second that you expect your application to require. You can make use of auto-scaling to automatically adjust your table’s capacity based on a specific utilization target in order to sustain spikes outside of your projected planning. In provisioned mode, you need to provision DynamoDB with the expected throughput. You set WCUs (Write Capacity Units) and RCUs (Read Capacity Units) which signify the allowed number of write and read operations per second, respectively. They are priced per hour. One WCU is $0.00065 per hour One RCU is $0.00013 per hour This yields the following formulas for calculating monthly costs: On-Demand Costs for DynamoDB With DynamoDB’s on-demand mode, no planning is required. You pay for the actual reads/writes that your application is using (the total number of actual writes or reads, not writes or reads per second). In this mode, you pay by usage and the cost is per request unit (rather than per capacity unit, as in the provisioned mode). AWS charges $1.25 per million write request units (WRU) and $0.25 per million read request units (RRU). Therefore, a single request unit costs 1 millionth of the actual write/read operation, as follows: One WRU is $0.00000125 per write One RRU is $0.00000025 per read This yields the following formulas for calculating monthly costs: ScyllaDB Cost Calculations As stated previously, we used ScyllaDB’s on-demand pricing for all cost comparisons in this study. ScyllaDB’s on-demand costs were determined using our online pricing calculator as follows: From the ScyllaDB Pricing Calculator This calculator estimates the size and cost of a ScyllaDB Cloud cluster based on the specified technical requirements around throughput and item/data size. Note that in ScyllaDB, the primary aspect driving costs is the cluster size, unlike DynamoDB’s model on the volume of reads and writes. Once the cluster size is determined, the deployment can often exceed throughput requirements. For comparison, DynamoDB’s provisioned pricing structure requires users to explicitly specify sustained and peak throughput. Overprovisioning equivalent performance in DynamoDB would be significantly pricier compared to ScyllaDB. Without an annual commitment for cost savings, the estimated annual cost for the ScyllaDB Cloud cluster is $54,528, calculated at a monthly rate of $4,544. Conclusion As the results indicate, what might begin at a seemingly reasonable cost can quickly escalate into “bill shock” with DynamoDB – especially at high throughputs, and particularly with write-heavy workloads. This makes DynamoDB a suboptimal choice for data-intensive applications anticipating steady or rapid growth. ScyllaDB’s significantly lower costs – a reflection of ScyllaDB taking full advantage of modern infrastructure for high throughput and low latency – make it a more cost-effective solution for data-intensive applications. ScyllaDB – with its LSM-tree-based storage, unified caching, shard-per-core design, and advanced schedulers – allows you to maximize the advantages of modern hardware, from huge CPU chips to blazing-fast NVMe. Beyond the presented cost savings, ScyllaDB sustains 2X peaks and provides 2X-4X better P99 latency. Additionally, it can further reduce latency when idle – or enable spare resources to be shared across multiple tables. For larger workloads spanning 500K-1M OPS and beyond, this can result in a cost saving in the millions – with better performance and fewer query limitations.Introducing Netflix’s TimeSeries Data Abstraction Layer
By Rajiv Shringi, Vinay Chella, Kaidan Fullerton, Oleksii Tkachuk, Joey Lynch
Introduction
As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming, the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital. In previous blog posts, we introduced the Key-Value Data Abstraction Layer and the Data Gateway Platform, both of which are integral to Netflix’s data architecture. The Key-Value Abstraction offers a flexible, scalable solution for storing and accessing structured key-value data, while the Data Gateway Platform provides essential infrastructure for protecting, configuring, and deploying the data tier.
Building on these foundational abstractions, we developed the TimeSeries Abstraction — a versatile and scalable solution designed to efficiently store and query large volumes of temporal event data with low millisecond latencies, all in a cost-effective manner across various use cases.
In this post, we will delve into the architecture, design principles, and real-world applications of the TimeSeries Abstraction, demonstrating how it enhances our platform’s ability to manage temporal data at scale.
Note: Contrary to what the name may suggest, this system is not built as a general-purpose time series database. We do not use it for metrics, histograms, timers, or any such near-real time analytics use case. Those use cases are well served by the Netflix Atlas telemetry system. Instead, we focus on addressing the challenge of storing and accessing extremely high-throughput, immutable temporal event data in a low-latency and cost-efficient manner.
Challenges
At Netflix, temporal data is continuously generated and utilized, whether from user interactions like video-play events, asset impressions, or complex micro-service network activities. Effectively managing this data at scale to extract valuable insights is crucial for ensuring optimal user experiences and system reliability.
However, storing and querying such data presents a unique set of challenges:
- High Throughput: Managing up to 10 million writes per second while maintaining high availability.
- Efficient Querying in Large Datasets: Storing petabytes of data while ensuring primary key reads return results within low double-digit milliseconds, and supporting searches and aggregations across multiple secondary attributes.
- Global Reads and Writes: Facilitating read and write operations from anywhere in the world with adjustable consistency models.
- Tunable Configuration: Offering the ability to partition datasets in either a single-tenant or multi-tenant datastore, with options to adjust various dataset aspects such as retention and consistency.
- Handling Bursty Traffic: Managing significant traffic spikes during high-demand events, such as new content launches or regional failovers.
- Cost Efficiency: Reducing the cost per byte and per operation to optimize long-term retention while minimizing infrastructure expenses, which can amount to millions of dollars for Netflix.
TimeSeries Abstraction
The TimeSeries Abstraction was developed to meet these requirements, built around the following core design principles:
- Partitioned Data: Data is partitioned using a unique temporal partitioning strategy combined with an event bucketing approach to efficiently manage bursty workloads and streamline queries.
- Flexible Storage: The service is designed to integrate with various storage backends, including Apache Cassandra and Elasticsearch, allowing Netflix to customize storage solutions based on specific use case requirements.
- Configurability: TimeSeries offers a range of tunable options for each dataset, providing the flexibility needed to accommodate a wide array of use cases.
- Scalability: The architecture supports both horizontal and vertical scaling, enabling the system to handle increasing throughput and data volumes as Netflix expands its user base and services.
- Sharded Infrastructure: Leveraging the Data Gateway Platform, we can deploy single-tenant and/or multi-tenant infrastructure with the necessary access and traffic isolation.
Let’s dive into the various aspects of this abstraction.
Data Model
We follow a unique event data model that encapsulates all the data we want to capture for events, while allowing us to query them efficiently.
Let’s start with the smallest unit of data in the abstraction and work our way up.
- Event Item: An event item is a key-value pair that users use to store data for a given event. For example: {“device_type”: “ios”}.
- Event: An event is a structured collection of one or more such event items. An event occurs at a specific point in time and is identified by a client-generated timestamp and an event identifier (such as a UUID). This combination of event_time and event_id also forms part of the unique idempotency key for the event, enabling users to safely retry requests.
- Time Series ID: A time_series_id is a collection of one or more such events over the dataset’s retention period. For instance, a device_id would store all events occurring for a given device over the retention period. All events are immutable, and the TimeSeries service only ever appends events to a given time series ID.
- Namespace: A namespace is a collection of time series IDs and event data, representing the complete TimeSeries dataset. Users can create one or more namespaces for each of their use cases. The abstraction applies various tunable options at the namespace level, which we will discuss further when we explore the service’s control plane.
API
The abstraction provides the following APIs to interact with the event data.
WriteEventRecordsSync: This endpoint writes a batch of events and sends back a durability acknowledgement to the client. This is used in cases where users require a guarantee of durability.
WriteEventRecords: This is the fire-and-forget version of the above endpoint. It enqueues a batch of events without the durability acknowledgement. This is used in cases like logging or tracing, where users care more about throughput and can tolerate a small amount of data loss.
{
"namespace": "my_dataset",
"events": [
{
"timeSeriesId": "profile100",
"eventTime": "2024-10-03T21:24:23.988Z",
"eventId": "550e8400-e29b-41d4-a716-446655440000",
"eventItems": [
{
"eventItemKey": "deviceType",
"eventItemValue": "aW9z"
},
{
"eventItemKey": "deviceMetadata",
"eventItemValue": "c29tZSBtZXRhZGF0YQ=="
}
]
},
{
"timeSeriesId": "profile100",
"eventTime": "2024-10-03T21:23:30.000Z",
"eventId": "123e4567-e89b-12d3-a456-426614174000",
"eventItems": [
{
"eventItemKey": "deviceType",
"eventItemValue": "YW5kcm9pZA=="
}
]
}
]
}
ReadEventRecords: Given a combination of a namespace, a timeSeriesId, a timeInterval, and optional eventFilters, this endpoint returns all the matching events, sorted descending by event_time, with low millisecond latency.
{
"namespace": "my_dataset",
"timeSeriesId": "profile100",
"timeInterval": {
"start": "2024-10-02T21:00:00.000Z",
"end": "2024-10-03T21:00:00.000Z"
},
"eventFilters": [
{
"matchEventItemKey": "deviceType",
"matchEventItemValue": "aW9z"
}
],
"pageSize": 100,
"totalRecordLimit": 1000
}
SearchEventRecords: Given a search criteria and a time interval, this endpoint returns all the matching events. These use cases are fine with eventually consistent reads.
{
"namespace": "my_dataset",
"timeInterval": {
"start": "2024-10-02T21:00:00.000Z",
"end": "2024-10-03T21:00:00.000Z"
},
"searchQuery": {
"booleanQuery": {
"searchQuery": [
{
"equals": {
"eventItemKey": "deviceType",
"eventItemValue": "aW9z"
}
},
{
"range": {
"eventItemKey": "deviceRegistrationTimestamp",
"lowerBound": {
"eventItemValue": "MjAyNC0xMC0wMlQwMDowMDowMC4wMDBa",
"inclusive": true
},
"upperBound": {
"eventItemValue": "MjAyNC0xMC0wM1QwMDowMDowMC4wMDBa"
}
}
}
],
"operator": "AND"
}
},
"pageSize": 100,
"totalRecordLimit": 1000
}
AggregateEventRecords: Given a search criteria and an aggregation mode (e.g. DistinctAggregation) , this endpoint performs the given aggregation within a given time interval. Similar to the Search endpoint, users can tolerate eventual consistency and a potentially higher latency (in seconds).
{
"namespace": "my_dataset",
"timeInterval": {
"start": "2024-10-02T21:00:00.000Z",
"end": "2024-10-03T21:00:00.000Z"
},
"searchQuery": {...some search criteria...},
"aggregationQuery": {
"distinct": {
"eventItemKey": "deviceType",
"pageSize": 100
}
}
}
In the subsequent sections, we will talk about how we interact with this data at the storage layer.
Storage Layer
The storage layer for TimeSeries comprises a primary data store and an optional index data store. The primary data store ensures data durability during writes and is used for primary read operations, while the index data store is utilized for search and aggregate operations. At Netflix, Apache Cassandra is the preferred choice for storing durable data in high-throughput scenarios, while Elasticsearch is the preferred data store for indexing. However, similar to our approach with the API, the storage layer is not tightly coupled to these specific data stores. Instead, we define storage API contracts that must be fulfilled, allowing us the flexibility to replace the underlying data stores as needed.
Primary Datastore
In this section, we will talk about how we leverage Apache Cassandra for TimeSeries use cases.
Partitioning Scheme
At Netflix’s scale, the continuous influx of event data can quickly overwhelm traditional databases. Temporal partitioning addresses this challenge by dividing the data into manageable chunks based on time intervals, such as hourly, daily, or monthly windows. This approach enables efficient querying of specific time ranges without the need to scan the entire dataset. It also allows Netflix to archive, compress, or delete older data efficiently, optimizing both storage and query performance. Additionally, this partitioning mitigates the performance issues typically associated with wide partitions in Cassandra. By employing this strategy, we can operate at much higher disk utilization, as it reduces the need to reserve large amounts of disk space for compactions, thereby saving costs.
Here is what it looks like :
Time Slice: A time slice is the unit of data retention and maps directly to a Cassandra table. We create multiple such time slices, each covering a specific interval of time. An event lands in one of these slices based on the event_time. These slices are joined with no time gaps in between, with operations being start-inclusive and end-exclusive, ensuring that all data lands in one of the slices. By utilizing these time slices, we can efficiently implement retention by dropping entire tables, which reduces storage space and saves on costs.
Why not use row-based Time-To-Live (TTL)?
Using TTL on individual events would generate a significant number of tombstones in Cassandra, degrading performance, especially during range scans. By employing discrete time slices and dropping them, we avoid the tombstone issue entirely. The tradeoff is that data may be retained slightly longer than necessary, as an entire table’s time range must fall outside the retention window before it can be dropped. Additionally, TTLs are difficult to adjust later, whereas TimeSeries can extend the dataset retention instantly with a single control plane operation.
Time Buckets: Within a time slice, data is further partitioned into time buckets. This facilitates effective range scans by allowing us to target specific time buckets for a given query range. The tradeoff is that if a user wants to read the entire range of data over a large time period, we must scan many partitions. We mitigate potential latency by scanning these partitions in parallel and aggregating the data at the end. In most cases, the advantage of targeting smaller data subsets outweighs the read amplification from these scatter-gather operations. Typically, users read a smaller subset of data rather than the entire retention range.
Event Buckets: To manage extremely high-throughput write operations, which may result in a burst of writes for a given time series within a short period, we further divide the time bucket into event buckets. This prevents overloading the same partition for a given time range and also reduces partition sizes further, albeit with a slight increase in read amplification.
Note: With Cassandra 4.x onwards, we notice a substantial improvement in the performance of scanning a range of data in a wide partition. See Future Enhancements at the end to see the Dynamic Event bucketing work that aims to take advantage of this.
Storage Tables
We use two kinds of tables
- Data tables: These are the time slices that store the actual event data.
- Metadata table: This table stores information about how each time slice is configured per namespace.
Data tables
The partition key enables splitting events for a time_series_id over a range of time_bucket(s) and event_bucket(s), thus mitigating hot partitions, while the clustering key allows us to keep data sorted on disk in the order we almost always want to read it. The value_metadata column stores metadata for the event_item_value such as compression.
Writing to the data table:
User writes will land in a given time slice, time bucket, and event bucket as a factor of the event_time attached to the event. This factor is dictated by the control plane configuration of a given namespace.
For example:
During this process, the writer makes decisions on how to handle the data before writing, such as whether to compress it. The value_metadata column records any such post-processing actions, ensuring that the reader can accurately interpret the data.
Reading from the data table:
The below illustration depicts at a high-level on how we scatter-gather the reads from multiple partitions and join the result set at the end to return the final result.
Metadata table
This table stores the configuration data about the time slices for a given namespace.
Note the following:
- No Time Gaps: The end_time of a given time slice overlaps with the start_time of the next time slice, ensuring all events find a home.
- Retention: The status indicates which tables fall inside and outside of the retention window.
- Flexible: This metadata can be adjusted per time slice, allowing us to tune the partition settings of future time slices based on observed data patterns in the current time slice.
There is a lot more information that can be stored into the metadata column (e.g., compaction settings for the table), but we only show the partition settings here for brevity.
Index Datastore
To support secondary access patterns via non-primary key attributes, we index data into Elasticsearch. Users can configure a list of attributes per namespace that they wish to search and/or aggregate data on. The service extracts these fields from events as they stream in, indexing the resultant documents into Elasticsearch. Depending on the throughput, we may use Elasticsearch as a reverse index, retrieving the full data from Cassandra, or we may store the entire source data directly in Elasticsearch.
Note: Again, users are never directly exposed to Elasticsearch, just like they are not directly exposed to Cassandra. Instead, they interact with the Search and Aggregate API endpoints that translate a given query to that needed for the underlying datastore.
In the next section, we will talk about how we configure these data stores for different datasets.
Control Plane
The data plane is responsible for executing the read and write operations, while the control plane configures every aspect of a namespace’s behavior. The data plane communicates with the TimeSeries control stack, which manages this configuration information. In turn, the TimeSeries control stack interacts with a sharded Data Gateway Platform Control Plane that oversees control configurations for all abstractions and namespaces.
Separating the responsibilities of the data plane and control plane helps maintain the high availability of our data plane, as the control plane takes on tasks that may require some form of schema consensus from the underlying data stores.
Namespace Configuration
The below configuration snippet demonstrates the immense flexibility of the service and how we can tune several things per namespace using our control plane.
"persistence_configuration": [
{
"id": "PRIMARY_STORAGE",
"physical_storage": {
"type": "CASSANDRA", // type of primary storage
"cluster": "cass_dgw_ts_tracing", // physical cluster name
"dataset": "tracing_default" // maps to the keyspace
},
"config": {
"timePartition": {
"secondsPerTimeSlice": "129600", // width of a time slice
"secondPerTimeBucket": "3600", // width of a time bucket
"eventBuckets": 4 // how many event buckets within
},
"queueBuffering": {
"coalesce": "1s", // how long to coalesce writes
"bufferCapacity": 4194304 // queue capacity in bytes
},
"consistencyScope": "LOCAL", // single-region/multi-region
"consistencyTarget": "EVENTUAL", // read/write consistency
"acceptLimit": "129600s" // how far back writes are allowed
},
"lifecycleConfigs": {
"lifecycleConfig": [ // Primary store data retention
{
"type": "retention",
"config": {
"close_after": "1296000s", // close for reads/writes
"delete_after": "1382400s" // drop time slice
}
}
]
}
},
{
"id": "INDEX_STORAGE",
"physicalStorage": {
"type": "ELASTICSEARCH", // type of index storage
"cluster": "es_dgw_ts_tracing", // ES cluster name
"dataset": "tracing_default_useast1" // base index name
},
"config": {
"timePartition": {
"secondsPerSlice": "129600" // width of the index slice
},
"consistencyScope": "LOCAL",
"consistencyTarget": "EVENTUAL", // how should we read/write data
"acceptLimit": "129600s", // how far back writes are allowed
"indexConfig": {
"fieldMapping": { // fields to extract to index
"tags.nf.app": "KEYWORD",
"tags.duration": "INTEGER",
"tags.enabled": "BOOLEAN"
},
"refreshInterval": "60s" // Index related settings
}
},
"lifecycleConfigs": {
"lifecycleConfig": [
{
"type": "retention", // Index retention settings
"config": {
"close_after": "1296000s",
"delete_after": "1382400s"
}
}
]
}
}
]
Provisioning Infrastructure
With so many different parameters, we need automated provisioning workflows to deduce the best settings for a given workload. When users want to create their namespaces, they specify a list of workload desires, which the automation translates into concrete infrastructure and related control plane configuration. We highly encourage you to watch this ApacheCon talk, by one of our stunning colleagues Joey Lynch, on how we achieve this. We may go into detail on this subject in one of our future blog posts.
Once the system provisions the initial infrastructure, it then scales in response to the user workload. The next section describes how this is achieved.
Scalability
Our users may operate with limited information at the time of provisioning their namespaces, resulting in best-effort provisioning estimates. Further, evolving use-cases may introduce new throughput requirements over time. Here’s how we manage this:
- Horizontal scaling: TimeSeries server instances can auto-scale up and down as per attached scaling policies to meet the traffic demand. The storage server capacity can be recomputed to accommodate changing requirements using our capacity planner.
- Vertical scaling: We may also choose to vertically scale our TimeSeries server instances or our storage instances to get greater CPU, RAM and/or attached storage capacity.
- Scaling disk: We may attach EBS to store data if the capacity planner prefers infrastructure that offers larger storage at a lower cost rather than SSDs optimized for latency. In such cases, we deploy jobs to scale the EBS volume when the disk storage reaches a certain percentage threshold.
- Re-partitioning data: Inaccurate workload estimates can lead to over or under-partitioning of our datasets. TimeSeries control-plane can adjust the partitioning configuration for upcoming time slices, once we realize the nature of data in the wild (via partition histograms). In the future we plan to support re-partitioning of older data and dynamic partitioning of current data.
Design Principles
So far, we have seen how TimeSeries stores, configures and interacts with event datasets. Let’s see how we apply different techniques to improve the performance of our operations and provide better guarantees.
Event Idempotency
We prefer to bake in idempotency in all mutation endpoints, so that users can retry or hedge their requests safely. Hedging is when the client sends an identical competing request to the server, if the original request does not come back with a response in an expected amount of time. The client then responds with whichever request completes first. This is done to keep the tail latencies for an application relatively low. This can only be done safely if the mutations are idempotent. For TimeSeries, the combination of event_time, event_id and event_item_key form the idempotency key for a given time_series_id event.
SLO-based Hedging
We assign Service Level Objectives (SLO) targets for different endpoints within TimeSeries, as an indication of what we think the performance of those endpoints should be for a given namespace. We can then hedge a request if the response does not come back in that configured amount of time.
"slos": {
"read": { // SLOs per endpoint
"latency": {
"target": "0.5s", // hedge around this number
"max": "1s" // time-out around this number
}
},
"write": {
"latency": {
"target": "0.01s",
"max": "0.05s"
}
}
}
Partial Return
Sometimes, a client may be sensitive to latency and willing to accept a partial result set. A real-world example of this is real-time frequency capping. Precision is not critical in this case, but if the response is delayed, it becomes practically useless to the upstream client. Therefore, the client prefers to work with whatever data has been collected so far rather than timing out while waiting for all the data. The TimeSeries client supports partial returns around SLOs for this purpose. Importantly, we still maintain the latest order of events in this partial fetch.
Adaptive Pagination
All reads start with a default fanout factor, scanning 8 partition buckets in parallel. However, if the service layer determines that the time_series dataset is dense — i.e., most reads are satisfied by reading the first few partition buckets — then it dynamically adjusts the fanout factor of future reads in order to reduce the read amplification on the underlying datastore. Conversely, if the dataset is sparse, we may want to increase this limit with a reasonable upper bound.
Limited Write Window
In most cases, the active range for writing data is smaller than the range for reading data — i.e., we want a range of time to become immutable as soon as possible so that we can apply optimizations on top of it. We control this by having a configurable “acceptLimit” parameter that prevents users from writing events older than this time limit. For example, an accept limit of 4 hours means that users cannot write events older than now() — 4 hours. We sometimes raise this limit for backfilling historical data, but it is tuned back down for regular write operations. Once a range of data becomes immutable, we can safely do things like caching, compressing, and compacting it for reads.
Buffering Writes
We frequently leverage this service for handling bursty workloads. Rather than overwhelming the underlying datastore with this load all at once, we aim to distribute it more evenly by allowing events to coalesce over short durations (typically seconds). These events accumulate in in-memory queues running on each instance. Dedicated consumers then steadily drain these queues, grouping the events by their partition key, and batching the writes to the underlying datastore.
The queues are tailored to each datastore since their operational characteristics depend on the specific datastore being written to. For instance, the batch size for writing to Cassandra is significantly smaller than that for indexing into Elasticsearch, leading to different drain rates and batch sizes for the associated consumers.
While using in-memory queues does increase JVM garbage collection, we have experienced substantial improvements by transitioning to JDK 21 with ZGC. To illustrate the impact, ZGC has reduced our tail latencies by an impressive 86%:
Because we use in-memory queues, we are prone to losing events in case of an instance crash. As such, these queues are only used for use cases that can tolerate some amount of data loss .e.g. tracing/logging. For use cases that need guaranteed durability and/or read-after-write consistency, these queues are effectively disabled and writes are flushed to the data store almost immediately.
Dynamic Compaction
Once a time slice exits the active write window, we can leverage the immutability of the data to optimize it for read performance. This process may involve re-compacting immutable data using optimal compaction strategies, dynamically shrinking and/or splitting shards to optimize system resources, and other similar techniques to ensure fast and reliable performance.
The following section provides a glimpse into the real-world performance of some of our TimeSeries datasets.
Real-world Performance
The service can write data in the order of low single digit milliseconds
while consistently maintaining stable point-read latencies:
At the time of writing this blog, the service was processing close to 15 million events/second across all the different datasets at peak globally.
Time Series Usage @ Netflix
The TimeSeries Abstraction plays a vital role across key services at Netflix. Here are some impactful use cases:
- Tracing and Insights: Logs traces across all apps and micro-services within Netflix, to understand service-to-service communication, aid in debugging of issues, and answer support requests.
- User Interaction Tracking: Tracks millions of user interactions — such as video playbacks, searches, and content engagement — providing insights that enhance Netflix’s recommendation algorithms in real-time and improve the overall user experience.
- Feature Rollout and Performance Analysis: Tracks the rollout and performance of new product features, enabling Netflix engineers to measure how users engage with features, which powers data-driven decisions about future improvements.
- Asset Impression Tracking and Optimization: Tracks asset impressions ensuring content and assets are delivered efficiently while providing real-time feedback for optimizations.
- Billing and Subscription Management: Stores historical data related to billing and subscription management, ensuring accuracy in transaction records and supporting customer service inquiries.
and more…
Future Enhancements
As the use cases evolve, and the need to make the abstraction even more cost effective grows, we aim to make many improvements to the service in the upcoming months. Some of them are:
- Tiered Storage for Cost Efficiency: Support moving older, lesser-accessed data into cheaper object storage that has higher time to first byte, potentially saving Netflix millions of dollars.
- Dynamic Event Bucketing: Support real-time partitioning of keys into optimally-sized partitions as events stream in, rather than having a somewhat static configuration at the time of provisioning a namespace. This strategy has a huge advantage of not partitioning time_series_ids that don’t need it, thus saving the overall cost of read amplification. Also, with Cassandra 4.x, we have noted major improvements in reading a subset of data in a wide partition that could lead us to be less aggressive with partitioning the entire dataset ahead of time.
- Caching: Take advantage of immutability of data and cache it intelligently for discrete time ranges.
- Count and other Aggregations: Some users are only interested in counting events in a given time interval rather than fetching all the event data for it.
Conclusion
The TimeSeries Abstraction is a vital component of Netflix’s online data infrastructure, playing a crucial role in supporting both real-time and long-term decision-making. Whether it’s monitoring system performance during high-traffic events or optimizing user engagement through behavior analytics, TimeSeries Abstraction ensures that Netflix operates seamlessly and efficiently on a global scale.
As Netflix continues to innovate and expand into new verticals, the TimeSeries Abstraction will remain a cornerstone of our platform, helping us push the boundaries of what’s possible in streaming and beyond.
Stay tuned for Part 2, where we’ll introduce our Distributed Counter Abstraction, a key element of Netflix’s Composite Abstractions, built on top of the TimeSeries Abstraction.
Acknowledgments
Special thanks to our stunning colleagues who contributed to TimeSeries Abstraction’s success: Tom DeVoe Mengqing Wang, Kartik Sathyanarayanan, Jordan West, Matt Lehman, Cheng Wang, Chris Lohfink .
Introducing Netflix’s TimeSeries Data Abstraction Layer was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Instaclustr for Apache Cassandra® 5.0 Now Generally Available
NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.
NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and on–premises.
Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.
Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by:
- Helping you build AI/ML-based applications through Vector Search
- Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities
- Improving flexibility and security
With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industry–leading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.
Support for continuous backups and private network add–ons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported.
Some of the key new features in Cassandra 5.0 include:
- Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.
- Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.
- Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.
- Numerous stability and testing improvements: You can read all about these changes here.
All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.
Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.
We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.
Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things.
Lifecycle Policy Updates
As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).
To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters.
Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.
Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.
You can read more about our lifecycle policies on our website.
Getting Started
Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.
- Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
- Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
- Learn about 4 new Apache Cassandra 5.0 features to be excited about.
- Click here to learn what you need to know about Apache Cassandra 5.0.
Why Choose Apache Cassandra on the Instaclustr Managed Platform?
NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.
NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.
Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.
With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.
If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.
The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.
Apache Cassandra® 5.0: Behind the Scenes
Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.
Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.
It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform.
Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra project. His changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform.
August 2023: The Beginning
We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.
One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java.
September 2023: First Milestone
The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).
Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.
At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released.
November 2023: Further Testing and Internal Preview
The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.
We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.
We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.
During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us.
If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer:
java.lang.IllegalArgumentException: Could not parse type name vector<float, 5> Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417) Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356) <Rest of stacktrace removed for brevity>
December 2023: Focus on new features and planning
As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.
The final list of high impact features we came up with was:
- A new data type – Vectors
- Trie memtables/Trie Indexed SSTables (BTI Formatted SStables)
- Storage-Attached Indexes (SAI)
- Unified Compaction Strategy
A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase.
Once the holiday season arrived, it was time for a break, and we were back in force in February next year.
February 2024: Intensive testing
In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, we discovered issues in the interaction with our monitoring and provisioning setup.
We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.
Next, we broke down the work we needed to do. This included identifying monitoring agents requiring upgrade and config defaults that needed to change.
From this point, the project split into 3 streams of work:
- Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.
- Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.
- Infrastructure Upgrades – Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables.
A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved.
March 2024: Public Preview Release
In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables.
However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.
See our public preview launch blog for further details.
There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults.
April 2024: Configuration Tuning and Deeper Testing
The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.
This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default.
"memtable": { "configurations": { "skiplist": { "class_name": "SkipListMemtable" }, "sharded": { "class_name": "ShardedSkipListMemtable" }, "trie": { "class_name": "TrieMemtable" }, "default": { "inherits": "trie" } } }, "sstable": { "selected_format": "bti" }, "storage_compatibility_mode": "NONE",
The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables. You can override this per table:
CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};
Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information.
You can also reference the cassandra_latest.yaml file for the latest settings (please note you should not apply these to existing clusters without rigorous testing).
May 2024: Major Infrastructure Milestone
We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.
At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed.
June 2024: Successful Rollout New Cassandra Driver
The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default.
July 2024: Path to Release Candidate
We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.
The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform.
The Road Ahead to General Availability
We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including:
- Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure.
At Launch:
When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.
For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.
What Have We Learned From This Project?
- Releasing limited,
small
and frequent changes
has resulted in a smooth project, even if sometimes frequent
releases do not feel smooth. Some
thoughts:
- Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
- Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc.
- Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.
- Releasing to internal and public preview helped
create
momentum within
the Instaclustr
business and
teams:
- This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
- Communicating frequently, transparently, and efficiently is the foundation
of success:
- We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything.
- It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project.
This has been a long–running but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.
You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.
More Readings
- The Top 5 Questions We’re Asked about Apache Cassandra 5.0
- Vector Search in Apache Cassandra 5.0
- How Does Data Modeling Change in Apache Cassandra 5.0?
The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.
Will Your Cassandra Database Project Succeed?: The New Stack
Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.
That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.
Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.
Accurate Data Modeling Is a Must
Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.
On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.
Configure Cassandra Clusters the Right Way
Accurate, expertly managed cluster configurations are pivotal to the success of Cassandra implementations. Get those cluster settings wrong and Cassandra can suffer from data inconsistencies and performance issues due to inappropriate node capacities, poor partitioning or replication strategies that aren’t up to the task.
Teams should understand the needs of their particular use case and how each cluster configuration setting affects Cassandra’s abilities to serve that use case. Attuning configurations to best support your application — including the right settings for node capacity, data distribution, replication factor and consistency levels — will ensure that you can harness the full power of Cassandra when it counts.
Take Advantage of Tunable Consistency
Cassandra gives teams the option to leverage the best balance of data consistency and availability for their use case. While these tunable consistency levels are a valuable tool in the right hands, teams that don’t understand the nuances of these controls can saddle their applications with painful latency and troublesome data inconsistencies.
Teams that learn to operate Cassandra’s tunable consistency levels properly and carefully assess their application’s needs — especially with read and write patterns, data sensitivity and the ability to tolerate eventual consistency — will unlock far more beneficial Cassandra experiences.
Perform Regular Maintenance
Regular Cassandra maintenance is required to stave off issues such as data inconsistencies and performance drop-offs. Within their Cassandra operational procedures, teams should routinely perform compaction, repair and node-tool operations to prevent challenges down the road, while ensuring cluster health and performance are optimized.
Anticipate Capacity and Scaling Needs
By its nature, success will yield new needs. Be prepared for your Cassandra cluster to grow and scale well into the future — that is what this database is built to do. Starving your Cassandra cluster for CPU, RAM and storage resources because you don’t have a plan to seamlessly add capacity is a way of plucking failure from the jaws of success. Poor performance, data loss and expensive downtime are the rewards for growing without looking ahead.
Plan for growth and scalability from the beginning of your Cassandra implementation. Practice careful capacity planning. Look at your data volumes, write/read patterns and performance requirements today and tomorrow. Teams with clusters built for growth will be ready to do so far more easily and affordably.
Make Changes With a Careful Testing/Staging/Prod Process
Teams that think they’re streamlining their process efficiency by putting Cassandra changes straight into production actually enable a pipeline for bugs, performance roadblocks and data inconsistencies. Testing and staging environments are essential for validating changes before putting them into production environments and will save teams countless hours of headaches.
At the end of the day, running all data migrations, changes to schema and application updates through testing and staging environments is far more efficient than putting them straight into production and then cleaning up myriad live issues.
Set Up Monitoring and Alerts
Teams implementing monitoring and alerts to track metrics and flag anomalies can mitigate trouble spots before they become full-blown service interruptions. The speed at which teams become aware of issues can mean the difference between a behind-the-scenes blip and a downtime event.
Have Backup and Disaster Recovery at the Ready
In addition to standing up robust monitoring and alerting, teams should regularly test and run practice drills on their procedures for recovering from disasters and using data backups. Don’t neglect this step; these measures are absolutely essential for ensuring the safety and resilience of systems and data.
The less prepared an organization is to recover from issues, the longer and more costly and impactful downtime will be. Incremental or snapshot backup strategies, replication that’s based in the cloud or across multiple data centers and fine-tuned recovery processes should be in place to minimize downtime, stress and confusion whenever the worst occurs.
Nurture Cassandra Expertise
The expertise required to optimize Cassandra configurations, operations and performance will only come with a dedicated focus. Enlisting experienced talent, instilling continuous training regimens that keep up with Cassandra updates, turning to external support and ensuring available resources — or all of the above — will position organizations to succeed in following the best practices highlighted here and achieving all of the benefits that Cassandra can deliver.
The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.
Use Your Data in LLMs With the Vector Database You Already Have: The New Stack
Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.
Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.
You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.
But, ready for some good news?
If you’re already using Apache Cassandra 5.0, OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.
For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.
Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.
RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.
Let’s look closer at what each open source technology brings to the vector database discussion:
Apache Cassandra 5.0 Offers Native Vector Indexing
With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.
Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.
OpenSearch Provides a Combination of Benefits
Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.
With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.
The pgvector Extension Makes Postgres a Powerful Vector Store
Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.
pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.
Was the Answer to Your AI Challenges in Front of You All Along?
The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.
The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.
Who Did What to That and When? Exploring the User Actions Feature
NetApp recently released the user actions feature on the Instaclustr Managed Platform, allowing customers to search for user actions recorded against their accounts and organizations. We record over 100 different types of actions, with detailed descriptions of what was done, by whom, to what, and at what time.
This provides customers with visibility into the actions users are performing on their linked accounts. NetApp has always collected this information in line with our security and compliance policies, but now, all important changes to your managed cluster resources have self-service access from the Console and the APIs.
In the past, this information was accessible only through support tickets when important questions such as “Who deleted my cluster?” and “When was the firewall rule removed from my cluster?” needed answers. This feature adds more self-discoverability of what your users are doing and what our support staff are doing to keep your clusters healthy.
This blog post provides a detailed walkthrough of this new feature at a moderate level of technical detail, with the hope of encouraging you to explore and better find the actions you are looking for.
For this blog, I’ve created two Apache Cassandra® clusters in one account and performed some actions on each. I’ve also created an organization linked to this account and performed some actions on that. This will allow a full example UI to be shown and demonstrate the type of “stories” that can emerge from typical operations via user actions.
Introducing Global Directory
During development, we decided to consolidate the other global account pages into a new centralized location, which we are calling the “Directory”.
This Directory provides you with the consolidated view of all organizations and accounts that you have access to, collecting global searches and account functions into a view that does not have a “selected cluster” context (i.e., global). For more information on how Organizations, Accounts and Clusters relate to each other, check out this blog.
Organizations serve as an efficient method to consolidate all associated accounts into a single unified, easily accessible location. They introduce an extra layer to the permission model, facilitating the management and sharing of information such as contact and billing details. They also streamline the process of Single Sign-On (SSO) and account creation.
Let’s log in and click on the new button:
This will take us to the new directory landing page:
Here, you will find two types of global searches: accounts and user actions, as well as account creation. Selecting the new “User Actions” item will take us to the new page. You can also navigate to these directory pages directly from the top right ‘folder’ menu:
User Action Search Page: Walkthrough
This is the new page we land on if we choose to search for user actions:
When you first enter, it finds the last page of actions that happened in the accounts and organizations you have access to. It will show both organization and account actions on a single consolidated page, even though they are slightly different in nature.
*Note: The accessible accounts and organisations are defined as those you are linked to as
CLUSTER_ADMIN
or
OWNER
*TIP: If you don’t want an account user to see user actions, give the
READ_ONLY
access.
You may notice a brief progress bar display as the actions are retrieved. At the time of writing, we have recorded nearly 100 million actions made by our customers over a 6-month period.
From here, you can increase the number of actions shown on each page and page through the results. Sorting is not currently supported on the actions table, but it is something we will be looking to add in the future. For each action found, the table will display:
- Action: What happened to your account (or organization)? There are over 100 tracked kinds of actions recorded.
- Domain: The specific account or organization name of the action targeted.
- Description: An expanded description of what happened, using context captured at the time of action. Important values are highlighted between square brackets, and the copy button will copy the first one into the clipboard.
- User: The user who
performed the action, typically using the console/
APIs or
Terraform
provider, but
it can also be triggered by “Instaclustr
Support” using our
admin tools.
- For those actions marked with user “Instaclustr Support”, please reach out to support for more information about those actions we’ve taken on your behalf or visit https://support.instaclustr.com/hc/en-us.
- Local time: The action time from your local web browser’s perspective.
Additionally, for those who prefer programmatic access, the user action feature is fully accessible via our APIs, allowing for automation and integration into your existing workflows. Please visit our API documentation page here for more details.
Basic (super-search) Mode
Let’s say we only care about the “LeagueOfNations” organization domain; we can type ‘League’ and then click Search:
The name patterns are simple partial string patterns we look for as being ’contained’ within the name, such as ”Car” in ”Carlton”. These are case insensitive. They are not (yet!) general regular expressions.
Advanced “find a needle” Search Mode
Sometimes, searching by names is not precise enough; you may want to provide more detailed search criteria, such as time ranges or narrowing down to specific clusters or kinds of actions. Expanding the “Advanced Search” section will switch the page to a more advanced search criteria form, disabling the basic search area and its criteria.
Let’s say we only want to see the “Link Account” actions over the last week:
We select it from the actions multi-chip selector using the cursor (we could also type it and allow autocomplete to kick in). Hitting search will give you your needle time to go chase that Carl guy down and ask why he linked that darn account:
The available criteria fields are as follows (additive in nature):
- Action: the kinds of actions, with a bracketed count of their frequency over the current criteria; if empty, all are included.
- Account: The account name of interest OR its UUID can be useful to narrow the matches to only a specific account. It’s also useful when user, organization, and account names share string patterns, which makes the super-search less precise.
- Organization: the organization name of interest or its UUID.
- User: the user who performed the action.
- Description: matches against the value of an expanded description variable. This is useful because most actions mention the ‘target’ of the action, such as cluster-id, in the expanded description.
- Starting At: match actions starting from this time cannot be older than 12 months ago.
- Ending At: match actions up until this time.
Bonus Feature: Cluster Actions
While it’s nice to have this new search page, we wanted to build a higher-order question on top of it: What has happened to my cluster?
The answer can be found on the details tab of each cluster. When clicked on, it will take you directly to the user actions page with appropriate criteria to answer the question.
* TIP: we currently support entry into this view with a
descriptionFormat queryParam
allowing you to save bookmarks to particular action ‘targets’. Further
queryParams
may be supported in the future for the remaining criteria: https://console2.instaclustr.com/global/searches/user-action?descriptionContextPattern=acde7535-3288-48fa-be64-0f7afe4641b3
Clicking this provides you the answer:
Future Thoughts
There are some future capabilities we will look to add, including the ability to subscribe to webhooks that trigger on some criteria. We would also like to add the ability to generate reports against a criterion or to run such things regularly and send them via email. Let us know what other feature improvements you would like to see!
Conclusion
This new capability allows customers to search for user actions directly without contacting support. It also provides improved visibility and auditing of what’s been changing on their clusters and who’s been making those changes. We hope you found this interesting and welcome any feedback for “higher-order” types of searches you’d like to see built on top of this new feature. What kind of common questions about user actions can you think of?
If you have any questions about this feature, please contact Instaclustr Support at any time. If you are not a current Instaclustr customer and you’re interested to learn more, register for a free trial and spin up your first cluster for free!
The post Who Did What to That and When? Exploring the User Actions Feature appeared first on Instaclustr.
Powering AI Workloads with Intelligent Data Infrastructure and Open Source
In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively.
This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI.
The Power of Open Source in AI Solutions
Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions:
- Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
- Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
- Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals.
- Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness.
Vector Databases: A Key Component for AI Workloads
Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems.
Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities.
Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users.
The Role of Vector Databases in AI:
- Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models.
- High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly.
- Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance.
Leveraging Existing Infrastructure for AI Workloads
Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements:
- Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement.
- Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads.
- Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration.
Open Source Databases for Enterprise AI
Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors:
PostgreSQL® and pgvector
“The world’s most advanced open source relational database“, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure.
From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.
OpenSearch®
Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.
OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI agents to recommendation engines with trustworthy results and minimal hallucinations.
Apache Cassandra® 5.0 with Native Vector Indexing
Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.
Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.
Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises.
Open Source Opens the Door to Successful AI Workloads
Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutions—and suffering the pitfalls of vendor lock-in or simply mismatched features—can easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position.
When leveraging open source databases for AI workloads, consider the following:
- Data Security: Ensure robust security measures are in place to protect sensitive data.
- Scalability: Plan for future growth by choosing solutions that can scale with your needs.
- Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications.
- Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI.
Conclusion
Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency.
Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.
And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts!
The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.
How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes?
Data modeling in Apache Cassandra® is an important topic—how you model your data can affect your cluster’s performance, costs, etc. Today I’ll be looking at a new feature in Cassandra 5.0 called Storage -Attached Indexes (SAI), and how they affect the way you model data in Cassandra databases.
First, I’ll briefly cover what SAIs are (for more information about SAIs, check out this post). Then I’ll look at 3 use cases where your modeling strategy could change with SAI. Finally, I’ll talk about benefits and constraints of SAIs. and constraints of SAIs.
What Are Storage–Attached Indexes?
From the Cassandra 5.0 Documentation, Storage –Attached Indexes (SAIs) “[provide] an indexing mechanism that is closely integrated with the Cassandra storage engine to make data modeling easier for users.” Secondary Indexing, which is indexing values on properties that are not part of the Primary Key for that table, has been available for Cassandra in the past (called SASI and 2i). However, SAIs will replace the existing functionality, as it will be deprecated in 5.0, and then tentatively removed in Cassandra 6.0.
This is because SAIs improve upon the older methods in a lot of key ways. For one, according to the devs, SAIs are the fastest indexing method for Cassandra clusters. This performance boost was a plus for using indexing in production environments. It also lowered the data storage overhead over prior implementations, which lowers costs by reducing the need for database storage, which induces operational costs, and by reducing latency when dealing with indexes, lowering a loss of user interaction due to high latency.
How Do SAIs work?
SAIs are implemented as part of the SSTables, or Sorted String Tables, of a Cassandra database. This is because SAIs index Memtables and SSTables as they are written. It filters from both in-memory and on-disk sources, filtering them out into a series of indexed columns at read time. I’m not going to go into too much detail here because there are a lot of existing resources on this exciting topic: see the Cassandra 5.0 Documentation and the Instaclustr site for examples.
The main thing to keep in mind is that SAIs are attached to Cassandra’s storage engine, and it’s much more performant from speed, scalability, and data storage angles as a result. This means that you can use indexing reliably in production beginning with Cassandra 5.0, which allows data modeling to be improved very quickly.
To learn more about how SAIs work, check out this piece from the Apache Cassandra blog.
What Is SAI For?
SAI is a filtering engine, and while it does have some functionality overlap with search engines, it directly says it is “not an enterprise search engine” (source).
SAI is meant for creating filters on non-primary-key or composite partition keys (source), essentially meaning that you can enable a ‘WHERE’ clause on any column in your Cassandra 5.0 database. This makes queries a lot more flexible without sacrificing latency or storage space as with prior methods.
How Can We Use SAI When Data Modeling in Cassandra 5.0?
Because of the increased scalability and performance of SAIs, data modeling in Cassandra 5.0 will most definitely change.
You will be able to search collections more thoroughly and easily, for instance, indexing is more of an option when designing your Cassandra queries. This will also allow new query types, which can improve your existing queries—which by Cassandra’s design paradigm changes your table design.
But what if you’re not on a greenfield project and want to use SAIs? No problem! SAI is backwards-compatible, and you can migrate your application one index at a time if you need.
How Do Storage–Attached Indexes Affect Data Modeling in Apache Cassandra 5.0?
Cassandra’s SAI was designed with data modeling in mind (source). It unlocks new query patterns that make data modeling easier in quite a few cases. In the Cassandra team’s words: “You can create a table that is most natural for you, write to just that table, and query it any way you want.” (source)
I think another great way to look at how SAIs affect data modeling is by looking at some queries that could be asked of SAI data. This is because Cassandra data modeling relies heavily on the queries that will be used to retrieve the data. I’ll take a look at 2 use cases: indexing as a means of searching a collection in a row and indexing to manage a one-to-many relationship.
Use Case: Querying on Values of Non-Primary-Key Columns
You may find you’re searching for records with a particular value in a particular column often in a table. An example may be a search form for a large email inbox with lots of filters. You could find yourself looking at a record like:
- Subject
- Sender
- Receiver
- Body
- Time sent
Your table creation may look like:
CREATE KEYSPACE IF NOT EXISTS inbox WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; CREATE TABLE IF NOT EXISTS emails ( id int, sender text, receivers text, subject text, body text, timeSent timestamp, PRIMARY KEY (id)); };
If you allow users to search for a particular subject or sender, and the data set is large, not having SAIs could make query times painful:
SELECT * FROM emails WHERE emails.sender == “sam.example@example.com”
To fix this problem, we can create secondary indexes on our sender, receiver, and body fields:
CREATE CUSTOM INDEX sender_sai_idx ON Inbox.emails (sender) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE INDEX IF NOT EXISTS receiver_sai_idx on Inbox.emails (receiver) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE CUSTOM INDEX body_sai_idx ON Inbox.emails (body) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE CUSTOM INDEX subject_sai_idx ON Inbox.emails (subject) USING 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};
Once you’ve established the indexes, you can run the same query and it will automatically use the SAI index to find all emails with a sender of “sam.example@examplemail.com” OR by subject match/body match. Note that although the data model changed with the inclusion of the indexes, the SELECT query does not change, and the fields of the table stayed the same as well!
Use Case: Managing One-To-Many Relationships
Going back to the previous example, one email could have many recipients. Prior to secondary indexes, you would need to scan every row in the collection of every row in the table in order to query on recipients. This could be solved in a few ways. One is to create a join table for recipients that contains an id, email id, and recipient. This becomes complicated when the constraint that each email should only appear once per email is added. With SAI, we now have an index-based solution—create an index on a collection of recipients for each row.
The script to create the table and indices changes a bit:
id int, sender text, receivers set<text>, subject text, body text, timeSent timestamp, PRIMARY KEY (id)); };
The text type of receivers changes to a set<text>. A set is used because each email should only occur once. This takes the logic you would have had to implement for the join table solution and moves it to Cassandra.
The indexing code remains mostly the same, except for the creation of the index for receivers:
CREATE INDEX IF NOT EXISTS receivers_sai_idx on Inbox.emails (receivers)
That’s it! One line of CQL and there’s now an index on receivers. We can query for emails with a particular receiver:
SELECT * FROM emails WHERE emails.receievers CONTAINS “sam.example@examplemail.com”
There are many one-to-many relationships that can be simplified in Cassandra with the use of secondary indexes and SAI.
What Are the Benefits of Data Modeling With Storage Attached Indexes?
There are many benefits to using SAI when data modeling in Cassandra 5.0:
- Query performance: because of SAI’s implementation, it has much faster query speeds than previous implementations, and indexed data is generally faster to search than unindexed data. This means you have more flexibility to search within your data and write queries that search non-primary-key columns and collections.
- Move over piecemeal: SAI’s backwards compatibility, coupled with how little your table structure has to change to add SAIs, means you can move over your data models piece by piece, meaning moving is easier.
- Data storage overhead: SAI has much lower data overhead than previous secondary index implementations, meaning more flexibility in what you can store in your data models without impacting overall storage needs.
- More complex
queries/features: SAI allows you to write much
more thorough queries when looking
through SAIs,
and offers up a lot of new functionality, like:
- Vector Search
- Numeric Range queries
- AND queries within indexes
- Support for map/set/
What Are the Constraints of Storage–Attached Indexes?
While there are benefits to SAI, there are also a few constraints, including:
- Because SAI is attached to the SSTable mechanism, the performance of queries on indexed columns will be “highly dependent on the compaction strategy in use” (per the Cassandra 5.0 CEP-7)
- SAI is not designed for unlimited-size data sets, such as logs; indexing on a dataset like that would cause performance issues. The reason for this is read latency at higher row counts spread across a cluster. It is also related to consistency level (CL), as the higher the CL is, the more nodes you’ll have to ping on larger datasets. (Source).
- Query complexity: while you can query as many indexes as you like, when you do so, you incur a cost related to the number of index values processed. Be aware when designing queries to select from as few indexes as necessary.
- You cannot index multiple columns in one index, as there is a 1-to-1 mapping of an SAI index to a column. You can however create separate indexes and query them simultaneously.
This is a v1, and some features, like the LIKE comparison for strings, the OR operator, and global sorting are all slated for v2.
Disk usage: SAI uses an extra 20-35% disk space over unindexed data; note that over previous versions of indexing, it consumes much less (source). You shouldn’t just make every column an index if you don’t need to, saving disk space and maintaining query performance.
Conclusion
SAI is a very robust solution for secondary indexes, and their addition to Cassandra 5.0 opens the door for several new data modelling strategies—from searching non-primary-key columns, to managing one-to-many relationships, to vector search. To learn more about SAI, read this post from the Instaclustr by NetApp blog, or check out the documentation for Cassandra 5.0.
If you’d like to test SAI without setting up and configuring Cassandra yourself, Instaclustr has a free trial and you can spin up Cassandra 5.0 clusters today through a public preview! Instaclustr also offers a bunch of educational content about Cassandra 5.0.
The post How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes? appeared first on Instaclustr.
Cassandra Lucene Index: Update
**An important update regarding support of Cassandra Lucene Index for Apache Cassandra® 5.0 and the retirement of Apache Lucene Add-On on the Instaclustr Managed Platform.**
Instaclustr by NetApp has been maintaining the new fork of the Cassandra Lucene Index plug-in since its announcement in 2018. After extensive evaluation, we have decided not to upgrade the Cassandra Lucene Index to support Apache Cassandra® 5.0. This decision aligns with the evolving needs of the Cassandra community and the capabilities offered by the Storage–Attached Indexing (SAI) in Cassandra 5.0.
SAI introduces significant improvements in secondary indexing, while simplifying data modeling and creating new use cases in Cassandra, such as Vector Search. While SAI is not a direct replacement for the Cassandra Lucene Index, it offers a more efficient alternative for many indexing needs.
For applications requiring advanced indexing features, such as full-text search or geospatial queries, users can consider external integrations, such as OpenSearch®, that offer numerous full-text search and advanced analysis features.
We are committed to maintaining the Cassandra Lucene Index for currently supported and newer versions of Apache Cassandra 4 (including minor and patch-level versions) for users who rely on its advanced search capabilities. We will continue to release bug fixes and provide necessary security patches for the supported versions in the public repository.
Retiring Apache Lucene Add-On for Instaclustr for Apache Cassandra
Similarly, Instaclustr is commencing the retirement process of the Apache Lucene add-on on its Instaclustr Managed Platform. The offering will move to the Closed state on July 31, 2024. This means that the add-on will no longer be available for new customers.
However, it will continue to be fully supported for existing customers with no restrictions on SLAs, and new deployments will be permitted by exception. Existing customers should be aware that the add-on will not be supported for Cassandra 5.0. For more details about our lifecycle policies, please visit our website here.
Instaclustr will work with the existing customers to ensure a smooth transition during this period. Support and documentation will remain in place for our customers running the Lucene add–on on their clusters.
For those transitioning to, or already using the Cassandra 5.0 beta version, we recommend exploring how Storage-Attached Indexing can help you with your indexing needs. You can try the SAI feature as part of the free trial on the Instaclustr Managed Platform.
We thank you for your understanding and support as we continue to adapt and respond to the community’s needs.
If you have any questions about this announcement, please contact us at support@instaclustr.com.
The post Cassandra Lucene Index: Update appeared first on Instaclustr.
Building a 100% ScyllaDB Shard-Aware Application Using Rust
Building a 100% ScyllaDB Shard-Aware Application Using Rust
I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
Learning Rust the hard way for a production Kafka+ScyllaDB pipeline
This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...
On Scylla Manager Suspend & Resume feature
On Scylla Manager Suspend & Resume feature
!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...
Renaming and reshaping Scylla tables using scylla-migrator
We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...
Python scylla-driver: how we unleashed the Scylla monster's performance
At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...
Scylla Summit 2019
I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...
Scylla: four ways to optimize your disk space consumption
We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...