DataStax and Google Cloud Collaborate to Evolve Open Source Apache Cassandra for Generative AI
Companies everywhere are looking for ways to integrate AI into their business as popularity of generative AI technology continues to surge. While organizations need to think about how to implement AI into their business, they also need to think about what is needed to support their AI...Introducing Vector Search: Empowering Cassandra and Astra DB Developers to Build Generative AI Applications
In the age of AI, Apache Cassandra® has emerged as a powerful and scalable distributed database solution. With its ability to handle massive amounts of data and provide high availability, Cassandra has become a go-to choice for many AI applications including Uber, Netflix, and Priceline. However,...Security Advisory: CVE 2023-20601 Apache Cassandra®
Following the publication of CVE-2023-20601, Instaclustr began investigating its potential impact on our Instaclustr Managed Apache Cassandra® offering. This vulnerability affects Apache Cassandra from 4.0.0 through to 4.0.9, and from 4.1.0 through to 4.1.1 The vulnerability can be exploited with privilege escalation when enabling FQL/Audit logs, allowing users with JMX access to run arbitrary commands as the user running Apache Cassandra.
The security controls that exist in our managed service—including but not limited to firewalls, intrusion detection, and compartmentalization practices—lower the risk of this vulnerability. However, our course of action will be to release Cassandra version 4.0.10 as a newer, patched version of Managed Cassandra and subsequently upgrade customers on an impacted version (i.e., any managed Cassandra 4.0.1, 4.0.4, and 4.0.9). Apache Cassandra version 4.0.10 contains the fix and will soon be made available on the Instaclustr Managed Platform. If you have any questions, please get in contact with Instaclustr Support.
Mitigation for customers on Cassandra 4.0.1, 4.0.4, or 4.0.9:
- For customers using the managed service the Instaclustr Support team will be in contact with you to schedule an upgrade of your managed Cassandra clusters to version 4.0.10.
- For support only customers, you will need to upgrade your Cassandra clusters to 4.0.10, but in the short term if is advisable to close any remote JMX access to your clusters.
As a further mitigation step, we will immediately be marking managed Cassandra versions 4.0.1, 4.0.4, and 4.0.9 as Legacy Support prior to these versions being marked as End of Life on 31 July 2023 as per our lifecycle policy.
As always, customers who want to take a more proactive stance should limit access to their managed Cassandra cluster to only trusted clients and ensure those clients are secure. This is always good security practice in any case.
If you have any further queries regarding this vulnerability and how it relates to Instaclustr services, please contact Instaclustr Support.
References: https://nvd.nist.gov/vuln/detail/CVE-2023-30601
The post Security Advisory: CVE 2023-20601 Apache Cassandra® appeared first on Instaclustr.
The Data Modeling Behind Social Media “Likes”
Did you ever wonder how Instagram, Twitter, Facebook, and other social media platforms track who liked your posts? This post explains how to do it with ScyllaDB NoSQL. And if you want to go hands-on with these and other NoSQL strategies, join us at ScyllaDB Labs, a free 2 hour interactive event.
Recently, I was invited to speak at an event called “CityJS.” But here’s the thing: I’m the PHP guy. I don’t do JS at all, but I accepted the challenge. To pull it off, I needed to find a good example to show how a highly scalable and low latency database works.
So, I asked one of my coworkers for examples. He told me to look for high numbers inside any platform, like counters or something like that. At that point, I realized that any type of metrics can fit this example. Likes, views, comments, follows, etc. could be queried as counters. Here’s what I learned about how do proper data modeling for these using ScyllaDB.
First things first, right? After deciding what to cover in my talk, I needed to understand how to build this data model.
We’ll need a posts
table and also a
post_likes
table that relates who liked each post. So
far, it seems enough to do our likes counter.
My first bet for a query to count all likes was something like:
Ok and if I just do a query with
SELECT count(*) FROM social.post_likes
it can work, right?
Well, it worked but it was not as performant as expected when I did a test with a couple thousands of likes in a post. As the number of likes grows, the query becomes slower and slower…
“But ScyllaDB can handle thousands of rows easily… why isn’t it performant?” That’s probably what you’re wondering right now.
ScyllaDB – even as a cool database with cool features – will not solve the problem of bad data modeling. We need to think about how to make things faster.
Researching Data Types
Ok, let’s think straight: the data needs to be stored and we
need the relation between who liked our post, but we can’t use it
for count. So what if I create a new row as integer
in
the posts
table and increment/decrement it every
time?
Well, that seems like a good idea, but there’s a problem: we need to keep track of every change on the posts table and if we start to INSERT or UPDATE data there, we’ll probably create a bunch of nonsense records in our database.
Using ScyllaDB, every time that you need to update something, you actually create new data.
You will have to track everything that changes in your data. So, for each increase, there will be one more row unless you don’t change your clustering keys or don’t care about timestamps (a really bad idea).
After that, I went into the ScyllaDB
docs and found out that there’s a type called
counter
that fit our needs and is also ATOMIC!
Ok, it fit our needs but not our data modeling. To use this type, we have to follow a few rules but let’s focus on the ones that are causing trouble for us right now:
- The only other columns in a table with a counter column can be columns of the primary key (which cannot be updated).
- No other kinds of columns can be included.
- You need to use UPDATE queries to handle tables that own a counter data type.
- You only can INCREMENT or DECREMENT values, setting a specific value is not permitted.
This limitation safeguards correct handling of counter and non-counter updates by not allowing them in the same operation.
So, we can use this counter but not on the posts table… Ok then, it seems that we’re finding a way to get it done.
Proper Data Modeling
With the information that counter
type should not
be “mixed” with other data types in a table, the only option that
is left to us is create a NEW TABLE and store this
type of data.
So, I made a new table called post_analytics
that
will hold only counter
types. For the moment, let’s
work with only likes since we have a Many to Many
relation (post_likes) created already.
These next queries are what you probably will run for this example
that we created:
Now you might have new unanswered questions in your mind like:
“So every time that I need a new counter related to some data, I’ll
need a new table?” Well, it depends on your use case. In the social
media case, if you want to store who saw the post, you will
probably need a post_viewers
table with session_id and
a bunch of other stuff.
Having these simple queries that can be done without joins can
be way faster than having count(*)
queries.
Me on the CityJS stage talking about data modeling using Typescript
Go Hands-On with ScyllaDB and Data Modeling…at the ScyllaDB Labs Event
If you want to go deeper on this topic, with some real hands-on exercises, please join us for ScyllaDB Labs! It’s a free 2 hour virtual event where I’ll be presenting along with Guy Shtub, Head of Training and Felipe Cardeneti Mendes, Solutions Architect.
This is an interactive workshop where we’ll go hands-on to build and interact with high-performance apps using ScyllaDB. It will be a great way to discover the NoSQL strategies used by top teams and apply them in a guided, supportive environment. As you go live with some sample applications, you’ll learn about the features and best practices that will enable your own applications to get the most out of ScyllaDB.
We’ll cover:
- Understanding if your use case is a good fit for ScyllaDB
- Achieving and maintaining low-latency NoSQL at scale
- Building high performance applications with ScyllaDB
- Data modeling, ingestion, and processing with the sample app
- Navigating key decisions & logistics when getting started with ScyllaDB
Hope to see you there!
Build your First ScyllaDB Application: New Rust, Python & PHP Tutorials
A couple of years ago, we published the first version of CarePet, an example IoT project designed to help you get started with ScyllaDB. Recently, we’ve expanded the project to make it useful for a larger group of developers by including new examples in Rust, Python, and PHP. Additionally, we also implemented a Terraform example that allows you to spin up a ScyllaDB Cloud cluster with the initial CarePet schema.
This blog post goes over the components of the project and shows you how to get started. In the GitHub repository, you can find examples in multiple languages:
The CarePet app allows tracking of pets’ health indicators. Each tutorial builds the same app with these three components:
- A virtual collar that reads and pushes sensor data
- A web app for reading and analyzing the pets’ data
- A database migration tool
The tutorials and other supporting material are in our documentation as well as GitHub
Keep in mind that this example is meant to be one of your first steps at trying ScyllaDB. It is not production ready and should be used for reference purposes only.
TL;DR High-Level Overview + Deploying to ScyllaDB Cloud with Terraform
To get your hands dirty and start building right away:
- Go to iot.scylladb.com
- Select your favorite programming language
- Follow the instructions in the documentation
If you have problems or questions while working on the app, feel free to post a message in our community forum and we’ll be happy to solve your problem or answer your questions.
Using the ScyllaDB Cloud Terraform provider you can interact with ScyllaDB Cloud – create, edit, or delete your instances – using Terraform. In the CarePet repository, you can find a starter Terraform configuration that sets up a new ScyllaDB Cloud cluster and creates the initial schema in the new database. If you want to complete the tutorial using ScyllaDB Cloud and Terraform, go to this documentation page and get started!
About This Tutorial
Now, a little more detail…
In this sample app tutorial, you’ll create a simple IoT application from scratch that uses ScyllaDB as the database. The application is called “Care Pet”; it collects and analyzes data from sensors attached to virtual pet collars. This data can be used to monitor a pet’s health and activity.
The tutorial walks you through a specific instance of the steps to create a typical IoT application from scratch. This includes gathering requirements, creating the data model, cluster sizing and hardware needed to match the requirements, and (finally) building and running the application.
Use Case Requirements
Each pet collar includes sensors that report four different measurements: Temperature, Pulse, Location, and Respiration. The collar reads the sensor’s data once a minute, aggregates it in a buffer, and sends measurements directly to the app once an hour. The application should scale to 10 million pets. It keeps each pet’s data history for a month. Thus, the database will need to scale to contain 43 billion data points in a month (60 × 24 × 30 × 10,000,000 = 43,200,000,000); 43,200 data samples per pet. If the data variance is low, it will be possible to further compact the data and reduce the number of data points.
Note that in a real world design you’d have a fan-out architecture. The end-node IoT devices (the collars) would communicate wirelessly to an MQTT gateway or equivalent, which would, in turn, send updates to the application via, say, Apache Kafka. We’ll leave out those extra layers of complexity for now, but if you are interested, check out our post on ScyllaDB and Confluent Integration for IoT Deployments.
Performance Requirements
The application has two parts:
- Sensors: Writes to the database, throughput sensitive
- Backend dashboard: Reads from the database, latency-sensitive
For this example, we assume 99% writes (sensors) and 1% reads (backend dashboard)
The desired Service Level Objectives (SLOs) are:
- Writes: Throughput of 670K operations per second
- Reads: Latency of up to 10 milliseconds per key for the 99th percentile
The application requires high availability and fault tolerance. Even if a ScyllaDB node goes down or becomes unavailable, the cluster is expected to remain available and continue to provide service. You can learn more about ScyllaDB’s high availability in this lesson.
You can also calculate what these requirements would cost using ScyllaDB Cloud vs. other providers using our pricing calculator.
To satisfy the requirements of the example, ScyllaDB needs to be able to do 670K write operations per second and 10K read operations per second with 10ms P99 latency where the average item size is 1KB and the full data set is 1 TB. As you can see, ScyllaDB can provide not just great performance benefits but also huge cost savings in cases like this if you migrate from a different database like DynamoDB.
Design and Data Model
Now, let’s think about our queries, make the primary key and clustering key selection, and create the database schema. See more in the data model design document.
Here’s how we will create our data using CQL:
We hope you find the project useful to learn more about ScyllaDB. If you have questions or issues that are specific to this sample application, feel free to open an issue in the project’s GitHub repository.
Resources
Here’s what you need to get started with ScyllaDB:
- Install ScyllaDB (self-hosted)
- Install ScyllaDB (in cloud)
- Getting started with ScyllaDB
- Community forum
- CarePet documentation site
For a deeper understanding of ScyllaDB and to get answers to your questions, go to:
- ScyllaDB Essentials course on ScyllaDB University.
- Data Modeling and Application Development course on ScyllaDB University.
- Join the ScyllaDB Users Slack channel
Vector Search Is Coming to Apache Cassandra
There’s no artificial intelligence without data. And when your data is scattered all over the place, you’ll spend more time managing the implementation process instead of focusing on what’s most important: building the application. The world's most prominent applications already use Apache...How to Maximize Database Concurrency
“ScyllaDB really loves concurrency…to a point.”
That was one of the key takeaways from the recent webinar, How Optimizely (Safely) Maximizes Database Concurrency, featuring Brian Taylor, Principal Software Engineer at Optimizely.
No database or system – digital or not – can tolerate unbounded concurrency. For the most part, high concurrency is essential for achieving impressive performance. But if the clients end up overwhelming the database with requests at a pace that it can’t handle, throughput suffers, then latency rises as a side effect.
As Brian explained, knowing exactly how hard you can push your database is key for facing throughput challenges, like Black Friday surges, without a hitch. Identifying that precise tipping point begins with the Universal Scaling Law, which states that as the number of users (N) increases, the system throughput (X) will:
- Enjoy a period of near linear scaling
- Eventually saturate some resource such that increasing the number of users doesn’t increase the throughput
- Possibly encounter a coordination cost that drives down the throughput with further increasing number of users
Here’s how that USL applies to a database…
In the linear region, throughput is directly proportional to concurrency. The size of the cluster will determine how large the linear region is. You should engineer your system to take full advantage of the linear region.
In the saturation region, throughput is approximately constant, regardless of concurrency. At this point, assuming a well-tuned cluster and workload, you are writing to the system as fast as the durable storage can take it. Concurrency is no longer a mechanism to increase throughput and is now becoming a risk to stability. You should engineer your system to stay out of the saturation region.
In the retrograde region, increasing concurrency now decreases throughput. A system that enters this region is likely to get stuck here until demand declines. The harder you push, the less throughput you get and the more demand builds which makes you want to push harder. However, “pushing harder” consumes more client resources. This is a vicious cycle and you are now on “the road to hell.”
That’s just a very high level look at the first part of the talk. Watch the complete video to hear more about:
- How Brian determines the bounds of these regions for his specific database and workload
- How the database operates in each region and what it means for the team
- Ways you can find more concurrency to throw at ScyllaDB when you’re in that magical linear region
- His 3 top tips for getting the most out of ScyllaDB’s “awesome concurrency”
This talk stirred up some great discussion, so we wanted to share some of the highlights here.
What about adaptive database concurrency limits?
Q: Configuring fixed concurrency limits is not a new way to ensure one has a stable system. A situation which often happens is that the values you previously benchmarked through your performance testing could quickly become stale – due to data or coding changes and natural application autoscaling to meet newer real-world demands. For example, Netflix has a nice write-up around the topic where they implemented Adaptative Concurrency Limits in order to address this situation. Is this something that you already implement or have considered implementing at Optimizely, and do you have any tips for the audience?
A: Great question. So static concurrency limits are a hazard – you should know that in advance. A small subset of problems can work fine under static concurrency limits, but most can’t. Because most of our systems auto scale, most of our database clusters are shared by multiple things.
You need to recognize that the USL fit you made for your cluster is, unfortunately, also a snapshot in time. You got a snapshot of the USL on the day you measured it, or the days you measured it, but it’s going to change as you change your software, as hardware potentially degrades, as you add users to the cluster in the background that are soaking up some of that performance, as backups that you didn’t measure during your peak performance testing kick in and do their thing. The real safe concurrency limit is always dynamic.
The way I’ve tackled that is actually based on a genius suggestion from a former manager who loved digging into foundational tech. He suggested applying the TCP congestion control algorithm. That is the algorithm that operates way down on the network stack, deciding how big a chunk of data to send at once. It chooses the size and sends a chunk. Then if that chunk size fails, it chooses a smaller size and sends a chunk. If that succeeds, it scales up the chunk size. So you have this dynamic tuning that finds the chunk size that yields the maximum throughput for whatever network “weather” you’re sitting on. Network conditions are dynamic just like database conditions. They’re solving the same problem.
I applied the TCP congestion control algorithm to choosing concurrency. So if things are going fine, and the weather at Scylla land is good, then I scale up concurrency, do a little more. If the weather is getting kind of choppy, things are slowing down, I scale it down. So that’s the way that I dynamically manage concurrency. It works very well in this autoscaling, multiple clients sort of real world scenario.
What’s the impact of payload size on concurrency?
Q: What is the impact of payload record size? Correct me if I’m wrong, but I think that the larger your payload gets, the less concurrency you have to apply to your system, right?
A: Exactly. Remember, the fundamental limiter is the SSD. So bigger payloads means more throughput hitting the SSD per request. That’s a first order effect. My recommendation would be to base your testing on your average payload size if your workload is fairly smooth. If things are significantly variable over time (e.g. that average swings around dynamically throughout the day) then I would tune to the maximum payload size. The most pessimistic (aka safe) tuning would be to find the maximum payload size that your system allows, then fit the USL under those test conditions.
How do you address workloads that are spiky by nature?
Q: No system supports unlimited concurrency. Otherwise, no matter how fast your database is, you’ll eventually end up reaching the retrograde region. Do you have any recommendations for workloads that are unbound and/or spiky by nature?
A: Yeah, in my career, all workloads are spiky so you always get to solve this problem. You need to think about what happens when the wheels fall off. When work bunches up, what are you going to do? You have two choices. You can queue and wait or you can load shed. You need to make that choice consistently for your system. Your chosen scale chooses a throughput you can handle within SLA. Your SLA tells you how often you’re allowed to violate your latency target so make sure the reality you’ve observed violates your design throughput well less than the limit your SLA requires. And then when reality drives you outside that window, you either queue and wait and blow up your latency or you load shed and effectively force the user to implement the queue themselves. Load shedding is not magic. It’s just choosing to force the user to implement the queue.
Note: ScyllaDB supports configuring concurrency limiters per shard or throughput limits per partition. [Learn more – Retaining Goodput with Query Rate Limiting]
How can limiting concurrency help you achieve your desired throughput?
Q: One of the situations that we see fairly often happening is when your concurrency goes too high, and then you reach the retrograde region and just see your throughput going down. Latency goes up, your throughput goes down, you start seeing timeouts, errors, etc, and so on. And then the general feedback that we’re going to provide is that you have to fix your concurrency. People don’t understand how reducing their concurrency can actually help them to achieve their desired throughput. Could you elaborate a bit on that?
A: Well, first I’ll tell a story. The basis for this talk was a system that encountered high partition cardinality. This led to lots of concurrent requests to ScyllaDB. From my perspective, it looked like ScyllaDB’s performance just tanked, even though the problem was really on our side. It was a frustrating day. Another law of nature is that your bad day happens at the worst possible time. In my case, huge cardinality happened on Black Friday (because obviously there’s a lot going on that day…). The thing to know is that when you’re thinking about throughput, your primary knob is concurrency. So if you have a throughput problem, go back to your primary knob. You’re observing your concurrency right? (I wasn’t, but I am now.)
How to run open-loop testing with Rust and ScyllaDB?
Ok, this wasn’t part of the Q & A – but Brian developed a great tool, and we wanted to highlight it. We’ll let Brian explain:
With classic closed loop load testing, simulated users have up to one request in flight at any time. Users must receive the response to their request before issuing the next request. This is great for fitting the USL because we’re effectively choosing X when we choose users. But, in my opinion, it’s not directly useful for modern capacity testing.
Modern open loop load testing does not model users or think times. Instead, it models the load as a constant throughput load source. This is a match for capacity planning for internet connected systems where we typically know requests per second but don’t really know or care how many users are behind that load. Concurrency is theoretically unbounded in an open-loop test: This is the property that lets us open our eyes to how the saturation and retrograde regions will look when our system encounters them in real life.
As I was developing this talk, unfortunately, I couldn’t find an open loop load tester for ScyllaDB – so I wrote one. scylla-bench: constant cate (CRrate) simulates a constant rate load and delivers it to ScyllaDB with variable concurrency. The name is also a bad Rust pun.
It’s an open-loop ScyllaDB load testing tool developed in the creation of this talk. It is intended to complement the ScyllaDB-provided closed-loop testing tool, scylla-bench. It overcomes coordinated omission using the technique advocated in my P99 CONF talk.
Tencent Games’ Real-Time Event-Driven Analytics System Built with ScyllaDB + Pulsar
As a part of Tencent Interactive Entertainment Group Global (IEG Global), Proxima Beta is committed to supporting our teams and studios to bring unique, exhilarating games to millions of players around the world. You might be familiar with some of our current games, such as PUBG Mobile, Arena of Valor, and Tower of Fantasy.
Our team at Level Infinite (the brand for global publishing) is responsible for managing a wide range of risks to our business – for example, cheating activities and harmful content. From a technical perspective, this required us to build an efficient real-time analytics system to consistently monitor all kinds of activities in our business domain.
In this blog, we share our experience of building this real-time event-driven analytics system. First, we’ll explore why we built our service architecture based on Command and Query Responsibility Segregation (CQRS) and event sourcing patterns with Apache Pulsar and ScyllaDB. Next, we’ll look at how we use ScyllaDB to solve the problem of dispatching events to numerous gameplay sessions. Finally, we’ll cover how we use ScyllaDB keyspaces and data replication to simplify our global data management.
A Peek at the Use Case: Addressing Risks in Tencent Games
Let’s start with a real-world example of what we’re working with and the challenges we face.
This is a screenshot from Tower of Fantasy, a 3D-action role-playing game. Players can use this dialog to file a report against another player for various reasons. If you were to use a typical CRUD system for it, how would you keep those records for follow-ups? And what are the potential problems?
The first challenge would be determining which team is going to own the database to store this form. There are different reasons to make a report (including an option called “Others”), so a case might be handled by different functional teams. However, there is not a single functional team in our organization that can fully own the form.
That’s why it is a natural choice for us to capture this case as an event, like “report a case.” All the information is captured in this event as is. All functional teams only need to subscribe to this event and do their own filtering. If they think the case falls into their domain, they can just capture it and trigger further actions.
CQRS and Event Sourcing
The service architecture behind this example is based on the CQRS and event sourcing patterns. If these terms are new to you, don’t worry! By the end of this overview, you should have a solid understanding of these concepts. And if you want more detail at that point, take a look at our blog dedicated to this topic.
The first concept to understand here is event sourcing. The core idea behind event sourcing is that every change to a system’s state is captured in an event object and these event objects are stored in the order in which they were applied to the system state. In other words, instead of just storing the current state, we use an append-only store to record the entire series of actions taken on that state. This concept is simple but powerful as the events that represent every action are recorded so that any possible model describing the system can be built from the events.
The next concept is CQRS, which stands for Command Query Responsibility Segregation. CQRS was coined by Greg Young over a decade ago and originated from the Command and Query Separation Principle. The fundamental idea is to create separate data models for reads and writes, rather than using the same model for both purposes. By following the CQRS pattern, every API should either be a command that performs an action, or a query that returns data to the caller – but not both. This naturally divides the system into two parts: the write side and the read side.
This separation offers several benefits. For example, we can scale write and read capacity independently for optimizing cost efficiency. From a teamwork perspective, different teams can create different views of the same data with fewer conflicts.
The high-level workflow of the write side can be summarized as follows: events that occur in numerous gameplay sessions are fed into a limited number of event processors. The implementation is also straightforward, typically involving a message bus such as Pulsar, Kafka, or a simpler queue system that acts as an event store. Events from clients are persisted in the event store by topic and event processors consume events by subscribing to topics. If you’re interested in why we chose Apache Pulsar over other systems, you can find more information in the blog referenced earlier.
Although queue-like systems are usually efficient at handling traffic that flows in one direction (e.g. fan-in), they may not be as effective at handling traffic that flows in the opposite direction (e.g. fan-out). In our scenario, the number of gameplay sessions will be large, and a typical queue system doesn’t fit well since we can’t afford to create a dedicated queue for every game-play session. We need to find a practical way to distribute findings and metrics to individual gameplay sessions through Query APIs. This is why we use ScyllaDB to build another queue-like event store, which is optimized for event fan-out. We will discuss this further in the next section.
Before we move on, here’s a summary of our service architecture.
Starting from the write side, game servers keep sending events to our system through Command endpoints and each event represents a certain kind of activity that occurred in a gameplay session. Event processors produce findings or metrics against the event streams of each gameplay session and act as a bridge between two sides. On the read side, we have game servers or other clients that keep polling metrics and findings through Query endpoints and take further actions if abnormal activities have been observed.
Distributed Queue-Like Event Store for Time Series Events
Now let’s look at how we use ScyllaDB to solve the problem of dispatching events to numerous gameplay sessions. By the way, if you Google “Cassandra” and “queue”, you may come across an article from over a decade ago stating that using Cassandra as a queue is an anti-pattern. While this might have been true at that time, I would argue that it is only partially true today. We made it work with ScyllaDB (which is Cassandra-compatible).
To support the dispatch of events to each gameplay session, we use the session id as the partition key so that each gameplay session has its own partition and events belonging to a particular gameplay session can be located by the session id efficiently.
Each event also has a unique event id, which is a time UUID, as the clustering key. Because records within the same partition are sorted by the clustering key, the event id can be used as the position id in a queue. Finally, ScyllaDB clients can efficiently retrieve newly arrived events by tracking the event id of the most recent event that has been received.
There is one caveat to keep in mind when using this approach: the consistency problem. Retrieving new events by tracking the most recent event id relies on the assumption that no event with a smaller id will be committed in the future. However, this assumption may not always hold true. For example, if two nodes generate two event identifiers at the same time, an event with a smaller id might be inserted later than an event with a larger id.
This problem, which I refer to as a “phantom read,” is similar to the phenomenon in the SQL world where repeating the same query can yield different results due to uncommitted changes made by another transaction. However, the root cause of the problem in our case is different. It occurs when events are committed to ScyllaDB out of the order indicated by the event id.
There are several ways to address this issue. One solution is to maintain a cluster-wide status, which I call a “pseudo now,” based on the smallest value of the moving timestamps among all event processors. Each event processor should also ensure that all future events have an event id greater than its current timestamp.
Another important consideration is enabling TimeWindowCompactionStrategy, which eliminates the negative performance impact caused by tombstones. Accumulation of tombstones was a major issue that prevented the use of Cassandra as a queue before TimeWindowCompactionStrategy became available.
Now let’s shift to discussing other benefits beyond using ScyllaDB as a dispatching queue.
Simplifying Complex Global Data Distribution Challenges
Since we are building a multi-tenancy system to serve customers around the world, it is essential to ensure that customer configurations are consistent across clusters in different regions. Trust is – keeping a distributed system consistent is not a trivial task if you plan to do it all by yourself.
We solved this problem by simply enabling data replication on a keyspace across all data centers. This means any change made in one data center will eventually propagate to others. Thank ScyllaDB, as well as DynamoDB and Cassandra, for the heavy lifting that makes this challenging problem seem trivial.
You might be thinking that using any typical RDBMS could achieve
the same result since most databases also support data replication.
This is true if there is only one instance of the control panel
running in a given region. In a typical primary/replica
architecture, only the primary node supports read/write while
replica nodes are read-only. However, when you need to run multiple
instances of the control panel across different regions– for
example, every tenant has a control panel running in its home
region, or even every region has a control panel running for local
teams – it becomes much more difficult to implement this using a
typical primary/replica architecture.
If you have used AWS DynamoDB, you may be familiar with a feature called Global Table, which allows applications to read and write locally and access the data globally. Enabling replication on keyspaces with ScyllaDB provides a similar feature, but without vendor lock-in. You can easily extend global tables across a multi-cloud environment.
Keyspaces as Data Containers
Next, let’s look at how we use keyspaces as data containers to improve the transparency of global data distribution.
Let’s take a look at the diagram below. It shows a solution to a typical data distribution problem imposed by data protection laws. For example, suppose region A allows certain types of data to be processed outside of its borders as long as an original copy is kept in its region. As a product owner, how can you ensure that all your applications comply with this regulation?
One potential solution is to perform end-to-end (E2E) tests to ensure that applications correctly send the correct data to the correct region as expected. This approach requires application developers to take full responsibility for implementing data distribution correctly. However, as the number of applications grows, it becomes impractical for each application to handle this problem individually and E2E tests also become increasingly expensive in terms of both time and money.
Let’s think twice about this problem. By enabling data replication on keyspaces, we can divide the responsibility for correctly distributing data into two tasks: 1) identifying data types and declaring their destinations, and 2) copying or moving data to the expected locations.
By separating these two duties, we can abstract away complex configurations and regulations from applications. This is because the process of transferring data to another region is often the most complicated part to deal with, such as passing through network boundaries, correctly encrypting traffic, and handling interruptions.
After separating these two duties, applications are only required to correctly perform the first step, which is much easier to verify through testing at earlier stages of the development cycle. Additionally, the correctness of configurations for data distribution becomes much easier to verify and audit. You can simply check the settings of keyspaces to see where data is going.
Tips for Others Taking a Similar Path
To conclude, we’ll leave you with important lessons that we learned, and that we recommend you apply if you end up taking a path similar to ours:
- When using ScyllaDB to handle time series data, such as using it as an event-dispatching queue, remember to use the Time-Window Compaction Strategy.
- Consider using keyspaces as data containers to separate the responsibility of data distribution. This can make complex data distribution problems much easier to manage.
Watch Tech Talks On-Demand
This article is based on a tech talk presented at ScyllaDB Summit 2023. You watch this talk – as well as talks by engineers from Discord, Epic Games, Strava, ShareChat and more – on-demand.