FullContact: Improving the Graph by Transitioning to Scylla

In 2020, FullContact launched our Resolve product, backed by Cassandra. Initially, we were eager to move from our historical database HBase to Cassandra with its promises for scalability, high availability, and low latency on commodity hardware. However, we could never run our internal workloads as fast as we wanted — Cassandra didn’t seem to live up to expectations. Early on, we had a testing goal of hitting 1000 queries per second, and then soon after 10x-ing that to 10,000 queries per second through the API. We couldn’t get to that second goal due to Cassandra, even after lots of tuning.

Late last year, a small group of engineers at FullContact tried out ScyllaDB to replace Cassandra after hearing about it from one of our DevOps engineers. If you haven’t heard about Scylla before, I encourage you to check it out — it’s a drop in Cassandra replacement, written in C++, promising big performance improvements.

In this blog, we explore our experience starting from a hackathon and ultimately our transition to Scylla from Cassandra. The primary benchmark we use for performance testing is how many queries per second we can run through the API. While it’s helpful to measure a database by reads and writes per second, our database is only as good as our API can send its way, and vice versa.

The Problem with Cassandra

Our Resolve Cassandra cluster is relatively small: 3 instances of c5.2xlarge EC2 instances, each with 2 TB of gp2 EBS storage. This cluster is relatively inexpensive and, short of being primarily limited by the EBS volume speed limitation (250MB/s), it gave us sufficient scale to launch Resolve. Using EBS as storage also lets us increase the size of EBS volumes without needing to redeploy or rebuild the database and gain storage space. Three nodes may be sufficient for now, but if we’re running low on disk, we can add a terabyte or two to each node while running and keep the same cluster.

After several production customer-runs and some large internal batch loads began, our Cassandra Resolve tables grew from hundreds of thousands to millions and soon to over a hundred million rows. While we load-tested Cassandra before release and could sustain 1000 API calls per second from one Kubernetes pod, this was primarily an empty database or at least one with only a relatively small data set (~ a few million identifiers) max.

With both customers calling our production Resolve API and internal loads at 1000/second, we saw API speeds starting to creep up: 100ms, 200ms, and 300ms under heavy load. For us, this is too slow. And upon exceptionally heavy load for this cluster, we were seeing more and more often the dreaded:

DriverTimeoutException: Query timed out after PT2S

coming from the Cassandra Driver.

Cassandra Tuning

One of the first areas we found to gain performance had to do with Compaction Strategies — the way Cassandra manages the size and number of backing SSTables. We used the Size Tiered Compaction Strategy — the default setting, designed for “general use,” and insert heavy operations. This compaction strategy caused us to end up with single SSTables larger than several gigabytes. This means on reads, for any SSTables that get through the bloom filter, Cassandra is iterating through many extensive SSTables, reading them sequentially. Doing this at thousands of queries per second means we were quite easily able to max the EBS disk throughput, given sufficient traffic. 2 TB EBS volumes attached to an i3.2xlarge max out at a speed of ~250MB/s. From the Cassandra nodes, it was difficult to see any bottlenecks or why we saw timeouts. However, it was soon evident in the EC2 console that the EBS write throughput was pegged at 250MB/s, where memory and CPU were well below their maximums. Additionally, as we were doing large reads and writes concurrently, we have huge files being read. Still, the background compaction added additional stress on the drives by continuously bucketing SSTables into different size tables.

We ended up moving to Leveled Compaction Strategy:

alter table mytable WITH compaction = { 'class' :
'LeveledCompactionStrategy’};

Then after an hour or two of Cassandra completing its shuffling data around to smaller SSTables, were we again able to handle a reasonably heavy workload.

Weeks after updating the table’s compaction strategies, Cassandra (having so many small SSTables) struggled to run as fast with heavy read operations. We realized that the database likely needed more heap to run the bloom filtering in a reasonable amount of time. Once we doubled the heap in

/opt/cassandra/env.sh:

MAX_HEAP_SIZE="8G"

HEAP_NEWSIZE="3G"

Followed by a Cassandra service restart, one instance at a time, it was back to performing more closely to how it did when the cluster was smaller, up to a few thousand API calls per second.

Finally, we looked at tuning the size of the SSTables to make them even smaller than the 160MB default. In the end, we did seem to get a marginal performance boost after updating the size to something around 8MB. However, we still couldn’t get more than about 3,000 queries per second through the Cassandra database before we’d reach timeouts again. It continued to feel like we were approaching the limits of what Cassandra could do.

alter table mytable WITH compaction = { 'class' :
'LeveledCompactionStrategy’, ‘sstable_size_in_mb’ : 80 };

Enter Scylla

After several months of seeing our Cassandra cluster needing frequent tuning (or more tuning than we’d like), we happened to hear about Scylla. From their website: “We reimplemented Apache Cassandra from scratch using C++ instead of Java to increase raw performance, better utilize modern multi-core servers and minimize the overhead to DevOps.”

This overview comparing ScyllaDB and Cassandra was enough to give it a shot, especially since it “provides the same CQL interface and queries, the same drivers, even the same on-disk SSTable format, but with a modern architecture.”

With Scylla billing itself as a drop-in replacement for Cassandra promising MUCH better performance on the same hardware, it sounded almost too good to be true!

As we’ve explored in our previous Resolve blog, our database is primarily loaded by loading SSTables built offline using Spark on EMR. Our initial attempt to load a Scylla database with the same files as our current production database left us a bit disappointed. loading all the files to a fresh Scylla cluster required us to rebuild them with an older version of the Cassandra driver to force it to generate files using an older format.

After talking to the folks at Scylla, we learned that it didn’t support Cassandra’s latest MD file format. However, you can rename the .md files to .mc, and this will supposedly allow these files to be read by Scylla. [Editor’s note: Scylla fully supports the “MD” SSTable format as of Scylla Open Source 4.3.]

Once we were able to get SSTables loaded, we ran into another performance issue of starting the database in a reasonable amount of time. On Cassandra, when you copy files to each node in the cluster and start it, the database starts up within a few seconds. In Scylla, after copying files and restarting the Scylla service, it would take hours for larger tables to be re-compacted, shuffled, and ready to go, even though our replication factor was 3, on a 3 node cluster. So in copying all the files to each cluster, our thinking was data shouldn’t need to be transformed at all.

Once data was loaded, we were able to properly load test our APIs finally! And guess what? We were finally able to hit 10,000 queries per second relatively easily!

Grafana dashboard showing our previous maximum from 13:30 – 17:30 running around 3,000 queries/second. We were able to hit 5,000, 7,500, and over 10,000 queries per second with a loaded Scylla cluster.

We’ve been very pleased with Scylla’s performance out-of-the-box, being able to achieve double our goal set earlier last year of 10,000 queries per second, peaking at over 20,000 requests per second, all while keeping our 98th percentile under 50ms! And best of all — this is all out-of-the-box performance! No JVM or other tuning needs required! (The brief blips near 17:52, 17,55, and 17:56 are due to our load generator changing Kafka partitioning assignments as more load consumers are added).

In addition to the custom dashboards we have from the API point of view, Scylla conveniently ships Prometheus metric support and lets us install their Grafana dashboards easily to monitor our clusters with minimal effort.

OS metrics dashboard from Scylla:

Scylla Advanced Dashboard:

Offline SSTables to Cassandra Streaming

After doing some quick math factoring in Scylla’s need to recompact and reshuffle all your data loaded from offline SSTables, we realized reworking the database building, replacing it with streaming inserts straight into Cassandra would be faster using the spark-cassandra-connector.

In reality, rebuilding a database offline isn’t the primary use case that’s run regularly. Still, it is a useful tool for large schema changes and large internal data changes. This, combined with the fact that our SSTable build ultimately has SSTables being written to a single executor, we’ve since abandoned the offline SSTable build process.

We’ve updated our Airflow DAG to stream directly to a fresh Scylla cluster:

Version 1 of our Database Rebuild process, building SSTables offline.

Updated version 2 looks very similar, but it streams data directly to Scylla:

Conveniently the code is pretty straightforward as well:

We create a spark config and session:

val sparkConf = super.createSparkConfig()
       .set("spark.cassandra.connection.host",
cassandraHosts)
       // any other settings we need/want to set,
consistency level, throughput limits, etc.

val session =
SparkSession.builder().config(sparkConf).getOrCreate()

val records = session.read
       .parquet(inputPath)
       .as[ResolveRecord]
       .cache()

2. For each table we need to populate, we can map to a case class matching the table schema and saving as the correct table name and keyspace:

records

       // map to a row
       .map(row => TableCaseClass(id1, id2, ….))
       .toDF()
       .format("org.apache.spark.sql.cassandra")
       .options(Map("keyspace" -> keyspace, "table" ->
"mappingtable"))
       .mode(SaveMode.Append)
       // stream to scylla
       .save()

With some trial and error, we have found the sweet spot of the numbers and size of EMR EC2 nodes: for our data sets, running an 8 node c5.large was able to keep the load as fast as the EBS drives could handle while not running into more timeout issues.

Cassandra and Scylla Performance Comparison

Our Cassandra cluster under heavy load

Our Scylla cluster on the same hardware, with the same type of traffic

The top graph shows queries per second (white line; right Y-axis) we were able to push through our Cassandra cluster before we encountered timeout issues with the API speed measured at the mean, 95th, and 98th percentiles, (blue, green, and red, respectively; left-Y axis). You can see we could push through about 7 times the number of queries per second while dropping the 98th percentile latency from around 2 seconds to 15 milliseconds!

Next Steps

As our data continues to grow, we are continuing to look for efficiencies around data loading. A few areas we are currently evaluating:

  • Using Scylla Migrator to load Parquet straight to Scylla, using Scylla’s partition aware driver
  • Exploring i3 class EC2 nodes
  • Network efficiencies with batching rows and compression, on the spark side
  • Exploring more, smaller instances for cluster setup

This article originally appeared on the FullContact website here and is republished with their permission. We encourage others who would like to share their Scylla success stories to contact us. Or, if you have questions, feel free to join our user community on Slack.

The post FullContact: Improving the Graph by Transitioning to Scylla appeared first on ScyllaDB.

Load Balancing in Scylla Alternator

In a previous post, Comparing CQL and the DynamoDB API, I introduced Scylla — an open-source distributed database which supports two popular NoSQL APIs: Cassandra’s query language (CQL) and Amazon’s DynamoDB API. The goal of that post was to outline some of the interesting differences between the two APIs.

In this post I want to look more closely at one of these differences: The fact that DynamoDB-API applications are not aware of the layout of the Scylla cluster and its individual nodes. This means that Scylla’s DynamoDB API implementation — Alternator -—needs a load balancing solution which will redirect the application’s requests to many nodes. We will look at a few options of how to do this, recommend a client-side load balancing solution, and explain how we implemented it and how to use it.

Why Alternator Needs a Load Balancer

In the CQL protocol, clients know which Scylla nodes exist. Moreover, clients are usually token aware, meaning that a client is aware of which partition ranges (“token” ranges) are held by which of the Scylla nodes. Clients may even be aware of which specific CPU inside a node is responsible for each partition (we looked at this shard awareness in a recent post).

The CQL clients’ awareness of the cluster allows a client to send a request directly to one of the nodes holding the required data. Moreover, the clients are responsible for balancing the load between all the Scylla nodes (though the cluster also helps by further balancing the load internally – we explained how this works in an earlier post on heat-weighted load balancing).

The situation is very different in the DynamoDB API, where clients are not aware of which nodes exist. Instead, a client is configured with a single “endpoint address”, and all requests are sent to it. Amazon DynamoDB provides one endpoint per geographical region, as listed here. For example in the us-east-1 region, the endpoint is:

https://dynamodb.us-east-1.amazonaws.com

If, naïvely, we configure an application with the IP address of a single Scylla node as its single DynamoDB API endpoint address, the application will work correctly. After all, any Alternator node can answer any request by forwarding it to other nodes as necessary. However this single node will be more loaded than the other nodes. This node will also become a single point of failure: If it goes down clients cannot use the cluster any more.

So we’re looking for a better, less naïve, solution which will provide:

  • High availability — any single Alternator node can go down without loss of service.
  • Load balancing — clients sending DynamoDB API requests will cause nearly equal load on all Scylla nodes.

In this document we shortly survey a few possible load-balancing solutions for Alternator, starting with server-side options, and finally recommending a client-side load balancing solution, and explaining how this solution works and how it is used.

Server-side Load Balancing

The easiest load-balancing solution, as far as the client application is concerned, is a server-side load balancer: The application code remains completely unchanged, configured with a single endpoint URL, and some server-side magic ensures that the different requests to this single URL somehow get routed to the different nodes of the Scylla cluster. There are different ways to achieve this request-routing magic, with different advantages and different costs:

The most straightforward approach is to use a load-balancer device — or in a virtualized network, a load-balancer service. Such a device or service accepts connections or HTTP requests at a single IP address (a TCP or HTTP load balancer, respectively), and distributes these connections or requests to the many individual nodes behind the load balancer. Setting up load-balancing hardware requires solving additional challenges, such as failover load-balancers to ensure high availability, and scaling the load balancer when the traffic grows. As usual, everything is easier in the cloud: In the cloud, adding a highly-available and scalable load balancer is as easy as signing up for another cloud service. For example, Amazon has its Elastic Load Balancer service, with its three variants: “Gateway Load Balancer” (an IP-layer load balancer), “Network Load Balancer” (a TCP load balancer) and “Application Load Balancer” (an HTTP load balancer).

As easy as deploying a TCP or HTTP load balancer is, it comes with a price tag. All connections need to flow through this single service in their entirety, so it needs to do a lot of work and also adds latency. Needing to be highly available and scalable further complicates the load balancer and increases its costs.

There are cheaper server-side load balancing solutions that avoid passing all the traffic through a single load-balancing device. One efficient technique is DNS load balancing, which works as follows: We observe that the endpoint address is not a numeric IP address, but rather a domain name, such as “dynamodb.us-east-1.amazonaws.com”. This gives the DNS server for this domain name an opportunity to return a different Scylla node each time a domain-name resolution is requested.

If we experiment with Amazon DynamoDB, we can see that it uses this DNS load-balancing technique: Each time that “dynamodb.us-east-1.amazonaws.com” is resolved, a different IP address is returned, apparently chosen from a set of several hundreds. The DNS responses have a short TTL (5 seconds) to encourage clients to not cache these responses for very long. For example:

$ while :
do
     dig +noall +answer dynamodb.us-east-1.amazonaws.com
     sleep 1
done

dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.94.2.86
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.228.140
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.233.18
dynamodb.us-east-1.amazonaws.com. 3 IN    A    52.119.228.140
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.233.182
dynamodb.us-east-1.amazonaws.com. 1 IN    A    52.119.228.140
dynamodb.us-east-1.amazonaws.com. 1 IN    A    52.119.233.18
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.233.238
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.234.122
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.233.174
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.233.214
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.224.232
dynamodb.us-east-1.amazonaws.com. 4 IN    A    52.119.224.232
dynamodb.us-east-1.amazonaws.com. 5 IN    A    52.119.225.12
dynamodb.us-east-1.amazonaws.com. 1 IN    A    52.119.226.182
...

These different IP addresses may be different physical nodes, or be small load balancers fronting for a few nodes each.

The DNS-based load balancing method is cheap because only the DNS resolution goes through the DNS server — not the entire connection. This method is also highly available and scalable because it points clients to multiple Scylla nodes and there can be more than one DNS server.

However, DNS load balancing alone has a problem: When a Scylla node fails, clients who have already cached a DNS resolution may continue to send requests to this dead node for a relatively long time. The floating IP address technique can be used to solve this problem: We can have more than one IP address pointing to the same physical node. When one of the nodes fails, other nodes take over the dead node’s IP addresses — servicing clients who cached its IP addresses until those clients retry the DNS request and get a live node.

Here are two examples of complete server-side load balancing solutions. Both include a DNS server and virtual IP addresses for quick failover. The second example also adds small TCP load balancers; Those are important when there are just a few client machines who cache just a few DNS resolutions — but add additional costs and latency (requests go through another network hop).

Server-side load balancing, example 1:
DNS + virtual IP addresses

Server-side load balancing, example 2:
DNS + TCP load balancers + virtual IP addresses

Alternator provides two discovery services that these different load-balancing techniques can use to know which nodes are alive and should be balanced:

  1. Some load balancers can be configured with a fixed list of nodes plus a health-check service to periodically verify if each node is still alive. Alternator allows checking the health of a node with a trivial HTTP GET request, without any authentication needed:

$ curl http://1.2.3.4:8000/
healthy: 1.2.3.4:8000

  1. In other setups, the load balancer might not know which Alternator nodes exist or when additional nodes come up or go down. In this case, the load balancer will need to know of one live node (at least), and can discover the rest by sending a “/localnodes” request to a known node:

$ curl http://127.0.0.1:8000/localnodes
["127.0.0.1", “127.0.0.2”, “127.0.0.3”]

The response is a list of all living nodes in this data center of the Alternator cluster, a list of IP addresses in JSON format (the list does not repeat the protocol and port number, which are assumed to be the same for all nodes).

This request is called localnodes because it returns the local nodes — the nodes in the same data center as the known node. This is usually what we need — we will have a separate load balancer per data center, just like AWS DynamoDB has a separate endpoint per AWS region.

Most DNS servers, load balancers and floating IP address solutions can be configured to use these discovery services. As a proof-of-concept, we also wrote a simple DNS server that uses the /localnodes in just 50 lines of Python.

Client-side Load Balancing

So, should this post about load balancing in Alternator end here? After all, we found some excellent server-side load balancing options. Don’t they solve all our problems?

Well not quite…

The first problem with these server-side solutions is their complexity. We started the previous section by saying that the main advantage of server-side load balancing was the simplicity of using it in applications — you only need to configure applications with a single endpoint URL. However, this simplicity only helps you if you are only responsible for the application, and someone else takes care of deploying the server for you  — as is the case in Amazon DynamoDB. Yet, many Alternator users are responsible for deploying both application and database. Such users do not welcome the significant complexity added to their database deployment, and the extra costs which come with this complexity.

The second problem with the server-side solutions is added latency. Solutions which involve a TCP or HTTP load balancer require another hop for each request – increasing not just the cost of each request, but also its latency.

So an alternative which does not require complex, expensive or latency-inducing server-side load balancing is client-side load balancing, an approach that has been gaining popularity, e.g., in microservice stacks with tools such as Netflix Eureka.

Client-side load balancing means that we modify the client application to be aware of the Alternator cluster. The application can use Alternator’s discovery services (described above) to maintain a list of available server nodes, and then open separate DynamoDB API connections to each of those endpoints. The application is then modified to send requests through all these connections instead of sending all requests to the same Scylla node.

The difficulty with this approach is that it requires many modifications to the client’s application. When an application is composed of many components which make many different requests to the database, modifying all these places is time-consuming and error-prone. It also makes it more difficult to keep using exactly the same application with both Alternator and DynamoDB.

Instead, we want to implement client-side load balancing with as few as possible changes to the client application. Ideally, all that would need to be changed in an application is to have it load an additional library, or initialize the existing library a bit differently; From there on, the usual unmodified AWS SDK functions will automatically use all of Alternator’s nodes instead of just one.

And this is exactly what we set out to do — and we will describe the result here, and recommend it as the preferred Alternator load-balancing solution.

Clearly, such a library-based solution will be different for different AWS SDKs in different programming languages. We implemented such a solution for six different SDKs in five programming languages: Java (SDK versions 1 and 2), Go, Javascript (Node.js), Python and C++. The result of this effort is collected in an open-source project, and this is the method we recommend for load balancing Alternator. The rest of this post will be devoted to demonstrating this method.

In this post, we will take a look at one example, using the Java programming language, and  Amazon’s original AWS SDK (AWS SDK for Java v1). The support for other languages and SDK is similar – see https://github.com/scylladb/alternator-load-balancing/ for the different instructions for each language.

An Example in Java

An application using AWS SDK for Java v1 creates a DynamoDB object and then uses methods on this object to perform various requests. For example, DynamoDB.createTable() creates a new table.

Had Alternator been just a single node, https://127.0.0.1:8043/, an application that wished to connect to this one node with the AWS SDK for Java v1 would use code that looks something like this to create the DynamoDB object:

URI node = URI.create("https://127.0.0.1:8043/");

AWSCredentialsProvider myCredentials =
     new AWSStaticCredentialsProvider(
          new BasicAWSCredentials("myusername", "mypassword"));

AmazonDynamoDB client =
     AmazonDynamoDBClientBuilder.standard()
     .withEndpointConfiguration(
          new AwsClientBuilder.EndpointConfiguration(
          node.toString(), "region-doesnt-matter"))
     .withCredentials(myCredentials)
     .build();

     DynamoDB dynamodb = new DynamoDB(client);

And after this setup, the application can use this “dynamodb” object any time it wants to send a DynamoDB API request (the object is thread-safe, so every thread in the application can access the same object). For example, to create a table in the database, the application can do:

Table tab = dynamodb.createTable(tabName,
     Arrays.asList(
          new KeySchemaElement("k", KeyType.HASH),
          new KeySchemaElement("c", KeyType.RANGE)),
     Arrays.asList(
          new AttributeDefinition("k", ScalarAttributeType.N),
          new AttributeDefinition("c", ScalarAttributeType.N)),
          new ProvisionedThroughput(0L, 0L));

We want a way to create a DynamoDB object that can be used everywhere that the application used the normal DynamoDB object (e.g., in the above createTable request), just that each request will go to a different Alternator node instead of all of them going to the same URL.

Our small library (see installation instructions) allows us to do exactly this, by creating the DynamoDB object as follows. The code is mostly the same as above, with the new code in bold:

import com.scylladb.alternator.AlternatorRequestHandler;

URI node = URI.create("https://127.0.0.1:8043/");

AlternatorRequestHandler handler =
     new AlternatorRequestHandler(node);

AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard()
     .withRegion("region-doesnt-matter")
     .withRequestHandlers(handler)
     .withCredentials(myCredentials)
     .build();

DynamoDB dynamodb = new DynamoDB(client);

The main novelty in this snippet is the AlternatorRequestHandler, which is passed into the client builder with withRequestHandlers(handler). The handler is created with the URI of one known Alternator node. The handler object then contacts this node to discover the rest of the Alternator nodes in this data-center. The handler also keeps a background thread which periodically refreshes the list of live nodes. After learning of more Alternator nodes, there is nothing special about the original node passed when creating the handler — and this original node may go down at any time without hurting availability.

The region passed to withRegion() does not matter (and can be any string), because AlternatorRequestHandler will override the chosen endpoint anyway. Unfortunately we can’t just drop the withRegion() call, because without it the library will expect to find a default region in the configuration file and complain when it is missing.

Each request made to this dynamodb object will now go to a different live Alternator node. The application code doing those requests does not need to change at all.

Conclusions

We started this post by explaining why Alternator, Scylla’s implementation of the DynamoDB API, needs a load-balancing solution for sending different requests to different Scylla nodes. We then surveyed both server-side and client-side load balancing solutions, and recommended a client-side load balancing solution.

We showed that this approach requires only very minimal modification to applications — adding a small library and changing the way that the DynamoDB object is set up, while the rest of the application remains unchanged: The application continues to use the same AWS SDK it used with Amazon DynamoDB, and the many places in the application which invoke individual requests remain exactly the same.

The client-side load balancing allows porting DynamoDB applications to Alternator with only trivial modifications, while not complicating the Scylla cluster with load balancing setups.

GET THE SCYLLA ALTERNATOR LOAD BALANCER

The post Load Balancing in Scylla Alternator appeared first on ScyllaDB.

Apache Cassandra Changelog #6 | April 2021

Apache Cassandra Changelog Header

Our monthly roundup of key activities and knowledge to keep the community informed.

Release Notes

Released

A blocking issue was found in beta-2 which has delayed the release of rc-1. Also during rc-1 evaluation, some concerns were raised about the contents of the source distribution, but work to resolve that got underway quickly and is ready to commit.

For the latest status on Cassandra 4.0 GA, please check the Jira board (ASF login required). However, we expect GA to arrive very soon! Read the latest summary from the community here. The remaining tickets represent 1% of the total scope.

Join the Cassandra mailing list to stay up-to-date.

Changed

The release cadence for the Apache Cassandra project is changing. The community has agreed to one release every year, plus periodic trunk snapshots. The number of releases that will be supported in this agreement is three, and every incoming release will be supported for three years.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

The PMC is pleased to announce that Berenguer Blasi has accepted the invitation to become a project committer. Thanks so much, Berenguer, for all the work you have done!

Added

As the community gets closer to the launch of 4.0, we are organizing a celebration with the help of ASF – Cassandra World Party 4.0 will be a one-day, no-cost virtual event on Wednesday, April 28 to bring the global community together in celebration of the upcoming release milestone. The CFP for 5-minute lightning talks is open now until April 9 – newcomers welcome! Register here.

Added

Apache Cassandra is taking part in the Google Summer of Code (GSoC) under the ASF umbrella as a mentoring organization. If you’re a post-secondary student and looking for an exciting opportunity to contribute to the project that powers your favorite Internet services then read Paulo Motta’s GSoC blog post for the details.

Changed

Recent updates to cass-operator in March by the Kubernetes SIG have seen the specification for seeds now supporting hostnames and separate seeds for separate data centers. Currently, the SIG is discussing whether cass-operator, the community-developed operator for Apache Cassandra, should have CRDs for keyspaces and roles, how to accomplish pod-specific configurations, and whether CRDs should represent Schema, watch here.

The project is also looking at how to make the cass-operator multi-cluster by using the same approach used for Multi-CassKop. One idea is to use existing CassKop CRDs to manage cass-operator, and it could be a way to demonstrate how easy it is to migrate from one operator to another.

K8ssandra will be seeking to support Apache Cassandra 4.0 features, which involve some new configuration settings and require changes in the config builder. It will also be supporting JDK 11, the new garbage collectors, and the auditing features.

kubernetes-sig-meeting-2021-03-25

User Space

American Express

During last year’s ApacheCon, Laxmikant Upadhyay presented a 35-minute guide on the best practices and strategies for upgrading Apache Cassandra in production. This includes pre- and post-upgrade steps and rolling and parallel upgrade strategies for Cassandra clusters. - Laxmikant Upadhyay

Spotify

In a recent AMA, Spotify discussed Backstage, its open platform for building developer portals. Spotify elaborated on the database solutions it provides internally: “Spotify is mostly on GCP so our devs use a mix of Google-managed storage products and self-managed ones.[…] The unmanaged storage solutions Spotify devs start and operate themselves on GCE include Apache Cassandra, PostgreSQL, Memcached, Elastic Search, and Redis. We hope to support stateful workloads in the future. We’ve explored using PersistentVolumes backed by persistent disks.” - David Xia

Do you have a Cassandra case study to share? Email cassandra@constantia.io.

In the News

TFiR: How Apache Cassandra Works With Containers

Dataversity: Why 2021 Will Be a Big Year for Apache Cassandra (and Its Users)

ZDNet: Microsoft Ignite Data and Analytics Roundup: Platform Extensions Are the Key Theme

Techcrunch: Microsoft Azure Expands its NoSQL Portfolio with Managed Instances for Apache Cassandra

Cassandra Tutorials & More

How to Install Apache Cassandra on CentOS 8 - Shehroz Azam, LinuxHint

Cassandra With Java: Introduction to UDT - Otavio Santana, DZone

Apache Cassandra Horizontal Scalability for Java Applications [Book] - Otavio Santana, DZone

Cloud-native applications and data with Kubernetes and Apache Cassandra - Patrick McFadin, DataStax

Apache Cassandra Changelog Footer

Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

Capacity Planning with Style

Scylla Cloud now offers a new Scylla Cloud Calculator to help you estimate your costs based on your database needs. While it looks like a simple tool, anyone steeped in the art of database capacity planning knows that there is often far more to it than meets the eye. In this blog post we’ll show you how to use our handy new tool and then illustrate it with an example from a fictional company.

Considerations for the Cloud

One of the great things about using the cloud is that capacity planning mistakes can be cheaper: It is treated as an expense, adjustable monthly, and has no impact on your annual capital budget. You’re not on the hook for equipment costs if you’ve underspecified or overspecified a system. Even if you get it wrong it’s relatively easy to change your capacity later. In many cases people don’t bother with precise capacity planning anymore, settling on rough estimates. For many situations, e.g., stateless applications, this is fine. Experimentation is certainly a good way to find the right capacity.

Yet in the case of stateful systems like databases capacity planning is still vital. Because although modern databases are elastic — allowing you to add and remove capacity — this elasticity is limited. For example, databases may need hours to adjust, making them unable to meet real-time burst traffic scaling requirements. Scaling a database is relatively slow and sometimes capped by the data model (e.g. partitioning) or data distribution methods (replication).

As well, many databases are made more affordable by signing longer term committed license agreements, such as annual contracts. In such cases, you want to make sure you are signing up for the actual capacity you need, and not overprovisioning (wasting your budget on excess capacity) or underprovisioning (risking hitting upper bounds on performance and storage capacity).

Using the Scylla Cloud Calculator

Now let’s get into the Scylla Cloud Calculator itself and see how it works. While some experienced hands may already know exactly what instance types they want and need — by the way, we still show you prices by instance types — others may only have a general idea of their requirements: throughput in reads or writes, or total unique data stored. In this latter case, you want Scylla Cloud to instead take in your requirements, recommend to you how many and what specific types of servers you will need to provision, and inform you of what that will cost based upon on-demand or 1-year reserved provisioning.

We have written a separate guide for how to use the Scylla Cloud Calculator, but let’s go through some of the basics here. With our calculator, you can specify the following attributes:

  • Read ops/sec
  • Write ops/sec
  • Average item size (in KB)
  • Data set size (in TB)
  • On demand or reserved (1 year)

See How Scylla Cloud Compares

You can also use the Scylla Cloud Calculator to compare Scylla Cloud pricing to the following NoSQL databases:

  • Amazon DynamoDB
  • DataStax Astra
  • Amazon Keyspaces

Note that while many features across these databases may be the same or similar, there will be instances where specific features are not available, or there may be similar features that vary widely in implementation across the different offerings. How Scylla refers to “Reserved” and “On-Demand” resources also may differ from how other vendors use these same terms. You are encouraged to contact us to find out more about your specific needs.

ScyllaDB will strive to maintain current pricing on these competitors in our calculator. Yet if you find any discrepancy between our calculator and what you find on a competitor’s website, please let us know.

Assumptions and Limitations

The pricing calculator today cannot take into account all the factors that can modify your capacity, production experience and costs. For example, the calculator does not model the CPU or IO requirements of, say, Materialized Views, Lightweight Transactions, highly variable payload sizes, or frequent full table or range scans. Nor can it specifically model behavior when you apply Scylla Cloud’s advanced features such as Workload Prioritization, where different classes of reads or writes may have different priorities. Further, note the pricing calculator by default sets the Replication Factor for the cluster to 3.

If you have a need to do more advanced sizing calculation, to model configurations and settings more tailored to your use case, or to conduct a proof-of-concept and your own benchmarking, please reach out and contact our sales and solutions teams directly.

For now, let’s give an example of a typical use case that the calculator can more readily handle.

Example Use Case: BriefBot.ai

Welcome to BriefBot.ai, a fictional service that is growing and about to launch its second generation infrastructure. The service creates and serves personalized briefs from a variety of data sources based on user preferences. Users will scroll through their feeds, seeing briefs in some order (we call it a “timestamp” but it can be virtual time to support algorithmic feed) from different bots. Briefs also have points, which are awarded by users and are used to enhance the ML algorithms powering the service. This service is expected to become very popular and we project it will have tens of millions of users. To support this scale, we have decided to use Scylla as the main datastore for storing and serving users’ feeds.

Each feed is limited to 1,000 daily items — we really don’t expect users to read more in a single day, and each item is about 1kb. Further, for the same of data retention, the scrollback in the feed is limited. Historical data will be served from some other lower priority storage that may have higher latencies than Scylla. A method such as Time To Live (TTL) can evict older data records. This also governs how swiftly the system will need to grow its storage over time.

Data Model

CREATE KEYSPACE briefbot WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 3}

CREATE TABLE briefs (
   user_id UUID,
   brief_id UUID,
   bot_id UUID,
   timestamp TIMESTAMP,
   brief TEXT,
   points int,
   origin_time TIMESTAMP,
   PRIMARY KEY (user_id, timestamp)

)

The briefs table is naturally partitioned by user_id and has everything we need when loading the feed, so we only need to do one query when a user is looking at the feed

SELECT * FROM briefs WHERE user_id=UUID AND timestamp<CURRENT_POSITION PER PARTITION LIMIT 100

This is a common pattern in NoSQL: denormalizing data such that we can read it with a single query. Elsewhere we will have another database containing briefs accessible by bot id, brief id etc, but this database is specifically optimized for fast feed loading experience of our users’ feeds. For example, if we need bot logos we might add another field “bot_avatar” to our table (right now we’ll fetch bot avatars from a CDN, so this isn’t important). Also note that in our example data model we specified using SimpleStrategy to keep things simple; we advise you choose NetworkTopologyStrategy for an actual production deployment.

While this table is being read synchronously, writes are done by a processing pipeline asynchronously. Every time a bot generates a new brief, the pipeline writes new brief items to all relevant users. Yes, we are writing the same data multiple times — this is the price of denormalization. Users don’t observe this and writes are controlled by us, so we can delay them if needed. We do need another subscriptions table which tells us which user is subscribed to which bot:

CREATE TABLE subscriptions (
   user_id UUID,
   bot_id UUID,
   PRIMARY KEY (bot_id)
)

Notice the primary key of this table is bot_id, you might wonder how we are able to handle bots which may have millions of users subscribed to them. In a real system this would also cause a large cascade of updates when a popular bot has a new brief – the common solution for this problem would be to separate the popular bots to a separate auxiliary system which has briefs indexed by bot_id and have users pull from it in real time. Instead we will make bot_id a virtual identifier, with many uuids mapped to the same bot; every “virtual bot” will only have several thousand users subscribed to it.

Workload Estimation

Now that we have a rough design and data model, we can start estimating the workload. Our product manager has also provided some estimates on the expected user base and growth of the service, which we can feed into a guesstimation tool.

Our initial guesses for the number of users are given as ranges based on market research. The model is based on data both from known sources (e.g. the number of briefs per bot which we control) and unknown sources outside of our control (e.g. the number of users and their activity). The assumptions we make need to take into account our uncertainty, and thus larger ranges are given to parameters we are less sure of. We can also use different distributions for different parameters, as some things have known distributions. The output of the model is also given as distributions, and for sizing our system we will take an upper percentile of the load — 90th percentile in our case. For systems that cannot be easily scaled after initial launch we would select a higher percentile to reduce the chances of error – but this will make our system substantially more expensive. In our example, the values for database load are:

P50 P90 P95
Reads/sec 41,500 165,000 257,000
Writes/sec 195,000 960,750 1,568,000
Dataset (TB) 44 104 131

As you can see, there is a large difference between the median and the higher percentiles. Such is the cost of uncertainty! But the good news is that we can iteratively improve this model as new information arrives to reduce this expensive margin of error.

Growth over Time

It is unrealistic to assume that the system will serve the maximal load on day one, and we expect gradual growth over the lifetime of the system. There are several models we can use to predict the growth of the system, e.g. linear, logistic or exponential. Those kinds of models are very helpful for an existing system where we have data to extrapolate from. However our system is new; Thankfully Scylla is scalable and allows scaling up and down, so we only need to make sure that we have enough capacity when the system launches; we can adjust the capacity based on real data later on.

Compaction

Compaction affects both the required storage and the required compute power. Different compaction strategies give different performance for reads and writes and for different situations. For this system, we choose the Incremental Compaction Strategy available as part of Scylla Enterprise which powers Scylla Cloud. This compaction strategy accommodates a write heavy scenario while keeping space amplification to a reasonable level. Our calculation of required resources will have to be adjusted for the requirements of compaction, and this is handled by the Scylla Cloud sizing calculator.

Deciding on Operating Margins

One of the important decisions that need to be made when sizing a cluster is how much operating margin we would like to have. Operating margin accounts for resilience, e.g. degraded operation when a node is lost, and also for load fluctuations, maintenance operations and other unforeseen events. The more operating margin we take, the easier it is to operate the cluster, but additional capacity can be expensive. We would like to choose a margin that will cover what is likely to happen, relying on our ability to add nodes in the case of unforeseen events.

The operating margin we choose will account for:

  • 1 node failure
  • Routine repair cycle
  • Peak load increase of 25%
  • Shards which are 20% hotter than average shard load

These margins can be either conservative and be stacked on top of another, or optimistic and overlapping. We choose to be optimistic and assume there is a small chance of more than one or two of the above happening simultaneously. We will need to make sure our capacity has a margin of at least one node or 25%, whichever higher. This margin will be applied to the throughput parameters, as storage already has its own margin for compaction, node failures and such – this is part of the design of Scylla.

Estimating Required Resources Using the Scylla Cloud Calculator

After having an estimate for the workload and operational margins, we still need to calculate the amount of resources needed to handle this workload. This can be quite challenging, as we have discussed in a previous blog post, and requires a performance model of the system. Knowing the required resources, we can use a few thumb rules to derive the required configuration of servers and composition of the cluster. If you are running on EC2, the calculator already recommends a cluster composition of EC2 instances, currently i3 and i3en instance types only.

Again, the calculator is just an estimate — the exact performance of a complex system like Scylla depends on the actual data and usage patterns, but it does provide results which are good enough for initial design and cost estimation.

Selecting the Right Cluster

According to the calculator, a cluster of 15 i3en.12xlarge instances will fit our needs. This cluster has more than enough throughput capacity (more than 2 million ops/sec) to cover our operating margins – as the limiting factor in this case was storage capacity. But why did the calculator choose large 12xlarge nodes and not a large number of smaller nodes?

Scylla is a redundant cluster which is resilient to node failures, but that does not mean node failures have zero impact. For instance, in a recent disaster involving Kiwi.com, a fire took out an entire datacenter, resulting in the loss of 10 out of the total 30 nodes in their distributed cluster. While Scylla kept running, load on the other servers did increase significantly. As we have previously discussed we need to account for node failures in our capacity planning.

We should note that using larger nodes may be more efficient from a resource allocation point of view, but it results in a small number of large nodes which means that every failing node takes with it a significant part of cluster capacity. In a large cluster of small nodes each node failure will be less noticeable, but there is a larger chance of node failures — possibly more than one! This has been discussed further in The Math of Reliability.

We have a choice between a small cluster of large machines which emphasizes performance and cost effectiveness in a non-failing state and a large cluster of small machines which is more resilient to failures but also more likely to have some nodes failing. Like all engineering tradeoffs, this depends on the design goals of the system, but as a rule of thumb we recommend a balanced configuration with at least 6-9 nodes in the cluster, even if the cluster can run on 3 very large nodes.

In the case of BriefBot, we will use the calculator recommendation of 15 i3.12xlarge nodes which will give us ample capacity and redundancy for our workload.

Monitoring and Adjusting

Congratulations! We have launched our system. Unfortunately, this doesn’t mean our capacity planning work is done — far from it. We need to keep an eye on the cluster utilization metrics and make sure we are within our operational margin. We can even set some predictive alerts, perhaps using linear regression (available in Graphite and Prometheus) or more advanced models offered by various vendors. Some metrics, like the amount of storage used are fairly stable; others, like utilization, suffer from seasonal fluctuations which may need to be “smoothed” or otherwise accounted for in prediction models. An simple yet surprisingly effective solution is to just take the daily/weekly peaks and run a regression on them — it works well enough in many systems.

In the case of Scylla, the metrics we would like predict growth on would be:

  • Storage used
  • CPU utilization (Scylla, not O/S)

These metrics and many more are available as part of Scylla Monitoring Stack.

Summary

Capacity planning is an important step in any large scale system design that often requires operational knowledge and experience of the systems involved. With the new sizing calculator we hope to make administrators’ lives easier when planning new clusters. If you are interested in discussing this further, as always we are available on the community Slack channel for questions, comments and suggestions.

CHECK OUT THE SYLLA CLOUD CALCULATOR

The post Capacity Planning with Style appeared first on ScyllaDB.

Scylla’s New IO Scheduler

As any other computational resource, disks are limited in the speed they can provide. This speed is typically measured as a 2-dimensional value measuring Input/Output Operations per Second (IOPS) and bytes per second (throughput). Of course these parameters are not cut in stone even for each particular disk, and the maximum number of requests or bytes greatly depends on the requests’ distribution, queuing and concurrency, buffering or caching, disk age, and many other factors.

So when performing IO a disk must always balance between two inefficiencies — overwhelming the disk with requests and underutilizing it.

Overwhelming the disk should be avoided, because when the disk is full of requests it cannot distinguish between the criticality of certain requests over others. All requests are of course, important, yet we care more about latency sensitive requests. For example, Scylla serves real time queries that need to be completed in single digit milliseconds or less and, in parallel, Scylla processes terabytes of data for compaction, streaming, decommission and so forth. The former have strong latency sensitivity. The latter are less so. The task of the I/O scheduler is to maximize the IO bandwidth while keeping latency as low as possible for latency sensitive tasks. The former IO scheduler design, implementation and priority classes are described here, in this blog we will cover the enhancements of the new scheduler to improve its performance in environments where IO is a bottleneck or work is unbalanced.

In case of Seastar the situation with not overwhelming the disk is additionally complicated by the fact, that Seastar doesn’t do IO directly with the disk, but instead it talks to an XFS filesystem with the help of the Linux AIO API. However the idea remains the same: Seastar tries hard to feed the disk with as many requests as it can handle, but not more than this. (We’ll put aside all the difficulties defining what “as much as it can” means for now.)

Seastar uses the “shared nothing” approach, which means that any decision made by CPU cores (called shards) are not synchronized with each other. In rare cases, when one shard needs the other shard’s help, they explicitly communicate with each other. Also keep in mind that Seastar schedules its tasks at the rate of ~0.5 ms, and in each “tick” a new request may pop up on the queue.

A conceptual model of Seastar’s original IO Scheduler. Different tasks are represented by different colors, rectangle length represents the IO block length

When a sharded architecture runs the show the attempt not to overwhelm the disk reveals itself at an interesting angle. Consider a 1GB/s disk on a 32-shard system. Since they do not communicate with each other, the best they can do in order not to overwhelm the disk, and make it ready to serve new requests within 0.5 ms time-frame is to submit requests that are not larger than 16k in size (the math is 1GB/s divided by 32 hards, multiplied by 0.5ms).

This sounds pretty decent, as the disk block is 512-4096 bytes, but in order to achieve the maximum disk throughput one ought to submit requests of 128k and larger. To keep requests big enough not to underutilize the disk but small enough not to overwhelm it,  shards should communicate, and according to sharded design this was solved by teaching shards to forward their requests to other shards. On start some of them were allowed to do IO for real and were called “IO-coordinators”, while others sent their requests to these coordinators and waited for the completion to travel back from that shard. This solved the overwhelming, but led to several nasty problems.

Visual representation of a CPU acting as an IO coordinator

As non-coordinator shards use the Seastar cross-shard message bus, they all get unavoidable latency tax roughly equal to the bus’ round-trip time. Since the reactor’s scheduling slice is 0.5ms, in the worst case this value is 1ms, which is well comparable with the 4k request latency on modern NVMe disks.

Also, with several IO coordinators configured the disk becomes statically partitioned. So if one shard is trying to do IO-intensive work, while others are idling, the disk will not work at its full speed, even if large requests are in use. This, in turn, can be solved by always configuring a single IO coordinator regardless of anything, but this raises the next problem to its extreme.

When a coordinator shard is loaded with offloaded IO from other shards, in the worst case the shard can spend the notable amount of its time doing IO thus slowing down its other regular activities. When there’s only one CPU acting as an IO coordinator, this problem reveals itself at its best (or maybe worst).

The goal of the new scheduler is to kill two birds with one stone — to let all shards do their own IO work without burdening one CPU with it predominantly, and at the same time keep the disk loaded according to the golden rule of IO dispatching: “as much as it can, but not more than this.” (A variant of Einstein’s apocryphal expression.) Effectively this means that shards still need to communicate with each other, but using some other technique, rather than sending Seastar messages to each other.

Here Come the IO Groups

An IO group is an object shared between several shards and where these shards keep track of the capacity of a disk. There can be more than one group in the system, but more on this later. When a shard needs to dispatch a request it first grabs the needed fraction of the capacity from the group. If the capacity requirement is satisfied, the shard dispatches, if not,  it waits until the capacity is released back to the group, which in turn happens when a previously submitted request completes.

IO groups ensure no one CPU is burdened with all IO coordination. The different CPUs intercommunicate to negotiate their use of the same shared disk

An important option of the new algorithm is “fairness”, which means that all shards should have equal chances to get the capacity they need. This sounds trivial, but simple algorithms fail short on this property, for example early versions of Linux had a simple implementation of a spinlock, which suffered from being unfair in the above sense, so the community had to re-implement it in a form of a ticket-lock.

A Side Note

The simplest way to implement the shared capacity is to have a variable on the group object that’s initialized with a value equal to the disk capacity. When a shard needs to grab the capacity it decrements the value. When it returns the capacity back it increments it back. When the counter becomes negative, this denotes the disk is over dispatched and the shard needs to wait. This naive approach, however, cannot be fair.

Let’s imagine a group with the capacity of 3 units and three shards all trying to submit requests with weights of 2 units each. Apparently, these requests cannot be submitted altogether so one shards wins the race and dispatches, while two other shards have to wait.

Now how would this “wait” look like? If both shards should poll the capacity counter until it becomes at least 2, they can do it forever. Likely the shards need to “reserve” the capacity they need, but again, with a plain counter this won’t just work. Even if we allow this counter to go negative, denoting the capacity starvation, from the example above, if both subtract 2 from the counter, it will never become larger than -1 and the whole dispatching will get stuck. We’ll refer to this as the increment/decrement (“inc/dec”) waiting problem.

Capacity Rovers

The algorithm that worked for us is called “capacity rovers” and it resembles the way TCP controls the congestion with the help of a sliding window algorithm. A group maintains two values called tail rover and head rover. They are initialized with respectively zero and the “disk capacity” values. When a shard needs to dispatch a request it increments the tail rover and, if it’s still less than the head, dispatches. When the request finishes it increments the head rover. All increments, of course, happen with the help of atomic instructions.

The interesting part of the algorithm is what a shard does if after incrementing the tail rover value happens to be larger than the head one, in the code it’s called “ahead of”. In this case the disk current usage vs capacity doesn’t allow for more requests and the shard should wait and that’s where the naive inc/dec waiting problem is solved.

First of all, a shard will be allowed to dispatch when the disk capacity will allow for it, i.e. when the head counter will be again ahead of the tail. Note, that it can be ahead by any value, since the capacity demand had already been accounted for by incrementing the tail rover. Also, since all shards only increment the tail rover, and never decrement it, they implicitly form a queue — the shard that incremented the tail counter first may also dispatch first. The waiting finishes once the head rover gets ahead of the tail rover value that a shard saw when incrementing it. The latter calls for increment-and-get operation for an integer, but it’s available on all platforms.

One may have noticed that both tail and head rovers never decrement and thus must overflow sometime. That’s true and there is no attempt to avoid the overflow; it is welcomed instead. To safely compare wrapping rovers signed arithmetics are used. Having a Linux kernel development background I cannot find a better example of it as Linux jiffies. Of course this wrapping comparison only works if all shards check the rovers not more rarely than they overflow. At current rates this means the shard should poll its IO queues every 15 minutes or more often, which is a reasonable assumption.

Also note the shared rovers require each shard to alter two variables in memory with the help of atomic instructions, and if some shard is more lucky accessing the memory cells than another it may get the capacity more easily. That’s also true and the primary reason for such asymmetry is NUMA machines — if there’s only one IO group for the whole system, then shards sitting in the same NUMA node as the group object’s memory is allocated on will enjoy faster rovers operations and, likely, more disk capacity devoted to them.

To mitigate this, IO groups are allocated per NUMA node which, of course, results in static disk partitioning again, but this time the partitioning comes at larger chunks.

As a result, we now have a scheduler that allows all shards to dispatch requests on their own and use more disk capacity than they could if the system was partitioned statically. This improves the performance for imbalanced workloads and opens up a possibility for another interesting feature called “IO cancellation,” but that’s the topic for another post.

From a performance perspective, it will take some effort so see the benefits. For direct experiments with an IO scheduler there’s a tool called io_tester that lives in a Seastar repository.

The most apparent test to run is to compare the throughput one shard can achieve on a multi-shard system. The old scheduler statically caps the per-shard throughput to 1/Nth of the disk, while the new one allows a single shard to consume all the capacity.

Another test to run is to compare the behavior of the uniform 4-k workload on both the old scheduler with 1 coordinator (achieved with the –num-io-queues option) and the new one with the help. If checking the latency distributions, it can be noted that the new scheduler results in more uniform values.

Now Available in Scylla Open Source

You can immediately start to benefit from our new IO Scheduler, which is present in our recently released Scylla Open Source 4.4. If you’d like to ask more questions about the IO Scheduler, or using Scylla in general, feel free to contact us directly, or join our user community in Slack.

The post Scylla’s New IO Scheduler appeared first on ScyllaDB.

Project Circe March Update


Springtime is here! It’s time for our monthly update on Project Circe, our initiative to make Scylla into an even more monstrous database. Monstrously more durable, stable, elastic, and performant. In March 2021 we released Scylla Open Source 4.4. This new software release provides a number of features and capabilities that fall under the key improvement goals we set out for Project Circe. Let’s hone in on the recent performance and manageability improvements we’ve delivered.

New Scheduler

The Seastar I/O scheduler is used to maximize the requests throughput from all shards to the storage. Until now, the scheduler was running in a per-shard scope: each shard runs its own scheduler, balanced between its I/O tasks, like reads, updates and compactions. This works well when the workload between shards is approximately balanced; but when, as often happened, one shard was more loaded, it could not take more I/O, even if other shards were not using their share. I/O scheduler 2.0 included in Scylla 4.4 fixes this. As storage bandwidth and IOPS are shared, each shard can use the whole disk if required.

Drivers in the Fast Lane

While we’ve been optimizing our server performance, we also know the other side of the connection needs to be able to keep up. So in recent months we have been polishing our existing drivers and releasing all-new, shard-aware drivers.

We recently updated our shard-aware drivers for Java and Go (GoCQL) to support Change Data Capture (CDC). This makes such data updates more easily consumable and highly performant. We are committed to adding CDC shard-awareness to all our supported drivers, such as our Python driver. Speaking of new drivers, have you checked out the new shard-aware C/C++ driver? Or how about the Rust driver we have in development? They’ll get the CDC update in due time too.

We also introduced new reference example of CDC consumer implementations:

You can use these examples when building a Go or Java base application feeding from a Scylla CDC stream. Such an application can, for example, feed from a stream of IoT updates, updating the latest min and max value in an aggregation Scylla table.

Manageability Improvements

Advisor for Scylla Monitoring Stack

We recently announced a Scylla Monitoring Advisor section. For example, it can help you diagnose issues with balance between your nodes. Are you seeing unacceptable latencies from a particular shard? Low cache hit rate? Connection problems? Scylla Advisor helps you pinpoint your exact performance bottlenecks so you can iron them out.

Look forward to additional capabilities and more in-depth documentation of this feature in Scylla Monitoring Stack 3.7, which is right around the corner.

New API for Scylla Manager

Scylla Manager 2.3 adds a new suspend/resume API, allowing users to set a maintenance window in which no recurrent or ad-hoc task, like repair or backup, is running.

Additional Performance Optimizations

Improved performance is often the result of many smaller improvements.  Here’s a list of ways we’ve recently further improved the performance of Scylla with our 4.4 release:

  • Repair works by having the repairing node (“repair master”) merge data from the other nodes (“repair followers”) and write the local differences to each node. Until now, the repair master calculated the difference with each follower independently and so wrote an SStable corresponding to each follower. This creates additional work for compaction, and is now eliminated as the repair master writes a single SStable containing all the data. #7525
  • When aggregating (for example, SELECT count(*) FROM tab), Scylla internally fetches pages until it is able to compute a result (in this case, it will need to read the entire table). Previously, each page fetch had its own timeout, starting from the moment the page fetch was initiated. As a result, queries continued to be processed long after the client gave up on them, wasting CPU and I/O bandwidth. This was changed to have a single timeout for all internal page fetches for an aggregation query, limiting the amount of time it can be alive. #1175
  • Scylla maintains system tables that track any SSTables that have large cells, rows, or partitions for diagnostics purposes. When tracked SSTables were deleted, Scylla deleted the records from the system tables. Unfortunately, this uses range tombstones, which are not (yet) well supported in Scylla. A series was merged to reduce the number of range tombstones to reduce impact on performance. #7668
  • Queries of Time Window Compaction Strategy tables now open SSTables in different windows as the query needs them, instead of all at once. This greatly reduces read amplification, as noted in the commit. #6418
  • For some time now, Scylla reshards (rearranges SSTables to contain data for one shard) on boot. This means we can stop considering multi-shard SSTables in compaction, as done here. This speeds up compaction a little. A similar change was done to cleanup compactions. #7748
  • During bootstrap, decommission, compaction, and reshape Scylla will separate data belonging to different windows (in Time Window Compaction Strategy) into different SSTables (to preserve the compaction strategy invariant). However, it did not do so for memtable flush, requiring a reshape if the node was restarted. It will now flush a single memtable into multiple SSTables, if needed. #4617

Reducing Latency Spikes

  • Large SSTables with many small partitions often require large bloom filters. The allocation for the bloom filters has been changed to allocate the memory in steps, reducing the chance of a reactor stall. #6974
  • The token_metadata structure describes how tokens are distributed across the cluster. Since each node has 256 tokens, this structure can grow quite large, and updating it can take time, causing reactor stalls and high latency. It is now updated by copying it in the background, performing the change, and swapping the new variant into the place of the old one atomically. #7220 #7313
  • A cleanup compaction removes data that the node no longer owns (after adding another node, for example) from SSTables and cache. Computing the token ranges that need to be removed from cache could take a long time and stall, causing latency spikes. This is now fixed. #7674
  • When deserializing values sent from the client, Scylla will no longer allocate contiguous space for this work. This reduces allocator pressure and latency when dealing with megabyte-class values. #6138
  • Continuing improvement of large blob support, we now no longer require contiguous memory when serializing values for the native transport, in response to client queries. Similarly, validating client bind parameters also no longer requires contiguous memory. #6138
  • The memtable flush process was made more preemptible, reducing the probability of a reactor stall. #7885

Get Scylla Open Source

Check out the release notes in full, and then head to our Download Center to get your own copy of Scylla Open Source 4.4.

DOWNLOAD SCYLLA NOW

The post Project Circe March Update appeared first on ScyllaDB.

Scylla Open Source Release 4.4

The Scylla team is pleased to announce the release of Scylla Open Source 4.4, a production-ready release of our open source NoSQL database.

Scylla is an open source, NoSQL database with superior performance and consistently low latencies.

Scylla 4.4 includes performance, stability improvements and bug fixes (below).

Find the Scylla Open Source 4.4 repository for your Linux distribution here. Scylla 4.4 is also available as Docker, EC2 AMI and GCP image.

Please note that only the last two minor releases of the Scylla Open Source project are supported. Starting today, only Scylla Open Source 4.4 and Scylla 4.3 are supported, and Scylla 4.2 is retired.

Related Links

New Features

Timeout per Operation

There is now new syntax for setting timeouts for individual queries with “USING TIMEOUT”.  #7777

This is particularly useful when one has queries that are known to take a long time. Till now, you could either increase the timeout value for the entire system (with request_timeout_in_ms), or keep it low and see many timeouts for the longer queries. The new Timeout per Operation allows you to define the timeout in a more granular way. Conversely, some queries might have tight latency requirements, in which case it makes sense to set their timeout to a small value. Such queries would get time out faster, which means that they won’t needlessly hold the server’s resources.

You can use the new TIMEOUT parameters for both queries (SELECT) and updates (INSERT, UPDATE, DELETE).

Examples:

SELECT * FROM t USING TIMEOUT 200ms;

INSERT INTO t(a,b,c) VALUES (1,2,3) USING TIMESTAMP 42 AND TIMEOUT 50ms;

Working with prepared statements works as usual — the timeout parameter can be explicitly defined or provided as a marker:

SELECT * FROM t USING TIMEOUT ?;

INSERT INTO t(a,b,c) VALUES (?,?,?) USING TIMESTAMP 42 AND TIMEOUT 50ms;

More Active Client Info

The system.clients table includes information on connected actively connected clients (drivers). Scylla 4.4 add more fields to the table:

  • connection_stage
  • driver_name
  • driver_version
  • protocol_version

It also improves:

  • client_type – distinguishes CQL from Thrift just in case
  • username – now it displays the correct username if `PasswordAuthenticator` is configured.

#6946

Deployment and Packaging

  • Scylla is now available for Ubuntu 20
  • We now include node-exporter in rpm/deb/tar packages. For deb/rpm, this is the optional scylla-node-exporter subpackage. This simplifies installation, especially for air-gapped systems (without an Internet connection) #2190
  • Downgrading on Debian derivatives is now easier: installing the scylla metapackage with a specific version will pull in subpackages with the same version. #5514

Documentation

The project developer oriented, in-tree documentation is now published to a new doc site using the Scylla Sphinx Theme. With time we plan to move, and open source all of Scylla documentation from the current doc site to the Scylla project.

CDC

Change Data Capture (CDC) is production ready from Scylla 4.3.

CDC API has changed from Scylla 4.3 to Scylla 4.4. If you are using CDC with Scylla 4.3 please refer to Scylla Docs for more info.

#8116

New reference example of a CDC consumer implementation see:

Note that only the Go consumer currently uses the end-of-record column.

Alternator

Now supports nested attribute paths in all expressions #8066

Additional Features

  • It is now possible to ALTER some properties of system tables, for example update the speculative_retry for the system_auth.roles table #7057
    Scylla will still prevent you from deleting columns that are needed for its operation, but will allow you to adjust other properties.
  • The removenode operation has been made safe. Scylla will now collect and merge data from the other nodes to preserve consistency with quorum queries.

Tools and APIs

  • API: There is now a new API to force removal of a node from gossip. This can be used if information about a node lingers in gossip after it is gone. Use with care as it can lead to data loss. #2134

Performance Optimizations – I/O Scheduler 2.0

The Seastar I/O scheduler is used to maximize the requests throughput from all shards to the storage. Till now, the scheduler was running in a per shard scope: each shard runs its own scheduler, balanced between its I/O tasks, like reads, updates and compactions. This works well when the workload between shards is approximately balanced; but when, as often happened, one shard was more loaded, it could not take more I/O, even if other shards were not using their share.

I/O scheduler 2.0 included in Scylla 4.4 fixes this. As storage bandwidth and IOPS are shared, each shard can use the whole disk if required.

More Performance Optimizations

  • Repair works by having the repairing node (“repair master”) merge data from the other nodes (“repair followers”) and write the local differences to each node. Until now, the repair master calculated the difference with each follower independently and so wrote an SStable corresponding to each follower. This creates additional work for compaction, and is now eliminated as the repair master writes a single SStable containing all the data. #7525
  • When aggregating (for example, SELECT count(*) FROM tab), Scylla internally fetches pages until it is able to compute a result (in this case, it will need to read the entire table). Previously, each page fetch had its own timeout, starting from the moment the page fetch was initiated. As a result, queries continued to be processed long after the client gave up on them, wasting CPU and I/O bandwidth. This was changed to have a single timeout for all internal page fetches for an aggregation query, limiting the amount of time it can be alive. #1175
  • Scylla maintains system tables tracking SSTables that have large cells, rows, or partitions for diagnostics purposes. When tracked SSTables were deleted, Scylla deleted the records from the system tables. Unfortunately, this uses range tombstones which are not (yet) well supported in Scylla. A series was merged to reduce the number of range tombstones to reduce impact on performance. #7668
  • Queries of Time Window Compaction Strategy tables now open sstables in different windows as the query needs them, instead of all at once. This greatly reduces read amplification, as noted in the commit. #6418
  • For some time now, Scylla reshards (rearranges SSTables to contain data for one shard) on boot. This means we can stop considering multi-shard SSTables in compaction, as done here. This speeds up compaction a little. A similar change was done to cleanup compactions. #7748
  • During bootstrap, decommission, compaction, and reshape Scylla will separate data belonging to different windows (in Time Window Compaction Strategy) into different SSTables (to preserve the compaction strategy invariant). However, it did not do so for memtable flush, requiring a reshape if the node was restarted. It will now flush a single memtable into multiple SSTables, if needed. #4617

The following updates eliminate latency spikes:

  • Large SSTables with many small partitions often require large bloom filters. The allocation for the bloom filters has been changed to allocate the memory in steps, reducing the chance of a reactor stall. #6974
  • The token_metadata structure describes how tokens are distributed across the cluster. Since each node has 256 tokens, this structure can grow quite large, and updating it can take time, causing reactor stalls and high latency. It is now updated by copying it in the background, performing the change, and swapping the new variant into the place of the old one atomically.  #7220 #7313
  • A cleanup compaction removes data that the node no longer owns (after adding another node, for example) from SSTables and cache. Computing the token ranges that need to be removed from cache could take a long time and stall, causing latency spikes. This is now fixed. #7674
  • When deserializing values sent from the client, Scylla will no longer allocate contiguous space for this work. This reduces allocator pressure and latency when dealing with megabyte-class values. #6138
  • Continuing improvement of large blob support, we now no longer require contiguous memory when serializing values for the native transport, in response to client queries. Similarly, validating client bind parameters also no longer requires contiguous memory. #6138
  • The memtable flush process was made more preemptible, reducing the probability of a reactor stall. #7885
  • Large allocation in mutation_partition_view::rows() #7918
  • Potential reactor stall on LCS compaction completion #7758

Configuration

  • Hinted handoff configuration can now be changed without restart. #5634

Monitoring and Log Collection

  • Scylla Monitoring 3.6 adds support for log collection using Grafana Loki.
    A new setup utility in Scylla allows users to configure rsyslog to send logs to Loki or any other log collection server supporting the rsyslog format. #7589
  • New metrics report CQL errors #5859
  • There are now metrics for the different types of transport-level messages:
    • STARTUP
    • AUTH_RESPONSE
    • OPTIONS
    • QUERY
    • PREPARE
    • EXECUTE
    • BATCH
    • REGISTER
  • and for overload indicators:
    • Reads_shed_due_to_overload – The number of reads shed because the admission queue reached its max capacity. When the queue is full, excessive reads are shed to avoid overload
    • Writes_failed_due_to_too_many_in_flight_hints – number of CQL write requests which failed because the hinted handoff mechanism is overloaded and cannot store any more in-flight hints

#4888

  • For all metric update from Scylla 4.3 to Scylla 4.4 see here

Build

  • The compiler used to build Scylla has been changed from gcc to clang. With clang we get a working implementation of coroutines, which are required for our Raft implementation. #7531

Debugging

Scylla will now log a memory diagnostic report when it runs out of memory. Among other data the report includes free and used memory of the LSA, Cache, Memtable and system, as well as memory pools . See example report in the commit log #6365
A new tool, scylla-sstable-index, is now available to examine sstable -Index.db files. The tool is not yet packaged.
Scylla uses a type called managed_bytes to store serialized data. The data is stored fragmented in cache and memtable, but contiguous outside it, with automatic conversion to contiguous storage on access. This automatic conversion made it difficult to detect when this conversion happens (so we can eliminate it), so a large patch series made all the conversions explicit and removed automatic conversion. #7490

Bugs Fixed in the Release

For a full list use git log

  • CQL: Invalid aggregation result on table with index: When using aggregates (COUNT(*)) on table with index on clustering key and filtering on clustering key, wrong results are returned  #7355
  • CQL: Keyspaces have a durable_writes attribute that says whether to use the commitlog or not. Now, when changing it, the effects take place immediately. #3034
  • Stability: Cleanup compaction in KA/LA SSTables may crash the node in some cases #7553 (also part of Scylla 4.2.2)
  • Stability: ‘ascii’ type column isn’t validated, and one could create illegal ascii values by using CQL without bind variables #5421
  • Alternator AttributesToGet breaks QueryFilter #6951 (also part of Scylla 4.2.2)
  • Stability: From time to time, Scylla needs to move SSTables from one directory to another. If Scylla crashed at exactly the wrong moment, it could leave valid SSTables in both the old and the new places. #7429
  • Install: Scylla does not start when kernel inotify limits are exceeded #7700 (also part of Scylla 4.2.2)
  • Install: missing /etc/systemd/system/scylla-server.service.d/dependencies.conf on scylla-4.4 rpm #7703 (also part of Scylla 4.2.2)
  • Stability: Query with multiple indexes, one date one boolean, fail with ServerError #7659
  • Thrift: The ability to write to non-thrift tables (tables which were created by CQL CREATE TABLE) from thrift has been removed, since it wasn’t supported correctly. Use CQL to write to CQL tables. #7568
  • Stability: SSTable reshape for Size-Tiered Compaction Strategy has been fixed. Reshape happens when the node starts up and finds its STTables do not conform to the strategy invariants, or on import. It then compacts the data so the invariants are preserved, guaranteeing reasonable read amplification. A bug caused reshape to be ignored for this compaction strategy. #7774
  • Stability: When joining a node to a cluster, Scylla will now verify the correct snitch is used in the new node. #6832
  • Stability: Wild pointer dereferenced when try_flush_memtable_to_sstable on shutdown fails to create sstable in a dropped keyspace #7792
  • Stability: Scylla records large data cells in the system.large_rows table. If such a large cell was part of a static rows, Scylla would crash #6780
  • CQL: CQL prepared statements incomplete support for “unset” values #7740
  • Stability: Crash in row cache if a partition key is larger than 13k, with a managed_bytes::do_linearize() const: Assertion `lc._nesting' failed error  #7897
  • CQL: min/max aggregate functions are broken for timeuuid #7729
  • CQL: paxos_grace_seconds table option: prevent applying a negative value #7906
  • LWT: Provide default for serial_consistency even if not specified in a request #7850
  • Redis: redis: parse error message is broken #7861 #7114
  • CQL: Restrictions missing support for “IN” on tables with collections, added in Cassandra 4.0 #7743 #4251
  • Stability: Possible schema discrepancy while updating secondary index computed column #7857
  • Stability: filesystem error: open failed: Too many open files and coredump during resharding with 5000 tables #7439
  • Stability: leveled_compaction_strategy: fix boundary of maximum SSTable level #7833
  • Packaging: seastar-cpu-map.sh missing from PATH #6731
  • Stability: Scylla crashes on SELECT by indexed part of the pkey with WHERE like "key = X AND key = Y" #7772
  • Stability: a race condition can cause coredump in truncating a table after refresh #7732
  • Gossip ACK message in the wrong scheduling group cause latency spikes in some cases #7986
  • Stability: nodetool repair and nodetool removenode commands failed during repair process running #7965
  • Stability: [CDC] nodes aborting with coredump when writing large row #7994
  • Stability: Unexpected partition key when repairing with different number of shards may resulting in error similar to: “WARN  2020-11-04 19:06:41,168 [shard 1] repair - repair_writer: keyspace=ks, table=cf, multishard_writer failed: invalid_mutation_fragment_stream
    This error may occur when running repair between nodes with a different core number. #7552 Seastar #867
  • Scylla on AWS: When psutil.disk_paritions() reports / is /dev/root, aws_instance mistakenly reports root partition is part of ephemeral disks, and RAID construction will fail. #8055
  • Alternator: unexpected ValidationException in ConditionExpression and FilterExpression #8043
  • Stability: stream_session::prepare is missing a string format argument #8067
  • Scylla on GCE: scylla_io_setup did not configure pre-tuned GCE instances correctly #7341
  • install: dist/debian: node-exporter package does not install systemd unit #8054
  • Stability: Compaction Backlog controller will misbehave for time series use cases #6054
  • Monitoring: False 1 second latency reported in I/O queue delay metrics #8166
  • Stability: TWCS reshape was silently ignoring windows which contain at least min_threshold SSTables (can happen with data segregation). #8147
  • Packaging: dist/debian: node-exporter files mistakenly packaged in scylla-conf package #8163
  • scylla_setup failed to setup with error /etc/bashrc: line 99: TMOUT: readonly variable #8049
  • Performance: SSTables being compacted by Cleanup, Scrub, Upgrade can potentially be compacted by regular compaction in parallel #8155
  • Stability: Failing to start Scylla after reboot of machine “Startup failed: seastar::rpc::timeout_error (rpc call timed out)#8187
  • Stability: a regression introduced in 4.4 in cdc generation query for Alternator (Stream API) #8210
  • Stability: Split CDC streams table partitions into clustered rows #8116
  • Install: scylla_raid_setup returns ‘mdX is already using‘ even it’s unused #8219
  • Stability: a regression on cleanup compaction’s space requirement introduced in Scylla 4.1, due to unlimited parallelism #8247
  • Stability: Scylla crash when the IN marker is bound to null #8265
  • Stability: segfault due to corrupt _last_status_change map in cql_transport::cql_server::event_notifier::on_down during shutdown #8143

DOWNLOAD SCYLLA NOW

The post Scylla Open Source Release 4.4 appeared first on ScyllaDB.

Apache Cassandra World Party | April 2021

The COVID-19 pandemic has taken a toll on a lot of things and one of those is our ability to interact as a community. There has been no in-person conferences or meetups for over a year now. The Cassandra community has always thrived on sharing with each other at places like ApacheCon and the Cassandra Summit. With Cassandra 4.0, we have a lot to celebrate!

Apache Cassandra™ World Party footer

This release will be the most stable database ever shipped, and Cassandra has become one of the most important databases running today. It is responsible for the biggest workloads in the world and because of that, we want to gather the worldwide community and have a party.

We thought we would do something different for this special event that reflects who we are as a community. Our community lives and works in every timezone, and we want to make it as easy as possible for everyone to participate so we’ve decided to use an Ignite-style format. If you are new to this here’s how it works:

  • Each talk is only 5 minutes long.
  • You get ten slides and they automatically advance every 30 seconds.
  • To get you started, we have a template ready here.
  • You can do your talk in the language of your choice. English is not required.
  • Format: PDF (Please no GIFs or videos)

When you are ready, you can use this link to submit your talk idea by April 9, 2021. It’s only five minutes but you can share a lot in that time. Have fun with the format and encourage other people in your network to participate. Diversity is what makes our community stronger so if you are a newcomer, don’t let that put you off – we’d love you to share your experiences.

What?

One-day virtual party on Wednesday, April 28 with three, hour-long sessions so you can celebrate with the Cassandra community in your time zone – or attend all three!

When?

Join us at the time most convenient for you:

  • April 28 5:00am UTC
  • April 28 1:00pm UTC
  • April 28 9:00pm UTC

Register here. In the meantime, please submit your 5-minute Cassandra talk by April 9!

Whether you’re attending or speaking, all Apache Cassandra™ World Party participants must adhere to the code of conduct.

Got questions about the event? Please email events@constantia.io.