Connect Faster to Scylla with a Shard-Aware Port

In Scylla Open Source 4.3, we introduced a feature called shard-aware port. In addition to listening for CQL driver connections on the usual port (by default 9042), Scylla now also listens for connections on another port (by default 19042). The new port offers a more robust method of connecting to the cluster which decreases the number of connection attempts that the driver needs to make.

In this blogpost, we will describe problems that arise when using the old port and how the new port helps to overcome them.

Token-awareness, Shard-awareness

To understand the problem the shard-aware port solves, we need to remind ourselves what token-awareness and shard-awareness is.

Scylla achieves its high scalability by splitting data in two dimensions:

  • Horizontally – a single Scylla cluster consists of multiple nodes, each of them storing a portion of the total data.
  • Vertically – a single Scylla node is further split into multiple shards. Each shard is exclusively handled by a single CPU core and handles a subset of the node’s data.

(For a more in-depth look about the partitioning scheme, I refer you to another blog post here.)

All Scylla nodes and shards are equal in importance, and are capable of serving any read and write operations. We call the node which handles a given write or read request a coordinator. Because data in the cluster is replicated – a single piece of data is usually kept on multiple nodes — the coordinator has to contact multiple nodes in order to satisfy a read or write request. We call those nodes replicas.

It’s not hard to see that it is desirable that the coordinator is also a replica. In such a case, the coordinator has one less node to contact over the network — it can serve the operation using local memory and/or disk. This reduces the overall network usage, and reduces the total latency of the operation.

Similarly, it is also best to send the request to the right shard. Although when you choose a wrong shard the right one will be contacted to perform the operation, it’s still more costly than using the right shard from the beginning and may incur a penalty in the form of higher latency.

The strategy of choosing the right node as the coordinator is called token-awareness, while choosing the right shard is called shard-awareness.

Usually, when doing common operations on singular partitions using prepared statements, the driver has enough information to calculate the token of the partition. Based on the token, it’s easy to choose the optimal node and shard to choose as the coordinator.

Problems with the Old Approach

For a driver to be token-aware, it is enough to have a connection open for every node in the cluster. However, in Scylla, each connection is handled by a single shard for its whole lifetime.

Therefore, to achieve shard-awareness, the driver needs to keep a connection open to each shard for every node.

The problem is that previously a driver had no way to communicate to a Scylla node which specific shard should handle the connection. Instead, Scylla chose the shard that handled the least number of connections at the moment and assigned it to the new connection.

This method works well in some scenarios — for example, consider a single application connecting to a node with N shards. It wants to establish one connection for each shard, so it opens N connections, either sequentially or in parallel. If the node does not handle any other clients at the moment, this method will assign a different shard for each connection — each time a new connection is processed, there will be a shard which doesn’t handle any connections at the moment, and it will be chosen. In this scenario, this algorithm of assigning shards to connections works great.

However, there are realistic scenarios in which this algorithm does not perform so well.

Consider a situation in which there are multiple clients connecting at once — for example, after a node was restarted. Now connection requests from all clients will interleave with each other and for each client there is a chance that it will connect to the same shard more than once. If this happens, the driver has to retry connecting for every shard that was not covered by previous attempt.

An even worse situation occurs if there are some non-shard aware applications connected to a node such that one shard handles many more connections than the others. The shard assignment algorithm will then be very reluctant to assign the most loaded shard to a new connection. The driver will have to keep establishing excess connections to other shards until connection counts on each shard equalize. Only then the unlucky shard will have a chance of being assigned.

All of this can cause a driver to take some time before it manages to cover all shards with connections. During that time, if it is not connected to a shard, all requests which are supposed to be routed to it must be sent to another shard. This impacts latency.

Introducing the “Shard-Aware” Port

In order to fix this problem we made it possible for the driver to choose the shard that handles the new connection. We couldn’t do it by extending the CQL protocol. Unfortunately, due to implementation specifics of Seastar — the framework used to implement Scylla — migrating a connection to another shard would be very hard to implement. Instead, Scylla now listens for connections on the shard-aware port. The new port uses a different algorithm to assign shards — the chosen shard is the connection’s source port modulo the number of shards.

The drivers which support connecting over the shard-aware port will use it, if available, to establish connections to missing shards. This greatly reduces the number of connection attempts it has to make when multiple drivers connect at the same time. It also is immune to the problem caused by connections being unevenly distributed across shards due to non-shard-aware drivers.

An Experiment

To show that connecting to the cluster over the shard-aware port is more robust than the old approach, we ran a test which simulates a scenario in which multiple clients are connecting to the same node at the same time. We used our benchmarking application scylla-bench, which uses our fork of the GoCQL driver.

Our setup looked like this:

  • Cluster: 3 nodes, each using i3.8xlarge instance type with Scylla 4.3 installed. Each node had 30 shards.
  • Loaders: 3 instances of type c5.9xlarge. On each loader we ran 50 scylla-bench instances in parallel, each writing 5k rows per second with concurrency 100.

During the test we restarted one of the database nodes and observed how many connections it took until all scylla-bench instances managed to establish connection per shard.

We ran the test with an older version of the driver that does not support the shard-aware port (1.4.3) and one that does (1.5.0).

Fig 1. Active connection count to a node after it was restarted. On the left — GoCQL version 1.4.3, on the right — version 1.5.0.

Fig 2. Total count of accepted connections during node’s runtime. Both left and right graphs show the number of accepted connections starting from node’s restart. On the left — GoCQL version 1.4.3, on the right — version 1.5.0.

We can see that the number of connection attempts is much lower for version of GoCQL which supports the shard-aware port – it’s very close to the desired number of connections, which is 4,500 (3 nodes * 50 scylla-bench instances * 30 shards). It’s not equal to 4,500 because a small percentage of connections timed out, which triggered fallback logic that attempts connecting to the non-shard aware port. The number 4,830 is still much better than 15,830 – more than three times smaller.

Limitations

The shard-aware port solution is not a perfect one, and in some circumstances it won’t work properly due to network configuration issues. While our drivers are designed to automatically detect such issues and fall back to the non-shard-aware port if necessary, it’s good to be aware of them so that you may either fix them, or disable this feature — so that your application doesn’t waste time detecting that the shard-aware port doesn’t work.

Unreachable shard-aware port. After establishing the first connection on the non-shard-aware port, a driver will try connecting to other shards using the shard-aware port. If the port is not reachable or connecting times out, it will fall back to the non-shard-aware port.

Client-side NAT. The power of the shard-aware port comes from the fact that the driver can choose which shard to connect to by specifying a particular source port. If your client application is behind a NAT which modifies source ports, it loses this ability. Our drivers should detect if it connects to an incorrect shard and fall back to the non-shard-aware port.

If you cannot fix those issues and suspect that using the non-shard-aware port instead might help you, you can disable this feature (or not enable it in the first place — it depends on the driver). Please refer to the documentation of your driver in order to learn how to do it.

Supported drivers and further reading

We are planning to add this feature to all drivers that are maintained by us. As of writing this blog, two of our drivers have full support for this feature merged and released:

  • Supported: Our fork of GoCQL, starting from version 1.5.0. See the README for details.
  • Supported: C++ driver for Scylla, starting from version 2.15.2-1. See the documentation for details.
  • Mostly supported: Scylla Rust Driver, starting from version 0.1.0. The shard-aware port is detected automatically and used if enabled in Scylla, but the fallback mitigations described in the “Limitations” section are not implemented at the time of writing this blog. See the README for more information about the driver.
  • Not supported yet: This feature is not yet implemented in our Python driver nor our Java driver, but we intend to support this feature in both in the coming future.

The post Connect Faster to Scylla with a Shard-Aware Port appeared first on ScyllaDB.

On Coordinated Omission

Many of us have heard of Coordinated Omission (CO), but few understand what it means, how to mitigate the problem, and why anyone should bother. Or, perhaps, this is the first time you have heard of the term. So let’s begin with a definition of what it is and why you should be concerned about it in your benchmarking.

The shortest and most elegant explanation of COs I have ever seen was from Daniel Compton in Artillery #721.

Coordinated omission is a term coined by Gil Tene to describe the phenomenon when the measuring system inadvertently coordinates with the system being measured in a way that avoids measuring outliers.

To that, I would add: …and misses sending requests.

To describe it in layman’s terms, let’s imagine a coffee-fueled office. Each hour a worker has to make a coffee run to the local coffee shop. But what if there’s a road closure in the middle of the day? You have to wait a few hours to go on that run. Not only is that hour’s particular coffee runner late, but all the other coffee runs get backed up for hours behind that. Sure, it takes the same amount of time to get the coffee once the road finally opens, but if you don’t measure that gap caused by the road closure, you’re missing measuring the total delay in getting your team their coffee. And, of course, in the meanwhile you will be woefully undercaffeinated.

CO is a concern because if your benchmark or a measuring system is suffering from it, then the results will be rendered useless or at least will look a lot more positive than they actually are.

Over the last 10 years, most benchmark implementations have been corrected by their authors to account for coordinated omission. Even so, using benchmark tools still requires knowledge to produce meaningful results and to spot any CO occurring. By default they do NOT respect CO.

Let’s take a look at how Coordinated Omission arises and how to solve it.

The Problem

Let’s consider the most popular use case: measuring performance characteristics of the web application – a system that consists of a client that makes requests to the database. There are different parameters we can look at. What are we interested in?

The first step in measuring the performance of a system is understanding what kind of system you’re dealing with.

For our purposes, there are basically two kinds of systems: open model and closed model. This 2006 research paper gives a good breakdown of the differences between the two. For us, the important feature of open-model systems is that new requests arrive independent of how quickly the system processes requests. By contrast, in closed-model systems new job arrivals are triggered only by job completions.

An example of a real-world open system is your typical website. Any number of users can show up at any time to make an HTTP request. In our coffee shop example, any number of users could show up at any time looking for their daily fix of caffeine.

In a code, it may look as follows:

  • Variant 1. A client example of an open-model system (client-database). The client supplies requests at the speed of spawning threads. Thus requests arrival rate does not depend on the database throughput.
for (;;) {
  std::thread([]() {
    make_request("make a request");
  });

As you can see, in this case, you will open a thread for each request. The number of requests will be unbounded.

An example of a closed system in real life is an assembly line. An item gets worked on, and moves to the next step only upon completion. The system admits only new items to be worked on once the former one moves on to the next step.

  • Variant 2. A client example of a closed-model system (client-database) where the requests arrival rate equals the request service rate (the rate at which the system can handle them):
std::thread([]() {
  for (;;) {
    make_request("make a request");
  }
});

Here, we open a thread and make requests one at a time. Maximum number of requests in the system at the same time generated by this system is 1.

In our case as far as there are clients that generate requests independently of the system, the system is of the open model. It turns out the benchmarking load generation tools that we all use have the same structure.

One way we can assess the performance of the system is by measuring the latency value for a specific level of utilization. To do that, let’s define a few terms we’ll use:

  • System Utilization is how busy the system is processing requests [time busy / time overall].
  • Throughput is the number of processed requests in a unit of time. The higher the Throughput, the higher the Utilization.
  • Latency is the total response time for a request, which consists of the processing itself (Service Time) and the cycle time it takes for processing (Waiting Time). (Latency = Waiting Time + Service Time.)

In general, throughput and latency in code look like this:

auto start = now();
for (int i = 0; i < total; i++) {
  auto request_start = now();
  make_request(i);
  auto request_end = now();

  auto latency = request_end - request_start;
  save(latency);
}

auto end = now();
auto duration = end - start;

auto throughput = total / duration;

As users we want our requests to be processed as fast as possible. The lower the latency, the better. The challenge for benchmarking is that latency varies with system utilization. For instance, consider this theoretical curve, which should map to the kinds of results you will see in actual testing:

Response time (Latency) vs Utilization [R=1/µ(1-ρ)] for an open system
[from Rules in PAL: the Performance Analysis of Logs tool, figure 1]

With more requests and higher throughput, utilization climbs up to its limit of 100% and latency rises to infinity. The goal of benchmarking is to find the optimal point on that curve, where utilization (or throughput) is highest, with latency at or below the target level. To do that we need to ask questions like:

  • What is the maximum throughput of the system when the 99th latency percentile (P99) is less than 10 milliseconds (ms)?
  • What is the system latency when it handles 100,000 QPS?

It follows that to measure performance, we need to simulate a workload. The exact nature of that workload is up to the researcher’s discretion, but ideally, it should reflect typical application usage.

So now we have two problems: generating the load for the benchmark and measuring the resulting latency. This is where the problem of Coordinated Omission comes into play.

Load Generation

The easiest way to create a load is to generate a required number of requests (X) per second:

for (auto i: range{1, X}) {
    std::thread([]() {
      make_request("make a request");
    });
}

The problem comes when X reaches a certain size. Modern NoSQL in-memory systems can handle tens of thousands of requests per second per CPU core. Each node will have several cores available, and there will be many nodes in a single cluster. We can expect target throughput (X) to be 100,000 QPS or even 1,000,000 QPS, requiring hundreds of thousands of threads. This approach has the following drawbacks:

  • Creating a thread is a relatively expensive operation: it can easily take 200 microseconds (⅕ ms) and more. It also eats kernel resources.
  • Scheduling many threads is expensive as they need to context switch (1-10 microseconds) and share a single limited resource – CPU.
  • There may be a maximum number of threads limit in place (see RLIMIT_NPROC and this thread on StackExchange)

Even if we can afford those costs, just creating 100,000 threads takes a lot of time:

#include <thread>

int main(int argc, char** argv) {
  for (int i = 0; i < 100000; i++) {
    std::thread([](){}).detach();
  }
  return 0;
}

$ time ./a.out

./a.out  0.30s user 1.38s system 183% cpu 0.917 total

We want to spend CPU resources making requests to the target system, not creating threads. We can’t afford to wait 1 second to make 100,000 requests when we need to issue 100,000 requests every second.

So, what to do? The answer is obvious: let’s stop creating threads. We can preallocate them and send requests like this:

// Every second generate X requests with only N threads where N << X:

for (int i = 0; i < N; i++) {
  std::thread([]() {
    for (;;) {
      make_request("make a request");
    }
  });
}

Much better. Here we allocate only N threads at the beginning of the program and send requests in a loop.

But now we have a new problem.

We meant to simulate a workload for an open-model system but wound up with a closed-model system instead. (Compare again to Variant 1 and Variant 2 mentioned above.)

What’s the problem with that? The requests are sent synchronously and sequentially. That means that every worker-thread may send only some bounded number of requests in a unit of time. This is the key issue: the worker performance depends on how fast the target system processes its requests.

For example, suppose:

Response time = 10 ms

Thread sending requests sequentially may send only = 1000 [ms/sec] / 10 [ms/req] = 100 [req/sec]

Outliers may take longer, so you need to provide a sufficient number of threads for your load to be covered. For example:

Target throughput = 100,000 QPS (expect it to be much more in real test)
Expected latency P99 = 10 ms

1 worker may expect a request to take up to 10 ms for 99% of cases

We expect 1 worker must be able to send 1,000 [ms/sec] / 10 [ms/req] = 100 [requests/sec]

Target throughput / Requests per worker = 100,000 [QPS] / 100 [QPS/Worker] = 1,000 workers

At this point, it must be clear that with this design a worker can’t send requests faster than it takes to process one request. That means that we need a sufficient number of workers to cover our throughput goal. To make workers fulfill the goal we need a schedule for them that will determine when a certain worker has to send a request exactly.

Let’s define a schedule as a request generation plan that defines points in time when requests must be fired.

A schedule: four requests uniformly spread in a unit of time:
1000 [ms/sec] / 4 [req] = 250 ms every next request

There are two ways to implement a worker schedule: static and dynamic. A static schedule is a function of the start timestamp; the firing points don’t move. A dynamic schedule is one where the next firing starts after the last request has completed but not before the minimum time delay between requests.

A key characteristic of a schedule is how it processes outliers—moments when the worker can’t keep up with the assigned plan, and the target system’s throughput appears to be lower than the loader’s throughput.

This is exactly the first part of the Coordinated Omission problem. There are two approaches to outliers: Queuing and Queueless.

A good queueing implementation tracks requests that have to be fired and tries to fire them as soon as it can to get the schedule back on track. A not-so-good implementation shifts the entire schedule by pushing delayed requests into the next trigger point.

By contrast, a Queueless implementation simply drops missed requests.

Queueing

The Queuing approach for a static schedule is the most reliable and correct. It is used in most implementations of the benchmark tools: YCSB, Cassandra-stress, wrk2, etc. It does not ignore requests it had to send. It queues and sends them as soon as possible, trying to get back on the schedule.

For example:

Our worker’s schedule: 250ms per request

A request took too long: 1sec instead 250ms

To get back on track we have to send 3 additional requests

A static schedule with queuing uses a function that defines when the request need to be sent:

// f(request) = start + (request - 1) * time_per_request

template <class Clock>

Clock::duration next(time_point<Clock> start, milliseconds t, size_t request) {
  return start + (request - 1) * t;
}

template <class Clock>
void wait_next(time_point<Clock> start, milliseconds t, size_t request) {
  sleep_until(next(start, t, request));
}

It also maintains a counter of requests sent. Those 2 objects are enough to implement queuing:

for (size_t worker = 0; worker < workers; worker++) {
    std::thread([=]() {
      for (size_t req = 1; req <= worker_requests; req++) {
        make_request(req);
        wait_next(start, t_per_req, req + 1);
      };
    };
}

This implementation fulfills all desired properties: a) it does not skip requests if it’s missed its firing point; b) it tries to minimize schedule violation if something goes wrong and returns to the schedule as soon as it can.

For example:

Dynamic scheduling with queuing uses a global rate limiter, which gives each worker a token and counts requests as they’re sent.

RateLimiter limiter(throughput, burst = 0);
for (size_t worker = 0; worker < workers; worker++) {
    std::thread([=]() {
      for (size_t req = 1; req <= worker_requests; req++) {
        limiter.acquire(1);
        make_request(req);
      };
    };
}

The drawback of this approach is that the global limiter potentially is a contention point and the overall schedule constantly deviates from the desired value. Most implementations use “burst = 0,” which makes them susceptible to coordination with the target system when system throughput appears lower than the target.

For example, take a look at the Cockroach YCSB Go port and this.

Usage of local rate limiters makes the problem even worse because each of the threads can deviate independently.

Queueless

As the name suggests, a Queueless approach to scheduling doesn’t queue requests. It simply skips ones that aren’t sent on the proper schedule.

Our worker’s schedule: 250ms per request

Queueless implementation that ignores requests that it did not send

Queueless implementation that ignores schedule

This approach is relatively common, as it seems simpler than queuing. It is a variant of a Rate Limiter, regardless of the Rate Limiter implementation (see Buckets, Tokens), which completes at a certain time (for example 1 hour) regardless of how many requests are sent. With this approach, skipped requests must be simulated.

RateLimiter r(worker_throughput, burst = 0);

while (now() < end) {
  r.acquire(1);
  make_request(req);
}

Overall, there are two approaches for generating load with a closed-model loader to simulate an open-model system. If you can, use Queuing because it does not omit generating requests according to the schedule along with a static schedule because it will always converge your performance back to the expected plan.

Measuring latency

A straightforward way to measure an operation’s latency is the following:

auto start = now();
make_request(i);
auto end = now();

auto latency = end - start;

But this is not exactly what we want.

The system we are trying to simulate sends a request every [1 / throughput] unit of time regardless of how long it takes to process them. We simulate it by the means of N workers (threads) that send a request every [1 / worker_throughput] where worker_throughput ≤ throughput, and N * worker_throughput = throughput.

Thus, a worker has a [1/N] schedule that reflects a part of the schedule of the simulated system. The question we need to answer is how to map the latency measurements from the 1/N worker’s schedule to the full schedule of the simulated open-model system. It’s a straightforward process until requests start to take longer than expected.

Which brings us to the second part of the Coordinated Omission problem: what to do with latency outliers.

Basically, we need to account for workers that send their requests late. So far as the simulated system is concerned, the request was fired in time, making latency look longer than it really was. We need to adjust the latency measurement to account for the delay in the request.

Request latency = (now() – intended_time) + service_time

where

service_time = end – start
intended_time(request) = start + (request – 1) * time_per_request

This is called latency correction.

Latency consists of time spent waiting in the queue plus service time. A client’s service time is the time between sending a request and receiving the response, as in the example above. The waiting time for the client is the time between the point in time a request was supposed to be sent and the time it was actually sent.

We expect every request to take less than 250ms

1 request took 1 second

1 request took 1 second; o = missed requests and their intended firing time points

So the second request must be counted as:

1st request = 1 second
2nd request = [now() – (start + (request – 1) * time_per_request)] + service time =

[now() – (start + (2 – 1) * 250 ms)] + service time =
750ms + service time

If you do not want to use latency Correction for some reason in your implementation, for example because you do not fire the missed requests (Queueless), you have only one option — to pretend the requests were sent. It is done like so:

recordSingleValue(value);
if (expectedIntervalBetweenValueSamples <= 0)
      return;
for (long missingValue = value - expectedIntervalBetweenValueSamples;
      missingValue >= expectedIntervalBetweenValueSamples;
      missingValue -= expectedIntervalBetweenValueSamples) {
      recordSingleValue(missingValue);
}

As in HdrHistogram/recordSingleValueWithExpectedInterval.

By doing that it compensates for the missed calls in the resulting latency distribution through simulation, by adding a number of expected requests with some expected latency.

How Not to YCSB

So let’s try a benchmark of Scylla and see what we get.

$ docker run -d --name scylladb scylladb/scylla --smp 1

to get container IP address

$ SCYLLA_IP=$(docker inspect -f "{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}" scylladb)

to prepare keyspace and table for YCSB

$ docker exec -it scylladb cqlsh
> CREATE KEYSPACE ycsb WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1 };
> USE ycsb;
> create table usertable (y_id varchar primary key,
    field0 varchar, field1 varchar, field2 varchar, field3 varchar, field4 varchar,
    field5 varchar, field6 varchar, field7 varchar, field8 varchar, field9 varchar);
> exit;

build and run YCSB

$ docker run -it debian
# apt update
# apt install default-jdk git maven
# git clone https://github.com/brianfrankcooper/YCSB.git
# cd YCSB
# mvn package -DskipTests=true -pl core,binding-parent/datastore-specific-descriptor,scylla

$ cd scylla/target/
$ tar xfz ycsb-scylla-binding-0.18.0-SNAPSHOT.tar.gz
$ cd ycsb-scylla-binding-0.18.0-SNAPSHOT

load initial data

$ apt install python
$ bin/ycsb load scylla -s -P workloads/workloada \
    -threads 1 -p recordcount=10000 \
    -p readproportion=0 -p updateproportion=0 \
    -p fieldcount=10 -p fieldlength=128 \
    -p insertstart=0 -p insertcount=10000 \
    -p scylla.writeconsistencylevel=ONE \
    -p scylla.hosts=${SCYLLA_IP}

run initial test that does not respect CO

$ bin/ycsb run scylla -s -P workloads/workloada \
-threads 1 -p recordcount=10000 \
-p fieldcount=10 -p fieldlength=128 \
-p operationcount=100000 \
-p scylla.writeconsistencylevel=ONE \
-p scylla.readconsistencylevel=ONE \
-p scylla.hosts=${SCYLLA_IP}

On my machine it results in:

[OVERALL], Throughput(ops/sec), 8743.551630672378
[READ], 99thPercentileLatency(us), 249
[UPDATE], 99thPercentileLatency(us), 238

The results seem to yield fantastic latency of 249µs for P99 reads and 238µs P99 writes while doing 8,743 OPS. But those numbers are a lie! We’ve applied a closed-model system to the database.

Try to make it an open model by setting a target for throughput.

$ bin/ycsb run scylla -s -P workloads/workloada \
    -threads 1 -p recordcount=10000 \
    -p fieldcount=10 -p fieldlength=128 \
    -p operationcount=100000 \
    -p scylla.writeconsistencylevel=ONE \
    -p scylla.readconsistencylevel=ONE \
    -p scylla.hosts=${SCYLLA_IP} \
    -p target ${YOUR_THROUGHPUT_8000_IN_MY_CASE}

[OVERALL], Throughput(ops/sec), 6538.511834706421
[READ], 99thPercentileLatency(us), 819
[UPDATE], 99thPercentileLatency(us), 948

Now we see the system did not manage to keep up with my 8,000 OPS target per core. What a shame. It reached only 6,538 OPS. But take a look at P99 latencies. They are still good enough. Less than 1ms per operation. But these results are still a lie.

How to YCSB Better

By setting a target throughput we have made the loader simulate an open-model system, but the latencies are still incorrect. We need to request a latency correction in YCSB:

$ bin/ycsb run scylla -s -P workloads/workloada \
    -threads 1 -p recordcount=10000 \
    -p fieldcount=10 -p fieldlength=128 \
    -p operationcount=100000 \
    -p scylla.writeconsistencylevel=ONE \
    -p scylla.readconsistencylevel=ONE \
    -p scylla.hosts=${scylladb_ip_address} \
    -p target ${YOUR_THROUGHPUT_8000_IN_MY_CASE} \
    -p measurement.interval=both

[OVERALL], Throughput(ops/sec), 6442.053726728081
[READ], 99thPercentileLatency(us), 647
[UPDATE], 99thPercentileLatency(us), 596
[Intended-READ], 99thPercentileLatency(us), 665087
[Intended-UPDATE], 99thPercentileLatency(us), 663551

Here we see the actual latencies of the open-model system: 665ms per operation for P99 – totally different from what we can see for a non-corrected variant.

As homework I suggest you compare the latency distributions by using “-p measurement.histogram.verbose=true”.

Conclusion

To sum up, we’ve defined coordination omission (CO), explained why it appears, demonstrated its effects, and classified it into 2 categories: load generation and latency measurement. We have presented four solutions: queuing/queueless and correction/simulation.

We found that the best implementation involves a static schedule with queuing and latency correction, and we showed how those approaches can be combined together to effectively solve CO issues: queuing with correction or simulation, or queueless with simulation.

To mitigate CO effects you must:

  • Explicitly set the throughput target, the number of worker threads, the total number of requests to send or the total test duration
  • Explicitly set the mode of latency measurement
    • Correct for queueing implementations
    • Simulate non-queuing implementations

For example, for YCSB the correct flags are:

-target 120000 -threads 840 -p recordcount=1000000000 -p measurement.interval=both

For cassandra-stress they are (see):

duration=3600s -rate fixed=100000/s threads=840

Besides coordinated omission there are many other parameters that are hard to find the right values for (for example).

The good news is that you don’t have to go through this process alone. We invite you to contact us with your benchmarking questions, or drop by our user Slack to ask your NoSQL industry peers and practitioners for their advice and experience.

CHECK OUT THESE  SCYLLA BENCHMARKS

The post On Coordinated Omission appeared first on ScyllaDB.

Getting the Most out of Scylla University LIVE

We are thrilled to see how many people have pre-registered for our upcoming Scylla University LIVE event on April 22, 2021. To help you get the most out of your experience, I’ll add some links to lessons you can take on Scylla University to brush up before the live online classes begin.

As we described in our prior blog post announcing the event, the Scylla University LIVE will be online and instructor-led. It will include two parallel tracks – one for Developers and Scylla Cloud users and one for DevOps and Architects. You can bounce back and forth between tracks or drop in for the sessions that most interest you. However, please be aware that we don’t plan to record the event. So please choose your learning path wisely!

Following the training sessions, we will host Birds of a Feather networking opportunities to meet your fellow database monsters.

Editor’s Note

Scylla University LIVE has come and gone, but you can log in to Scylla University to take all our online self-paced training. It’s free!

SIGN UP FOR SCYLLA UNIVERSITY

Agenda and Suggested Content

Here are details on each of the Scylla University Live sessions, along with suggestions for Scylla University lessons that may help you prepare and get the most out of each session. Whether you are a newcomer to Scylla or have years of experience under your belt, you’ll find a class suitable to your interests.

Time (Pacific Time) Developers & Scylla Cloud Users DevOps & Architects
8:45-09:00am Early Arrival & Agenda
9:00-09:55am Getting Started with Scylla
An introduction to Scylla Architecture, basic concepts, and basic data modeling. Plus some hands-on labs.
Presenter: Guy ShtubSuggested Content (optional)
Best Practices for Change Data Capture
How Scylla implements CDC as standard CQL tables, and how to consume this data.Presenter: Calle WilundSuggested Content (optional)
09:55-10:00am Break/Passing Period
10:00-10:55am How to Write Better Apps
Advanced data modeling and data types, Scylla shard-aware drivers, materialized views, global and local secondary indexes, and compaction strategiesPresenter: Tzach LivyatanSuggested Content (optional)
Scylla on Kubernetes: CQL and DynamoDB compatible APIs
Learn to use Scylla Operator for cloud-native deployments, including use of the new helm chartsPresenter: Maciej ZimnochSuggested Content (optional)
10:55-11:00am Break/Passing Period
11:00-11:55am Lightweight Transactions in Action
An overview of Scylla’s implementation of LWTs and best practices for using themPresenter: Kostja OsipovSuggested Content (optional)
Advanced Monitoring and Admin Best Practices
An overview of Scylla Monitoring Stack and its components
Presenter: Shlomi LivneSuggested Content (optional)
11:55am-12:00pm Break/Passing Period
12:00pm-1:00pm Birds of a Feather Roundtables

Get Your Certificate!

Participants that complete the LIVE training event will receive a certificate.

UPDATE: Thank You!

Scylla University LIVE was a great success! But if you missed it, don’t worry! You can still take all our free online courses in Scylla University:

SIGN UP FOR SCYLLA UNIVERSITY

The post Getting the Most out of Scylla University LIVE appeared first on ScyllaDB.

Speakers Announced for April 28 Cassandra 4.0 World Party

The Apache Cassandra World Party will be a virtual event with three sessions on Wednesday, April 28, at 5:00-6:00 AM UTC, 1:00-2:00 PM UTC, and 9:00-10:00 PM UTC. We want as many people around the world to be able to attend in their time zone, learn a little something about Cassandra, meet others, and participate in interactive content as we celebrate the upcoming launch of the 4.0 release milestone! Register now.

Talks will be fast, five-minute Ignite-style presentations. Sessions will also feature interactive content, and there will be giveaways.

Today we’re excited to share the confirmed speakers and schedule for the event:

5:00 - 6:00am UTC | Moderated by Jeremey Hanna

  • Introduction to Apache Cassandra™ 4.0
  • Downside of Incremental Repairs by Payal Kumari
  • Apache Cassandra™ with Quarkus by Ravindra Kulkarni
  • Cassandra: Now and For The Future by Nirmal KPS Singh
  • Understanding Cassandra by Pradeep Gopal
  • Fun and Games

1:00 - 2:00pm UTC | Moderated by Ekaterina Dimitrova + Patrick McFadin

  • Introduction to Apache Cassandra™ 4.0
  • Cassandra Robustness: Errors I Made and You Cannot Anymore! by Carlos Rolo
  • Raising the Bar on QA by Mick Semb Wever
  • Cassandra in Adobe Audience Manager by Serban Teodorescu
  • 11 Years of Cassandra by John Schulz
  • Fun and Games

9:00 - 10:00pm UTC | Moderated by Melissa Logan + Ben Bromhead

  • Introduction to Apache Cassandra™ 4.0
  • How Apache Cassandra™ Skills Help Women on a Path to Reentry in Tech by Autumn Capasso
  • Making Cassandra Easy by Rahul Xavier Singh
  • Optimizing Cassandra for Cloud Native Architecture by Subrata Ashe
  • Moving from Elastic to Cassandra by Charles Herring
  • Fun and Games

There’s so much to celebrate about Apache Cassandra 4.0. The release includes highly anticipated enterprise features, such as five-times faster streaming of data during scaling operations, improved incremental repair, enterprise-grade auditing, and Java 11 support.

We encourage you to join us as we celebrate the hard work that’s gone into making this possible and the beginning of a new, exciting phase for the project.

FullContact: Improving the Graph by Transitioning to Scylla

In 2020, FullContact launched our Resolve product, backed by Cassandra. Initially, we were eager to move from our historical database HBase to Cassandra with its promises for scalability, high availability, and low latency on commodity hardware. However, we could never run our internal workloads as fast as we wanted — Cassandra didn’t seem to live up to expectations. Early on, we had a testing goal of hitting 1000 queries per second, and then soon after 10x-ing that to 10,000 queries per second through the API. We couldn’t get to that second goal due to Cassandra, even after lots of tuning.

Late last year, a small group of engineers at FullContact tried out ScyllaDB to replace Cassandra after hearing about it from one of our DevOps engineers. If you haven’t heard about Scylla before, I encourage you to check it out — it’s a drop in Cassandra replacement, written in C++, promising big performance improvements.

In this blog, we explore our experience starting from a hackathon and ultimately our transition to Scylla from Cassandra. The primary benchmark we use for performance testing is how many queries per second we can run through the API. While it’s helpful to measure a database by reads and writes per second, our database is only as good as our API can send its way, and vice versa.

The Problem with Cassandra

Our Resolve Cassandra cluster is relatively small: 3 instances of c5.2xlarge EC2 instances, each with 2 TB of gp2 EBS storage. This cluster is relatively inexpensive and, short of being primarily limited by the EBS volume speed limitation (250MB/s), it gave us sufficient scale to launch Resolve. Using EBS as storage also lets us increase the size of EBS volumes without needing to redeploy or rebuild the database and gain storage space. Three nodes may be sufficient for now, but if we’re running low on disk, we can add a terabyte or two to each node while running and keep the same cluster.

After several production customer-runs and some large internal batch loads began, our Cassandra Resolve tables grew from hundreds of thousands to millions and soon to over a hundred million rows. While we load-tested Cassandra before release and could sustain 1000 API calls per second from one Kubernetes pod, this was primarily an empty database or at least one with only a relatively small data set (~ a few million identifiers) max.

With both customers calling our production Resolve API and internal loads at 1000/second, we saw API speeds starting to creep up: 100ms, 200ms, and 300ms under heavy load. For us, this is too slow. And upon exceptionally heavy load for this cluster, we were seeing more and more often the dreaded:

DriverTimeoutException: Query timed out after PT2S

coming from the Cassandra Driver.

Cassandra Tuning

One of the first areas we found to gain performance had to do with Compaction Strategies — the way Cassandra manages the size and number of backing SSTables. We used the Size Tiered Compaction Strategy — the default setting, designed for “general use,” and insert heavy operations. This compaction strategy caused us to end up with single SSTables larger than several gigabytes. This means on reads, for any SSTables that get through the bloom filter, Cassandra is iterating through many extensive SSTables, reading them sequentially. Doing this at thousands of queries per second means we were quite easily able to max the EBS disk throughput, given sufficient traffic. 2 TB EBS volumes attached to an i3.2xlarge max out at a speed of ~250MB/s. From the Cassandra nodes, it was difficult to see any bottlenecks or why we saw timeouts. However, it was soon evident in the EC2 console that the EBS write throughput was pegged at 250MB/s, where memory and CPU were well below their maximums. Additionally, as we were doing large reads and writes concurrently, we have huge files being read. Still, the background compaction added additional stress on the drives by continuously bucketing SSTables into different size tables.

We ended up moving to Leveled Compaction Strategy:

alter table mytable WITH compaction = { 'class' :
'LeveledCompactionStrategy’};

Then after an hour or two of Cassandra completing its shuffling data around to smaller SSTables, were we again able to handle a reasonably heavy workload.

Weeks after updating the table’s compaction strategies, Cassandra (having so many small SSTables) struggled to run as fast with heavy read operations. We realized that the database likely needed more heap to run the bloom filtering in a reasonable amount of time. Once we doubled the heap in

/opt/cassandra/env.sh:

MAX_HEAP_SIZE="8G"

HEAP_NEWSIZE="3G"

Followed by a Cassandra service restart, one instance at a time, it was back to performing more closely to how it did when the cluster was smaller, up to a few thousand API calls per second.

Finally, we looked at tuning the size of the SSTables to make them even smaller than the 160MB default. In the end, we did seem to get a marginal performance boost after updating the size to something around 8MB. However, we still couldn’t get more than about 3,000 queries per second through the Cassandra database before we’d reach timeouts again. It continued to feel like we were approaching the limits of what Cassandra could do.

alter table mytable WITH compaction = { 'class' :
'LeveledCompactionStrategy’, ‘sstable_size_in_mb’ : 80 };

Enter Scylla

After several months of seeing our Cassandra cluster needing frequent tuning (or more tuning than we’d like), we happened to hear about Scylla. From their website: “We reimplemented Apache Cassandra from scratch using C++ instead of Java to increase raw performance, better utilize modern multi-core servers and minimize the overhead to DevOps.”

This overview comparing ScyllaDB and Cassandra was enough to give it a shot, especially since it “provides the same CQL interface and queries, the same drivers, even the same on-disk SSTable format, but with a modern architecture.”

With Scylla billing itself as a drop-in replacement for Cassandra promising MUCH better performance on the same hardware, it sounded almost too good to be true!

As we’ve explored in our previous Resolve blog, our database is primarily loaded by loading SSTables built offline using Spark on EMR. Our initial attempt to load a Scylla database with the same files as our current production database left us a bit disappointed. loading all the files to a fresh Scylla cluster required us to rebuild them with an older version of the Cassandra driver to force it to generate files using an older format.

After talking to the folks at Scylla, we learned that it didn’t support Cassandra’s latest MD file format. However, you can rename the .md files to .mc, and this will supposedly allow these files to be read by Scylla. [Editor’s note: Scylla fully supports the “MD” SSTable format as of Scylla Open Source 4.3.]

Once we were able to get SSTables loaded, we ran into another performance issue of starting the database in a reasonable amount of time. On Cassandra, when you copy files to each node in the cluster and start it, the database starts up within a few seconds. In Scylla, after copying files and restarting the Scylla service, it would take hours for larger tables to be re-compacted, shuffled, and ready to go, even though our replication factor was 3, on a 3 node cluster. So in copying all the files to each cluster, our thinking was data shouldn’t need to be transformed at all.

Once data was loaded, we were able to properly load test our APIs finally! And guess what? We were finally able to hit 10,000 queries per second relatively easily!

Grafana dashboard showing our previous maximum from 13:30 – 17:30 running around 3,000 queries/second. We were able to hit 5,000, 7,500, and over 10,000 queries per second with a loaded Scylla cluster.

We’ve been very pleased with Scylla’s performance out-of-the-box, being able to achieve double our goal set earlier last year of 10,000 queries per second, peaking at over 20,000 requests per second, all while keeping our 98th percentile under 50ms! And best of all — this is all out-of-the-box performance! No JVM or other tuning needs required! (The brief blips near 17:52, 17,55, and 17:56 are due to our load generator changing Kafka partitioning assignments as more load consumers are added).

In addition to the custom dashboards we have from the API point of view, Scylla conveniently ships Prometheus metric support and lets us install their Grafana dashboards easily to monitor our clusters with minimal effort.

OS metrics dashboard from Scylla:

Scylla Advanced Dashboard:

Offline SSTables to Cassandra Streaming

After doing some quick math factoring in Scylla’s need to recompact and reshuffle all your data loaded from offline SSTables, we realized reworking the database building, replacing it with streaming inserts straight into Cassandra would be faster using the spark-cassandra-connector.

In reality, rebuilding a database offline isn’t the primary use case that’s run regularly. Still, it is a useful tool for large schema changes and large internal data changes. This, combined with the fact that our SSTable build ultimately has SSTables being written to a single executor, we’ve since abandoned the offline SSTable build process.

We’ve updated our Airflow DAG to stream directly to a fresh Scylla cluster:

Version 1 of our Database Rebuild process, building SSTables offline.

Updated version 2 looks very similar, but it streams data directly to Scylla:

Conveniently the code is pretty straightforward as well:

We create a spark config and session:

val sparkConf = super.createSparkConfig()
       .set("spark.cassandra.connection.host",
cassandraHosts)
       // any other settings we need/want to set,
consistency level, throughput limits, etc.

val session =
SparkSession.builder().config(sparkConf).getOrCreate()

val records = session.read
       .parquet(inputPath)
       .as[ResolveRecord]
       .cache()

2. For each table we need to populate, we can map to a case class matching the table schema and saving as the correct table name and keyspace:

records

       // map to a row
       .map(row => TableCaseClass(id1, id2, ….))
       .toDF()
       .format("org.apache.spark.sql.cassandra")
       .options(Map("keyspace" -> keyspace, "table" ->
"mappingtable"))
       .mode(SaveMode.Append)
       // stream to scylla
       .save()

With some trial and error, we have found the sweet spot of the numbers and size of EMR EC2 nodes: for our data sets, running an 8 node c5.large was able to keep the load as fast as the EBS drives could handle while not running into more timeout issues.

Cassandra and Scylla Performance Comparison

Our Cassandra cluster under heavy load

Our Scylla cluster on the same hardware, with the same type of traffic

The top graph shows queries per second (white line; right Y-axis) we were able to push through our Cassandra cluster before we encountered timeout issues with the API speed measured at the mean, 95th, and 98th percentiles, (blue, green, and red, respectively; left-Y axis). You can see we could push through about 7 times the number of queries per second while dropping the 98th percentile latency from around 2 seconds to 15 milliseconds!

Next Steps

As our data continues to grow, we are continuing to look for efficiencies around data loading. A few areas we are currently evaluating:

  • Using Scylla Migrator to load Parquet straight to Scylla, using Scylla’s partition aware driver
  • Exploring i3 class EC2 nodes
  • Network efficiencies with batching rows and compression, on the spark side
  • Exploring more, smaller instances for cluster setup

This article originally appeared on the FullContact website here and is republished with their permission. We encourage others who would like to share their Scylla success stories to contact us. Or, if you have questions, feel free to join our user community on Slack.

The post FullContact: Improving the Graph by Transitioning to Scylla appeared first on ScyllaDB.

Apache Cassandra Changelog #6 | April 2021

Apache Cassandra Changelog Header

Our monthly roundup of key activities and knowledge to keep the community informed.

Release Notes

Released

A blocking issue was found in beta-2 which has delayed the release of rc-1. Also during rc-1 evaluation, some concerns were raised about the contents of the source distribution, but work to resolve that got underway quickly and is ready to commit.

For the latest status on Cassandra 4.0 GA, please check the Jira board (ASF login required). However, we expect GA to arrive very soon! Read the latest summary from the community here. The remaining tickets represent 1% of the total scope.

Join the Cassandra mailing list to stay up-to-date.

Changed

The release cadence for the Apache Cassandra project is changing. The community has agreed to one release every year, plus periodic trunk snapshots. The number of releases that will be supported in this agreement is three, and every incoming release will be supported for three years.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

The PMC is pleased to announce that Berenguer Blasi has accepted the invitation to become a project committer. Thanks so much, Berenguer, for all the work you have done!

Added

As the community gets closer to the launch of 4.0, we are organizing a celebration with the help of ASF – Cassandra World Party 4.0 will be a one-day, no-cost virtual event on Wednesday, April 28 to bring the global community together in celebration of the upcoming release milestone. The CFP for 5-minute lightning talks is open now until April 9 – newcomers welcome! Register here.

Added

Apache Cassandra is taking part in the Google Summer of Code (GSoC) under the ASF umbrella as a mentoring organization. If you’re a post-secondary student and looking for an exciting opportunity to contribute to the project that powers your favorite Internet services then read Paulo Motta’s GSoC blog post for the details.

Changed

Recent updates to cass-operator in March by the Kubernetes SIG have seen the specification for seeds now supporting hostnames and separate seeds for separate data centers. Currently, the SIG is discussing whether cass-operator, the community-developed operator for Apache Cassandra, should have CRDs for keyspaces and roles, how to accomplish pod-specific configurations, and whether CRDs should represent Schema, watch here.

The project is also looking at how to make the cass-operator multi-cluster by using the same approach used for Multi-CassKop. One idea is to use existing CassKop CRDs to manage cass-operator, and it could be a way to demonstrate how easy it is to migrate from one operator to another.

K8ssandra will be seeking to support Apache Cassandra 4.0 features, which involve some new configuration settings and require changes in the config builder. It will also be supporting JDK 11, the new garbage collectors, and the auditing features.

kubernetes-sig-meeting-2021-03-25

User Space

American Express

During last year’s ApacheCon, Laxmikant Upadhyay presented a 35-minute guide on the best practices and strategies for upgrading Apache Cassandra in production. This includes pre- and post-upgrade steps and rolling and parallel upgrade strategies for Cassandra clusters. - Laxmikant Upadhyay

Spotify

In a recent AMA, Spotify discussed Backstage, its open platform for building developer portals. Spotify elaborated on the database solutions it provides internally: “Spotify is mostly on GCP so our devs use a mix of Google-managed storage products and self-managed ones.[…] The unmanaged storage solutions Spotify devs start and operate themselves on GCE include Apache Cassandra, PostgreSQL, Memcached, Elastic Search, and Redis. We hope to support stateful workloads in the future. We’ve explored using PersistentVolumes backed by persistent disks.” - David Xia

Do you have a Cassandra case study to share? Email cassandra@constantia.io.

In the News

TFiR: How Apache Cassandra Works With Containers

Dataversity: Why 2021 Will Be a Big Year for Apache Cassandra (and Its Users)

ZDNet: Microsoft Ignite Data and Analytics Roundup: Platform Extensions Are the Key Theme

Techcrunch: Microsoft Azure Expands its NoSQL Portfolio with Managed Instances for Apache Cassandra

Cassandra Tutorials & More

How to Install Apache Cassandra on CentOS 8 - Shehroz Azam, LinuxHint

Cassandra With Java: Introduction to UDT - Otavio Santana, DZone

Apache Cassandra Horizontal Scalability for Java Applications [Book] - Otavio Santana, DZone

Cloud-native applications and data with Kubernetes and Apache Cassandra - Patrick McFadin, DataStax

Apache Cassandra Changelog Footer

Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

Apache Cassandra World Party | April 2021

The COVID-19 pandemic has taken a toll on a lot of things and one of those is our ability to interact as a community. There has been no in-person conferences or meetups for over a year now. The Cassandra community has always thrived on sharing with each other at places like ApacheCon and the Cassandra Summit. With Cassandra 4.0, we have a lot to celebrate!

Apache Cassandra™ World Party footer

This release will be the most stable database ever shipped, and Cassandra has become one of the most important databases running today. It is responsible for the biggest workloads in the world and because of that, we want to gather the worldwide community and have a party.

We thought we would do something different for this special event that reflects who we are as a community. Our community lives and works in every timezone, and we want to make it as easy as possible for everyone to participate so we’ve decided to use an Ignite-style format. If you are new to this here’s how it works:

  • Each talk is only 5 minutes long.
  • You get ten slides and they automatically advance every 30 seconds.
  • To get you started, we have a template ready here.
  • You can do your talk in the language of your choice. English is not required.
  • Format: PDF (Please no GIFs or videos)

When you are ready, you can use this link to submit your talk idea by April 9, 2021. It’s only five minutes but you can share a lot in that time. Have fun with the format and encourage other people in your network to participate. Diversity is what makes our community stronger so if you are a newcomer, don’t let that put you off – we’d love you to share your experiences.

What?

One-day virtual party on Wednesday, April 28 with three, hour-long sessions so you can celebrate with the Cassandra community in your time zone – or attend all three!

When?

Join us at the time most convenient for you:

  • April 28 5:00am UTC
  • April 28 1:00pm UTC
  • April 28 9:00pm UTC

Register here. In the meantime, please submit your 5-minute Cassandra talk by April 9!

Whether you’re attending or speaking, all Apache Cassandra™ World Party participants must adhere to the code of conduct.

Got questions about the event? Please email events@constantia.io.

Developing an Enterprise-Level Apache Cassandra Sink Connector for Apache Pulsar

Join Apache Cassandra for Google Summer of Code 2021

I have been involved with Apache Cassandra for the past eight years, and I’m very proud to mention that my open source journey started a little more than a decade ago during my participation at the Google Summer of Code (GSoC).

GSoC is a program sponsored by Google to promote open source development, where post-secondary students submit project proposals to open source organizations. Selected students receive community mentorship and a stipend from Google to work on the project for ten weeks during the northern hemisphere summer. Over 16,000 students from 111 countries have participated so far! More details about the program can be found on the official GSoC website.

The Apache Software Foundation (ASF) has been a GSoC mentor organization since the beginning of the program 17 years ago. The ASF acts as an “umbrella” organization, which means that students can submit project proposals to any subproject within the ASF. Apache Cassandra mentored a successful GSoC project in 2016 and we are participating again this year. The application period opens on March 29, 2021 and ends on April 13, 2021. It’s a highly competitive program, so don’t wait to the last minute to prepare!

How to Get Involved

Getting Started

The best way to get started if you’re new to Apache Cassandra is to get acquainted by reading the documentation and setting up a local development environment. Play around with a locally running instance via cqlsh and nodetool to get a feel for how to use the database. If you run into problems or roadblocks during this exercise, don’t be shy to ask questions via the community channels like the developers mailing list or the #cassandra-dev channel on the ASF Slack.

GSoC Project Ideas

Once you have a basic understanding of how the project works, browse the GSoC ideas list to select ideas that you are interested in working on. Browse the codebase to identify components related to the idea you picked. You are welcome to propose other projects if they’re not on the ideas list.

Writing a proposal

Write a message on the JIRA ticket introducing yourself and demonstrating your interest in working on that particular idea. Sketch a quick proposal on how you plan to tackle the problem and share it with the community for feedback. If you don’t know where to start, don’t hesitate to ask for help!

Useful Resources

There are many good resources on the web on preparing for GSoC, particularly the ASF GSoC Guide and the Python community notes on GSoC expectations. The best GSoC students are self-motivated and proactive, and following the tips above should increase your chances of getting selected and delivering your project successfully. Good luck!

Apache Cassandra Changelog #5 | March 2021

Our monthly roundup of key activities and knowledge to keep the community informed.

Apache Cassandra Changelog Header

Release Notes

Released

We are expecting 4.0rc to be released soon, so join the Cassandra mailing list to stay up-to-date.

For the latest status on Cassandra 4.0 GA please check the Jira board (ASF login required). We are within line-of-sight to closing out beta scope, with the remaining tickets representing 2.6% of the total scope. Read the latest summary from the community here.

Proposed

The community has been discussing release cadence after 4.0 reaches GA. An official vote has not been taken on this yet, but the current consensus is one major release every year. Also under discussion are bleeding-edge snapshots (where stability is not guaranteed) and the duration of support for releases.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

We are pleased to announce that Paulo Motta has accepted the invitation to become a PMC member! This invite comes in recognition of all his contributions to the Apache Cassandra project over many years.

Added

Apache Cassandra is taking part in the Google Summer of Code (GSoC) under the ASF umbrella as a mentoring organization. We will be posting a separate blog soon detailing how post-secondary students can get involved.

Proposed

With 4.0 approaching completion, the idea of a project roadmap is also being discussed.

Changed

The Kubernetes SIG is looking at ways to invite more participants by hosting two meetings to accommodate people in different time zones. Watch here.

Cassandra Kubernetes SIG Meeting 2021-02-11

A community website dedicated to cass-operator is also in development focused on documentation for the operator. Going forward, the Kubernetes SIG is discussing release cadence and looking at six releases a year.

K8ssandra 1.0, an open source production-ready platform for running Apache Cassandra on Kubernetes, was also released on 25 February and announced on its new community website. Read the community blog to find out more and what’s next. K8ssandra now has images for Cassandra 3.11.10 and 4.0-beta4 that run rootless containers with Reaper and Medusa functions.

User Space

Instana

“The Instana components are already containerized and run in our SaaS platform, but we still needed to create containers for our databases, Clickhouse, Cassandra, etc., and set up the release pipeline for them. Most of the complexity is not in creating a container with the database running, but in the management of the configuration and how to pass it down in a maintainable way to the corresponding component.” - Instana

Flant

“We were able to successfully migrate the Cassandra database deployed in Kubernetes to another cluster while keeping the Cassandra production installation in a fully functioning state and without interfering with the operation of applications.” - Flant

Do you have a Cassandra case study to share? Email cassandra@constantia.io.

In the News

CRN: Top 10 Highest IT Salaries Based On Tech Skills In 2021: Dice

TechTarget: Microsoft ignites Apache Cassandra Azure service

Dynamic Business: 5 Ways your Business Could Benefit from Open Source Technology

TWB: Top 3 Technologies which are winning the Run in 2021

Cassandra Tutorials & More

Data Operations Guide for Apache Cassandra - Rahul Singh, Anant

Introduction to Apache Cassandra: What is Apache Cassandra - Ksolves

What’s New in Apache Cassandra 4.0 - Deepak Vohra, Techwell

Reaper 2.2 for Apache Cassandra was released - Alex Dejanovski, The Last Pickle

What’s new in Apache Zeppelin’s Cassandra interpreter - Alex Ott

Apache Cassandra Changelog Footer


Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

Developing High Performance Apache Cassandra™ Applications in Rust (Part 1)

Apache Cassandra Changelog #4 | February 2021

Our monthly roundup of key activities and knowledge to keep the community informed.

Apache Cassandra Changelog Header

Release Notes

Released

Apache Cassandra 3.0.24 (pgp, sha256 and sha512). This is a security-related release for the 3.0 series and was released on February 1. Please read the release notes.

Apache Cassandra 3.11.10 (pgp, sha256 and sha512) was also released on February 1. You will find the release notes here.

Apache Cassandra 4.0-beta4 (pgp, sha256 and sha512) is the newest version which was released on December 30. Please pay attention to the release notes and let the community know if you encounter problems with any of the currently supported versions.

Join the Cassandra mailing list to stay updated.

Changed

A vulnerability rated Important was found when using the dc or rack internode_encryption setting. More details of CVE-2020-17516 Apache Cassandra internode encryption enforcement vulnerability are available on this user thread.

Note: The mitigation for 3.11.x users requires an update to 3.11.10 not 3.11.24, as originally stated in the CVE. (For anyone who has perfected a flux capacitor, we would like to borrow it.)

The current status of Cassandra 4.0 GA can be viewed on this Jira board (ASF login required). RC is imminent with testing underway. The remaining tickets represent 3.3% of the total scope. Read the latest summary from the community here.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

Apache Cassandra will be participating in the Google Summer of Code (GSoC) under the ASF umbrella as a mentoring organization. This is a great opportunity to get involved, especially for newcomers to the Cassandra community.

We’re curating a list of JIRA tickets this month, which will be labeled as gsoc2021. This will make them visible in the Jira issue tracker for participants to see and connect with mentors.

If you would like to volunteer to be a mentor for a GSoC project, please tag the respective JIRA ticket with the mentor label. Non-committers can volunteer to be a mentor as long as there is a committer as co-mentor. Projects can be mentored by one or more co-mentors.

Thanks to Paulo Motta for proposing the idea and getting the ticket list going.

Added

Apache Zeppelin 0.9.0 was released on January 15. Zeppelin is a collaborative data analytics and visualization tool for distributed, general-purpose data processing system, which supports Apache Cassandra and others. The release notes for the Cassandra CQL Interpreter are available here.

Changed

For the GA of Apache Cassandra 4.0, any claim of support for Python 2 will be dropped from update documentation. We will also introduce a warning when running in Python 2.7. Support for Python 3 will be backported to at least 3.11, due to existing tickets, but we will undertake the work needed to make packaging and internal tooling support Python 3.

Changed

The Kubernetes SIG is discussing how to encourage more participation and to structure SIG meetings around updates on Kubernetes and Cassandra. We also intend to invite other projects (like OpenEDS, Prometheus, and others) to discuss how we can make Cassandra and Kubernetes better. As well as updates, the group discussed handling large-scale backups inside Kubernetes and using S3 APIs to store images. Watch here.

kubernetes-sig-meeting-2021-01-14

User Space

Backblaze

“Backblaze uses Apache Cassandra, a high-performance, scalable distributed database to help manage hundreds of petabytes of data.” - Andy Klein

Witfoo

Witfoo uses Cassandra for big data needs in cybersecurity operations. In response to the recent licensing changes at Elastic, Witfoo decided to blog about its journey away from Elastic to Apache Cassandra in 2019. - Witfoo.com

Do you have a Cassandra case study to share? Email cassandra@constantia.io.

In the News

The New Stack: What Is Data Management in the Kubernetes Age?

eWeek: Top Vendors of Database Management Software for 2021

Software Testing Tips and Tricks: Top 10 Big Data Tools (Big Data Analytics Tools) in 2021

InfoQ: K8ssandra: Production-Ready Platform for Running Apache Cassandra on Kubernetes

Cassandra Tutorials & More

Creating Flamegraphs with Apache Cassandra in Kubernetes (cass-operator) - Mick Semb Wever, The Last Pickle

Apache Cassandra : The Interplanetary Database - Rahul Singh, Anant

How to Install Apache Cassandra on Ubuntu 20.04 - Jeff Wilson, RoseHosting

The Impacts of Changing the Number of VNodes in Apache Cassandra - Anthony Grasso, The Last Pickle

CASSANDRA 4.0 TESTING - Charles Herring, Witfoo

Apache Cassandra Changelog Footer


Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

Security Advisory: CVE-2020-17516

Earlier this week, a vulnerability was published to the Cassandra users mailing list which describes a flaw in the way Apache Cassandra nodes perform encryption in some configurations. 

The vulnerability description is reproduced here:

“Description:

When using ‘dc’ or ‘rack’ internode_encryption setting, a Cassandra instance allows both encrypted and unencrypted connections. A misconfigured node or a malicious user can use the unencrypted connection despite not being in the same rack or dc, and bypass mutual TLS requirement.”

The recommended mitigations are:

  • Users of ALL versions should switch from ‘dc’ or ‘rack’ to ‘all’ internode_encryption setting, as they are inherently insecure
  • 3.0.x users should additionally upgrade to 3.0.24
  • 3.11.x users should additionally upgrade to 3.11.10

By default, all Cassandra clusters running in the Instaclustr environment are configured to use ‘internode_encryption’ set to all. To confirm that our clusters are unaffected by this vulnerability, Instaclustr has checked the configuration of all Cassandra clusters in our managed service fleet and none are using the vulnerable configurations ‘dc’ or ‘rack’. 

Instaclustr restricts access to Cassandra nodes to only those IP addresses and port combinations required for cluster management and customer use, further mitigating the risk of compromise. 

In line with the mitigation recommendation, Instaclustr is developing a plan to upgrade all 3.0.x and 3.11.x Cassandra clusters to 3.0.24 and 3.11.10. Customers will be advised when their clusters are due for upgrade. 

Instaclustr recommends that our Support Only customers check their configurations to ensure that they are consistent with this advice, and upgrade their clusters as necessary to maintain a good security posture. 

Should you have any questions regarding Instaclustr Security, please contact us by email security@instaclustr.com.

If you wish to discuss scheduling of the upgrade to your system or have any other questions regarding the impact of this vulnerability, please contact support@instaclustr.com


To report an active security incident, email support@instaclustr.com.

The post Security Advisory: CVE-2020-17516 appeared first on Instaclustr.

Apache Cassandra Changelog #3 | January 2021

Our monthly roundup of key activities and knowledge to keep the community informed.

Apache Cassandra Changelog Header

Release Notes

Released

Apache Cassandra 4.0-beta4 (pgp, sha256 and sha512) was released on December 30. Please pay attention to release notes and let the community know if you encounter problems. Join the Cassandra mailing list to stay updated.

Changed

The current status of Cassandra 4.0 GA can be viewed on this Jira board (ASF login required). RC is imminent with testing underway. Read the latest summary from the community here.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

The Cassandra community welcomed one new PMC member and five new committers in 2020! Congratulations to Mick Semb Wever who joined the PMC and Jordan West, David Capwell, Zhao Yang, Ekaterina Dimitrova, and Yifan Cai who accepted invitations to become Cassandra committers!

Changed

The Kubernetes SIG is discussing how to extend the group’s scope beyond the operator, as well as sharing an update on current operator merge efforts in the latest meeting. Watch here.

IApache Cassandra Kubernetes SIG Meeting Header

User Space

Keen.io

Under the covers, Keen leverages Kafka, Apache Cassandra NoSQL database and the Apache Spark analytics engine, adding a RESTful API and a number of SDKs for different languages. Keen enriches streaming data with relevant metadata and enables customers to stream enriched data to Amazon S3 or any other data store. - Keen.io

Monzo

Suhail Patel explains how Monzo prepared for the recent crowdfunding (run entirely through its app, using the very same platform that runs the bank) which saw more than 9,000 people investing in the first five minutes. He covers Monzo’s microservice architecture (on Go and Kubernetes) and how they profiled and optimized key platform components such as Cassandra and Linkerd. - Suhil Patel

In the News

ZDNet - Meet Stargate, DataStax’s GraphQL for databases. First stop - Cassandra

CIO - It’s a good day to corral data sprawl

TechTarget - Stargate API brings GraphQL to Cassandra database

ODBMS - On the Cassandra 4.0 beta release. Q&A with Ekaterina Dimitrova, Apache Cassandra Contributor

Cassandra Tutorials & More

Intro to Apache Cassandra for Data Engineers - Daniel Beach, Confessions of a Data Guy

Impacts of many columns in a Cassandra table - Alex Dejanovski, The Last Pickle

Migrating Cassandra from one Kubernetes cluster to another without data loss - Flant staff

Real-time Stream Analytics and User Scoring Using Apache Druid, Flink & Cassandra at Deep.BI - Hisham Itani, Deep.BI

User thread: Network Bandwidth and Multi-DC replication (Login required)

Apache Cassandra Changelog Footer


Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

The Instaclustr LDAP Plugin for Cassandra 2.0, 3.0, and 4.0

LDAP (Lightweight Directory Access Protocol) is a common vendor-neutral and lightweight protocol for organizing authentication of network services. Integration with LDAP allows users to unify the company’s security policies when one user or entity can log in and authenticate against a variety of services. 

There is a lot of demand from our enterprise customers to be able to authenticate to their Apache Cassandra clusters against LDAP. As the leading NoSQL database, Cassandra is typically deployed across the enterprise and needs this connectivity.

Instaclustr has previously developed our LDAP plugin to work with the latest Cassandra releases. However, with Cassandra 4.0 right around the corner, it was due for an update to ensure compatibility. Instaclustr takes a great deal of care to provide cutting-edge features and integrations for our customers, and our new LDAP plugin for Cassandra 4.0 showcases this commitment. We always use open source and maintain a number of Apache 2.0-licensed tools, and have released our LDAP plugin under the Apache 2.0 license. 

Modular Architecture to Support All Versions of Cassandra

Previously, the implementations for Cassandra 2.0 and 3.0 lived in separate branches, which resulted in some duplicative code. With our new LDAP plugin update, everything lives in one branch and we have modularized the whole solution so it aligns with earlier Cassandra versions and Cassandra 4.0.

The modularization of our LDAP plugin means that there is the “base” module that all implementations are dependent on. If you look into the codebase on GitHub, you see that the implementation modules consist of one or two classes at maximum, with the rest inherited from the base module. 

This way of organizing the project is beneficial from a long-term maintenance perspective. We no longer need to keep track of all changes and apply them to the branched code for the LDAP plugin for each Cassandra version. When we implement changes and improvements to the base module, all modules are updated automatically and benefit.

Customizable for Any LDAP Implementation

This plugin offers a default authenticator implementation for connecting to a LDAP server and authenticating a user against it. It also offers a way to implement custom logic for specific use cases. In the module, we provide the most LDAP server-agnostic implementation possible, but there is also scope for customization to meet specific LDAP server nuances. 

If the default solution needs to be modified for a particular customer use case, it is possible to add in custom logic for that particular LDAP implementation. The implementation for customized connections is found in “LDAPPasswordRetriever” (DefaultLDAPServer being the default implementation from which one might extend and override appropriate methods). This is possible thanks to the SPI mechanism. If you need this functionality you can read more about it in the relevant section of our documentation.

Enhanced Testing for Reliability

Our GitHub build pipeline now tests the LDAP plugins for each supported Cassandra version on each merged commit. This update provides integration tests that will spin up a standalone Cassandra node as part of JUnit tests as well as a LDAP server. This is started in Docker as part of a Maven build before the actual JUnit tests. 

This testing framework enables us to test to make sure that any changes don’t break the authentication mechanism. This is achieved by actually logging in via the usual mechanism as well as via LDAP.

Packaged for Your Operating System

Last but not least, we have now added Debian and RPM packages with our plugin for each Cassandra version release. Until now, a user of this plugin had to install the JAR file to Cassandra libraries directory manually. With the introduction of these packages, you do not need to perform this manual action anymore. The plugin’s JAR along with the configuration file will be installed in the right place if the official Debian or RPM Cassandra package is installed too.

How to Configure LDAP for Cassandra

In this section we will walk you through the setup of the LDAP plugin and explain the most crucial parts of how the plugin works.

After placing the LDAP plugin JAR to Cassandra’s classpath—either by copying it over manually or by installing a package—you will need to modify a configuration file in /etc/cassandra/ldap.properties.

There are also changes that need to be applied to cassandra.yaml. For Cassandra 4.0, please be sure that your authenticator, authorizer, and role_manager are configured as follows:

authenticator: LDAPAuthenticator
authorizer: CassandraAuthorizer
role_manager: LDAPCassandraRoleManager

Before using this plugin, an operator of a Cassandra cluster should configure system_auth keyspace to use NetworkTopologyStrategy.

How the LDAP Plugin Works With Cassandra Roles

LDAP plugin works via a “dual authentication” technique. If a user tries to log in with a role that already exists in Cassandra, separate from LDAP, it will authenticate against that role. However, if that role is not present in Cassandra, it will reach out to the LDAP server and it will try to authenticate against it. If it is successful, from the user’s point of view, it looks like this role was in Cassandra the whole time as it logs in the user transparently. 

If your LDAP server is down, you will not be able to authenticate with the specified LDAP user. You can enable caching for LDAP users—available in the Cassandra 3.0 or 4.0 plugins—to take some load off a LDAP server when authentication is conducted frequently.

The Bottom Line

Our LDAP plugin meets the enterprise need for a consolidated security and authentication policy. 100% open source and supporting all major versions of Cassandra, the plugin works with all major LDAP implementations and can be easily customized for others. 

The plugin is part of our suite of supported tools for our support customers and Instaclustr is committed to actively maintaining and developing the plugin. Our work updating it to support the upcoming Cassandra 4.0 release is part of this commitment. You can download it here and feel free to get in touch with any questions you might have. Cassandra 4.0 beta 2 is currently in preview on our managed platform and you can use our free trial to check it out.

The post The Instaclustr LDAP Plugin for Cassandra 2.0, 3.0, and 4.0 appeared first on Instaclustr.

Apache Cassandra Changelog #2 | December 2020

Our monthly roundup of key activities and knowledge to keep the community informed.

Apache Cassandra Changelog Header

Release Notes

Released

Apache #Cassandra 4.0-beta3, 3.11.9, 3.0.23, and 2.2.19 were released on November 4 and are in the repositories. Please pay attention to release notes and let the community know if you encounter problems. Join the Cassandra mailing list to stay updated.

Changed

Cassandra 4.0 is progressing toward GA. There are 1,390 total tickets and remaining tickets represent 5.5% of total scope. Read the full summary shared to the dev mailing list and take a look at the open tickets that need reviewers.

Cassandra 4.0 will be dropping support for older distributions of CentOS 5, Debian 4, and Ubuntu 7.10. Learn more.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

The community weighed options to address reads inconsistencies for Compact Storage as noted in ticket CASSANDRA-16217 (committed). The conversation continues in ticket CASSANDRA-16226 with the aim of ensuring there are no huge performance regressions for common queries when you upgrade from 2.x to 3.0 with Compact Storage tables or drop it from a table on 3.0+.

Added

CASSANDRA-16222 is a Spark library that can compact and read raw Cassandra SSTables into SparkSQL. By reading the sstables directly from a snapshot directory, one can achieve high performance with minimal impact to a production cluster. It was used to successfully export a 32TB Cassandra table (46bn CQL rows) to HDFS in Parquet format in around 70 minutes, a 20x improvement on previous solutions.

Changed

Great news for CEP-2: Kubernetes Operator, the community has agreed to create a community-based operator by merging the cass-operator and CassKop. The work being done can be viewed on GitHub here.

Released

The Reaper community announced v2.1 of its tool that schedules and orchestrates repairs of Apache Cassandra clusters. Read the docs.

Released

Apache Cassandra 4.0-beta-1 was released on FreeBSD.

User Space

Netflix

“With these optimized Cassandra clusters in place, it now costs us 71% less to operate clusters and we could store 35x more data than our previous configuration.” - Maulik Pandey

Yelp

“Cassandra is a distributed wide-column NoSQL datastore and is used at Yelp for both primary and derived data. Yelp’s infrastructure for Cassandra has been deployed on AWS EC2 and ASG (Autoscaling Group) for a while now. Each Cassandra cluster in production spans multiple AWS regions.” - Raghavendra D Prabhu

In the News

DevPro Journal - What’s included in the Cassandra 4.0 Release?

JAXenter - Moving to cloud-native applications and data with Kubernetes and Apache Cassandra

DZone - Improving Apache Cassandra’s Front Door and Backpressure

ApacheCon - Building Apache Cassandra 4.0: behind the scenes

Cassandra Tutorials & More

Users in search of a tool for scheduling backups and performing restores with cloud storage support (archiving to AWS S3, GCS, etc) should consider Cassandra Medusa.

Apache Cassandra Deployment on OpenEBS and Monitoring on Kubera - Abhishek Raj, MayaData

Lucene Based Indexes on Cassandra - Rahul Singh, Anant

How Netflix Manages Version Upgrades of Cassandra at Scale - Sumanth Pasupuleti, Netflix

Impacts of many tables in a Cassandra data model - Alex Dejanovski, The Last Pickle

Cassandra Upgrade in production : Strategies and Best Practices - Laxmikant Upadhyay, American Express

Apache Cassandra Collections and Tombstones - Jeremy Hanna

Spark + Cassandra, All You Need to Know: Tips and Optimizations - Javier Ramos, ITNext

How to install the Apache Cassandra NoSQL database server on Ubuntu 20.04 - Jack Wallen, TechRepublic

How to deploy Cassandra on Openshift and open it up to remote connections - Sindhu Murugavel

Apache Cassandra Changelog Footer

Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.