Scylla Operator 1.5

The Scylla team is pleased to announce the release of Scylla Operator 1.5.

Scylla Operator is an open-source project that helps Scylla Open Source and Scylla Enterprise users run Scylla on Kubernetes. The Scylla operator manages Scylla clusters deployed to Kubernetes and automates tasks related to operating a Scylla cluster, like installation, scale out and downscale, as well as rolling upgrades.

Scylla Operator 1.5 improves stability and brings a few features. As with all of our releases, any API changes are backward compatible.

API changes

  • ScyllaCluster now supports specifying image pull secrets to enable using private registries. (#678)
    • To learn more about it use kubectl explain scyllacluster.spec.imagePullSecrets.

Notable Changes

  • ScyllaCluster now allows changing resources, placement and repository. (#763)
  • The operator validation webhooks now chain the error, so you no longer have to iterate “one by one” if your CR contained more than one error. (#695)
  • Operator and webhook server pods are now preferred to be scheduled to a different node (within the same deployment) for better HA. (#700)
  • Scylla manager deployment got a readiness probe to indicate its state better. (#725)
  • Webhook server got a PodDisruptionBudget for better HA when pods are being evicted. #715

For more details checkout the GitHub release notes.

Supported Versions

  • Scylla ≥4.3, Scylla Enterprise ≥2021.1
  • Kubernetes ≥1.19.10
  • Scylla manager >=2.2
  • Scylla monitoring ≥1.0

Upgrade Instructions

Upgrading from v1.4.x doesn’t require any extra action. Depending on your deployment method, use helm upgrade or kubectl apply to update the manifests from v1.5.0 tag while substituting the released image.

Getting Started with Scylla Operator

  • Scylla Operator Documentation
  • Learn how to deploy Scylla on Google Kubernetes Engine (GKE) here
  • Learn how to deploy Scylla on Amazon Elastic Kubernetes Engine (EKS) here
  • Learn how to deploy Scylla on a Kubernetes Cluster here (including MiniKube example)

Related Links

We’ll welcome your feedback! Feel free to open an issue or reach out on the #kubernetes channel in Scylla User Slack.

The post Scylla Operator 1.5 appeared first on ScyllaDB.

AWS Graviton2: Arm Brings Better Price-Performance than Intel


Since the last time we took a look at Scylla’s performance on Arm, its expansion into the desktop and server space has continued: Apple introduced its M1 CPUs, Oracle Cloud added Ampere Altra-based instances to its offerings, and AWS expanded its selection of Graviton2-based machines. So now’s a perfect time to test Arm again — this time with SSDs.

Summary

We compared Scylla’s performance on m5d (x86) and m6gd (Arm) instances of AWS. We found that Arm instances provide 15%-25% better price-performance, both for CPU-bound and disk-bound workloads, with similar latencies.

Compared machines

For the comparison we picked m5d.8xlarge and m6gd instances, because they are directly comparable, and other than the CPU they have very similar specs:

Intel (x86-Based) Server Graviton2 (Arm-based) Server
Instance type m5d.8xlarge m6gd.8xlarge
CPUs Intel Xeon Platinum 8175M
(16 cores / 32 threads)
AWS Graviton2
(32 cores)
RAM 128 GB 128 GB
Storage (NVMe SSD) 2 x 600 GB (1,200 GB Total) 1 x 1,900 GB
Network bandwidth 10 Gbps 10 Gbps
Price/Hour (us-east-1) $1.808 $1.4464

For m5d, both disks were used as one via a software RAID0.

Scylla’s setup scripts benchmarked the following stats for the disks:

m5d.8xlarge
read_iops: 514 k/s
read_bandwidth: 2.502 GiB/s
write_iops: 252 k/s
write_bandwidth: 1.066 GiB/s
m6d.8xlarge
read_iops: 445 k/s
read_bandwidth: 2.532 GiB/s
write_iops: 196 k/s
write_bandwidth: 1.063 GiB/s

Their throughput was almost identical, while the two disks on m5d provided moderately higher IOPS.

Note: “m” class (General Purpose) instances are not the typical host for Scylla. Usually, “I3” or “I3en” (Storage Optimized) instances, which offer a lower cost per GB of disk, would be chosen. However there are no Arm-based “i”-series instances available yet.

While this blog was being created, AWS released a new series of x86-based instances — the m6i — which boasts up to 15% improved price performance over m5. However, they don’t yet have SSD-based variants, so would not be a platform recommended for use with a low latency persistent database system like Scylla.

Benchmark Details

For both instance types, the setup consisted of a single Scylla node, 5 client nodes (c5.4xlarge) and 1 Scylla Monitoring node (c5.large), all in the same AWS availability zone.

Scylla was launched on 30 shards with 2GiB of RAM per shard. The remaining CPUs (2 cores for m6gd, 1 core with 2 threads for m5d) were dedicated to networking. Other than that, Scylla AMI’s default configuration was used.

The benchmark test we used was cassandra-stress, with a Scylla shard-aware driver. This version of cassandra-stress is distributed with Scylla. The default schema of cassandra-stress was used.

The benchmark consisted of six phases:

  1. Populating the cluster with freshly generated data.
  2. Writes (updates) randomly distributed over the whole dataset.
  3. Reads randomly distributed over the whole dataset.
  4. Writes and reads (mixed in 1:1 proportions) randomly distributed over the whole dataset.
  5. Writes randomly distributed over a small subset of the dataset.
  6. Reads randomly distributed over a small subset of the dataset.
  7. Writes and reads (mixed in 1:1 proportions) randomly distributed over a small subset of the dataset.

Phase 1 was running for as long as needed to populate the dataset, phase 2 was running for 30 minutes, and other phases were running for 15 minutes. There was a break after each write phase to wait for compactions to end.

The size of the dataset was chosen to be a few times larger than RAM, while the “small subset” was chosen small enough to fit entirely in RAM. Thus phases 1-4 test the more expensive (disk-touching) code paths, while phases 5-7 stress the cheaper (in-memory) paths.

Note that phases 3 and 4 are heavily bottlenecked by disk IOPS, not by the CPU, so their throughput is not very interesting for the general ARM vs x86 discussion. Phase 2, however, is bottlenecked by the CPU. Scylla is a log-structured merge-tree database, so even “random writes” are fairly cheap IO-wise.

Results

Phase m5d.8xlarge kop/s m6gd.8xlarge kop/s difference in throughput difference in throughput/price
1. Population 563 643 14.21% 42.76%
2. Disk writes 644 668 3.73% 29.66%
3. Disk reads 149 148 -0.67% 24.16%
4. Disk mixed 205 176 -14.15% 7.32%
5. Memory writes 1059 1058 -0.09% 24.88%
6. Memory reads 1046 929 -8.92% 13.85%
7. Memory mixed 803 789 -1.74% 22.82%

We see roughly even performance across both servers, but because of the different pricing, this results in a ~20% price-performance advantage overall for m6gd. Curiously, random mixed read-writes are significantly slower on m6gd, even though both random writes and random reads are on par with m5d.

Service Times

The service times below were measured on a separate run, where cassandra-stress was fixed at 75% of the maximum throughput.

m5d p90 m5d p99 m5d p999 m5d p9999 m6gd p90 m6gd p99 m6gd p999 m6gd p9999
Disk writes 1.38 2.26 9.2 19.5 1.24 2.04 9.5 20.1
Disk reads 1.2 2.67 4.28 19.2 1.34 2.32 4.53 19.1
Disk mixed
(write latency)
2.87 3.9 6.48 20.4 2.78 3.58 7.97 32.6
Disk mixed
(read latency)
0.883 1.18 4.33 13.9 0.806 1.11 4.05 13.7
Memory writes 1.43 2.61 11.7 23.7 1.23 2.8 13.2 25.7
Memory reads 1.09 2.48 11.6 24.6 0.995 2.32 11 24.4
Memory mixed
(read latency)
1.01 2.22 10.4 24 1.12 2.42 10.8 24.2
Memory mixed
(write latency)
0.995 2.19 10.3 23.8 1.1 2.37 10.7 24.1

No abnormalities. The values for both machines are within 10% of each other, with the one notable exception of the 99.99% quantile of “disk mixed.”

Conclusion

AWS’ Graviton2 ARM-based servers are already on par with x86 or better, especially for price-performance, and continue to improve generation-after-generation.

By the way, Scylla doesn’t have official releases for Arm instances yet, but stay tuned. Their day is inevitably drawing near. If you have any questions about running Scylla in your own environment, please contact us via our website, or reach out to us via our Slack community.

The post AWS Graviton2: Arm Brings Better Price-Performance than Intel appeared first on ScyllaDB.

What We’ve Learned after 6 Years of IO Scheduling

This post is by P99 CONF speaker Pavel Emelyanov, a developer at NoSQL database company ScyllaDB. To hear more from Pavel and many more latency-minded developers, register for P99 CONF today.

Why Scheduling at All?

Scheduling requests of any kind always serves one purpose — gain control over priorities of those requests. In the priorities-less system one doesn’t need to schedule; just putting whatever arrives into the queue and waiting when it finishes is enough. When serving IO requests in Scylla we cannot afford to just throw those requests into the disk and wait for them to complete. In Scylla different types of IO flows have different priorities. For example, reading from disk to respond to a user query is likely a ”synchronous” operation in the sense that a client really waits for it to happen, even though the CPU is most likely busy with something else. In this case if there’s some IO running at the time the query request comes in Scylla must do its best to let the query request get served in a timely manner even if this means submitting it into the disk ahead of something else. Generally speaking we can say that OLTP workloads are synchronous in the aforementioned sense and are thus latency-sensitive. This is somewhat opposite to OLAP workloads, which can tolerate higher latency as long as they get sufficient throughput.

Seastar’s IO Scheduler

Scylla implements its IO scheduler as a part of the Seastar framework library. When submitted, an IO request is passed into the seastar IO scheduler, where it finds its place in one of several queues and eventually gets dispatched into the disk.

Well, not exactly into the disk. Scylla does its IO over files that reside on a filesystem. So when we say “request is sent into a disk” we really mean here that the request is sent into the Linux kernel AIO, then it gets into a filesystem, then to the Linux kernel IO-scheduler and only thento the disk.

The scheduler’s goal is to make sure that requests are served in a timely manner according to the assigned priorities. To maintain the fairness between priorities IO scheduler maintains a set of request queues. When a request arrives the target queue is selected based on the request priority and later, when dispatching, the scheduler uses a virtual-runtime-like algorithm to balance between queues (read — priorities), but this topic is beyond the scope of this blog post.

The critical parameter of the scheduler is called “latency goal.” This is the time period after which the disk is guaranteed to have processed all the requests submitted so far, so the new request, if it arrives, can be dispatched right at once and, in turn, be completed not later than after the “latency goal” time elapses. To make this work the scheduler tries to predict how much data can be put into the disk so that it manages to complete them all within the latency goal. Note that meeting the latency goal does not mean that requests are not queued somewhere after dispatch. In fact, modern disks are so fast that the scheduler does dispatch more requests than the disk can handle without queuing. Still the total execution time (including the time spent in the internal queue) is small enough not to violate the latency goal.

The above prediction is based on the disk model that’s wired into the scheduler’s brains, and the model uses a set of measurable disk characteristics. Modeling the disk is hard and a 100% precise model is impossible, since disks, as we’ve learned, are always very surprising.

The Foundations of IO

Most likely when choosing a disk one would be looking at its 4 parameters — read/write IOPS and read/write throughput (in Gbps). Comparing these numbers to one another is a popular way of claiming one disk is better than the other and in most of the cases real disk behavior meets the user expectations based on these numbers. Applying Little’s Law here makes it clear that the “latency goal” can be achieved at a certain level of concurrency (i.e. — the number of requests put in disk altogether) and all the scheduler needs to do its job is to stop dispatching at some level of in-disk concurrency.

Actually it may happen that the latency goal is violated once even a single request is dispatched. With that the scheduler should stop dispatching before it submits this single request, which in turn means that no IO should ever happen. Fortunately this can be observed only on vintage spinning disks that may impose milliseconds-scale overhead per request. Scylla can work with these disks too, but the user’s latency expectation must be greatly relaxed.

Share Almost Nothing

Let’s get back for a while to the “feed the disk with as many requests as it can process in ‘latency goal’ time” concept and throw some numbers into the game. The latency goal is the value of a millisecond’s magnitude, the default goal is 0.5ms. An average disk doing 1GB/s is capable of processing 500kB during this time frame. Given a system of 20 shards each gets 25kB to dispatch in one tick. This value is in fact quite low. Partially because Scylla would need too many requests to work and thus it would be noticeable overhead, but the main reason is that disks often require much larger requests to work at their maximum bandwidth. For example, the NVMe disks that are used by AWS instances might need 64k requests to get to the peak bandwidth. Using 25k requests will give you ~80% of the bandwidth even if exploiting high concurrency.

This simple math shows that Seastar’s “shared nothing” approach doesn’t work well when it comes to disks, so shards must communicate when dispatching requests. In the old days Scylla came with the concept of IO coordinator shards; later this was changed to the IO-groups.

Why iotune?

When deciding whether or not to dispatch a request, the scheduler always asks itself — if I submit the next request, will it make the in-disk concurrency high enough so that it fails the latency goal contract or not? Answering this question, in turn, depends on the disk model that sits in the scheduler’s brain. This model can be evaluated in two ways — ashore or on the fly (or the combination of these two).

Doing it on the fly is quite challenging. Disk, surprisingly as it can be, is not deterministic and its performance characteristics change while it works. Even such a simple number as “bandwidth” doesn’t have a definite fixed value, even if we apply statistical errors to our measurement. The same disk can show different read speeds depending on if it’s in so-called burst mode or if the load is sustained, if it’s a read or write (or mixed) IO, it’s heavily affected by the disk usage history, air temperature in the server room and tons of other factors. Trying to estimate this model runtime can be extremely difficult.

Contrary to this, Scylla measures disk performance in advance with the help of a tool called iotune. This tool literally measures a bunch of parameters the disk has and saves the result in a file we call “IO properties.” Then the numbers are loaded by Seastar on start and are then fed into the IO scheduler configuration. The scheduler thus has the 4-dimensional “capacity” space at hand and is allowed to operate inside a sub-area in it. The area is defined by 4 limits on each of the axes and the scheduler must make sure it doesn’t leave this area in a mathematical sense when submitting requests. But really these 4 points are not enough. Not only the scheduler needs a more elaborated configuration of the mentioned “safe area,” but also must handle the requests’ lengths carefully.

Pure Workloads

First, let’s see how disks behave if being fed with what we call “pure” loads, i.e. with only reads or only writes. If one divides maximum disk bandwidth on its maximum IOPS rate, the obtained number would be some request size. If heavily loading the disk with requests smaller than that size, the disk will be saturated by IOPS and its bandwidth will be underutilized. If using requests larger than that threshold, the disk will be saturated by bandwidth and its IOPS capacity will be underutilized. But are all “large” requests good enough to utilize the disk’s full bandwidth? Our experiments show that some disks show notably different bandwidth values when using, for example, 64k requests vs using 512k requests (of course, the larger request size is the larger the bandwidth is). So to get the maximum bandwidth from the disk one needs to use larger requests and vice versa — if using smaller requests one would never get the peak bandwidth from the disk even if the IOPS limit would still not be hit. Fortunately, there’s an upper limit on the request size above which the throughput will no longer grow. We call this limit a “saturation length.”

This observation has two consequences. First, the saturation length can be measured by iotune and, if so, it is later advertised by the scheduler as the IO size that subsystems should use if they want to obtain the maximum throughput from the disk. The SSTables management code uses buffers of that length to read and write SSTables.

This advertised request size, however, shouldn’t be too big. It must still be smaller than the largest one with which the disk still meets the latency goal. These two requirements — to be large enough to saturate the bandwidth and small enough to meet the latency goal — may be “out of sync”, i.e. the latter one may be lower than the former. We’ve met such disks, for those the user will need to choose between latency and throughput. Otherwise he will be able to enjoy both (provided other circumstances are favored).

The second consequence is that if the scheduler sees medium-sized requests coming in it must dispatch fewer data than it would if the requests had been larger. This is because effectively disk bandwidth would be below the peak and, respectively, the latency goal requirement won’t be met. Newer Seastar models this behavior with the help of a staircase function which seems to be both — good approximation and not too many configuration parameters to maintain.

Mixed Workloads

The next dimension of complexity comes with what we call “mixed workloads.” This is when the disk has to execute both reads and writes at the same time. In this case both the total throughput and the IOPS will be different from what one would get if we calculated a linear ratio between the inputs. This difference is two-fold.

First, read flows and write flows get smaller in a different manner. Let’s take a disk that can run 1GB/s of reads or 500MB/s of writes. It’s no surprise that disks write slower than they read. Now let’s try to saturate the disk with two equal unbounded read and write flows. What output bandwidth would we see? The linear ratio makes us think that each flow would get its half, i.e. reads would show 500MB/s and writes would get 250MB/s. In reality the result will differ between disk models and the common case seems to be that writes would always feel much better than reads. For example we may see an equal rate of 200MB/s for both flows, which is 80% for write and only 40% for read. Or, in the worst (or maybe the best) case, writes can continue working at peak bandwidth while reads would have to be content with the remaining percents.

Second, this inhibition greatly depends on the request sizes used. For example, when a saturated read flow is disturbed with a one-at-a-time write request the read throughput may become 2 times lower for small-sized writes or 10 times lower for large-sized writes. This observation imposes yet another limitation on the maximum IO length that the scheduler advertises to the system. When configured the scheduler additionally limits the maximum write request length so that it will have a chance to dispatch mixed workload and still stay within the latency goal.

Unstable Workloads

If digging deeper we’ll see that there are actually two times more speed numbers for a disk. Each speed characteristic can in fact be measured in two modes — bursted or sustained. EBS disks are even explicitly documented to work this way. This surprise is often the first thing a disk benchmark measures — the documented (in ads) disk throughput is often the “bursted” one, i.e. the peak bandwidth the disk dies would show if being measured in 100% refined circumstances. But once the workload lasts longer than a few seconds or becomes “random” there starts a background activity inside the disk and the resulting speed drops. So when benchmarking the disk it’s often said that one must clearly distinguish between short and sustained workloads and mention which one was used in the test.

The iotune, by the way, measures the sustained parameters, mostly because scylla doesn’t expect to exploit burtsable mode, partially because it’s hard to pin this “burst.”

View the Full Agenda

To see the full breadth of speakers at P99 CONF, check out the published agenda, and register now to reserve your seat. P99 CONF will be held October 6 and 7, 2021, as a free online virtual conference. Register today to attend!

REGISTER FOR FREE

 

 

The post What We’ve Learned after 6 Years of IO Scheduling appeared first on ScyllaDB.

OlaCabs’ Journey with Scylla

OlaCabs is a mobility platform, providing ride sharing services spanning 250+ cities across India, Australia, New Zealand and the UK. OlaCabs is just one part of the overall Ola brand, which also includes OlaFoods, OlaPayment, and an electric vehicle startup, Ola Electric. Founded in 2010, OlaCabs is redefining mobility. Whether your delivery vehicle or ride sharing platform is a motorized bike, auto-rickshaw, metered taxi or cab, OlaCabs’s platform supports over 2.5 million driver-partners and hundreds of millions of consumers.

The variety of vehicles Ola supports reflects the diversity of localized transportation options.

At Scylla Summit 2021 we had the privilege of hosting Anil Yadav, Engineering Manager at Ola Cabs, who spoke about how Ola Cabs was using Scylla’s NoSQL database to power their applications.

OlaCabs began using Scylla in 2016, when it was trying to solve for the problem of spiky intraday traffic in its ride hailing services. Since then they have developed multiple applications that interface with it, including Machine Learning (ML) for analytics and financial systems.

The team at OlaCabs determined early that they did not require the ACID properties of an RDBMS, and instead needed a high-availability oriented system to meet demanding high throughput, low-latency, bursty traffic.

OlaCabs’ architecture combines Apache Kafka for data streaming from files stored in AWS S3, Apache Spark for machine learning and data pipelining, and Scylla to act as their real-time operational data store.

OlaCabs needs to coordinate data between both the demand of the passengers side of transactions and transportation as well as the supply-side of the drivers, matching up as best as possible which drivers and vehicles would be best suited for each route. It also makes these real-time decisions based on learning from prior history of traffic patterns and even the behavior of their loyal customers and experiences of their drivers.

Anil shared some of the key takeaways from his experiences running Scylla in production.

You can watch the full video on demand from Scylla Summit 2021 to hear more about the context of these tips below. And you can learn more about OlaCabs journey with Scylla by reading their case study of how they grew their business as one of the earliest adopters of Scylla.

If you’d like to learn how you can grow your own business using Scylla as a foundational element of your data architecture, feel free to contact us or join our growing community on Slack.



The post OlaCabs’ Journey with Scylla appeared first on ScyllaDB.

Linear Algebra in Scylla

So, we all know that Scylla is a pretty versatile and fast database with lots of potential in the real-time applications, especially ones involving time series. We’ve seen Scylla doing the heavy lifting in the telecom industry, advertising, nuclear research, IoT or banking; this time, however, let’s try something out of the ordinary. How about using Scylla for… numerical calculations? Even better – for distributed numerical calculations? Computations on large matrices require enormous amounts of memory and processing power, and Scylla allows us to mitigate both these issues – matrices can be stored in a distributed cluster, and the computations can be run out-of-place by several DB clients at once. Basically, by treating the Scylla cluster as RAM, we can operate on matrices that wouldn’t fit into a hard disk!

But Why?

Linear algebra is a fundamental area of mathematics that many fields of science and engineering are based on. Its numerous applications include computer graphics, machine learning, ranking in search engines, physical simulations, and a lot more. Fast solutions for linear algebra operations are in demand and, since combining this with NoSQL is an uncharted territory, we had decided to start a research project and implement a library that could perform such computations, using Scylla.

However, we didn’t want to design the interface from scratch but, instead, base it on an existing one – no need to reinvent the wheel. Our choice was BLAS, due to its wide adoption. Ideally, our library would be a drop-in replacement for existing codes that use OpenBLAS, cuBLAS, GotoBLAS, etc.

But what is “BLAS”? This acronym stands for “Basic Linear Algebra Subprograms” and means a specification of low-level routines performing the most common linear algebraic operations — things like dot product, matrix-vector and matrix-matrix multiplication, linear combinations and so on. Almost every numerical code links against some implementation of BLAS at some point.

The crown jewel of BLAS is the general matrix-matrix multiplication (gemm) function, which we chose to start with. Also, because the biggest real-world problems seem to be sparse, we decided to focus on sparse algebra (that is: our matrices will mostly consist of zeroes). And — of course! — we want to parallelize the computations as much as possible. These constraints have a huge impact on the choice of representation of matrices.

Representing Matrices in Scylla

In order to find the best approach, we tried to emulate several popular representations of matrices typically used in RAM-based calculations/computations. Note that in our initial implementations, all matrix data was to be stored in a single database table.

Most formats were difficult to reasonably adapt for a relational database due to their reliance on indexed arrays or iteration. The most promising and the most natural of all was the dictionary of keys.

Dictionary of Keys

Dictionary of keys (or DOK in short) is a straightforward and memory-efficient representation of a sparse matrix where all its non-zero values are mapped to their pairs of coordinates.

CREATE TABLE zpp.dok_matrix (
matrix_id int,
pos_y bigint,
pos_x bigint,
val   double,
PRIMARY KEY (matrix_id, pos_y, pos_x)
);

NB. A primary key of ((matrix_id, pos_y), pos_x) could be a slightly better choice from the clustering point of view but at the time we were only starting to grasp the concept.

The Final Choice

Ultimately, we decided to represent the matrices as:

CREATE TABLE zpp.matrix_{id} (
    block_row bigint,
    block_column bigint,
    row bigint,
    column  bigint,
    value   {type},
    PRIMARY KEY ((block_row, block_column), row, column)
);

The first visible change is our choice to represent each matrix as a separate table. The decision was mostly ideological as we simply preferred to think of each matrix as a fully separate entity. This choice can be reverted in the future.

In this representation, matrices are split into blocks. Row and column-based operations are possible with this schema but time-consuming. Zeroes are again not tracked in the database (i.e. no values less than a certain epsilon should be inserted into the tables).

Alright, but what are blocks?

Matrix Blocks and Block Matrices

Instead of fetching a single value at a time, we’ve decided to fetch blocks of values, a block being a square chunk of values that has nice properties when it comes to calculations.

Definition:

Let us assume there is a global integer n, which we will refer to as the “block size”. A block of coordinates (x, y) in matrix A = [ a_{ij} ] is defined as a set of those a_{ij} which satisfy both:

(x – 1) * n < i <= x * n

(y – 1) * n < j <= y * n

Such a block can be visualized as a square-shaped piece of a matrix. Blocks are non-overlapping and form a regular grid on the matrix. Mind that the rightmost and bottom blocks of a matrix whose size is indivisible by n may be shaped irregularly. To keep things simple, for the rest of the article we will assume that all our matrices are divisible into nxn-sized blocks.

NB. It can easily be shown that this assumption is valid, i.e. a matrix can be artificially padded with (untracked) zeroes to obtain a desired size and all or almost all results will remain the same.

A similar approach has been used to represent vectors of values.

It turns out that besides being good for partitioning a matrix, blocks can play an important role in computations, as we show in the next few paragraphs.

Achieving Concurrency

For the sake of clarity, let us recall the definitions of parallelism and concurrency, as we are going to use these terms extensively throughout the following chapters.

Concurrency A condition that exists when at least two threads are making progress. A more generalized form of parallelism that can include time-slicing as a form of virtual parallelism.
Parallelism A condition that arises when at least two threads are executing simultaneously.

Source:Defining Multithreading Terms (Oracle)

In the operations of matrix addition and matrix multiplication, as well as many others, each cell of the result matrix can be computed independently, as by definition it depends only on the original matrices. This means that the computation can obviously be run concurrently.

(Matrix multiplication. Source: Wikipedia)

Remember that we are dealing with large, sparse matrices? Sparse – meaning that our matrices are going to be filled mostly with zeroes. Naive iteration is out of question!

It turns out that bulk operations on blocks give the same results as direct operations on values (mathematically, we could say that matrices of values and matrices of blocks are isomorphic). This way we can benefit from our partitioning strategy, downloading, computing and uploading entire partitions at once, rather than single values, and performing computation only on relevant values. Now that’s just what we need!

Parallelism

Let’s get back to matrix multiplication. We already know how to perform the computation concurrently: we need to use some blocks of given matrices (or vectors) and compute every block of the result.

So how do we run the computations in parallel?

Essentially, we introduced three key components:

  • a scheduler
  • Scylla-based task queues
  • workers

To put the idea simply: we not only use Scylla to store the structures used in computations, but also “tasks”, representing computations to be performed, e.g. “compute block (i, j) of matrix C, where C = AB; A and B are matrices”. A worker retrieves one task at a time, deletes it from the database, performs a computation and uploads the result. We decided to keep the task queue in Scylla solely for simplicity: otherwise we would need a separate component like Kafka or Redis.

The scheduler is a class exposing the library’s public interface. It splits the mathematical operations into tasks, sends them to the queue in Scylla and monitors the progress signaled by the workers. Ordering multiplication of two matrices A and B is as simple as calling an appropriate function from the scheduler.

As you may have guessed already, workers are the key to parallelisation – there can be arbitrarily many of them and they’re independent from each other. The more workers you can get to run and connect to Scylla, the better. Note that they do not even have to operate all on the same machine!

A Step Further

With the parallelism in hand we had little to no problem implementing most of BLAS operations, as almost every BLAS routine can be split into a set of smaller, largely independent subproblems. Of all our implementations, the equation system solver (for triangular matrices) was the most difficult one.

Normally, such systems can easily be solved iteratively: each value bi of the result can be computed based on the values ai1, …, aii, x1, …, xi-1 and bi. as

This method doesn’t allow for concurrency in the way we described before.

Conveniently, mathematicians have designed different methods of solving equation systems, which can be run in parallel. One of them is the so-called Jacobi method, which is essentially a series of matrix-vector multiplications repeated until an acceptable result is obtained.

One downside of this approach is that in some cases this method may never yield an acceptable result; fortunately for us, this shouldn’t happen too often for inputs representing typical real-life computational problems. The other is that the results we obtain are only (close) approximations of the exact solutions. We think it’s a fair price for the benefit of scalability.

Benchmarks

To measure the efficiency of our library, we performed benchmarks on the AWS cloud.

Our test platform consisted of:

  1. Instances for workers – 2 instances of c5.9xlarge:
    • 2 × (36 CPUs, 72GiB of RAM)
    • Running 70 workers in total.
  2. Instances for Scylla – 3 instances of i3.4xlarge:
    • 3 × (16CPUs, 122GiB of RAM, 2×1,900GB NVMe SSD storage)
    • Network performance of each instance up to 10 Gigabit.

We have tested matrix-matrix multiplication with a variable number of workers.

The efficiency scaled well, at least up to a certain point, where database access became the bottleneck. Keep in mind that in this test we did not change the number of Scylla’s threads.

Matrix-matrix multiplication (gemm), time vs input size. Basic gemm implementations perform O(n3) operations, and our implementation does a good job of keeping the same time complexity.

Our main accomplishment was a multiplication of a big matrix (106×106, with 1% of the values not being 0) and a dense vector of length 106.

Such a matrix, stored naively as an array of values, would take up about 3.6TiB of space whereas in a sparse representation, as a list of <block_x, block_y, x, y, value>, it would take up about 335GiB.

The block size used in this example was 500*500. The test concluded successfully – the calculations took about 39 minutes.

Something Extra: Wrapper for cpp-driver

To develop our project we needed a driver to talk to Scylla. The official scylla-cpp-driver (https://github.com/scylladb/cpp-driver), despite its name, only exposes C-style interface, which is really unhandy to use in modern C++ projects. In order to avoid mixing C-style and C++-style code we developed a wrapper for this driver, that exposes a much more user-friendly interface, thanks to usage of RAII idiom, templates, parameter packs, and other mechanisms.

Wrapper’s code is available at https://github.com/scylla-zpp-blas/scylla_modern_cpp_driver.

Here is a short comparison of code required to set up a Scylla session. On the left cpp-driver, on the right our wrapper.

Summary

This research project was jointly held between ScyllaDB and the University of Warsaw, Faculty of Mathematics, Informatics and Mechanics, and constituted the BSc thesis for the authors. It was a bumpy but fun ride and Scylla did not disappoint us: we built something so scalable that, although still being merely a tidbit, could bring new quality to the world of sparse computational algebra.

The code and other results of our work are available for access at the GitHub repository:

https://github.com/scylla-zpp-blas/linear-algebra

No matter whether you do machine learning, quantum chemistry, finite element method, principal component analysis or you are just curious — feel free to try our BLAS in your own project!

CHECK OUT THE SCYLLA BLAS PROJECT ON GITHUB

The post Linear Algebra in Scylla appeared first on ScyllaDB.

Changes to Incremental Repairs in Cassandra 4.0

What Are Incremental Repairs? 

Repairs allow Apache Cassandra users to fix inconsistencies in writes between different nodes in the same cluster. These inconsistencies can happen when one or more nodes fail. Because of Cassandra’s peer-to-peer architecture, Cassandra will continue to function (and return correct results within the promises of the consistency level used), but eventually, these inconsistencies will still need to be resolved. Not repairing, and therefore allowing data to remain inconsistent, creates significant risk of incorrect results arising when major operations such as node replacements occur. Cassandra repairs compare data sets and synchronize data between nodes. 

Cassandra has a peer-to-peer architecture. Each node is connected to all the other nodes in the cluster, so there is no single point of failure in the system should one node fail. 

As described in the Cassandra documentation on repairs, full repairs look at all of the data being repaired in the token range (in Cassandra, partition keys are converted to a token value using a hash function). Incremental repairs, on the other hand, look only at the data that’s been written since the last incremental repair. By using incremental repairs on a regular basis, Cassandra operators can reduce the time it takes to complete repairs. 

A History of Incremental Repairs

Incremental repairs have been a feature of Apache Cassandra since the release of Cassandra 2.2. At the time 2.2 was released, incremental repairs were made the default repair mechanism for Cassandra. 

But by the time of the release of Cassandra 3.0 and 3.11, incremental repairs were no longer recommended to the user community due to various bugs and inconsistencies. Fortunately, Cassandra 4.0 has changed this by resolving many of these bugs.

Resolving Cassandra Incremental Repair Issues

One such bug was related to how Cassandra marks which  SSTables have been repaired (ultimately this bug was addressed in Cassandra-9143). This bug would result in overstreaming and essentially plug communication channels in Cassandra and slow down the entire system. 

Another fix was in Cassandra-10446. This allows for forced repairs even when some replicas are down. It is also now possible to run incremental repairs when nodes are down (this was addressed in Cassandra-13818).

Other changes came with Cassandra-14939, including:

  • The user can see if pending repair data exists for a specific token range 
  • The user can force the promotion or demotion of data for completed sessions rather than waiting for compaction 
  • The user can get the most recent repairedAT timestamp for a specific token range

Incremental repairs are the default repair option in Cassandra. To run incremental repairs, use the following command:

nodetool repair

If you want to run a full repair instead, use the following command:

nodetool repair --full

Similar to Cassandra 4.0 diagnostic repairs, incremental repairs are intended primarily for nodes in a self-support Cassandra cluster. If you are an Instaclustr Managed Cassandra customer, repairs are included as a part of your deployment, which means you don’t need to worry about day-to-day repair tasks like this. 

In Summary

While these improvements should greatly increase the reliability of incremental repairs, we recommend a cautious approach to enabling in production, particularly if you have been running subrange repair on an existing cluster. Repairs are a complex operation and the impact of different approaches can depend significantly on the state of your cluster when you start the operation and even the data model that you are using. To learn more about Cassandra 4.0, contact our Cassandra experts for a free consultation or sign up for a free trial of our Managed Cassandra service. 

The post Changes to Incremental Repairs in Cassandra 4.0 appeared first on Instaclustr.

Deploying Scylla Operator with GitOps

There are many approaches to deploying applications on Kubernetes clusters. As there isn’t a clear winner, scylla-operator tries to support a wide range of deployment methods. With Scylla Operator 1.1 we brought support for helm charts (You can read more about how to use and customize them here) and since Scylla Operator 1.2 we are also publishing the manifests for deploying either manually or with GitOps automation.

What is GitOps?

GitOps is a set of practices to declaratively manage infrastructure and application configurations using Git as the single source of truth. (You can read more about GitOps in this Red Hat article.) A Git repository contains the entire state of the system making any changes visible and auditable.

In our case, the Kubernetes manifests represent the state of the system and are kept in a Git repository. Admins either apply the manifests with kubectl or use tooling that automatically reconciles the manifests from Git, like ArgoCD.

Deploying Scylla Operator from Kubernetes Manifests

In a GitOps flow, you’d copy over the manifests into an appropriate location in your own Git repository, but for the purposes of this demonstration, we’ll assume you have checked out the https://github.com/scylladb/scylla-operator repository at v1.4.0 tag and you are located at its root.

Prerequisites

Local Storage

ScyllaClusters use Kubernetes local persistent volumes exposing local disk as PersistentVolumes (PVs). There are many ways to set it up. All the operator cares about is a storage class and a matching PV being available. A common tool for mapping local disks into PVs is the Local Persistence Volume Static Provisioner. For more information have a look at our documentation. For testing purposes, you can use minikube that has an embedded dynamic storage provisioner.

(We are currently working on a managed setup by the operator, handling all this hassle for you.)

Cert-manager

Currently, the internal webhooks require the cert-manager to be installed. You can deploy it from the official manifests or use the snapshot in our repository:

$ kubectl apply -f ./examples/common/cert-manager.yaml

# Wait for the cert-manager to be ready.
$ kubectl wait --for condition=established crd/certificates.cert-manager.io crd/issuers.cert-manager.io
$ kubectl -n cert-manager rollout status deployment.apps/cert-manager-webhook

Deploying the ScyllaOperator

The deployment manifests are located in ./deploy/operator and ./deploy/manager folders. For the first deployment, you’ll need to deploy the manifests in sequence as they have interdependencies, like scylla-manager needing a ScyllaCluster or establishing CRDs (propagating to all apiservers). The following instructions will get you up and running:

$ kubectl apply -f ./deploy/operator

# Wait for the operator deployment to be ready.
$ kubectl wait --for condition=established crd/scyllaclusters.scylla.scylladb.com
$ kubectl -n scylla-operator rollout status deployment.apps/scylla-operator

$ kubectl apply -f ./deploy/manager/prod

# Wait for the manager deployment to be ready.

$ kubectl -n scylla-manager rollout status deployment.apps/scylla-manager
$ kubectl -n scylla-manager rollout status deployment.apps/scylla-manager-controller
$ kubectl -n scylla-manager rollout status statefulset.apps/scylla-manager-cluster-manager-dc-manager-rack

Customization

Customization is the beauty of using Git; you can change essentially anything in a manifest and keep your changes in a patch commit. Well, changing everything is probably not the best idea to keep having a supported deployment, but changing things like the loglevel or adjusting ScyllaCluster resources for the scylla-manager makes sense.

Summary

We are constantly trying to make deploying Scylla Operator easier and manifests allow you to do that without any extra tooling with just kubectl and git, or hooking it into your GitOps automation.

For those using Operator Lifecycle Manager (OLM), we are also planning to ship an OLM bundle and publish it on the operatorhub.io so stay tuned.

LEARN MORE ABOUT SCYLLA OPERATOR

DOWNLOAD SCYLLA OPERATOR

The post Deploying Scylla Operator with GitOps appeared first on ScyllaDB.

Project Circe August Update

Project Circe is ScyllaDB’s year-long initiative to improve Scylla consistency and performance. Today we’re sharing our updates for the month of August 2021.

Toward Scylla “Safe Mode”

Scylla is a very powerful tool, with many features and options we’ve added over the years, some of which we modeled after Apache Cassandra and DynamoDB, and others that are unique to Scylla. In many cases, these options, or a combination of them, are not recommended to run in production. We don’t want to disable or remove them, as they are already in use, but we also want to move users away from them. That’s why we’re introducing Scylla Safe Mode.

Safe Mode is a collection of reservations that  make it harder for the user to use non-recommended options in production.

Some examples of Safe Mode that we added to Scylla in the last month:

More Safe Mode restrictions are planned.

Performance Tests

We are constantly improving Scylla performance, but other projects are, of course, doing the same. So it’s interesting to run updated benchmarks comparing performance and other attributes. In a recent 2-part blog series we compared the performance of Scylla with the latest release of its predecessor, Apache Cassandra.

Raft

We continue our initiative to combine strong consistency and high performance with Raft. Some of the latest updates:

  • Raft now has its own experimental flag in the configuration file: “experimental: raft
  • The latest Raft pull request (PR) adds a Group 0 sub-service, which includes all the members of the clusters, and allows other services to update topology changes in a consistent, linearizable way.
  • This service brings us one step closer to strong consistent topology changes in Scylla.
  • Followup services will have consistent schema changes, and later data changes (transactions).

20% Projects

You might think that working on a distributed database in cutting edge C++ is already a dream job for most developers, but the Scylla dev team allocates 20% of their time to personal projects.

One such cool project is when Piotr Sarna made a PR for adding WebAssembly to user-defined functions (UDF). While still in very early stages, this has already initiated an interesting discussion in the comment thread.

User Defined Functions are an experimental feature in Scylla since the 3.3 release. We originally supported Lua functions, and have now extended to WASM. More languages can be added in the future.

Below is an example taken from the PR, of a CQL command to create a simple WASM fibonacci function:


CREATE FUNCTION ks.fibonacci (str text)
    CALLED ON NULL INPUT
    RETURNS boolean
    LANGUAGE wasm
    AS ' (module
        (func $fibonacci (param $n i32) (result i32)
            (if
                (i32.lt_s (local.get $n) (i32.const 2))
                (return (local.get $n))
            )
            (i32.add
                (call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
                (call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
            )
        )
        (export "fibonacci" (func $fibonacci))
    ) '

More on the great potential of UDF in Scylla in a talk by Avi

Some Cool Additions to Git Master

These updates will be merged into upcoming Scylla releases, primarily Scylla Open Source 4.6

  • Repair-based node operations are now enabled by default for the replacenode operation. Repair-based node operations use repair instead of streaming to transfer data, making it resumable and more robust (but slower). A new parameter defines which node operations use repair. (Learn more)
  • User-Defined Aggregates (UDA) have been implemented. Note UDA is based on User Defined Function (UDF) which is still an experimental feature
  • If Scylla stalls while reclaiming memory, it will now log memory-related diagnostics so it is easier to understand the root cause.
  • After adding a node, a cleanup process is run to remove data that was copied to the new node. This is a compaction process that compacts only one SSTable at a time. This fact was used to optimize cleanup. In addition, the check for whether a partition should be removed during cleanup was also improved.
  • When Scylla starts up, it checks if all SSTables conform to the compaction strategy rules, and if not, it reshapes the data to make it conformant. This helps keep reads fast. It is now possible to abort the reshape process in order to get Scylla to start more quickly.
  • Scylla uses reader objects to read sequential data. It caches those readers so they can be reused across multiple pages of the result set, eliminating the overhead of starting a new sequential read each time. However, this optimization was missed for internal paging used to implement aggregations (e.g. SUM(column)). Scylla now uses the optimization for aggregates too.
  • The row cache behavior was quadratic in certain cases where many range tombstones were present. This has been fixed.
  • The installer now offers to set up RAID 5 on the data disks in addition to RAID 0; this is useful when the disks can have read errors, such as on GCP local disks.
  • The install script now supports supervisord in addition to systemd. This was brought in from the container image, where systemd is not available, and is useful in some situations where root access is not available.
  • A limitation of 10,000 connections per shard has been lifted to 50,000 connections per shard, and made tunable.
  • The docker image base has been switched from CentOS 7 to Ubuntu 20.04 (like the machine images). CentOS 7 is getting old.
  • The SSTableloader tool now supports Zstd compression.
  • There is a new type of compaction: validation. A validation compaction will read all SSTables and perform some checks, but write nothing. This is useful to make sure all SSTables can be read and pass sanity checks.
  • SSTable index files are now cached, both at the page level and at an object level (index entry). This improves large partition workloads as well as intermediate size workloads where the entire SSTable index can be cached.
  • It was found that the very common single-partition query was treated as an IN query with a 1-element tuple. This caused extra work to be done (to post-process the IN query). We now specialize for this common case and avoid the extra post-processing work.

Monitoring News

Scylla Monitoring Stack continues to move forward fast.

We continue to invest in Scylla Advisor, which takes information from Scylla and OS level metrics (via Prometheus) and Logs (via Loki), combining them using policy rules to advise the user on what he should look at, in a production system.

For example Scylla Monitoring Stack 3.8 now warns about prepared-statements cache eviction.

Other August Releases

Just Plain Cool

A new Consistency Level Calculator helps you understand the impact of choosing different replication factors and consistency levels with Scylla.

The post Project Circe August Update appeared first on ScyllaDB.