Scylla Student Projects, Part I: Parquet

In 2019, ScyllaDB sponsored a program for Computer Science students organized by the University of Warsaw. Throughout the whole academic year, 3 teams of undergraduate students collaborated with and learned from ScyllaDB engineers to bring new features to Scylla and its underlying Seastar engine. The projects picked for 2019 edition were:

  • Parquet support for Seastar and Scylla
  • SeastarFS: an asynchronous userspace file system for Seastar
  • Kafka client for Seastar and Scylla.

We’re pleased to announce that the cooperation was very successful and we look forward to taking part in future editions of the program! Now, let’s see some details on the results of the first project on the list: Parquet support for Seastar and Scylla. This work is all to the credit of the students who wrote it, Samvel Abrahamyan, Michał Chojnowski, Adam Czajkowski and Jacek Karwowski, and their supervisor, Dr. Robert Dąbrowski.

Introduction

Apache Parquet is a well known columnar storage format, incorporated into Apache Arrow, Apache Spark SQL, Pandas and other projects. In its columns, it can store simple types as well as complex nested objects and data structures. Representing the data as columns brings interesting advantages over the classic row-based approach:

  • fetching specific columns from a table requires less I/O, since no redundant values from other columns are read from disk
  • values from a column are often the same or similar to each other, which increases the efficiency of compression algorithms
  • interesting data encoding schemes, like bit-packing integers, can be easily applied
  • more complex operations, like aggregating all values from a single column,
    can be implemented more efficiently (e.g. by leveraging vectorized CPU instructions)

An example of the Parquet file format, showing how it can optimize based on repeated values in columnar data.

Scylla uses SSTables as its native storage format, but we’re interested in allowing our users to pick another format — like Parquet — for certain workloads. That was the main motivation for pursuing this student project.

How to integrate ScyllaDB with Parquet?

Parquet is open-source, very popular and broadly used by many projects and companies, so why not use an existing C++ library and plug it right into Scylla? The short answer is “latency.”

Scylla is built on top of Seastar, an asynchronous high-performance C++ framework. Seastar was created in accordance with the shared-nothing principle and it has its own nonblocking I/O primitives, schedulers, priority groups and many other mechanisms designed specifically for ensuring low latency and most optimized hardware utilization. In the Seastar world, issuing a blocking system call (like read()) is an unforgivable mistake and a performance killer. That also means that many libraries which rely on traditional, blocking system calls (used without care) would create such performance regressions when used in a Seastar-based project — and Parquet’s C++ implementation was not an exception in that matter.

There are multiple ways of adapting libraries for Seastar, but in this case the simplest answer turned out to be the best — let’s write our own! Parquet is well documented and its specification is quite short, so it was a great fit for a team of brave students to try and implement it from scratch in Seastar.

Implementing parquet4seastar

Spoiler alert: the library is already implemented and it works!

https://github.com/michoecho/parquet4seastar

The first iteration of the project was an attempt to simply copy the whole code from Arrow’s repository and replace all I/O calls with ones compatible with Seastar. That also means rewriting everything to Seastar’s future/promise model, which is a boring and mechanical task, but also easy to do. Unfortunately, it quickly turned out that Parquet implementation from Apache Arrow has quite a lot of dependencies within Arrow itself. Thus, in order to avoid rewriting more and more lines, a decision was made: let’s start over, take Parquet documentation and write a simple library for reading and writing Parquet files, built from scratch on top of Seastar.

Other advantages to this approach cited by the students: by writing it over from scratch, they would avoid carrying over any technical debt and minimize the amount of lines-of-code to be added to the existing code base, and, most of all, they thought it would be more fun!

A block diagram of how parquet4seastar and parquet2cql were designed to interact with the Scylla database.

The library was written using state-of-the-art Seastar practices, which means that measures have been taken to maximize the performance while keeping the latencies low. The performance tests indicated that the reactor stalls all came from external compression libraries – which, of course, can be rewritten in Seastar as well.

We were also pleased to discover that Parquet’s C++ implementation in Apache Arrow comes with a comprehensive set of unit tests – which were adjusted for parquet4seastar and used for ensuring that our reimplementation is at least as correct as the original.

Still, our main goal was to make the library easy to integrate with existing Seastar projects, like Scylla. As a first step and a proof-of-concept for the library, a small application which reads Parquet files and translates them into CQL queries was created.

parquet2cql

parquet2cql is a small demo application which shows the potential of parquet4seastar library. It reads Parquet files from disks, takes a CQL schema for a specific table and spits out CQL queries, ready to be injected into ScyllaDB via cqlsh or any CQL driver. Please find a cool graph which shows how parquet2cql works below. `p4s` stands for `parquet4seastar`.

parquet2cql can be used as a crude way of loading Parquet data straight into Scylla, but it’s still only a demo application – e.g. it does not support CQL prepared statements, which would make the process much more optimized. For those interested in migrating Parquet data to Scylla clusters, there’s a way to ingest Parquet files using our Scylla Spark Migrator.

Integration with ScyllaDB

Allowing ScyllaDB to store its data directly in Parquet instead of the classic SSTable format was way out of scope for this project, but, nonetheless, a proof-of-concept demo which stores SSTable data files not only in the native MC format, but also in Parquet was performed successfully! The implementation assumed that no complex types (lists, set) were present in the table. This experiment allowed us to compare the performance and storage overhead of using Parquet vs SSTable mc format for various workloads. Here’s a diagram showing how the experiment was performed:

Results

The project was not only about coding – a vast part of it was running various correctness and performance tests and comparisons. One of the tests checked whether parquet4seastar library is faster than its older brother from Apache Arrow. Here are the sample results:

Reading time of parquet4seastar relative to Apache Arrow (less means that parquet4seastar was faster).

The results indicate that parquet4seastar is generally similar to Apache Arrow in terms of time of execution (with an exception of the short strings scenario, which is a result of a design decision, please find more details in the paper below). The results are promising, because they mean that providing much better latency guarantees by using nonblocking I/O and the future/promise model did not result in any execution time overhead. Aside from comparing the library against Apache Arrow, many more test scenarios were run – measuring reactor stalls, comparing the sizes of SSTables stored in native MC format vs in Parquet, etc.

Here we have a comparison of the disk usage between Parquet and SSTables. The chart above shows the results of the first test conducted where the students inserted a million rows with random strings each time, but some values were duplicated. The horizontal axis shows the number of duplicates for each value and the vertical axis shows the total size of the files. You can see that in this test Parquet is more efficient.

In this example the students tested a table with multiple NULL values, typical of a sparse data set. The horizontal axis shows the number of randomly selected columns that are not NULL and instead have a random value. In this case you can see that SSTables are a better format when most of the columns are NULL.

From these tests it can be concluded that with Parquet you can achieve significant disk space savings when the data is not null, but the number of unique values ​​is not very large.

The Paper

Now, the coolest part. Each project was also a foundation for the Bachelor’s thesis of the students who took part in it. The thesis was already reviewed and accepted by the University of Warsaw and is public to read.You can find a detailed description of the design, goals, performed tests and results in this document: zpp_parquet.pdf. We’re very proud of contributing to the creation of this academic paper – congrats to all brand new BSc degree holders! We are definitely looking forward to continuing our cooperation with the students and the faculty of the University of Warsaw in the future. Happy reading!

READ THIS: APACHE PARQUET SUPPORT FOR SCYLLA

The post Scylla Student Projects, Part I: Parquet appeared first on ScyllaDB.

Stadia Maps: Using Scylla to Serve Maps in Milliseconds

Ian Wagner and Luke Seelenbinder were consulting with a perfectionist client who demanded an aesthetically beautiful map for their website. It also needed to render lightning fast and be easy to configure. On top of that Ian and Luke saw how even small business clients like theirs could spend as much as $700 a month for existing services from the global mapping giants. To address these deficiencies they founded Stadia Maps with the goal to drive down prices using open mapping programs, combined with a far more satisfying look-and-feel and an overall better user experience.

Having accomplished what they set out to do after launching in 2017, their growth and success led to a new requirement: the ability to scale their systems to deliver maps on a global basis. The maps they generate come from a PostGIS service using OpenStreetMap geographical data. This system produces map tiles, comprised of either line-based vector map data, known as Mapbox Vector Tiles (MVTs), or alternatively, a bitmap rendering of that same area.

To scale this service globally, Stadia Maps does not serve all of these maps directly out of their PostGIS service. Instead they have a pipeline, with PostGIS providing these map tiles to Scylla. Scylla globally distributes these tiles and serves them out for rendering on the client.

How many tiles does it take to map the world?

This all depends on the zoom level you need to support. Zoom level 0 is a map of the entire world on a single map tile. Each level of zoom expands the number of tiles by an exponent of four. So the first zoom level has four tiles, the second sixteen, and so on. Thus, the nineteenth level of zoom, Z19, would be the equivalent of 419 tiles (274.8 billion) tiles.

It is impractical and inefficient to pre-render that many tiles, especially considering that 70% of the planet’s surface is water (making those tiles merely the same uniform blue). Instead, Stadia Maps pre-renders tiles to the Z14 scale — the equivalent of about a 1:35,000 scale map, requiring 268 million tiles to map the entire world — and produces the rest as needed in an on-demand fashion.

An example of a Stadia Map for Tallinn, Estonia at the Z14 level of zoom.

If a map object is already stored in Scylla, it is served out of Scylla directly. In this case, Scylla is acting like a persistent cache. However, if the object is not already in Scylla, it will make a request to the PostGIS system to render the desired tile, and then store it for future queries.

Each of these tiles is optimized by being compressed and stored in Scylla. A vector-based MVT tile might be ~50kb of data uncompressed, but can typically be shrunk down to 10-20kb for storage. A plain water tile can be shrunk even more — to as little as 1kb.

Raster data (bitmapped images) take up more space — about 20-50kb per object. However, these are not cached indefinitely; Stadia Maps evicts them based on a Time-to-Live (TTL) to keep their total data size under management relatively steady.

Moving from CockroachDB to Scylla

When Stadia Maps first set out to deliver these map tiles at a global scale they began with the distributed SQL database CockroachDB. However, they realized their use case did not need its strong consistency model, and that Cockroach had no way to adjust consistency settings. So what is generally considered one of Cockroach’s strengths was, for Stadia Maps, actually a hindrance. Instead of strongly consistent data (which introduces latencies), users wanted the maps to be rendered fast. With long tail latencies of one second per tile CockroachDB’s behavior was simply too slow. Users were impatiently watching a map painted out before their eyes tile-by-tile in real time. That is, if it was served at all. Users were also getting hit with HTTP 503 timeout errors. Misbehaving Cockroach servers pushed to their limits were found to be the root cause of the problem.

Also, because most tiles were only requested once a month, or even once a year, Stadia Maps had a problem trying to implement a Least Recently Used (LRU) algorithm. For them, a Time To Live (TTL) was a more natural fit for data eviction.

This was when Stadia Maps decided to move to Scylla and its eventually consistent global distribution capabilities. For example a user planning a business trip to London from, say, Tokyo, would prefer that their map of London was served out close to them, from an Asian datacenter, for speed of response. If they had to fetch data about London from a UK-based server, that would necessitate long global round trips.

Long-tail latencies, known as p95 (and p99), are vitally important. This is the latency threshold below which 95% of the responses (or 99%) are made. Because many map tiles can be returned for even a single server query, mathematically it means there was a significant chance one or more tiles might appear late. As Luke notes, “If even one tile doesn’t load, that looks really bad.”

Stadia Maps’ end-to-end latency requirement is 250 ms for p95 — about the same as the blink of an eye. That includes everything, from making the request at a user’s browser, to acknowledging that request at the Stadia Maps API, processing the request, fetching the data from the database, then returning it and rendering it in the browser for the user. The database itself has only a fraction of that time to respond — less than 10 ms for the worst case.

To be able to meet this service level, Stadia Maps has three primary datacenters around the world. All data is replicated to all datacenters. However, the consistency of that data is minimal — a consistency level of 1 (CL=1). If a user is served up a stale tile it is usually only a few hours old. The likelihood of there being much change in those few hours since the last update is minimal.

If there isn’t a local copy of the tile in the local cluster, it will then try to fetch that tile if it exists in other Scylla clusters. If that fails, it will make the request to the local PostGIS service to render it. The results of which will then be put back into Scylla for future requests. In this latter case the responsiveness is still relatively slow, yet compared to CockroachDB p99 the response time for Scylla dropped by nearly half, to about 500 ms — well within timeout ranges.

Once the decision was made, Stadia Maps’ migration from CockroachDB to Scylla was swift. It only took one month from the first commit of code to the day Scylla was in production. Since moving to Scylla, according to Luke, “We have not found the limit” in regards to serving requests and throughput.

Working Side-by-Side

Scylla is a “greedy” system, generally requiring as much CPU and RAM as the server is capable of providing for fast performance. Yet Stadia Maps configured Scylla to set resource limits with the intention to deploy its map rendering servers side-by-side with Scylla on the same hardware. The map rendering engine requires a great deal of CPU and RAM, but not a lot of storage. Whereas Scylla requires a great deal of storage but in this case comparably little CPU and RAM. By configuring both systems optimally Stadia Maps gets the greatest utility out of their shared hardware resources.

Even now Stadia continues to optimize their deployment. Their application, written in Rust, is continually improved. “We keep trying to cut milliseconds all over the place,” Luke notes. The less cycles they need to steal from the raster renderer the better.

Discover Scylla for Yourself

Stadia Maps found Scylla in their research for NoSQL systems that could scale and perform in their demanding environment. We hope what you’ve read here will inspire you to try it out for your own applications and use cases. To get started, download Scylla today and join our open source community.

DOWNLOAD SCYLLA

The post Stadia Maps: Using Scylla to Serve Maps in Milliseconds appeared first on ScyllaDB.

Scylla Open Source Release 4.1.2

The ScyllaDB team announces Scylla Open Source 4.1.2, a bugfix release of the Scylla 4.1 stable branch. Scylla Open Source 4.1.2, like all past and future 4.x.y releases, is backward compatible and supports rolling upgrades.

These release notes also include defects fixed in Scylla Open Source 4.1.1.

Related links:

Issues fixed in Scylla 4.1.2:

  • CQL: Impossible WHERE condition returns a non-empty slice #5799
  • Docker: add option to start Alternator with HTTPS. When using Scylla Docker, you can now use --alternator-https-port in addition to the existing --alternator-port. #6583
  • UX: scylla-housekeeping, a service which check for the latest Scylla version, return “cannot serialize” error instead of propagating the actual error #6690
  • UX: scylla_setup: on RAID disk prompt, typing same disk twice cause traceback #6711
  • Stability: The concurrency control of the view update generation process was broken: instead of maintaining a constant fixed concurrency as intended, it could either reduce concurrency to almost 0, or increase it towards infinity #6774
  • Stability: Counter write read-before-write is issued with no timeout, which may lead to unbounded internal concurrency if the enclosing write operation timed out. #5069
  • Scylla crashes when CDC Log is queried with a wrong stream_id #6570

Issues fixed in Scylla 4.1.1:

  • CQL: Using LWT (INSERT ...IF NOT EXISTS) with two or more tables in a BATCH may cause boost::dynamic_bitset assertion #6332
  • Alternator: PutItem should output empty map, not empty JSON, when there are no attributes to return #6568
  • Alternator: missing UnprocessedKeys/Items attribute in BatchGetItem/WriteItems result #6569
  • UX: cql transport report a broken pipe error to the log when a client closes its side of a connection abruptly #5661
  • Storage: SSTable upgrade has a 2x disk space requirement #6682
  • Stability: Scylla freezes when casting a decimal with a negative scale to a float #6720
  • Alternator: modifying write-isolation tag is not propagated to a node of another DC #6513
  • Stability: Stream ended with unclosed partition and coredump #6478
  • Stability: oversized allocation when calling get token range API, for example during a Scylla Manager initiate repair #6297
  • Build: gdb cannot find shared libraries #6494
  • Stability: In a case when using partition or clustering keys which have a representation in memory which is larger than 12.8 KB (10% of LSA segment size), linearization of the large (fragmented) keys may cause undefined behavior #6637

The post Scylla Open Source Release 4.1.2 appeared first on ScyllaDB.

Using Change Data Capture (CDC) in Scylla

Change Data Capture (CDC) allows users to track data updates in their Scylla database. While it is similar to the feature with the same name in Apache Cassandra and other databases, how it is implemented in Scylla is unique, more accessible, and more powerful. We recently added a lesson in Scylla University so users can understand how to get the most out of the changing nature of their data stored in Scylla. This lesson is based on a talk we gave at our Scylla Summit last year.

Overview

So what is Change Data Capture? It’s a record of modifications that happen for one or more tables in the database. We want to know when we write. We want to know when we delete. Obviously this could be done as triggers in an application, but the key feature is that we want to be able to consume this asynchronously. Plus for the consumer to know this from a data standpoint more than an application standpoint.

CDC is one of the key features in Scylla in 2020 and one of the major things that we’re working on.

Some of the prominent use case include:

  • Fraud detection — you want to analyze transactions in some sort of batch manner to see if credit cards are being used from many places at the same time
  • Streaming Data — you want to plug it into your Kafka pipeline, you want some sort of analysis done on a transaction level and not on an aggregated level
  • Data duplication — you want to mirror a portion of your database without using multi-datacenter replication
  • Data replication — you want to somehow transform the data on a per-transaction basis to some other storage medium

How does Scylla’s CDC work?

CDC is enabled per table. You don’t want to store changes for everything. You want it to be as isolated as possible. For Scylla that means the granularity of the modifications made to a CQL row.

Optionally, we want to read pre-imaged data. We want to show the current state if data exists for a row, and we want to limit it to columns that are affected by the change. Plus we want to add a log for the modifications made:

  • The pre-image (original state of data before the change)
  • Changes per column (delta)
  • The post-image (final, current state after the change)

Or any combination thereof. We might only want one of the three or all three, etc. It’s all optional. It’s all controlled by you.

Also, unlike the commitlog-like implementation in Apache Cassandra, CDC in Scylla is just another CQL table. It’s enabled per table and it creates another data table. It is stored distributed in the cluster, sharing all the same properties as you’re used to with your normal data. There’s no need for a separate specialized application to consume it.

It’s ordered by timestamp in the original operation and the associated CQL batch sequence. (Note that a CQL operation can break down into more than one modification.)

You have the columns for the pre-image and the delta records for the change. Every column records information about how we modified it and what TTL you put into it.

The topology of the CDC log is meant to match the original data so you get the same data distribution. You potentially get the same consistency (depending on the options you select) as for your normal data writes. It’s synchronized with your writes. It shares the same properties and distribution all to minimize the overhead of adding the log, and also to make sure that logs and actual data in the end result match each other as close as possible.

Everything in a CDC log is transient, representing a window back in time that you can consume and look at. The default survival time-to-live (TTL) is 24 hours. But again it’s up to you. This is to ensure that if this log builds up, if you’re not consuming it, it’s not going to kill your database. It’s going to add to your total storage but it’s not going to kill your systems.

The downsides of this approach are that we’re going to add a read before write if you want to pre-image data, and/or a read-after-write if you want a post-image. That will have a performance impact, and that is why we don’t want users to just generally enable CDC on all database transactions.

The log shares the same consistency as every other other thing in Scylla. Which is to say it is eventually consistent. It represents the change as viewed by the client. Because Scylla is a distributed database, there is no single truth of change of data. So it’s all based on how the coordinator or the client talking to the coordinator view the state and the change. (We’ll allow you to use different consistency levels for the base table as for the CDC table, although that is a special use case.)

Also note, it’s the change, it’s not what happened afterwards. So, for example, depending on your consistency level if you lost a node at some point in time, what you view from the data might not be what you last wrote, because you lost the only replica that held that data or you can’t reach a consensus level to get that value back. Obviously you could get partial logs in the case of severe node crashes.

Consuming the Data

How do you consume CDC table data? Well, that’s up to you. But it’s also up to us on how we implemented CDC.

It’s going to be consumed on the lowest level through normal CQL. It’s just a table. Very easy to read. Very easy to understand. The data is already deduplicated. You don’t need to worry about aggregating data from different replicas and reconciling it. It’s been done already for you. Everything is normal CQL data. Easy to parse. Easy to manipulate. You don’t need to know how the server works on the internal level. Everything is well known, detailed components.

This allows us for a layered approach. We can build adapters. We can build integrators on top of this for more advanced use cases for standardized connectors. We can adapt it to “push” models and we can integrate it with Kafka. We can provide an API that emulates the DynamoDB API via our Alternator project. And there’s going to be more.

We’re investing heavily in this and we’re going to be presenting a lot of integration projects to make sure that you can use CDC without relying on third-party connectors or appliances. It’s all going to come directly from Scylla.

Under the Hood

Here’s a brief example of what happens when you’re using this given a very simple table. We create a table and we turn on CDC:

> CREATE TABLE base_table (
     pk text,
     ck text,
     val1 text,
     val2 text,
     PRIMARY KEY (pk, ck)
) WITH cdc = { ‘enabled’ = ‘true’, preimage = ‘true’ };

In this case we have a primary key, we have a clustering key, we have two values and we enabled CDC. We also add that we want the pre-image.

Now we do two inserts. You can see we’re inserting two CQL rows that share the same primary key.

> insert into base_table(pk, ck, val1, val2) values(“foo”, “bar”, “val1”, “val2”);

> insert into base_table(pk, ck, val1, val2) values(“foo”, “baz”, “vaz1”, “vaz2”);

So we get two update records in the CDC stream.

Stream_id |   time | batch_seq |  operation | ttl |   _pk |    _ck | _val1(op, value, ttl)|           _val2(...)
----------+--------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |        |         0 |     UPDATE |     | “foo” |  “bar” |  (ADD, “val1”, null) |  (ADD, “val2”, null)
    UUID1 |        |         0 |     UPDATE |     | “foo” |  “baz” |  (ADD, “vaz1”, null) |  (ADD, “vaz2”, null)

Again there’s no pre-image because nothing existed previously, it’s just two inserts.

Now we modify it by updating one of the CQL rows in the table:

> update base_table set val1 = “val3” where pk = “foo” and ck = “bar”;

Then we get the pre-image that only includes the value that we’re changing. Notice we try to limit the change log to make sure that it has as low impact as possible on performance. Here’s the resultant change:

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         0 |   PREIMAGE |     | “foo” |  “bar” |  (ADD, “val1”, null) |
    UUID1 |         |         1 |     UPDATE |     | “foo” |  “bar” |  (ADD, “val3”, null) |

To explain that tuple, we’re setting the value three and we don’t have a TTL.

Performing a delete of one of the values is similar:

> delete val2 from base_table where pk = “foo” and ck = “bar”;

We’re deleting one of the values. We’re not deleting the row, we’re setting a value to NULL. So we get a pre-image for the column that’s affected and then we get a delete delta.

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         0 |   PREIMAGE |     | “foo” |  “bar” |                      | (ADD, “val2”, null)
    UUID1 |         |         1 |     UPDATE |     | “foo” |  “bar” |                      | (DEL, null, null)

For a row delete we get a pre-image and the values that exist:

> delete from base_table where pk = “foo” and ck = “bar”;

And then we get a delete record for the row:

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         0 |   PREIMAGE |     | “foo” |  “bar” | (ADD, “val3”, null)  |
    UUID1 |         |         1 |    ROW_DEL |     | “foo” |  “bar” |                      |

Of course the delete record for a row doesn’t contain any deltas for the columns because it’s gone.

Whereas the partition delete we only get the delete of the partition, there’s no pre-image here:

> delete from base_table where pk = “foo”;

Stream_id |   time  | batch_seq |  operation | ttl |   _pk |    _ck |                _val1 |               _val2
----------+---------+-----------+------------+-----+-------+--------+----------------------+---------------------
    UUID1 |         |         1 |   PART_DEL |     |       |        |                      |

Summary

CDC in Scylla is going to be very easy to integrate and very easy to consume. Everything is plain old CQL tables. Nothing magic, nothing special. Just performant.

It’s going to be robust. Everything is replicated the same way normal data is replicated. It’s going to share all the same properties as your normal cluster. If your cluster works, CDC is going to work.

Our CDC implementation comes with a reasonable overhead. We coalesced the log writes into the same replica ranges, which means that, yes, you’re doing two writes, but it’s going to share the same request so it doesn’t add very much.

Data is TTL’d. It has a limited lifetime. Which means the CDC log is not going to overwhelm your system. You’re not going to get any node crashes because you enabled CDC. You’re just going to get some more data that you may or may not consume.

Here’s a short comparison with some NoSQL competitors:

Cassandra DynamoDB MongoDB Scylla
Consumer location on-node off-node off-node off-node
Replication duplicated deduplicated deduplicated deduplicated
Deltas yes no partial yes
Pre-image no yes no yes, optional
Post-image no yes yes yes, optional
Slow consumer reaction Table stopped Consumer loses data Consumer loses data Consumer loses data
Ordering no yes yes yes

Where can you put the consumer? In Cassandra you have to do it on-node. For the other three, including Scylla, we can put the consumer separated from the database cluster. It can be somewhere else. It’s just over wire.

Scylla data is deduplicated, you don’t have to worry about trying to reconcile different data streams from different replicas. It’s already done for you.

Do you get deltas? Yes. With Scylla you get deltas. You know the changes.

Do you get pre-image? With Scylla, optionally yes you do. You know the state before the delta and you know the delta itself.

Do you get post-image? With Scylla, optionally again, yes, we can give you that.

What happens if you don’t consume data? Yes, it is lost but it’s no catastrophe. If you didn’t consume it you probably didn’t want it. You have to take responsibility for consuming your data before you hit your TTLs.

And the ordering? With Scylla, yes. It’s just CQL. You’re going to be able to consume it the way you want. It’s ordered by timestamp — again, the perceived timestamp of the client. It’s a client view. It’s the database description of how your application perceived the state of data as it modified it.

Next Steps

Sign up for Scylla University to take the CDC lesson and get credit. You can also watch the session from Scylla Summit in full below or at our Tech Talks page. Also check out our CDC documentation. If you have more questions, make sure to bring them to our Slack community, or attend one of our live online Virtual Workshops.



REGISTER TO TAKE OUR CDC LESSON

The post Using Change Data Capture (CDC) in Scylla appeared first on ScyllaDB.

Scylla Open Source Release 4.0.4

The ScyllaDB team announces Scylla Open Source 4.0.4, a bugfix release of the Scylla 4.0 stable branch. Scylla Open Source 4.0.4, like all past and future 4.x.y releases, is backward compatible and supports rolling upgrades.

Note that the latest Scylla stable branch is Scylla 4.1, and it is highly recommended to upgrade to it.

Related links:

Issues fixed in this release

  • CQL: avoid using shared_ptr’s in unrecognized_entity_exception #6287
  • Storage: SSTable upgrade has a 2x disk space requirement #6682
  • CQL: Default TTL is reset when not explicitly specified in alter table statement #5084
  • Alternator: modifying write-isolation tag is not propagated to a node of another DC #6513
  • Config: Internode, on the wire, compression is not enabled based on configuration #5963
  • UX: scylla-housekeeping, a service which check for the latest Scylla version, return “cannot serialize” error instead of propagating the actual error #6690
  • UX: scylla_setup: on RAID disk prompt, typing same disk twice cause traceback #6711
  • CQL: Impossible WHERE condition returns a non-empty slice #5799
  • For example, the following should always return zero rows:
    SELECT * FROM t1 WHERE r>10 AND r<10 ALLOW FILTERING ;
  • Stability: Counter write read-before-write is issued with no timeout, which may lead to unbounded internal concurrency if the enclosing write operation timed out. #5069
  • Stability: features service lifecycle had been extended since it may be used by the storage_proxy until scylla shuts down. #6250
  • Stability: Issuing a reverse query with multiple IN restrictions on the clustering key might result in incorrect results or a crash. For example:
    CREATE TABLE test (pk int, ck int, v int, PRIMARY KEY (pk, ck));
    SELECT * FROM test WHERE pk = ? AND ck IN (?, ?, ?) ORDERED BY ck DESC;

    #6171
  • Compaction: Fix partition estimation with TWCS interposer #6214
  • Performance: Few of the Local system tables from `system` namespace, like large_partitions do not use gc grace period to 0, which may result in millions of tombstones being needlessly kept for these tables, which can cause read timeouts. Local system tables use LocalStrategy replication, so they do not need to be concerned about gc grace period. #6325
  • Stability: nodetool listsnapshots failed during parallel running with create/clear snapshots #5603
  • Stability: Potential use-after-free in truncate after restart #6230
  • Stability: Scylla freezes when casting a decimal with a negative scale to a float #6720
  • Stability: Stream ended with unclosed partition and coredump #6478

The post Scylla Open Source Release 4.0.4 appeared first on ScyllaDB.

Scylla Virtual Workshops

Scylla Virtual Workshops are a way to deepen your understanding of our database and make you more productive with it. We began the series in May and hold workshop sessions twice per month.

The format consists of a demonstration of how to deploy a fully functional cluster of Scylla using Docker, an overview of Scylla’s design principles and NoSQL-based practices from Scylla’s point of view, followed by a Q&A session. We also encourage attendees to fill out a survey based on the common sizing issues they encounter while scaling their workloads, so we can better understand our attendees’ real-world setups and situations.

Who should attend?

Scylla Virtual Workshops are oriented toward those who want to deepen their understanding of NoSQL, and how Scylla might fit their real-world big data use cases. The ideal attendee should have basic Linux/Docker experience as well as a familiarity with NoSQL, but no specific knowledge of Scylla is required.

If you are a database architect, developer, or engineering manager looking to build or migrate a current application using a NoSQL database (MongoDB, Couch, Redis, Cassandra), this will be a great session for you. If you are using an RDBMS (PostgreSQL, MySQL, Aurora) and your application is growing to the level of having issues with scalability, distribution, or performance induced from the database, then this session would be of great value for you.

Now that we are entering our third month for the series, we thought we’d share with you some of the most insightful Q&A’s from one of our recent sessions, hosted by Eyal Gutkind, our VP of Solutions.

Q: Are Scylla Monitoring Stack and Scylla Manager part of Scylla Open Source?

A: We deliver these tools separately from Scylla Open Source. Scylla Monitoring Stack is an open source offering itself. Scylla Manager is free-to-use for Scylla Open Source users, but is limited to five nodes. (Scylla Manager can manage an unlimited number of nodes for Scylla Enterprise.)

Q: Does Scylla Monitoring Stack support the DynamoDB-compatible API?

A: Yes. In fact, we provide a dashboard interface for Alternator, our DynamoDB-compatible API, in Scylla Monitoring Stack. You can even enable monitoring for only the DynamoDB interface if you are not using the CQL interface (read more about it here).

Q: Do the Cassandra client drivers work seamlessly with ScyllaDB?

A: Yes. Any Cassandra-compatible driver will work with Scylla. However, you will see performance improvements if you use one of the Scylla shard-aware drivers, such as our Go, Java or Python drivers. Refer to our documentation to find the latest available drivers.

Q: Are Scylla’s SSTables compatible with open source Apache Cassandra?

A: Yes. They are compatible with Apache Cassandra SSTables. So if you are using the “la,” “ka, “mc” format, it’s possible to move those files from Cassandra to Scylla. (Read more here.)

Q: Is it possible to upgrade from Apache Cassandra or DataStax? How?

A: Yes. It’s possible to migrate from both. The question isn’t whether you can migrate, but which of the various migration strategies is the best for your use case. Cold migration? Dual writes? One way is by snapshotting an existing Cassandra system and loading the files to Scylla via sstableloader; you can read about that in our migration guide. However, DataStax uses a proprietary SSTable format on disk, so instead, we provide an open source Apache Spark-based, Scylla Migrator.

Q: How are you benchmarking your disks/systems? How can we evaluate it on our own?

A: Perfect! You can do that. We use something called iotune. Behind the scenes, when you do Scylla Setup, we are doing something called iotune. It will benchmark the number IOPS that you’ll get for reads and writes, and you get the actual bandwidth of the system. We recommend that you use a construction of RAID0, software RAID0. Don’t give a RAID controller. It’s going to be less efficient from your perspective.

Q: We currently have a cluster of 30 nodes. How can I drop it down to 5 nodes in Scylla?

A: There are some guidelines that we give you. For example, we would like to see a ratio of 1 to 50, in terms of memory-to-disk. So, for every gigabyte of RAM, we want to see roughly 50 gigabytes of disk space available. CPU? Not that much. Give us enough CPU and you’ll be great. Think about it. If you have 30 nodes and each one has 8 vCPUs, for 240 vCPUs total. That means you are running at 120 physical cores. For a system like Scylla, that’s way above what we need. If you have 30,000 to 50,000 transactions per second you can probably do that with 30 to 40 physical cores, so 80 vCPUs. You’ll be shrinking your cluster by 3x at least.

Also consider that Scylla can be used on denser nodes. The AWS I3en series can have up to 96 vCPUs per node. Which means you could reduce your deployment to just three nodes (with triple replication). That would shrink your cluster 10x.

If you have more questions about sizing reach out to us.

Q: Can we use shared disk or is it necessary to use directly connected SSDs?

A: We highly recommend directly-connected SSDs. You can use shared disk, something like VMware or OpenStack. However, we want to make sure that you have the right throughput and latency and IOPS enabled.

Q: How do you expose the server’s metrics?

A: So we use Prometheus in Scylla Monitoring Stack to expose all the metrics that we have inside the system. By all means play around with it. You can even use Datadog if you want to push the data out.

Q: Can you use Scylla Manager in a Docker container?

A: The answer to that is ‘Yes.’

Q: How does Scylla behave if you don’t have TTLs and continually generate tombstones?

A: Tombstones are not a problem if you have properly defined your GC grace period and compactions are run properly. (You can learn more about tombstones and repairs in this Scylla University course.)

Q: What about Kubernetes support?

A: We provide an open source Scylla Kubernetes Operator, which is currently in beta. You can learn more about it by reading our documentation, watching this presentation from last year’s Scylla Summit, or taking the course in Scylla University.

Join a Virtual Workshop

Scylla’s next Virtual Workshop will be this Friday, July 24 2020. Beyond that, always check our Webinars page to find out about our latest virtual events, including live and on-demand webinars. Also consider checking out videos from our Tech Talks from past live events.

SAVE YOUR SEAT AT OUR NEXT VIRTUAL WORKSHOP

The post Scylla Virtual Workshops appeared first on ScyllaDB.

Apache Cassandra 4.0 Beta Released

The beta release of Apache Cassandra 4.0 is finally here, it’s been two years in the making. We’ve had a preview release available to customers since March for testing. A wide range of improvements have been made.

Stability

The explicit goal of this release has been to be “the most stable major release ever” to accelerate adoption within the release cycle, which I blogged about in January.  For this release a series of new testing frameworks were implemented focusing on stability, and performance, which have paid off handsomely. The feeling of the team at Instaclustr is that we have never been more confident about a release (and we’ve seen quite a few!)

Performance

This release integrates the async event-driven networking code from Netty for communication between nodes. I blogged about Netty in February but it’s worth reiterating what’s been achieved with this upgrade. It has enabled Cassandra 4.0 to have a single thread pool for all connections to other nodes instead of maintaining N threads per peer which was cramping performance by causing lots of context switching. It has also facilitated zero copy streaming for SStables which now goes x5 times faster than before.

This complete overhaul of the networking infrastructure has delivered some serious gains.

  • Tail end latency has been reduced by 40%+ in P99s in initial testing
  • Node recovery time has been vastly reduced
  • Scaling large clusters is easier and faster

Auditing and Observability

Cassandra 4.0 introduces a powerful new set of enterprise class audit capabilities that I covered here in March. These help Cassandra operators meet their SOX and PCI requirements with a robust high level interface. Audit logging saves to the node, outside of the database, with configurable log rollover. Audit logs can be configured to attend to particular keyspaces, commands, or users and they can be inspected with the auditlogviewer utility.

Full Query Logging is also supported and the fqltool allows inspection of these logs.

Virtual Tables

Virtual tables, which I covered here in February, enable a series of metrics to be pulled from a node via CQL from read-only tables. This is a more elegant mechanism than JMX access as it avoids the additional configuration required. JMX access is not going anywhere soon, but this presents a really solid improvement to a number of metric monitoring tasks.

Community

Our community is our most powerful feature in all our releases and I can’t think of a better validation of open source under the Apache Foundation community model than this release. I just want to take this opportunity to congratulate and thank everyone in the community who have taken both Cassandra and its release processes to the next level with this beta release. 

As always, you can spin up a free trial of Cassandra on our platform. Even with the performance gains delivered in this release our popular Cassandra Data Modeling Guide to Best Practices is always worth a read to get the most out of Cassandra.

The post Apache Cassandra 4.0 Beta Released appeared first on Instaclustr.

Introducing Apache Cassandra 4.0 Beta: Battle Tested From Day One

This is the most stable Apache Cassandra in history; you should start using Apache Cassandra 4.0 Beta today in your test and QA environments, head to the downloads site to get your hands on it. The Cassandra community is on a mission to deliver a 4.0 GA release that is ready to be deployed to production. You can guarantee this holds true by running your application workloads against the Beta release and contributing to the community’s validation effort to get Cassandra 4.0 to GA.

With over 1,000 bug fixes, improvements and new features and the project’s wholehearted focus on quality with replay, fuzz, property-based, fault-injection, and performance tests on clusters as large as 1,000 nodes and with hundreds of real world use cases and schemas tested, Cassandra 4.0 redefines what users should expect from any open or closed source database. With software, hardware, and QA testing donations from the likes of Instaclustr, iland, Amazon, and Datastax, this release has seen an unprecedented cross-industry collaboration towards releasing a battle-tested database with enterprise security features and an understanding of what it takes to deliver scale in the cloud.

There will be no new features or breaking API changes in future Beta or GA builds. You can expect the time you put into the beta to translate into transitioning your production workloads to 4.0 in the near future.

Quality in distributed infrastructure software takes time and this release is no exception. Open source projects are only as strong as the community of people that build and use them, so your feedback is a critical part of making this the best release in project history; share your thoughts on the user or dev mailing lists or in the #cassandra ASF slack channel.

Redefining the elasticity you should expect from your distributed systems with Zero Copy Streaming

5x faster scaling operations

Cassandra streams data between nodes during scaling operations such as adding a new node or datacenter during peak traffic times. Thanks to the new Zero Copy Streaming functionality in 4.0, this critical operation is now up to 5x faster without vnodes compared to previous versions, which means a more elastic architecture particularly in cloud and Kubernetes environments.

Globally distributed systems have unique consistency caveats and Cassandra keeps the data replicas in sync through a process called repair. Many of the fundamentals of the algorithm for incremental repair were rewritten to harden and optimize incremental repair for a faster and less resource intensive operation to maintain consistency across data replicas.

Giving you visibility and control over what’s happening in your cluster with real time Audit Logging and Traffic Replay

Enterprise-grade security & observability

To ensure regulatory and security compliance with SOX, PCI or GDPR, it’s critical to understand who is accessing data and when they are accessing it. Cassandra 4.0 delivers a long awaited audit logging feature for operators to track the DML, DDL, and DCL activity with minimal impact to normal workload performance. Built on the same underlying implementation, there is also a new fqltool that allows the capture and replay of production workloads for analysis.

There are new controls to enable use cases that require data access on a per data center basis. For example, if you have a data center in the United States and a data center in Europe, you can now configure a Cassandra role to only have access to a single data center using the new CassandraNetworkAuthorizer.

For years, the primary way to observe Cassandra clusters has been through JMX and open source tools such as Instaclustr’s Cassandra Exporter and DataStax’s Metrics Collector. In this most recent version of Cassandra you can selectively expose system metrics or configuration settings via Virtual Tables that are consumed like any other Cassandra table. This delivers flexibility for operators to ensure that they have the signals in place to keep their deployments healthy.

Looking to the future with Java 11 support and ZGC

One of the most exciting features of Java 11 is the new Z Garbage Collector (ZGC) that aims to reduce GC pause times to a max of a few milliseconds with no latency degradation as heap sizes increase. This feature is still experimental and thorough testing should be performed before deploying to production. These improvements significantly improve the node availability profiles from garbage collection on a cluster which is why this feature has been included as experimental in the Cassandra 4.0 release.

Part of a vibrant and healthy ecosystem

The third-party ecosystem has their eyes on this release and a number of utilities have already added support for Cassandra 4.0. These include the client driver libraries, Spring Boot and Spring Data, Quarkus, the DataStax Kafka Connector and Bulk Loader, The Last Pickle’s Cassandra Reaper tool for managing repairs, Medusa for handling backup and restore, the Spark Cassandra Connector, The Definitive Guide for Apache Cassandra, and the list goes on.

Get started today

There’s no doubt that open source drives innovation and the Cassandra 4.0 Beta exemplifies the value in a community of contributors that run Cassandra in some of the largest deployments in the world.

To put it in perspective, if you use a website or a smartphone today, you’re probably touching a Cassandra-backed system.

To download the Beta, head to the Apache Cassandra downloads site.

Resources:

Apache Cassandra Blog: Even Higher Availability with 5x Faster Streaming in Cassandra 4.0

The Last Pickle Blog: Incremental Repair Improvements in Cassandra 4

Apache Cassandra Blog: Audit Logging in Apache Cassandra 4.0

The Last Pickle Blog: Cassandra 4.0 Data Center Security Enhancements

The Last Pickle Blog: Virtual tables are coming in Cassandra 4.0

The Last Pickle Blog: Java 11 Support in Apache Cassandra 4.0

Apache Cassandra Infographic