Spark + Cassandra Best Practices

Spark Overview

Spark was created in 2009 as a response to difficulties with map-reduce in Hadoop, particularly in supporting machine learning and other interactive data analysis. Spark simplifies the processing and analysis of data, reducing the number of steps and allowing ease of development. It also provides for reduced latency, since processing is done in-memory.

Spark can be used to process and analyze data to and from a variety of data storage sources and destinations. In this blog , we will discuss Spark in conjunction with data stored in Cassandra.

Querying and manipulating data in Spark has several advantages over doing so directly in Cassandra, not the least of which is being able to join data performantly. This feature is useful for analytics projects.

Spark Use Cases

Typical use cases for Spark when used with Cassandra are: aggregating data (for example, calculating averages by day or grouping counts by category) and archiving data (for example, sending external data to cold storage before deleting from Cassandra). Spark is also used for batched inserts to Cassandra. Other use cases not particular to Cassandra include a variety of machine learning topics.

Spark in the Data Lifecycle

A data analysis project starts with data ingestion into data storage. From there, data is cleansed and otherwise processed. The resulting data is analyzed, reviewing for patterns and other qualities. It may then be further analyzed using a variety of machine learning methods. End-users will be able to run ad hoc queries and use interfaces to visualize data patterns. Spark has a star role within this data flow architecture.

Ingestion

Spark can be used independently to load data in batches from a variety of data sources (including Cassandra tables) into distributed data structures (RDDs) used in Spark to parallelize analytic jobs. Since one of the key features of RDDs is the ability to do this processing in memory, loading large amounts of data without server-side filtering will slow down your project. The spark-cassandra-connector has this filtering and other capabilities. (See https://github.com/datastax/spark-cassandra-connector/.) The limitation on memory resources also implies that, once the data is analyzed, it should be persisted (e.g., to a file or database).

To avoid some of the limitations of this batch processing, streaming functionality was added to Spark. In Spark 1, this functionality was offered through DStreams. (See https://spark.apache.org/docs/latest/streaming-programming-guide.html.)

Spark 2 — a more robust version of Spark in general — includes Structured Streaming. (See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.) With Structured Streaming, consider that instead of creating a static table based on a batch input, the table is constantly updated with new data from the source. The data will be stored in a data frame and continuously updated with the new data. (Another benefit of using dataframes over RDDs is that the data is intuitively abstracted into columns and rows.)

Available data sources on the source side for streaming include the commonly used Apache Kafka. Kafka buffers the ingest, which is key for high-volume streams. See https://kafka.apache.org/ for more details on Kafka.

Data Storage

Although we are focusing on Cassandra as the data storage in this presentation, other storage sources and destinations are possible. Another frequently used data storage option is Hadoop HDFS. The previously mentioned spark-cassandra-connector has capabilities to write results to Cassandra, and in the case of batch loading, to read data directly from Cassandra.

Native data output formats available include both JSON and Parquet. The Parquet format in particular is useful for writing to AWS S3. See https://aws.amazon.com/about-aws/whats-new/2018/09/amazon-s3-announces-new-features-for-s3-select/ for more information on querying S3 files stored in Parquet format. A good use case for this is archiving data from Cassandra.

Data Cleansing

Data cleansing involves dealing with questionable data (such as null values) and other preprocessing tasks (such as converting categorical data to mapped integers). Once data is stored in a data frame, it can be transformed into new dataframes based on filters. Other than the fact you have the capability to do this cleansing within the same code (e.g., the Scala script running Spark), Spark does not provide magic to clean data; after all, this takes knowledge about the data and the business to understand and code particular transformation tasks.

Pattern Analysis

Spark dataframes can be easily explored for basic patterns using commands like describe, which will show the count, mean, standard deviation, minimum value, and maximum value of selected columns. Dataframes can be further transformed with functions like groupBy, pivot, and agg (aggregate). Spark SQL can be used for complex joins and similar tasks, using the SQL language that will be familiar to many data analysts.

Machine Learning and Data Mining

Machine learning and data mining encompass a broad range of data modeling algorithms intended to make predictions or to discover unknown meaning within data.

From Spark 2 onward, the main library for Spark machine learning is based on data frames instead of RDDs. You may see this new data frame-based library referred to as Spark ML, but the library name hasn’t changed — it is still MLlib. (See https://spark.apache.org/docs/latest/ml-guide.html.) Some things that are possible with Spark in this area are recommendation engines, anomaly detection, semantic analysis, and risk analysis.

Ad Hoc Queries

Spark SQL is available to use within any code used with Spark, or from the command line interface; however, the requirement to run ad hoc queries generally implies that business end-users want to access a GUI to both ask questions of the data and create visualizations. This activity could take place using the eventual destination datastore as the backend. Directly from Spark, there are enterprise options such as Tableau, which has a Spark connector. For query speed, a memory-based cache such as Apache Ignite could be used as the analytics backend; to maintain that speed by avoiding disk i/o, the data being used for queries should fit into memory.

Visualization

Depending on the programming language and platform used, there may be libraries available to directly visualize results. For example, if Python is used in a Jupyter notebook, then Matplotlib will be available. (See https://matplotlib.org/.) In general, except for investigating data in order to clean it, etc., visualization will be done on the data after it is written to the destination. For business end-users, the above discussion in Ad Hoc Queries applies.

Architecture

The general architecture for a Spark + Cassandra project is apparent from the discussion of the Data Lifecycle above. The core elements are source data storage, a queueing technology, the Spark cluster, and destination data storage.

In the case of Cassandra, the source data storage is of course a cluster. Spark will only query a single data center, and to avoid load on a production cluster, this is what you want. Most installations will set up a second data center replicating data from the production data center cluster and attach Spark to this second data center. If you are unable to set up a separate data center to connect to Spark (and we strongly recommend setting it up), be sure to carefully tune the write variables in the spark-cassandra-connector. In addition, if Datastax Enterprise is being used, then DSE Search should be isolated on a third data center.

Another consideration is whether to set up Spark on dedicated machines. It is possible to run Spark on the same nodes as Cassandra, and this is the default with DSE Analytics. Again, this is not advised on the main production cluster, but can be done on a second, separate cluster.

Spark in the Cloud

Spark at Google Cloud

Google Cloud offers Dataproc as a fully managed service for Spark (and Hadoop):

https://cloud.google.com/dataproc/

Spark at AWS

AWS supports Spark on EMR: https://aws.amazon.com/emr/features/spark/.

Spark Development

Coding Language Options

Spark code can be written in Python, Scala, Java, or R. SQL can also be used within much of Spark code.

Tuning Notes

Spark Connector Configuration

Slowing down the throughput (output.throughput_mb_per_sec) can alleviate latency.

For writing, then the Spark batch size (spark.cassandra.output.batch.size.bytes) should be within the Cassandra configured batch size (batch_size_fail_threshold_in_kb).

Writing more frequently will reduce the size of the write, reducing the latency. Increasing the batch size will reduce the number of times Cassandra has to deal with timestamp management. Spark can cause allocation failures if the batch size is too big.

For further documentation on connector settings, see  https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md.

Spark Security

Security has to be explicitly configured in Spark; it is not on by default. However, the configuration doesn’t cover all risk vectors, so review the options carefully. Also, most of these settings can be overridden in code accessing Spark, so it is important to audit your codebase and most important to limit connections from specific hosts further protected by user authentication.

Enable authentication

Authentication is turned off by default. It is enabled through two configuration variables, using a shared secret.

The two configuration variables are spark.authenticate (default is false; set to true) and spark.authenticate.secret (set to string of shared secret). If YARN is used, then much of this is done by default. If not, then set these two in the spark-defaults.conf file.

Then use this secret key when submitting jobs. Note that the secret key can be used to submit jobs by anyone with the key, so protect it well.

Enable logging for all submitted jobs

You can set spark.eventLog.enabled in the spark-defaults.conf file, but it can be overridden in a user’s code (e.g., in the SparkConf) or in shell commands, so it has to be enforced by business policy.

Note also that the log file itself (configured via spark.eventLog.dir) should be protected with filesystem permissions to avoid users snooping data within it.

Block Java debugging options in JVM

Make sure the JVM configuration does not include the following options: -Xdebug, -Xrunjdwp, -agentlib:jdwp.

Redact environment data in WebUI

Can disable the whole UI via spark.ui.enabled, but other than that, or overriding the EnvironmentListener with alternate custom code, there is no way to redact the information in the Environment tab of the UI specifically.

It is recommended to enable ACLs for both the WebUI and the history server, which will protect the entirety of the web-based information.

Enable and enforce SASL RPC encryption

The recommendation with Spark is to enable AES encryption since version 2.2, unless using an external Shuffle service. To enable SASL, set the following to true: spark.authenticate.enableSaslEncryption and spark.network.sasl.serverAlwaysEncrypt.

Enable encryption on all UIs and clients

To enable AES encryption for data going across the wire, in addition to turning on authentication as above, also set the following to true: spark.network.crypto.enabled. Choose a key length and set via spark.network.crypto.keyLength, and choose an algorithm from those available in your JRE and set via spark.network.crypto.keyFactoryAlgorithm.

Don’t forget to also set configuration from any database (e.g., Cassandra) to Spark, to encrypt that traffic.

Enable encryption on Shuffle service

In addition to the above encryption configuration, set the following to true: spark.network.crypto.saslFallback.

To encrypt the temporary files created by the Shuffle service, set this to true: spark.io.encryption.enabled. The key size and algorithm can also be set via spark.io.encryption.keySizeBits and spark.io.encryption.keygen.algorithm, but these have reasonable defaults.

Disable Spark REST server

The REST server presents a serious risk, as it does not allow for encryption. Set the following to false: spark.master.rest.enabled.

Operations

Monitoring Spark

Spark has built-in monitoring: https://spark.apache.org/docs/latest/monitoring.html

Ask Me Anything with Avi Kivity

 

Recently we hosted the first of our new series of AMA sessions with ScyllaDB CTO Avi Kivity. Join our Meetup to be notified of the next AMA.

Here’s how our first AMA went:

Q: Would you implement Scylla in Go, Rust or Javascript if you could?

Avi: Good question. I wouldn’t implement Scylla in Javascript. It’s not really a high-performance language, but I will note that Node.js and Seastar share many characteristics. Both are using a reactor pattern and designed for high concurrency. Of course the performance is going to be very different between the two, but writing code for Node.js and writing code for Seastar is quite similar.

Go also has an interesting take on concurrency. I still wouldn’t use it for something like Scylla. It is a garbage-collected language so you lose a lot of predictability, and you lose some performance. The concurrency model is great. The language lacks generics. I like generics a lot and I think they are required for complex software. I also hear that Go is getting generics in the next iteration. Go is actually quite close to being useful for writing a high-performance database. It still has the downside of having a garbage collector, so from that point-of-view I wouldn’t pick it.

If you are familiar with how Scylla uses the direct I/O and asynchronous I/O, this is not something that Go is great at right now. I imagine that it will evolve. So I wouldn’t pick Javascript or Go.

However, the other language you mentioned, Rust, does have all of the correct characteristics that Scylla requires. Precise control over what happens. It doesn’t have a garbage collector so it means that you have predictability over how much time your things take, like allocation. You don’t have pause times. And it is a well-designed language. I think it is better than C++ which we are currently using. So if we were starting at this point in time, I would take a hard look at Rust, and I imagine that we would pick it instead of C++. Of course, when we started Rust didn’t have the maturity that it has now, but it has progressed a long time since then and I’m following it with great interest. I think it’s a well-done language.

Q: Can I write in a plugin to extend Scylla?

Avi: Right now you cannot. We do have support for Lua, which can be used to write plugins, but we did not yet prepare any extension points that you can actually use. You can only use them to write User Defined Functions (UDFs) for now. But we do have Lua as infrastructure for plugins. So if you have an idea for something you would need a plugin for, just file an issue or write to the users mailing list and we can see if it’s something that will be useful for many users. Then we can consider writing this extension point. And then you’ll be able to write your extensions with our extension language, Lua.

Q: What did you take from the KVM/OSv projects to use in Scylla?

Avi: KVM and OSv taught us how to talk directly to the hardware and how to be mindful of every cycle; how to drive the efficiency. KVM was designed for use on large machines so we had to be very careful when architecting it for multi-core. And the core counts have only increased since then. So it taught us how to both deal with large SMP [Symmetric Multiprocessor] systems and also how to deal with highly asynchronous systems.

OSv is basically an extension of that so we used a lot of our learnings from KVM to write OSv in the same way, dealing with highly parallel systems and trying to do everything asynchronously. So that the system does not block on locks and instead just switches to do something else. These are all ideas that we used in Scylla. More in Seastar than Scylla. Seastar is the underlying engine.

There is also the other aspect, which is working on a large scale open source project. We are doing the same thing here with a very widely dispersed team of developers. I think our developers come from around fifteen countries. This is something that we’re used to from Linux, and it is a great advantage for us.

Q: How does the Scylla client driver target a specific shard on a target host node? Does it pick TCP source port to match target node NIC RSS [receive-side scaling] setting?

Avi: Actually we have an ongoing discussion about how to do that. The scheme that we currently use is a little bit awkward. So what the client does is it first negotiates the ability to connect with a particular shard. Just make sure it’s connecting to a server that has this capability. Then it just opens a bunch of connections which arrive at a target shard more or less randomly. And then it discards all of the connections that it has in excess. If it sounds like a very inefficient way to do things, then that’s the case. But it was the simplest for us and it did not depend on anything. It didn’t need to depend on the source ports being unchanged.

Because it is awkward we are now considering another iteration on that, depending on the source port. The advantage of that is it allows you to open exactly the number of connections you need. You don’t have to open excess connections and then discard them, which is wasteful. But there are some challenges. One of the challenges is that perhaps not all languages may allow you to select the source port. And if so, they wouldn’t be able to take advantage of that mechanism and will have to fall back on the older mechanism. The other possible problem is that network address translation [NAT] between the client and the server may alter the source port. If that happens it stops working, so we will have to make it optional for users that are behind network address translation devices.

I think the question also mentioned the hash computed by the NIC, and we’re not doing that. It is too hard to coordinate. Different NICs have different hash functions. Right now we have basically a random connection to a shard, then discarding the access connection. We’re thinking of adding on top of that using the source port for directing a connection to a specific shard.

Q: Any plans for releasing Scylla on Azure and/or GCP?

Avi: First, you can install Scylla on Azure or on GCP. We don’t provide images right now so it’s a little bit more work. You have to select an OS image and then follow the installation procedure. Just as you would for on-prem. So it’s a little bit more work than is necessary.

We are working to provide images, which will save that work and you will just be able to select the Scylla image from the marketplace then launch it and we’ll start working immediately.

The other half is Scylla Cloud. We are working on GCP and Azure but think that GCP is the one next in line. I don’t have the timelines but we’re trying to do it as quickly as possible so that people can use Scylla. Especially people who are migrating away from Amazon and want a replacement for Dynamo. They can find it in Scylla running on GCP.

Q: Do you plan at some point in the future to support an Alternator [Scylla’s DynamoDB-compatible API] that includes the feature of DynamoDB Transactions?

Yes. So DynamoDB has two kinds of transactions. The first kind is a transaction that is within a single item. That is already supported. It is implemented on top of our Lightweight Transactions [LWT] support that’s used for regular CQL operations.

There is another kind which is a multi-item transaction and that we do plan to support, but it is still in the future. It’s not short-term implementation work. What we are doing right now is implementing Raft. This is the kind of infrastructure for transactions. Then we will implement Alternator transactions on top of the Raft infrastructure.

Q: Scylla and Seastar took the approach of dynamically allocating memory. Other HPC [High Performance Computing] products sometimes choose the approach of pooling objects and pre-allocating the object. Did you consider the pooling approach for Scylla? Why did you choose this memory allocation approach?

Avi: Interesting question. Basically the approach of having object pools works when you have very, very regular workloads. In Scylla the workload is very irregular. You can have a workload that has lots of reads with, I don’t know, 40 byte objects, or you can have a workload that one megabyte objects, and you can have simple workloads where you just do simple reads and writes, and you can have complex workloads that use indexes and indirect indexes and build materialized views. So the pattern of memory use varies widely between different workloads.

If you had a pool for everything then you would end up with a large number of pools most of which are not in use. The advantage of dynamic use of memory is that you can repurpose all of the memory. The memory doesn’t have a specific role attached to it. This is even dynamic. So the workload can start with one kind of use for memory and then shift to another kind of use. We don’t even have a static allocation between memtables and cache. So if you have a read-only workload, then all of the memory gets used for cache. But if you have a read-write workload then half of the memory gets used for memtables and half of the memory gets used for cache.

There is also working memory that use used to manage connections and provide working memory for requests in-flight, and that is also shared with the cache. So if you have a workload that uses a lot of connection and has high concurrency, then that memory is taken from the cache and is used to serve those requests. A workload that doesn’t have a high concurrency can, instead of having that memory sitting in a pool and doing nothing, that memory is part of the cache.

The advantage is that everything is dynamic and you can use more of your memory at any given time. Of course it results in a much more complex application because you have to keep track of all of the users of the memory, and make sure that you don’t use more than 100%. With pools it’s easy if you take your memory and you divide it among the pools. But if you have dynamic use you have to be very careful to avoid running out of memory.

Q: Is it true that Scylla uses C++20 already?

Avi: Yes. We’re using C++20. We switched about a month ago now. The features I was really anxious to start using C++20 for is coroutines. Which really fits very well with the asynchronous I/O that we do everywhere. It promises to really simplify our code base but that feature is not ready. We’re waiting for some fixes in GCC; in the compiler. As soon as those fixes hit the compiler we will start using coroutines. The other C++20 features are already in use and we’re enjoying them.

Related articles:

Q: It’s being said that wide column storage is good for messenger-like applications. High frequency of small updates. If so, could you explain why Scylla would work better than MongoDB for such workloads?

Avi: First I have to say that I think that all this, the labeling of wide column storage, is a misnomer. It comes back from the days, the early days of Cassandra where it was schemaless. You would allocate a partition and you could push as many cells as you wanted into a partition without any schema and those would be termed a “wide column.” But these days both Cassandra and Scylla don’t use this model. Instead you divide your partition into a large number of rows and each row has a moderate number of columns. So it’s not really accurate to refer to Scylla as a wide column. Of course the name stuck and it’s impossible to make it go unstuck.

To go back to the question about whether it fits a messaging application, yes, we actually have Discord, a very famous messaging application using Scylla. If it’s a good fit for them then it’s a good fit for any messaging application.

Comparing to Mongo, I don’t think the problem is the data model. It’s really about the scalability and the performance. If you have a messaging application that has a high throughput, if you’re successful and you have a high number of users, then you need to use something like Scylla in order to scale. If you have a small number of users and you don’t have high performance, then anything will work. It really depends on whether you have high scales or not.

Q: Can you compare Scylla to Aerospike? Do they optimize for the same use cases at all?

Avi: Yes. Though in some respects Aerospike is similar to Scylla I would say that the main differences are that Aerospike is partially in-memory. So Aerospike has to keep all of the keys in memory and Scylla does not. That allows Scylla to address much larger storage. If you have to keep all of the keys in memory then even if you have a lot of disk space as soon as you exhaust the memory for the keys then you need to start adding nodes. Whereas with Scylla we support a disk-to-memory ratio of 100 to 1.

So if you have 128 gigs of RAM then you can have 10 terabytes of disk. There are, of course, servers with more than 128 gigs of RAM, so we do have users who are running with dozens of terabytes of disk. I think this is something that Aerospike does not do well. I’m not an expert in Aerospike, so don’t take me up on that. My understanding is this is a point where Scylla would perform much better by allowing this workload to be supported.

If you have a smaller amount of data then I think they are similar, though I think we have a much wider range of features. We support more operations than Aerospike. Better cross datacenter replication. But again, I’m not really an expert in Aerospike so I can’t say a lot about that.

Q: Does Scylla also support persistent memory? Any performance comparisons with regular RAM or maybe PMEM as cache? Or future work planned?

Avi: We did do some benchmarks with persistent memory. I think it was a couple of years ago. You can probably find them on our blog. We did that with Intel disks. We currently don’t have anything planned with persistent memory, and the reason is it’s not widely available. You can’t find it easily. Of course you can buy it if you want, but you can’t just go to a cloud and select an instance with persistent memory.

When it becomes widely available then sure it makes a lot of sense to use it as a cache expander. This way you can get even larger ratios of disk-to-memory and so improve the resource utilization. It is very interesting for us but it doesn’t have enough presence in the market yet to make it worth our while to start optimizing for it.

Read more:

Q: What’s the status of Clang MSVC compliance? Is that something you’re on?

Avi: Scylla does compile with Clang. We have a developer that keeps testing it and sends patches every time we drift away from Clang. Our main compiler is GCC, but we do compile with Clang.

Recently, because of some gaps in Clang’s C++20 support we drifted a bit away, but I imagine as soon as Clang comes back into compliance then we’ll be able to build with Clang again.

MSVC is pretty far away. And the reason here is not the compiler but the ecosystems. We’re heavily tied into Linux. I imagine we could port Seastar to Windows but there really isn’t any demand for running Scylla and other Seastar applications on Windows. Linux really is the performance king for such applications. So this is where we are focusing.

It could be done, so if you’re interested, you could look at porting Seastar to Windows. But I really don’t know in terms of compiler compliance whether MSVC is there or not. I hear that they’re getting better lately but I don’t have any first hand knowledge. [Editor’s note: users can read more about Clang and MSVC compliance here.]

Q: Does Scylla support running on AWS Graviton2 instances? If so, what’s your insights and experience on performance comparisons with instances with AMD/Intel instances specifically for a Scylla database workload?

Avi: Interesting question! So we did test, we did run benchmarks on Arm systems, not just Graviton but on Arm systems. Scylla builds and runs on those Arm systems. The Graviton2 systems that Amazon is providing are not really suitable for Scylla because they don’t have a fast NVMe storage. And of course, we need a lot of very fast storage. So you could run Scylla on EBS, Elastic Block Store, but that’s not really a good performance point for Scylla. Scylla really wants fast local storage.

When we get a Graviton-based instance that fast NVMe’s, then I’m sure that Scylla will be a really good fit for it, and we will support it. We do compile on Arm and work on Arm without any problems. As to whether it will work well? It will. The shared-nothing architecture of Scylla works really well with the Arm relaxed memory model. It really gives the CPU the maximum flexibility in how it issues the read and write request memory, compared to other applications that use locking.

We do expect to get really good performance on Arm, and we did on the benchmarks that we performed in the past. I think they are on our website as well. So as soon as there are Arm-based instances with fast storage, then I expect we will provide AMIs and you will be able to use them. Even now you can compile it on your own for Arm and it’s supposed to just work.

Learn more:

Q: Does every memtable have a separate commitlog? I assume yes. If so, how are these commitlogs spread among the available disks?

Avi: Okay. So your assumption is incorrect. We have a shared commitlog for all of the memtables within a single shard, or a single core. The reason for that is to consolidate the writes. If you have a workload that writes to many tables then instead of having to write to many files we just intersperse the writes to all of those memtables need to write for a single file which is more efficient. But we do have a separate commitlog for every shard, or for every core.

This is part of the shared-nothing architecture which tries to avoid any sharing of data or control between cores. Generally the number of active commitlogs is equal to the number of cores on the system. We keep cycling between them. There is one active segment. As soon as it gets filled we open a new active segment and start writing to that.

Q: I have a follow-up question to the commitlog question: If there is only a single commitlog for multiple memtables, won’t it cause, for interleaved workloads for example, for a lot of writes happen to only one memtable then a single write to a different memtable, won’t it cause it to flush both memtables? How do you process this internally with the entries for different memtables within a single commitlog?

Avi: It’s a good point, though it can cause early flushes of memtables, but the way we handle it is simply by being very generous with the allocation of space to commitlog. Our commitlog can be as large as the memory size and our memtable is limited to half the memory size. So in practice the memtables flush because they hit their own limits before you hit this kind of forced flushing.

In theory it is possible to get forced flushing but in practice it’s never been a problem just by being generous with disk space. It’s easy to be generous with your user’s disk space, but really, when you usually have disk space that is several dozen times larger than memory you’re just sacrificing a few percent of your disk for a commitlog.

Q: So the answer is basically that you hopefully never reach the situation that you actually run out of space for the commitlog to flush all the entries, and you hope that until you get to that point then you will flush all the previous contiguous memtables anyway. Right?

Avi: Yes, and that’s what happens in practice because the commitlog is larger than memory. Then some memtable reaches its limit and starts flushing. It is possible for what you describe to happen, but it is not so bad. It’s okay. You have a memtable that is flushed a little bit earlier than it otherwise would be. It means that there will be a little bit more compaction later. But in practice it’s never been a cause of a problem.

Q: What is Scylla’s approach to debugging issues from the field? Do you rely on logs? On traces? How about issues that you cannot recreate on your machines, that come from clients?

Avi: First of all, it’s hard. There is no magic answer for that. We have a tracing facility that can be enabled dynamically. All of the interesting code path are tagged with a kind of dynamic tracing system. You can select a particular request for tracing and then all of the code paths that this request touches log some information. That is stored in a Scylla table. Then you can use that to understand what went wrong.

There is also a regular log with variable levels which is pretty standard. This has limited usefulness because you cannot just open the log for debug mode because if you have a production application that is generating a large number of requests it will quickly spam the log and it will also impact the running requests. So there’s no magic answer. Sometimes tracing works.

What we do go to is the metrics. The metrics don’t give you any direct information about a particular request but they do give you a very good overall picture about what the system is doing. The state of various queues. Where memory is allocated. The rate at which various subsystems are operating.

There’s a lot more than just metrics that we expose in the dashboard. If you go to Prometheus then you have access to — I think it’s now in the thousands — thousands of metrics that describe different parts of the system. We use those to gain an understanding of what the customer system is doing. From that we can try to drill down. Whether we are tracing, or maybe by asking the customer to do different things.

In extreme cases where we are not able to make progress we will create a debug build with special instrumentation. But of course we try to avoid doing that. That’s an expensive way to debug. We prefer having enough debug information in advance present in the tracing system and in the metrics.

Q: How do you debug Seastar applications of this size? I considered Seastar for my pet project but it looks virtually undebuggable.

Avi: It’s not undebuggable. It does have a learning curve. It takes time to learn how to debug it. Once you do, then you need some experience and expertise. We rely a lot on the sanitizers that are part of the compiler. If you run the application in debug mode and it finds memory bugs with high precision. It also pinpoints the problem. So it’s easy to resolve and we are adding debug tools. We have error injection tools in Scylla and also in Seastar. We use those to try flush out bugs before they hit users and customers.

Next Steps

Our developers are very responsive. If you have more questions, here’s a few ways to get in touch.

 

SIGN UP FOR THE SCYLLA MEETUP

The post Ask Me Anything with Avi Kivity appeared first on ScyllaDB.

Scylla Open Source Release 4.1

The Scylla team is pleased to announce the availability of Scylla Open Source 4.1.0, a production-ready release of our open source NoSQL database.

Scylla is an open source, NoSQL database, with superior performance and consistently low latencies. Find the Scylla Open Source 4.1 repository for your Linux distribution here. A Scylla Open Source 4.1 Docker image is also available.

Scylla 4.1 is focused on stability and bug fixes, as well as adding more operations to the Alternator API.
In addition, we are adding support for Red Hat Enterprise Linux 8, CentOS 8, and Debian 10.

Please note that only the latest two minor releases of Scylla Open Source project are supported. Starting today, only Scylla Open Source 4.1 and Scylla 4.0 will be supported, and Scylla 3.3 will be retired.

Related Links

New features in Scylla 4.1

Deployment

  • Scylla is now available for Red Hat Enterprise Linux 8, CentOS 8, and Debian 10

Alternator

  • Alternator: add mandatory configurable write isolation mode #6452
    Alternator allows users to choose the isolation level per table. If they do not use one, a default isolation level is set. In Scylla 4.0 the default level was “always”. In Scylla 4.1, the user needs to explicitly set the default value in scylla.yaml:
    • forbid_rmw – Forbids write requests which require a read before a write. Will return an error on read-modify-write operations, e.g., conditional updates (UpdateItem with a ConditionExpression).
    • only_rmw_uses_lwt – This mode isolates only updates that require read-modify-write. Use this setting only if you do not mix read-modify-write and write-only updates to the same item, concurrently.
    • always – Isolate every write operation, even those that do not need a read before the write. This is the slowest choice, guaranteed to work correctly for every workload. This was the default in 4.0,

For example, the following configuration will set the default isolation mode to always, which was the default in Scylla 4.0:
alternator_write_isolation: always

Note: Alternator 4.1 will not run without choosing a default value!

  • Allow access to system tables from Alternator REST API. If a Query or Scan request intercepts a table name with the following pattern: .scylla.alternator.KEYSPACE_NAME.TABLE_NAME, it will read
    the data from Scylla’s KEYSPACE_NAME.TABLE_NAME table.
    Example: in order to query the contents of Scylla’s system.large_rows, pass TableName='.scylla.alternator.system.large_rows' to a Query/Scan request.
    #6122
  • Add support for ScanIndexForward
    Alternator now support the ScanIndexForward option of a query expression
    By default, query sort order is ascending. Setting ScanIndexForward to False parameter reverses the order.
    See AWS DynamoDB Query API
    #5153

LWT

Additional Features

  • Security: Support SSL Certificate Hot Reloading
    Scylla supports hot reloading of SSL Certificates. If SSL/TLS support is enabled, then whenever the files are changed on disk, Scylla will reload them and use them for subsequent connections.
  • CQL: support Filtering on counter columns #5635
  • Compaction: Allow Major compactions for TWCS. Make major compaction aware of time buckets for TWCS. That means that calling a major compaction with TWCS will not bundle all SSTables together, but rather split them based on their timestamps. #1431
  • Restore: Scylla no longer accepts online loading of SSTable files in the main directory. New tables are only accepted in the upload directory. Till 4.1 both options worked, starting 4.1 the first will return an error.
  • Scylla-tools-java are now based on Apache Cassandra 3.11 tools and use Java 11.

Experimental features in Scylla 4.1

  • Change Data Capture (CDC). While functionally complete, we are still testing CDC to validate it is production ready towards GA in a following Scylla 4.x release. No API updates are expected.
    Selected updates from Scylla 4.0:
    • Use a special partitioner for CDC Log, allowing Scylla to ensure that we’re able to generate a CDC stream for each vnode.
    • Misleading error from CDC preimage when RF > number of nodes #5746
    • CDC: Frozen lists are serialized incorrectly in CDC log #6172
    • Segfault in CDC postimage #6143

See CDC Documentation

  • User Defined Functions #2204
    Scylla now has basic Lua based UDF. First released in Scylla 3.3, it is still under development and will stay experimental in 4.1.

Select updates from Scylla 4.0

Redis

The following commands were added: redis: setex, ttl, exists commands

Monitoring

Scylla 4.1 dashboards are available in latest the Scylla Monitoring release (3.4)

Other notable bugs fix and updates in the release

  • Security: upgrade bundled gnutls library to version 3.6.14, to fix gnutls vulnerability CVE-2020-13777 #6627
  • Stability: an integer overflow in the index reader, introduced in 3.1.0. The bug could be triggered on index files larger than 4GB (corresponding to very large sstables) and result in queries not seeing data in the file (data was not lost). #6040
  • Stability: a rare race condition between a node failure and streaming may cause abort #6414, #6394, #6296
  • Stability: When hinted handoff enabled, commitlog positions are not removed from rps_set for discarded hints #6433 #6422. This issue would occur if there are hint files stored on disk that contain a large number of hints that are no longer valid – either older than gc_grace_seconds of their destination table, or their destination table no longer exists. When Scylla attempts to send hints from such a file, invalid hints will cause memory usage to increase, which may result in “Oversized allocation” warnings to appear in logs. This increase is temporary (ends when hint file processing completes), and shouldn’t be larger than the hint file itself times a small constant.
  • Stability: potential use after free in storage service when using API call /storage_service/describe_ring/ #6465
  • Stability: Issuing a reverse query with multiple IN restrictions on the clustering key might result in incorrect results or a crash. For example:
    CREATE TABLE test (pk int, ck int, v int, PRIMARY KEY (pk, ck));
    SELECT * FROM test WHERE pk = ? AND ck IN (?, ?, ?) ORDERED BY ck DESC;
    #6171
  • Row cache is incorrectly invalidated in incremental compaction (used in ICS and LCS) #5956 #6275
  • Stability (CQL): Using prepare statements with collections of tuples can cause an exit #4209
  • LWT: In some rare cases, a failure may cause two reads with no write in between that return different value which breaks linearizability #6299
  • Alternator: Alternator batch_put_item item with nesting greater then 16 yield ClientError #6366
  • Alternator: alternator: improve error message in case of forbid_rmw forbidden write #6421
  • Setup: scylla does save coredumps on the same partition that Scylla is using #6300
  • Performance: Few of the Local system tables from `system` namespace, like large_partitions do not use gc grace period to 0, which may result in millions of tombstones being needlessly
    kept for these tables, which can cause read timeouts. Local system tables use LocalStrategy replication, so they do not need to be concerned about gc grace
    Period. #6325
  • Stability: online loading of SSTable files in the main directory is diabled.
    Till now, one cloud upload tables either in a dedicated upload directory, or in the main data directory.
    The second, also used by Apache Cassandra, proved to be dangerous, and is disabled from this release forward.
  • Stability: multishard_writer can deadlock when producer fails, for example when, during a repair, a node fail #6241
  • Stability: fix segfault when taking a snapshot without keyspace specified
  • Performance: bypass cache when generating view updates from streaming #6233
  • nodetool status returns wrong IPv6 addresses #5808
  • Docker: docker-entrypoint.py doesn’t gracefully shutdown scylla when the container is stopped #6150
  • CQL: Allow any type to be casted to itself #5102
  • CQL: ALTERing compaction settings for table also sets default_time_to_live to 0 #5048
  • Stability: Crash from storage_proxy::query_partition_key_range_concurrent during shutdown #6093
  • Stability: When speculative read is configured a write may fail even though enough replicas are alive #6123
  • Performance: Streaming during bootstrapping is slow due to compactions #5109
  • Cloud Formation: Missing i3.metal in Launch a Scylla Cluster using Cloud Formation. scylla-machine-image#31
  • scylla-tools-java: cassandra-stress fails to retry user profile insert and query operations #86
  • scylla-tools-java: sstableloader null clustering column gives InvalidQueryException #88
  • scylla-tools-java: sstableloader fails with uppercase KS trying to load into lowercase when importing few years old cassandra #126
  • scylla-tools-java: nodetool: permission issues when running nodetool #147
  • scylla-tools-java: via cqlsh not possible to escape special characters such as ‘ when used as FILTER for UPDATE #150
  • Alternator: KeyConditions with “bytes” value doesn’t work #6495
  • Alternator: alternator: wait for schema agreement after table creation #6361
  • Repair based node operations is disabled by default (same as 4.0)
  • Coresump setup: clean coredump files #6159
  • Coresump setup: enable coredump directory mount #6566
  • Stability: scylla init is stuck for more than a minute due to mlocking all memory at once #6460
  • Upgrade: when required, for example, during upgrade, to use an old SSTable format, Scylla will now default to LA format instead of KA format. New clusters will always use MC format. #6071

The post Scylla Open Source Release 4.1 appeared first on ScyllaDB.

Scylla University Updates, June 2020

 

2020 has been a busy year so far at Scylla University. We published lots of new content to help you improve your Scylla and NoSQL skills. Here is an overview of the new content.

Language Lessons

People are passionate about their preferred programming languages, so Scylla University strives to provide content tailored to many predominant programming languages; Rust, Go, Python, Node.js, and C++.

  • Rust and Scylla lesson: Build a simple Rust application that connects to a Scylla cluster and performs basic queries. For establishing communication between our application and the Scylla server, you’ll use CDRS, which is an open-source Scylla driver for Rust.
  • Golang and Scylla Part 1: In this lesson, you’ll see how to use GoCQL, a ScyllaDB client for the GO language, to interact with a Scylla cluster. Go is an open-source programming language designed at Google in 2007. It’s syntactically similar to C but with garbage collection, structural typing, memory safety, and communicating sequential processes style concurrency.
  • Golang lesson Part 2: the lesson walks you through an example of how to store images in ScyllaDB using Golang.
  • Using GoCQLX to interact with a Scylla cluster: the lesson, introduces the GoCQLX package and uses a hands-on example to show how it works in practice. GoCQLX is an extension to the GoCQL driver. It improves developer productivity while not sacrificing any performance.
  • Coding with Python Part 1: In this lesson, you’ll see how to connect to a Scylla cluster using the driver for Python with an example application.
  • Coding with Python Part 2: Prepared Statements: learn how to use Prepared Statements in your Python applications to improve performance.
  • Coding with Python Part 3 – Data Types: learn how to store binary files in Scylla with a simple Python application.
  • Node.js lesson: learn how to connect to a Scylla cluster using the Node.js driver
  • CPP lesson: the lesson covers the C++ driver, which is compatible with both Scylla and Cassandra. It goes over an example of how to use it to connect to a Scylla cluster and perform basic operations.

Operations and Management

Regardless of the programming language you use, these provide both the fundamentals and advanced deep dives that will allow you to get the most out of Scylla. We also introduce you to a whole new way to connect to Scylla, using our new DynamoDB-compatible API.

  • Operations course: designed with Administrators and Architects in mind.
    By the end of this course, you’ll gain a deep understanding of building, administering, and monitoring Scylla clusters, as well as how to troubleshoot Scylla.
  • Data Modeling course: this course explains basic and advanced data modeling techniques including application workflow, query analysis, denormalization, and other NoSQL data modeling topics while showing some concrete hands-on examples of how to do it.
  • Monitoring and Manager lab: you’ll learn how to start Scylla Manager and connect it to a Scylla cluster and set up Scylla Monitoring, using a hands-on example. The lesson is compatible with Manager 2.0.
  • DynamoDB API Compatibility – Project Alternator Basics: get started with Scylla Alternator, an open-source project that gives Scylla compatibility with DynamoDB. The lesson introduces the project. Afterward, it goes over a hands-on example of creating a one node Scylla cluster and performing some basic operations on it.
  • Expiring Data with TTL lesson: Time to Live (TTL) is a Scylla (and Apache Cassandra) feature that is often used but not always fully understood. Scylla provides the functionality to automatically delete expired data according to TTL. It can be set when defining a Table, or when using the INSERT and UPDATE queries The lesson covers different use cases for using TTL with some hands-on examples you can try out yourself.
  • Compaction lesson: This lesson starts by giving an overview of the compaction process and how it works. It then goes on to cover the different compactions strategies: Size Tiered Compaction Strategy (STCS), Leveled Compaction Strategy (LCS), Time-Window Compaction Strategy (TWCS), and Incremental Compaction Strategy (ICS) and when to use each of them.

If you haven’t tried out Scylla University yet, just register as a user and start learning. All the material is available for free. Courses are completely online, self-paced, and include practical examples, quizzes, and hands-on labs.

University Dean’s List Awards

Out of the hundreds of active users a month using Scylla University, the three top users were:

  1. Felipe Santos, Dynasty Sports & Entertainment, “Scylla University is a great place to start if you are not familiar with NoSQL databases and planning to switch your current relational database to a very robust NoSQL solution.
    Moreover, its content helped me a lot on my first steps delving into NoSQL world and gave me many insights of how that would affect our product’s design.”
  2. Tho Vo, Weatherford, “The courses are very well laid-out, easy to follow. Top grade materials.”

     
  3. Vinicius Neves, “Scylla’s performance is fantastic! We achieved great results by implementing it in our streaming stream. Scylla University attracted our collaborators for its easy and intuitive language!” 

     

Cool Scylla T-shirts are on their way to you.

A reminder, points are awarded for:

  • Completing lessons including quizzes and hands-on labs, with a high score
  • Submitting a feedback form, for example, this one.
  • General suggestions, feedback, and input (just shoot me a message on the #scylla-university channel on Slack)

Get Certified

For each course you complete, you get a complimentary certificate. Doing the training helps you gain knowledge for your latest project or for your next role. Gain credibility as a ScyllaDB and NoSQL expert, by sharing your certificates on your LinkedIn profile and show off your achievements.

To see your certificates go to your profile page.

Check out more courses and join the #scylla-university channel on Slack for more training related discussions.

The post Scylla University Updates, June 2020 appeared first on ScyllaDB.

Scylla University: Using the DynamoDB API in Scylla

Alternator

Scylla University has many courses on how to configure, administer and get the most out of your Scylla cluster. Most of them are written within the context of using our Cassandra-compatible CQL interface. With the release of Scylla Open Source 4.0, we now provide content for users who are coming to Scylla in the context of using its new DynamoDB®-compatible API, which we call Project Alternator.

Our first lesson is called DynamoDB API Compatibility – Project Alternator Basics. If you are already familiar with tools like git and Docker, you should be able to complete the lesson in around 12 minutes.

Overview

Alternator is an open-source project that gives Scylla compatibility with Amazon DynamoDB.

In this lesson, we’ll start by introducing the project. Afterward, we’ll see a hands-on example of creating a one node Scylla cluster and performing some basic operations on it.

The goal of this project is to deliver an open-source alternative to DynamoDB, deployable wherever a user would want: on-premises, on other public clouds like Microsoft Azure or Google Cloud Platform, or still on AWS (for users who wish to take advantage of other aspects of Amazon’s market-leading cloud ecosystem, such as the high-density i3en instances). DynamoDB users can keep their same client code unchanged. Alternator is written in C++ and is a part of Scylla.

You can read more about it in this blog post and in the documentation.

The three main benefits Scylla Alternator provides to DynamoDB users are:

  1. Cost: DynamoDB charges for read and write transactions (RCUs and WCUs). A free, open-source solution doesn’t.
  2. Performance: Scylla was implemented in modern C++. It supports advanced features that enable it to improve latency and throughput significantly.
  3. Openness: Scylla is open-source. It can run on any suitable server cluster regardless of location or deployment method.

Setting up a Scylla Cluster

If you haven’t done so yet, download the example from git:

git clone https://github.com/scylladb/scylla-code-samples.git

Go to the directory of the alternator example:

cd scylla-code-samples/alternator/getting-started

Next, we’ll start a one-node cluster with Alternator enabled.

By default, Scylla does not listen to DynamoDB API requests. To enable such requests, we will set the alternator-port configuration option to the port (8000 in this example), which will listen for DynamoDB API requests.

Alternator uses Scylla’s LWT feature. You can read more about it in the documentation.

docker run --name some-scylla --hostname some-scylla -p 8000:8000 -d scylladb/scylla:4.0.0 --smp 1 --memory=750M --overprovisioned 1 --alternator-port=8000

Wait a few seconds and make sure the cluster is up and running:

docker exec -it some-scylla nodetool status

In this example, we will use the Python language to interact with Scylla with the Boto 3 SDK for Python. It’s also possible to use the CLI or other languages such as Java, C#, Python, Perl, PHP, Ruby, Erlang, Javascript.

Next, if you don’t already have it set up, install boto3 python library which also contains drivers for DynamoDB:

sudo pip install --upgrade boto3

In the three scripts create.py read.py and write.py change the value for “endpoint_url” to the IP address of the node.

Create a Table

Now, we’ll use the create.py script to create a table in our newly created cluster, using Alternator.

Authorization is not in the scope of this lesson, so we’ll use ‘None’ and revisit this in a future lesson.

We define a table called ‘mutant_data’ with the required properties such as the primary key “last_name” which is of a String data type. You can read about Boto 3 data types here.

The DynamoDB data model is similar to Scylla’s. Both databases have a partition key (also called “hash key” in DynamoDB) and an optional clustering key (called “sort key” or “range key” in DynamoDB), and the same notions of rows (which DynamoDB calls “items”) inside partitions. There are some differences in the data model. One of them is that in DynamoDB, columns (called “attributes”), other than the hash key and sort key, can be of any type, and can be different in each row. That means they don’t have to be defined in advance. You can learn more about data modeling in Alternator in more advanced lessons.

In this simple example, we use a one node Scylla cluster. In a production environment, it’s recommended to run a cluster of at least three nodes.

Also, in this example, we’ll send the queries directly to our single node. In a production environment, you should use a mechanism to distribute different DynamoDB requests to different Scylla nodes, to balance the load. More about that in future lessons.

Run the script:

python create.py

Each Alternator table is stored in its own keyspace, which Scylla automatically creates. Table xyz will be in keyspace alternator_xyz. This keyspace is initialized when the first Alternator table is created (with a CreateTable request). The replication factor (RF) for this keyspace and all Alternator tables is chosen at that point, depending on the size of the cluster: RF=3 is used on clusters with three or more live nodes. RF=1 would is used if our cluster is smaller, as is in our case. Using a Scylla cluster of fewer than three nodes is not recommended for production.

Performing Basic Queries

Next, we will write and read some data from the newly created table.

In this script, we use the batch_write_item operation to write data to the table “mutant_data.” This allows us to write multiple items in one operation. Here we write two items using a PutRequest, which is a request to perform the PutItem operation on an item.

Notice that unlike Scylla (and Cassandra for that matter) in DynamoDB, Writes do not have a configurable consistency level. They use CL=QUORUM.

Execute the script to write the two items to the table:

python write.py

Next, we’ll read the data we just wrote, again using a batch operation, batch_get_item.

The response is a dictionary with the result, the two entries we previously wrote.

Execute the read to see the results:

DynamoDB supports two consistency levels for reads, “eventual consistency” and “strong consistency.” You can learn more about Scylla consistency levels here and here. Under the hood, Scylla implements Strongly-consistent reads with LOCAL_QUORUM, while eventually-consistent reads are performed with LOCAL_ONE.

More Resources

Conclusion

In this lesson, we learned the basics of Alternator: the open-source DynamoDB Scylla API. We saw how to create a cluster, connect to it, write data, and read data. Future lessons will cover more advanced topics and more interesting examples, including data modeling, backup and restore, single region vs. multi-region, streams (CDC), encryption at rest, and more.

Sign Up for Scylla University

Once you’re done reading the black, make sure you sign up for or login to Scylla University to get credit for taking this course. You’ll get your own personalized certificate which you can add to your LinkedIn profile.

The post Scylla University: Using the DynamoDB API in Scylla appeared first on ScyllaDB.

Scylla Manager Release 2.1

The Scylla Manager team is pleased to announce the release of Scylla Manager 2.1, a production-ready version of Scylla Manager for Scylla Enterprise and Scylla Open-Source customers. Scylla Manager is a centralized cluster administration and recurrent tasks automation tool. Manager 2.1 brings improvements to Backup, Repair and Health Check tasks.

Manager 2.1 is backward compatible with the previous version of Scylla Manager, and in particular, can continue a Manager 2.0 backup plan. The release includes a correlated Manager Agent 2.1 release.

Scylla Enterprise customers are encouraged to upgrade to Scylla Manager 2.1 in coordination with the Scylla support team.

Useful Links:

Backup Improvements

The speed of backup to AWS S3 is improved, by changing the structure of the metadata and improving the read and purge speed of older backups.

  • Usability: Add Backup Delete command
  • Usability: In addition to data, Scylla Manager will also include backup of the full schema. To enable this feature, one needs to provide CQL credentials by running sctool cluster update with --username and --password pass.
  • Robustness: Scylla Manager allows a backup cluster which has down nodes as long as all ring tokens are covered by live nodes.
  • Performance: In case of lots of small tables the total upload time is shortened. Manager uses long polling to move to another table immediately when the current table’s data is uploaded.
  • Performance: improved pure performance
  • Storage: Better disk space management
    • Old snapshots are deleted before taking a new one
    • Active snapshots are deleted before purge phase
  • Performance: Number of bytes sent from Agent to Server when uploading (to check the progress) is reduced to just a few bytes (vs up to 10MB in extreme cases).
  • Usability: Sctool backup list command shows the total backup size
  • Usability: Sctool backup files command can now work even when the number of tables or files in the backup is huge.
  • Usability: Config option poll_interval is removed.

Healthcheck Improvements

Manager Health Check task is improved by making it more reliable, and removing cases of false alerts.

  • Robustness: CQL ping from Scylla Manager can now execute a CQL SELECT query, instead of CQL OPTION in Manager 2.1. This gives a more reliable latency and status reports.
    To enable this option, one needs to provide CQL credentials by running sctool cluster update --username user --password pass
  • Robustness: Nodes that are known to be in Down state are not pinged (REST and CQL), eliminating false alerts.
  • Usability: Sctool status command is enhanced to:
    • Shows all clusters by default
    • has new statuses UNAUTHORISED and TIMEOUT and shows HTTP error code in case of REST check failure
    • Provides information from gossip in nodetool format ex, UN, DN, UJ
    • SSL column is removed; we add SSL to CQL output

Repair Improvements

A more granular control of the repair operation gives more control on repair speed.

  • Usability: School repair command gained --intensity flag that lets you control repair speed.
  • Usability: resuming a repair will always work regardless of the repair age. This can be overridden using the `max_age` configuration.

Other Improvements

  • Agent binary can take multiple config files (needed for docker)
  • Config files are better validated, values in config files that are not recognised as options will trigger errors.
  • Rclone base is changed from v1.50.2 to v1.51.0.
  • Go version upgrade from 1.13.5 to 1.13.8
  • Debug (pprof) port enabled by default, bound to localhost – from both Manager server and agent.
  • Logging of errors is improved, there is no errorVerbose just error message. For errors logged in the ERROR level we added errorStack that contains a stack trace of the error.
  • Manager Server and Agent transport: retry backoff improvements
    • In some cases (ex. when body was read) we did not retry – this is fixed
    • For requests originating from sctool / rest API we use a different retry policy when request has to be performed against a given host (1 retry wait 1s).
  • Sctool --show-table flag takes no parameter
  • Sctool tables header start with capital letters

Monitoring

You can use Scylla Monitoring Stack 3.3 with -M 2.1 option to get Manager 2.1 dashboards

The following metric was removed in Manager 2.1:

  • scylla_manager_backup_move_manifest_retries

The post Scylla Manager Release 2.1 appeared first on ScyllaDB.

Scylla Enterprise Release 2019.1.9

The ScyllaDB team announces the release of Scylla Enterprise 2019.1.9, a production-ready Scylla Enterprise patch release. As always, Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2019.1.9 in coordination with the Scylla support team.

The focus of Scylla Enterprise 2019.1.9 is improving stability and bug fixes. More below.

Related Links

Fixed issues in this release are listed below, with open source references, if present:

  • Tooling: nodetool status returns wrong IPv6 addresses #5808
  • AWS: Update enhanced networking supported instance list #6540
  • CQL: Filtering on a static column in an empty partition is incorrect #5248
  • Stability: When a node fails during an ongoing node replace procedure, and then restarted with no data, parts of the token range might end up not assigned to any node #5172
  • API: Scylla returns the wrong error code (0000 – server internal error) in response to trying to do authentication/authorization operations that involve a non-existing role. #6363
  • Stability: potential use after free in storage service #6465
  • Stability: When hinted handoff enabled, commitlog positions are not removed from rps_set for discarded hints #6433 #6422.
  • Stability: multishard_writer can deadlock when producer fails, for example when, during a repair, a node fail #6241

The post Scylla Enterprise Release 2019.1.9 appeared first on ScyllaDB.

Opera: Syncing Tens of Millions of Browsers with Scylla

Opera is a growing brand in the web browser industry, with hundreds of millions of monthly active users. In particular, its rate of growth has been greatly accelerated by adoption across Africa, where it offers its own content and news hubs, making Opera a vital factor in a faster and more accessible Internet experience.

At Scylla Summit 2019 we had the pleasure of hosting Rafal Furmanski and Piotr Olchawa of Opera, who explained how they use Scylla to synchronize user data across different browsers and platforms.

Founded in 1995 in Oslo, Norway, Opera is now a NASDAQ-traded company (OPRA), with offices in Poland, Sweden and China. Its portfolio ranges from lightweight mobile browsers like Opera Mini, Opera for Android and Opera Touch to desktop browsers like Opera GX, designed for gamers looking to engage with their favorite brands and content.

Opera GX is Opera’s new web browser for gamers, with direct integrations for sites like Twitch, Discord, Reddit and YouTube, as well as features like WebRTC, crypto wallets, and VPNs.

Users want a seamless experience across their desktop and mobile devices. Opera Sync is the service Opera offers to ensure users can find what they are looking for and use it in a way they are familiar with across their many browsers. It synchronizes data, from favorite sites, bookmarks, and browser histories, to passwords and preferences. While not everyone who uses an Opera browser relies upon the Opera Sync feature, each month tens of millions of users depend on it to make their access across platforms fast, familiar and easy.

Opera Sync runs on Google Cloud Platform, using Scylla as a backend source of truth data store on 26 nodes across two data centers; one in the U.S. and one in the Netherlands. It also uses Firebase Cloud Messaging to update mobile clients as changes occur.

Opera maintains a data model in Scylla for what information needs to be shared across browsers. This is an example of a bookmark:

class Bookmark(Model):
    user_id = columns.Text(partition_key=True)
    version = columns.BigInt(primary_key=True, clustering_order='ASC')
    id = columns.Text(primary_key=True)

    parent_id = columns.Text()
    position = columns.Bytes()
    name = columns.Text()
    ctime = columns.DateTime()
    mtime = columns.DateTime()
    deleted = columns.Boolean(default=False)
    folder = columns.Boolean(default=False)
    specifics = columns.Bytes()

The userid is a partition key, so that all user data is collocated. The version and ID of the object then comprise the primary key of the object, and the version acts as a clustering key to enable queries such as selecting all bookmarks for a user of a particular version, or, in another case, changing or removing bookmarks of a certain version. (Versions, in Opera’s case, are set as precise timestamps.)

Benchmarking Scylla vs. Cassandra

Opera had started using Cassandra with release 2.1, and according to Rafal, “we immediately got hit by some very, very ugly bugs like lost writes or ghost entries showing here and there. NullPointerExceptions, out-of-memory errors, crashes during repair. It was really a nightmare for us.”

“Besides all of these bugs, we observed very, very high read and write latencies,” Rafal noted that while general performance was okay, p99 latencies could be up to five seconds. Cassandra suffered Java Garbage Collection (GC) pauses with a direct correlation to uptime: the longer the node was up, the longer the GC pause. Cassandra gave Opera many “stop the world” problems. For instance, there were failures of the gossip and CQL binary protocols. Even if you used nodetool status, a node might not respond even if nodetool said that the node was okay.

Beyond this, Rafal recounted more headaches administering their Cassandra cluster: crashes without specific reasons, problems with bootstrapping new nodes, and never-ending repairs. “When you have to bootstrap a new node, it has to stream data from the other nodes. And if one of these nodes was down because of these problems I mentioned before, you basically had to start over with the bootstrap process.”

The “solutions” to these issues were often just exacerbating underlying problems, such as throwing more nodes to maintain uptime. This was because as soon as a node reached 700 gigabytes of data it became unresponsive. “We also tried to tune every piece of Java and Casandra config. If you come from the Cassandra world, you know that there are a lot of things to configure there.” They sought help from Cassandra gurus, but the problems Opera was having seemed unique to them. As a stop-gap fix, Rafal’s team even added a cron job to periodically restart Cassandra nodes that died.

While they were dealing with their production problems, Rafal encountered Scylla at the Cassandra Summit where it debuted in 2015. Recalling sitting in the audience listening to Avi’s presentation Rafal commented, “I was really amazed about its shard-per-core architecture.” Opera’s barrier to adoption was the wait for counters to be production-ready (available in Scylla 2.0 by late 2017). By July 2018 he created Opera’s first Scylla cluster and performed benchmarks. Within a month, they made the decision to migrate all their user data to Scylla. By May 2019, Opera decommissioned their last node of Cassandra.

The benchmark was performed on a 3-node bare-metal cluster with a mixed workload of 50% reads (GetUpdates) and 50% writes (Commits). These were the same node models Opera was using in production. Against this they ran a cassandra-stress test with a bookmark table identical to how it was defined in production. Their results were clear: Scylla sustained three times more operations per second, with far lower latencies — nearly an order of magnitude faster for p99.

In Opera’s benchmarks, Scylla provided nearly 3x the throughput of Apache Cassandra

In Opera’s benchmarks, Scylla’s mean (p50) latencies were less than half of Apache Cassandra, while long-tail (p99) latencies were 8x faster

In terms of one unanticipated bottleneck, Rafal was somewhat apologetic. “In the case of Scylla, I was throttled by our one gig networking card. So these results could be even better.”

Migrating from Cassandra to Scylla

Opera’s backend uses a Python and Django framework. So they modified their in-house library to connect to more than one database as a first step. They then prepared a 3-node cluster per datacenter, along with monitoring. They moved a few test users to Scylla (specifically, Rafal and his coworkers) to see if it worked. Once they proved there were no errors, they began moving all new users to Scylla (about 8,000 users daily, or a quarter million per month) keeping the existing users on Cassandra.

After a while, they migrated existing users to Scylla and began decommissioning nodes and performing final migration cleanup. These were the settings for their Django settings.py script, showing how it could simultaneously connect to Cassandra and Scylla, and how it was topology aware:

DATABASES = {
    'cassandra': {...}
    'scylla': {
        'ENGINE': 'django_cassandra_engine',
        'NAME': 'sync',
        'HOST': SCYLLA_DC_HOSTS[DC],
        'USER': SCYLLA_USER,
        'PASSWORD': SCYLLA_PASSWORD,
        'OPTIONS': {
            'replication': {
                'strategy_class': 'NetworkTopologyStrategy',
                'Amsterdam': 3,
                'Ashburn': 3
            },
            'connection': {..}
        }
    }
}

To determine a user’s connection, and which database to connect to for a specific user, Opera used the following code, written in Python, which then queried a memcached cluster:

def get_user_store(user_id):
    connection = UserStore.maybe_get_user_connection(user_id) # from cache

    if connection is not None: # We know exactly which connection to use
        with ContextQuery(UserStore, connection=connection) as US:
            return US.objects.get(user_id=user_id)
    else: # We have no clue which connection is correct for this user
        try:
            with ContextQuery(UserStore, connection='cassandra') as US:
                user_store = US.objects.get(user_id=user_id)
        except UserStore.DoesNotExist:
            with ContextQuery(UserStore, connection='scylla') as US:
                user_store = US.objects.get(user_id=user_id)

        user_store.cache_user_connection()

return user_store

In the first step, they tried to get the user’s connection from cache. If there was a hit on the cache they would safely return the database to use. If that query failed, they would then check the databases (Cassandra, then Scylla, in order) and add which location the user data was located in to the cache for subsequent queries.

Their migration scripts had the requirement of being able to move the user data from Cassandra to Scylla and back (in case they needed to revert for any reason). It also needed to perform a consistency check after migrating, and needed to support concurrency, so that multiple migration scripts could run in parallel. Additionally, it needed to “measure everything” — number of migrated users, migration time, migration errors (and the reasons), failures, etc.

Their migration script would be a free user (one not currently synchronizing data), and mark it for migration. It would set user_store_migration_pending = True, with a time-to-live (TTL). After the data was migrated and the consistency check was performed, they would remove the data from Cassandra and clear the connection cache. Then, finally, they would set the user_store_migration_pending = False.

During the migration Cassandra still provided problems, in the form of timeouts and unavailability. Even as the number of nodes and users active on that Cassandra cluster dwindled. Migrating huge accounts took time. Plus, during the migration period, the user was unable to utilize Opera sync.

After the migration, Opera was able to shrink their cluster from 32 nodes down to 26. However, they had plans to move to higher density nodes and shrink further down to 8 nodes. Their bootstrapping, which used to take days, now takes hours. Their performance, as stated before, was far faster, with a significant drop in latencies. But best of all, their team had “no more sleepless nights.”

Initial sync time with Scylla dropped 6x over Cassandra

Long-tail latencies (p95 and p99), which had ranged as high as five seconds for reads on Cassandra, were now all safely in the single-digit millisecond range with Scylla.

Next Steps

If you want to view the full video, you can watch it below, and read Opera’s slides in our Tech Talks page. In the full video you can hear more about Opera’s open source scylla-cli project for managing and repairing Scylla in a smart way.

 

The post Opera: Syncing Tens of Millions of Browsers with Scylla appeared first on ScyllaDB.