You’re Invited to Scylla Summit 2019!

Scylla Summit is returning to San Francisco this November 5th and 6th and you’re invited! Now in its fourth year, our annual community and industry conference promises to deliver even more big data insights, engaging speakers and momentous product announcements.

With more and more users deploying Scylla, the response to our Call for Speakers was amazing! Join us to hear about innovative real-time big data use cases, lessons learned, and technical deep dives from industry leaders like Capital One, Comcast, Opera, Numberly, Lookout, SAS Institute, Nauto, Tubi TV, iFood, and more!

Topics will cover the latest use cases and programming techniques across a variety of vertical markets and applications. Find out how Scylla is being used in conjunction with other groundbreaking big data and cloud-oriented technologies, from Apache Spark and Kafka to Kubernetes. Plus learn all aspects of architecture and operations, from design for scaling, to performance tuning, to migration strategies.

Keynotes from ScyllaDB’s leadership, including Dor Laor, Avi Kivity and Glauber Costa, will showcase our latest capabilities and you’ll get a glimpse into our roadmap for our open source, enterprise and cloud products..

Scylla newbies and experts can also deepen their Scylla and NoSQL knowledge at our pre-summit Training Day on November 4th. With both basic and advanced sessions available, there’s something for everyone.

We will also hand out the latest Scylla User Awards to those who are pushing the boundaries of big data in terms of scale and innovation.

And of course, our Summit is your best chance to network and socialize with ScyllaDB’s engineering team and your industry peers.

This year’s event will be held at the Parc 55 San Francisco, just a few minutes walk to performing arts venues, the shops around Union Square and the museums by the Yerba Buena Gardens.

We hope you will join us! Save 25% when you register now – Early Bird registration for both the Scylla Summit conference and one-day training is open until the end of September. Regular pricing will begin October 1, but don’t wait too long. Training seats are limited!

REGISTER NOW FOR SCYLLA SUMMIT

The post You’re Invited to Scylla Summit 2019! appeared first on ScyllaDB.

Scylla Open Source 3.1: Efficiently Maintaining Consistency with Row-Level Repair

Repair is one of several anti-entropy mechanisms in Scylla. It is used to synchronize data across replicas. In this post, we introduce a new repair algorithm coming with Scylla Open Source 3.1 that improves performance by operating at the row-level, rather than across entire partitions.

What is Repair?

During regular operation, a Scylla cluster continues to function and remains ‘always-on’ even in the face of failures such as:

  • A down node
  • A network partition
  • Complete datacenter failure
  • Dropped mutations due to timeouts
  • Process crashes (before a flush)
  • A replica that cannot write due to a lack of resources

As long as the cluster can satisfy the required consistency level (usually a quorum), availability and consistency will be met. However, in order to automatically mitigate data inconsistency (entropy), Scylla, following Apache Cassandra, uses three processes:

  • Hinted Handoff: automatically recover intermediate failures by storing up to 3 hours of untransmitted data between nodes.
  • Read Repair: fix inconsistencies as part of the read path.
  • Repair: described below.

The infinite War: Chaos Monster and Sun God, Chaos vs Order, Entropy vs Scylla Repair. (Source: Wikimedia Commons)

Scylla repair is a process that runs in the background and synchronizes the data between nodes so that all replicas eventually hold exactly the same data. Repairs are a necessary part of database maintenance because data can become inconsistent with other replicas over time. Scylla Manager optimizes repair and automates it by running the process according to a schedule.

In most cases, the proportion of data that is out of sync is very small, while in a few other cases, for example, if a node was down for a day, the difference might be bigger.

How Repair Works in Scylla Open Source 3.0

Running repair on node T-3

In Scylla 3.0, the repair master, the node that initiates the repair, selects a range of partitions to work on and executes the following steps:

  1. Detect mismatch
    1. The repair master splits the ranges into sub-ranges that contain around 100 partitions.
    2. The repair master computes the checksum of each sub-range and asks the related peers to compute the checksum of the same subrange.
  2. Sync mismatch
    1. If the checksums match, the data in this sub-range is already in sync.
    2. If the checksums do not match, the repair master fetches the data from the followers and sends back the merged data to followers.

What is the Problem with Scylla 3.0 Repair?

There are two problems with the implementation described above:

  1. Low granularity may cause a huge overhead: a mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. Even a single partition can be very large, not to mention the size of 100 partitions. A difference of just a single byte can easily cause 10GB of data to be streamed.
  2. The data is read twice: once to calculate the checksum and find the mismatch and again to stream the data to fix the mismatch.

Clearly, the two issues above can significantly slow down the repair process, needlessly sending a huge amount of unnecessary data over the wire. In the case of a cross datacenter repair, this extra data translates into additional costs.

Basic repair unit in Scylla 3.0 and prior: 100 partitions

What is a Row-level Repair?

By now the solution for the above seems straightforward: repair should work at the row-level, not the partition-level. Repair should transfer only the mismatched rows.

 

Single-row repair in Scylla 3.1

 

Basic repair unit beginning in Scylla 3.1: A row

The new Row Level Repair improves Scylla in two ways:

Minimize data transfer

With Row Level Repair, Scylla calculates the checksum for each row and uses set reconciliation algorithms to find the mismatches between nodes. As a result, only the mismatched rows are exchanged, which eliminates unnecessary data transmission over the network.

Minimize disk reads

The new implementation manages to:

  • Read the data only once
  • Keep the data in a temporary buffer
  • Use the cached data to calculate the checksum and to send to the replicas.

Results

In a benchmark done on a three-node Scylla cluster on AWS using i3.4xlarge instances, each with 1 billion rows (1 TB of data, with 1 KB of data per row) and no background workload, we tested three use cases, representing three cluster failure types:

  1. 0% synced: node has completely lost its data. This use case is where the repair needs to sync *all* of its data. In practice, it is better to run a node replace procedure in this case.
  2. 100% synced: very likely, there were zero failures since the last repair, and the data is in sync. Still, we would like to minimize the effort in reading the data and validate it is the case.
  3. 99.9% synced: The nodes are almost in sync, but there are a few mismatched rows.

Below are the results for each of the above cases with the old and new repair algorithm.

Test case Description Time to repair Ratio
Scylla 3.0 Scylla 3.1
0% synced One of the nodes has zero data. The other two nodes have 1 billion identical rows. 49.0 min 37.07 min x1.32 faster
100% synced All of the 3 nodes have 1 billion identical rows. 47.6 min 9.8 min x4.85 faster
99.9% synced Each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. 96.4 min 14.2 min x6.78 faster

The new row-level repair shines (x6.78 faster) where a small percent of the data is out of sync — which is the most likely use case.

For the last test case, the bytes sent over the wire are as follows:

Scylla 3.0 Scylla 3.1 Transfer data ratio
TX 120.52 GiB 4.28 GiB 3.6%
RX 60.09 GiB 2.14 GiB 3.6%

As expected, where the actual difference between nodes is small, sending just relevant rows, rather than 100 partitions at a time, makes a huge difference: less than 3.6% of the data transfer compared to 3.0 repair!

For the first and second test cases, the speedup comes from two factors:

  • More parallelism: In Scylla 3.0 repair, only one token range is repaired per shard at a time. While in 3.1 repair, Scylla repairs more (16) ranges in parallel.
  • Faster hash: In Scylla 3.0 repair, the 256-bit cryptographic SHA256 hash algorithm is used; in Scylla 3.1 repair, we switch to a faster, 64-bit non-cryptographic hash. This was possible as 3.1 repair less data to hash per hash group.

Conclusions

With the new row-level repair, we fixed two major issues in the previous repair implementation.
The new algorithm proves to work much faster and sends only a fraction of the data between nodes. This new implementation:

  • Shortens the repair time, reducing the chance of failure during the repair (which requires a restart)
  • Reduces the resource consumption of the CPU, Disk, and network, and thus frees up more resources for on-line requests.
  • Significantly reduces the data transfer amounts between nodes and reduces the costs in cloud deployments (which charge for data transfers between Regions).

We have even more repair improvements planned for upcoming versions of Scylla. Please stay tuned!

The post Scylla Open Source 3.1: Efficiently Maintaining Consistency with Row-Level Repair appeared first on ScyllaDB.

Apache Spark at Scylla Summit, Part 2: Tips for Building Resilient Pipelines

With continued and growing interest in Apache Spark, we had two speakers present at Scylla Summit 2018 on the topic. This is the second of a two-part article. Make sure you read the first part, which covered the talk about Scylla and Spark best practices by ScyllaDB’s Eyal Gutkind. This second installment covers the presentation by Google’s Holden Karau.

Holden Karau is an open source developer advocate at Google. She is an Apache Spark/PMC committer and a co-author of Learning Spark: Lightning-Fast Big Data Analysis and High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark.

In her talk, entitled Building Recoverable (and Optionally Asynchronous) Pipelines in Spark, Holden provided an overview of Spark, how it can fail and, based on those different failures, she outlined a number of strategies for how pipelines can be recovered.

It’s no secret that Spark can fail in hour 23 of a 24-hour job. In her talk, Holden asks the question: “Have you ever had a Spark job fail in its second to last stage after a ‘trivial’ update? Have you been part of the way through debugging a pipeline to wish you could look at its data? Have you ever had an ‘exploratory’ notebook turn into something less exploratory?”

According to Holden, it’s surprisingly simple not only to build recoverable pipelines, but also to make them more debuggable. But the “simple” solution proposed turns out to have hidden flaws. Holden shows how to work around them, and ends with a reminder of how important it is to test your code.



In Holden’s view, there are a few different approaches that you can take towards making recoverable pipelines, each of which is subject to interesting problems.

Holden pointed out that Spark has many different components, with confusing redundancy among them. Many of the components have two copies, such as two different machine learning libraries; an old deprecated one that “sort of works” and a new one that doesn’t work yet, but isn’t deprecated. There are two different streaming engines with three different streaming execution engines. There is also a SQL engine, which, while it may not be cutting-edge, can provide a unified view and help you to work with the SQL lovers in your organization.

The Joy of Out of Memory Errors

Holden used the familiar Spark example, Wordcount.

But, she pointed out, Wordcount is limited. Real-world big data uses cases span ETL, SQL on top of NoSQL, machine learning, and deep learning. Holden’s examples can plug into all of these use cases.

In Holden’s experience, Spark is not only going to fail, it’s usually going to fail late. There are several reasons for this. Lazy evaluation may be cool, but it also means that no errors are thrown until the results are saved out. For example, if the input files don’t exist or the HDFS connection was improperly configured, or if we were querying a database and we didn’t have permissions, we wouldn’t get the out-of-memory errors.

“I mean I write a lot of errors in my code, “ said Holden, “When you change just one small piece, everything runs fine, since you only changed one thing. You deploy to production and then everything fails. The data size is increasing without the required tuning changes. As your database starts to grow, you may need to add more nodes to Spark. If you don’t add those nodes to Spark, it will try really hard to finish your job and then 23 hours later it will fail.”

Another scenario that Holden discussed is type-checking failures that can make Spark ‘catch on fire,’ such as key-skew (or data skew). The goal is to put the fire out and move on.

Why is making a Spark job recoverable so hard? Spark jobs tend to be scheduled on top of a cluster like Kubernetes or YARN. When jobs fail, all of the resources go back to the cluster manager and report failure. For those who use notebooks and persistent clusters like Apache Livy that’s not quite the case; you can have your Spark context and cluster resources still hanging around, which lets you recover a bit easily, but there are still downsides.

Building a “Recoverable” Wordcount

Holden provided a first take for an updated wordcount example that supports recovery:

A “Recoverable” Wordcount

In this example, if the file exists we load it, but we don’t compute our word count. Instead we will return the raw computation, which allows a place to checkpoint. If things go wrong, the code can run again and return to that checkpoint.

This doesn’t fix everything, however. If a job fails in a ‘weird’ way, the files will exist but they’ll be in a bad state, meaning that you need to check to make sure that things are actually in a good state. If the entire job succeeded, temporary recovery points may be sitting in persistent storage and need to be cleaned up. Another problem is that it’s not async, so it will block while saving. We could do something smarter, like using functions so that we have general-purpose code.

In a second example Holden showed how we could also do this a little bit differently to check for Spark’s success marker.

“Recoverable” Wordcount 2

Checking the success marker helps, but it also has limitations. For example, cleaning up on success is a bit complicated. We have to wait to finish writing out files which will slow the pipeline down to make it recoverable, when one reason we wanted to make our pipeline recoverable is that it was kind of slow. We want to minimize the impact of making it recoverable.

Holden then showed a third take on the same problem. This version checks for the two different kinds of success files, takes in data frames and checks that the success exists. If they do, we load it and return that we’re reusing it. Otherwise we save it and return the default data frame . If you know Spark you may realize that there’s some sad problems with this solution.

“Recoverable” Wordcount, 3rd try

Problems Making a “Happy Path”

We have two options: make the “happy path” (as Holden dubbed it) async or clean up on success. To make the saving async instead of calling save, you can use a non-blocking DataFrame save, where we kick a thread, save it back. According to Holden, “That’s kind of not great.” If we were working in the JVM, we would have proper async methods, but we don’t. Instead we can just use use Python threads here and and kick off our save, as follows:

Adding async

But this may break things. Now we need to examine the internals of Spark to understand why making this code async doesn’t just solve the problem.

Spark’s Core Magic: The DAG

In Spark, most work is done by transformations. At Spark’s core, what we ask it to do only happens when we finally force it to do so. When we do, it constructs a directed acyclic graph (DAG). Spark doesn’t perform them; it merely makes a note that it needs to do it. The problem is that, in Spark, the data doesn’t really exist until it absolutely has to.

Transformations return new RDDs or DataFrames representing the data. Those turn out to be just ‘plans’ that are executed when we command Spark to do so. While it looks like we’re saving a dataset that exists, it actually doesn’t exist until we call save. It also goes away right after we call save.

That’s unfortunate because it requires data to exist twice. But even if we make it exist twice there’s still a slight problem; the directed acyclic graph (DAG). While a DAG could be a simple sequence, it could also be a more complex job with multiple pieces coming together. When using DataFrames, the result is a query plan, which is similar to the DAG, but more difficult to read and contains things like code generation.

Example of the difference between a Directed Acyclic Graph (DAG) on the left and a DataFrame query plan on the right

So what’s gonna go boom? We cache the result and then we call count and do a non-blocking DataFrame save and return the DataFrame. This sort of fixes the problem to a degree.

A recoverable pipeline using cache, sync count, and async save

If Spark gets two requests for the same for the same piece of data, Spark isn’t ‘smart’ enough to realize that two different people want the same thing; it just computes the same result twice. But if you’re processing, you are probably doing your recoverable part on the expensive part of your pipeline.

“That’s fine for me, since I work at a cloud company, and we sell compute resources. But that might not be fine to your boss.”

By forcing this cache and count, we tell Spark we want to use the data twice, which should make it exist in memory (and keep it around). Spark should save and then return the result without blocking on the save, since the data already exists and the save can read out of memory. You don’t have to worry about the fact that we might ask it for a second time. Spark will just go back and read that same block of memory. The fact that we’re reading that same memory to write out is fine; you’re not duplicating an expensive operation. Instead, we have an inexpensive operation happening twice, which may not be ideal but which works well enough. There are some minor downsides, but we now have a recoverable pipeline.

You probably only need a recoverable pipeline where have complicated business logic, a service hitting an expensive API server. Don’t make everything recoverable, just do it on the parts that matter.

As a final note, Holden pointed out that one problem is that you need a concept of job IDs. Spark has a built-in understanding of what a job ID is, but it’s better to use your own to give you control. For example, if you want to run two different backfills, you don’t want them recovering from each other part of the way through. Instead, you want them to run separately. You can do that by giving them different job IDs, so they don’t interfere with each other’s data.

Register Now for Scylla Summit 2019!

If you enjoyed technical insights from industry leaders, and want to learn more about how to get the most out of Scylla and your big data infrastructure, sign up for Scylla Summit 2019, coming up this November 5-6, 2019 in San Francisco, California.

Register for Scylla Summit Now!

The post Apache Spark at Scylla Summit, Part 2: Tips for Building Resilient Pipelines appeared first on ScyllaDB.

Apache Spark at Scylla Summit, Part 1: Best Practices

With continued and growing interest in Apache Spark, we had two speakers present at Scylla Summit 2018 on the topic. This is the first of a two-part article, covering the talk by ScyllaDB’s Eyal Gutkind. The second part covers the talk by Google’s Holden Karau.

With business demanding more actionable insight and ROI out of their big data, it is no surprise that analytics are a common workload on Scylla. Nor is it a surprise that Spark is a perennial favorite on the Scylla Summit agenda, and our annual gathering last year proved to be no exception. The focus was on the practical side of using Scylla with Spark, understanding the interplay between cluster configuration between the two, and making a Spark installation more resilient to failures late in a long-running analytics job.

Apache Spark

Apache Spark bills itself as a unified analytics engine for large-scale data processing. It achieves its high performance using a directed acyclic graph (DAG) scheduler, along with a query optimizer and an execution engine.

In many ways Spark can be seen as a successor to Hadoop’s MapReduce. Invented by Google, MapReduce was the first to leverage large server clusters to index the web. Following Spark’s release, it was quickly adopted by enterprises. Today, Spark’s community encompasses a broad community ranging from individual developers to tech giants like IBM. Use cases expanded to support traditional SQL batch jobs and ETL workloads across large data sets, as well as streaming machine data.

While Spark has achieved broad adoption, users often wrestle with a few common issues. The first issue is how Spark should be deployed in a heterogeneous big data environment, where many sources of data, including unstructured NoSQL data, feed analytics pipelines. The second issue is how to make a very long-running analytics pipeline more resilient to failures.

Best Practices for Running Spark with Scylla

First on the agenda was Eyal Gutkind, Scylla’s head of Solutions Architecture. In this role, Eyal helps Scylla customers successfully deploy and integrate Scylla in their heterogeneous data architectures — everything from helping them with sizing their cluster to hooking up Scylla with their Spark or Kafka infrastructure. Eyal pointed out that side-by-side Spark and Scylla deployments are a common theme in modern IT. Executing analytics workloads on transactional data provide insights to the business team. ETL workloads using Spark and Scylla are also common. In his presentation, Eyal covered different workloads that his team has seen in practice, and how they helped optimize both Spark and Scylla deployments to support a smooth and efficient workflow. Some of the best practices that Eyal covered included properly sizing the Spark and Scylla nodes, tuning partitions sizes, setting connectors concurrency and Spark retry policies. Finally, Eyal addressed ways to use Spark and Scylla in migrations from different data models.



Mapping Scylla’s Token Architecture and Spark RDDs

Eyal homed in on the architectural differences between Scylla and Spark. Scylla shards differently than Spark. But the main difference is the way Spark consumes data out of Scylla.

Spark is a distributed system with a driver program—the main function. Each executor has different cache and memory settings. Memory is ordered in a resilient distributed dataset (RDD), which is stored on each node.

A basic example of how Scylla makes evenly-distributed partitions via token ranges

Scylla takes the partitioned data and shards it by cores, distributing data across clusters using Murmur3 to make sure that every node will have an even number of partitions to prevent hotspots. This guarantees the load on each node will be even. (Read more about the murmur3 partitioner in our Scylla Ring Architecture documentation.)

The driver hashes the partition key and sends the hash to a node. From there, the cluster replicates the data to other nodes based on your replication factor (RF).

An example of how the partition hash function is applied to data to insert it into a token range

In contrast, Spark is a lazy system. It consumes data, but doesn’t do anything with it; the actual execution happens only when Spark writes data. Spark reads data taken from Scylla into RDDs, which are distributed across different nodes.

How Apache Spark distributes data across different nodes using SparkContext

When running the two systems side-by-side, multiple partitions from Scylla will be written into multiple RDDs on different Spark nodes. For example:

How Apache Spark splits multiple RDDs across nodes into partitions

Spark was written to work with the Hadoop file system (HDFS), where the execution unit sits on top of the data. The idea is to minimize traffic on the network and make it efficient. Scylla works differently; partitions sit on different nodes in the cluster. So there is a mismatch in data locality between Scylla and Spark nodes.

Tables are stored on different nodes, and gathering data across different nodes is an expensive operation. The Cassandra-Spark collector creates a SparkContext, which is an abstraction of the Spark notion of the data inside Scylla for the Spark executors. It writes information into Scylla in batches; there are nuances to how this happens. Currently, the Spark connector is packaged with the Java driver. “We are trying to bring in more efficient drivers so that we can connect better between Java Application and Scylla—the Spark connector benefits from this.”

When Spark Writes to Scylla: Strategies for Batching

The Spark connector batches information inside Spark. As Eyal outlined, when you deploy Spark on Scylla you need to look at the following settings:

For example, if you have to write 1,000 rows, it won’t start writing to Scylla until 1,001 rows have been collected (then the first 1,000 will be sent). There are ways to change the buffer size, though. (Default is 1,000.) Another trigger, the amount of data in buffer (measured in bytes), ensures you don’t overcommit Scylla nodes to throughput that cannot be sustained. You can also define the number of concurrent writes. The default is 5.

These settings determine how groups will be batched. Batches are inefficient when batching multiple partition keys in a single batch, so you need to make sure that those batches return to a single partition key. These settings allow multiple strategies. The default is by partition key, but you can’t guarantee the same P key for each batch.

Another option is batching by replica set, and the third option is NONE; to disregard the grouping option. Once a batch is ready it goes to all nodes or partitions that are participating.

The maximum batch size is 1K. If you’re going to change the size of the batch and there are some key use cases where you want to do this, be aware that it actually might collide with your Scylla settings. You’ll see warnings in your log, errors when you exceed five kilobytes, or it’s going to error out and that can be painful!

Eyal pointed out that is important to make sure that you are aligned between the size of the batches you defined in your connector and the ones that you actually use inside your system.

When Scylla or the connector writes into the Scylla nodes, it’s going to use a local quorum. It will ask at least if you have a replication factor of three; two nodes to acknowledge the write. If you have a very high write throughput you might want to change it to a local one to prevent any type of additional latency in the system.

The Spark connector will check the table size. For example, the table has a gigabyte of data. The driver then checks how many executors of Spark have been assigned to run the job, for example, 10.

That one gigabyte will be divided by 10, for 100 megabytes that will have to be fetched for each one of those executors. The connector says “wait! that might be too much!” For example, in a Cassandra cluster I want to throttle it, or make sure that I don’t over commit the cluster.

The default setting for input.split.size_in_mb is going to take that RDD or that data and chunk it into 64 megabyte reads from the cluster. Since driver was written for Cassandra, the awareness is by node, not by shard. To do this operation by shard, from the Scylla perspective, now a single shard, a single connection, will try to read 64 megabytes, making it inefficient. Our recommendation is to use a smaller chunk, which will create more reads from the cluster, but each of those reads will hit a different shard in the cluster. This provides better concurrency on the read side. (For example, in an 8-CPU machine, where each CPU represents a shard, you could divide the default by the number of shards, setting input.split.size_in_mb to 64mb / 8 = 8mb.)

If you have a huge partition and you want to dice it by different rows, you can configure it to be 50, 60, or 100 rows. The current default is 1,000 rows, which is efficient. Eyal recommended that Scylla users experiment with various input sizes to find the setting that works best for each use case.

To Collocate, or Not to Collocate?

Eyal discussed a common topic from his customer engagements; whether or not to collocate Scylla and Spark. He pointed out that other systems recommend that you put your Spark cluster together with your data cluster, which provides some efficiencies in that you manage fewer nodes overall. This arrangement is more efficient from spinning up and down system since you don’t have to multiply your control plane.

There’s a cost to that, however. Every Spark node has executors. You need to define how many cores in each one of your servers will be an executor. For example, say you have a 16 core server — 8 cores for Scylla and 8 cores for the Spark executors, dividing the node in half. This may be efficient in cases where you have a constant analytics workload running against that specific cluster. It might be worthwhile to increase or scale-up the node to collocate Spark together with Scylla.

Eyal said that in his experience, “those workloads that are constantly running are fairly small, so the benefit of collocation might not be that beneficial. We do recommend separating the Spark cluster from the Scylla cluster. You’ll have to tune less. You have a more efficient Scylla that will provide better performance. And you might want to tear down or scale up/scale down your Spark cluster, so the dynamic of actually using the Spark is going to be more efficient.”

Since Spark is implemented in Java, you have to manage memory. If you’re going to set up some kind of a off-heap memory, you have to consider Scylla usage of that specific memory to prevent collisions.

Fine Tuning

Spark exposes a setting that defines the number of parallel tasks you want the executor to run. The default setting of 1 can be increased to the number of cores you have in the Scylla node, creating more parallel connections. You can also reduce the split size from 64 mb to 1 mb.

There is a setting inside the Cassandra connector that determines the maximum connections to deploy. The default is calculated by the number of executors. Eyal recommended that Scylla users increase this setting to the number of cores or more that you have in the Scylla side, to open more connections. The number of connections going to open by the way if you don’t have any workload is one. So the Java connector will open one connection between your Spark executors and your Scylla nodes.

The different concurrent writes — the actual writes batch in-flight you can have — have a default of 5. It might make sense to increase that number, if you have a very high write load. The default for Concurrent.reads for each one of those connections is 512. Again, it makes sense if you need a very high read workload to increase that number.

To Conclude

Scylla does enable you to run analytics workloads. And ScyllaDB actually plans to improve the processes more (see Scylla’s Workload Prioritization feature). To support analytics user that can benefit from a different path of data and scanning those tables in a more efficient way.

Eyal recommended from his perspective that “if you have some questions about the most efficient way to deploy your Spark cluster with Scylla, talk to us. We have found that there are several unique use cases, and you can achieve better performance by minor tuning in the connector, or by replacing the Java driver underneath it. Resource management is the key for a performant cluster. Time after time, we see that if your resource management is correct, and you size it correctly from the Spark side and the Scylla side, you’ll be happy with the results.”

Part Two of this article covers Google’s Holden Karau and her Scylla Summit 2018 presentation on Building Recoverable (and Optionally Asychronous) Pipelines in Spark.

Register Now for Scylla Summit 2019!

If you enjoy technical insights from industry leaders, and want to learn more about how to get the most out of Scylla and your big data infrastructure, sign up for Scylla Summit 2019, coming up this November 5-6, 2019 in San Francisco, California.

Register for Scylla Summit Now!

The post Apache Spark at Scylla Summit, Part 1: Best Practices appeared first on ScyllaDB.

Scylla Enterprise Releases 2018.1.12 and 2018.1.13

Scylla Enterprise Release Notes

The ScyllaDB team announces the release of Scylla Enterprise 2018.1.12 and 2018.1.13, which are production-ready Scylla Enterprise patch releases. Scylla Enterprise 2018.1.12 and 2018.1.13 are bug fix releases for the 2018.1 branch, a stable branch of Scylla Enterprise. As always, Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.13 in coordination with the Scylla support team.

The major fix in 2018.1.12 is an update to the Gossip protocol, improving the stability of Scylla cluster while adding nodes. The following gossip-related bugs are fixed in this release:

  • When adding a node to a cluster, it announces its status through gossip, and other nodes start sending it writes requests. At this time, it is possible the joining node hasn’t learned the tokens of other nodes, which can cause error messages like:
    token_metadata - sorted_tokens is empty in first_token_index!
    storage_proxy - Failed to apply mutation from 127.0.4.1#0:
    std::runtime_error (sorted_tokens is empty in first_token_index!) #3382
  • Waiting for 30 seconds for the gossip to stabilize is not enough. Instead, Scylla will wait for the number of nodes reported by gossip to stabilize – stays on the same value for at least 30 sec. #2866

The major fix in 2018.1.13 is fixing a Scylla crash on rare cases while reading old SSTable (ka/la) file format. The root cause is a bug in sstable reader which makes it present data of the next partition as belonging to the previous partition in some rare cases. Only users using range deletions may be affected by this problem.

The error can manifest as incorrect query result (if clustering ranges for the involved partitions are not overlapping) or crashes due to violation of internal constraints. If this happens during compaction or streaming, the error will persist.

Other issues fixed in 2018.1.12

  • In some cases, when --abort-on-lsa-bad-alloc is enabled, Scylla aborts even though it’s not really out of memory #2924
  • Schema changes: schema change statement can be delayed indefinitely when there are constant schema pulls #4436
  • Move from Python 3.4 to Python 3.6 for Scylla scripts, following changes in CentOS EPEL repository
  • Operations which involve flush, like restart or drain, may take too long even under low load since the controller assigns a smaller number of shares than it can.
  • row_cache: potential abort when populating cache concurrently with MemTable flush #4236
  • Repair: repair failed with an error message: std::system_error (error system:98, Address already in use)
    Serialization decimal and variant data types to JSON can cause an exception on the client-side (e.g. CQLSh) #4348
  • Fix performance regression from solving an issue in Leveled Compaction Strategy
  • CQL: on rare cases, when executing a prepared select statement with a multicolumn IN, the system can return incorrect results due to a memory violation. For example:
    SELECT * FROM atable WHERE id=17 AND (id1, id2, id3) IN ((1, 2 , 3),(4, 5 , 6))
  • Scylla node crashes upon prepare request with multi-column IN restriction #3692, #3204

Related Links

The post Scylla Enterprise Releases 2018.1.12 and 2018.1.13 appeared first on ScyllaDB.

Introducing Hybrid Solutions with VMware Cloud and DataStax

We’re excited to announce consistent data, infrastructure, and application management and operations for hybrid and multi-cloud deployments. This advanced DataStax integration builds directly on VMware’s vSphere and VMware Cloud platforms to deliver a uniform experience across on-premises, hybrid, and multi-cloud deployments and enterprise-grade availability.

With hundreds of joint customers, we’ve validated managing DataStax Enterprise (DSE) and DataStax Distribution of Apache Cassandra™ to provide simple prescriptive approaches to deploying with VMware, and today we’re making that even easier with additional deployment options. We are:

  • Adding to our joint on-premises solutions with prescriptive approaches for hybrid and multi-cloud. 
  • Empowering customers to choose the VM density and high availability (HA) options that work best for their use case, based on the number of DSE nodes per physical host and vSAN FTT, respectively. 

Hybrid and Multi-Cloud

Hybrid and multi-cloud is all the rage these days, but for our joint customers, it isn’t just hype. The challenges they’re facing in these deployment scenarios are limiting their growth in the following ways: 

  • Cloud Lock-In – Customers need a strong negotiating position with cloud vendors. Overdependence on the vendor’s ecosystem diminishes power in pricing discussions.
  • Uniform Deployments – Customers need to utilize the same tooling to deploy an application stack in development, test, and production environments.
  • Simplified Application Experience – Many customers do development work in the cloud and deploy on-premises for production (or the reverse). Application performance should be uniform across these environments. 

VMware solves these challenges by making the same hyperconverged infrastructure (HCI) stack available on-premises and in multiple clouds. Customers can deploy virtual machines in the same way, across data centers and clouds. Our new solution brief shows how DataStax customers can best leverage VMware’s hybrid and multi-cloud solutions. 

Solution Expansion

We launched our initial solution with VMware after a year of testing and collaboration. Both teams were methodical in our respective approaches and we provided a narrow solution set which we had the highest level of confidence in. The feedback was that a database with such broad use cases should have a broad set of integration options. Customers understand their application requirements and we can serve them better by helping them understand the broad solution set and when to use each. Customers originally wanted: 

  • The ability to scale resources UP (add more CPU or DRAM) as well as OUT (traditional node-based scale-out)
  • Non-niche management options for DataStax product infrastructure
  • Quickly delivered test clusters
  • Full-stack management options for security, storage, and administration

In a new white paper, we’re showcasing a more flexible range of options in terms of DSE nodes per ESX hosts as well as additional HA options.

We encourage everyone to join our webinar on August 14 where we’ll discuss these solutions in detail and provide a preview of what we’re showing off at VMworld 2019 in the HCI Zone of the Expo Hall and in our hybrid and multi-cloud talk. 

Running DataStax Enterprise in VMware Cloud and Hybrid Environments (webinar)

REGISTER NOW

tlp-stress 1.0 Released!

We’re very pleased to announce the 1.0 release of tlp-stress! tlp-stress is a workload centric benchmarking tool we’ve built for Apache Cassandra to make performance testing easier. We introduced it around this time last year and have used it to do a variety of testing for ourselves, customers, and the Cassandra project itself. We’re very happy with how far it’s come and we’re excited to share it with the world as a stable release. You can read about some of the latest features in our blog post from last month. The main difference between tlp-stress and cassandra stress is the inclusion of workloads out of the box. You should be able to get up and running with tlp-stress in under 5 minutes, testing the most commonly used data models and access patterns.

We’ve pushed artifacts to Bintray to make installation easier. Instructions to install the RPM and Debian packages can be found on the tlp-stress website. In the near future we’ll get artifacts uploaded to Maven as well as a Formula for Homebrew to make installs on MacOS easier. For now, grab the tarball from the tlp-stress website. We anticipate following a short release cycle similar to how Firefox or Chrome would be released, so we expect to release improvements every few weeks.

We’ve set up a mailing list if you have questions or suggestions. We’d love to hear from you if you’re using the tool!

Scylla Open Source Release 3.0.9

Scylla Open Source Release Notes

The ScyllaDB team announces the release of Scylla Open Source 3.0.9, a bugfix release of the Scylla 3.0 stable branch. Scylla Open Source 3.0.9, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

Related links:

Issues solved in this release:

  • Repair of a single shard range opens RPC connections for streaming on all shards. This is redundant and can exhaust the number of connections on a large machine. Note that Scylla Manager runs repair on a shard by shard basis. Running repairs from nodetool (nodetool repair) will make the issue even worse. #4708
  • CQL: Marshalling error when using Date with capital Z for timezone, for example, ‘2019-07-02T18:50:10Z’ #4641
  • Scylla init process: a possible race between role_manager and pasword_authenticator can cause Scylla to exit #4226
  • CQL: Using tuples as a clustering key type without to_string() implementation, for example, a tuple, will cause the large row detector to exit. #4633
  • Fix segmentation faults when replacing expired SSTables #4085

The post Scylla Open Source Release 3.0.9 appeared first on ScyllaDB.