How SkyElectric Uses Scylla to Power Its Smart Energy Platform

SkyElectric’s mission is “to provide sustainable energy to all” by revolutionizing how power is delivered for the world — “the great energy challenge of the 21st Century.” SkyElectric takes seriously this mission to bridge the gap in energy usage between the developed and developing nations of the world, and to address climate change. We at ScyllaDB were honored to have Jehannaz Khan, Director of Engineering of SkyElectric, and Meraj Rasool, SkyElectric DevOps Lead, speak at Scylla Summit 2019 about their use case.

The Challenge to Make Clean Energy Universally Available

In Pakistan electricity as a resource is stretched incredibly thin. This nation of 216 million people (about two thirds the population of the U.S.) produces 20 gigawatts (GW) of energy. In comparison, the U.S. produces over 1,072 GW — over 50 times the amount of electricity. Pakistan consistently suffers a 2,000MW shortfall in production behind demand, about 10% of overall production, which leads directly to load shedding — the deliberate shutdown of parts of the energy grid to prevent systemic collapse. Consequently, rural areas suffer up to 11 hours of blackouts daily and urban areas can suffer 8 hours of daily power cuts.

Unreliable energy adversely impacts the commercial and industrial sectors and the Pakistani economy overall. Jehannaz said “growth in GDP is directly linked to energy usage. The more energy you use, the more your GDP grows.” Indeed, an analysis by the U.S. Energy Information Agency (EIA) states the energy crisis costs Pakistan somewhere between 2-3% of GDP.

Pakistan has aggressive national goals to increase renewables use. Whereas wind, solar and biomass presently account for 4% of energy generation, the government seeks to increase that rate to 20-30% by 2030 through incentives as well as reforms such as allowing net-metering.

Pakistan also faces significant effects of climate change and has been described as risking being “ground zero for global warming consequences.” Extreme temperatures can reach over 50º C (122º F), not only stressing the power grid but also increasing mortality rates. So while the government can increase its energy through its abundant fossil fuel reserves (including coal and natural gas), Jehannaz emphasized “we wanted to know how we could provide this energy in a way that didn’t exacerbate climate change.”

To make clean energy universally available by building a distributed and intelligent solar and energy storage grid, managed via the Internet, across the world.

— SkyElectric’s vision and mission

Building a Cloud-Based System for Energy Management

To address these pressing challenges SkyElectric launched the Smart Energy System in February 2017, combining a solar hybrid inverter, grid interconnect and lithium-ion battery solution, “so you never have to worry about the lights going out.” These nanogrids — small power plants in homes and businesses — integrate data services to a cloud platform, including a mobile app so you could check on your power system at any time, and a network operations center (NOC) manned by a customer care team monitoring your system from afar.

SkyElectric October 2019 Statistics

An example of statistics from SkyElectric showing that for the monthly period ending 29 October 2019, 570 Megawatt hours (MWh) of their customers’ power was produced by solar, and 422 MWh came from the grid. This accounted for saving over 424 metric tonnes of CO2 emissions.

The Smart Energy System is held together by SmartFlow, an AI-driven energy management solution that controls how it operates. It is fed by historical, current and predicted information on grid availability – trying to anticipate when and if a customer’s power may shut off — plus tariff structures, sunlight, load and storage levels all to ensure that their customers maintain the highest levels of power availability at the lowest rates.

Even as they scaled to their first thousand systems installed in residences and businesses across the country SkyElectric needed to reengineer their back-end systems to deal with the massive amounts of data now under management as well as for capacity to meet their planned expansions. With over 300,000 data points per minute being sent to the cloud, that’s 157 billion data points in a year. They needed to store that data, run fault analysis on it, provide fault rectification, and perform AI algorithms to improve the service.

Evolution of the SkyElectric Cloud

The first prototype (v1) of their back end was built on the Java programming language and the MySQL database in an on-premises environment. Well before they hit their first 50 beta customers they saw problems with scaling. Database write latencies were hitting 2 seconds, and read latencies for analysis of just hours of data could take minutes. Scanning data for even a full day became prohibitive. There was no way to scale to their vision of thousands of systems given this version of their infrastructure. And there was no way to provide the security and peace of mind of their service to their customers if it was going down every few days.

The lesson they had learned was that they needed mutable data with ACID properties (users, systems configurations) as well as, separately, an immutable (read-only) time series data store for their energy and system health metrics.

In a second design (v2) SkyElectric considered MongoDB and Node.js instead. But they abandoned work on this iteration before reaching production.

Their efforts turned to a new design (v3) built with Elixir on top of Scylla and Elasticsearch (NoSQL), as well as PostgreSQL (SQL), deployed to AWS. Their use case focused on performance, scalability, throughput and availability. But also ease-of-use. With those as requirements, they searched into what were the latest databases available. “And that’s when we came across Scylla.”

Though they looked at Cassandra and RiakTS, these did not meet all of their criteria. Cassandra had serious resource issues, according to Meraj. He was also reticent to go into production with a Java-based solution given his past experiences. RiakTS seemed perfect, but Meraj saw it as a major flaw that you cannot update the schema. For example, he said that with every sprint they make minor changes. So this was a requirement of the application.

Meraj familiarized himself with Scylla through videos, benchmarks and other information online. He was greatly relieved to learn that Scylla was written in C++.

SkyElectric has now been in production with Scylla for a year and a half without any major problems. Scylla serves as the primary store for time series data, which has scaled both in request volume and overall data stored by an order of magnitude, now more than one terabyte. Meanwhile average latencies remain low and predictable: 1.4 ms for writes, and submillisecond for reads. “We are very confident we will be able to grow with it to thousands and thousands of systems.”

As a DevOps veteran Meraj also cited the ease of operations that came with Scylla. “I sleep well because of Scylla.” On adding nodes to his expanding cluster, or to replace a failed node, “It was painless. It was very easy to add a node without worrying about my data.” Software updates were also very easy, “Hassle free”.

Meraj noted he was using Scylla Manager to support weekly repairs, but looked forward to the future capabilities for Scylla Manager to also do backups and restores, which he currently does via a Python script to Amazon S3.

“Regarding performance, I will say I have seen an optimal usage of the hardware for Scylla.” For a company trying to maximize power, quite literally, utilizing every core and all the RAM in an instance is very important to Meraj and the team at SkyElectric. “This is a very big ‘yes’ for me.”

He was looking forward to the Change Data Capture (CDC) feature announced at the summit, which is coming up soon in an impending release of Scylla.

Lastly, as an open source user, Meraj was very appreciative of all the help he received through the community via Slack.

We look forward to hearing the next chapters of the SkyElectric saga. Would you like to share your own story of the mission-critical work you are doing with Scylla? If so, drop us a line.

The post How SkyElectric Uses Scylla to Power Its Smart Energy Platform appeared first on ScyllaDB.

Observability in Apache Cassandra 4.0 with Event Diagnostics

Several new observability features will be part of the next major Apache Cassandra 4.0 release. We covered virtual tables in an earlier blog post and now would like to give a preview of another new feature, Diagnostic Events, which provide real time insight into your Cassandra internals.

Observability is key to successfully operating Apache Cassandra, as it allows users and developers to find bugs and identify runtime issues. Log files and metrics are a very popular way to get insights into a cluster’s behaviour, but they are limited to small text representations or time series data. Often important information is missing from log files and can’t be added without changing the source code and rebuilding Cassandra. In addition, log text output is tailored to be readable by humans and is not designed to be machine readable without creating custom parsers.

Diagnostic Events have been designed to fill this gap by providing a way to observe all different types of changes that occur inside Cassandra as they happen. For example, testing frameworks can subscribe listeners that will block the server when change events happen, thus providing a continuous evaluation of invariants across testing scenarios and environments. While other observability tools could subscribe listeners without blocking the Cassandra server, providing third parties visibility into how Cassandra navigates a variety of changes and critical information about your cluster.

The idea of Event Diagnostics was first proposed in 2016 by Stefan Podwinksi and then implemented as part of CASSANDRA-12944.

Types of Diagnostic Events

Currently the types of observability implemented by Event Diagnostics are:

  • AuditEvents, capturing all audit log entries
  • BootstrapEvents, when a new node is bootstrapping into a cluster
  • GossiperEvents, state changes that are announced over gossip
  • HintEvents, the storage and delivery of hints
  • TokenMetadataEventsPendingRangeCalculatorServiceEvents, lifecycle changes and tasks when token ownership ranges are being calculated
  • ReadRepairEvents and PartitionRepairEvents, different types of repair events
  • SchemaAnnouncementEvents, schema changes proposed, received, and accepted
  • SchemaMigrationEvents, schema propagation across a cluster
  • SchemaEvents, high-level schema changes
  • TokenAllocatorEvents, allocation of token range ownerships, random or via an allocation strategy.

A little about the implementation

Diagnostic Events have been implemented with a JMX interface consisting of two MBeans: the DiagnosticEventService and the LastEventIdBroadcaster. The DiagnosticEventService provides methods to enable diagnostics on a per event type and to bulk read new events. The LastEventIdBroadcaster provides attributes for the last published offsets for each event type. Importanlty, the LastEventIdBroadcaster avoids the use of JMX notifications, a mechanism that too easily loses events, by maintaining an offset for each event type’s queue. Behind these JMX interfaces the persistence of events is regulated by DiagnosticEventStore and although an in-memory store is currently used, an implementation based on Chronicle Queue is planned.

Monitoring Cassandra Diagnostic Events

Reaper is one of the tools that has the ability to listen to and display Cassandra’s emitted Diagnostic Events in real time. The following section outlines how to implement this feature and gain better visiblity into your cluster.

Enabling Diagnostic Events server-side in Apache Cassandra 4.0

Diagnostic Events are not enabled (published) by default in Apache Cassandra version 4.0, but can be manually enabled. To activate the publishing of diagnostic events, enable the diagnostic_events_enabled flag on the Cassandra node:

# Diagnostic Events #
# If enabled, diagnostic events can be helpful for troubleshooting operational issues. Emitted events contain details
# on internal state and temporal relationships across events, accessible by clients via JMX.
diagnostic_events_enabled: true

Restarting the node is required after this change.

Using Reaper to display Event Diagnostics

In Reaper go to the “Live Diagnostics” page.

Select the cluster that is running Cassandra version 4.0 with diagnostic_events_enabled: true, using the “Filter cluster” field.

Expand the “Add Events Subscription” section and type in a description for the subscription to be created.

Select the node whose diagnostic events you want to observe and select the diagnostic events you want to observe. Check “Enable Live View”.

Add Diagnostic Subscription

Press “Save”. In the list of subscriptions there should now be a row that displays the information entered above.

Press “View”. A green bar will appear. Underneath this green bar the types of diagnostic events selected and the nodes subscribed to will be displayed. Press the green bar to stop showing live events.

Add Diagnostic Subscription

Event subscriptions is a way to understand what goes on inside Apache Cassandra, particularly at the cluster coordination level. Enabling Diagnostic Events has no performance overhead on the Cassandra node until subscriptions are also enabled. Reaper’s subscription may not be able to keep up and receive all events when traffic or load is significantly high on the Cassandra node as it keeps the event subscriptions in a fifo in-memory queue, of maximum length 200 per event type. The AuditEvent type can impose additional overhead on a cluster, and for that reason requires additional configuration in the auditing section of Cassandra’s yaml configuration file.

Happy diagnosing, hope you have some fun with this extra real-time insight into Cassandra internals. And if you would like to see more diagnostic event types added reach out, make your suggestions, and even better throw us a patch and get the ball rolling…

How Scylla Scaled to One Billion Rows a Second

Scylla is a highly scalable, highly performant NoSQL database. But just how fast can fast get? And what happens when you run it on a bare metal cloud like Packet? We set out to design a test that would showcase the combined abilities of Scylla as a database and Packet’s fastest instances. We’ll spill the beans early: we were able to scan more than 1,000,000,000 rows per second (that’s a billion, with a “b”). Now, if we were only doing a million rows per second per server, that’d need a cluster of 1,000 servers.

We did it with 83 nodes.

That means we were reading over 12 million rows per second per server.

Millies, Billies and Trillies, a Brief History of Benchmarking Scale

Before we get too far into the story of how we hit those numbers, let’s step back a little in history to see other milestones in NoSQL scalability. Not all of these tests are measuring the same thing, of course, but it is vital to see how we’ve measured progress in the Big Data industry. Each of these achievements is like summiting a tall mountain. Once one company or one open source project reached the milestone, it sent others scurrying to summit the peak themselves. And, no sooner was one summit reached than higher performance targets were set for the next generation of tests.

It was only eight years ago, in November 2011, when Netflix shared their achievement on hitting over one million writes per second on Cassandra. They did it with 288 servers (a mix of 4 and 8 CPU machines) grouped in 3 availability zones of 96 instances each. Each server could sustain only about 11,000 – 12,000 writes per second.

Fast forward less than three years, to May 2014, and Aerospike was able to achieve one million transactions per second on a single $5,000 commodity server. There was a key revolution going on in server hardware at the time that enabled this revolutionary advance: solid state drives (SSDs). Aerospike was optimized for Flash storage (SSDs), whereas the Netflix test used hard disk drives (HDD) in the m1.xlarge.

Scylla released the very next year, in 2015. With the Seastar engine at its heart, and being a complete re-write of Cassandra in C++, Scylla proved you could make a Cassandra-compatible database also hit a million transactions on a single server. Wait. Did we say only a million TPS? The very next month, with a bit more optimization, we hit 1.8 million TPS.

Across in the industry more “milly” scale tests pushed new limits. Redis first announced hitting 1.2 million ops on a single instance in November 2014. What’s more, it boasted its latencies were sub-millisecond (<1 msec). It later announced hitting 10 million ops at <1 ms latency with only 6 nodes at the very end of 2017. They’ve since scaled to 50 million ops on 26 nodes (September 2018) and recently 200 million ops on 40 instances (Jun 2019); this last test averaging about 5 million operations per second per server. The key to Redis’ performance numbers is that it operates in-memory (from RAM) for the most part, saving a trip to persistent storage.

As nodes become denser the capability to hit 1 million ops has no longer become an achievement; it’s now more of a database vendor’s basic ante into the performance game. So beyond a “milly,” we can look at the next three orders of magnitude up to the ‘billies;” operations on the order of billions of seconds.

Speaking about in-memory scalability, about a half a year before the Netflix blog, in April 2011, the in-memory OLAP system Apache Druid arrived on the scene in a big way. In a 40-node cluster, they were able to aggregate 1 billion rows in 950 milliseconds — a “billy.” But that pales in comparison to MemSQL claiming reading a trillion rows per second in-memory in 2018 — a “trilly.”

Which gets into the next dimension of these tests of scalability. What are you specifically measuring? Are you talking about transactions or operations per second? Are they reads or writes? Or are you talking about aggregations or joins? Separate single row reads or returning multiple rows-per-second results from a single query? Was this on persistent storage or in-memory? While scale is important, it is vital to note these are not all apples-to-apples comparisons. While large numbers are impressive, make sure you understand the nature of what each vendor is testing for. It goes to show that big data headlines claiming “millions,” “billions” and “trillions” all need to be looked at on their own merits.

Basic Methodology: Hiding a Needle in a Haystack

We set forth to create a database that simulated the scenario of a million homes, all equipped with smart home sensors capable of reporting thermometer readings every minute, for a year’s worth of data.

That equates to:

1,000,000 * 24 * 60 * 365 = 525.6 billion data samples for a single year.

To give an idea of how much data that is, if you were to analyze it at one million points per second (a “milly”), it would till take over 146 hours — almost a week — to analyze.

To generate the data, we used the IoT Temperature example application, available in the Scylla Code Examples repository. The application mimics an IoT application that stores temperature reading from many sensors that are sent to the database every minute, for a total of 1440 points per day. Each sensor has a sensor_id, that is used as part of a composite partition key together with a YY-MM-DD date. The schema is as follows:

CREATE TABLE billy.readings (
    sensor_id int,
    date date,
    time time,
    temperature int,
    PRIMARY KEY ((sensor_id, date), time)

This is a pretty standard schema for a time-series workload that allows for efficient queries that query a specific point in time for a sensor by specifying the entire primary key, and also queries for an entire day by specifying just the partition key components (sensor_id and date)

Aggregations are also possible by iterating over all possible sensor_id and date, if the dataset is non-sparse, or by means of a full table scan if the dataset is sparse (There is also a Full-table scan example in the Scylla Code Examples repository.)

We used the populator binary from our application to generate 365 days of data coming from 1,000,000 sensors. This is a total of 525 billion temperature points stored.

To create the data set we launched two dozen loader / worker nodes to generate temperatures for the entire simulated year’s worth of data according to a random distribution with minor deviations.

We then manually logged into the database and poisoned one of the days by inserting outlier data points that don’t conform to the original distribution. These data points serve as an anomaly that can be detected. In specific, we put in a pair of extreme temperatures that are far above and below normal bounds in the same house on the same day at nearly the same time. Such extremes could indicate either a malfunction in a house’s temperature sensor, or real-world emergency conditions (for example, going from freezing cold to very hot might indicate a power failure followed by a house fire).

Really, though, it just served as a “needle” to find in a “haystack” worth of data. Because finding this anomaly without prior knowledge of the partition it was stored in requires scanning the entire database.

We wanted to minimize network transfer and do as much work as possible in the server to highlight the fact that the consistent hashing algorithms used by Scylla excel at single-partition accesses. For each partition we then executed the following query:

SELECT count(temperature) as totalRows , sum(temperature) as sumTemperature , min(temperature) as minTemperature, max(temperature) as maxTemperature from readings where sensor_id = ? and date = ?`

To read the data we then executed the reader binary distributed with the application, connecting to 24 worker / loader machines. A twenty-fifth machine was used as the system monitor and coordinator for load daemons.

Our “Billy” test setup: the coordinator sends a batch of partition keys to each worker (in our execution, a batch contains 25,000 partition keys). The worker sends a query to Scylla for each partition to compute the number of rows, sum, max and min for each of those partitions and returns to the coordinator the total number of rows, sum, max and min for the entire batch. The coordinator then does the same for all workers and formats the data. In the process of doing that, we report the sensor id for which the max and min temperature were seen.

First Test: Scanning 3 Months of Data

Let’s assume the anomaly we’re looking for took place in a three-month period between May 1st and August 31st of 2019. Since the database stores data for an entire year and those dates are not the most recent, it is therefore unlikely to be frequently queried by other workloads. We can reasonably presume a significant fraction of this data is “cold,” and not resident in memory (not cached). For 1 million sensors each with 3 months worth of data that results in a scan of more than 130 billion temperature points!

Due to Scylla’s high performance all the 130 billion temperature points are scanned in less than 2 minutes. This averages close to 1.2 billion temperature points read per second. The application’s output tells us the greatest temperature anomalies (an extremely low minimum and extremely high maximum temperature) in that interval were recorded on August 27th by SensorID 473869.

# ./read -startkey 1 -endkey 1000000 -startdate 2019-06-01 -enddate 2019-08-31 -abortfailure -hosts /tmp/phosts.txt -step 25000 -sweep random
Finished Scanning. Succeeded 132,480,000,000 rows. Failed 0 rows. RPC Failures: 0. Took 110,892.91 ms
Processed 1,194,666,083 rows/s
Absolute min: 19.71, date 2019-08-27, sensorID 473869
Absolute max: 135.21, date 2019-08-27, sensorID 473869

And since a picture is worth a billion points, it’s even easier if we plot the output temperatures into a graph. It becomes obvious that something odd was going on that day!

Graph of the daily averages, minimum and maximum temperatures (in fahrenheit) from our sample data clearly show when sensor 473869 had an anomaly on a day in late August 2019.

By looking at our monitoring data, we can indeed see that a significant fraction of our dataset is served from cold storage:

A normal workload, where some rows are cached while some are not. The point-in-time request-rate depends on the cache hit rate, but overall 132 Billion rows are read in 110 seconds, averaging 1.194 Billion rows per second

Now we must ask ourselves: were there any other similar though lesser temperature anomalies anywhere that day? To find that out, we can run the application again, this time excluding the already-flagged SensorID. Because a similar scan just took place the data is now all cached and we now take less than one second to scan all 1.5 billion temperature points for that day – 1.5 billion data points per second.

# ./read -startkey 1 -endkey 1000000 -startdate 2019-08-27 -enddate 2019-08-27 -abortfailure -hosts /tmp/phosts.txt -step 25000 -sweep random --exclude=473869
Finished Scanning. Succeeded 1,439,998,560 rows. Failed 0 rows. RPC Failures: 0. Took 938.84 ms
Processed 1,533,812,721 rows/s
Absolute min: 68.38, date 2019-08-27, sensorID 459949
Absolute max: 75.75, date 2019-08-27, sensorID 458027

But what about the rest of the year? Inspired by the results of the first run, we are not afraid of just scanning the entire database, even if we know that this time all of the data will be cold and served from storage. In fact, we now have miss ratios of essentially 100% for the majority of the workload, and are still able to scan close to 1 billion temperature points per second.

# ./read -startkey 1 -endkey 1000000 -startdate 2019-01-01 -enddate 2019-12-31 -abortfailure -hosts /tmp/phosts.txt -step 25000 -sweep random -exclude=473869
Finished Scanning. Succeeded 525,599,474,400 rows. Failed 0 rows. RPC Failures: 0. Took 542,191.31 ms
Processed 969,398,554 rows/s
Absolute min: 68.00, date 2019-05-28, sensorID 82114
Absolute max: 79.99, date 2019-03-19, sensorID 152594

To do a complete data set scan at the “billy” scale (nearly a billion rows per second) took 542 seconds – just over 9 minutes.

The Infrastructure

To run this test at scale required a partner with hardware that was up to the challenge. Enter Packet, who provided the bare metal server test bed for our analysis. The database itself ran on their n2.xlarge.x86 nodes. These machines have 28 Intel Xeon Gold 5120 cores running at 2.2 GHz, 384 GB of RAM, and 3.8 TB of fast NVMe storage. With 4x 10 Gbps NICs, we wanted to make sure that the test was not bounded by network IO. Against this, the worker cluster comprised 24 c2.medium.x86 servers. These sport 24 cores running at 2.2 GHz using the AMD EPYC 7401p chip, plus 64 GB RAM and 960 GB SSD.

See the presentation from our Scylla Summit keynote here, beginning at 4:09 in the video:

Next Steps

We’ve now reached a new summit in terms of scale and performance, but there are always more mountains to climb. We are constantly astonished to hear what our users are accomplishing in their own networks. If you have your own “mountain climbing” stories to share, or questions on how address a mountain of data you are trying to summit yourself we’d love to hear them!

Also, to help you reach your own scalability goals you might want to view a few talks from our recent Scylla Summit. First, we showed at Scylla Summit some even more efficient ways to write applications like the test presented above using upcoming new features in Scylla Open Source 3.2 including CQL PER PARTITION LIMIT, GROUP BY and LIKE.

For Scylla Enterprise users, you might also be interested in our Scylla Summit session on workload prioritization. This feature, unique to Scylla, allows you to put multiple workloads on the same cluster, such as OLTP and OLAP, specifying priorities for resource utilization to ensure no single job starves others entirely. Because while it is important to design tests to show the maximal lab-condition capabilities of a system, it is also vital to have features to prevent bringing a production cluster to its knees!

Lastly, make sure you check out how to use UDFs, UDAs and other features now committed into the Scylla main branch. This talk by Scylla CTO Avi Kivity explains how you will eventually be able to use these to shift functions and aggregations directly to database server itself, offloading your client from having to write its own coordinator code.

The post How Scylla Scaled to One Billion Rows a Second appeared first on ScyllaDB.

Project Gemini: An Open Source Automated Random Testing Suite for Scylla and Cassandra Clusters

Today we are releasing a new data integrity testing suite for the open source community. Those who will have the most direct utility for this software will be those testing Scylla and Cassandra databases, or, more broadly, other CQL-compliant databases. This could either be software development QA teams working on their own software or for users performing acceptance testing (DevOps, SREs, etc.). The design pattern can also be of interest to others building software system testing suites.

To ensure database integrity and reliability in a distributed database, it is vital to test for data correctness or data loss. Writing a database is complicated. A distributed database with a rich API is an even more daunting task since there can be so many patterns of data access, lots of different failure patterns and the database can host enormous datasets with all kinds of schemas. As we add more features and functionality, a framework like Gemini allows us to proceed in a fast pace with a low data regression risk.

Let’s presume you want to make sure that there is no data integrity/loss or corruption when you develop a new Scylla release. You just want to know that everything you believe you wrote actually got written and reads return identical answers. If there are errors detected, you can deal with fault isolation and root cause analysis later. This is where Gemini comes into play.

Readers familiar with the Jepsen suite will find it similar. Jepsen concentrates on database transactional guarantees and consensus systems behavior in the face of distributed failures. Gemini together with Scylla Cluster Test (SCT) have a similar concept and chose to highlight on huge datasets and various fuzzy data operations. Gemini and SCT loop through data modification, event interruption and data validation steps in a loop.

Goal & Definitions

Gemini is a testing tool created by ScyllaDB engineers and used internally in our quality assurance processes to find hard-to-trigger bugs that can cause data corruption or data loss in our Open Source and Enterprise database releases. Gemini accomplishes this by applying random testing to a system under test and validating the results against a test oracle. While the primary goal for Gemini at ScyllaDB is to test our own database, more broadly it can also be used to test other CQL-compatible databases such as Apache Cassandra, or test two different CQL-compatible databases against each other (such as Cassandra and Scylla) to ensure compatibility of behavior.

System Under Test

In our case, system under test (SUT) is the term for the Scylla cluster that is being inspected. For example, it can be a development version of Scylla that needs to be validated for a release. The goal of running the Gemini tool is to find bugs in the SUT.

Topology for a typical Gemini test setup. Note that the test oracle is depicted here as a single node; it could also be a cluster.

Test Oracle

When you have a system under test, what do you compare it to? The comparative system is called the test oracle. (The term has nothing to do with that large Fortune 100 database company headquartered in Redwood City, California.) The test oracle will perform in an expected, predictable manner and is known to produce correct results.

For instance, start with a well-known, stable prior version of a product. It can be run on a small cluster, or topologically simplified to run on a single-node cluster (albeit a large one to handle the load), to isolate system components such as caches, sstables, and flushes.

Ideally, if you operate on both systems in the same manner, the system under test should produce the same exact results as the test oracle. If there is any difference in results between the system under test and the test oracle, for whatever reason, the test has failed.


Gemini is a program that performs automatic random testing to discover bugs. The tool connects to the system under test and a test oracle using the CQL protocol. At high-level, Gemini operates as follows:

  • Generates random operations using the CQL query language.
  • Applies the operations to the system under test and the test oracle.
  • Validates that the system under test and the test oracle are in the same state.

In the absence of bugs, the system under test and the test oracle are expected to be in the same client-visible state. That is, after mutating both databases in the exact same way (with INSERT, UPDATE, or DELETE), subsequent CQL queries (with SELECT) are supposed to return identical result sets. If Gemini detects a difference between the result sets from the two systems, there’s a difference in behavior between the two systems. As we assume the test oracle to behave correctly, the problem is in the system under test. Of course, in practice, test oracle can sometimes also be at fault, but even then, a difference in result sets is an indication of a problem that needs to be analyzed and addressed.

Gemini can also be run in an “oracle-less” mode, where you can just generate random workloads and throw abuses at a standalone system under test as quickly as possible to see if something breaks.


Gemini is implemented with the Go programming language, using the gocql driver to connect to CQL clusters.

A Gemini run starts with the execution of the program. For a test run, Gemini needs a CQL schema (i.e. CREATE KEYSPACE and CREATE TABLE). The tool can generate one for you or users can specify the schema using an external file. Gemini uses the schema information to produce random values for every column in every table in the schema. There are various knobs in the tool to control the kind of data generated. Gemini is able to generate tables with small and large partitions, for example.

After schema creation, the Gemini assigns partition key ranges between N concurrent goroutines to only a single goroutine update and read a given partition at a time. The tool then starts to generate mutation operations or validation/check operations at random. A mutation operation changes the state of both databases and a validation/check operation queries both databases and that the results match. If there’s a mismatch, the tool reports an error to the user together with the CQL query that we detected a difference with.

At a high-level, the Gemini main loop looks as follows:

while (1) {
  generate a random CQL operation
  apply to both clusters
  validate the data

The generated CQL operation can either be a mutation or a read query. Mutations change the state of the systems and read queries are used to verify the state. During the while loop a nemesis disruption can be applied on the SUT.

A random read CQL operation can be, for example:

SELECT a, b FROM tab WHERE pk = ? AND ck = ?

SELECT a, b FROM tab WHERE pk = ? AND ck > ? LIMIT ?

SELECT b FROM tab WHERE token(pk) >= ? LIMIT ?


with many more variations trying to use indexes, ordering, etc.

The mutation CQL operations are somewhat simple: either an INSERT or a DELETE, decided at random. The validation/check operations are slightly more complex and allow variation between a single or multiple partition queries and clustering range queries.

We also run another program during the Gemini runs to perform non-disruptive nemesis operations. For example, the program forces the cluster to perform major compactions, repair, and so on, while Gemini is running. The idea is to exercise edge cases in the Scylla code to find difficult-to-trigger bugs. Scylla Cluster Test framework has many additional disruptive nemesis tests and we intend to apply most of them to the Gemini suite.

Data Generation

The data generated by Gemini is inherently random. Gemini generates data for all the CQL types as well as synthetically generated UDT. There are some limits on the size of the data imposed by the type system itself and for some types such as text we have introduced artificial limits to avoid some extreme sizes. Gemini generates randomly distributed partitions. That means that the data will be randomly affected by both writes and reads. This is often a good outset but sometimes you may want to test other things like for example how Scylla behaves with hot partitions. To support this Gemini allows for the user to decide between 2 other distributions aside from the default random namely zipf and normal distributions.


There are some inherent complications in the validation procedure. The vast majority of queries executed on any Scylla cluster is based on the partition key in one way or another. There is a reason for this and it lies at the heart of how the database is designed. We use this as a simple way to avoid concurrent updates to the same data by different workers. Each worker simply gets a range of partitions upon which it operates in isolation. This works well up to a point.

For example, queries using a secondary index can potentially hit all partitions and are thus subject to a logical race, since any of the matched partitions could be updated by another worker at any given time. This would lead to a perfectly valid state but a state that Gemini would recognize as faulty simply because the results from the SUT and the test oracle differ, a false positive.

Another matter that complicates things is the fact that Scylla updates changes to indexes and materialized views asynchronously in the background. This means that there is a data race between when the two systems were updated and when they are fully in sync. Solving this is not easy and the only reasonable approach currently is to simply retry the validation a few times. If a valid state is found we are good, otherwise we try again until a maximum time has passed. This can still lead to false positives because we may simply give up too soon.

Examples of Found Bugs

Gemini has already found some issues in Scylla in our internal daily runs. You can find some of the issues reported on Scylla’s Github issue tracker by looking at issues with the label gemini.

Reactor stall is a serious quality of service problem in Scylla. Task scheduling Scylla is non-preemptive and cooperative, which means internal tasks run to completion unless the yield the CPU to other tasks. A reactor stall is a bug in the code, where an internal task takes a very long time to complete without yielding, which causes other tasks to wait. This waiting increases tail latency. Gemini has found multiple reactor stall bugs in Scylla by generating CQL queries that trigger edge cases. For example, Gemini found a bug in Scylla, where multiple IN restrictions (that cause the result set to be a cartesian product) would be processed without yielding (see issue #5010 for discussion).

Database crashes are another serious problem. Randomized testing really shows its power in finding edge cases in the code that are not covered by our other test suites (unit tests, functional tests, or longevity tests with nemesis testing). One common class of errors Gemini has found are “out of memory” errors, which can crash the database. You can find examples of such bugs in #4590 and #5014.


Gemini is a tool for automatic random testing of Scylla and Cassandra clusters. It generates random CQL operations, applies the operations to a system under test and a test oracle, and verifies that the user-visible state is the same for the two systems. If there’s a mismatch in their states, the system under test has a bug that needs to be addressed. Gemini has already found some issues in Scylla, and we continue to develop the tool further to discover even more issues.

The source code to Gemini is available on Github:

Please refer to the following document for a quick start on running Gemini:

For more information about Gemini’s architecture, please refer to this document:

You can also view the tech talk we gave at Scylla Summit on Gemini, including the video as well as our slides.


Bonus: Gemini Wallpaper!

If you are a fan of thorough testing of open source like we are, you can show your love by downloading our new free starry night Gemini wallpapers, perfect for your laptop or mobile.


The post Project Gemini: An Open Source Automated Random Testing Suite for Scylla and Cassandra Clusters appeared first on ScyllaDB.

Cassandra Reaper 2.0 was released

Cassandra Reaper 2.0 was released a few days ago, bringing the (long awaited) sidecar mode along with a refreshed UI. It also features support for Apache Cassandra 4.0, diagnostic events and thanks to our new committer, Saleil Bhat, Postgres can now be used for all distributed modes of Reaper deployments, including sidecar.

Sidecar mode

By default and for security reasons, Apache Cassandra restricts JMX access to the local machine, blocking any external request.
Reaper relies heavily on JMX communications when discovering a cluster, starting repairs and monitoring metrics. In order to use Reaper, ops has to change Cassandra’s default configuration to allow external JMX access, potentially breaking existing company security policies. All this makes unnecessary burden on Reaper’s out-of-the-box experience.

With its 2.0 release, Reaper can now be installed as a sidecar to the Cassandra process and communicate locally only, coordinating with other Reaper instances through the storage backend exclusively.

At the risk of stating the obvious, this means that all nodes in the cluster must have a Reaper sidecar running so that repairs can be processed.

In sidecar mode, several Reaper instances are likely to be started at the same time, which could lead to schema disagreement. We’ve contributed to the migration library Reaper uses and added a consensus mechanism based on LWTs to only allow a single process to migrate a keyspace at once.

Also, since Reaper can only communicate with a single node in this mode, clusters in sidecar are automatically added to Reaper upon startup. This allowed us to seamlessly deploy Reaper in clusters generated by the latest versions of tlp-cluster.

A few limitations and caveats of the sidecar in 2.0:

  • Reaper clusters are isolated and you cannot manage several Cassandra clusters with a single Reaper cluster.
  • Authentication to the Reaper UI/backend cannot be shared among Reaper instances, which will make load balancing hard to implement.
  • Snapshots are not supported.
  • The default memory settings of Reaper will probably be too high (2G heap) for the sidecar and should be lowered in order to limit the impact of Reaper on the nodes.

Postgres for distributed modes

So far, we had only implemented the possibility of starting multiple Reaper instances at once when using Apache Cassandra as storage backend.
We were happy to receive a contribution from Saleil Bhat allowing Postgres for deployments with multiple Reaper instances, which also allows it to be used for sidecar setups.
As recognition for the hard work on this feature, we welcome Saleil as a committer on the project.

Apache Cassandra 4.0 support

Cassandra 4.0 is now available as an alpha release and there have been many changes we needed to support in Reaper. It is now fully operational and we will keep working on embracing 4.0 new features and enhancements.
Reaper can now listen and display in real-time live diagnostic events transmitted by Cassandra nodes. More background informations can be found in CASSANDRA-12944, and stay tuned for an upcoming TLP blog post on this exciting feature.

Refreshed UI look

While the UI look of Reaper is not as important as its core features, we’re trying to make Reaper as pleasant to use as possible. Reaper 2.0 now brings five UI themes that can be switched from the dropdown menu. You’ll have the choice between 2 dark themes and 3 light themes, which were all partially generated using this online tool.

Superhero reaper theme

Solarized reaper theme

Yeti reaper theme

Flatly reaper theme

United reaper theme

And more

The docker image was improved to avoid running Reaper as root and to allow disabling authentication, thanks to contributions from Miguel and Andrej.
The REST API and spreaper can now forcefully delete a cluster which still has schedules or repair runs in history, making it easier to remove obsolete clusters, without having to delete each run in history. Metrics naming was adjusted to improve tracking repair state on a keyspace. They are now provided in the UI, in the repair run detail panel:

United reaper theme

Also, Inactive/unreachable clusters will appear last in the cluster list, ensuring active clusters display quickly. Lastly, we brought various performance improvements, especially for Reaper installations with many registered clusters.

Upgrade now

In order to reduce the initial startup of Reaper and as we were starting to have a lot of small schema migrations, we collapsed the initial migration to cover the schema up to Reaper 1.2.2. This means upgrades to Reaper 2.0 are possible from Reaper 1.2.2 and onwards, if you are using Apache Cassandra as storage backend.

The binaries for Reaper 2.0 are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 1.4 are available on the Reaper website.

Note: the docker image in 2.0 seems to be broken currently and we’re actively working on a fix. Sorry for the inconvenience.

Is Arm ready for server dominance?

For a long time, Arm processors have been the kings of mobile. But that didn’t make a dent in the server market, still dominated by Intel and AMD and their x86 instruction set. Major disruptive shifts in technology don’t come often. Displacing strong incumbents is hard and predictions about industry-wide replacement are, more often than not, wrong.

But this kind of major disruption does happen. In fact we believe the server market is seeing tectonic changes which will eventually favor Arm-based servers. Companies that prepare for it are ripe for extracting value from this trend.

This became clear this week, when AWS announced their new generation of Graviton2-based chips. The Graviton2 System on a Chip (SoC) is based on the Arm Neoverse N1 core. AWS claims they are much faster than their predecessors, a claim that we put to the test in this article.

There is also movement in other parts of the ecosystem: startups like Nuvia are also a sign of times to come. With a $53 million Series A just announced, with a founder team that packs years of experience in chip design and employs as VP of Software Jon Masters, a well-known Arm-advocate in the Linux community and previous Chief Arm Architect at Red Hat, Nuvia is a name you will hear a lot more about pretty soon.

At ScyllaDB, we believe these recent events are an indication of a fundamental shift and not just a new inconsequential offer. Companies that ride this trend are in a good position to profit from it.

The commoditization of servers and instruction sets

The infrastructure business is a numbers game. In the end, personalization matters little and those who can provide the most efficient service wins. For this reason, Arm-based processors, now the dominant force in the mobile world, have been perennially on the verge of a server surge. It’s easy to see why: Arm-based servers are known to be extremely energy efficient. With power accounting for almost 20% of datacenter costs, a move towards energy efficient is definitely a welcome one.

The explosion of mobile and IoT have been at the forefront of the dramatic shift in the eternal evolutionary battle of RISC (Arm) vs. CISC (x86). As Hennessy and Patterson observed last year “In today’s post-PC era, x86 shipments have fallen almost 10% per year since the peak in 2011, while chips with RISC processors have skyrocketed to 20 billion.” Still as recently as 2017 Arm only accounted for 1% of the server market. We believe that the market is now on the inflection point of change. We’re far from the only ones with that same thought.

There have been, however, challenges for adoption in practice. Unlike in the x86 world, there hasn’t so far been a dominant set of vendors offering a standardized platform. The Arm world is still mostly custom made, which is an advantage in mobile but a disadvantage for the server and consumer market. This is where startups like Nuvia can change the game, by offering a viable standard-based platform.

The cloud also radically changes the economics of platform selection: inertia and network effects are stronger in a market with many buyers that will naturally gravitate towards their comfort zone. But as more companies offload (and upload) their workloads to the cloud and refrain from running their own datacenters, innovation becomes easier if the cost is indeed justified.

By analogy if you look at the gaming market, there has been a strong lock-in based on your platform of choice: Xbox, PS4 or a high end gaming PC. But as cloud gaming emerges from a niche into the mainstream, like Project xCloud or any number of other cloud gaming platforms, enabling you to play your favorite games from just about any nominal device, that hardware lock-in becomes less prevalent. The power shifts from the hardware platform to the cloud.

Changes are easier when they are encapsulated. And that’s exactly what the cloud brings to the table. Compatibility of server applications are not a problem for new architectures: Linux runs just as well across multiple platforms and as applications become more and more high level, the moat provided by the instruction set gets demolished and the decision shifts to economic factors. In an age where most applications are serverless and/or microservices oriented interacting with cloud-native services, does it really matter what chipset goes underneath?

Arm’s first foray in the cloud: EC2 A1 instances

AWS announced in late 2018 the EC2 A1 instances, featuring their own AWS-manufactured Arm silicon. This was definitely a signal of a potential change, but back then, we took it for a spin and the results were underwhelming.

Executing a CPU benchmark in the EC2 A1 and comparing it to the x86-based M5d.metal hints just how big the gap is. As you can see in Table 1 below, the EC2 A1 instances perform much worse in any of the CPU benchmark tests conducted, with the exception of the cache benchmark. For most others, the difference is not only present but also huge, certainly much bigger than the 46% price difference that the A1 instance has compared to their M5 x86 counterparts.

Test EC2 A1 EC2 M5d.metal Difference
cache 1280 311 311.58%
icache 18209 34368 -47.02%
matrix 77932 252190 -69.10%
cpu 9336 24077 -61.22%
memcpy 21085 111877 -81.15%
qsort 522 728 -28.30%
dentry 1389634 2770985 -49.85%
timer 4970125 15367075 -67.66%

Table 1: Result of the stress command: stress-ng --metrics-brief --cache 16 --icache 16 --matrix 16 --cpu 16 --memcpy 16 --qsort 16 --dentry 16 --timer 16 -t 1m

But microbenchmarks can be misleading. At the end of the day, what truly matters is application performance. To put that to the test, we ran a standard read benchmark of the Scylla NoSQL database, in a single-node configuration. Using the m5.4xlarge as a comparison point — it has the same number of vCPUs as the EC2 A1 — we can see that while the m5.4xlarge sustains around 610,000 reads per second, the a1.metal is capable of doing only 102,000 reads/s. In both cases, all available CPUs are at 100% utilization.

This corresponds to an 84% decrease in performance, which doesn’t justify the lower price.

Figure1: Benchmarking a Scylla NoSQL database read workload with small metadata payloads, which makes it CPU-bound. EC2 m5.4xlarge vs EC2 a1.metal. Scylla is able to achieve 600,000 reads per second in this configuration for the x86-based m5.4xlarge, but the performance difference is 84% worse for the a1.metal, while the price is only 46% cheaper.

Aside from just the CPU power, the EC2 A1 instances are EBS-only instances, which means running a high performance database or any other data-intense application is a challenge on its own, since they lack the fast NVMe devices that are present in other instances like the M5d.

In summary, while the A1 is a nice wave to the Arm community, and may allow some interesting use cases, it does little to change the dynamics of the server market.

Arm reaches again: the EC2 M6 instances

This all changed this week when AWS during its annual re:Invent conference announced the availability of their new class of Arm-based servers, the M6g and M6gd instances among others, based on the Graviton2 processor.

We ran the same stress-ng benchmark set as before, but this time comparing the EC2 M5d.metal and EC2 M6g. The results are more inline with what we would expect from running a microbenchmark set against such different architectures: The Arm-based instance performs better, and sometimes much better in some tests, while the x86-based instance performs better in others.

Test EC2 M6g EC2 M5d.metal Difference
cache 218 311 -29.90%
icache 45887 34368 33.52%
matrix 453982 252190 80.02%
cpu 14694 24077 -38.97%
memcpy 134711 111877 20.53%
qsort 943 728 29.53%
dentry 3088242 2770985 11.45%
timer 55515663 15367075 261.26%

Table 2: Result of the stress command: stress-ng --metrics-brief --cache 16 --icache 16 --matrix 16 --cpu 16 --memcpy 16 --qsort 16 --dentry 16 --timer 16 -t 1m

Figure2 : EC2 M6g vs EC2 A1. The M6g class is 5 times faster than A1 for running reads in the Scylla NoSQL database, in the same workload presented in Figure 1.

Figure3: EC2 M6g vs x86-based M5, both of the same size. The performance of the Arm-based server is comparable to the x86 instance. With AWS claiming that prices will be 20% lower than x86, economic forces will push M6g ahead.

Figure4: CPU utilization during the read benchmark, for 14 CPUs. They are all operating at capacity. Shown in the picture is the data for M6g, but all 3 platforms achieve the same thing. Scylla uses two virtual CPUs for interrupt delivery, which are not shown, summing up to 16.

For database workloads the biggest change comes with the announcement of the new M6gd instance family. Just like what you get with the M5 and M5d x86-based families, the M6gd features fast local NVMe to serve demanding data-driven applications.

We took them for a spin as well using IOTune, a utility distributed with Scylla that is used to benchmark the storage system for database tuning once the database is installed.

We compared storage for each of the instances, in both cases using 2 NVMe cards set up in a RAID0 array:


Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 1517 MB/s
Measuring sequential read bandwidth: 3525 MB/s
Measuring random write IOPS: 381329 IOPS
Measuring random read IOPS: 765004 IOPS


Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2027 MB/s
Measuring sequential read bandwidth: 5753 MB/s
Measuring random write IOPS: 393617 IOPS
Measuring random read IOPS: 908742 IOPS

M6gd.metal M5d.metal Difference
Write bandwidth (MB/s) 2027 1517 +33.62%
Read bandwidth (MB/s) 5753 3525 +63.21%
Write IOPS 393617 381329 +3.22%
Read IOPS 908742 765004 +18.79%

Table 3: Result of IOTune utility testing

M6gd NVMe cards are, surprisingly, even faster than the ones provided by the M5d.metal. This is likely in virtue of them being newer, but clearly shows that certainly there are no penalties posed by the new architecture.


Much has been said for years about the rise of Arm-based processors in the server market, but so far we still live in an x86-dominated world. However, key dynamics of the industry are changing: with the rise of cloud-native applications hardware selection is now the domain of the cloud provider, not of the individual organization.

AWS, the biggest of the existing cloud providers released an Arm-based offering in 2018 and now in 2019 catapults that offering to a world-class spot. With results comparable to x86-based instances and AWS’s sure ability to offer a lower price due to well known attributes of the Arm-based servers like power efficiency, we consider the new M6g instances to be a game changer in a red-hot market ripe for change.

Editor’s Note: The microbenchmarks in this article have been updated to reflect the fact that running a single instance of stress-ng would skew the results in favor of the x86 platforms, since in SMT architectures a single thread may not be enough to use all resources available in the physical core. Thanks to our readers for bringing this to our attention.

The post Is Arm ready for server dominance? appeared first on ScyllaDB.

Cassandra Elastic Auto-Scaling Using Instaclustr’s Dynamic Cluster Resizing

This is the third and final part of a mini-series looking at the Instaclustr Provisioning API, including the new Open Service Broker.  In the last blog we demonstrated a complete end to end example using the Instaclustr Provisioning API, which included dynamic Cassandra cluster resizing.  This blog picks up where we left off and explores dynamic resizing in more detail.

  1. Dynamic Resizing

Let’s recap how Instaclustr’s dynamic Cassandra cluster resizing works.  From our documentation:

  1. Cluster health is checked Instaclustr’s monitoring system including synthetic transactions.
  2. The cluster’s schema is checked to ensure it is configured for the required redundancy for the operation.
  3. Cassandra on the node is stopped, and the AWS instance associated with the node is switched to a smaller or larger size, retaining the EBS containing the Cassandra data volume, so no data is lost.
  4. Cassandra on the node is restarted. No restreaming of data is necessary.
  5. Monitor the cluster to wait until all nodes have come up cleanly and have been processing transactions for at least one minute (again, using our synthetic transaction monitoring) and then move on to the next nodes.

Steps 1 and 2 are performed once per cluster resize, steps 3 and 4 are performed for each node, and step 5 is performed per resize operation. Nodes can be resized one at a time, or concurrently, in which case multiple steps 3 and 4 are performed concurrently.  Concurrent resizing allows up to one rack at a time to be replaced for faster overall resizing.

Last blog we ran the Provisioning API demo for a 6 node cluster (3 racks, 2 nodes per rack), which included dynamic cluster resizing one node at a time (concurrency = 1).  Here it is again:

Welcome to the automated Instaclustr Provisioning API Demonstration

We’re going to Create a cluster, check it, add a firewall rule, resize the cluster, then delete it

STEP 1: Create cluster ID = 5501ed73-603d-432e-ad71-96f767fab05d

racks = 3, nodes Per Rack = 2, total nodes = 6

Wait until cluster is running…

progress = 0.0%………………….progress = 16.666666666666664%……progress = 33.33333333333333%…..progress = 50.0%….progress = 66.66666666666666%……..progress = 83.33333333333334%…….progress = 100.0%

Finished cluster creation in time = 708s

STEP 2: Create firewall rule

create Firewall Rule 5501ed73-603d-432e-ad71-96f767fab05d

Finished firewall rule create in time = 1s

STEP 3 (Info): get IP addresses of cluster: 

STEP 4 (Info): Check connecting to cluster…

TESTING Cluster via public IP: Got metadata, cluster name = DemoCluster1

TESTING Cluster via public IP: Connected, got release version = 3.11.4

Cluster check via public IPs = true

STEP 5: Resize cluster…

Resize concurrency = 1

progress = 0.0%………………………………………………………progress = 16.666666666666664%………………………..progress = 33.33333333333333%……………………….progress = 50.0%……………………….progress = 66.66666666666666%………………………………………..progress = 83.33333333333334%………………………….progress = 100.0%

Resized data centre Id = 50e9a356-c3fd-4b8f-89d6-98bd8fb8955c to resizeable-small(r5-xl)

Total resizing time = 2771s

STEP 6: Delete cluster…

Deleting cluster 5501ed73-603d-432e-ad71-96f767fab05d

Delete Cluster result = {“message”:”Cluster has been marked for deletion.”}

*** Instaclustr Provisioning API DEMO completed in 3497s, Goodbye!

This time we’ll run it again with concurrency = 2. Because we have 2 nodes per rack, this will resize all the nodes in each rack concurrently before moving onto the next racks.

Welcome to the automated Instaclustr Provisioning API Demonstration

We’re going to Create a cluster, check it, add a firewall rule, resize the cluster, then delete it

STEP 1: Create cluster ID = 2dd611fe-8c66-4599-a354-1bd2e94549c1

racks = 3, nodes Per Rack = 2, total nodes = 6

Wait until cluster is running…

progress = 0.0%………………..progress = 16.666666666666664%…..progress = 33.33333333333333%…..progress = 50.0%…..progress = 66.66666666666666%…….progress = 83.33333333333334%…….progress = 100.0%

Finished cluster creation in time = 677s

STEP 2: Create firewall rule

create Firewall Rule 2dd611fe-8c66-4599-a354-1bd2e94549c1

Finished firewall rule create in time = 1s

STEP 3 (Info): get IP addresses of cluster: 

STEP 4 (Info): Check connecting to cluster…

TESTING Cluster via public IP: Got metadata, cluster name = DemoCluster1

TESTING Cluster via public IP: Connected, got release version = 3.11.4

Cluster check via public IPs = true

STEP 5: Resize cluster…

Resize concurrency = 2

progress = 0.0%…………………….progress = 16.666666666666664%….progress = 33.33333333333333%………………………progress = 50.0%….progress = 66.66666666666666%……………………..progress = 83.33333333333334%…..progress = 100.0%

Resized data centre Id = ae984ecb-0455-42fc-ab6d-d7eccadc6f94 to resizeable-small(r5-xl)

Total resizing time = 1177s

STEP 6: Delete cluster…

Deleting cluster 2dd611fe-8c66-4599-a354-1bd2e94549c1

Delete Cluster result = {“message”:”Cluster has been marked for deletion.”}

*** Instaclustr Provisioning API DEMO completed in 1872s, Goodbye!


This graph compares these two results, and shows the total time (in minutes) for provisioning and dynamically resizing the cluster.  The provisioning times are similar, and the resizing times are longer. Resizing by rack is 2.35 times faster than resizing by node (19 minutes c.f. 47 minutes).

Dynamic Cluster Resizing - Total times by node vs by rack

This is a graph of the resize time for each unit of resize (nodes, or rack). There are 6 resize operations (by node) and 3 (by rack, 2 nodes at a time). Each rack resize operation is actually slightly shorter than each node resize operation. There is obviously some constant overhead per resize operation, on top of the actual time to resize each node. In particular (Step 5 above) the cluster is monitored for at least a minute after each resizing operation, before moving onto the next operation.

Dynamic Cluster Resizing - Resizing times by node vs by rack

In summary, cluster resizing by rack is significantly faster than by node, as multiple nodes in a rack can be resized concurrently (up to the maximum number of nodes in the rack), and each resize operation has an overhead, so a smaller number of resize operations are more efficient.

  1.  Dynamics of Resizing

What actually happens to the cluster during the resize?  The following graphs visualise the dynamics of the cluster resizing over time, by showing the change to the number of CPU cores for each node in the cluster.   For simplicity, this cluster has 2 racks (represented by blue and green bars) with 3 nodes per rack, so 6 nodes in total. We’ll start with the visually simplest case of resizing by rack. This is the start state. Each node in a rack starts with 2 CPU cores. 

Resize by Rack - Cores Per Node

At the start of the first rack resize all the nodes in the first rack (blue) are reduced to 0 cores.

Dynamics of Resizing - Cores Per Node

By the end of the first resize, all of the 2 core nodes in the first rack have been resized to 4 core nodes (blue).

Second Rack - Resize by Rack

The same process is repeated for the second rack (green).

Second Rack - Resize by Rack, 4 cores per node

So that we end up with a cluster than has 4 cores per node.

Second Rack - Resize by Rack, resize 2 end

Next we’ll show what happens doing resize by node. It’s basically the same process, but as only one node at a time is resized there are more operations (we only show the 1st and the 6th).  The initial state is the same, with all nodes having 2 cores per node.

Resize by Node - Start

At the start of the first resize, one node is reduced to 0 cores. 

Resize by Node - Resize 1 Start

And at the end of the operation, this node has been resized to 4 cores.

Resize by Node - Resize 1 end

This process is repeated (not shown) for the remaining cores, until we start the 6th and last resize operation:

Resize by Node - Resize 6 start

Again resulting in the cluster being resized to 4 cores per node.

Resize by Node - Resize 6 end

These graphs show that during the initial resize operation there is a temporary reduction in total cluster capacity. The reduction is more substantial when resizing by rack, as a complete rack is unavailable until it’s resized (3/6 nodes = 50% initial capacity for this example), than if resizing by node, as only 1 node at a time is unavailable (⅚ nodes = 83% initial capacity for this example). 

  1. Resizing Modelling

In order to give more insight into the dynamics of resizing we built a simple time vs. throughput model for dynamic resizing for Cassandra clusters. It shows the changing capacity of the cluster as nodes are resized up from 2 to 4 core nodes. It can be used to better understand the differences between dynamic resizing by node or by rack. We assume a cluster with 3 racks, 2 nodes per rack, and 6 nodes in total.

First, let’s look at resizing one node at a time. This graph shows time versus cluster capacity. We assume 2 units of time per node resize (to capture the start, reduced capacity, and end, increased capacity – states of each operation as in the above graphs). The initial cluster capacity is 100% (dotted red line), and the final capacity will be 200% (dotted green line). As resizing is by node, initially 1 node is turned off, and eventually replaced by a new node with double the number of cores. However, during the 1st resize, only 5 nodes are available, so the capacity of the cluster is temporarily reduced to ⅚th of the original throughput (83%, the dotted orange line). After the node has been replaced the theoretical total capacity of the cluster has increased to 116% of the original capacity. This process continues until all the nodes have been resized (note that for simplicity we’ve approximated the cluster capacity as a sine wave, in reality it’s an asymmetrical square wave).

Resizing by Node

The next graph shows resizing by rack. For each resize, 2 nodes are replaced concurrently. During the 1st rack resize the capacity of the cluster is reduced more than for the single node resize, to 4/6th of the original throughput (67%). After two nodes have been replaced the theoretical capacity of the cluster has increased to 133% of the original capacity. So, for rack resizing the final theoretical maximum capacity is still double the original, but it happens faster, and we lose more (⅓) of the starting capacity during the 1st rack resize.

Resize by Rack - 6 nodes, 3 racks

The model has 2 practical implications (irrespective of what type of resizing is used). First, the maximum practical capacity of the cluster during the initial resize operation will be less than 100% of the initial cluster capacity. Exceeding this load during the initial resize operation will overload the cluster and increase latencies. Second, even though the theoretical capacity of the cluster increases as more resize operations are completed, the practical maximum capacity of the cluster will be somewhat less. This is because Cassandra load balancing assumes equal sized nodes, and is therefore unable to perfectly load balance requests across the heterogeneous node sizes. As more nodes are resized there is an increased chance of a request hitting one of the resized nodes, but the useable capacity is still impacted by the last remaining original small node (resulting in increased latencies for requests directed to it). Eventually when all the nodes have been resized the load can safely be increased as load balancing will be back to normal due to homogeneous node sizes.

The difference between resizing by node and by rack is clearer in the following combined graph (resizing by node, blue, by rack, orange):

Resizing by node cf by rack

In summary, it shows that resizing by rack (orange) is faster but has a lower maximum capacity during the first resize operation.  This graph summaries these two significant differences (practical maximum capacity during first resize operation, and relative time to resize the cluster).

Practical max capacity and resize time

What happens if you have different sized clusters (number of nodes)? This graph shows that the maximum useable capacity during resizing by node, ranges from 0.5 to 0.93 of the original capacity with increasing numbers of nodes (from 2 to 15). Obviously the more nodes the cluster has the less impact there is if only one node is resized at a time.

Max Capacity during resizing (by node)

The problem with resizing by node is that the total cluster resizing time is linearly increasing, so prohibitively high for larger clusters.  

Linear increase in cluster resizing time

  1. Auto Scaling Cassandra Elasticity

Another use case for the Instaclustr Provisioning API is to use it for automating Cassandra Elasticity, by initiating dynamic resizing on demand.   A common use case for dynamic scaling is to increase capacity well in advance for predictable loads such as weekend batch jobs or peak shopping seasons. Another less common use case is when the workload unexpectedly increases and the resizing has to be more dynamic. However, to do this you need to know when to trigger a resize, and what type of resize to trigger (by node or rack). Another variable is how much to resize by, but for simplicity we’ll assume we only resize from one instance size up to the next size (the analysis can easily be generalised).

We assume that the Instaclustr monitoring API is used to monitor Cassandra cluster load and utilisation metrics. A very basic auto scaling mechanism could trigger a resize based on exceeding a simple threshold cluster utilization. The threshold would need to be set low enough so that the load doesn’t exceed the reduced load due to the initial resize operation. This obviously has to be lower for resize by rack than by node, which could result in the cluster having lower utilisation than economical, and “over eager” resizing.   A simple threshold trigger also doesn’t help you decide which type of resize to perform (although a simple rule of thumb may be sufficient to decide, e.g. if you have more than x nodes in the cluster then always resize by rack, else by node).

Is there a more sophisticated way of deciding when and what type of resize to trigger? Linear regression is a simple way of using past data to predict future trends, and has good support in the Apache Maths library.  Linear regression could therefore be run over the cluster utilisation metrics to predict how fast the load is changing.

This graph shows the maximum capacity of the cluster initially (100%). The utilisation of the cluster is measured for 60 minutes, and the load appears to be increasing. A linear regression line is computed to predict the expected load in the future (dotted line). The graph shows that we predict the cluster will reach 100% capacity around the 280 minute mark (220 minutes in the future). 

Utilisation and Regression

What can we do to preempt the cluster being overloaded? We can initiate a dynamic resize sufficiently ahead of the predicted time of overload to preemptively increase the cluster capacity.  The following graphs show auto scaling (up) using the Instaclustr Provisioning API to Dynamically Resize Cassandra clusters from one r5 instance size to the next size up in the same family. The capacity of the cluster is 100% initially, and is assumed to be double this (200%) once the resize is completed. The resize times are based on the examples above (but note that there may be some variation in practice), for a 6 node cluster with 3 racks. 

The first graph shows resize by node (orange). As we discovered above, the effective maximum capacity during a resize by node is limited to 83% of the initial capacity, and the total time to resize was measured at 47 minutes. We therefore need to initiate a cluster resize by node (concurrency=1) at least 47 minutes prior (at the 155 minute mark) to the time that the regression predicts we’ll hit a utilisation of 83% (200m mark), as shown in this graph:

Resize by Node

The second graph shows resizing by rack. As we discovered above, the maximum capacity during a resize by rack is limited to 67% of the initial capacity, and the total time to resize was measured at 20 minutes. We therefore need to initiate a cluster resize by rack at least 20 minutes prior (at the 105 minute mark) to the time that the regression predicts we’ll hit a utilisation of 67% (125m mark), as shown in this graph:

Resize by Rack

This graph shows the comparison, and reveals that resize by rack (blue) must be initiated sooner (but completes faster) than resize by node (orange).

Resize by Node vs by Rack

How well does this work in practice? Well, it depends on how accurate the linear regression is. In some cases the load may increase faster than predicted, in other cases slower than predicted, or may even drop off. For our randomly generated workload it turns out that the utilization actually increases faster than predicted (green), and both resizing approaches were initiated too late, as the actual load had increased to higher than safe load for either resize types.

Resize by Node vs by rack 2:2

What could we do better? Resizing earlier is obviously advantageous. A better way of using linear regression is to compute confidence intervals (and is included in Apache Maths). The following graph shows a 90% upper confidence interval (red dotted line). Using this line for the predictions to trigger resizes results in earlier cluster resizing (at the 60 and 90 minute marks. The resize by rack (blue) is now triggered immediately after the 60 minutes of measured data is evaluated), resulting in the resizing operations being completed by the time the load becomes critical as shown in this graph:

Resize by node vs by rack - Triggered with upper confidence interval

The potential downside of using a 90% confidence interval to trigger the resize is that it’s possible that the actual utilisation never reaches the predicted value. For example, here’s a different run of randomly generated load showing a significantly lower actual utilisation (green). In this case it’s apparent that we’ve resized the cluster far too early, and the cluster may even need to be resized down again depending on the eventual trend. 

Resize by node vs by rack - Triggered with upper confidence interval 2:2

The risk of resizing unnecessarily can be reduced by ensuring that the regressions are regularly recomputed (e.g. every 5 minutes), and resizing requests are only issued just in time. For example, in this case the resize by node would not have been requested if the regression was recomputed at the 90 minute mark, using only the previous 60 minutes of data, and extrapolated 60 minutes ahead, as the upper confidence interval is only 75%, which is less than the 83% trigger threshold for resizing by node.

  1. Resizing Rules – by node or rack

As a thought experiment (not tested yet) I’ve developed some pseudo code rules for deciding when and what type of upsize to initiate, assuming the variables are updated every 5 minutes, and correct values have been set for a 6 node (3 rack, 2 nodes per rack) cluster.

now = current time

load = current utilisation (last 5m)

rack_threshold = 67% (for 3 racks), in general = 100 – ((1/racks)*100)

node_threshold = 83% (for 6 nodes), in general = 100 – ((1/nodes)*100)

rack_resize_time = 25m (rounded up)

node_resize_time = 50m (rounded up)

predict_time_rack_threshold (time when rack_threshold utilisation is predicted to be reached)

predict_time_node_threshold (time when node_threshold utilisation is predicted to be reached)

predicted_load_increasing = true if regression positive else negative

sufficient_time_for_rack_resize = (predict_time_rack_threshold – rack_resize_time) > now

sufficient_time_for_node_resize = (predict_time_node_threshold – node_resize_time) > now

trigger_rack_resize_now = (predict_time_rack_threshold – rack_resize_time) <= now

trigger_node_resize_now = (predict_time_node_threshold – node_resize_time) <= now


1 Every 5 minutes recompute linear regression function (upper 90% confidence interval) from past 60 minutes of cluster utilisation metrics and update the above variables.


2 IF predict_load_increasing THEN

IF load < rack_threshold AND

NOT sufficient_time_for_node_resize AND

trigger_rack_resize_now THEN

TRIGGER RACK RESIZE to next size up (if available)

WAIT for resize to complete

ELSE IF load < node_threshold AND

trigger_node_resize_now THEN

TRIGGER NODE RESIZE to next size up (if available)

WAIT for resize to complete

How does the rule work? The general idea is that if the load is increasing we may need to upsize the cluster, but the type of resize (by rack or by node) depends on the current load and the time remaining to safely do the different types of resizes. We also want to leave a resize as late as possible in case the regression prediction changes signalling that we don’t really need to resize yet. 

If the current load is under the resize by rack threshold (67%), then both options are still possible, but we only trigger the rack resize if there is no longer sufficient time to do a node resize, and it’s the latest possible time to trigger it.

However, if the load has already exceeded the safe rack resize threshold and it’s the latest possible time to trigger a node resize, then trigger a node resize. 

If the load has already exceeded the safe node threshold then it’s too late to resize and an alert should be issued.  An alert should also be issued if the cluster is already using the largest resizeable instances available as dynamic resizing is not possible.

Also note that the rack and node thresholds above are the theoretical best case values, and should be reduced for a production cluster with strict SLA requirements to allow for more headroom during resizing, say by 10% (or more). Allowing an extra 10% headroom reduces the thresholds to 60% (by rack) and 75% (by node) for the 6 node, 3 rack cluster example.

If you resize a cluster up you should also implement a process for “downsizing” (resizing down). 

This works in a similar way to the above, but because downsizing (eventually) halves the capacity, you can be more conservative about when to downsize. You should trigger it when the load is trending down, and when the capacity is well under 50% so that an “upsize” isn’t immediately triggered after the downsize (e.g. triggering at 40% will mean that the downsized clusters are running at 80% at the end of the downsize, assuming the load doesn’t drop significantly during the downsize operation). Note that you also need to wait until the metrics for the resized cluster have stabilised otherwise the upsizing rules will be triggered (as it will incorrectly look like the load has increased).

After an upsizing has completed you should also wait for close to the billing period for the instances (e.g. an hour) before downsizing, as you have to pay for them for an hour anyway so there’s no point in downsizing until you’ve got value for money out of them, and this gives a good safety period for the metrics to stabilise with the new instance sizes.

Simple downsizing rules (not using any regression or taking into account resize methods) would look like this:

3 IF more than an hour size last resize up complete AND

   NOT predict_load_increasing THEN

IF load < 40% THEN

TRIGGER RACK RESIZE to next size down (if available)

WAIT for resize to complete + some safety margin

We’ve also assumed a simple resize step from one size to the next size up. The algorithm can be easily adapted to resize from/to any size instances in the same family, which would be useful if the load is increasing more rapidly and is predicted to exceed the capacity of a single jump in instances sizes in the available time (but in practice you should do some benchmarking to ensure that you know the actual capacity improvement for different instance sizes).  

  1. Resizing Rules – any concurrency

We above analysis assumes you can only size by node or rack. In practice, concurrency can be any integer value between 1 and the number of nodes per rack, so you can resize by part rack concurrently. In practice, resizing by part rack may be the best choice as it can mitigate against losing quorum and is reasonable fast (if a “live” node fails during resizing you can lose quorum, the more nodes resized at once, the greater the risk). These rules can therefore be generalised as follows, with resize by node and rack now just edge cases. We keep track of the Concurrency which starts from an initial value equal to the number of nodes per rack (resize by “rack”) and eventually decreases to 1 (resize by “node”). We also now have a couple of functions that compute values for different values of concurrency, basically the current concurrency and the next concurrency (concurrency – 1).  Depending on the number of nodes per rack, some values of concurrency will have different threshold values but the same resize time (if the number of nodes isn’t evenly divisible by concurrency). 

now = current time

load = current utilisation (last 5m)

racks = number of racks (e.g. 3)

nodes_per_rack = number of nodes per rack (e.g. 10)

total_nodes = racks * nodes_per_rack

concurrency = nodes_per_rack

threshold(C) = 100 – ((C/total_nodes) * 100)

resize_operations(C) = roundup(nodes_per_rack/C) * racks

resize _time(C) = resize_operations(C) * average time per resize operation (measured)

predict_time_threshold(C) = time when threshold(C) utilisation is predicted to be reached

predicted_load_increasing = true if regression positive else negative

sufficient_time_for_resize(C) = (predict_time_threshold(C) – resize_time(C)) > now

trigger_resize_now(C) = (predict_time_threshold(C) – resize_time(C)) <= now


1 Every 5 minutes recompute linear regression function (upper 90% confidence interval) from past 60 minutes of cluster utilisation metrics and update the above variables.


2 IF predict_load_increasing THEN

IF load < threshold(concurrency) AND

(concurrency == 1


resize_time(concurrency) == resize_time(concurrency – 1) 


 NOT sufficient_time_for_resize(concurrency – 1))  


Trigger_resize_now(concurrency) THEN

TRIGGER RESIZE(concurrency) to next size up (if available)

WAIT for resize to complete

ELSE IF concurrency == 1

exception(“load has exceeded node size threshold and insufficient time to resize”)

ELSE IF load < Threshold(Concurrency – 1) 

Concurrency = Concurrency – 1

This graph shows the threshold (U%) and cluster resize time for an example 30 node cluster, with 3 racks and 10 nodes per rack and a time for each resize operation of 3 minutes (faster than reality, but easier to graph). Starting with concurrency = 10 (resize by “rack’), the rules will trigger a resize if the current load is increasing, under the threshold (67%, in the blue region), there is just sufficient predicted time for the resize to take place, and the next concurrency (9) resize time is the same or there is insufficient time for the next concurrency (9) resize. However, if the load has already exceeded the threshold then the next concurrency level (9) is selected and the rules are rerun 5 minutes later.  The graph illustrates the fundamental tradeoff with the dynamic cluster resizing, which is that smaller concurrencies enable resizing with higher starting loads (blue), with the downside that resizes take longer (orange). 

Resize Concurrency vs. Threshold and Cluster resize time

These (or similar) rules may or may not be a good fit for specific use case. For example, they are designed to be conservative to prevent over eager resizing by waiting as late as possible to initiate a resize at the current concurrency level. This may mean that due to load increases you miss the opportunity to do a resize at higher concurrency level, so have to wait longer for a resize at a lower concurrency level. Instead you could choose to trigger a resize at the highest possible concurrency as soon as possible, by simply removing the check for there being insufficient time to resize at the next concurrency level.

Finally, these or similar rules could be implemented with monitoring data from the Instaclustr Cassandra Prometheus API, using Prometheus PromQL linear regression and rules to trigger Instaclustr provisioning API dynamic resizing.

The post Cassandra Elastic Auto-Scaling Using Instaclustr’s Dynamic Cluster Resizing appeared first on Instaclustr.

ApacheCon Berlin, 22-24 October 2019

ApacheCon Europe, October 22-24, 2019, Kulturbrauerei Berlin #ACEU19

What’s better than one ApacheCon? Another ApacheCon! This year there were two Apache Conferences, one in Las Vegas and then again in Berlin.

They were similar but different. What were some differences between ApacheCon Berlin and Las Vegas? The location. In contrast to the hyper-real gambling oasis in a bone dry desert of Las Vegas, Berlin is a real capital city steeped in history. The venue was a historic brewery (once the largest in the world, but unfortunately no longer brewing so also “dry”, the Kulturbrauerei):

ApacheCon Berlin 2019 - Kulturbrauerei

It has multiple night clubs, and the concrete-bunker like main auditorium of the Kesselhaus (boiler house) is the perfect venue for a heavy metal concert (and the conference keynotes).  You also can’t escape the history, as next door to the conference there was a museum of everyday life in East Germany before the wall came down 30 years ago.  The East Germans were often creative to cope with the restrictions and shortages of life under Communism, and came up with innovations such as this “camping car”!

ApacheCon - Camping Car

The Berlin ApacheCon was also smaller, but in a more compact location and with less tracks (General, Community, Machine Learning, IoT, Big Data), so on average the talks had more buzz than Las Vegas, with an environment more conducive to catching up with people more than once for ongoing conservations afterwards. There was also an Open Source Design workshop focussing on Usability). I’d met some of the people involved in this at the speakers reception and had a lively dinner conversation (perhaps because I was the only non-designer at the table so asked lots of silly questions). It’s good to see Open Source UX design getting the attention it deserves, as once upon a time “Open Source” was synonymous with “Badly Designed’!

This “State of the feather” talk by David Nalley (Executive Vice President, ASF), espoused the Apache Way resulting in vendor neutrality, independence, trust and safety for contributors and users (and the photo reveals something of the industrial ambience of the boiler house):

“State of the feather” talk by David Nalley

Instaclustr was proud to be one of the ApacheCon EU sponsors:

ApacheCon EU Sponsors 2019

I had the privilege of kicking off the Machine Learning track held in the (appropriately named) “Maschinenhaus” with my talk “Kafka, Cassandra and Kubernetes at Scale – Real-time Anomaly detection on 19 billion events a day”. 

ApacheCon Berlin 2019 - Paul Brebner's talk on Building a Scalable Streaming Iot Application using Apache Kafka

My second talk of the day was in the more intimate venue, the “Frannz Salon”, in the IoT track, on “Kongo: Building a Scalable Streaming IoT Application using Apache Kafka”.

I managed to attend some talks by other speakers at ApacheCon Berlin. These are some of the highlights.

This IoT talk intersected with mine in terms of the problem domain (real-time RFID asset tracking), but provided a good explanation of Apache Flink, including Flink pattern matching which is a powerful CEP library: “7 Reasons to use Apache Flink for your IoT Project – How We Built a Real-time Asset Tracking System”.

In the Big Data track, there was a talk on the hot topic of how to use Kubernetes to run Apache software: Patterns and Anti-Patterns of running Apache bigdata projects in Kubernetes.

And finally a talk about an impressive Use Case for Apache Kafka, monitoring the largest machine in the world, the Large Hadron Collider at CERN: “Open Source Big Data Tools accelerating physics research at CERN”. Monitoring data had to be collected from 100s of accelerators and detector controllers, experiment data catalogues, data centres and system logs. Kafka is used as the buffered transport to collect and direct monitoring data, Spark is used to perform intermediate processing including data enrichment and aggregation, storage is provided by HDFS, ElasticSearch etc, and data analysis by Kibana, Grafana and Zeppelin. This pipeline handles a peak of 500GB of monitoring data a day (with capacity for more). Another innovation is a system called SWAN, to provide ephemeral Spark clusters on Kubernetes, for user managed Spark resources (provision, control, use and dispose).  A similar scalable pipeline enabled by fully managed Kafka and ElasticSearch is available from Instaclustr.

Signboard - East and West Germany

Berlin was a thought-provoking historical location for ApacheCon Europe. For approaching 30 years (1961-1989) the wall had divided East and West Germany, with incompatible political, economic and social systems pitted in a stand-off with only metres separating them, and only a few tightly controlled points of access bridging them. However, 30 years ago the wall came down and Berlin was reunified resulting in rapid social and economic changes. In a similar way, over the last 20 years, Apache has shifted the software world from Licences and Lock-in to Free and Open Source, and created a vibrant ecosystem of projects, people, software and software. I look forward to repeating the experience next year.

After ApacheCon I had a chance to explore a different sort of history. The Berlin technology museum had an impressive display of old computers, including a reconstruction of the 1st programmable (using punched movie film) digital (but mechanical) computer built in 1938 by Konrad Zuse, the Z1.  Perhaps more significantly, in the 1940’s Zuse also designed the first high level programming language, Plankalkül, evidently conceptually decades ahead of other high level languages (he included array and parallel processing, and goal-directed execution), and it probably influenced Algol.

Konrad Zuse and the Z1 computer in the Berlin technology museum

Konrad Zuse and the Z1 computer in the Berlin technology museum

Finally, the friendly Instaclustr team at our stand at ApacheCon Berlin.  If you didn’t have a chance to talk to us in person at ApacheCon, contact us at

Instaclustr at ApacheCon Berlin 2019

The post ApacheCon Berlin, 22-24 October 2019 appeared first on Instaclustr.

Medusa - Spotify’s Apache Cassandra backup tool is now open source

Spotify and The Last Pickle (TLP) have collaborated over the past year to build Medusa, a backup and restore system for Apache Cassandra which is now fully open sourced under the Apache License 2.0.

Challenges Backing Up Cassandra

Backing up Apache Cassandra databases is hard, not complicated. You can take manual snapshots using nodetool and move them off the node to another location. There are existing open source tools such as tablesnap, or the manual processes discussed in the previous TLP blog post “Cassandra Backup and Restore - Backup in AWS using EBS Volumes”. However they all tend to lack some features needed in production, particularly when it comes to restoring data - which is the ultimate test of a backup solution.

Providing disaster recovery for Cassandra has some interesting challenges and opportunities:

  • The data in each SSTable is immutable allowing for efficient differential backups that only copy changes since the last backup.
  • Each SSTable contains data that node is responsible for, and the restore process must make sure it is placed on a node that is also responsible for the data. Otherwise it may be unreachable by clients.
  • Restoring to different cluster configurations, changes in the number of nodes or their tokens, requires that data be re-distributed into the new topology following Cassandra’s rules.

Introducing Medusa

Medusa is a command line backup and restore tool that understands how Cassandra works.
The project was initially created by Spotify to replace their legacy backup system. TLP was hired shortly after to take over development, make it production ready and open source it.
It has been used on small and large clusters and provides most of the features needed by an operations team.

Medusa supports:

  1. Backup a single node.
  2. Restore a single node.
  3. Restore a whole cluster.
  4. Selective restore of keyspaces and tables.
  5. Support for single token and vnodes clusters.
  6. Purging the backup set of old data.
  7. Full or incremental backup modes.
  8. Automated verification of restored data.

The command line tool that uses Python version 3.6 and needs to be installed on all the nodes you want to back up. It supports all versions of Cassandra after 2.1.0 and thanks to the Apache libcloud project can store backups in a number of platforms including:

Backup A Single Node With Medusa

Once medusa is installed and configured a node can be backed up with a single, simple command:

medusa backup --backup-name=<backup name>

When executed like this Medusa will:

  1. Create a snapshot using the Cassandra nodetool command.
  2. Upload the snapshot to your configured storage provider.
  3. Clear the snapshot from the local node.

Along with the SSTables, Medusa will store three meta files for each backup:

  • The complete CQL schema.
  • The token map, a list of nodes and their token ownership.
  • The manifest, a list of backed up files with their md5 hash.

Full And Differential Backups

All Medusa backups only copy new SSTables from the nodes, reducing the network traffic needed. It then has two ways of managing the files in the backup catalog that we call Full or Differential backups. For Differential backups only references to SSTables are kept by each new backup, so that only a single instance of each SStable exists no matter how many backups it is in. Differential backups are the default and in operations at Spotify reduced the backup size for some clusters by up to 80%.

Full backups create a complete copy of all SSTables on the node each time they run. Files that have not changed since the last backup will be copied in the backup catalog into the new backup (and not copied off the node). In contrast to the differential method which only creates a reference to files. Full backups are useful when you need to take a complete copy and have all the files in a single location.

Cassandra Medusa Full Backups

Differential backups take advantage of the immutable SSTables created by the Log Structured Merge Tree storage engine used by Cassanda. In this mode Medusa checks if the SSTable has previously being backed up, and only copies the new files (just like always). However all SSTables for the node are then stored in a single common folder, and the backup manifest contains only metadata files and references to the SSTables.

Cassandra Medusa Full Backups

Backup A Cluster With Medusa

Medusa currently lacks an orchestration layer to run a backup on all nodes for you. In practice we have been using crontab to do cluster wide backups. While we consider the best way to automate this (and ask for suggestions) we recommend using techniques such as:

  • Scheduled via crontab on each node.
  • Manually on all nodes using pssh.
  • Scripted using cstar.

Listing Backups

All backups with the same “backup name” are considered part of the same backup for a cluster. Medusa can provide a list of all the backups for a cluster, when they started and finished, and if all the nodes have completed the backup.

To list all existing backups for a cluster, run the following command on one of the nodes:

$ medusa list-backups
2019080507 (started: 2019-08-05 07:07:03, finished: 2019-08-05 08:01:04)
2019080607 (started: 2019-08-06 07:07:04, finished: 2019-08-06 07:59:08)
2019080707 (started: 2019-08-07 07:07:04, finished: 2019-08-07 07:59:55)
2019080807 (started: 2019-08-08 07:07:03, finished: 2019-08-08 07:59:22)
2019080907 (started: 2019-08-09 07:07:04, finished: 2019-08-09 08:00:14)
2019081007 (started: 2019-08-10 07:07:04, finished: 2019-08-10 08:02:41)
2019081107 (started: 2019-08-11 07:07:04, finished: 2019-08-11 08:03:48)
2019081207 (started: 2019-08-12 07:07:04, finished: 2019-08-12 07:59:59)
2019081307 (started: 2019-08-13 07:07:03, finished: Incomplete [179 of 180 nodes])
2019081407 (started: 2019-08-14 07:07:04, finished: 2019-08-14 07:56:44)
2019081507 (started: 2019-08-15 07:07:03, finished: 2019-08-15 07:50:24)

In the example above the backup called “2019081307” is marked as incomplete because 1 of the 180 nodes failed to complete a backup with that name.

It is also possible to verify that all expected files are present for a backup, and their content matches hashes generated at the time of the backup. All these operations and more are detailed in the Medusa README file.

Restoring Backups

While orchestration is lacking for backups, Medusa coordinates restoring a whole cluster so you only need to run one command. The process connects to nodes via SSH, starting and stopping Cassandra as needed, until the cluster is ready for you to use. The restore process handles three different use cases.

  1. Restore to the same cluster.
  2. Restore to a different cluster with the same number of nodes.
  3. Restore to a different cluster with a different number of nodes.

Case #1 - Restore To The Same Cluster

This is the simplest case: restoring a backup to the same cluster. The topology of the cluster has not changed and all the nodes that were present at the time the backup was created are still running in the cluster.

Cassandra Medusa Full Backups

Use the following command to run an in-place restore:

$ medusa restore-cluster --backup-name=<name of the backup> \

The seed target node will be used as a contact point to discover the other nodes in the cluster. Medusa will discover the number of nodes and token assignments in the cluster and check that it matches the topology of the source cluster.

To complete this restore each node will:

  1. Download the backup data into the /tmp directory.
  2. Stop Cassandra.
  3. Delete the commit log, saved caches and data directory including system keyspaces.
  4. Move the downloaded SSTables into to the data directory.
  5. Start Cassandra.

The schema does not need to be recreated as it is contained in the system keyspace, and copied from the backup.

Case #2 - Restore To A Different Cluster With Same Number Of Nodes

Restoring to a different cluster with the same number of nodes is a little harder because:

  • The destination cluster may have a different name, which is stored in system.local table.
  • The nodes may have different names.
  • The nodes may have different token assignments.

Cassandra Medusa Full Backups

Use the following command to run a remote restore:

$ medusa restore-cluster --backup-name=<name of the backup> \
                         --host-list <mapping file>

The host-list parameter tells Medusa how to map from the original backup nodes to the destination nodes in the new cluster, which is assumed to be a working Cassandra cluster. The mapping file must be a Command Separated File (without a heading row) with the following columns:

  1. is_seed: True or False indicating if the destination node is a seed node. So we can restore and start the seed nodes first.
  2. target_node: Host name of a node in the target cluster.
  3. source_node: Host name of a source node to copy the backup data from.

For example:


In addition to the steps listed for Case 1 above, when performing a backup to a remote cluster the following steps are taken:

  1. The system.local and system.peers tables are not modified to preserve the cluster name and prevent the target cluster from connecting to the source cluster.
  2. The system_auth keyspace is restored from the backup, unless the --keep-auth flag is passed to the restore command.
  3. Token ownership is updated on the target nodes to match the source nodes by passing the -Dcassandra.initial_token JVM parameter when the node is restarted. Which causes ownership to be updated in the local system keyspace.

Case #3 - Restore To A Different Cluster With A Different Number Of Nodes

Restoring to a different cluster with a different number of nodes is the hardest case to deal with because:

  • The destination cluster may have a different name, which is stored in system.local table.
  • The nodes may have different names.
  • The nodes may have different token assignments.
  • Token ranges can never be the same as there is a different number of nodes.

The last point is the crux of the matter. We cannot get the same token assignments because we have a different number of nodes, and the tokens are assigned to evenly distribute the data between nodes. However the SSTables we have backed up contain data aligned to the token ranges defined in the source cluster. The restore process must ensure the data is placed on the nodes which are replicas according to the new token assignments, or data will appear to have been lost.

To support restoring data into a different topology Medusa uses the sstableloader tool from the Cassandra code base. While slower than copying the files from the backup the sstableloader is able to “repair” data into the destination cluster. It does this by reading the token assignments and streaming the parts of the SSTable that match the new tokens ranges to all the replicas in the cluster.

Cassandra Medusa Full Backups

Use the following command to run a restore to a cluster with a different topology :

$ medusa restore-cluster --backup-name=<name of the backup> \

Restoring data using this technique has some drawbacks:

  1. The restore will take significantly longer.
  2. The amount of data loaded into the cluster will be the size of the backup set multiplied by the Replication Factor. For example, a backup of a cluster with Replication Factor 3 will have 9 copies of the data loaded into it. The extra replicas will be removed by compaction however the total on disk load during the restore process will be higher than what it will be at the end of the restore. See below for a further discussion.
  3. The current schema in the cluster will be dropped and a new one created using the schema from the backup. By default Cassandra will take a snapshot when the schema is dropped, a feature controlled by the auto_snapshot configuration setting, which will not be cleared up by Medusa or Cassandra. If there is an existing schema with data it will take extra disk space. This is a sane safety precaution, and a simple work around is to manually ensure the destination cluster does not have any data in it.

A few extra words on the amplification of data when restoring using sstableloader. The backup has the replicated data and lets say we have a Replication Factor of 3, roughly speaking there are 3 copies of each partition. Those copies are spread around the SSTables we collected from each node. As we process each SSTable the sstableloader repairs the data back into the cluster, sending it to the 3 new replicas. So the backup contains 3 copies, we process each copy, and we send each copy to the 3 new replicas, which means in this case:

  • The restore sends nine copies of data to the cluster.
  • Each node gets three copies of data rather than one.

The following sequence of operations will happen when running this type of restore:

  1. Drop the schema objects and re-create them (once for the whole cluster)
  2. Download the backup data into the /tmp directory
  3. Run the sstableloader for each of the tables in the backup

Available Now On GitHub

Medusa is now available on GitHub and is available through PyPi. With this blog post and the readme file in the repository you should be able to take a backup within minutes of getting started. As always if you have any problems create an issue in the GitHub project to get some help. It has been in use for several months at Spotify, storing petabytes of backups in Google Cloud Storage (GCS), and we thank Spotify for donating the software to the community to allow others to have the same confidence that their data is safely backed up.

One last thing, contributions are happily accepted especially to add support for new object storage providers.

Also, we’re looking to hire a Cassandra expert in America.

TLP Is Hiring Another Cassandra Expert

TLP is looking to hire someone in America to work with us on tough problems for some of the biggest and smallest companies in the world. We help our customers get the best out of Apache Cassandra, and we have a really simple approach: be smart, be nice, and try to make it fun. Our customers range from large corporates to well-known internet companies to small startups. They call on TLP to fix and improve the Cassandra clusters they use to deliver products and services to millions of users, and sometimes to hundreds of millions. We’ve been doing this since 2011 so if you have used the internet there is a good chance TLP has worked with at least one company you have visited.

You will be part of the globally-distributed Consulting team. In the team you will use your expert Cassandra experience and knowledge to work directly with our customers to solve their problems, improve their processes, and build their Cassandra skills. Your teammates will rely on you to review their work, question their assumptions, and help them when they get stuck; as you will of them. This is the only way we know works when the team is asked to solve problems no one else can. When given time you will enjoy participating in the Apache Cassandra community and contributing to our open source projects which have wide usage. In short you will be smart, like working with people of all skill levels, and get a buzz out of connecting with other people and helping them.

About The Role

  • You will be part of the Consulting team distributed in 7 countries in Asia-Pacific, America, and Europe with diverse backgrounds.
  • You will get to work with large and small companies from around the world and together we will find the best ways to help their teams be successful.
  • You will occasionally be required to work outside of normal business hours with customers or your team members. We provide limited on-call support during business hours, and occasional extended support for our customers.
  • You will understand the customer comes first, and will be able to get a quick response to them if they need one. Then you will follow-up with the full TLP knowledge dump :)
  • We provide our customers expert advice including troubleshooting, performance tuning, scaling and upgrading, data modelling, monitoring, automation, hardware selection, and training. With the aim of delivering documentation and video so that if the event happens again they are able to solve it by themselves. We work in a highly-collaborative way even though we are widely distributed. Most things we do for customers are checked by a team member, and we are comfortable admitting we are wrong in front of each other and our customers.

About You

  • You will have 3+ years experience using, operating, and working on Cassandra or be able to convince us you know enough already.
  • You will be fluent with Java, and ideally at least comfortable with Python or Bash (or a similar scripting language).
  • You will be able to demonstrate contributions to open source projects and be familiar with the workflow.
  • You will be comfortable speaking to an audience and ideally have previously spoken about Cassandra at an event. At TLP you will be able to speak about complex ideas to the rest of the team, our customers, or a full room at a conference.
  • You will be able to communicate complex ideas in writing or interpretive dance, and have previous examples of either. At TLP you will be able to write effective ticket updates or run books for customers, blog posts for TLP, or participate in a weekend dance workshop on the frailty of the human condition and consistency in partition tolerant distributed databases.

About TLP

The Last Pickle has been helping customers get the most out of Apache Cassandra since March 2011. We are a successful, self-funded startup with a great customer list, and a great team. We get to spend time on open source tools and researching how best to use Cassandra, as we strive to be a research driven consultancy.

If this sounds like the right job for you let us know by emailing

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused…

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications

By Cody Rioux, Daniel Jacobson, Jeff Chao, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, Zhenzhong Xu

Today we’re excited to announce that we’re open sourcing Mantis, a platform that helps Netflix engineers better understand the behavior of their applications to ensure the highest quality experience for our members. We believe the challenges we face here at Netflix are not necessarily unique to Netflix which is why we’re sharing it with the broader community.

As a streaming microservices ecosystem, the Mantis platform provides engineers with capabilities to minimize the costs of observing and operating complex distributed systems without compromising on operational insights. Engineers have built cost-efficient applications on top of Mantis to quickly identify issues, trigger alerts, and apply remediations to minimize or completely avoid downtime to the Netflix service. Where other systems may take over ten minutes to process metrics accurately, Mantis reduces that from tens of minutes down to seconds, effectively reducing our Mean-Time-To-Detect. This is crucial because any amount of downtime is brutal and comes with an incredibly high impact to our members — every second counts during an outage.

As the company continues to grow our member base, and as those members use the Netflix service even more, having cost-efficient, rapid, and precise insights into the operational health of our systems is only growing in importance. For example, a five-minute outage today is equivalent to a two-hour outage at the time of our last Mantis blog post.

Mantis Makes It Easy to Answer New Questions

The traditional way of working with metrics and logs alone is not sufficient for large-scale and growing systems. Metrics and logs require that you to know what you want to answer ahead of time. Mantis on the other hand allows us to sidestep this drawback completely by giving us the ability to answer new questions without having to add new instrumentation. Instead of logs or metrics, Mantis enables a democratization of events where developers can tap into an event stream from any instrumented application on demand. By making consumption on-demand, you’re able to freely publish all of your data to Mantis.

Mantis is Cost-Effective in Answering Questions

Publishing 100% of your operational data so that you’re able to answer new questions in the future is traditionally cost prohibitive at scale. Mantis uses an on-demand, reactive model where you don’t pay the cost for these events until something is subscribed to their stream. To further reduce cost, Mantis reissues the same data for equivalent subscribers. In this way, Mantis is differentiated from other systems by allowing us to achieve streaming-based observability on events while empowering engineers with the tooling to reduce costs that would otherwise become detrimental to the business.

From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.

Guiding Principles Behind Building Mantis

The following are the guiding principles behind building Mantis.

  1. We should have access to raw events. Applications that publish events into Mantis should be free to publish every single event. If we prematurely transform events at this stage, then we’re already at a disadvantage when it comes to getting insight since data in its original form is already lost.
  2. We should be able to access these events in realtime. Operational use cases are inherently time sensitive by nature. The traditional method of publishing, storing, and then aggregating events in batch is too slow. Instead, we should process and serve events one at a time as they arrive. This becomes increasingly important with scale as the impact becomes much larger in far less time.
  3. We should be able to ask new questions of this data without having to add new instrumentation to your applications. It’s not possible to know ahead of time every single possible failure mode our systems might encounter despite all the rigor built in to make these systems resilient. When these failures do inevitably occur, it’s important that we can derive new insights with this data. You should be able to publish as large of an event with as much context as you want. That way, when you think of a new questions to ask of your systems in the future, the data will be available for you to answer those questions.
  4. We should be able to do all of the above in a cost-effective way. As our business critical systems scale, we need to make sure the systems in support of these business critical systems don’t end up costing more than the business critical systems themselves.

With these guiding principles in mind, let’s take a look at how Mantis brings value to Netflix.

How Mantis Brings Value to Netflix

Mantis has been in production for over four years. Over this period several critical operational insight applications have been built on top of the Mantis platform.

A few noteworthy examples include:

Realtime monitoring of Netflix streaming health which examines all of Netflix’s streaming video traffic in realtime and accurately identifies negative impact on the viewing experience with fine-grained granularity. This system serves as an early warning indicator of the overall health of the Netflix service and will trigger and alert relevant teams within seconds.

Contextual Alerting which analyzes millions of interactions between dozens of Netflix microservices in realtime to identify anomalies and provide operators with rich and relevant context. The realtime nature of these Mantis-backed aggregations allows the Mean-Time-To-Detect to be cut down from tens of minutes to a few seconds. Given the scale of Netflix this makes a huge impact.

Raven which allows users to perform ad-hoc exploration of realtime data from hundreds of streaming sources using our Mantis Query Language (MQL).

Cassandra Health check which analyzes rich operational events in realtime to generate a holistic picture of the health of every Cassandra cluster at Netflix.

Alerting on Log data which detects application errors by processing data from thousands of Netflix servers in realtime.

Chaos Experimentation monitoring which tracks user experience during a Chaos exercise in realtime and triggers an abort of the chaos exercise in case of an adverse impact.

Realtime Personally Identifiable Information (PII) data detection samples data across all streaming sources to quickly identify transmission of sensitive data.

Try It Out Today

To learn more about Mantis, you can check out the main Mantis page. You can try out Mantis today by spinning up your first Mantis cluster locally using Docker or using the Mantis CLI to bootstrap a minimal cluster in AWS. You can also start contributing to Mantis by getting the code on Github or engaging with the community on the users or dev mailing list.


A lot of work has gone into making Mantis successful at Netflix. We’d like to thank all the contributors, in alphabetical order by first name, that have been involved with Mantis at various points of its existence:

Andrei Ushakov, Ben Christensen, Ben Schmaus, Chris Carey, Danny Yuan, Erik Meijer, Indrajit Roy Choudhury, Josh Evans, Justin Becker, Kathrin Probst, Kevin Lew, Ram Vaithalingam, Ranjit Mavinkurve, Sangeeta Narayanan, Santosh Kalidindi, Seth Katz, Sharma Podila.

Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reduce Operational Costs with AWS EC2 R5 Instance Types on the Instaclustr Managed Platform

AWS EC2 R5 instance types on the Instaclustr Managed Platform

Continuing our efforts in adding support for the latest generation AWS EC2 instances, Instaclustr is pleased to announce support for the EC2 R5 instance type for Apache Cassandra clusters on the Instaclustr Managed Platform. The R-series instances have been optimized for memory intensive applications that require high-performance databases, distributed in-memory caches, in-memory databases, and big data analytics. On the Instaclustr Managed Platform, they are suitable for Cassandra and Kafka clusters when they are required by customer applications to deliver very high performance I/O with minimal latency.

The latest generation of R-series (R5s) instances provide significantly better price-to-performance and price-per-GB over its predecessor. They are designed to gain the most out of the hardware capabilities allowing applications to attain the best value for money. R5s also have EBS-optimized burst support, delivering better peak throughput than R4s. Read more about AWS EC2 R5 instances here.

If you are interested in migrating your existing Cassandra and Kafka cluster on the Instaclustr Managed Platform from R4s to R5s, our highly experienced Technical Operations team can provide all the assistance you need. We have built several tried and tested node replacement strategies to provide zero-downtime, non-disruptive migrations. Read our Advanced Node Replacement blog for more details on one such strategy for Cassandra.

Apache Kafka Benchmarking Results

Our Kafka performance benchmarking methodology aims to find the maximum consumer throughput (messages consumed per second) for a given test set.  For this test a large number of producer load generators were established, enough to stress our Kafka cluster. Under this peak load, a consumer group was also established to consume messages from Kafka, essentially achieving the maximum consumer throughput. The results showed 16% increase in consumer throughput with R5s compared to R4s. 

The load test set was comprised of two 3-node Kafka clusters, one with R4.xlarge and the other with R5.xlarge instances. Each node had 750GB EBS backed SSDs attached. A 30 partition 3x replicated topic was created on both the clusters. The performance was calculated on small 100 byte messages. Average message consumption per second was 1,994,499 messages/sec with R4s, and 2,322,540 messages/sec with R5s.

AWS R5 benchmarking result

We are working on benchmarking R5s for Cassandra and will soon publish the results once the testing is completed. 

If you want to know more about this benchmarking or need clarification on when to use the R5 instance type for Cassandra of Kafka, reach out to our Support team (if you are an existing customer), or contact our Sales team. You can also refer to our support article that describes the nature of each offered instance type and their likely use cases.

R5 Pricing

R5 series is slightly cheaper than R4s and the cost savings are passed to our customers. But considering R5’s improved performance, the overall price-performance gain is significantly higher. You can access pricing details through the Instaclustr Console when you login, or contact our Support or Sales teams.

The post Reduce Operational Costs with AWS EC2 R5 Instance Types on the Instaclustr Managed Platform appeared first on Instaclustr.

TypeScript Support in the DataStax Node.js Drivers

TypeScript declarations of the driver are now contained in the same package and repository. These declarations will be maintained and kept in sync along with the JavaScript API. 

Additionally, you can now use DSE-specific types like geo, auth and graph types from TypeScript.

Getting started

To get started with the Node.js driver for Apache Cassandra in a TypeScript project, install the driver package:

npm install cassandra-driver

Removing Support for Outdated Encryption Mechanisms

At Instaclustr, security is the foundation of everything we do. We are continually working towards compliance with additional security standards as well as conducting regular risk reviews of our environments. This blog post outlines some technical changes we are making that both increases the security of our managed environment and enables compliance with a wider range of security standards.

From October 9, 2019 AEST newly provisioned Instaclustr clusters running recent versions of Cassandra and Kafka will have support for the SSLv3, TLSv1.0 and TLSv1.1 encryption protocols disabled and thus require the use of TLS 1.2 and above. From this date, we will also begin working with customers to roll this change out to existing clusters.

Instaclustr-managed clusters that will be affected are:

  • Apache Cassandra 3.11+
  • Apache Kafka 2.1+

Why are we doing this?

The protocols we are disabling are out of date, have known vulnerabilities and are not compliant with a range of public and enterprise security standards. All identified clients that support this version of Cassandra and Kafka support TLS1.2.

How can I test if I will be affected?


The cqlsh CLI will need to be changed to request TLSv1.2 (otherwise it defaults to TLSv1.0).  Assuming a cqlshrc file based on the Instaclustr example, the updated entry should be:


certfile = full_path_to_cluster-ca-certificate.pem

validate = true

factory = cqlshlib.ssl.ssl_transport_factory

version = TLSv1_2

Note: If running CQLSH from on Mac OS X, the system Python is not updated and will not support the TLSv1_2 option.  You should instead manually update your system Python or run cqlsh from a Docker container.

Clients built on the Datastax Apache Cassandra Java driver can create a custom SSLContext that requires that TLSv1.2 is used e.g.

ctx = SSLContext.getInstance("TLSv1.2", "SunJSSE");

If the client is able to successfully connect, then it confirms that your Java environment supports TLSv1.2 (i.e. is recent enough and is not configured to disable it).


If using the official Apache Kafka Java client (or the Instaclustr ic-kafka-topics tool), the client configuration can be updated to allow only TLSv1.2.  For example, based on the Instaclustr example configuration, the enabled protocols becomes:


If the client is able to successfully connect, then it confirms that your Java environment supports TLSv1.2 (i.e. is recent enough and is not configured to disable it).


We understand that testing and changing systems is a time-consuming process. Given the widespread support for TLSv1.2 we do not anticipate that this change will actually impact any current systems. 

If you have any questions or concerns, please do not hesitate to contact us at


The post Removing Support for Outdated Encryption Mechanisms appeared first on Instaclustr.

ApacheCon 2019: DataStax Announces Cassandra Monitoring Free Tier, Unified Drivers, Proxy for DynamoDB & More for the Community

It’s hard to believe that we’re celebrating the 20th year of the Apache Software Foundation. But here we are—and it’s safe to say open source has come a long way over the last two decades.

We just got back from ApacheCon, where DataStax—one of the major forces behind the powerful open source Apache Cassandra™ database—was a platinum sponsor this year. 

We don’t know about you, but we couldn’t be more excited about what the future holds for software development and open source technology in particular.

During CTO Jonathan Ellis’ keynote, we announced three exciting new developer tools for the Cassandra community:

  • DataStax Insights (Cassandra Performance Monitoring)
  • Unified Drivers
  • DataStax Spring Boot Starter for the Java Driver


While we’re at it, in my talk “Happiness is a hybrid cloud with Apache Cassandra,” I announced our preview release for another open source tool: DataStax Proxy for DynamoDB™ and Apache Cassandra.

This tool enables developers to run their AWS DynamoDB™ workloads on Cassandra. With this proxy, developers can run DynamoDB workloads on-premises to take advantage of the hybrid, multi-model, and scalability benefits of Cassandra.

These tools highlight our commitment to open source and will help countless Cassandra developers build transformative software solutions and modern applications in the months and years ahead. 

Let’s explore each of them briefly.

1. DataStax Insights (Cassandra Performance Monitoring)

Everyone who uses Cassandra—whether they’re developers or operators—stands to benefit from DataStax Insights, a next-generation performance management and monitoring tool that is included with DataStax Constellation, DataStax Enterprise, and open source Cassandra 3.x and higher. 

We’re now offering sign-ups for DataStax Insights (or better said: Cassandra Monitoring) for free, allowing Cassandra users to have an at-a-glance health index to get a single view of all clusters. The tool also enables users to optimize their clusters using AI to recommend solutions to issues, highlight anti-patterns, and identify performance bottlenecks, among other things. 

DataStax Insights is free for all Cassandra users for up to 50 nodes and includes one week of rolling retention. Interested in joining the DataStax Insights early access program? We’re taking sign-ups now (more on that below). 


2. Unified Drivers

Historically, DataStax has maintained two sets of drivers: one for DataStax Enterprise and one for open source Cassandra users. Moving forward, we are merging these two sets into a single unified DataStax driver for each supported programming language including C++, C#, Java, Python, and node.js. As a result, each unified driver will work for both Cassandra and DataStax products.

This move benefits developers by simplifying driver choice, which makes it easier to determine which driver to use when building applications. At the same time, developers using the open source version of Cassandra will now have free access to advanced features that initially shipped with our premium solutions. Further, developers that have previously used two different sets of drivers will now only need to use one driver for their applications across any DataStax platform and open source Cassandra. This will help with enhanced load balancing and reactive streams support. 

3. Spring Boot Starter 

The DataStax Java Driver Spring Boot Starter, which is now available in DataStax Labs, streamlines the process of building standalone Spring-based applications with Cassandra and DataStax databases.

Developers will enjoy that this tool centralizes familiar configuration in one place while providing easy access to the Java Driver in Spring applications.

It’s just one more way that makes the application development process easier.

4. DataStax Proxy for DynamoDB™ and Apache Cassandra™

With the DataStax Proxy for DynamoDB and Cassandra, developers can run DynamoDB workloads on-premises, taking advantage of the hybrid, multi-model, and scalability benefits of Cassandra.

The proxy is designed to enable users to back their DynamoDB applications with Cassandra. We determined that the best way to help users leverage this new tool and to help it flourish was to make it an open source Apache 2 licensed project.

The code consists of a scalable proxy layer that sits between your app and the database. It provides compatibility with the DynamoDB SDK which allows existing DynamoDB applications to read and write data to Cassandra without application changes.


Sign up for the DataStax Insights early access program today!

Are you interested in optimizing your on-premises or cloud-based Cassandra deployments using a platform that lets novices monitor and fine-tune their cluster performance like experts? 

If so, you may want to give DataStax Insights a try. 

We’re currently accepting sign-ups to our early access program. Click the button below to get started!


DataStax Labs

DataStax Labs provides the Apache Cassandra™ and DataStax communities with early access to product previews or enhancements for developers that are being considered for future production software; including tools, aids, and partner software designed to increase productivity. When you try out some of our new Labs technologies, we would love your feedback—good or bad—let us know!

Top 5 Reasons to Choose Apache Cassandra Over DynamoDB

Overview – Why Apache Cassandra Over DynamoDB

DynamoDB and Apache Cassandra are both very popular distributed data store technologies. Both are used successfully in many applications and production-proven at phenomenal scale. 

At Instaclustr, we live and breathe Apache Cassandra (and Apache Kafka). We have many customers at all levels of size and maturity who have built successful businesses around Cassandra-based applications. Many of those customers have undertaken significant evaluation exercises before choosing Cassandra over DynamoDB and several have migrated running applications from DynamoDB to Cassandra. 

This blog distills the top reasons that our customers have chosen Apache Cassandra over DynamoDB.

Reason 1: Significant Cost of Writes to DynamoDB

For many use cases, Apache Cassandra can offer a significant cost saving over DynamoDB. This is particularly the case of requirements that are write-heavy. The cost of write to DynamoDB is five times that cost of the read (reflected directly in your AWS bill). For Apache Cassandra, write are several times cheaper than reads (reflected in system resource usage).

Reason 2: Portability

DynamoDB is available in AWS and nowhere else. For multi-tenant SaaS offerings where only a single instance of the application will ever exist, then being all-in on AWS is not a major issue. However, many applications, for a lot of good reasons, still need to be installed and managed on a per-customer basis and many customers (often the largest ones!) will not want to run on AWS. Choosing Cassandra allows your application to run anywhere you can run a linux box.  

Reason 3: Design Without Having to Worry About Pricing Models

DynamoDB’s pricing is complex with two different pricing models and multiple pricing dimensions. Applying the wrong pricing models or designing your architecture without considering pricing can result in order of magnitude differences in costs. This also means that a seemingly innocuous change to your application can dramatically impact cost.  With Apache Cassandra, you have your infrastructure and you know your management fees, once you have completed performance testing and you know that your infrastructure can meet your requirements, you know your costs.

Reason 4: Multi-Region Functionality

Apache Cassandra was the first NoSQL technology to offer active-active multi-region support. While DynamoDB has added Global Tables, these have a couple of key limitations when compared to Apache Cassandra. The most significant in many cases is that you cannot add replicas to an existing global table. So, if you set up in two regions and then decide to add a third you need to completely rebuild from an empty table. With Cassandra, adding a region to a cluster is a normal, and fully online, operation. Another major limitation is that DynamoDB only offers eventual consistency across Global Tables, whereas Apache Cassandra’s tunable consistency levels can enforce strong consistency across multiple regions.

Reason 5: Avoiding Vendor Lock-In

Apache Cassandra is true open source software, owned and governed by the Apache Software Foundation to be developed and maintained for the benefit of the community and able to be run in any cloud or on-premise environment. DynamoDB is an AWS proprietary solution that not only locks you in to DynamoDB but also locks your application to the wider AWS ecosystem. 

While these are the headline reasons that people make the choice of Apache Cassandra over DynamoDB, there are also many advantages at the detailed functional level such as:

  • DynamoDB’s capacity is limited by partition with a maximum of 1,000 write capacity units and 3,000 read capacity units per partition. Cassandra’s capacity is distributed per node which typically provide a per-partition limit orders of magnitude higher than this.
  • Cassandra’s CQL query language provides a simple learning curve for developers familiar with SQL.
  • DynamoDB only allows single value partition and sort (called clustering in Cassandra) keys while Cassandra support multi-part keys. A minor difference but another way Cassandra reduces application complexity.
  • Cassandra supports aggregate functions which in some use cases can provide significant efficiencies.


The post Top 5 Reasons to Choose Apache Cassandra Over DynamoDB appeared first on Instaclustr.