Distributed Database Things to Know: Cassandra Datacenter & Racks

Take, for instance, a datacenter could just be a cloud provider, an actual physical datacenter location, a zone in Azure, or region in some other provider. What an actual Datacenter in Cassandra parlance actually is can vary, but the origins of why it’s called a datacenter remains the same. The elements of racks also can vary, but also remain the same.

Origins: Racks & Datacenters?

Let’s cover the actual things in this industry we call datacenter and racks first, unrelated to Apache Cassandra terms.

Tantan Connects with Scylla to Help Singles Find Their Closest Match

Tantan and Scylla

About Tantan

Tantan is a locality-based dating app that provides a venue for people to connect and expand their social circles. More than 160 million men and women have used Tantan to chat, form new friendships, and, ideally, find their perfect match. We have more than 10 million daily active users.

Our app delivers profiles of people that might be interesting to our users. Those profiles incorporate photos, likes, and interests. After evaluating prospective matches, users can ‘like’ favorites with a swipe. If the swipe is reciprocated, a match is struck.

Once matched up, users can interact via texts, voice messages, pictures, and videos. Our app provides social privacy by blocking and filtering other users. Our new ‘ice-breaker’ feature lets our users get a better understanding of a potential match with a brief quiz.

tantan logo

“By adopting Scylla, we’ve completely side-stepped garbage collection and other issues that were interfering with the real-time nature of our application.”

Ken Peng, Backend Specialist, Tantan

The Challenge

The Tantan app periodically collects user locations, which are used to identify, in real-time, other users within a range of about 300 meters. Our peak usage hits about 50,000 writes per second.

Because we have so many geographically distributed users, we put a lot of strain on our database. Due to the real-time nature of our app, our number-one database requirement is low latency for mixed read and write workloads.

Initially we ran two 16-node Cassandra clusters–one cluster for passby and one for storing user data. Yet even with this many nodes, we found ourselves running into serious latency issues. These issues were often caused by a common culprit in Cassandra — garbage collection. However, we also saw latency spikes resulting from other slow maintenance tasks, such as repair. Even after tuning Cassandra to our specific workloads, we saw little improvement.

The Solution

Scylla turned out to be the just the right solution for our problems. We found that Scylla meets our needs perfectly. Today, we run the passby function against an 8-node Scylla cluster, a reduction in nodes that’s been a big cost savings for us. More importantly, by adopting Scylla, we’ve completely side-stepped garbage collection and other issues that were interfering with the real-time nature of our application. We also spend much less time managing our database infrastructure.

We found the Scylla teams and the broader Scylla community to be very responsive in getting us quickly up to speed and helping us to optimize our Scylla deployment.

The post Tantan Connects with Scylla to Help Singles Find Their Closest Match appeared first on ScyllaDB.

How to Tap Into the Power of a NoSQL Database

Businesses have been relying on relational database management systems (RDBMS) to store, process, and analyze their data for decades. These systems have been around since the 1970s and were created for a world where virtually all data was structured (e.g., names and ZIP codes).

However, today’s data comes in many more forms and also comes much faster and heavier. To unlock the full potential of this high-volume, high-variability data, organizations simply cannot continue to rely solely on relational databases. A new kind of database is required.

This is why we’ve seen the emergence and increased adoption of NoSQL databases in recent years.

While NoSQL databases currently account for just 3% of the $46 billion database market, forward-thinking companies are increasingly migrating to them to future-proof their operations.

The Case for NoSQL Databases

Why are today’s leading companies moving to NoSQL databases?

For starters, NoSQL databases are perfect for running application deployments in hybrid or multi cloud environments because of their flexibility in handling data and easy scalability.

NoSQL databases help organizations better understand the relationships between all of the data they have in their possession—not just the structured data.

This enables them to target their customers more effectively, identify opportunities they might have otherwise missed, and detect problems as they materialize—or even before they occur—which sure beats waiting until something terrible happens and being forced to respond from a defensive position.

NoSQL databases also help companies lower their operating costs. Whereas RDBMS generally run on expensive servers, leading NoSQL databases can be delivered through the cloud, leveraging inexpensive commodity hardware. This makes it easier for companies to find room in their budgets to store and process large swaths of data.

What’s more, NoSQL databases provide the elastic scalability today’s leading organizations require. If an ecommerce site isn’t working properly because there are more concurrent users than the underlying database can handle, customers will be frustrated. If an internal app stalls for the same reason, employee productivity will grind to a halt. Because they scale outward instead of upward, NoSQL databases can accommodate traffic spikes with ease—ensuring optimal user experiences and productive employees.

NoSQL databases are also much easier to manage compared to RDBMS, requiring companies to devote fewer internal resources to their operation. Engineers and architects might still have to tweak things here and there, but they are nowhere near as resource-intensive as traditional RDBMS.

Finally, NoSQL databases that incorporate graph technology are ideal for fraud detection, maintaining compliance with often burdensome and complicated regulations, and keeping customer data safe.

Getting Started With NoSQL Databases

By now you may be sold on the transformative promise of NoSQL.

But how exactly do you get started?

It’s easier than you might think.

We built DataStax Enterprise (DSE), the always-on, active everywhere database designed for hybrid cloud environments, to help organizations realize their full potential by leveraging powerful technology created for the modern world.

DSE was designed to be deployed on any infrastructure, so you’ll be able to use the database on whatever hardware your business currently relies on.

DSE uses a masterless,  active everywhere architecture to ensure high availability without requiring you to invest in failover solutions. It also provides data autonomy—meaning you can take your data with you wherever you go instead of being locked in to a vendor’s solution—and enables you to manage workloads across data centers and cloud environments, increasing efficiency even more.

In today’s data-driven world, companies that refuse to migrate to modern database solutions will be left in the dust. On the flipside, those that are powered by NoSQL databases will be able to move faster, more affordably, and with more data and insights at their disposal

Whitepaper: Architect’s Guide to NoSQL

DOWNLOAD NOW

SLAs – Watch Out for the Shades of Gray

The Service Level Agreement (SLA) is an integral part of an MSP’s (Managed Service Provider) business. Its purpose is to define the scope of services that the MSP offers, including:

  • guarantees on metrics relevant to their business and technology
  • customer responsibilities
  • issue management
  • compensation commitment when MSPs fail to deliver on the SLA.

It should set the customer’s expectations, be realistic and be crystal clear, with no scope for misinterpretation. Well, that’s how they are meant be. But unfortunately, in the quest for more sales, some MSPs tend to commit themselves to unrealistic SLAs. It’s tempting to buy into a service when an MSP offers you 100% availability. It is even more tempting when you see a compensation clause that gives you confidence in going ahead with that MSP. But hold on! Have you checked out the exclusion clauses? Have you checked out your responsibilities in order to get what you are entitled to in the SLA? Just as it is the MSP’s responsibility to define a crystal-clear SLA, it is the customer’s responsibility to thoroughly understand the SLA and be aware of every clause in it. That is how you will notice the shades of gray!

We have put together a list of things to look for in an SLA so that customers are aware of the nuances involved and avoid unpleasant surprises after signing on.

Priced SLAs

Some MSPs provide a baseline SLA for their service, and customers wishing to receive higher levels of commitment may need to fork out extra money.

At Instaclustr, we have a different take on this. We have four tiers of SLA — not to have customers pay more but because our SLA approach is based on the capability and limitations of the underlying technology we are servicing. The SLA tiers are based on the number of nodes in a cluster and a set of responsibilities that customers are prepared to commit to. Our customers do not pay extra for a higher level of SLA. They pay for the number of nodes in the cluster. With more nodes come the higher levels of SLA.

 

Compensation

While MSPs do their best in delivering on the SLA commitment, sometimes things go south.  In scenarios where an SLA metric is not delivered, MSPs provide a compensation. Typically, it is paid in credit or a cut in the monthly bill. Customers should look into how the compensation is calculated. Some compensate based on the length of downtime while others may compensate with a flat reduction in the monthly bill irrespective of the length of downtime.

Instaclustr compensates its customers fairly, with a flat reduction in their monthly bill no matter how small the downtime is. The exact rate varies based on the SLA tier the customer’s cluster falls in. For example, a customer with a cluster of 12 production nodes (Critical tier) is entitled to up to 100% of their monthly bill with each violation of availability SLA, capped at 30%, and each violation of latency SLA, capped at 10%. Every time we fail to deliver on an SLA metric, there is a big impact on Instaclustr’s business — but we like the challenge. Continuously improving uptime and performance is at the core of our business.

Issue Management

Another important factor customers have to look at in an SLA is how the MSP handles issues. Issue management commonly includes communication touchpoints, named contacts, escalation procedure, issue impact and severity levels, first-response time and, in some cases, resolution time. Customers should familiarize themselves with each of these aspects and make their internal teams aware of them.

Instaclustr customers can get this information in our Support Policy document, which gives clear details on issue management and sets the right expectations.

Customer Responsibilities

Although the primary purpose of an SLA is for MSPs to provide guarantees on key metrics such as availability and latency, MSPs usually add a “help us help you” clause. It basically means: in order for MSPs to uphold those SLA guarantees, customers have certain responsibilities that they have to commit to. If your SLA doesn’t have customer responsibilities recorded, talk to your MSP and get it clarified first thing.

Instaclustr SLA has a list of customer responsibilities that must be met in order for us to deliver on those SLA guarantees. Each SLA tier, which is based on the number of nodes in a cluster, has different levels of responsibilities. Basically, requiring customers to take on more responsibilities to receive higher level of SLA guarantees. For example, the “Small” tier for Kafka requires customers to maintain a minimum of 3 replicas per Kafka topic to get the SLA guarantees that the tier promises. While the “Critical” tier for Kafka cluster requires customers to not only maintain 3 replicas per topic but also maintain separate testing and production clusters. This transparency upfront avoids all the uncertainties and unpleasant surprises if an issue arises.

Service/Technology Dependent SLA

Instaclustr’s  SLA guarantees are defined on the basis of the technology under management and its cluster configuration. For instance, a customer with a Kafka cluster comprising 5 production nodes will be guaranteed 99.95% availability for writes but is given no guarantee for latency. However, if the same customer has another Kafka cluster with 12 production nodes (and meets other documented conditions), they will be guaranteed 99.999% availability and 99th percentile latency. Similarly, Cassandra clusters of different sizes come with different tiers of SLA—basically, the larger the cluster (more nodes), the higher the availability of data. This guarantee is backed by our experience in providing massive-scale technology as a fully managed service. Instaclustr simply offers the best SLA realistically possible for the technology and the cluster configuration.

If your MSP is promising a 100% availability SLA irrespective of the technology under management, its size and its configuration, that is simply not realistic. Be sure to check for exclusions and also review the compensation clause to make sure it is substantial.

Hold Harmless Clauses

This is probably the most important section to watch out for in an SLA. MSPs need to protect themselves from situations where an SLA guarantee isn’t met because of conditions outside their control. MSPs operate in several autonomous environments with several technologies and integrations. Even with best practices in place, sometimes something will go wrong due to no fault on the part of an MSP.

For example, Instaclustr has this clause: “All service levels exclude outages caused by non-availability of service at the underlying cloud provider region level or availability zone level in regions which only support two availability zones”. Clauses like this are critical from an MSP’s perspective as they ensure they aren’t vulnerable to conditions outside their control—simply because we can’t do anything about such large-scale disruptions in the underlying cloud infrastructure. However, an unreasonable exclusion would be to exclude failures of virtual machines (VM) in the cloud. With hundreds or thousands of VMs running on a cloud, it is normal to expect VM failures. The large-scale technologies we operate can easily be designed to handle VM (node) failures without hurting availability. If your MSP has a VM failure exclusion clause, it is time to have a conversation.

Another necessary exclusion is significant, unexpected changes in application behavior. An MSP has limited visibility into changes in your application environment (either business-level or technical) that may impact the delivered services. However, we do have the shared objective with our customers of maximising their cluster’s performance and availability. Communicating, and in some cases testing, significant changes before deploying on production is necessary to ensure the service can cope with the change. Instaclustr SLAs for latency exclude “unusual/unplanned workload on cluster caused by customer’s application”, and generally exclude “issues caused by customer actions including but not limited to attempting to operate a cluster beyond available processing or storage capacity”. Customers can avoid being impacted by these clauses by contacting our technical operations team ahead of significant changes to manage the risks.

It is common for MSPs to add a third-party exclusion clause. This means: if an issue is found to be in a third-party application they integrate with, and where they have no control, they are safeguarded. But Instaclustr does not have any third-party application related exclusion. Given that we operate in the open source technology space, adding a clause like this would be protecting ourselves in relation to any issue in Kafka or Cassandra—which is what our customers are paying us to manage. If your MSP has third-party application related exclusions, it is time to closely review them if you haven’t done so already.

Customers should understand these exclusions in depth and make sure they are not unreasonable. This will need thorough review from technical and legal teams.

It’s Not Just About Uptime

Over the years, uptime or availability has become the default term when discussing an SLA with MSPs. It is definitely the most important metric you would want covered. However, there are many other metrics that could be very valuable to a customer’s business. It is important to make sure that the SLA covers the right metrics—the ones that matter to the customer based on the application and the technology under management.

Instaclustr has always included availability and latency in SLAs, as they are two key metrics for the technologies we manage. We are glad to announce the recent addition of the Recovery Point Objective (RPO) metric to our SLA. We take daily backups, with an option of continuous backup (5-minute intervals), and our technical operations team has always treated data recovery as a top priority incident as it impacts customer’s business. So it just made sense to add an RPO guarantee to our SLA.

SLAs Are Negotiable

SLAs are a mutual contract—in fact, a legal agreement between two parties. Although it is primarily the MSP’s responsibility to draft the initial SLA, customers have the right to negotiate. Customers should look into it thoroughly and negotiate if something important to their business is missing before signing up for it. Instaclustr Sales and Customer success teams handle most of these discussions and we welcome any SLA requirements from customers and are open to negotiation.

SLAs Aren’t Everything

SLAs are a quantitative, contractual promise with financial penalties for failures. They are important for covering the most important aspects of service delivery. However, SLAs can never capture every aspect of delivery of a high-quality service that meets the customer’s needs. At Instaclustr, meeting our SLAs is a core focus but just the first step in providing a service that meets and exceeds our customers’ expectations. For example, ticket response time guarantees tell you how quickly you will get a first response to a ticket but not the quality of response. At Instaclustr we have expert engineers as the first responders to all tickets to provide the best quality response possible.

Instaclustr SLA Update

As part of our commitment to keep our SLAs relevant and realistic, we regularly review them on the basis of quantitative evidence. We have recently updated our SLAs to strengthen our commitments. Here is a summary of the changes.

  • Cassandra
    • Recovery Point Objective (RPO) is added as a new metric, stating: “We will maintain backups to allow a restoration of data with less than 24 hours data loss for standard backups and less than 5 minutes data loss for our Continuous Back-ups option. Should we fail to meet this Recovery Point Objective, you will be eligible for SLA credits of 100% of monthly fees for the relevant cluster. If you have undertaken restore testing of your cluster in the last 6 months (using our automated restore functionality) and can demonstrate that data loss during an emergency restore is outside target RPO and your verification testing, then you will be eligible for SLA credits of 500% of monthly fees”.
    • Starter plan: increase from 99.9% to 99.95% for availability target with best effort.
    • Small plan: 99.9% availability for consistency ONE has been increased to 99.95% availability for consistency LOCAL_QUORUM.
    • Enterprise plan: 99.95% availability has been increased to 99.99%.
  • Kafka
    • Starter plan: increase from 99.9% to 99.95% for availability target with best effort.
    • Small plan: increase from 99.9% availability to 99.95%.
    • Enterprise plan: increase from 99.95% availability to 99.99%.
    • Critical plan: increase from 99.99% availability to 99.999%.

 

More details and the complete SLA can be found on the Policy documentation page on the Instaclustr website.

The post SLAs – Watch Out for the Shades of Gray appeared first on Instaclustr.

Cassandra Writes in Depth

There is no point of describing this concept again, so if you are not familiar with Cassandra’s architecture, just check resources below, otherwise skip to the next section.

How to Get the Most Out of Apache Cassandra™

Since it first appeared in 2008, Apache Cassandra™—an open source distributed NoSQL database that originally began as a Facebook project—has become increasingly popular for enterprise applications that require high availability, high performance, and scalability.

As more and more enterprises deploy Cassandra, the demand for engineers skilled in the new database is increasing, too. Only 8% of respondents to a recent survey believe that there are enough qualified NoSQL experts on hand to meet the needs of today’s enterprises.

Unfortunately, you can’t just migrate to Cassandra and expect your wildest dreams to come true.

Getting the most out of Cassandra requires a well-thought-out game plan and a team of skilled and knowledgeable database administrators and operators who know exactly how to carry it out.

With that in mind, let’s take a look at four tips your organization can employ to ensure your deployment of Cassandra helps you achieve your business goals.

1. Train your team thoroughly

According to Gartner, skills shortages in data science remain a problem for many organizations.

Getting the most out of Cassandra starts with making sure your team knows the ins and outs of the technology and is comfortable using it. The easiest way to do this is to invest adequate resources into training and professional development.

For the best results, begin training your team a few weeks or even months before you roll out Cassandra. That way, they’ll have enough time to become familiar with the new technology before it’s deployed.

2. Give your team access to additional resources

Not everyone on your team will learn at the same pace.

In addition to regular training exercises, direct your team to additional resources they can leverage on their own time to become even more familiar with Cassandra’s functionality.

For example, DataStax Academy is a free resource that engineers can use to train themselves at their own pace. The academy features ad-hoc learning opportunities, how-tos, podcasts, and more. There’s also a developer blog that offers tips and tricks, a Slack channel for discussions and, from time to time, and inperson meetups held all over the country.

To sum: After you’ve trained your team, point them to resources they can use to get up to speed on topics they might not be as comfortable with.

3. Pick the right data model

One of the hardest parts of using Cassandra is picking the right data model.

Generally speaking, your data model should help you achieve two main goals:

  1. Spreading data evenly around the cluster
  2. Minimizing the number of partitions read

To accomplish the first goal, you’ll need to pick a good primary key. To accomplish the second goal, model your data to fit your queries instead of modeling around relations or objects.

For more tips on how to pick the right data model, check this out.

4. Optimize your Cassandra implementation

As you scale your Cassandra deployment across the enterprise, managing the database can become more costly and increasingly complex without the right approach.

This is why we created DataStax Distribution of Apache Cassandra™, a production-ready implementation of Cassandra that is ready to go out of the box and is completely compatible with open source Cassandra.

The DataStax Distribution of Apache Cassandra also comes with best-in-class support. Choose between 24×7 or 8×5 support, depending on your needs. By leveraging these services, you’ll be able to reduce internal support costs considerably while ensuring SLAs are met.

eBook: The 5 Main Benefits of Apache Cassandra™

READ NOW

Introducing Scylla University

Scylla University

So you’re thinking of running your applications with Scylla? You’ve probably heard it’s a lightning fast, self-optimizing, highly available Apache Cassandra drop-in replacement. Yet you may still have questions like:

  • How do I upgrade from my current system?
  • How many nodes do I need?
  • How do I ensure that my data is consistent across the database?
  • How should applications be written to maximize database performance? How do I scale up or scale out?

To make it easier to find answers to these questions and many more, we have launched Scylla University. Anyone in your organization can now take advantage of our valuable new self-paced online training at no cost.

Visit Scylla University

Our goal at Scylla University is to improve the learning process by creating a destination for engaging, easy-to-use training, advice and best practices. We want to make sure that everyone has access to the knowledge needed to master the fastest database in the market.

The first course serves as a background for anyone who needs hands-on knowledge of Scylla — from developers to DBAs. Each lesson is chunked into small learning segments, allowing you to learn at your own pace. Each module contains quizzes to check your understanding as well as interactive exercises and hands-on labs so you can practice what you’re learning.

Consistency Level: Write Operations

Learn about consistency levels and replication factors in Scylla

The overview course explains the basics of Scylla. By the end of this course, you will understand the fundamental concepts of NoSQL databases and gain knowledge of Scylla’s features and advantages. This includes topics such as installation, the Scylla architecture, data model, and how to configure the system for high availability.

It sets the model for all Scylla University courses to come, including videos, slides, and hands-on labs where you get a chance to put theories into practice.

According to recent research, combining theoretical understanding and practical hands-on learning, with incremental testing for comprehension, is the most effective way to study.

Theory and Practice Make Perfect

New Courses in the Works

We’re currently developing new courses and learning material, including in-depth Scylla Administration, advanced architecture, troubleshooting, best practices and advanced data modeling. When you register for Scylla University we’ll notify you about new content as it becomes available. You’ll also be the first to know about software releases and other updates.

Our future plans include:

  • Courses specific to user levels ranging from novice to advanced
  • More content, case studies, videos, webinars, quizzes and hands-on labs to help you improve your skills
  • Certification paths for Scylla and NoSQL
  • Ongoing refinements to the user interface

Scylla University Sea Monster Honor Roll Challenge

Scylla University Mascot

Got what it takes to be the best of the best? Each completed quiz and lesson will earn you points. Think you can earn enough to be on the Honor Roll? How about the Head of the Class? The top 5 students will get a Scylla t-shirt and more cool swag. This challenge ends on March 20th, so get going!

Get Started Today!

Scylla Essentials

“You don’t have to be great to start, but you have to start to be great.” – Zig Ziglar

Our first online course, Scylla Essentials – Overview of Scylla, is now available. It’s free. All you have to do to get started is register.

Is there a course you would like us to add to our curriculum? Or perhaps you’ve got suggestions on how we can improve the University site? Please let us know.

The post Introducing Scylla University appeared first on ScyllaDB.

Upgraded Compute Intensive Node Sizes for AWS Clusters

Last year, Instaclustr upgraded its support for high throughput, compute intensive nodes on top of the AWS platform. This involved switching away from the aging c3 instance types and onto the newer c5d instance type. The new c5d instances provide better performance, better compute power and is better value for money than the c3 instances.

The new Instaclustr node type is based off the c5d.2xlarge compute image and provisioned with 8 vCPUs, 16GiB RAM and 200GB NVMe SSD storage. These nodes will be suitable for high throughput, low latency deployments of Apache Cassandra.

But how much better are the c5d instances? We ran our standard Cassandra benchmarking tests on c3 and c5d instances running Cassandra clusters to find out. These tests cover a combination of pure read, pure write and mixed read/write operations, with varying data sizes to evaluate Cassandra’s performance on these node sizes.

Below are the results for our basic latency and throughput tests on clusters with 3 nodes, running Cassandra version 3.11.3.

Latency

We measure latency by performing tests with a static number of operations per second across all test clusters. This number of operations is calibrated to represent a moderate load on the least powerful cluster under test. This shows a difference in the amount of time taken to process the same number of requests.

Small Operations

For small sized operations, the c3’s showed 87% higher median latency on insert operations, and 136% higher median latency on read operations.

Latency - Small Operations

Medium Operations

Medium sized operations show 33% higher latency on inserts on c3s, stretching out to 90% higher latency for read operations.

Latency - Medium Operations

Throughput

Then we measure a clusters throughput, by starting at a small number of operations per seconds and increasing them until the cluster starts to become overloaded (indicated by median latency exceeding 20ms for reads, 5ms for writes, or pending compactions being more than 20 on minute after load completes). This tells us if a cluster is able to perform more or fewer requests, in the same amount of time.

Small Operations

For small sized operations, we can see that c5d nodes allow for an 80% improvement in insert operations per second and a huge 125% improvement in write operations per second.

Throughput - Small Operations

Medium Operations

While for medium data sizes we have the slightly more modest 33% improvement in insert operations per second and the still significant 93% improvement in read operations per second.

Throughput - Medium Operations

Conclusion

Comparing those performance figures against the modest additional price of c5d’s (c5d’s are approximately 15% more expensive, with some variance depending on the region), we can see that c5d’s represent strong value for money improvement over the c3s.

Instaclustr supports the c5d instances in the following AWS regions:

  • US-East-1 (N. Virginia)
  • US-East-2 (Ohio)
  • US-West-1 (Northern California)
  • US-West-2 (Oregon)
  • Europe-Central-1 (Frankfurt)
  • Europe-West-1 (Ireland)
  • Europe-West-2 (London)
  • Asia-Pacific-Southeast-1 (Singapore)

If you already have a c3 cluster with Instaclustr, don’t worry, these will continue to be supported as normal.

Instaclustr supports a range of machine size configurations across all its cloud providers to make sure there is a size perfect for your needs. To see what we’re offering with the new c5d nodes or any of our other existing nodes sizes, visit our console or contact sales@instaclustr.com.

The post Upgraded Compute Intensive Node Sizes for AWS Clusters appeared first on Instaclustr.

Apache Cassandra - Data Center Switch

Did you ever wonder how to change the hardware efficiently on your Apache Cassandra cluster? Or did you maybe read this blog post from Anthony last week about how to set up a balanced cluster, found it as exciting as we do, but are still unsure how to change the number of vnodes?

A Data Center “Switch” might be what you need.

We will go through this process together hereafter. Additionally, If you’re just looking at adding or removing a data center from an existing cluster, these operations are described as part of the data center switch. I hope the structure of the post will make it easy for you to find the part you need!

Warning/Note: If you are unable to get hardware easily, this won’t work for you. The Data Center (DC) Switch is well adapted to cloud environments or for teams with spare/available hardware, as we will need to create a new data center before we can actually free up the previously used machines. Generally we use the same number of nodes in the new data center, but this is not mandatory. Thus, you’ll want to use a proportional number of machines to keep performance unchanged in case the hardware changes.

Definition

First things first, what is a “Data Center Switch” in our Apache Cassandra context?

The idea is to transition to a new data center, freshly added for this operation, and then to remove the old one. In between, clients need to be switched to the new data center.

Logical isolation / topology between data centers in Cassandra helps keep this operation safe and allows you to rollback the operation at almost any stage and with little effort. This technique does not generate any downtime when it is well executed.

Use Cases

A DC Switch can be used for changes that cannot be easily or safely performed within an existing data center, through JMX, or even with a rolling restart. It can also allow you to make changes such as modifying the number of vnodes in use for a cluster, or to change the hardware without creating unwanted transitional states where some nodes would have distinct hardware (or operating system).

It could also be used for a very critical upgrade. Note that with an upgrade it’s important to keep in mind that streaming in a cluster running mixed versions of Casandra is not recommended. In this case, it’s better to add the new Data Center using the same Cassandra version, feed it with the data from the old data center, and only then upgrade the Cassandra version in the new data center. When the new data center is considered stable, the clients can be routed to the new datacenter and the old one can be removed.

It might even fit your needs in some case I cannot think about right now. Once you know and understand this process, you might find yourself making original use of it.

Limitations and Risks

  • Again, this is a good fit for a cloud environment or when getting new servers. In some cases it might not be possible (or is not worth it) to double the hardware in use, even for a short period of time.
  • You cannot use the DC Switch to change global settings like the cluster_name, as this value is unique for the whole cluster.

How-to - DC Switch

This section provides a detailed runbook to perform a Data Center Switch.

We will cover the following aspects of a data center switch.

Phases described here are mostly independent and at the end of each phase the cluster should be in a stable state. You can run only phase 1 (add a DC) or phase 3 (remove a DC) for example. You should always read about and pay attention to the phase 0 though.

Rollback

Before we start, it is important to note that until the very last step, anything can be easily rolled back. You can always safely and quickly go back to the previous state during the procedure. Simply stated, you should be able to run the opposite commands to the one shown here, in the reverse order until the cluster is in an acceptable state. It’s good to check (and store!) the value for the configuration we are about to change, before any step.

Keep in mind that after each Phase below the cluster is in stable state and can remain like this for a long time or forever without problem.

Phase 0 - Prepare Configuration for Multiple Data Centers

First, the preparation phase. We want to make sure Cassandra and Clients will react as expected in a Multi-DC environment.

Server Side (Cassandra)

Step 1: All the keyspaces use NTS

  • Confirm that each user keyspace is using the NetworkTopologyStrategy:
$ cqlsh -e ​"DESCRIBE KEYSPACES;"​
$ cqlsh -e ​"DESCRIBE KEYSPACE <my_ks>;"​ | grep replication
$ cqlsh -e ​​"DESCRIBE KEYSPACE <my_other_ks>;"​ | grep replication
...

If not, change it to be NetworkTopologyStrategy.

  • Confirm that all the keyspaces (including system_*, possibly opscenter, …) also use either the NetworkTopologyStrategy or the LocalStrategy. To be clear:
    • keyspaces using SimpleStrategy (i.e. possibly system_auth, system_distributed, …) should be switched to NetworkTopologyStrategy.
    • system keyspace or any other keyspace, using LocalStrategy should not be changed.

Note: SimpleStrategy is not good in most cases and none of the keyspaces should use it in a Multi-DC context. This is because client operations would touch the distinct data centers to answer reads, breaking the expected isolation between the data centers.

$ cqlsh -e ​"DESCRIBE KEYSPACE system_auth;"​ | grep replication
...

In case you need to change this, do something like:

# Custom Keyspace
ALTER KEYSPACE tlp_labs WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'old-dc': '3'
};
[...]

# system_* keyspaces from SimpleStrategy to NetworkTopologyStrategy
ALTER KEYSPACE system_auth WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'old-dc': '3' # Or more... But it's not today's topic ;-)
};
ALTER KEYSPACE system_distributed WITH replication = {
  'class': 'NetworkTopologyStrategy',
  old-dc': '3': '3'
};
ALTER KEYSPACE system_traces WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'old-dc': '3': '3'
};
[...]

Warning:

This process might have consequences on data availability, be sure to understand the consequences on the token ownership not to break the service availability.

You can avoid this problem by mirroring the previous distribution SimpleStrategy was producing. Using NetworkTopologyStrategy in combination with GossipingPropertyFileSnitch copying the previous SimpleStrategy/SimpleSnitch behaviour (data center and rack names), you should be able to make this change a ‘non-event’, where actually nothing happens from the Cassandra topology perspective.

In other words, if both topologies result in the same logical placement of the nodes, then there is no movement and no risk. If the operation results in a topology change, (ie 2 clusters considered previously as one for example) it’s good to consider the consequences ahead and to run a full repair after the transition.

Step 2:

Make sure the existing nodes are not using the SimpleSnitch.

Instead, the snitch must be one that considers the data center (and racks).

For example:

 endpoint_snitch: Ec2Snitch

or

 endpoint_snitch: GossipingPropertyFileSnitch

If your cluster is using SimpleSnitch at the moment, be careful changing this value, as you might induce a change in the topology or where the data belongs. It is worth reading in detail about this specific topic if that is the case.

That’s it for now on Cassandra, we are done configuring the server side.

Client Side (Cassandra)

With this, since there are many clients out there, you might have to adapt to your specific case or driver. I take the Datastax Java driver as an example here.

All the clients using the cluster should go through the checks/changes below.

Step 4: Use a ​DCAware​ policy and disable remote connections

Cluster cluster = Cluster.builder()
                                  .addContactPoint(​<ip_contact_list_from_old_dc>​)
                                  .withLoadBalancingPolicy(
                                          DCAwareRoundRobinPolicy.builder()
                                          .withLocalDc(​"<old_dc_name>"​)
                                          .withUsedHostsPerRemoteDc(​0​)
                                          .build()
                                  ).build();

Step 5: Pin connections to the existing data center

It’s important to use a consistency level that aims at retrieving the data from the current data center and not across the whole cluster. In general, use a consistency of the form: LOCAL_*. If you were using QUORUM before, change it to LOCAL_QUORUM for example.

At this point, all the clients should be ready to receive the new data center. Clients should now ignore the new data center’s nodes and only contact the local data center as defined above.

Phase 1 - Add a New Data Center

Step 6: Create and configure new Cassandra nodes

Choose the right hardware and number of nodes for the new data center, then bring the machines up.

Configure Cassandra nodes exactly like the old nodes except for those configuration that you intended to change with the new DC along with the data center name. The data center name is defined depending on the Snitch you picked. It can either be determined by the IP address, a File or AWS region name.

To perform this change using GossipingPropertyFileSnitch, edit the cassandra-rackdc.properties file on all nodes:

dc=<new_dc>
...

To create a new data center in the same region in AWS, you have to set the dc_suffix option in the cassandra-rackdc.properties file on all nodes:

# to have us-east-1-awesome-cluster for example
dc_suffix=-awesome-cluster

Step 7: Add new nodes to the cluster

Nodes can now be added, one at the time, just start Cassandra using the service start method that is specific to your operating system. For example on my linux systems we can run:

service cassandra ​start

Notes:

  • Start with the seed nodes for this new data center. Using two or three nodes as seeds per DC is a standard recommendation.
    -seeds: “<​old_ip1​>, <​​old_​ip2​>, <​old_​ip3​>, <​new_ip1​>, <​new_​ip2​>, <​new_​ip3​>”
    
  • There should be no streaming, adding a node should be quick - check the logs, to make sure of it. tail ​-fn 100 /​var​/​log​/cassandra/system.log
  • Due to the previous point, the nodes should join quickly as part of the new data center, check nodetool status. Make sure a node appears as UN before moving to the next node.

Step 8: Start accepting writes on the new data center

The next step is to accept writes for this data center by changing the topology so the new DC is also part of replication strategy:

ALTER​ KEYSPACE <my_ks> ​WITH​ ​replication​ = {
        'class': 'NetworkTopologyStrategy',
        '<old_dc_name>': '<replication_factor>',
        '<my_new_dc>': '<replication_factor>'
};
ALTER​ KEYSPACE <my_other_ks> ​WITH​ ​replication​ = {
        'class': 'NetworkTopologyStrategy',
        '<old_dc_name>': '<replication_factor>',
        '<my_new_dc>': '<replication_factor>'
};
[...]

Include system keyspaces that should now be using the NetworkTopologyStrategy for replication:

ALTER KEYSPACE system_auth WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<old_dc_name>': '<replication_factor>',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_distributed WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<old_dc_name>': '<replication_factor>',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_traces WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<old_dc_name>': '<replication_factor>',
  '<new_dc_name>': '<replication_factor>'
};
[...]

Note: Here again, do not modify tables using LocalStrategy.

To make sure that the keyspaces were altered as expected, you can see the ownership with:

nodetool​ status <my_ks>

Note: this is a a good moment to detect any imbalances in ownership and fix any issue there, before actually streaming the data to the new data center. We detailed this part in the post “How To Set Up A Cluster With Even Token Distribution” that Anthony wrote. You should really check this if you plan to use a low number of vnodes, if not, you might go through an operational nightmare trying to handle imbalances.

Step 9: Stream historical data to the new data center

Stream the historical data to the new data center to fill the gap, as the new cluster is now receiving writes, but is still missing all the past data. This is done by running this on all the nodes of the new data center:

nodetool​ rebuild old_dc_name

The output (in the logs) should look like:

INFO  [RMI TCP Connection(8)-192.168.1.31] ... - rebuild from dc: ..., ..., (All tokens)
INFO  [RMI TCP Connection(8)-192.168.1.31] ... - [Stream ...] Executing streaming plan for Rebuild
INFO  [StreamConnectionEstablisher:1] ... - [Stream ...] Starting streaming to ...
...
INFO  [StreamConnectionEstablisher:2] ... - [Stream ...] Beginning stream session with ...

Note:

  • nodetool​ setstreamthroughput X can help reducing the burden caused by the streaming on the nodes answering requests or, the other wait around, to make the transfer faster.
  • A good way to know the query finished is to run the command above from a screen or using tmux for example screen -R rebuild.

Phase 2 - Switch Clients to the new DC

At this point the new data center can be tested and should be a mirror of the previous one, except for the things you changed of course.

Step 10: Client Switch

The clients can now be routed to the new data center. To do so, change the contact point and the data center name. Doing this one client at the time, while observing impacts, is probably the safest way when there are many clients plugged to a single cluster. Back to our Java driver example, it would now look like this:

Cluster cluster = Cluster.builder()
                                  .addContactPoint(​<ip_contact_list_from_new_dc>​)
                                  .withLoadBalancingPolicy(
                                          DCAwareRoundRobinPolicy.builder()
                                          .withLocalDc(​"<new_dc_name>"​)
                                          .withUsedHostsPerRemoteDc(​0​)
                                          .build()
                                  ).build();

Note: Before going forward you can (and probably should) make sure that no client is connected to old nodes anymore. You can do this in a number of ways:

  • Look at netstats for opened (native or thrift) connections:
    netstat -tupawn | grep -e 9042 -e 9160
    
  • Check that the node does not receive local reads (i.e. ReadStage should not increase) and does not act as a coordinator (i.e. RequestResponseStage should not increase).
    watch -d "nodetool tpstats"
    
  • Monitoring system/Dashboards: Ensure there are no local reads heading to the old data center.

Phase 3 - Remove the old DC

Step 11: Stop replicating data in the old data center

Alter the keyspaces so they no longer reference the old data center:

ALTER​ KEYSPACE <my_ks> ​WITH​ ​replication​ = {
        'class': 'NetworkTopologyStrategy',
        '<my_new_dc>': '<replication_factor>'
};
ALTER​ KEYSPACE <my_other_ks> ​WITH​ ​replication​ = {
        'class': 'NetworkTopologyStrategy',
        '<my_new_dc>': '<replication_factor>'
};

[...]

ALTER KEYSPACE system_auth WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_distributed WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<new_dc_name>': '<replication_factor>'
};
ALTER KEYSPACE system_traces WITH replication = {
  'class': 'NetworkTopologyStrategy',
  '<new_dc_name>': '<replication_factor>'
};

Step 12: Decommission old nodes

Finally we need to get rid of the old nodes. Stopping the nodes is not enough as Cassandra will continue to expect them to come back to life anytime. We want them out, safely, but once and for all. This is what a decommission does. To cleanly remove the old data center we need to decommission all the nodes in this data center, one by one.

On the bright side, the operation should be almost instantaneous (at least very quick), because this datacenter is not owning any data anymore from a Cassandra perspective. Thus, there should be no data to stream to other nodes. If streaming happens, you probably forgot about a keyspace using SimpleStrategy or NetworkTopologyStrategy that still uses the old data center.

Sequentially, on each node of the old data center, run:

$ nodetool decommission

This should be fast, not to say immediate as this command should trigger no streaming at all due to the changes we made in the keyspaces replication configuration. This data center should not own any token ranges anymore as we removed the data center from all the keyspaces, in the previous step.

Step 13: Remove old nodes from the seeds

To remove any reference to the old data center, we need to update the cassandra.yaml file.

-seeds: “<​new_ip1​>, <​new_​ip2​>, <​new_​ip3​>”

And that’s it! You have now successfully switched over to a new Data Center. Of course during the process, ensure that the changes you just made are actually acknowledged in this new data center.

In-Memory Scylla, or Racing the Red Queen

Racing the Red Queen“Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!” — The Red Queen to Alice, Alice Through the Looking Glass

In the world of Big Data, if you are not constantly evolving you are already falling behind. This is at the heart of the Red Queen syndrome, which was first applied to the evolution of natural systems. It applies just as much to the evolution of technology. ‘Now! Now!’ cried the Queen. ‘Faster! Faster!’ And so it is with Big Data.

Over the past decade, many databases have shifted from storing all their data on Hard Disk Drives (HDDs) to Solid State Drives (SSDs) to drop latencies to just a few milliseconds. To get ever-closer to “now.” The whole industry continues to run “twice as fast” just to stay in place.

So as fast storage NVMe drives become commonplace in the industry, they practically relegate SATA SSDs to legacy status; they are becoming “the new HDDs”.

For some use cases even NVMe is still too slow, and users need to move their data to in-memory deployments instead, where speeds for Random Access Memory (RAM) are measured in nanoseconds. Maybe not in-memory for everything — first, because in-memory isn’t persistent, and also because it can be expensive! — but at least for their most speed-intensive data.

All this acceleration is certainly good news for any I/O intensive, latency sensitive applications which will now be able to use those storage devices as a substrate of workloads that used to need to be kept in memory for performance reasons. However, do the speed of accesses in those devices really match what they advertise? And what workloads are most likely to need the extra speed provided by hosting their data in-memory?

In this article we will examine the performance claims of latency-bound access in a real NVMe devices and show that there is still a place for in-memory solutions for extremely latency sensitive applications. To address those workloads, ScyllaDB added an in-memory option to Scylla Enterprise 2018.1.7. We will discuss how that all ties together in a real database like Scylla and how users can benefit from the new addition to the Scylla product.

Storage Speed Hierarchy

Various storage devices have different access speeds. Faster devices are usually more expensive and have less capacity. The table below shows a brief summary of devices in broad use in modern servers and their access latencies.

Device Latency
Register 1 cycle
Cache 2-10ns
DRAM 100-200ns
NVMe 10-100μs
SATA SSD 400μs
Hard Disk Drive (HDD)  10ms

It would be great of course to have all your data in fastest storage available: register or cache, but if your data fits in there it is probably not considered a Big Data environment. On the other hand, if the workload is backed by a spinning disk it is hard to expect good latencies for requests that need to access the underlying storage.. Considering size vs speed tradeoff NVMe does not look so bad here. Moreover, in real life situations the workload needs to fetch data from various places in the storage array to compose a request. In hypothetical scenario with in which two files are accessed for every storage-bound request and access time around ~50μs the cost of a storage-bound access is around 100μs, which is not too bad at all. But how reliable are those access numbers in real life?

Real World Latencies

In practice, we see that NVMe latencies may be much higher than that, though. Even larger than what spinning disks provide. There are a couple of reasons for that. First the technology limitation: SSD becomes slower as it fills up and data is written and rewritten. The reason, is that an SSD has an internal Garbage Collection (GC) process that looks for free blocks and it becomes more time consuming the less free space there is. We saw that some disks may have latencies of hundreds of milliseconds in worst case scenarios. To avoid this problem, freed blocks have to be explicitly discarded by the operating system to make GC unnecessary. This is done by running the fstrim utility periodically (which we absolutely recommend to do), but ironically fstrim that runs in the background may cause latencies by itself. Another reason for larger-than-promised latencies is that a query does not run in isolation. In a real I/O-intensive system like a database, usually there are a lot of latency sensitive accesses such as queries that run in parallel and consume disk bandwidth concurrently with high-throughput patterns like bulk writes and data reorganization (like compactions in ScyllaDB). As a result, latency sensitive requests may end up in a device queue and result in increased tail latency.

It is possible to observe all those scenarios in practice with the ioping utility. ioping is very similar to well-known networking ping utility, but instead of sending requests over the network it sends them to a disk. Here is the result of the test we did on AWS:

No other IO:
99 requests completed in 8.54 ms, 396 KiB read, 11.6 k iops, 45.3 MiB/s generated 100 requests in 990.3 ms, 400 KiB, 100 iops, 403.9 KiB/s min/avg/max/mdev = 59.6 us / 86.3 us / 157.8 us / 27.2 us

Read/Write fio benchmark:
99 requests completed in 34.2 ms, 396 KiB read, 2.90 k iops, 11.3 MiB/s generated 100 requests in 990.3 ms, 400 KiB, 100 iops, 403.9 KiB/s min/avg/max/mdev = 73.0 us / 345.2 us / 5.74 ms / 694.3 us

fstrim:
99 requests completed in 300.3 ms, 396 KiB read, 329 iops, 1.29 MiB/s generated 100 requests in 1.24 s, 400 KiB, 80 iops, 323.5 KiB/s min/avg/max/mdev = 62.2 us / 3.03 ms / 83.4 ms / 14.5 ms

As we can see under normal condition the disk provides latencies in the promised range, but when the disk is under load, max latency can be very high.

Scylla Node Storage Model

To understand what benefit one will have from keeping all the data in memory we need to consider how Scylla storage model works. Here is a schematic describing the storage model of a single node.

Scylla Storage Model

When the database is queried a node tries to locate the requested data in cache and memtables, both of which reside in RAM. If the data is in the cache – good, all that is needed is to combine the data from the cache with the data from memtable (if any) and a reply can be sent right away. But what if the cache has no data (and no indication that data is not present in permanent storage as well)? In this case, the bottom part of the diagram has to be invoked and storage has to be contacted.

The format the data is stored in is called an sstable. Depending on the configured compaction strategy, and on how recently queried data was written and on other factors, multiple sstables may have to be contacted to satisfy a request. Let’s take a closer look at the sstable format.

Very Brief Description Of the SSTable Format

Each sstable consist of multiple files. Here is a list of files for a hypothetical non-compressed sstable.

la-1-big-CRC.db
la-1-big-Data.db
la-1-big-Digest.sha1
la-1-big-Filter.db
la-1-big-Index.db
la-1-big-Scylla.db
la-1-big-Statistics.db
la-1-big-Summary.db
la-1-big-TOC.txt

Most of those files (green ones) are very small and their content is kept in memory while the sstable is open. But there are two exceptions: Data and Index (as indicated in red). Let’s take a closer look at what those two contain.

SSTable Data Format

The Data file stores the actual data. It is sorted according to partition keys, which makes binary search possible. But searching for a specific key in a large file may require a lot of disk access, so to make the task more efficient there is another file, Index, that holds a sorted list of keys and offsets into the Data file where data for those keys can be found.

As one can see, each access to an sstable requires at least two reads from disk (it may be even more depending on the size of the data that has to be read and the place of the key in the index file).

Benchmarking Scylla

Let’s look at how those maximum latencies can affect the behaviour of Scylla. The benchmark was run on a cluster in the Google Compute Engine (GCE) with one NVMe disk. We have experienced that NVMe on GCE is somewhat slow, so in a way it helps to emphasis in-memory benefits. Below is a graph of 99th percentile for access to two different tables. One is a regular table on NVMe disk (red) and another is in memory (green).

p99 Latencies In-Memory vs. SSD

The 99th percentile latency for the on-disk table is much higher and has much more variation in it. There is another line in the graph (in blue) that plots the number of compaction running in the system. It can be seen that the blue graph matches the red one which means that 99th percentile latency of an on-disk table is affected greatly by the compaction process. High on-disk latencies here are a direct result of tail latencies that occurred because user read was queued after compaction read.

Having performance of 20ms for P99 isn’t much for Scylla but in this case, a single not-so-fast NVMe disk was used. Adding more NVMes in raid0 setup will allow for more parallelism and will mitigate the negative effects of queuing, but doing so also increases the price of the setup and at some point will erase all the price benefits of using NVMe while not necessarily achieving the same performance as in-memory setup. In-memory setup allows you to get low and consistent latency at a reasonable price point.

Configuration

Two new configuration steps are needed to make use of the feature. First, one needs to specify how much memory should be left for in-memory sstable storage. It can be done by adding in_memory_storage_size_mb to scylla.yaml file or specifying --in-memory-storage-size-mb on a command line. After memory is reserved in-memory table can be created by executing:

CREATE TABLE ks.cf (
     key blob PRIMARY KEY,
     "C0" blob,

) WITH compression = {}
  AND read_repair_chance = '0'
  AND speculative_retry = 'ALWAYS'
  AND in_memory = 'true'
  AND compaction = {'class':'InMemoryCompactionStrategy'};

Note new in_memory property there that is set to true and new compaction strategy. Strictly speaking it is not required to use InMemoryCompactionStrategy with in-memory tables but this compaction strategy compacts much more aggressively to get rid of data duplication as fast as possible to save memory.

Note that mix of in-memory and regular tables is supported.

Conclusion

Despite what the vendors may say, real world storage devices can present high tail latencies in the face of competing requests, even for newer technology like NVMe. Workloads that cannot tolerate a jump in latencies under any circumstances can benefit greatly from the new Scylla enterprise in-memory feature. If, on the other hand, a workload can cope with occasionally higher latency for a low number of requests it is beneficial to let Scylla manage what data is held in memory with its usual caching mechanism and to use regular on-disk tables with fast NVMe storage.

Find out more about Scylla Enterprise

The post In-Memory Scylla, or Racing the Red Queen appeared first on ScyllaDB.

Scylla Open Source Release 2.3.3

Scylla Software Release

The Scylla team announces the release of Scylla Open Source 2.3.3, a bugfix release of the Scylla Open Source 2.3 stable branch. Release 2.3.3, like all past and future 2.3.y releases, is backward compatible and supports rolling upgrades.

Note that the latest stable release of Scylla Open Source is release 3.0 and you are encouraged to upgrade to it.

Related links:

Issues solved in this release:

  • A race condition between the “nodetool snapshot” command and Scylla running compactions may result in a nodetool error: Scylla API server HTTP POST to URL ‘/storage_service/snapshots’ failed: filesystem error: link failed: No such file or directory #4051
  • A bootstrapping node doesn’t wait for schema before joining the ring, which may result in a node fail to bootstrap, with error “storage_proxy – Failed to apply mutation”. In particular, this error manifests when a user defined type is used. #4196
  • Counters: Scylla rejects SSTables that contain counters that were created by Cassandra 2.0 and earlier. Due to #4206, Scylla mistakenly rejected some SSTables that were created by Cassandra 2.1 as well.
  • Core: In very rare cases, the commit log replay fails. Commit log replay is used after a node was unexpectedly restarted #4187
  • In some rare cases, during service stop, Scylla exited #4107

The post Scylla Open Source Release 2.3.3 appeared first on ScyllaDB.

How To Set Up A Cluster With Even Token Distribution

Apache Cassandra is fantastic for storing large amounts of data and being flexible enough to scale out as the data grows. This is all fun and games until the data that is distributed in the cluster becomes unbalanced. In this post we will go through how to set up a cluster with predictive token allocation using the allocate_tokens_for_keyspace setting, which will help to evenly distribute the data as it grows.

Unbalanced clusters are bad mkay

An unbalanced load on a cluster means that some nodes will contain more data than others. An unbalanced cluster can be caused by the following:

  • Hot spots - by random chance one node ends up responsible for a higher percentage of the token space than the other nodes in the cluster.
  • Wide rows - due to data modelling issues, for example a partition row which grows significantly larger than the other rows in the data.

The above issues can have a number of impacts on individual nodes in the cluster, however this is a completely different topic and requires a more detailed post. In summary though, a node that contains disproportionately more tokens and/or data than other nodes in the cluster may experience one or more of the following issues:

  • Run out storage more quickly than the other nodes.
  • Serve more requests than the other nodes.
  • Suffer from higher read and write latencies than the other nodes.
  • Time to run repairs is longer than other nodes.
  • Time to run compactions is longer than other nodes.
  • Time to replace the node if it fails is longer than other nodes.

What about vnodes, don’t they help?

Both issues that cause data imbalance in the cluster (hot spots, wide rows) can be prevented by manual control. That is, specify the tokens using the initial_token setting in the casandra.yaml file for each node and ensure your data model evenly distributes data across the cluster. The second control measure (data modelling) is something we always need to do when adding data to Cassandra. The first point however, defining the tokens manually, is cumbersome to do when maintaining a cluster, especially when growing or shrinking it. As a result, token management was automated early on in Cassandra (version 1.2 - CASSANDRA-4119) through the introduction of Virtual Nodes (vnodes).

Vnodes break up the available range of tokens into smaller ranges, defined by the num_tokens setting in the cassandra.yaml file. The vnode ranges are randomly distributed across the cluster and are generally non-contiguous. If we use a large number for num_tokens to break up the token ranges, the random distribution means it is less likely that we will have hot spots. Using statistical computation, the point where all clusters of any size always had a good token range balance was when 256 vnodes were used. Hence, the num_tokens default value of 256 was the recommended by the community to prevent hot spots in a cluster. The problem here is that the performance for operations requiring token-range scans (e.g. repairs, Spark operations) will tank big time. It can also cause problems with bootstrapping due to large numbers of SSTables generated. Furthermore, as Joseph Lynch and Josh Snyder pointed out in a paper they wrote, the higher the value of num_tokens in large clusters, the higher the risk of data unavailability .

Token allocation gets smart

This paints a pretty grim picture of vnodes, and as far as operators are concerned, they are caught between a rock and hard place when selecting a value for num_tokens. That was until Cassandra version 3.0 was released, which brought with it a more intelligent token allocation algorithm thanks to CASSANDRA-7032. Using a ranking system, the algorithm feeds in the replication factor of a keyspace, the number of tokens, and the partitioner, to derive token ranges that are evenly distributed across the cluster of nodes.

The algorithm is configured by settings in the cassandra.yaml configuration file. Prior to this algorithm being added, the configuration file contained the necessary settings to configure the algorithm with the exception of the one to specify the keyspace name. When the algorithm was added, the allocate_tokens_for_keyspace setting was introduced into the configuration file. The setting allows a keyspace name to be specified so that during the bootstrap of a node we query the keyspace for its replication factor and pass that to the token allocation algorithm.

However, therein lies the problem, for existing clusters updating this setting is easy, as a keyspace already exists, but for a cluster starting from scratch we have a chicken and egg situation. How do we specify a keyspace that doesn’t exist!? And there are other caveats, too…

  • It works for only a single replication factor. As long as all the other keyspaces are using the same replication as the one specified for allocate_tokens_for_keyspace all is fine. However, if you have keyspaces with a different replication factor they can potentially cause hot spots.
  • It works when nodes are only added to the cluster. The process for token distribution when a node is removed from the cluster remains unchanged, and hence can cause hot spots.
  • It works with only the default partitioner, Murmur3Partitioner.

Additionally, this is no silver bullet for all unbalanced clusters; we still need to make sure we have a data model that evenly distributes data across partitions. Wide partitions can still be an issue and no amount of token shuffling will fix this.

Despite these drawbacks, this feature gives us the ability to allocate tokens in a more predictable way whilst leveraging the advantage of vnodes. This means we can specify a small value for vnodes (e.g. 4) and still be able to avoid hot spots. The question then becomes, in the case of starting a brand new cluster from scratch, which comes first the chicken or the egg?

One does not simply start a cluster… with evenly distributed tokens

While it might be possible to rectify an unbalance cluster due to unfortunate token allocations, it is better for the token allocation to be set up correctly when the cluster is created. To set up a brand new cluster that takes advantage of the allocate_tokens_for_keyspace setting we need to use the following steps. The method below takes into account a cluster with nodes that spread across multiple racks. The examples used in each step, assumes that our cluster will be configured as follows:

  • 4 vnodes (num_tokens = 4).
  • 3 racks with a single seed node in each rack.
  • A replication factor of 3, i.e. one replica per rack.

1. Calculate and set tokens for the seed node in each rack

We will need to set the tokens for the seed nodes in each rack manually. This is to prevent each node from randomly calculating its own token ranges. We can calculate the token ranges that we will use for the initial_token setting using the following python code:

$ python

Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> num_tokens = 4
>>> num_racks = 3
>>> print "\n".join(['[Node {}] initial_token: {}'.format(r + 1, ','.join([str(((2**64 / (num_tokens * num_racks)) * (t * num_racks + r)) - 2**63) for t in range(num_tokens)])) for r in range(num_racks)])
[Node 1] initial_token: -9223372036854775808,-4611686018427387905,-2,4611686018427387901
[Node 2] initial_token: -7686143364045646507,-3074457345618258604,1537228672809129299,6148914691236517202
[Node 3] initial_token: -6148914691236517206,-1537228672809129303,3074457345618258600,7686143364045646503

We can then uncomment the initial_token setting in the cassandra.yaml file in each of the seed nodes, set it to value generated by our python command, and set the num_tokens setting to the number of vnodes. When the node first starts the value for the initial_token setting will used, subsequent restarts will use the num_tokens setting.

Note that we need to manually calculate and specify the initial tokens for only the seed node in each rack. All other nodes will be configured differently.

2. Start the seed node in each rack

We can start the seed nodes one at a time using the following command:

$ sudo service cassandra start

When we watch the logs, we should see messages similar to the following appear:

...
INFO  [main] ... - This node will not auto bootstrap because it is configured to be a seed node.
INFO  [main] ... - tokens manually specified as [-9223372036854775808,-4611686018427387905,-2,4611686018427387901]
...

After starting the first of the seed nodes, we can use nodetool status to verify that 4 tokens are being used:

$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.11  99 KiB     4            100.0%            5d7e200d-ba1a-4297-a423-33737302e4d5  rack1

We will wait for this message appear in logs, then start the next seed node in the cluster.

INFO  [main] ... - Starting listening for CQL clients on ...

Once all seed nodes in the cluster are up, we can use nodetool ring to verify the token assignments in the cluster. It should look something like this:

$ nodetool ring

Datacenter: dc1
==========
Address        Rack        Status State   Load            Owns                Token
                                                                              7686143364045646503
172.31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              -9223372036854775808
172.31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              -7686143364045646507
172.31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              -6148914691236517206
172.31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              -4611686018427387905
172.31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              -3074457345618258604
172.31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              -1537228672809129303
172.31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              -2
172.31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              1537228672809129299
172.31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              3074457345618258600
172.31.36.11   rack1       Up     Normal  65.26 KiB       66.67%              4611686018427387901
172.31.36.118  rack2       Up     Normal  65.28 KiB       66.67%              6148914691236517202
172.31.43.239  rack3       Up     Normal  99.03 KiB       66.67%              7686143364045646503

We can then move to the next step.

3. Create only the keyspace for the cluster

On any one of the seed nodes we will use cqlsh to create the cluster keyspace using the following commands:

$ cqlsh NODE_IP_ADDRESS -u ***** -p *****

Connected to ...
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh>
cassandra@cqlsh> CREATE KEYSPACE keyspace_with_replication_factor_3
    WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3}
    AND durable_writes = true;

Note that this keyspace can be any name, it can even be the keyspace that contains the tables we will use for our data.

4. Set the number of tokens and the keyspace for all remaining nodes

We will set the num_tokens and allocate_tokens_for_keyspace settings in the cassandra.yaml file on all of the remaining nodes as follows:

num_tokens: 4
...
allocate_tokens_for_keyspace: keyspace_with_replication_factor_3

We have assigned the allocate_tokens_for_keyspace value to be the name of keyspace created in the previous step. Note that at this point the Cassandra service on all other nodes is still down.

5. Start the remaining nodes in the cluster, one at a time

We can start the remaining nodes in the cluster using the following command:

$ sudo service cassandra start

When we watch the logs we should see messages similar to the following appear to say that we are using the new token allocation algorithm:

INFO  [main] ... - JOINING: waiting for ring information
...
INFO  [main] ... - Using ReplicationAwareTokenAllocator.
WARN  [main] ... - Selected tokens [...]
...
INFO  ... - JOINING: Finish joining ring

As per step 2 when we started the seed nodes, we will wait for this message to appear in the logs before starting the next node in the cluster.

INFO  [main] ... - Starting listening for CQL clients on ...

Once all the nodes are up, our shiny, new, evenly-distributed-tokens cluster is ready to go!

Proof is in the token allocation

While we can learn a fair bit from talking about the theory for the allocate_tokens_for_keyspace setting, it is still good to put it to the test and see what difference it makes when used in a cluster. I decided to create two clusters running Apache Cassandra 3.11.3 and compare the load distribution after inserting some data. For this test, I provisioned both clusters with 9 nodes using tlp-cluster and generated load using tlp-stress. Both clusters used 4 vnodes, but one of the clusters was setup using the even token distribution method described above.

Cluster using random token allocation

I started with a cluster that uses the traditional random token allocation system. For this cluster I set num_tokens: 4 and endpoint_snitch: GossipingPropertyFileSnitch in the cassandra.yaml on all the nodes. Nodes were split across three racks by specifying the rack in the cassandra-rackdc.properties file.

Once the cluster instances were up and Cassandra was installed, I started each node one at a time. After all nodes were started, the cluster looked like this:

ubuntu@ip-172-31-39-54:~$ nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.95   65.29 KiB  4            16.1%             4ada2c52-0d1b-45cd-93ed-185c92038b39  rack1
UN  172.31.39.79   65.29 KiB  4            20.4%             c282ef62-430e-4c40-a1d2-47e54c5c8685  rack2
UN  172.31.47.155  65.29 KiB  4            21.2%             48d865d7-0ad0-4272-b3c1-297dce306a34  rack1
UN  172.31.43.170  87.7 KiB   4            24.5%             27aa2c78-955c-4ea6-9ea0-3f70062655d9  rack1
UN  172.31.39.54   65.29 KiB  4            30.8%             bd2d745f-d170-4fbf-bf9c-be95259597e3  rack3
UN  172.31.35.165  70.36 KiB  4            25.5%             056e2472-c93d-4275-a334-e82f87c4b53a  rack3
UN  172.31.35.149  70.37 KiB  4            24.8%             06b0e1e4-5e73-46cb-bf13-626eb6ce73b3  rack2
UN  172.31.35.33   65.29 KiB  4            23.8%             137602f0-3248-459f-b07c-c0b3e647fa48  rack2
UN  172.31.37.129  99.03 KiB  4            12.9%             cd92c974-b32e-4181-9e14-fb52dd27b09e  rack3

I ran tlp-stress against the cluster using the command below. This generated a write-only load that randomly inserted 10 million unique key value pairs into the cluster. tlp-stress inserted data into a newly created keyspace and tabled called tlp_stress.keyvalue.

tlp-stress run KeyValue --replication "{'class':'NetworkTopologyStrategy','dc1':3}" --cl LOCAL_QUORUM --partitions 10M --iterations 100M --reads 0 --host 172.31.43.170

After running tlp-stress the cluster load distribution for the tlp_stress keyspace looked like this:

ubuntu@ip-172-31-39-54:~$ nodetool status tlp_stress
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.95   1.29 GiB   4            20.8%             4ada2c52-0d1b-45cd-93ed-185c92038b39  rack1
UN  172.31.39.79   2.48 GiB   4            39.1%             c282ef62-430e-4c40-a1d2-47e54c5c8685  rack2
UN  172.31.47.155  1.82 GiB   4            35.1%             48d865d7-0ad0-4272-b3c1-297dce306a34  rack1
UN  172.31.43.170  3.45 GiB   4            44.1%             27aa2c78-955c-4ea6-9ea0-3f70062655d9  rack1
UN  172.31.39.54   2.16 GiB   4            54.3%             bd2d745f-d170-4fbf-bf9c-be95259597e3  rack3
UN  172.31.35.165  1.71 GiB   4            29.1%             056e2472-c93d-4275-a334-e82f87c4b53a  rack3
UN  172.31.35.149  1.14 GiB   4            26.2%             06b0e1e4-5e73-46cb-bf13-626eb6ce73b3  rack2
UN  172.31.35.33   2.61 GiB   4            34.7%             137602f0-3248-459f-b07c-c0b3e647fa48  rack2
UN  172.31.37.129  562.15 MiB  4            16.6%             cd92c974-b32e-4181-9e14-fb52dd27b09e  rack3

I verified the data load distribution by checking the disk usage on all nodes using pssh (parallel ssh).

ubuntu@ip-172-31-39-54:~$ pssh -ivl ... -h hosts.txt "du -sh /var/lib/cassandra/data"
[1] ... [SUCCESS] 172.31.35.149
1.2G    /var/lib/cassandra/data
[2] ... [SUCCESS] 172.31.43.170
3.5G    /var/lib/cassandra/data
[3] ... [SUCCESS] 172.31.36.95
1.3G    /var/lib/cassandra/data
[4] ... [SUCCESS] 172.31.39.79
2.5G    /var/lib/cassandra/data
[5] ... [SUCCESS] 172.31.35.33
2.7G    /var/lib/cassandra/data
[6] ... [SUCCESS] 172.31.35.165
1.8G    /var/lib/cassandra/data
[7] ... [SUCCESS] 172.31.37.129
564M    /var/lib/cassandra/data
[8] ... [SUCCESS] 172.31.39.54
2.2G    /var/lib/cassandra/data
[9] ... [SUCCESS] 172.31.47.155
1.9G    /var/lib/cassandra/data

As we can see from the above results, there was large load distribution across nodes. Node 172.31.37.129 held the smallest amount of data (roughly 560 MB), whilst node 172.31.43.170 held six times that amount of data (~ roughly 3.5 GB). Effectively the difference between the smallest and largest data load is 3.0 GB!!

Cluster using predictive token allocation

I then moved on to setting up the cluster with predictive token allocation. Similar to the previous cluster, I set num_tokens: 4 and endpoint_snitch: GossipingPropertyFileSnitch in the cassandra.yaml on all the nodes. These settings were common to all nodes in this cluster. Nodes were again split across three racks by specifying the rack in the cassandra-rackdc.properties file.

I set the initial_token setting for each of the seed nodes and started the Cassandra process on them one at a time. One seed node allocated to each rack in the cluster.

The initial keyspace that would be specified in the allocate_tokens_for_keyspace setting was created via cqlsh using the following command:

CREATE KEYSPACE keyspace_with_replication_factor_3 WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'} AND durable_writes = true;

I then set allocate_tokens_for_keyspace: keyspace_with_replication_factor_3 in the cassandra.yaml file for the remaining non-seed nodes and started the Cassandra process on them one at a time. After all nodes were started, the cluster looked like this:

ubuntu@ip-172-31-36-11:~$ nodetool status keyspace_with_replication_factor_3
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.47   65.4 KiB   4            32.3%             5ece457c-a3af-4173-b9c8-e937f8b63d3b  rack2
UN  172.31.43.239  117.45 KiB  4            33.3%             55a94591-ad6a-48a5-b2d7-eee7ea06912b  rack3
UN  172.31.37.44   70.49 KiB  4            33.3%             93054390-bc83-487c-8940-b99e7b85e5c2  rack3
UN  172.31.36.11   104.3 KiB  4            35.4%             5d7e200d-ba1a-4297-a423-33737302e4d5  rack1
UN  172.31.39.186  65.41 KiB  4            31.2%             ecd00ff5-a90a-4d33-b7ab-bdd22e3e50b8  rack1
UN  172.31.38.137  65.39 KiB  4            33.3%             64802174-885a-4c04-b530-a9b4685b1b96  rack1
UN  172.31.40.56   65.39 KiB  4            33.3%             0846effa-e4ac-4a19-845e-2162cd2b7680  rack3
UN  172.31.36.118  104.32 KiB  4            35.4%             5ad47bc0-9bcc-4fc5-b5b0-0a15ad63345f  rack2
UN  172.31.41.196  65.4 KiB   4            32.3%             4128ca20-b4fa-4173-88b2-aac62539a6d8  rack2

I ran tlp-stress against the cluster using the same command that was used to test the cluster with random token allocation. After running tlp-stress the cluster load distribution for the tlp_stress keyspace looked like this:

ubuntu@ip-172-31-36-11:~$ nodetool status tlp_stress
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.36.47   2.16 GiB   4            32.3%             5ece457c-a3af-4173-b9c8-e937f8b63d3b  rack2
UN  172.31.43.239  2.32 GiB   4            33.3%             55a94591-ad6a-48a5-b2d7-eee7ea06912b  rack3
UN  172.31.37.44   2.32 GiB   4            33.3%             93054390-bc83-487c-8940-b99e7b85e5c2  rack3
UN  172.31.36.11   1.84 GiB   4            35.4%             5d7e200d-ba1a-4297-a423-33737302e4d5  rack1
UN  172.31.39.186  2.01 GiB   4            31.2%             ecd00ff5-a90a-4d33-b7ab-bdd22e3e50b8  rack1
UN  172.31.38.137  2.32 GiB   4            33.3%             64802174-885a-4c04-b530-a9b4685b1b96  rack1
UN  172.31.40.56   2.32 GiB   4            33.3%             0846effa-e4ac-4a19-845e-2162cd2b7680  rack3
UN  172.31.36.118  1.83 GiB   4            35.4%             5ad47bc0-9bcc-4fc5-b5b0-0a15ad63345f  rack2
UN  172.31.41.196  2.16 GiB   4            32.3%             4128ca20-b4fa-4173-88b2-aac62539a6d8  rack2

I again verified the data load distribution by checking the disk usage on all nodes using pssh.

ubuntu@ip-172-31-36-11:~$ pssh -ivl ... -h hosts.txt "du -sh /var/lib/cassandra/data"
[1] ... [SUCCESS] 172.31.36.11
1.9G    /var/lib/cassandra/data
[2] ... [SUCCESS] 172.31.43.239
2.4G    /var/lib/cassandra/data
[3] ... [SUCCESS] 172.31.36.118
1.9G    /var/lib/cassandra/data
[4] ... [SUCCESS] 172.31.37.44
2.4G    /var/lib/cassandra/data
[5] ... [SUCCESS] 172.31.38.137
2.4G    /var/lib/cassandra/data
[6] ... [SUCCESS] 172.31.36.47
2.2G    /var/lib/cassandra/data
[7] ... [SUCCESS] 172.31.39.186
2.1G    /var/lib/cassandra/data
[8] ... [SUCCESS] 172.31.40.56
2.4G    /var/lib/cassandra/data
[9] ... [SUCCESS] 172.31.41.196
2.2G    /var/lib/cassandra/data

As we can see from the above results, there was little variation in the load distribution across nodes compared to a cluster that used random token allocation. Node 172.31.36.118 held the smallest amount of data (roughly 1.83 GB) and nodes 172.31.43.239, 172.31.37.44, 172.31.38.137, and 172.31.40.56 held the largest amount of data (roughly 2.32 GB each). The difference between the smallest and largest data load being roughly 400 MB which is significantly less than the data size difference in the cluster that used random token allocation.

Conclusion

Having a perfectly balanced cluster takes a bit of work and planning. While there are some steps to set up and caveats to using the allocate_tokens_for_keyspace setting, the predictive token allocation is a definite must use when setting up a new cluster. As we have seen from testing, it allows us to take advantage of num_tokens being set to a low value without having to worry about hot spots developing in the cluster.

Distributed Database Things to Know: Consistent Hashing

As with Riak, which I wrote about in 2013, Cassandra remains one of the core active distributed database projects alive today that provides an effective and reliable consistent hash ring for the clustered distributed database system. This hash function is an algorithm that maps data to variable length to data that’s fixed. This consistent hash is a kind of hashing that provides this pattern for mapping keys to particular nodes around the ring in Cassandra.

ValuStor — a memcached alternative built on Scylla

Derek Ramsey, Software Engineering Manager at Sensaphone, gave an overview of ValuStor at Scylla Summit 2018. Sensaphone is a maker of remote monitoring solutions for the Industrial Internet of Things (IIoT). Their products are designed to watch over your physical plant and equipment — such as HVAC systems, oil and gas infrastructure, livestock facilities, greenhouses, food, beverage and medical cold storage. Yet there is a lot of software behind the hardware of IIoT. ValuStor is an example of ostensible “hardware guys” teaching the software guys a thing or two.

Overview and Origins of ValuStor

Open Source Initiative Approved License LogoDerek began his Scylla Summit talk with an overview: what is ValuStor? It is a NoSQL memory cache and a persistent database utilizing a Scylla back-end, designed for key-value and document store data models. It was implemented as a single header-only database abstraction layer and comprises three components: the Scylla database, a ValuStor Client and Cassandra driver (which Scylla can use since it is CQL-compliant).

ValuStor is released as free open source under the MIT license. The MIT license is extremely permissive, allowing anyone to “use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.” Which means that you can bake it into your own software, services and infrastructure freely and without reservation.

Derek then went back a bit into history to describe how Sensaphone began down this path. What circumstances gave rise to the origins of ValuStor? It all started when Sensaphone launched a single-node MySQL database. As their system scaled to hundreds of simultaneous connections, they added memcached to help offload their disk write load on the backend database. Even so, they kept scaling, and their system was insufficient to handle the growing load. It wasn’t the MySQL database that was the issue. It was memcached that was unable to handle all the requests.

For the short term, they began by batching requests. Yet Sensaphone needed to address the fundamental architectural issues, including the need for redundancy and scalability. There was also a cold cache performance risk (also known as the “thundering herd” problem) if they ever needed to restart.

No one in big data takes the decision to replace infrastructure lightly. Yet when they did the cutover it was surprisingly quick. It took only three days from inception to get Scylla running in production. And, as of last November, Sensaphone had already been in production for a year.

Since its initial implementation, Sensaphone added two additional use cases: managing web sessions, and for a distributed message queuing fan out for a producer/consumer application (a publish-subscribe design pattern akin to an in-house RabbitMQ or ActiveMQ). Derek recommended that anyone interested check out the published Usage Guide on GitHub.

Comparisons to memcached

Derek made a fair but firm assessment of the limitations of memcached for Sensaphone’s environment. First he cited the memcached FAQ itself, where it says it is not recommended for sessions, since if the cache is ever lost you may lock users off your site. While the PHP manual had a section on sessions in memcached, there is simply no guarantee of survivability of user data.

Second, Derek cited the minimal security implementation (e.g., SASL authentication). There has been a significant amplification of attacks on memcached in recent years (such as DDoS attacks), and while there are ways to minimize risks, there is no substitution for built-in end-to-end encryption.

Derek listed the basic, fundamental architectural limitations: “it has no encryption, no failover, no replication, and no persistence. Of course it has no persistence — it’s a RAM cache.” That latter point, while usually a feature for performance in memcached’s favor, was exactly what leads to the cold cache problem when your server inevitably crashes or has to be restarted.

Sensaphone was resorting to use batching to maintain a semblance of performance, whereas batch is an antipattern for Scylla.

ValuStor: How does it work?

ValuStor Client

Derek described client design, which kept ease-of-use first and foremost. There are only two API functions: get and store. (Deletes are not done directly, Instead, setting a time-to-live — TTL — of 1 second on data is effectively a delete.)

Implemented as an abstraction layer means you can use your data in a native programming language with native data types: integers, floating point, strings, JSON, blobs, bytes, and UUIDs.

For fault tolerance, ValuStor also added a client-side write queue for a backlog function, and automatic adaptive consistency (more on this later).

Cassandra Driver

The Cassandra driver supports thread safety and is multi-threaded. “Practically speaking, that means you can throw requests at it and it will automatically scale to use more CPU resources as required and you don’t need to do any special locking,” Derek explained. “Unlike memcached… [where] the C driver is not thread-safe.”

ValuStor also offers connection control, so if a Scylla node goes down it will automatically re-establish the connection with a different node. It is also datacenter-aware and will choose your datacenters intelligently.

Scylla Database Server

The Scylla server at the heart of ValuStor offers various architectural advantages. “First and foremost is performance. With the obvious question, ‘How in the world can a persistent database compete with RAM-only caching?’”

Derek then described how Scylla offers its own async and userspace I/O schedulers. Such architectural features can, at times, result in Scylla responsiveness with sub-millisecond latencies.

Scylla also has its own cache and separate memtable, which acts as a sort of cache. “In our use case at Sensaphone we have 100% cache hits all the time. We never have to hit the disk, even though it has one, and since our database has never actually gone down we’ve never actually even had to load it from disk except for maintenance periods.”

In terms of cache warming, Derek provided some advice, “The cold cache penalty is actually less severe for Scylla if you use heat-weighted load balancing because Scylla will automatically warm up your cache for you for the nodes you restart.”

Derek then turned to the issues of security. His criticisms were sobering: “Memcached is what I call ‘vulnerable by design.’” In the latest major issue, “their solution was simply to disable UDP by default rather than fix the problem.”

“By contrast, ValuStor comes with complete TLS support right out of the box.” That includes client authentication and server certified verification by domain or IP, over-the-wire encryption within and across datacenters, and of course basic password authentication and access control. You can read more about TLS setup in the ValuStor documentation.

“Inevitably, though, the database is going to go offline from the client perspective. Either you have network outage or you’ll have hardware issues on your database server.” Derek then dove down into a key additional feature for fault tolerance: a client-side write queue on the producer side. It buffers up and performs automatic retries. When the database is back up, it clears its backlog. The client keeps the requests serialized, so that data is not written in the wrong order. “Producers keep on producing and your writes are simply delayed. They aren’t lost.”

Derek then noted “Scylla has great redundancy. You can set your custom data replication factor per keyspace. It can be changed on the fly. And the client driver is aware of this and will route your traffic to the nodes that actually have your data.” You can also set different replication factors per datacenter, and the client is also aware of your multi-datacenter topology.

In terms of availability, Derek reminded the audience of the CAP theorem, “it states you can have Consistency, Availability or Partition tolerance. Pick any two.” This leads to the quorum problem (where you require n/2 + 1 nodes being available), which can lead to fragility issues in multi-datacenter deployments.

To illustrate, Derek showed the following series of graphics:

Quorum Problem: Example 1

Let’s say you have a primary datacenter with three nodes, and a secondary datacenter with two nodes. The outage of any two nodes will not cause a problem in quorum.

Quorum Problem: Example 1 (Primary Datacenter Failure)

However, if your primary datacenter goes offline, your secondary datacenter would not work if it required a strict adherence to quorum being set at n/2 +1.

Quorum Problem: Example 2

In a second example Derek put forth, if you had a primary with three nodes, and two secondary sites, then if your primary site went down, you could still keep operating if the primary site went offline, since there would still be four nodes, which meets the n/2 + 1 requirement.

Quorum Problem: Example 2 (Secondary site failures)

However, if both of your secondary datacenters went offline, Derek observed this failure would have the unfortunate effect of bringing your primary datacenter down with it, even if there was nothing wrong with your main cluster.

“The solution to this problem is automatic adaptive consistency. This is done on the client side.” Since Scylla is an eventually consistent database with tunable consistency, this buys ValuStor “the ability to adaptively downgrade the consistency on a retry of the requests.” This dramatically reduces the issues of likelihood of inconsistency. It also works well with Hinted Handoffs, which further reduces problems when individual nodes go offline.

Derek took the audience on a brief refresher on consistency levels, including ALL, QUORUM, and ONE/ANY. You can learn more about these reading the Scylla documentation and even teach yourself more going through our console demo.

Next, Derek covered the scalability of Scylla. “The Scylla architecture itself is nearly infinitely scalable. Due to the shard-per-core design you can keep throwing new cores and new machines at it and it’s perfectly happy to scale up. With the driver shard-aware, it will automatically route traffic to the appropriate location.” This is contrasted with memcached, which requires manual sharding. “This is not ideal.”

Using ValuStor

Configuring ValuStor is accomplished in a C++ template class. Once you’ve created the table in the database, you don’t even need to write any other CQL queries.

ValuStor Configuration

This is an example of a minimal configuration. There are more options for SSL.

ValuStor Usage

Here is an example of taking a key-value and storing it, checking to see if the operation was successful, performing automatic retries if it did not, and handling errors if the operation still fails.

Comparisons

ValuStor Comparisons

When designing ValuStor Derek emphasized, “We wanted all of the things on the left-hand side. In evaluating some of the alternatives, none of them really met our needs.”

In particular, Derek took a look at the complexity of Redis. It has dozens of commands. It has master-slave replication. And for Derek’s bottom line, it’s not going to perform as well as Scylla. He cited the recent change in licensing to Commons Clause, which has caused some confusion and consternation in the market. He also pointed out that if you do need the complexity of Redis, you can move to Pedis, which uses the Seastar engine at its heart for better performance.

What about MongoDB or CouchDB?

Derek also made comparisons to MongoDB and CouchDB, since ValuStor has full native JSON support and can also be used as a document store. “It’s not as full-featured, but depending on your needs, it might actually be a good solution.” He cited how Mongo also recently went through a widely-discussed licensing change (which we covered in a detailed blog).

Derek Ramsey at Scylla Summit 2018

Derek Ramsey at Scylla Summit 2018

What’s Next for ValuStor

Derek finished by outlining the feature roadmap for ValuStor.

  • SWIG bindings will allow it to connect to a wide variety of languages
  • Improvements to the command line will allow scripts to use ValuStor
  • Expose underlying Futures, to process multiple requests from a single thread for better performance, and lastly,
  • A non-template configuration option

To learn more, you can watch Derek’s presentation below, check out Derek’s slides, or peruse the ValuStor repository on Github.

The post ValuStor — a memcached alternative built on Scylla appeared first on ScyllaDB.

Testing Cassandra compatible APIs

In this quick blog post, I’m going to assess how the databases that advertise themselves as “Cassandra API-compatible” fare in the compatibility department.

But that is all I will do, only API testing, and not an extensive testing, just based on the APIs I see used often. Based on this, you can start building an informed decision on whether or not to change databases.

The contenders:

  • Apache Cassandra 4.0
    • Installation: Build from Source
  • Yugabyte – https://www.yugabyte.com/
    • “YCQL is a transactional flexible-schema API that is compatible with the Cassandra Query Language (CQL). “
    • Installation: Docker
  • ScyllaDB – https://www.scylladb.com/
    • “Apache Cassandra’s wire protocol, a rich polyglot of drivers, and integration with Spark, Presto, and Graph tools make for resource-efficient and performance-effective coding.”
    • Installation: Docker
  • Azure Cosmos – https://azure.microsoft.com/en-us/services/cosmos-db/
    • “Azure Cosmos DB provides native support for NoSQL and OSS APIs including MongoDB, Cassandra, Gremlin and SQL”
    • Installation: Azure Portal Wizard

All installations were done with the containers as they are provided. Cosmos DB used all defaults as they were provided by the wizard interface.

The CQL script used to test was this one: https://gist.github.com/cjrolo/f5f3cc02699c06ed1f4909d632d90f8f

What I’m not doing on this blog post: performance testing, feature comparison and everything else that is not testing the API. Those might all be more or less important for other use cases, but that is not the scope of this blog.

What was tested

In this test, the following CQL APIs were tested:

  1. Keyspace Creation
  2. Table Creation
  3. Adding a Column to a table (Alter table)
  4. Data Insert
  5. Data Insert with TTL (Time-to-live)
  6. Data Insert with LWT (Lightweight Transactions)
  7. Select Data
  8. Select data with a full table scan (ALLOW FILTERING)
  9. Creating a Secondary Index (2I)

Cassandra 4.0

  • All statements worked (as expected)

CosmosDB

LWT Not supported

ALLOW FILTERING Not supported

2i Is Not supported

Scylla

LWT Not supported

Yugabyte

2i Not Supported

Results Table

So, with these results, which are not a full comparison (I have left out other parts offered in these systems), you can decide if it is compatible enough for you.

Reaper 1.4 Released

Cassandra Reaper 1.4 was just released with security features that now expand to the whole REST API.

Security Improvements

Reaper 1.2.0 integrated Apache Shiro to provide authentication capabilities in the UI. The REST API remained fully opened though, which was a security concern. With Reaper 1.4.0, the REST API is now fully secured and managed by the very same Shiro configuration as the Web UI. Json Web Tokens (JWT) were introduced to avoid sending credentials over the wire too often. In addition spreaper, Reaper’s command line tool, has been updated to provide a login operation and manipulate JWTs.

The documentation was updated with all the necessary information to handle authentication in Reaper and even some samples on how to connect LDAP directories through Shiro.

Note that Reaper doesn’t support authorization features and it is impossible to create users with different rights.
Authentication is now enabled by default for all new installs of Reaper.

Configurable JMX port per cluster

One of the annoying things with Reaper was that it was impossible to use a different port for JMX communications than the default one, 7199.
You could define specific ports per IP, but that was really for testing purposes with CCM.
That long overdue feature has now landed in 1.4.0 and a custom JMX can be passed when declaring a cluster in Reaper:

Configurable JMX port

TWCS/DTCS tables blacklisting

In general, it is best to avoid repairing DTCS tables, as it can generate lots of small SSTables that could stay out of the compaction window and generate performance problems. We tend to recommend not to repair TWCS tables either, to avoid replicating timestamp overlaps betwen nodes that can delay the deletion of fully expired SSTables.

When using the auto-scheduler though, it is impossible to specify blacklists, as all keyspaces and all tables get automatically scheduled by Reaper.

Based on the initial PR of Dennis Kline that was then re-worked by our very own Mick, a new configuration setting allows automatically blacklisting of TWCS and DTCS tables for all repairs:

blacklistTwcsTables: false

When set to true, Reaper will discover the compaction strategy for all tables in the keyspace and remove any table with either DTCS or TWCS, unless they are explicitely passed in the list of tables to repair.

Web UI improvements

The Web UI reported decommissioned nodes that still appeared in the Gossip state of the cluster, with a Left state. This has been fixed and such nodes are not displayed anymore.
Another bug was the number of tokens reported in the node detail panel, which was nowhere near matching reality. We now display the correct number of tokens and clicking on this number will open a popup containing the list of tokens the node is responsible for:

Tokens

Work in progress

Work in progress will introduce the Sidecar Mode, which will collocate a Reaper instance with each Cassandra node and support clusters where JMX access is restricted to localhost.
This mode is being actively worked on currently and the branch already has working repairs.
We’re now refactoring the code and porting other features to this mode like snapshots and metric collection.
This mode will also allow for adding new features and permit Reaper to better scale with the clusters it manages.

Upgrade to Reaper 1.4.0

The upgrade to 1.4 is recommended for all Reaper users. The binaries are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 1.4 are available on the Reaper website.