Scylla Open Source Release 2.3.5

Scylla Open Source Release Notes

The Scylla team announces the release of Scylla Open Source 2.3.5, a bugfix release of the Scylla Open Source 2.3 stable branch. Scylla Open Source Release 2.3.5, like all past and future 2.3.y releases, is backward compatible and supports rolling upgrades.

Note that the latest stable branch of Scylla Open Source is release 3.0; you are encouraged to upgrade to it.

Related links:

Issues solved in this release:

  • Running scylla_setup on CentOS 7, resulted with Scylla setup failed for `ImportError: No module named 'yaml'`. The root cause is a change in EPEL that broke the scylla_setup Python Script #4379
  • nodetool cleanup may double the used disk space while running #3735
  • Schema change statements can be delayed indefinitely when there are constant schema pulls #4436
  • Serialization of decimal and variant data types to JSON (in SELECT JSON statements) can cause an exception on the client side #4348
  • row_cache: potential abort when populating cache concurrently with MemTable flush #4236

The post Scylla Open Source Release 2.3.5 appeared first on ScyllaDB.

Scylla Open Source Release 3.0.5

Scylla Open Source Release Notes

The ScyllaDB team announces the release of Scylla Open Source 3.0.5, a bugfix release of the Scylla Open Source 3.0 stable branch. Scylla Open Source 3.0.5, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

The major fix in this release is for a regression introduced in Scylla 3.0.4, causing higher latency in the Scylla write path, for users with Hinted Handoff enabled. #4351, #4352, #4354

Related links:

Other issues solved in this release:

  • Materialized Views: when a node is down, MV update generates redundant timeouts #3826, #3966, #4028
  • Migrating Apache Cassandra SSTables files to Scylla will fail in a case if the file is in MC format, for tables with compact storage and no clustering columns #4139
  • Rolling downgrade after a full upgrade may fail because new nodes will keep using the new features after a restart #4289
  • Restarting the cluster with some nodes still down causes less efficient streaming/repair and schema disagreement. #4225, #4341
  • Potential access to invalid memory when opening mutation fragment stream #4350
  • row_cache: potential abort when populating cache concurrently with MemTable flush #4236
  • On some HW types, scylla_prepare fails with a redundant error “Failed to write to /proc/irq/412/smp_affinity#4057
  • Running scylla_setup on CentOS 7, resulted with Scylla setup failed for `ImportError: No module named 'yaml'`. The root cause is a change in EPEL broke a scylla_setup Python Script #4379
  • In very rare cases, a schema update might cause a segmentation fault #4410
  • possible crash when RPC logs an error during compressed frame reads #4400
  • RPC can run out of local ports in large clusters with large nodes #4401
  • perftune.py doesn’t properly ban IRQs from irqbalance on Ubuntu 18 #4402
  • Name resolution (e.g. for seeds) can fail if the DNS server uses TCP instead of UDP #4404
  • Serialization decimal and variant data types to JSON can cause an exception on the client side (e.g. CQLSh) #4348

The post Scylla Open Source Release 3.0.5 appeared first on ScyllaDB.

Natura’s Short and Straight Road from Cassandra to Scylla

Natura is a multibillion-dollar Brazil-based consumer beauty products company that prides itself on its ecological friendliness, bioethics and sustainable business practices. Its operations span over 70 countries, 3,200 stores, 17,000 employees, and a human network of 1.8 million sales consultants. These operations supports sales to over 100 million consumers.

The global powerhouse is comprised of three major brands: Natura, founded out of Brazil, Aēsop, founded out of Australia, and The Body Shop, founded out of the United Kingdom.

To support Natura’s global operations requires managing a tremendous amount of consumer data. Felipe Moz, Big Data Engineer at Natura, took the time at Scylla Summit 2018 to describe Natura’s corporate data infrastructure, and why they migrated from DataStax Enterprise (a proprietary version of Apache Cassandra) to Scylla to continue scaling their business.

Natura’s architecture is front-ended with NGINX and Node.js. Data from user activities with their website are streamed over Apache Kafka into Apache Spark jobs — performing about 160,000 RDDs of streaming data as well as processing around 40 “long batch” jobs on a daily basis.

In their architecture, Scylla replaced DSE for key-value data. MongoDB is still used for document-oriented data. The reason DSE was replaced was to avoid JVM issues. Natura had suffered significant performance issues.

Natura's Big Data Architecture

Figure 1: Natura’s Big Data Architecture, deployed and managed using Docker and Kubernetes, includes NGINX, Apache Kafka and Apache Spark, NoSQL databases including MongoDB and Scylla, as well as a Talend integration to an Oracle SQL database.

They aren’t our provider. They are a part of our team.

— Felipe Moz, Natura

Beyond the product aspects of Scylla, Felipe emphasized the value of ScyllaDB’s enterprise support. “They know our use case and data modeling.” He was also glad to be able to get direct access to developers when needed through Scylla’s Slack channel.

Then Felipe got down to the bottom line, favorably comparing the cost of Scylla versus DataStax Enterprise. To provision hardware for Datastax Enterprise on Microsoft Azure required five DS14 v2 servers, which cost $2,000 per node for 744 hours (equivalent to a 31-day month), resulting in a monthly cost of $10,000, thus an annual cost of over $120,000 for the required hardware. With Scylla running on AWS i3.4xlarge instances, the monthly cost per node dropped to $913.54, making the monthly hardware costs around $4,600, annualized to less than $55,000. This alone saved Natura over half their annual hardware expenditure.

On top of the cost savings, Felipe noted significant performance benefits. Batch processing times on Scylla dropped on average to 10% of what they used to take on DataStax. In some cases, batch jobs that took 6 hours to run on DSE took less than 10 minutes to run on Scylla (1/36th, or 0.27% of the time).

For streaming analytics, Spark jobs that used to take 76 ms to complete on DSE were accomplished in as little as 6 ms on Scylla.

And for pure database performance, for p95 writes latencies dropped from 220 milliseconds (ms) on DSE to around 500 microseconds (μsec) on Scylla.

Scylla provided Natura with performance an order of magnitude greater than DataStax Enterprise for half the cloud server cost.

You can watch the Scylla Summit 2018 video below, or read Felipe’s slides on the related Tech Talk page.



If you’ve done comparisons of Scylla versus other NoSQL solutions in your own organization, or are about to conduct a performance-related comparison, we’d love to hear from you. Please drop us a line or feel free to join our Slack channel to discuss your results.

The post Natura’s Short and Straight Road from Cassandra to Scylla appeared first on ScyllaDB.

Scylla Cloud Onboarding

Scylla Cloud Migration

We recently launched Scylla Cloud, allowing you to get the most out of Scylla while not having to burden yourself with cluster management tasks.

Worried about scaling up or out? Replacing a node? Monitoring disk usage? Scheduling and keeping an eye on repairs and backups? Taking care of various security settings? Applying hotfixes? Well have no fear! We will take care of all these concerns for you and much more.

In this blog post we will cover the initial onboarding process, including creating an account and spinning up a cluster. In another upcoming blog we will cover specific migration options you have to get your data into Scylla Cloud.

Deploy a Cluster

After you sign-up for Scylla Cloud it’s now time to sign-in and create your cluster:

  1. Set your cluster name
  2. Set the allowed IPs: your client / driver IPs (IP ranges are supported)
  3. Cloud provider is AWS, select the region for your cluster deployment
  4. Select the instance type to be used (i3 family options, or t2 for developers)
  5. Select the Replication Factor (RF) and Number of nodes (Num nodes options, corresponds to RF)
  6. Optional: Check the box if you wish to enable VPC peering, and enter the Cluster Network CIDR. For each cluster we create a separate VPC. For each availability zone we create a subnet with 256 IPs. You can choose any range from the private ones 10.x.x.x or 192.168.x.x or 172.x.x.x

    Note: Please choose a CIDR that won’t overlap your existing VPCs.

Note: Distribution to Availability Zones (AZs) correspond to the RF. At the bottom section you will see an illustration of the AZs node distribution. Nodes will be evenly distributed between AZs.

Create a Cluster

Launching the Cluster

Next you will be able to see the cluster bootstrapping in progress until it’s completed.

Launching Cluster

Cluster is ready

When the cluster is up and running, you can already observe 3 major metrics in the UI:

  • Requests (Ops/Sec)
  • Storage space – Consumed disk space
  • Load (Scylla cpu-reactor load) – Load generated by Scylla

Cluster status

Add VPC Peering

If you enabled VPC peering, you will now see the screen to configure it. There are 3 steps: Request, Accept and Route.

1st step – Request: In order to setup the VPC peering we will need you to provide us the following details:

  1. AWS Account Number → for example: 868456412743
  2. VPC ID → for example: vpc-094d5a56043du234jf
  3. VPC CIDR → for example: 10.0.0.0/16
  4. Region → for example: us-east-1

AWS Details for VPC peering

Don’t forget to mark the check-box to add the VPC network to the cluster’s allowed IPs (if you do forget, you can still add it later manually).

Press the “Submit VPC Peering Request” button; this takes you to the 2nd step.

2nd step – Accept: Here you will find detailed explanations with screenshots on how to accept the request on your AWS UI.

Once you completed the needed steps on AWS, go back to Scylla Cloud UI and press the “Connect” button to get to the third step.

3rd step – Route: here there are also detailed steps and screenshots on how to setup the routing tables via the AWS UI.

Once you have completed the steps on AWS, go back to Scylla Cloud UI and press “Done”.

Connect to Scylla Cloud using various drivers

If you did not enable VPC peering, once the cluster bootstrap is done, you will be redirected to the “Connect” tab. Here you can find client side code examples on how to connect to your Scylla Cloud cluster using various drivers.

The examples already include the relevant details for your cluster, like the IP and datacenter name. Examples are provided for the following drivers:

  • Cqlsh
  • Java
  • Python
  • Ruby
  • Others

Connecting apps

Cluster view

Go to Cluster view and select your cluster to see a high level view of your cluster details:

  • Cluster ID
  • Scylla version
  • Cluster size
  • Replication Factor
  • CIDR Block (optional) -> CIDR block you’ll want us to create Scylla Cloud Cluster → for example: 10.0.0.0/16 (Needs to be different from your application VPC CIDR)
  • Allowed IPs → you can edit it from this view as well, adding or removing IPs / IP ranges
  • Public IP | Private IP | Status | Region | AZ | Instance Type
  • On the top right side you can observe the 3 major metrics mentioned earlier

Cluster view

Note: If you checked “Enable VPC peering” check-box and did not complete the VPC peering yet, you will see the “Add VPC Peering” notice at the top

Scylla Cloud Metrics

From your cluster view you can dive into the “Monitor” tab to see the following list of metrics:

  • CPU Load
  • Requests Served
  • Read Rate
  • Average Read Latency
  • Writes Rate
  • Average Write Latency

Note: We are constantly improving the metrics exposed to the UI, so over time the list will grow. Scylla DevOps are monitoring thousands of metrics per node for you in the background.

Scylla Cloud metrics

Resize / Delete cluster

Want to resize or delete your cluster? No worries, here’s how to do it.

  1. In your cluster view, click on “Actions” tab, there you’ll find 2 options: Resize / Delete
  2. Submit the request and our DevOps will take care of it for you.

Resize Cluster / Delete Cluster

Resize cluster

We manage everything for you!

  • Monitor thousands of metrics per server
  • Set automatic recurring repairs
  • Set Alerts
  • Disk Space
  • CQL connection (node down)
  • Backups (last 3 days, week + 2 weeks)
  • Security settings
  • Authentication Encryption
  • Applying hotfixes and upgrades

Expert Cloud Support

Keep in mind that upon signing up for Scylla Cloud we create a Zendesk account for you, if you don’t have one already. This allows you to press the “Support” button in the Scylla Cloud user interface for immediate access to your Zendesk account and to submit a ticket.

The Scylla support response team is there for you 24/7/365. We have DevOps monitoring your cluster and taking care of any request for scaling-up/out you might have, as well as engineers (Tiers 1-4) around the globe to respond to any Critical and High priority tickets.

Submit a ticket

Something’s not right? Cluster unavailable, Performance issues? Here’s how to submit a ticket.

  1. Press the “Support” button and you will be redirected into Zendesk
  2. Use the “Submit a request” form.

Submit request

Ticket Priority

Please make sure to select the appropriate Ticket priority. P1 and P2 tickets triggers On-Call DevOps / Engineers to engage and help resolve the issue.

  • Urgent (P1)
    • Cluster unavailable
    • Possible data lost
    • Possible data corrupted
  • High (P2)
    • Cluster is partially available (degraded)
    • Cluster suffers from major performance problems
  • Normal (P3)
    • Cluster suffers from minor performance issues
  • Low (P4)
    • Questions, Requests

Upgrades and Maintenance

Scylla Cloud runs the Scylla Enterprise edition. From time to time there are patch releases. In some cases we will define some upgrades as critical ones. Mandating an upgrade. Don’t worry, we’re not gonna pull the rug from under your feet – All upgrade procedures will be coordinated with the user.

Next Steps

After you get your cluster up and running, the next step is to load your data. We will be publishing a migration blog soon on migrating specifically to Scylla Cloud, but for now you might want to read our blog about migration strategies or watch the on-demand webinar.

If you do try Scylla Cloud, we’d love to hear your feedback, either via our Slack channel, or by dropping us a line.

Sign up for Scylla Cloud

The post Scylla Cloud Onboarding appeared first on ScyllaDB.

Introducing Scylla Cloud: The Fastest NoSQL Database as a Managed Service

Introducing Scylla Cloud

Today, we publicly announced Scylla Cloud, our fully-managed database as a service (DBaaS). Scylla Cloud is available immediately on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instances—including a developer instance that makes prototyping incredibly affordable. It will soon also run on Google Cloud and Microsoft Azure public cloud platforms.

Scylla Cloud runs the latest version of Scylla Enterprise, which is built from the ground up to make optimal use of the latest server technology and to run “close to the metal.” This means fewer servers are needed compared to other implementations in a cloud environment, saving on costs.

Scylla has been available and hardened for years by open source users, enterprise customers and a solid, growing community of developers. Now, we bring our enterprise product to a managed cloud solution that lets individual developers and teams get started quickly with small instances at low costs, and then to scale up as needed. With a full complement of tools and APIs, we make it easy for everyone from startups to large enterprises to migrate workloads from on-premises deployments or other cloud platforms over to Scylla’s managed cloud service.

Scylla Cloud provides managed Scylla clusters with automatic backup, upgrades, repairs, performance optimization, security hardening, 24*7 maintenance and support — all for a fraction of the cost of other DBaaS offerings. You can relieve your team of administrative tasks, allowing them to focus on creating applications that deliver engaging experiences to your customers. Scylla Cloud lets you prototype quickly, get into production fast and, once in production, scale smoothly over time.

Scylla Cloud currently runs on AWS, utilizing that platform’s highest performing systems for I/O intensive database use, the i3 series. With just a couple of clicks you can spin up a cluster of i3.16xlarge instances, with each Scylla node delivering millions of operations per second. For developers, Scylla Cloud also has made available the low cost t2.micro instance. You can now get a Scylla Enterprise cluster running for less than $1 per day!

Benefits

Scylla Cloud brings unrivaled benefits in a number of areas:

  • Price-Performance: Scylla Cloud was designed with a fundamentally superior architecture compared with other NoSQL managed cloud solutions, such as Apache Cassandra, DataStax Enterprise, Amazon DynamoDB, Google BigTable, and Microsoft Cosmos DB. It provides blazing-fast performance, superior node utilization, and both horizontal and vertical scalability. With Scylla Cloud, the number of nodes required to support a given workload is reduced by 3-5X, resulting in significant savings.
  • Predictable Low Latencies: Scylla performs predictably, with consistent single-digit millisecond response times in the 99th-percentile of latencies, resulting in dependable service levels even as you scale up your application traffic. In fact, tests show Scylla delivers 1/3rd the p99 latencies of Amazon DynamoDB. When dealing with the non-predictive nature of the Internet, we also perform far better than other database alternatives under “hot partition” conditions.
  • No Vendor Lock-In: You can easily move your data between Scylla Cloud and Scylla running on your on-premises datacenter, or on other public and private cloud providers.
  • Cassandra Compatibility: Scylla Cloud is compatible with the entire Cassandra ecosystem of drivers. That means you can get applications to market faster by using Scylla Cloud. In addition, moving data from an existing Cassandra cluster is straightforward and quick.
  • Flexible Pricing and Sizing Options: Scylla Cloud allows you to choose what fits your organization’s workload and purchasing preferences. You can select from a range of servers as well as monthly or annual pricing options. You can start small with a cluster for development and then increase the size of the cluster for full-scale testing and QA, and then even further as you scale to meet your production deployment requirements.

Features

The Scylla Cloud brings you a number of features that enable your application to get the most out of your systems:

  • Scaling Up and Out: Scylla’s performance grows linearly with larger systems that have additional cores, greater amounts of memory and storage capacity. This translates into lower costs compared to other DBaaS offerings.
  • Security: Scylla Cloud implements single-tenant occupancy along with encrypted backups and key management, so you can be assured your data is safe and only you have access to it. You can also use VPC peering to connect your application with your Scylla Cloud database.
  • Expert Operational Support: Scylla Cloud is managed and monitored by ScyllaDB engineers who are experts in keeping your service up and efficiently performant. By monitoring your cluster 24*7, they respond quickly to any issues. Automatic backups and repairs are scheduled without needing to dedicate your personnel to these tasks.
  • Hot Fixes: Updates and security fixes are applied to your running Scylla cluster automatically and transparently in a way that will not interfere with your database performance.
  • Highly Available: Scylla Cloud replicates your data across the availability zones within a region, so there is no single point of failure. Choose additional replicas if you require even more redundancy of your data. Soon, we will support replicating your data in different regions. You also have the ability to use Scylla Cloud in a Multi-Region configuration.
  • Easy to Integrate: We’ve made it quick and easy to hook Scylla Cloud in to your existing code base, with examples for cqlsh, Java, Python, Ruby and others.

How Do I Get Started?

Define your cluster, launch it and connect. That’s all that you have to do. The process is simple and fast. The first step after creating a Scylla Cloud account (cloud.scylladb.com) is to create your cluster by selecting your instance type and quantity as shown in Figure 1.

Create a Cluster

Figure 1: Select your instances for Scylla Cloud

You will also see a section for how to configure VPC peering. Use this to securely connect your application to Scylla Cloud.

VPC Peering

Figure 2: Configuration of VPC peering

After creating your cluster, which takes only a few minutes to provision, you will get the confirmation and the IP addresses of the servers to connect to. Figure 2 shows what this screen will look like.

Scylla Cloud: Cluster Ready

Figure 3: Confirmation that your Scylla Cloud cluster is ready

You can easily get a view of any of your provisioned clusters by selecting My Clusters from the left side menu, and then “View.” An example of the output is in Figure 4.

Scylla Cloud: Summary of Instances

Figure 4: Summary of the instances that you have provisioned

After you have connected your application to Scylla Cloud, and your application is running, you can observe some basic performance metrics. Figure 5 shows the performance of your write rate, average latency and 95th percentile latency.

Scylla Cloud: Cluster Performance

Figure 5: Sample output showing various write parameters of your Scylla Cloud instances.

Bottom Line

Scylla Cloud is ideal for teams that need maximum performance in a completely managed NoSQL database. It provides enterprise-grade features and support, allowing you to linearly scale your apps without requiring you to scale your staff. Scylla Cloud delivers consistent low latencies and high throughput with less use of hardware resources than other DBaaS solutions. The price-performance of Scylla Cloud leads the industry, meeting your most-demanding production requirements at a fraction of the cost.

Want to try it out? You can spin up t2.micro instances in minutes. Then deploy a cluster of i3’s when you’re ready for production loads. We’d love to hear about your experience and get your feedback.

Get Started Today

Next Steps

The post Introducing Scylla Cloud: The Fastest NoSQL Database as a Managed Service appeared first on ScyllaDB.

Even Higher Availability with 5x Faster Streaming in Cassandra 4.0

Streaming is a process where nodes of a cluster exchange data in the form of SSTables. Streaming can kick in during many situations such as bootstrap, repair, rebuild, range movement, cluster expansion, etc. In this post, we discuss the massive performance improvements made to the streaming process in Apache Cassandra 4.0.

High Availability

As we know Cassandra is a Highly Available, Eventually Consistent database. The way it maintains its legendary availability is by storing redundant copies of data in nodes known as replicas, usually running on commodity hardware. During normal operations, these replicas may end up having hardware issues causing them to fail. As a result, we need to replace them with new nodes on fresh hardware.

As part of this replacement operation, the new Cassandra node streams data from the neighboring nodes that hold copies of the data belonging to this new node’s token range. Depending on the amount of data stored, this process can require substantial network bandwidth, taking some time to complete. The longer these types of operations take, the more we are exposing ourselves to loss of availability. Depending on your replication factor and consistency requirements, if another node fails during this replacement operation, ability will be impacted.

Increasing Availability

To minimize the failure window, we want to make these operations as fast as possible. The faster the new node completes streaming its data, the faster it can serve traffic, increasing the availability of the cluster. Towards this goal, Cassandra 4.0 saw the addition of Zero Copy streaming. For more details on Cassandra’s zero copy implementation, see this blog post and CASSANDRA-14556 for more information.

Talking Numbers

To quantify the results of these improvements, we, at Netflix, measured the performance impact of streaming in 4.0 vs 3.0, using our open source NDBench benchmarking tool with the CassJavaDriverGeneric plugin. Though we knew there would be improvements, we were still amazed with the overall results of a five fold increase in streaming performance. The test setup and operations are all detailed below.

Test Setup

In our test setup, we used the following configurations:

  • 6-node clusters on i3.xl, i3.2xl, i3.4xl and i3.8xl EC2 instances, each on 3.0 and trunk (sha dd7ec5a2d6736b26d3c5f137388f2d0028df7a03).
  • Table schema
CREATE TABLE testing.test (
    key text,
    column1 int,
    value text,
    PRIMARY KEY (key, column1)
) WITH CLUSTERING ORDER BY (column1 ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'enabled': 'false'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
  • Data size per node: 500GB
  • No. of tokens per node: 1 (no vnodes)

To trigger the streaming process we used the following steps in each of the clusters:

  • terminated a node
  • add a new node as a replacement
  • measure the time taken to complete streaming data by the new node replacing the terminated node

For each cluster and version, we repeated this exercise multiple times to collect several samples.

Below is the distribution of streaming times we found across the clusters Benchmark results

Interpreting the Results

Based on the graph above, there are many conclusions one can draw from it. Some of them are

  • 3.0 streaming times are inconsistent and show high degree of variability (fat distributions across multiple samples)
  • 3.0 streaming is highly affected by the instance type and generally looks generally CPU bound
  • Zero Copy streaming is approximately 5x faster
  • Zero Copy streaming time shows little variability in its performance (thin distributions across multiple samples)
  • Zero Copy streaming performance is not CPU bound and remains consistent across instance types

It is clear from the performance test results that Zero Copy Streaming has a huge performance benefit over the current streaming infrastructure in Cassandra. But what does it mean in the real world? The following key points are the main take aways.

MTTR (Mean Time to Recovery): MTTR is a KPI (Key Performance Indicator) that is used to measure how quickly a system recovers from a failure. Zero Copy Streaming has a very direct impact here with a five fold improvement on performance.

Costs: Zero Copy Streaming is ~5x faster. This translates directly into cost for some organizations primarily as a result of reducing the need to maintain spare server or cloud capacity. In other situations where you’re migrating data to larger instance types or moving AZs or DCs, this means that instances that are sending data can be turned off sooner saving costs. An added cost benefit is that now you don’t have to over provision the instance. You get a similar streaming performance whether you use a i3.xl or an i3.8xl provided the bandwidth is available to the instance.

Risk Reduction: There is a great reduction in the risk due to Zero Copy Streaming as well. Since a Cluster’s recovery mainly depends on the streaming speed, Cassandra clusters with failed nodes will be able to recover much more quickly (5x faster). This means the window of vulnerability is reduced significantly, in some situations down to few minutes.

Finally, a benefit that we generally don’t talk about is the environmental benefit of this change. Zero Copy Streaming enables us to move data very quickly through the cluster. It objectively reduces the number and sizes of instances that are used to build Cassandra cluster. As a result not only does it reduce Cassandra’s TCO (Total Cost of Ownership), it also helps the environment by consuming fewer resources!

Helpshift Shares Its Scylla Success Story

Today’s blog post was written by Mourjo Sen, Software Developer at Helpshift.

Introduction

In this post, we look at Helpshift’s journey of moving to a sharded model of data. We will look at the challenges we faced on the way and how Scylla emerged as our choice of database for the scale that Helpshift 2.0 is aiming to serve.

Before we get into the use case for which using Scylla was the answer, here is a simplified overview of the primary elements of our architecture:

  • MongoDB is used to store conversations between brands and its customers
  • Redis is used as a cache layer in front of mongoDB for faster retrieval of frequently accessed data
  • ScyllaDB is used to store user-related data
  • PostgreSQL is used to store auxiliary data
  • Elasticsearch is used to search through different types of data and as a segmentation engine
  • Apache Flink is used as a stream processing engine and for delayed trigger management Apache
  • Kafka is used as a messaging system between services

Scylla: The Helpshift Story

Background

Helpshift is a digital customer service platform that provides a means for users to engage in a conversation with brands. It is comprised of many features which can be applied to incoming conversations. To name a few: automated rule application, segregation by properties, custom bots, auto-detection of topic, SLA management, real time statistics, an extensive search engine among others. Helpshift’s platform handles a scale of 180K requests per second currently and the scale keeps growing day by day. Sometimes, with growing scale, it is required to re-architect systems for facilitating the next phase of the company’s growth. One such area of re-evaluation and re-architecture was the management of user profiles.

Problem statement

Up until 2018, a user was registered on Helpshift when they had started a conversation on the Helpshift platform. At this point, a user ID would be generated for this user by the backend servers. The user ID would be used in all future calls to refer to this user. In this legacy user profile system, there are two important things to note:

  • A user was created by starting a conversation only when they interacted with the system
  • A user was bound to a device

While this system worked well for us for quite a while, we now wanted the user to be decoupled from the device. This meant that a user should be identified by some identifier outside of the Helpshift system. If someone logs into Helpshift from one device, his/her conversations should now be visible on any other device they logged into. Along with this, we also wanted to change the time at which users were created. This would enable us to provide proactive support to users even when they haven’t started a conversation.

These two requirements meant that we had to re-engineer the user subsystem at Helpshift, mainly because:

  • CRUD operations on the user data will be much higher
  • User authentication will have to be done via multiple parameters

With this change, we were expecting the number of users to be 150+ million and the number of requests to the users database to be 30K/sec. This was far too much for the current database we were using. Moreover, the nature of the data and access patterns warranted a very robust database. Firstly, the number of users would continue to grow over time. Secondly, the number of requests per second was unpredictable. Thus, we had to look to a model of databases which could scale as per need.

The database needed to support “sharding data” out of the box and adding capacity without skipping a beat in production. To “shard” data, essentially means to horizontally split data into smaller parts. The parts are then stored on dedicated database instances. This divides both the amount of data stored on each instance and the amount of traffic served by one instance.

Available Options

There are many varieties of sharded databases available in the industry today. We evaluated the most promising ones from the standpoint of properties we would like to have in our sharded databases. Our comparison included Cassandra, ScyllaDB, Voldemort, sharded MongoDB, and Google Cloud Spanner.

Our parameters of evaluation were:

  • Scalability: How much effort is required to add horizontal capacity to a cluster?
  • Data model: Is the data model able to support complex queries with “where” clauses?
  • Indexing: Are indexes on fields other than primary keys supported?
  • Query language: Is the query language expressive enough for production usage?
  • Throughput and latency: Are there benchmarks in the literature that studies the throughput/latency?
  • Persistence: If a node suddenly dies, is it able to recover without data loss?
  • Upgrading to new versions: Are rolling restart/upgrades supported?

In most of the above categories, ScyllaDB emerged as a winner in the properties we wanted in our sharded database. At the time of this work, we used ScyllaDB 2.0.1 and the following were the reasons for choosing ScyllaDB over the other candidates:

  • The throughput was by far the highest (for a generic benchmark)
  • Adding a node to a live cluster was fairly easy
  • Features like secondary indexes were proposed for release in the near future
  • ScyllaDB and Cassandra are fundamentally sharded databases and it is impossible to bypass the sharding (and thereby impossible load the cluster with scan-queries across nodes)

Schema design

Moving to a sharded database requires a change of the data model and access patterns. The main reason for that is that some fields are special in sharded systems. They act as keys to find the instance where this particular bit of data will be present. This means that these shard keys always have to be present as part of the query. It therefore makes it imperative to design the data model with all possible query patterns in mind, as it is possible that some queries are just not serviceable in sharded databases.

For this, we created auxiliary tables to find the shard-key of an entity. For example, if we want the data model to have the ability to find a user both by email and username, then we need to be able to make both the following queries:

  • SELECT * from users where user_id='1234';
  • SELECT * from users where email='abcd@hotmail.com';

This kind of query is not possible on sharded systems since each table can only have one primary or shard key. For this we introduced an intermediate table which would store the mapping from user_id to id and email to id for every user, where id is the primary key of the actual users table. A simplified version of our schema looks like:

  • CREATE TABLE users (id text PRIMARY KEY, user_id text, email text, name text)
  • CREATE TABLE mapping_emails (email text PRIMARY KEY, id text)
  • CREATE TABLE (user_id text PRIMARY KEY, id text) mapping_userids

With this, we can find a user by both id and email with a two-hop query, for example, if we want to find the user with id=’1234′:

  • SELECT id from mapping_userids where user_id='1234';
  • SELECT * from users where id='<id-from-last-query>';

This data model and schema sufficed for all the requirements we have at Helpshift. It is highly important to check the feasibility of the schema with respect to all query requirements.

Journey to production

Having decided on the schema, we had to benchmark ScyllaDB with the amount of data and scale we anticipated in production. The process of benchmarking was as follows:

  • 150 million entities
  • 30,000 user reads per second (this will exclude the additional reads we have to do via the lookup tables)
  • 100 user writes per second (this will exclude the additional writes we have to do on the lookup tables)
  • Adding a node to a live loaded cluster
  • Removing a node from a live cluster
  • 12 hour period for benchmark with sudden bursts of traffic

We benchmarked with the following cluster:

  • 10 nodes of size i3.2xlarge
  • Replication factor=3
  • Leveled compaction strategy
  • Quorum consistency

With this setup, we were able to achieve:

  • 32,000 user reads/sec (actual number of operations on Scylla were 150K/sec because of lookups and other business logic)
  • Average latency of 114 ms
  • Seastar reactor utilization 60%

As the benchmark results looked formidable, we therefore went to production with ScyllaDB. It has been a year with ScyllaDB in production and right now the cluster setup is as shown below:

  • 15 nodes i3.2xlarge
  • 73K user reads/sec at peak (which is 1151 IOPS on average)
  • Latency of 13 ms (95th percentile)
  • Average reactor load: 50%
  • Approximately 2 billion entries (which is 2.8 TB on disk)

Here are some of the metrics from our production ScyllaDB cluster:

Helpshift: IOPS

Figure 1: Disk IOPS across the ScyllaDB cluster: 959 reads/sec and 193 writes/sec

Helpshift: ScyllaDB requests/second

Figure 2: Number of incoming client requests and the ScyllaDB latency for each (note that a client request here is a request coming into the Helpshift infrastructure from outside and can comprise more than one operation on Scylla. The latency is for all Scylla operations combined for one client request.)

Helpshift: CQL Ops

Figure 3: The number of CQL operations on the ScyllaDB cluster grouped into reads, inserts and updates.

Conclusion

The decision to move to a sharded database comes as a natural evolution from the growth of a company but choosing the right database to keep scaling can both be a daunting and rewarding task. In our case, we chose Scylla. If we look at the results closely, the scale we benchmarked for was overcome in production in a course of one year. That is, in just one year, the number of requests we had to serve went from 30K/s to 73K/s. This kind of growth did not get bottlenecked by our backend systems solely because we were able to scale up our Scylla seamlessly as was required. This kind of empowerment is crucial to the growth of an agile startup and in hindsight, it was the perfect database for us.

You can read more about Helpshift’s use of Scylla and other technologies, such as their development of Scyllabackup and Spongeblob, at the Helpshift Engineering blog (engineering.helpshift.com).

The post Helpshift Shares Its Scylla Success Story appeared first on ScyllaDB.

Avengers Fan Frustration Highlights Importance of Database Technology

How do you know it’s time to upgrade your database technology?

When your fans cant buy tickets to your latest blockbuster movie, for example—as recently happened with Avengers fans trying to buy tickets on Fandango and other sites—or buy the latest smartphone, or re-book a flight that got cancelled, or one of the many other ‘I have to have it NOW’ things that millions and millions of consumers do every day.

One of the clearest indicators of using obsolete database technology, and likely a poor data architecture holding things together, is millions of frustrated fans wishing they had other ways to purchase your product.

The part that is so frustrating to those of us in the technology industry is that this is all completely avoidable with modern NoSQL technology like Apache Cassandra™, and the enterprise version of this, from DataStax.

Today’s data architectures need to accommodate millions of transactions in very short periods of time, and even more importantly (and harder), they need to be able to provide the data where and when their customers need it: locally at the endpoint device (think smartphones).

Here’s the problem: nobody knows where every consumer will be or exactly when they’ll engage to buy their movie ticket, or even more importantly, how long they’ll be willing to wait to get confirmation before they click away to another site. Because these consumers are highly distributed and have very short attention spans and demand instant confirmation for things, and because there are millions and millions of them, enterprises today need technology that can keep up, that can handle real-time demands at cloud scale.

And here’s the ironic part: The vendors of these obsolete legacy technologies keep saying that all you need is more hardware, more instances of the database, and more add-on technology to replicate copies of the data all over the place.

The problem is that this legacy technology was never intended for today’s loads and expectations. And yet, companies try to keep it running with more glue and tape (you know the kind…) and bubble gum. And when it all comes crashing down, it’s devastating to reputation and brand satisfaction, or Net Promoter Score (NPS). Recovering from that mass negative publicity takes years if you are lucky to recover at all.

The good news is that there’s a relatively easy solution to fix these issues—and even better, prevent them from even happening in the first place. How? By overhauling your data architecture with a highly scalable, always-on, active everywhere database that allows you to take full advantage of hybrid and multi-cloud computing environments.

Modern applications running in hybrid cloud—as long as they’re built and running on the right kind of database—don’t go down and don’t leave your customers waiting or wanting more. Period.

For a closer look into what you can do to avoid becoming the next headline about a negative customer experience, contact DataStax. We can help you not only avoid the negative brand experience but accelerate your growth and innovation in this data-driven, “I need it now” world.

eBook: Powering Growth and Innovation in a Hybrid Cloud World

READ NOW