Scylla University: Coding with Scala, Part 1

This lesson is taken from the Scylla Specific Drivers, Overview, Paging and Shard Awareness course in Scylla University, ScyllaDB’s free resource to learn NoSQL database development and Scylla administration. You can find the lesson here in Scylla University.

In a previous lesson, we explained how a Scylla Administrator restores and backs-up a cluster. As the number of mutants is on the rise, Division 3 decided that we must use more applications to connect to the mutant catalog and decided to hire Java developers to create powerful applications that can monitor the mutants. This lesson will explore how to connect to a Scylla cluster using the Phantom library for Scala: a Scala-idiomatic wrapper over the standard Java driver.

When creating applications that communicate with a database such as Scylla, it is crucial that the programming language being used includes support for database connectivity. Since Scylla is compatible with Cassandra, we can use any of the available Cassandra libraries. For example, in Go, there is GoCQL and GoCQLX. In Node.js, there is the cassandra-driver. For the JVM, we have the standard Java driver available. Scala applications on JVM can use it, but a library tailored for Scala’s features offers a more enjoyable and type-safe development experience. Since Division 3 wants to start investing in Scala, let’s begin by writing a sample Scala application.

Creating a Sample Scala Application

The sample application that we will create connects to a Scylla cluster, displays the contents of the Mutant Catalog table, inserts and deletes data, and shows the table’s contents after each action. First, we will go through each section of the code used and then explain how to run the code in a Docker container that accesses the Scylla Mutant Monitoring cluster. You can see the file here: scylla-code-samples/mms/scala/scala-app/src/main/scala/com/scylla/mms/App.scala.

As mentioned, we’re using the Phantom library for working with Scylla. The library’s recommended usage pattern involves modeling the database, table, and data service as classes. We’ll delve into those in a future lesson.

First, let’s focus on how we set up the connection itself. We start by importing the library:

This import brings all the necessary types and implicits into scope. The main entry point for our application is the App object’s main method. In it, we set up the connection to the cluster:

Next, in a list, we specify the DNS names through which the application contacts Scylla and the keyspace to use.

Finally, we instantiate the MutantsDatabase and the MutantsService classes. These classes represent, respectively, the collection of tables in use by our application and a domain-specific interface to interact with those tables.

We’re not operating on low-level types like ResultSet or Row objects with the Phantom library, but rather on strongly-typed domain objects. Specifically, for this lesson, we will be working with a Mutant data type defined as follows:

You can find the file here: scylla-code-samples/mms/scala/scala-app/src/main/scala/com/scylla/mms/model/Mutant.scala

The MutantsService class contains implementations of the functionality required for this lesson. You can find it here:

scylla-code-samples/mms/scala/scala-app/src/main/scala/com/scylla/mms/service/MutantService.scala.

Let’s consider how we would fetch all the mutant definitions present in the table:

Phantom provides a domain-specific language for interacting with Scylla. Instead of threading CQL strings through our code, we’re working with a strongly-typed interface that verifies, at compile-time, that the queries we are generating are correct (according to the data type definitions).

In this case, we are using the select method to fetch all Mutant records from Scylla. Note that by default, Phantom uses Scala’s Future data type for representing all the Scylla operations. Behind the scenes, Phantom uses a non-blocking network I/O to efficiently perform all of the queries. If you’d like to know more about Future, this is a good tutorial.

Moving forward, the following method allows us to store an instance of the Mutant data type into Scylla:

Everything is strongly typed, and as such, we’re using actual instances of our data types to interact with Scylla. This gives us increased maintainability of our codebase. If we refactor the Mutant data type, this code will no longer compile, and we’ll be forced to fix it.

And lastly, here’s a method that deletes a mutant with a specific name:

We can see that Phantom’s DSL can represent predicates in a type-safe fashion as well. We’re forced to use a matching type for comparing the firstName and lastName.

We can sequence the calls to these methods in our main entry point using a for comprehension on the Future values returned. The code below is from the following file:

scylla-code-samples/mms/scala/scala-app/src/main/scala/com/scylla/mms/App.scala

In this example, we execute all the futures sequentially (with some additional callbacks registered on them for printing out the information).

With the coding part done, let’s set up a Scylla Cluster and then run the sample application in Docker.

Set-up a Scylla Cluster

The example requires a single DC cluster. Follow this procedure to remove previous clusters and set up a new cluster.

Once the cluster is up, we’ll create the catalog keyspace and populate it with data.

The first task is to create the keyspace named catalog for the mutants’ catalog.

docker exec -it mms_scylla-node1_1 cqlsh
CREATE KEYSPACE catalog WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy','DC1' : 3};

Now that the keyspace is created, it is time to create a table to hold the mutant data.

use catalog;
CREATE TABLE mutant_data (
    first_name text,
    last_name text,
    address text,
    picture_location text,
    PRIMARY KEY((first_name, last_name)));

Now let’s add a few mutants to the catalog with the following statements:

insert into mutant_data ("first_name","last_name","address","picture_location") VALUES ('Bob','Loblaw','1313 Mockingbird Lane', 'http://www.facebook.com/bobloblaw');
insert into mutant_data ("first_name","last_name","address","picture_location") VALUES ('Bob','Zemuda','1202 Coffman Lane', 'http://www.facebook.com/bzemuda');
insert into mutant_data ("first_name","last_name","address","picture_location") VALUES ('Jim','Jeffries','1211 Hollywood Lane', 'http://www.facebook.com/jeffries');

Building the Scala Example

If you previously built the Scala Docker, you can skip directly to the section Running the Scala Example below.

Otherwise, to build the application in Docker, change into the scala subdirectory in scylla-code-samples:

cd scylla-code-samples/mms/scala

Now we can build and run the container:

docker build -t scala-app .
docker run -d --net=mms_web --name some-scala-app scala-app

To connect to the shell of the container, run the following command:

docker exec -it some-scala-app sh

Running the Scala Example

Finally, the sample Scala application can be run:

java -jar scala-app/target/scala-2.13/mms-scala-app-assembly-0.1.0-SNAPSHOT.jar

The output of the application will be:

Conclusion

In this lesson, we explained how to create a sample Scala application that executes a few basic CQL operations on a Scylla cluster using the Phantom library. These were only the basics, and there are more exciting topics that Division 3 wants developers to explore. In the next lesson, we will review the structure of an application using the Phantom library.

In the meantime, please be safe out there and continue to monitor the mutants!

*This lesson was written with the help of Itamar Ravid. Itamar is a Distributed systems engineer with a decade of experience in a wide range of technologies. He focuses on JVM-based microservices in Scala using functional programming. Thank you, Itamar!

TAKE THIS LESSON IN SCYLLA UNIVERSITY

The post Scylla University: Coding with Scala, Part 1 appeared first on ScyllaDB.

Scylla Summit 2021 Call for Speakers

Scylla Summit 2021 will be held as an online event this year, January 12 – 14, 2021. Our own sessions will showcase the latest Scylla features and capabilities, roadmap plans and the view from the top from our executives. We also want to extend this Call for Speakers, inviting you to tell your own story to the global community of your peers. Users’ voices and real-world perspectives are vital for the success of any software industry conference.

What Our Attendees Most Want to Hear

Our attendees appreciate hearing real-world examples from practitioners — developers and devops, architects, technical leaders, and product owners. We’re looking for compelling technical sessions: case studies, operational best practices, and more. Here’s a list of suggested topics, or feel free to recommend others.

Real-world Use Cases

Everything from design and architectural considerations to POCs, to production deployment and operational lessons learned, including…

  • Practical examples for vertical markets — IoT, AI/ML, security, e-commerce, fraud detection, adtech/martech, customer/user data, media/multimedia, social apps, B2B or B2C apps, and more.
  • Migrations — How did you get from where you were in the past to where you are today? What were the limitations of other systems you needed to overcome? How did Scylla address those issues?
  • Hard numbers — Clusters, nodes, CPUs, RAM and disk, data size, growth, throughput, latencies, benchmark and stress test results (think graphs, charts and tables). Your cluster can be any size — large or small. What attendees appreciate hearing most is how you got the best results out of your available resources.
  • Integrations — Scylla is just one of many big data systems running in user environments. What feeds it from upstream, or what does it populate downstream? Kafka? Spark? Parquet? Pandas? Other SQL or NoSQL systems? JanusGraph? KairosDB? Our users love block diagrams and want to understand data flows.
  • War stories — What were the hardest big data challenges you faced? How did you solve them? What lessons can you pass along?

Tips & Tricks

From getting started to performance optimization to disaster management, tell us your DevOps secrets, or unleash your chaos monkey!

  • Computer languages and dev tools — What’s your favorite languages and tools? Are you a Pythonista? Doing something interesting in Golang or Rust?
  • Working on Open Source? — Got a Github repo to share? Our attendees would love to walk your code.
  • Seastar — The Seastar infrastructure at the heart of Scylla can be used for other projects as well. What systems architecture challenges are you tackling with Seastar?

Schedule

  • CFP opens: Mon, Sep 21st, 2020
  • CFP closes (deadline for submissions: Thursday, Oct 22th, 2020, 5:00 PM Pacific
  • Scylla Summit (event dates): Tue, Jan 12th – Thu, Jan 14th, 2021

Required Information

You’ll be asked to include the following information in your proposal:

  • Proposed title
  • Description
  • Suggested main topic
  • Audience information
  • Who is this presentation for?
  • What will the audience take away?
  • What prerequisite knowledge would they need?
  • Videos or live demos included in presentation?
  • Length of the presentation (10 or 15 minute slots or longer)

Ten Tips for a Successful Speaker Proposal

Please keep in mind this event is made by and for deeply technical professionals. All presentations and supporting materials must be respectful and inclusive.

  • Be authentic — Your peers want your personal experiences and examples drawn from real-world scenarios
  • Be catchy — Give your proposal a simple and straightforward but catchy title
  • Be interesting — Make sure the subject will be of interest to others; explain why people will want to attend and what they’ll take away from it
  • Be complete — Include as much detail about the presentation as possible
  • Don’t be “pitchy” — Keep proposals free of marketing and sales. We tend to ignore proposals submitted by PR agencies and require that we can reach the suggested participant directly.
  • Be understandable — While you can certainly cite industry terms, try to write a jargon-free proposal that contains clear value for attendees
  • Be deliverable — Sessions have a fixed length, and you will not be able to cover everything. The best sessions are concise and focused.
  • Be specific — Overviews aren’t great in this format; the narrower and more specific your topic is, the deeper you can dive into it, giving the audience more to take home
  • Be cautious — Live demos sometimes don’t go as planned, so we don’t recommend them
  • Be rememberable — Leave your audience with take-aways they’ll be able to take back to their own organizations and work. Give them something they’ll remember for a good long time.

SUBMIT A SPEAKER PROPOSAL

The post Scylla Summit 2021 Call for Speakers appeared first on ScyllaDB.

Seedless NoSQL: Getting Rid of Seed Nodes in Scylla

Nodes in a Scylla cluster are symmetric, which means any node in the cluster can serve user read or write requests, and no special roles are assigned to a particular node. For instance, no primary nodes vs. secondary nodes or read nodes vs. write nodes.

This is a nice architecture property. However, there is one small thing that breaks the symmetry: that is the seed concept. Seed nodes are nodes that help the discovery of the cluster and propagation of gossip information. Users must assign a special role to some of the nodes in the cluster to serve as seed nodes.

In addition to breaking the symmetric architecture property, do seed nodes introduce any real issues? In this blog post, we will walk through the pains of seed nodes and how we get rid of them to make Scylla easier and more robust to operate than ever before.

What are the real pains of the seed nodes?

First, the seed node does not bootstrap. Bootstrap is a process that a new joining node streams data from existing nodes. However, seed nodes skip the bootstrap process and stream no data. It is a popular source of confusion for our users since users are surprised to see no data streamed on the new node after adding a node.

Second, the user needs to decide which nodes will be assigned as the seed nodes. Scylla recommends 3 seed nodes per datacenter (DC) if the DC has more than 6 nodes, or 2 seed nodes per DC if the DC has less than 6 nodes. If the number of nodes grows, the user needs to update the seed nodes configuration and modify scylla.yaml on all nodes.

Third, it is quite complicated to add a new seed node or replace a dead seed node. When a user wants to add a node as a seed node, the node can not be added into the cluster as a seed node directly. The correct way to do this is to add the node as a non-seed node, then promote it as a seed node. This is because a seed node does not bootstrap which means it does not stream data from existing nodes. When a seed node is dead and the users want to replace it, they can not use the regular replacing node procedure to replace it directly. Users need to first promote a non-seed node to act as a seed node and update the configuration on all nodes, then perform the replacing node procedure to replace it.

Those special treatments for seed nodes complicate the administration and operation of a Scylla cluster. Scylla needs to carefully document those differences between seed and no-seed nodes. Users can make mistakes easily even with the documents.

Can we get rid of those seed concepts to simplify things and make it more robust?

Let’s first take a closer look into what is the special role of seed nodes.

Seed nodes help in two ways:

1) Define the target nodes to talk within gossip shadow round

But what is a gossip shadow round? It is a routine used for a variety of purposes when a cluster boots up:

  • to get gossip information from existing nodes before normal gossip service starts
  • to get tokens and host IDs of the node to be replaced in the replacing operation
  • to get tokens and status of existing nodes for bootstrap operation
  • to get features known by the cluster to prevent an old version of a Scylla node that does not know any of those features from joining the cluster

2) Help to converge gossip information faster

Seed nodes help converge gossip information faster because, in each normal gossip round, a seed node is contacted once per second. As a result, the seed nodes are supposed to have more recent gossip information. Any nodes communicating with seed nodes will obtain the more recent information. This speeds up the gossip convergence.

How do we get rid of the seed concept?

To get rid of seen nodes entirely, you will have to solve for each of the functions seed nodes currently provide:

1) Configuration changes

  • The seeds option

The parameter --seed-provider-parameters seeds= is now used only once when a new node joins for the first time. The only purpose is to find the existing nodes.

  • The auto-bootstrap option

In Scylla, the --auto-bootstrap option defaults to true and is not present in the scyllal.yaml. This was intentionally done to avoid misuse. It is designed to make the bootstrap process faster when initializing a fresh cluster, by skipping the process to stream data from existing nodes.

In contrast, the Scylla AMI sets --auto-bootstrap to false because most of the time when AMI nodes start they are forming a fresh cluster. When adding a new AMI node to an existing cluster, users must set the –auto-bootstrap option to true. It is easy to forget setting the --auto-bootstrap option to true for AMIs and end up with a new node without any data streamed, which is annoying.

The new solution is that the --auto-bootstrap option will now be ignored so that we have less dangerous options and fewer errors to make. With this change, all new nodes — both seed and non-seed nodes — must bootstrap when joining the cluster for the first time.

One small exception is that the first node in a fresh cluster is not bootstrapped. This is because it is the first node and there are no other nodes to bootstrap from. The node with the smallest IP address is selected as the first node automatically, e.g., with seeds = “192.168.0.1,192.168.0.2,192.168.0.3”, node 192.168.0.1 will be selected as the first node and skips the bootstrap procedure. So when starting nodes to form a fresh cluster, always use the same list of nodes in the seeds config option.

2) Gossip shadow round target nodes selection

Before this change, only the seed nodes communicated within the gossip shadow round. The shadow round finishes immediately after any of the seed nodes have responded.

After this change, if the node is a new node that only knows the nodes listed in the seed config option, it talks to all the nodes listed in the seed option and waits for each of them to respond to finish the shadow round.

If the node is an existing node, it will talk to all nodes saved in the system.peers, without consulting the seed config option.

3) Gossip normal round target nodes selection

Currently, in each gossip normal round, 3 types of nodes are selected to talk with:

  • Select 10% of live nodes randomly
  • Select a seed node if the seed node is not selected above
  • Select an unreachable node

After this change, the selection has been changed to below:

  • Select 10% of live nodes in a random shuffle + group fashion

For example, there are 20 nodes in the cluster [n1, n2, .., n20]. The nodes are first being shuffled randomly. Then they are divided into 10 groups so that each group has 10% of live nodes. In each round, nodes in one of the groups are selected to talk with. When all groups have talked once, the nodes are shuffled randomly again and divided into 10 groups. This procedure repeats. Using a random shuffle method to select target nodes is also used in other gossip implementations, e.g., SWIM.

This method helps converge gossip information in a more deterministic way so we can drop the selection of one seed node in each gossip round.

  • Select an unreachable node

Unreachable node selection is not changed. It is communicated in order to bring back dead nodes into the cluster.

4) Improved message communication for shadow round

Currently, Scylla reuses the gossip SYN and ACK messages for the shadow round communication. Those messages are one way and are asynchronous which means the sender will not wait for the response from the receiver. As a result, there are many special cases in gossip message handlers in order to support the shadow round.

After this change, a new two way and synchronous RPC message has been introduced for communications in the gossip shadow round. This dedicated message makes it much easier to track which nodes have responded to the gossip shadow round and waits until all the nodes have responded. It also allows requesting the gossip application states that the sender is really interested in, which reduces the message size transferred on the network. Thanks to the new RPC message, the special handling of the regular gossip SYN and ACK message can be eliminated.

In a mixed cluster, the node will fall back to the old shadow round method in case the new message is not supported for compatibility reasons. We cannot introduce a gossip feature bit to decide if the new shadow round method can be used because the gossip service is not even started when we conduct a shadow round.

Summary

Getting rid of the seed concept simplifies Scylla cluster configuration and administration, makes Scylla nodes fully symmetric, and prevents unnecessary operation errors.

The work is merged in Scylla master and will be released in the upcoming Scylla 4.3 release.

Farewell seeds and hello seedless!

LEARN MORE ABOUT SCYLLA OPEN SOURCE

 

The post Seedless NoSQL: Getting Rid of Seed Nodes in Scylla appeared first on ScyllaDB.

Apache Cassandra Usage Report 2020

Apache Cassandra is the open source NoSQL database for mission critical data. Today the community announced findings from a comprehensive global survey of 901 practitioners on Cassandra usage. It’s the first of what will become an annual survey that provides a baseline understanding of who, how, and why organizations use Cassandra.

“I saw zero downtime at global scale with Apache Cassandra. That’s a powerful statement to make. For our business that’s quite crucial.” - Practitioner, London

Key Themes

Cassandra adoption is correlated with organizations in a more advanced stage of digital transformation.

People from organizations that self-identified as being in a “highly advanced” stage of digital transformation were more likely to be using Cassandra (26%) compared with those in an “advanced” stage (10%) or “in process” (5%).

Optionality, security, and scalability are among the key reasons Cassandra is selected by practitioners.

The top reasons practitioners use Cassandra for mission critical apps are “good hybrid solutions” (62%), “very secure” (60%), “highly scalable” (57%), “fast” (57%), and “easy to build apps with” (55%).

A lack of skilled staff and the challenge of migration deters adoption of Cassandra.

Thirty-six percent of practitioners currently using Cassandra for mission critical apps say that a lack of Cassandra-skilled team members may deter adoption. When asked what it would take for practitioners to use Cassandra for more applications and features in production, they said “easier to migrate” and “easier to integrate.”

Methodology

Sample. The survey consisted of 1,404 interviews of IT professionals and executives, including 901 practitioners which is the focus of this usage report, from April 13-23, 2020. Respondents came from 13 geographies (China, India, Japan, South Korea, Germany, United Kingdom, France, the Netherlands, Ireland, Brazil, Mexico, Argentina, and the U.S.) and the survey was offered in seven languages corresponding to those geographies. While margin of sampling error cannot technically be calculated for online panel populations where the relationship between sample and universe is unknown, the margin of sampling error for equivalent representative samples would be +/- 2.6% for the total sample, +/- 3.3% for the practitioner sample, and +/- 4.4% for the executive sample.

To ensure the highest quality respondents, surveys include enhanced screening beyond title and activities of company size (no companies under 100 employees), cloud IT knowledge, and years of IT experience.

Rounding and multi-response. Figures may not add to 100 due to rounding or multi-response questions.

Demographics

Practitioner respondents represent a variety of roles as follows: Dev/DevOps (52%), Ops/Architect (29%), Data Scientists and Engineers (11%), and Database Administrators (8%) in the Americas (43%), Europe (32%), and Asia Pacific (12%).

Cassandra roles

Respondents include both enterprise (65% from companies with 1k+ employees) and SMEs (35% from companies with at least 100 employees). Industries include IT (45%), financial services (11%), manufacturing (8%), health care (4%), retail (3%), government (5%), education (4%), telco (3%), and 17% were listed as “other.”

Cassandra companies

Cassandra Adoption

Twenty-two percent of practitioners are currently using or evaluating Cassandra with an additional 11% planning to use it in the next 12 months.

Of those currently using Cassandra, 89% are using open source Cassandra, including both self-managed (72%) and third-party managed (48%).

Practitioners using Cassandra today are more likely to use it for more projects tomorrow. Overall, 15% of practitioners say they are extremely likely (10 on a 10-pt scale) to use it for their next project. Of those, 71% are currently using or have used it before.

Cassandra adoption

Cassandra Usage

People from organizations that self-identified as being in a “highly advanced” stage of digital transformation were more likely to be using Cassandra (26%) compared with those in an “advanced” stage (10%) in “in process” (5%).

Cassandra predominates in very important or mission critical apps. Among practitioners, 31% use Cassandra for their mission critical applications, 55% for their very important applications, 38% for their somewhat important applications, and 20% for their least important applications.

“We’re scheduling 100s of millions of messages to be sent. Per day. If it’s two weeks, we’re talking about a couple billion. So for this, we use Cassandra.” - Practitioner, Amsterdam

Cassandra usage

Why Cassandra?

The top reasons practitioners use Cassandra for mission critical apps are “good hybrid solutions” (62%), “very secure” (60%), “highly scalable” (57%), “fast” (57%), and “easy to build apps with” (55%).

“High traffic, high data environments where really you’re just looking for very simplistic key value persistence of your data. It’s going to be a great fit for you, I can promise that.” - Global SVP Engineering

Top reasons practitioners use Cassandra

For companies in a highly advanced stage of digital transformation, 58% cite “won’t lose data” as the top reason, followed by “gives me confidence” (56%), “cloud native” (56%), and “very secure” (56%).

“It can’t lose anything, it has to be able to capture everything. It can’t have any security defects. It needs to be somewhat compatible with the environment. If we adopt a new database, it can’t be a duplicate of the data we already have.… So: Cassandra.” - Practitioner, San Francisco

However, 36% of practitioners currently using Cassandra for mission critical apps say that a lack of Cassandra-skilled team members may deter adoption.

“We don’t have time to train a ton of developers, so that time to deploy, time to onboard, that’s really key. All the other stuff, scalability, that all sounds fine.” – Practitioner, London

When asked what it would take for practitioners to use Cassandra for more applications and features in production, they said “easier to migrate” and “easier to integrate.”

“If I can get started and be productive in 30 minutes, it’s a no brainer.” - Practitioner, London

Conclusion

We invite anyone who is curious about Cassandra to test the 4.0 beta release. There will be no new features or breaking API changes in future Beta or GA builds, so you can expect the time you put into the beta to translate into transitioning your production workloads to 4.0.

We also invite you to participate in a short survey about Kubernetes and Cassandra that is open through September 24, 2020. Details will be shared with the Cassandra Kubernetes SIG after it closes.

Survey Credits

A volunteer from the community helped analyze the report, which was conducted by ClearPath Strategies, a strategic consulting and research firm, and donated to the community by DataStax. It is available for use under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

AWS Outposts: Run Fully Managed NoSQL Workloads On-Premises Using Scylla

AWS Outposts are a fully managed on-premises extension of Amazon Web Services. It puts all of your familiar infrastructure and APIs available right in your own local environment. Ideal for low-latency use cases, AWS Outposts allow you to run EC2 instances, Elastic Block Storage (EBS), and container-based services like Amazon’s EKS for Kubernetes.

Order what you need through your AWS Console, and a team of experts shows up on-site for the installation. Your local on-premises environment shows up as your own private zone and you store all of the physical assets in a traditional 80 x 48 x 24 inch 42U rack.

Scylla as an On-Premises Alternative to DynamoDB

Not all Amazon services are available in AWS Outposts, however. Notably, Amazon DynamoDB is currently unable to run in an on-premise environment.

This is where Scylla serves a role. With Scylla’s DynamoDB-compatible API, Alternator, you can run your DynamoDB workloads in your local AWS Outpost, using Scylla Cloud, our fully managed database-as-a-service (DBaaS) running in your own AWS account. Want to use the same data on AWS Outposts and AWS public region? Run Scylla Cloud in a multi-datacenter, hybrid, setup: one DC on-prem, one DC on a public region.

By the way, if you are more familiar with Scylla’s CQL interface, the same Cassandra Query Language used in Apache Cassandra, that’s fine too. You can also use Scylla Cloud on AWS Outposts as an on-premises managed Cassandra-compatible service. (Learn more about the differences between the DynamoDB API and the Cassandra CQL interface here.)

Scylla Cloud has been tested and certified by Amazon as “AWS Outposts ready.” As a member of the AWS Partner Network you can rely on the experts at ScyllaDB to manage an enterprise-grade NoSQL database on-premises for your critical business needs.

You can provision standard EC2 instance types such as the storage-intensive I3en series, which can store up to 60 terabytes of data. Administrators can also divide larger physical servers into smaller virtual servers for separate workloads and use cases.

Afterwards, you can then deploy Scylla Cloud on your AWS account. Once you have, you should be ready to run!

If you need to learn more in detail about how the Scylla DynamoDB-compatible API works, check out our lesson on using the DynamoDB API in Scylla from Scylla University. It’s completely free.

As a last point, you can also use Scylla Enterprise or Scylla Open Source on AWS Outposts if you want to manage your own database.

Next Steps

Contact us to learn more about using Scylla in AWS Outposts, or sign up for our Slack channel to discuss running on Outposts with our engineers and your big data industry peers.

ASK US ABOUT RUNNING SCYLLA ON AWS OUTPOSTS

The post AWS Outposts: Run Fully Managed NoSQL Workloads On-Premises Using Scylla appeared first on ScyllaDB.

CarePet: An Example IoT Use Case for Hands-On App Developers

We recently published CarePet, a project that demonstrates a generic Internet of Things (IoT) use case for Scylla. The application is written in Go and allows tracking of pets’ health indicators. It consists of three parts:

  • a collar that reads and pushes sensors data
  • a web app for reading and analyzing the pets’ data
  • a database migration tool

In this post, we’ll cover the main points of the guide. Check out the Github repo with the full code. The complete guides are here.

Keep in mind that this example is not fully optimized and is not production ready and should be used for reference purposes only.

Introduction

In this guided exercise, you’ll create an IoT app from scratch and configure it to use Scylla as the backend datastore. We’ll use as an example, an application called CarePet, which collects and analyzes data from sensors attached to a pet’s collar and monitors the pet’s health and activity. The example can be used, with minimal changes, for any IoT-like application. We’ll go over the different stages of the development, from gathering requirements, creating the data model, cluster sizing and hardware needed to match the requirements, and finally, building and running the application.

Use Case Requirements

Each pet collar includes sensors that report four different measurements: Temperature, Pulse, Location, and Respiration. The collar reads the sensors data once a minute, aggregates it in a buffer and sends measurements directly to the app once an hour. The application should scale to 10 Million pets. It keeps the data history of a pet for a month. Thus the database will need to scale to contain 43 billion data points in a month (60 × 24 × 30 × 10,000,000 = 43,200,000,000); 43,200 data samples per pet. If the data variance is low, it will be possible to further compact the data and reduce the number of data points.

Note that in a real world design you’d have a fan-out architecture, with the end-node IoT devices (the collars) communicating wirelessly to an MQTT gateway or equivalent, which would, in turn, send updates to the application via, say, Apache Kafka. We’ll leave out those extra layers of complexity for now, but if you are interested check out our post on Scylla and Confluent Integration for IoT Deployments.

Performance Requirements

The application has two parts:

  • Sensors: writes to the database, throughput sensitive
  • Backend dashboard: reads from the database, latency-sensitive

For this example, we assume 99% writes (sensors) and 1% reads (backend dashboard)

Desired Service Level Objectives (SLOs):

  • Writes throughput of 670K Operations per second
  • Reads: latency of up to 10 milliseconds per key for the 99th percentile.

The application requires high availability and fault tolerance. Even if a Scylla node goes down or becomes unavailable, the cluster is expected to remain available and continue to provide service. You can learn more about Scylla high availability in this lesson.

Design and Data Model

In this part we’ll think about our queries, make the primary key and clustering key selection, and create the database schema. See more in the data model design document.

Here’s how we will create our data using CQL:

CREATE TABLE IF NOT EXISTS owner (
    owner_id  UUID,
    address   TEXT,
    name      TEXT,
    PRIMARY KEY (owner_id)
);

CREATE TABLE IF NOT EXISTS pet (
    owner_id  UUID,
    pet_id    UUID,
    chip_id   TEXT,
    species   TEXT,
    breed     TEXT,
    color     TEXT,
    gender    TEXT,
    age       INT,
    weight    FLOAT,
    address   TEXT,
    name      TEXT,
    PRIMARY KEY (owner_id, pet_id)
);

CREATE TABLE IF NOT EXISTS sensor (
    pet_id    UUID,
    sensor_id UUID,
    type      TEXT,
    PRIMARY KEY (pet_id, sensor_id)
);

CREATE TABLE IF NOT EXISTS measurement (
    sensor_id UUID,
    ts        TIMESTAMP,
    value     FLOAT,
    PRIMARY KEY (sensor_id, ts)
) WITH compaction = { 'class' : 'TimeWindowCompactionStrategy' };

CREATE TABLE IF NOT EXISTS sensor_avg (
    sensor_id UUID,
    date      DATE,
    hour      INT,
    value     FLOAT,
    PRIMARY KEY (sensor_id, date, hour)
) WITH compaction = { 'class' : 'TimeWindowCompactionStrategy' };

Deploying the App

Prerequisites:

The example application uses Docker to run a three-node ScyllaDB cluster. It allows tracking of pet’s health indicators and consists of three parts:

  • migrate (/cmd/migrate) – creates the CarePet keyspace and tables
  • collar (/cmd/sensor) – generates a pet health data and pushes it into the storage
  • web app (/cmd/server) – REST API service for tracking the pets’ health state

Download the example code from git:

$ git clone git@github.com:scylladb/care-pet.git

Start by creating a local Scylla cluster consisting of 3 nodes:

$ docker-compose up -d

Docker-compose will spin up a Scylla cluster consisting of 3 nodes: carepet-scylla1, carepet-scylla2 and carepet-scylla3. Wait for about two minutes and check the status of the cluster: To check the status of the cluster:

$ docker exec -it carepet-scylla1 nodetool status

Once all the nodes are in UN – Up Normal status, initialize the database. This will create the keyspaces and tables:

$ go build ./cmd/migrate
$ NODE1=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' carepet-scylla1)
$ ./migrate --hosts $NODE1

Which will produce this format of expected output:

2020/08/06 16:43:01 Bootstrap database...
2020/08/06 16:43:13 Keyspace metadata = {Name:carepet DurableWrites:true StrategyClass:org.apache.cassandra.locator.NetworkTopologyStrategy StrategyOptions:map[datacenter1:3] Tables:map[gocqlx_migrate:0xc00016ca80 measurement:0xc00016cbb0 owner:0xc00016cce0 pet:0xc00016ce10 sensor:0xc00016cf40 sensor_avg:0xc00016d070] Functions:map[] Aggregates:map[] Types:map[] Indexes:map[] Views:map[]}

Next, start the pet collar simulation. From a separate terminal execute the following command to generate the pet’s health data and save it to the database:

$ go build ./cmd/sensor
$ NODE1=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' carepet-scylla1)
$ ./sensor --hosts $NODE1

Which will produce this format of expected output:

2020/08/06 16:44:33 Welcome to the Pet collar simulator
2020/08/06 16:44:33 New owner # 9b20764b-f947-45bb-a020-bf6d02cc2224
2020/08/06 16:44:33 New pet # f3a836c7-ec64-44c3-b66f-0abe9ad2befd
2020/08/06 16:44:33 sensor # 48212af8-afff-43ea-9240-c0e5458d82c1 type L new measure 51.360596 ts 2020-08-06T16:44:33+02:00
2020/08/06 16:44:33 sensor # 2ff06ffb-ecad-4c55-be78-0a3d413231d9 type R new measure 36 ts 2020-08-06T16:44:33+02:00|2020/08/06 16:44:33 sensor # 821588e0-840d-48c6-b9c9-7d1045e0f38c type L new measure 26.380281 ts 2020-08-06T16:44:33+02:00
...

Make a note of the pet’s Owner ID (the ID is the part after the # sign without trailing spaces). We will use it later.

Now, start the REST API service in a separate, third, terminal. This server exposes a REST API that allows for tracking the pets’ health state:

$ go build ./cmd/server
$ NODE1=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' carepet-scylla1)
$ ./server --port 8000 --hosts $NODE1

Producing this expected output:

2020/08/06 16:45:58 Serving care pet at http://127.0.0.1:8000

Using the Application

Open http://127.0.0.1:8000/ in a browser or send an HTTP request from the CLI:

$ curl -v http://127.0.0.1:8000/

Produces this expected output:

GET / HTTP/1.1
> Host: 127.0.0.1:8000
> User-Agent: curl/7.71.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: application/json
< Date: Thu, 06 Aug 2020 14:47:41 GMT
< Content-Length: 45
< Connection: close
<
* Closing connection 0
{"code":404,"message":"path / was not found"}

If you see this JSON in the end with 404, it means everything works as expected. To read an owner’s data use the previously saved owner_id as follows:

$ curl -v http://127.0.0.1:8000/api/owner/{owner_id}

for example:

$ curl http://127.0.0.1:8000/api/owner/a05fd0df-0f97-4eec-a211-cad28a6e5360

expected result:

{"address":"home","name":"gmwjgsap","owner_id":"a05fd0df-0f97-4eec-a211-cad28a6e5360"}

To list the owner’s pets run:

$ curl -v http://127.0.0.1:8000/api/owner/{owner_id}/pets

for example:

$ curl http://127.0.0.1:8000/api/owner/a05fd0df-0f97-4eec-a211-cad28a6e5360/pets

expected output:

[{"address":"home","age":57,"name":"tlmodylu","owner_id":"a05fd0df-0f97-4eec-a211-cad28a6e5360","pet_id":"a52adc4e-7cf4-47ca-b561-3ceec9382917","weight":5}]

To list each specific pet’s sensor:

$ curl -v curl -v http://127.0.0.1:8000/api/pet/{pet_id}/sensors

for example:

$ curl http://127.0.0.1:8000/api/pet/cef72f58-fc78-4cae-92ae-fb3c3eed35c4/sensors

produces this output:

[{"pet_id":"cef72f58-fc78-4cae-92ae-fb3c3eed35c4","sensor_id":"5a9da084-ea49-4ab1-b2f8-d3e3d9715e7d","type":"L"},{"pet_id":"cef72f58-fc78-4cae-92ae-fb3c3eed35c4","sensor_id":"5c70cd8a-d9a6-416f-afd6-c99f90578d99","type":"R"},{"pet_id":"cef72f58-fc78-4cae-92ae-fb3c3eed35c4","sensor_id":"fbefa67a-ceb1-4dcc-bbf1-c90d71176857","type":"L"}]

To review the data from a specific sensor:

$ curl http://127.0.0.1:8000/api/sensor/{sensor_id}/values?from=2006-01-02T15:04:05Z07:00&to=2006-01-02T15:04:05Z07:00

for example:

$ curl http://127.0.0.1:8000/api/sensor/5a9da084-ea49-4ab1-b2f8-d3e3d9715e7d/values\?from\="2020-08-06T00:00:00Z"\&to\="2020-08-06T23:59:59Z"

produces this expected output:

[51.360596,26.737432,77.88015,...]

Code Structure and Implementation

The code package structure is as follows:

Name Purpose
/api swagger api spec
/cmd applications executables
/cmd/migrate install database schema
/cmd/sensor simulates the pet’s collar
/cmd/server web application backend
/config database configuration
/db database handlers (gocql/x)
/db/cql database schema
/handler swagger REST API handlers
/model application models and ORM metadata

After data is collected from the pets via the sensors on their collars, it is delivered to the central database for analysis and health status checking.

The collar code sits in the /cmd/sensor and uses scylladb/gocqlx Go driver to connect to the database directly and publish its data. The collar gathers sensor measurements, aggregates data in a buffer and sends it every hour.

Overall all applications in this repository use scylladb/gocqlx for:

  • Relational Object Mapping (ORM)
  • Building Queries
  • Migrating database schemas

The web application’s REST API server resides in /cmd/server and uses go-swagger that supports OpenAPI 2.0 to expose its API. API handlers reside in /handler. Most of the queries are reads.

The application caches sensor measurements data on an hourly basis. It uses Lazy Evaluation to manage sensor_avg. It can be viewed as an application-level lazy-evaluated materialized view.

The algorithm is simple and resides in /handler/avg.go:

  • read sensor_avg
  • if aggregated data is missing, read measurement data, aggregate in memory, and save.
  • serve the request

Additional Resources, Future Plans

You can find the project code in the Github repo. The guides are here.

We’re working on expanding this use case to include things like sizing, adding more languages and also adding more guides, stay tuned!

Additionally, these resources will help you get started with ScyllaDB:

The post CarePet: An Example IoT Use Case for Hands-On App Developers appeared first on ScyllaDB.

Improving Apache Cassandra’s Front Door and Backpressure

As part of CASSANDRA-15013, we have improved Cassandra’s ability to handle high throughput workloads, while having enough safeguards in place to protect itself from potentially going out of memory. In order to better explain the change we have made, let us understand at a high level, on how an incoming request is processed by Cassandra before the fix, followed by what we changed, and the new relevant configuration knobs available.

How inbound requests were handled before

Let us take the scenario of a client application sending requests to C* cluster. For the purpose of this blog, let us focus on one of the C* coordinator nodes.

alt_text

Below is the microscopic view of client-server interaction at the C* coordinator node. Each client connection to Cassandra node happens over a netty channel, and for efficiency purposes, each Netty eventloop thread is responsible for more than one netty channel.

alt_text

The eventloop threads read requests coming off of netty channels and enqueue them into a bounded inbound queue in the Cassandra node.

alt_text

A thread pool dequeues requests from the inbound queue, processes them asynchronously and enqueues the response into an outbound queue. There exist multiple outbound queues, one for each eventloop thread to avoid races.

alt_text

alt_text

alt_text

The same eventloop threads that are responsible for enqueuing incoming requests into the inbound queue, are also responsible for dequeuing responses off from the outbound queue and shipping responses back to the client.

alt_text

alt_text

Issue with this workflow

Let us take a scenario where there is a spike in operations from the client. The eventloop threads are now enqueuing requests at a much higher rate than the rate at which the requests are being processed by the native transport thread pool. Eventually, the inbound queue reaches its limit and says it cannot store any more requests in the queue.

alt_text

Consequently, the eventloop threads get into a blocked state as they try to enqueue more requests into an already full inbound queue. They wait until they can successfully enqueue the request in hand, into the queue.

alt_text

As noted earlier, these blocked eventloop threads are also supposed to dequeue responses from the outbound queue. Given they are in blocked state, the outbound queue (which is unbounded) grows endlessly, with all the responses, eventually resulting in C* going out of memory. This is a vicious cycle because, since the eventloop threads are blocked, there is no one to ship responses back to the client; eventually client side timeout triggers, and clients may send more requests due to retries. This is an unfortunate situation to be in, since Cassandra is doing all the work of processing these requests as fast as it can, but there is no one to ship the produced responses back to the client.

alt_text

So far, we have built a fair understanding of how the front door of C* works with regard to handling client requests, and how blocked eventloop threads can affect Cassandra.

What we changed

Backpressure

The essential root cause of the issue is that eventloop threads are getting blocked. Let us not block them by making the bounded inbound queue unbounded. If we are not careful here though, we could have an out of memory situation, this time because of the unbounded inbound queue. So we defined an overloaded state for the node based on the memory usage of the inbound queue.

We introduced two levels of thresholds, one at the node level, and the other more granular, at client IP. The one at client IP helps to isolate rogue client IPs, while not affecting other good clients, if there is such a situation.

These thresholds can be set using cassandra yaml file.

native_transport_max_concurrent_requests_in_bytes_per_ip
native_transport_max_concurrent_requests_in_bytes

These thresholds can be further changed at runtime (CASSANDRA-15519).

Configurable server response to the client as part of backpressure

If C* happens to be in overloaded state (as defined by the thresholds mentioned above), C* can react in one of the following ways:

  • Apply backpressure by setting “Autoread” to false on the netty channel in question (default behavior).
  • Respond back to the client with Overloaded Exception (if client sets “THROW_ON_OVERLOAD” connection startup option to “true.”

Let us look at the client request-response workflow again, in both these cases.

THROW_ON_OVERLOAD = false (default)

If the inbound queue is full (i.e. the thresholds are met).

alt_text

C* sets autoread to false on the netty channel, which means it will stop reading bytes off of the netty channel.

alt_text

Consequently, the kernel socket inbound buffer becomes full since no bytes are being read off of it by netty eventloop.

alt_text

Once the Kernel Socket Inbound Buffer is full on the server side, things start getting piled up in the Kernel Socket Outbound Buffer on the client side, and once this buffer gets full, client will start experiencing backpressure.

alt_text

THROW_ON_OVERLOAD = true

If the inbound queue is full (i.e. the thresholds are met), eventloop threads do not enqueue the request into the Inbound Queue. Instead, the eventloop thread creates an OverloadedException response message and enqueues it into the flusher queue, which will then be shipped back to the client.

alt_text

This way, Cassandra is able to serve very large throughput, while protecting itself from getting into memory starvation issues. This patch has been vetted through thorough performance benchmarking. Detailed performance analysis can be found here.

One-Step Streaming Migration from DynamoDB into Scylla

Last year, we introduced the ability to migrate DynamoDB tables to Scylla’s Dynamo-compatible interface — Alternator. Using the Scylla Migrator, this allows users to easily transfer data stored in DynamoDB into Scylla and enjoy reduced costs and lower latencies.

Transferring a snapshot of a live table is usually insufficient for a complete migration; setting up dual writes on the application layer is required for a safe migration that can also be rolled back in the face of unexpected errors.

Today, we are introducing the ability to perform live replication of changes applied to DynamoDB tables into Alternator tables after the initial snapshot transfer has completed. This feature is based on DynamoDB Streams and uses Spark Streaming to replicate the change data. Read on for a description of how this works and a short walkthrough!

DynamoDB Streams

Introduced in 2014, DynamoDB Streams can be enabled on any DynamoDB table to capture modification activities into a stream that can be consumed by user applications. Behind the scenes, a Kinesis stream is created into which modification records are written.
For example, given a DynamoDB table created using the following command:

aws dynamodb create-table \
--table-name migration_test \
--attribute-definitions AttributeName=id,AttributeType=S AttributeName=version,AttributeType=N \
--key-schema AttributeName=id,KeyType=HASH AttributeName=version,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

We can enable a DynamoDB Stream for the table like so:

aws dynamodb update-table \
--table-name migration_test \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
{
  "TableDescription": {
    ...
    "StreamSpecification": {
      "StreamEnabled": true,
      "StreamViewType": "NEW_AND_OLD_IMAGES"
    }
  }
}

DynamoDB will now enable the stream for your table. You can monitor that process using describe-stream:

export STREAM_ARN=$(aws dynamodb describe-table --table-name migration_test | jq -r ".Table.LatestStreamArn")

aws dynamodbstreams describe-stream --stream-arn $STREAM_ARN
{
  "StreamDescription": {
    "StreamArn": "arn:aws:dynamodb:eu-west-1:277356164710:table/migration_test/stream/2020-08-19T19:26:06.164",
    "StreamLabel": "2020-08-19T19:26:06.164",
    "StreamStatus": "ENABLING",
    "StreamViewType": "NEW_IMAGE",
    "CreationRequestDateTime": "2020-08-19T22:26:06.161000+03:00",
    "TableName": "migration_test",
    ...
  }
}

Once the StreamStatus field switches to ENABLED, the stream is created and table modifications (puts, updates, deletes) will result in a record emitted to the stream. Because we chose NEW_AND_OLD_IMAGES for the stream view type, the maximal amount of information will be emitted for every modification (the IMAGE term refers to the item content): the OldImage field will contain the old contents of the modified item in the table (if it existed), and the NewImage field will contain the new contents.

Here’s a sample of the data record emitted upon item update:

{
  "awsRegion": "us-west-2",
  "dynamodb": {
    "ApproximateCreationDateTime": 1.46480527E9,
    "Keys": {
      "id": {"S": "id1"},
      "version": {"N": "10"}
    },
    "OldImage": {
      "id": {"S": "id1"},
      "version": {"N": "10"},
      "data": {"S": "foo"}
    },
    "NewImage": {
      "id": {"S": "id1"},
      "version": {"N": "10"},
      "data": {"S": "bar"}
    },
    "SequenceNumber": "400000000000000499660",
    "SizeBytes": 41,
    "StreamViewType": "NEW_AND_OLD_IMAGES"
  },
  "eventID": "4b25bd0da9a181a155114127e4837252",
  "eventName": "MODIFY",
  "eventSource": "aws:dynamodb",
  "eventVersion": "1.0"
}

Scylla Migrator and DynamoDB Streams

The functionality we are introducing today is aimed at helping you perform live migrations of DynamoDB Tables into your Scylla deployment without application downtime. Here’s a sketch of how this works:

  1. The migrator, on start-up, verifies that the target table exists (or creates it with the same schema as the source table) and enables the DynamoDB Stream of the source table. This causes inserts, modifications and deletions to be recorded on the stream;
  2. A snapshot of the source table is transferred from DynamoDB to Scylla;
  3. When the snapshot transfer completes, the migrator starts consuming the DynamoDB Stream and applies every change to the target table. This runs indefinitely until you stop it.

The order of steps in this process guarantees that no changes will be lost in the transfer. During step 5, old changes might be applied to the target table, but eventually the source and target tables should converge.

In contrast to our example before, which used NEW_AND_OLD_IMAGE for the stream type, the Migrator uses the NEW_IMAGE mode, as we always apply the modifications without any conditions on the existing items.

There’s one important limitation to note here: DynamoDB Streams have a fixed retention of 24 hours. That means that the snapshot transfer has to complete within 24 hours, or some of the changes applied to the table during the snapshot transfer might be lost. Make sure that you allocate enough resources for the migration process for it to complete sufficiently quickly. This includes:

  • Provisioned read throughput (or auto-scaling) on the source DynamoDB table;
  • Sufficient executors and resources on the Spark cluster;
  • Sufficient resources on the Scylla cluster.

Walkthrough

Let’s do a walkthrough on how this process is configured on the Scylla Migrator. First, we need to configure the source section of the configuration file. Here’s an example:

source:
  type: dynamodb
  table: migration_test
  credentials:
    accessKey:
    secretKey:

  region: us-west-2
  scanSegments: 32
  readThroughput: 1
  throughputReadPercent: 1.0
  maxMapTasks: 8

Because we’re dealing with AWS services, authentication must be configured properly. The migrator currently supports static credentials and instance profile credentials. Role-based credentials (through role assumption) will be supported in the future. Refer to the previous post about DynamoDB integration for more details about the rest of the parameters. They are also well-documented in the example configuration file supplied with the migrator.

Next, we need to configure a similar section for the target table:

target:
  type: dynamodb
  table: mutator_table

  endpoint:
    host: http://scylla
    port: 8000

  credentials:
    accessKey: empty
    secretKey: empty

  scanSegments: 8
  streamChanges: true

Pretty similar to the source section, except this time, we’re specifying a custom endpoint that points to one of the Scylla nodes’ hostname and the Alternator interface port. We’re also specifying dummy static credentials as those are required by the AWS SDK. Finally, note the streamChanges parameter. This instructs the migrator to set up the stream and replicate the live changes.

Launching the migrator is done with the spark-submit script:

spark-submit --class com.scylladb.migrator.Migrator \
  --master spark://spark-master:7077 \
  --conf spark.driver.host=spark-master \
  --conf spark.scylla.config=./config.yaml.scylla
  scylla-migrator-assembly-0.0.1.jar

Among the usual logs printed out while transferring data using the migrator, you should see the following lines that indicate that the migrator is setting up the DynamoDB Stream:

20/08/19 19:26:05 INFO migrator: Source is a Dynamo table and change streaming requested; enabling Dynamo Stream
20/08/19 19:26:06 INFO DynamoUtils: Stream not yet enabled (status ENABLING); waiting for 5 seconds and retrying
20/08/19 19:26:12 INFO DynamoUtils: Stream enabled successfully

Note that the migrator will abort if the table already has a stream enabled. This is due to the previously mentioned 24-hour retention imposed on DynamoDB Streams; we cannot guarantee the completeness of the migration if the stream already exists, so you’ll need to disable it.

Once the snapshot transfer has finished, the migrator will indicate that:

20/08/19 19:26:18 INFO migrator: Done transferring table snapshot. Starting to transfer changes

That will be followed by repeatedly printing out the counts of operation types that will be applied to the target table. For example, if there’s a batch of 5 inserts/modifications and 2 deletions, the following table will be printed:

+---------------+-----+
|_dynamo_op_type|count|
+---------------+-----+
| DELETE        |    2|
| MODIFY        |    5|
+---------------+-----+

You may monitor the progress of the Spark application on the Streaming tab of the Spark UI (available at port 4040 of the machine running the spark-submit command on a client-based submission or through the resource manager on a cluster-based submission).

As mentioned, the migrator will continue running indefinitely at this point. Once you are satisfied with the validity of the data on the target table, you may switch over your application to write to the Scylla cluster. The DynamoDB Stream will eventually be drained by the migrator, at which point no more operations will be printed on the logs. You may then shut it down by hitting Ctrl-C or stopping the receiver job from the UI.

A Test Run

We’ve tested this new functionality with a load generation tool that repeatedly applies random mutations to a DynamoDB table on a preset number of keys. You may review the tool here: https://github.com/iravid/migrator-dynamo-mutator.

For our test run, we’ve used a dataset of 10,000 distinct keys. The tool applies between 10 and 25 mutations to the table every second; these mutations are item puts and deletions.

While the tool is running, we started the migrator, configured to transfer the table and its changes to Scylla. As we’ve explained, the migrator starts by enabling the stream, scanning the table and transferring a snapshot of it to Scylla.

Once the initial transfer is done, the migrator starts transferring the changes that have accumulated so far in the DynamoDB Stream. It takes a few polls to the stream to start, but eventually we see puts and deletions being applied.

Once the migrator has caught up with the changes, indicated by the smaller number of changes applied in each batch, we move the load generation tool to the next step: it stops applying mutations to the source table, and loads the contents of both tables into memory for a comparison. Because we’re applying both deletions and puts, it’s important to perform the comparison in both directions – the table contents should be identical. Luckily, this is the case:

The tool checks that the set-wise difference between the item sets on DynamoDB (labelled as remote) and Scylla (labelled as local) are both empty, which means that the contents are identical.

Summary

We’ve seen in this post how you may transfer the contents of live DynamoDB tables to Scylla’s Alternator interface using the Scylla Migrator. Please give this a try and let us know how it works!

LEARN MORE ABOUT SCYLLA ALTERNATOR

LEARN MORE ABOUT THE SCYLLA MIGRATOR

The post One-Step Streaming Migration from DynamoDB into Scylla appeared first on ScyllaDB.