Apache Kafka “Kongo” Part 4.2: Connecting Kafka to Cassandra with Kafka Connect

Here’s the Kongo code and sample connect property files for this blog.

Kafka Connect is an API and ecosystem of 3rd party connectors that enables Kafka to be easily integrated with other heterogeneous systems without having to write any extra code. This blog focuses on a use case extending the Kongo IoT application to stream events from Kafka to Apache Cassandra using a Kafka Connect Cassandra Sink.

Part 4.2 covers Distributed Workers for Production and useful Kafka connect resources. 

1. Distributed Workers for Production

A standalone worker is useful for testing, but for production you will probably need to run distributed workers on a Kafka connect cluster.  

Distributed mode (multiple workers) handles automatic balancing of work, allows you to scale up (or down) dynamically, and offers fault tolerance both in the active tasks and for configuration and offset commit data. Distributed workers that are configured with matching group.id values (set in the distributed properties file) automatically discover each other and form a cluster. You have to run the connect-distributed.sh script (with the connect-distributed.properties file) on every node you want to be part of the Kafka Connect cluster.

Note that in distributed mode the connector configurations are not passed on the command line. Instead, you can use the REST API to create, modify, and destroy connectors. A connector created on one worker will automatically be load balanced across the other workers. You normally run a worker on each server in the connect cluster, but for testing you can run multiple workers on the same server by copying the distributed properties file and changing the rest.port number.

2. Managing Connectors via the REST API or connect-cli

Since Kafka Connect is intended to be run as a clustered service, it also provides a REST API for managing connectors.  By default the REST server runs on port 8083 using the HTTP protocol. You can talk to any worker port to get a cluster wide view. Here’s some documentation with examples.

Let’s try and run two workers and two task threads for the Kafka Cassandra Connector. In the sink properties set tasks.max=2,  and copy the connect distributed properties file and set rest.port=8084 in one of them.  Initially just start one worker by running the distributed connect command with one of the distributed property files (with the default port):

> bin/connect-distributed.sh config/connect-distributed.properties

You can check to see what’s running in a browser with:

http://localhost:8083/connectors

[]

This is telling us that no connectors are running. Why? We haven’t actually started the connector and tasks yet, just the worker.

Landoop Lenses has a Connect Command Line Interface which wraps the Connect REST API and is easier to use for managing connectors, and this is the equivalent command to see what’s running:

> bin/connect-cli ps

No running connectors

Before trying to run two connector tasks the number of partitions for the violation topic must to be increased to two or more, otherwise only one task will be able to do anything (the other will be idle). You can do this by either creating a new topic with two (or more) partitions, or altering the number of partitions on the existing violations topic:

> bin/kafka-topics.sh --zookeeper localhost --alter --topic violations-topic --partitions 2

Now you can start the connector and tasks with the command:

> bin/connect-cli create cassandra-sink < conf/cassandra-sink.properties

See what’s running with:

> bin/connect-cli ps

cassandra-sink

And check the task details with:

> bin/connect-cli status cassandra-sink

connectorState:  RUNNING

workerId: XXX:8083

numberOfTasks: 2

tasks:

 - taskId: 0

   taskState: RUNNING

   workerId: XXX:8083

 - taskId: 1

   taskState: RUNNING

   workerId: XXX:8083

This shows that there are two tasks running in the one worker (8083).

You can now start another worker by using the copy of the property file (with the changed port number):

> bin/connect-distributed.sh config/connect-distributed2.properties

Check to see what’s happening:

> bin/connect-cli status cassandra-sink

connectorState:  RUNNING

workerId: XXX:8083

numberOfTasks: 2

tasks:

 - taskId: 0

   taskState: RUNNING

   workerId: XXX:8083

 - taskId: 1

   taskState: RUNNING

   workerId: XXX:8084

There will still be two tasks running, but each task will be running on a different workerId (server:port).

Let’s see what happens if we kill a worker (simulating a real life failure). Kill the newest worker and you’ll notice that the original has two tasks running again.   Of course if you kill the remaining worker you will have no connector or tasks running. Kafka connect is intended to be used with a cluster manager (e.g. Kubernetes, Mesos etc) to manage (e.g. restart, autoscale, migrate, etc) the workers. However, note that if you start the worker again then the connector and tasks will also start again. If you actually want to stop them you have to use the command:

> bin/connect-cli rm cassandra-sink

Or you can pause/resume connectors (another option, restart, actually stops and starts them in the same state they were in):

> bin/connect-cli pause cassandra-sink

...

> bin/connect-cli resume cassandra-sink

...

Also note that the actual REST API has finer grained controls and can, for example, pause and restart individual tasks for a connector.

Is that the full story on Kafka Connect? No, there’s more. For example Transformations!

A question I asked myself at the start of this exercise was “Do you need to run a schema registry in order to use Kafka Connect?” It turns out that the answer is “No”, as we’ve demonstrated here for several simple examples. However, for more complex schemas, to reduce the risk of run time parsing errors due to data format exceptions (I saw a few and they can kill the task thread), and to support schema evolution using Avro (which uses JSON for schemas and compact binary serialization) it may be a good idea.

3. Further Kafka Connect Resources

Lenses Cassandra Connector:

Running Kafka Connect in distributed mode:

Note that in distributed mode the connector configurations are not passed on the command line. Instead, use the REST API (or the connect-cli program) to create, modify, and destroy connectors.

Some Cassandra Connectors on github:

Here’s the Kongo code and sample connect property files for this blog.

4. The Biggest Sink

Australia is famous for “Big” tourist attractions. E.g. The Big Rock (Uluru), The Big Banana etc. Turns out Australia also have the Biggest Sink (a bell-mouth or Morning Glory hole spillway), the Geehi Dam spillway in the Snowy mountains hydro electric scheme, it’s 32M in diameter with a massive 1557 cubic metres a second capacity. It can drain an Olympic sized swimming pool every 1.6s. Note the abseiler in this photo for scale:

I couldn’t find a photo of the Geehi Sink in spilling, but here’s a similar one from California (drone video):

The post Apache Kafka “Kongo” Part 4.2: Connecting Kafka to Cassandra with Kafka Connect appeared first on Instaclustr.

What’s New in DSE Graph 6

What’s New in DSE Graph 6

Introduction

We’re pleased to introduce you to the great new features we’re rolling out with DataStax Enterprise (DSE) Graph 6.

DSE Graph is tightly integrated as an optional component within DSE, providing a unique experience where graph, search, transactional analytics, management, and developer tools are provided through the industry’s best distributed cloud database designed for hybrid cloud.

DataStax continues to lead innovation in the graph database market via the various graph innovations being released in DSE 6, which include:

  • Integration with DSE Advanced Performance
  • Smart analytics query routing
  • Advanced schema management
  • Batch Fluent API Gremlin support
  • Support for TinkerPop 3.3.0
  • Great user experience enhancements to Studio

And much more.

DSE Graph adoption is increasing rapidly, and we’re very excited to see all the positive feedback and real-world wins from DSE Graph usage.

With this increased adoption, we’ve been receiving more and more practical product requests and feedback. The release of DSE 6 puts DSE Graph on a very exciting path of ever-increasing unification with DSE. The future is bright for DSE users looking to build applications using the best distributed cloud database available.

Now, let’s review some of the great features introduced in DSE Graph 6.

Advanced Performance

DSE 6 provides differentiating performance through DSE Advanced Performance. DSE Graph receives many benefits from this game-changing innovation implemented in DataStax’s distribution of Apache CassandraTM because DataStax’s distribution of Cassandra is the storage layer for DSE Graph.

We are seeing much better throughput for DSE Graph 6 compared to previous versions. For example, one-hop traversal performance (g.V().out() style traversals) has increased almost 50% while two-hop traversal (g.V().out().out() style traversals) performance has increased almost 60%. The extra throughput means DSE Graph users will receive even more value from DSE 6 as DSE Graph is able to handle more requests per node.

Smart Analytics Query Routing

DSE Graph 5.1 introduced a first of its kind innovation with the introduction of DSE Graph Frames. This feature provides a powerful method to work with full graphs, or large subgraphs, for transactional analytics purposes (think histograms of data contained in the graph), mass data transformation/movement, and Apache Spark™-based machine learning algorithms. DSE Graph Frames provides additional analytical graph functionality on top of the Gremlin standard Gremlin OLAP features.

With DSE Graph 5.1, we quickly learned that some items like transactional analytics were much faster using DSE Graph Frames vs. Gremlin OLAP.  That’s why with the release of DSE 6, graph users will no longer have to choose which analytical implementation to use to perform a graph analytics operation. With DSE 6, the DSE Graph engine will automatically route a Gremlin OLAP traversal to the correct implementation, resulting in the fastest and best execution for end users.

With DSE 6, Graph users will receive a simplified experience that’s proven to provide the fastest execution for graph analytic traversals.

Advanced Schema Management

DSE Graph was built based on years of graph experience by the team that has built both Apache Tinkerpop™ and the Titan graph database. The team knows how to solve distributed graph problems.

The initial releases of DSE Graph provided a schema that was very flexible but also fixed. To provide a simplified production experience, users were given the ability to remove existing graph schema elements, like Vertex Labels or Properties. With DSE Graph 6, users will now be able to remove any graph schema element they need, giving them a very similar schema experience compared with the rest of DSE.

Batch Fluent API Gremlin support

One of the major areas of innovation that DataStax is providing in the graph database market is through the Apache Tinkerpop graph processing framework, specifically the Gremlin query language. With the release of Gremlin version 3, Gremlin has evolved into the standard distributed graph processing language. One of the newer features of Gremlin that’s been introduced over this past year is the Gremlin Byte Code functionality that’s enabled Gremlins Language Variants (GLVs).  DataStax is leading the way with implementing enterprise-ready drivers that know how to leverage GLVs to provide a superior graph experience, through a feature named the DataStax Driver Fluent API.

Scylla Enterprise Release 2018.1.1

scylla release
The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.0, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.1 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise.

The 2018.1 branch is based on Scylla open source 2.1 and includes backported bug fixes from upstream releases (1.7, 2.0, 2.1) as well as enterprise-only bug fixes. Read more about Scylla Enterprise here.

Related Links

Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.1 in coordination with the Scylla support team.

Bug Fixes in this Release (With Open Source Issue Number Where Applicable):

  • Upgrading to latest version of the RHEL kernel causes Scylla to lose access to the RAID 0 data directory. #3437 A detailed notice has been sent to all relevant customers.
  • An upgrade from Scylla 2017.1 to 2018.1 may cause frequent and redundant schema updates.  #3394
  • Multi-DC writes may fail during schema changes. #3393
  • Installing Scylla 2018.1 on an old CentOS kernel results in the following error “systemd[5370]: Failed at step CAPABILITIES spawning /usr/bin/scylla: Invalid argument”. In Scylla 2018.1.1 the dependency on kernels later than kernel-3.10.0-514.el7 is explicitly added to Scylla Enterprise package spec, making Scylla installations on older kernels impossible. #3176

Known Issues

Scylla Enterprise 2018.1.1 was not validated on IBM POWER8 architecture. If you are using 2018.1. on Power, please wait for future notices on upgrade availability. In the meantime, you should explicitly install Scylla Enterprise 2018.1.0, not 2018.1.

Next Steps

  • Learn more about Scylla from our product page.
  • See what our users are saying about Scylla.
  • Download Scylla. Check out our download page to run Scylla on AWS, install it locally in a Virtual Machine, or run it in Docker.
  • Take Scylla for a Test drive. Our Test Drive lets you quickly spin-up a running cluster of Scylla so you can see for yourself how it performs.

The post Scylla Enterprise Release 2018.1.1 appeared first on ScyllaDB.

Announcing OpsCenter 6.5

Announcing OpsCenter 6.5

DataStax OpsCenter is an easy-to-use visual management and monitoring solution enabling administrators, architects, and developers to quickly provision, monitor, and maintain DataStax Enterprise (DSE) clusters, which are built on the best distribution of Apache Cassandra™.

Today, I am pleased to announce the general availability of DataStax OpsCenter 6.5, fully supporting DataStax Enterprise (DSE) 6. Let me provide a quick tour of some of the enhancements found in OpsCenter 6.5.

Introduction to OpsCenter

Apache Cassandra has made great strides as a technology since its initial 2008 release, being deployed by companies to power their mission-critical applications. But as we know, with any distributed system, management can be difficult.

DataStax OpsCenter is the key piece to the tricky management puzzle, helping to surface critical issues and solving others that you couldn’t see with just open source Apache Cassandra. From being notified of best practices, automating backups, or quickly understanding the health of your deployment, OpsCenter makes your management and monitoring push-button simple. In this blog, we’ll see what OpsCenter 6.5 offers to further simplify your operational needs.

Upgrade Service

Maintaining and upgrading databases can be a daunting task. At DataStax, we understand this and want to help you. As a sequel to the successful launch of the OpsCenter LifeCycle Manager Deploy feature, the Upgrade Service allows you to seamlessly perform patch [1] upgrades of DSE clusters without sacrificing the robustness, simplicity, or peace of mind. The initial release will allow you to perform DSE patch version upgrades, a more frequently required operation than major version upgrades.

This release has several key features to simplify your upgrade experience:

  • Simple and Flexible upgrade options
  • Improved robustness and auditability
  • Optimal default cluster settings

Fig 1: OpsCenter Upgrade Feature

NodeSync Service

DSE NodeSync automatically and continuously synchronizes replicas of designated keyspaces and tables of a cluster as a background process. Unlike the Repair Service, the NodeSync Service does not build a Merkle tree for comparisons and stream differences for compaction. Rather, it incrementally scans a token range, compares data one page at a time, and sends data over the standard write path.

OpsCenter provides a window for you to view NodeSync task statuses and take corrective measures. With the click of a button, OpsCenter allows you to enable or disable NodeSync on tables and keyspaces. Furthermore, as an admin, you will also be able to drill down and view the details at a per-table level. This allows for a much more flexible way to view the tasks, understand which tables are not being synchronized before the gc_grace_seconds period, and take corrective measures to fix the issue.

For more details, refer to our documentation to see all the features that improve operator visibility.

Fig 2: NodeSync Service

New DSE Advanced Performance Metrics

DSE 6 ships with a new suite of advanced performance optimizations and tools, called DSE Advanced Performance. In version 6, the new thread-per-core feature significantly boosts out-of-the-box performance for both read and write operations. The enhanced performance comes from a slick architectural design, which aims at having a non-blocking thread per CPU core, eliminating the need to coordinate with other cores and executing its assigned work at the maximum possible speed.

To further support this functionality, OpsCenter 6.5 adds thread-per-core metrics to enable users to assess the health of the DSE nodes and identify performance bottlenecks. We will continue to enhance our support for thread-per-core and add many more functionalities in upcoming releases to improve operator efficiency.

Fig 3: Thread per core (TPC)  Metrics

Support for DSE 6

OpsCenter 6.5 fully supports DSE 6 to provision, manage, and monitor your DSE clusters. We’ve also added support for new features such as AlwaysOn SQL that’s in DSE Analytics.

In addition to all of the above goodness, we’ve spent a lot of time focussing on general product stability improvements.

Fig 4: Always on SQL

Conclusion

In summary, this latest version of OpsCenter gives you with a wealth of new features and enhancements that make it even easier to provision and monitor DSE clusters.

For more information on OpsCenter 6.5, see our online documentation. To try OpsCenter 6.5 in your own environment, download a copy today and be sure to let us know what you think.

DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States and/or other countries. Apache Cassandra and Cassandra are trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

[1] For example: DSE 5.0.7 to DSE 5.0.10 or DSE 5.1.2 to DSE 5.1.4

Cassandra information using nodetool

Cassandra nodetool provides several types of commands to manage your Cassandra cluster. See my previous post about Cassandra nodetool for an orientation to the types of things you can do with this helpful Cassandra administration tool. Here, I am sharing details about one type — getting Cassandra information about your installation using nodetool. These can be used to get different displays of the status and other insights into the Cassandra nodes and full cluster.

Cassandra nodetool is installed along with the database management software, and is used on the command line interface (e.g., inside the Terminal window), like this:

"<yoastmark

Cassandra Information: Nodes

Let’s start with some very basic information about the node.

nodetool version 

This will show the version of Cassandra running on this node. Another way to get similar information is by using cassandra -v.

Example output:

ReleaseVersion: 3.11.2

nodetool info 

In the same way that the popular nodetool status (see below) provides a single-glance overview of the cluster, this command provides a quick overview of the node. It is a convenient way, for example, to see the memory usage.

Example output:

ID                     : 817788af-4209-44df-9ae8-dc345376c946
Gossip active          : true
Thrift active          : false
Native Transport active: true
Load                   : 749.07 KiB
Generation No          : 1526478319
Uptime (seconds)       : 15813
Heap Memory (MB)       : 72.76 / 95.00
Off Heap Memory (MB)   : 0.01
Data Center            : DC1
Rack                   : RAC1
Exceptions             : 0
Key Cache              : entries 35, size 2.8 KiB, capacity 4 MiB, 325 hits, 396 requests, 0.821 recent hit rate, 14400 save period in seconds
Row Cache              : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache          : entries 0, size 0 bytes, capacity 2 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Percent Repaired       : 100.0%
Token                  : (invoke with -T/--tokens to see all 256 tokens)
 

Cassandra Information: Cluster

Similarly, nodetool can provide basic information about the full cluster:

nodetool describecluster 

This will show a quick view of some of the important cluster configuration values: name, snitch type, partitioner type, and schema version.

Example output:

Cluster Information:
Name: Dev_Cluster
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
414b0faa-ac94-3062-808b-8f1e6776d456: [172.16.238.2, 172.16.238.3, 172.16.238.4, 172.16.238.5, 172.16.238.6, 172.16.238.7]

The “Schema Versions”: segment is especially important because it identifies any schema disagreements you might have between nodes. Every time a schema is changed, the schema is propagated to the other nodes. Sometimes you might see some persistent schema disagreements that could indicate that one of the nodes is down, and this is often resolved by either restarting the node or by doing a rolling restart to the entire node.

nodetool status 

If you run just one nodetool command on a server when you log in, this is it: a brief output of node state, address, and location.

Example output:

Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load Tokens       Owns (effective) Host ID                               Rack
UN  172.16.238.2  749.07 KiB 256          31.8% 817788af-4209-44df-9ae8-dc345376c946  RAC1
UN  172.16.238.3  471.93 KiB 256          33.6% a2e9a7b2-d665-4272-8327-ae7fbb0cf712  RAC2
UN  172.16.238.4  749.49 KiB 256          34.6% 603b610a-f8e3-476c-9952-5de57418ccff  RAC3
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load Tokens       Owns (effective) Host ID                               Rack
UN  172.16.238.5  528.98 KiB 256          34.0% f9bbf676-75c9-47ca-8826-1e8f0a3268e4  RAC1
UN  172.16.238.6  566.61 KiB 256          33.9% 93fd527c-e244-4aef-ab3f-2a6ee1f1d917  RAC2
UN  172.16.238.7  809.37 KiB 256          32.1% 25b1f103-85f9-4020-a4b3-8d1912443f55  RAC3
 

Cassandra Information: Backups

Other nodetool status commands can provide information about backups:

nodetool statusbackup 

Use this to view the status of incremental backups. If you have turned on incremental backups (e.g., via nodetool enablebackup), then the status will be running. If the incremental backups are disabled, the status will be not running.

Example output:

running

nodetool listsnapshots 

Use this command to view the schema snapshots on this node. This assumes you have created at least one snapshot (e.g., with nodetool snapshot).

Example output:

Snapshot Details:
Snapshot name Keyspace name      Column family name True size Size on disk
1526402908000 system_distributed parent_repair_history          0 bytes 13 bytes
1526402908000 system_distributed repair_history                 0 bytes 13 bytes
1526402908000 system_distributed view_build_status              0 bytes 13 bytes
1526402908000 keyspace1          standard1 1.51 MiB  1.51 MiB
1526402908000 keyspace1          counter1 0 bytes   864 bytes
1526402908000 system_auth        roles 0 bytes   13 bytes
1526402908000 system_auth        role_members 0 bytes   13 bytes
1526402908000 system_auth        resource_role_permissons_index 0 bytes   13 bytes
1526402908000 system_auth        role_permissions 0 bytes   13 bytes
1526402908000 system_traces      sessions 0 bytes   13 bytes
1526402908000 system_traces      events 0 bytes   13 bytes
Total TrueDiskSpaceUsed: 1.51 MiB

Cassandra Information: Data and Schema

Understand the details of your schema and data around the cluster with the following commands:

nodetool ring 

This command will produce a very long output of all tokens (primary key hashes) on a given node.

Example output:

Datacenter: DC1
==========
Address       Rack Status State   Load Owns         Token
172.16.238.3  RAC2 Up     Normal 421.98 KiB     32.70% -9220489979566737719
172.16.238.2  RAC1 Up     Normal 589.17 KiB     34.48% -9115796826660667716
172.16.238.3  RAC2 Up     Normal 421.98 KiB     32.70% -9100537612334946272
...

nodetool describering 

View detailed information on tokens present on a given node. Use a keyspace name along with this command (e.g., nodetool describering keyspace1).

Example output:

Schema Version:d7d68b06-5c21-3aa4-a2e4-f44eff6e25e3
TokenRange: TokenRange(start_token:2266716358050113757, end_token:2267497540130521369, endpoints:[172.16.238.6, 172.16.238.3], rpc_endpoints:[172.16.238.6, 172.16.238.3], endpoint_details:[EndpointDetails(host:172.16.238.6, datacenter:DC1, rack:RAC2), EndpointDetails(host:172.16.238.3, datacenter:DC2, rack:RAC1)]) TokenRange(start_token:-3767342014734755518, end_token:-3764135679630864587, endpoints:[172.16.238.6, 172.16.238.5], rpc_endpoints:[172.16.238.6, 172.16.238.5], endpoint_details:[EndpointDetails(host:172.16.238.6, datacenter:DC1, rack:RAC2), EndpointDetails(host:172.16.238.5, datacenter:DC2, rack:RAC3)]) TokenRange(start_token:-7182326699472165951, end_token:-7168882311135889918, endpoints:[172.16.238.3, 172.16.238.6], rpc_endpoints:[172.16.238.3, 172.16.238.6], endpoint_details:[EndpointDetails(host:172.16.238.3, datacenter:DC2, rack:RAC1), EndpointDetails(host:172.16.238.6, datacenter:DC1, rack:RAC2)]) TokenRange(start_token:-4555990503674633274, end_token:-4543114046836888769, endpoints:[172.16.238.5, 172.16.238.4], rpc_endpoints:[172.16.238.5, 172.16.238.4], endpoint_details:[EndpointDetails(host:172.16.238.5, datacenter:DC2, rack:RAC3), EndpointDetails(host:172.16.238.4, datacenter:DC1, rack:RAC3)])...

nodetool rangekeysample

This command will display a distribution of keys around the cluster.

Example output:

RangeKeySample: 2401899971471489924 8125817615588445820 6180648043275199428 -7666714398617260110 -59419700177700973...

nodetool viewbuildstatus 

Materialized views are populated in the background. This command will show the status of this building process. Specify the keyspace and view name (e.g., nodetool viewbuildstatus keyspace1 mv1). As the process runs, the output will change accordingly. (Note that materialized views are not recommended to be used in production.)

Example output:

keyspace1.mv1 has not finished building; node status is below.
Host          Info
/172.16.238.4 STARTED
/172.16.238.2 STARTED
/172.16.238.7 UNKNOWN

Later:

keyspace1.mv1 has not finished building; node status is below.
Host          Info
/172.16.238.4 SUCCESS
/172.16.238.2 STARTED
/172.16.238.7 SUCCESS

Finally:

keyspace1.mv1 has finished building

nodetool getendpoints 

Use getendpoints to find the node(s) holding a partition key using keyspace, table name, and token.

First, retrieve the key:

select key from keyspace1.standard1 where [your search terms];
 key
------------------------
 0x3138324b305033384e30

Then use the key to find the node(s):

nodetool getendpoints keyspace1 standard1 3138324b305033384e30
172.16.238.7
172.16.238.2
 

Cassandra Information: Processes

The inner workings of the Cassandra cluster are made more clear with the following nodetool commands.

nodetool compactionstats 

This command will display active and pending compactions.

Compactions are often intensive in terms of I/O and CPU, so monitoring pending compactions is often useful if you see any performance degradation or want to track operations such as repairs, SSTableupgrades, or rebuilds.

Example output:

pending tasks: 0

nodetool compactionhistory 

Similarly, this command will show completed compactions.

Example output:

4ed95830-5907-11e8-a690-df4f403979ef keyspace1     standard1 2018-05-16T12:47:35.731 1461715  1461715 {1:6357}
2de781b0-5907-11e8-a690-df4f403979ef keyspace1     standard1 2018-05-16T12:46:40.459 1462110  1461715 {1:6357}

nodetool gcstats 

A summarized view of garbage collection (gc) for a given node, in milliseconds since the last gc, is shown with this command. This might be useful for a dashboard display or for quick insight into the server.

Example output:

       Interval (ms) Max GC Elapsed (ms)Total GC Elapsed (ms)Stdev GC Elapsed (ms)   GC Reclaimed (MB) Collections Direct Memory Bytes
            20073305            9154 71095                 703 4353376560 305                       -1

nodetool statusgossip 

The node reports briefly whether or not it is communicating metadata to/from other nodes with this command. The default will be on unless it has been turned off (e.g., with nodetool disablegossip), perhaps for maintenance. If it is on, the output will be running. If gossip is disabled, the output will be (as you might guess) not running.

Example output:

Running

nodetool gossipinfo 

Assuming gossip is enabled, this will show what this node is communicating to other nodes in the cluster about itself.

Example output:

/172.16.238.2
  generation:1526415339
  heartbeat:21284
  STATUS:21216:NORMAL,-1122832159022483270
  LOAD:21238:6569245.0
  SCHEMA:12719:d386650c-2a99-336d-a7a8-9c25a4f39801
  DC:8:DC1
  RACK:10:RAC1
  RELEASE_VERSION:4:3.11.2
  INTERNAL_IP:21218:172.16.238.2
  RPC_ADDRESS:3:172.16.238.2
  NET_VERSION:1:11
  HOST_ID:2:817788af-4209-44df-9ae8-dc345376c946
  RPC_READY:58:true
  TOKENS:21215:<hidden>
...

nodetool statushandoff 

Hinted handoff might be turned off manually if you think a node will be down for too long (longer than max_hint_window_ms), or to avoid the traffic of hints over network when the node recovers. If it is running, the output will be Hinted handoff is running. If it has been disabled, the output will be Hinted handoff is not running.

Example output:

Hinted handoff is running

nodetool tpstats 

Usage statistics of the thread pools are shown with this command. The details of thread pool statistics warrant a separate blog entry, but meanwhile, the Cassandra documentation will provide a basic overview.

Example output:

Pool Name                         Active Pending Completed Blocked  All time blocked
ReadStage                              0 0 1789 0              0
MiscStage                              0 0 0 0              0
CompactionExecutor                     0 0 13314 0              0
MutationStage                          0 0 36069 0              0
MemtableReclaimMemory                  0 0 82 0              0
PendingRangeCalculator                 0 0 12 0              0
GossipStage                            0 0 132003 0              0
SecondaryIndexManagement               0 0 0 0              0
HintsDispatcher                        0 0 3 0              0
RequestResponseStage                   0 0 15060 0              0
Native-Transport-Requests              0 0 3390 0              2
ReadRepairStage                        0 0 6 0              0
CounterMutationStage                   0 0 0 0              0
MigrationStage                         0 0 10 0              0
MemtablePostFlush                      0 0 118 0              0
PerDiskMemtableFlushWriter_0           0 0 82 0              0
ValidationExecutor                     0 0 0 0              0
Sampler                                0 0 0 0              0
MemtableFlushWriter                    0 0 82 0              0
InternalResponseStage                  0 0 5651 0              0
ViewMutationStage                      0 0 0 0              0
AntiEntropyStage                       0 0 0 0              0
CacheCleanupExecutor                   0 0 0 0              0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                   843
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0
 

Cassandra Information: Performance Tuning

Finally, the following nodetool commands will provide information to support Cassandra performance tuning efforts.

nodetool proxyhistograms 

This command will give an overall sense of node performance, showing count and latency in microseconds. The values will be calculated as an average of the last 5 minutes.

Example output:

proxy histograms
Percentile       Read Latency Write Latency      Range Latency CAS Read Latency CAS Write Latency View Write Latency
                     (micros) (micros)           (micros) (micros) (micros)           (micros)
50%                   5839.59 8409.01           30130.99 0.00     0.00 0.00
75%                  14530.76 20924.30           89970.66 0.00    0.00 0.00
95%                  52066.35 155469.30          268650.95 0.00      0.00 0.00
98%                 107964.79 223875.79          268650.95 0.00     0.00 0.00
99%                 155469.30 268650.95          268650.95 0.00     0.00 0.00
Min                        263.21 315.85               5839.59 0.00 0.00               0.00
Max                 386857.37 464228.84          268650.95 0.00     0.00 0.00

nodetool tablehistograms 

This command displays count and latency as a measure of performance for a particular table. Specify the keyspace and table name when running tablehistograms. These values are also calculated over the last 5 minutes.

Example output:

keyspace1/standard1 histograms
Percentile  SSTables Write Latency      Read Latency Partition Size      Cell Count
                              (micros) (micros)       (bytes)
50%             0.00 152.32            454.83 NaN     NaN
75%             0.00 654.95           1629.72 NaN       NaN
95%             0.00 4866.32          10090.81 NaN         NaN
98%             0.00 8409.01          25109.16 NaN         NaN
99%             0.00 8409.01          30130.99 NaN         NaN
Min             0.00 20.50             51.01 NaN   NaN
Max             0.00 14530.76         155469.30 NaN           NaN

nodetool tablestats 

This will show the read/write count and latency per keyspace, and detailed count, space, latency, and indexing information per table. Note that this used to be called nodetool cfstats, so you will still see some references to that tool around (and cfstats is still aliased to tablestats).

nodetool toppartitions 

This command displays top partitions used for a table during a specified sampling period, in milliseconds. Cardinality is a count of unique operations in the sample. Count is the total number of operations during the sample per partition. The third column in the partition display (+/-) indicates the margin of error; an estimate rather than an exact count is captured to avoid overhead. Specify the keyspace, table name, and desired duration of sample.

Example output:

WRITES Sampler:
Cardinality: ~3 (256 capacity)
Top 10 partitions:
Partition  Count +/-
31344f50364f34343430 39 38
314c4c33373737333830 39 38
4b394e304e31334d4c30 39 38
4b32304e4d38304e3730 39 38
5034333737304b324d30 39 38
383335344e4b354c5030 39 38
343037374e33384b3331 39 38
3438504e324d4f345030 39 38
4f4f3834334f33333731 39 38
30353131333531343331 39 38

READS Sampler:
Cardinality: ~2 (256 capacity)
Top 10 partitions:
Partition Count +/-
4b3137304d3435303930 37 36
3431324b32334d354c30 37 36
4e31313739354c393930 37 36
34335039334c50303630 37 36
334e3437503839313131 37 36
314f31303933364c3130 37 36
4d4c36343533354b3431 37 36
4c4d334e4f4e31343530 37 36
4e324d394f3450343130 37 36
39364c30503634333930 37 36

The above has been a review of the informational commands available with Cassandra nodetool. Again, see my previous post for an orientation to Cassandra nodetool, and stay tuned for future posts on combining nodetools commands for administration tasks such as backups, performance tuning, and upgrades.

 

Apache Kafka Managed Service Now Available with Instaclustr

Instaclustr announces the immediate availability of Managed Apache Kafka on the Instaclustr Platform. Apache Kafka adds to Instaclustr’s existing offerings of Apache Cassandra, Apache Spark and Elassandra, providing customers with the opportunity to use a single managed service provider for a complete suite of leading open source data processing and storage technologies delivering reliability at scale.

Apache Kafka is the leading streaming and queuing technology for large-scale, always-on applications. Apache Kafka is widely used in application architectures to fill needs including:

  • Provide a buffering mechanism in front of a processing (i.e. deal with temporary incoming message rate greater than processing application can deal with)
  • Allow producers to publish messages with guaranteed delivery even if the consumers are down when the message is published
  • As an event store for events sourcing or Kappa architecture
  • Facilitating flexible, configurable architectures with many producers -> many consumers by separating the details who what is consuming messages for the apps that produce them (and vice-versa)
  • Performing stream analytics (with Kafka Streams)

Delivered through the Instaclustr Platform, Instaclustr’s Managed Apache Kafka provides the management features that have made Instaclustr’s Managed Apache Cassandra the leading managed service for Cassandra:

  • Support on AWS, Azure, GCP and IBM cloud
  • Automated provisioning and configuration of clusters with Kafka and Zookeeper
  • Run in our cloud provider account with a fixed, infrastructure inclusive, cost or use your own cloud provider account
  • Provision, configure and monitor your cluster using the Instaclustr Console or REST APIs.
  • Management of access via IP ranges or security groups.
  • Option of connection to your cluster using public IPs or private IPs and VPC peering.
  • Private network clusters with no public IPs.
  • Covered by SOC2 certification (from GA release)
  • Highly responsive, enterprise-grade, 24×7 support from Instaclustr’s renowned support team.

Instaclustr has been running Apache Kafka in production for internal use since 2017 and for the last few months working with Early Access Program customers Siteminder, Lendi, Kaiwoo and Paidy to ensure our offering is well suited to a range of customer requirements.

Our early access program has already delivered benefits to participating customers such as Lendi:

“We see Apache Kafka as a core capability for our architectural strategy as we scale our business. Getting set up with Instaclustr’s Kafka service was easy and significantly accelerated our timelines. Instaclustr consulting services were also instrumental in helping us understand how to properly use Kafka in our architecture.” Glen McRae, CTO, Lendi

and Siteminder:

“As very happy users of Instaclustr’s Cassandra and Spark managed services, we’re excited about the new Apache Kafka managed service. Instaclustr quickly got us up and running with Kafka and provided the support we needed throughout the process.” Mike Rogers, CTO, SiteMinder

The Public Preview period for Instaclustr Managed Kafka is expected to run until 25th June. Following Public Preview full SLAs will apply. Instaclustr’s Managed Kafka is ready for full production usage with SLAs available up to 99.95%. Our technical operations team is ready to migrate existing Kafka clusters to Instaclustr with zero-downtime.

For more information on Instaclustr’s Managed Apache Kafka offering please contact sales@instaclustr.com or sign up for a free trial.

The post Apache Kafka Managed Service Now Available with Instaclustr appeared first on Instaclustr.

Ola Cabs on Their First Two Years of Using Scylla in Production

Ola Cab

This is a guest blog post by Soumya Simanta, an architect at Ola Cabs, the leading ride-hailing service in India.

On-demand ride-hailing is a real-time business where responding to quick spikes in demand patterns is a critical need. These spikes are more profound during unusual circumstances such as bad weather conditions, special events, holidays, etc. It is critical that our software systems are architected to support this real-time nature of our business. However, creating a distributed software system that can satisfy this requirement is a challenge. An important component of any web-scale system is the database. Given the rapid growth of our organization, we wanted our choice of database to support some critical quality attributes. Our primary concerns were support for high throughput, low latency, and high availability (multi-zone and multi-datacenter support). Other key requirements were that the product is open source, required minimal maintenance and administration, had a rich ecosystem of tools and, finally, a database that is battle-tested in production.

Like many web-scale companies, we quickly realized that we do not need an ACID-compliant database for all our use cases. With proper design, we could model many of our systems around “eventually consistency,” thereby trading-off consistency while gaining all the other goodness that comes with an AP-compliant database such as Cassandra. Although we were already using Cassandra for some use cases, we were not sure if it was the best long-term fit for us. Primarily because it leverages the JVM runtime and therefore has the well-known latency issues and tuning overhead that comes with the JVM’s garbage collector.

Around early 2016 a new database, Scylla, caught our attention. The primary reason for our interest was that Scylla was advertised as a drop-in replacement for Cassandra, was written in native language (C++) and was developed by the creators of battle-tested software products such as the KVM hypervisor, OSv, and Seastar. While we recognized that Scylla did not yet support all the features of Cassandra and that it was not yet battle-tested in production, we were intrigued by the close-to-the-hardware approach they had taken in building their database. Very early on, we had discussions with Scylla’s core team and they assured us that they would soon add many important features. With this assurance, we were confident that a long-term bet on Scylla would yield a positive outcome for us. We created a roadmap where we would gradually adopt Scylla into our software ecosystem based on application criticality.

For our first use case in March of 2016, we deployed Scylla 1.0 in passive mode along with another database as our primary datastore. Only writes were performed to Scylla. As expected, Scylla being write-optimized by design performed reasonably well in terms of latency and throughput. However, what was surprising for us was the stability of the database. There were no errors or crashes and it ran without any maintenance for more than 3 months. This was very promising for us. Next, we started using Scylla 1.4 for both reads and writes. We had to modify our data model, which is a limitation of the Cassandra architecture itself and not specific to Scylla. For the initial few cases, we performed our installation of Scylla from scratch. However, we quickly realized that we were not getting the performance that was advertised. So we moved to an official AMI image that has tuned OS and database configurations provided by the ScyllaDB folks.

Our initial configurations were based on using EBS volumes. The advantage of using EBS was the protection against data loss in case of node failures. However, for certain use cases, we decided to move to ephemeral (NVMe) disks. In another system, we used Scylla as a persistent cache where we could tolerate the risk of data loss due to ephemeral node failures. We mitigated the risk of disk failures by setting the correct replication factor, storing data across multiple Availability Zones (AZs) and ensuring that we could regenerate data quickly. This gave us a significant performance boost in terms of latencies because Scylla is optimized for NVMe and an extra network hop is avoided as compared to using EBS.

We have deployed Scylla across multiple microservices and platforms in our organization. We currently run two types of workloads. First are real-time reads at high throughput along with batch writes from our machine learning pipelines and Spark jobs at daily or hourly intervals. The second type of workload is cases with a uniform mix of reads and writes. We initially used the standard Cassandra driver from Java. For the last year, we have been using the Gocql connector for Golang for most of our new use cases. In our experience, our overall application latencies provided by the Go-based apps are better with less jitter and significantly less application memory footprint when compared with Java-based applications. In most cases, we are using Scylla without a caching layer because by using the proper hardware for the server nodes we were able to get our target latency profiles (< 5ms 99-tile) with just Scylla. This not only resulted in a simplified architecture but also helped us save on the hardware cost for the caching layer.

Figure 1. Application Latency Profile. Application Go code running on Docker using Gocql driver for Scylla 2.1 on a 5 node i3.8xlarge cluster)

Figure 2. Scylla Prometheus Dashboard (for the application shown in Figure 1)

Figure 3: Application Latency Profile

 

Another valuable feature that was introduced in Scylla was native metrics monitoring support for Prometheus. This was extremely helpful in debugging and identifying bottlenecks in our data model and sharing our results for debugging with the Scylla community. Sharing all the metrics with the Scylla community and core developers reduced the performance debugging cycle. Additionally, the metrics enabled us to make cost-performance tradeoffs by selecting the right hardware specs for our latency/throughput requirements. Another takeaway was that by tweaking the schema (e.g., reduction in partition sizes) we could get significant improvements in performance. We highly recommend that all Scylla users configure the out-of-box monitoring provided by Scylla from day one.

In our two-year journey of using it in production, Scylla has lived up to our expectations. We have graduated from using Scylla for very simple and non-critical use cases to deploying it for some of our mission-critical flows. Scylla’s development team has delivered on their core promise of doing continuous releases to push out new features but has never compromised on the quality or performance of their product. The team is also very receptive to its community’s needs and is always ready to help on Scylla’s Slack channel and mailing lists. Some features we are looking forward to are the secondary indexes (currently in experimental mode), incremental repair, support for Apache Cassandra 3.x storage (finally patches arrived in the dev mailing list) and user-defined types (UDTs).

The post Ola Cabs on Their First Two Years of Using Scylla in Production appeared first on ScyllaDB.

What’s New With Drivers for DSE 6

What’s New With Drivers for DSE 6

The last major DataStax Enterprise ( DSE ) release was really really big, and after it we had to look in the mirror and ask ourselves: How can we continue to raise this bar? The answer? DSE 6.

With this most recent major DSE release, we took what was already the best distribution of Apache Cassandra™ and made it faster, more resilient, and more intuitive to operate. For a breakdown of these server-side improvements a la carte, head on over to the DataStax Enterprise 6 blog post.  In this DataStax Drivers blog post, we will give you the client-side scoop on what’s been added to facilitate interacting with DSE 6.

DSE Graph Fluent API Batches

DataStax Drivers now offer the ability to execute DSE Graph statements in batches. To accomplish this, they leverage CQL Batch under the covers and these DSE Graph Fluent API batches are subject to the same considerations. It’s advised to limit these to vertices and edges that share the same partition key or involve very little unique partitions to reduce coordinator burden and the number of DSE nodes involved. The design was bound to an ease-of-use commitment ( see Java Example, Python Example ).

DSE Graph Fluent API – C# and Node.js

We are very proud to also announce that DSE Graph Fluent API support has extended to more of the DataStax Drivers-supported languages. This continues to build on the momentum that Java and Python started and now this same programmatic style of writing and executing DSE Graph queries through the DataStax Drivers can be accomplished through the C# and Node.js DataStax Drivers. Check out the C# blog post and Node.js documentation for more detail.

Driver Metadata

NodeSync Information in Table Metadata

Goodbye Repair, Hello DSE NodeSync. Repair in Apache Cassandra™ is the process of making sure that the data on disk across your cluster is in sync. Though this may sound simple, we know that operating and monitoring this action in previous DSE versions was a burden. With DSE 6, this burden is gone and NodeSync now does this for you automatically.

The DataStax Drivers allow you to view NodeSync information via the table metadata. See this example showing how to verify that NodeSync is enabled on a given table and enjoy the peace of mind that comes with knowing that the data you are receiving in your applications is correct and uniform across your underlying servers.

Ports Added to Host Metadata

To facilitate the development of applications that need to connect to both native storage and JMX ports, we added these to the Host class metadata. Happy coding.

Continuous Paging

For those who soaked up DSE 5.1, the concept of Continuous Paging may sound familiar. If you have not yet reaped the benefits via DSE Analytics, then give it some thought in DSE 6, because we have made this feature even sturdier and more efficient.

At a high level, the performance gains produced by this feature are sourced from the fact that we are continuously preparing result pages on the server side in response to a query. This removes the inefficient chitter chatter from client to server that occurs with normal paging when requesting the next pages of results.

With DSE 6 we’ve improved the communication mechanism in the paging solution such that the driver specifies a number of pages of rows with the initial request and then requests more as it consumes them. Tangibly, this allows Continuous Paging queries to be safely made alongside other queries over the same driver connection without disrupting the connection when the server is producing results faster than the client can handle.

The shiny new DataStax Bulk Loader also sports the improved performance profile that Continuous Paging delivers. I suggest giving that a look if you are loading or unloading data in large quantities.

Prepared Statement Robustness

Recent DataStax Drivers releases have made strides in making prepared statements more durable.

If you have multiple clients connected to the same host, we fixed a case where the driver metadata could become invalid if the underlying schema had been altered (ie, a column has been added or dropped). If you had been working around the issue documented here by not preparing your SELECT* statements, you can now remove that workaround if you upgrade to the latest version of DSE and the DataStax Drivers. See this example to hammer this home.

Keyspace Per Query Support

To wrap up, we addressed the situation when your queries and keyspaces are independent of one another. In the most recent driver versions, you can now supply the keyspace as an option for your statements, rather than resorting to having separate sessions per keyspace or some other hackery. These examples illustrate this usability improvement.

Additional Search Conditionals

The integration of DSE Search was further tightened in DSE 5.1 with the ability to control search indexes through CQL. This journey continues with DSE 6 as native CQL queries can now leverage search indexes for a wider array of CQL query functionality and indexing support. The DataStax Drivers remain in lock step with the DSE server side functionality and the new search conditionals can be found in the Java Driver Query Builder ( see example ).

Get Started Now

You can download the new DSE 6 drivers now and check out our online documentation for help and guidance.

Mutant Monitoring System (MMS) Day 12 – Coding with Java Part 2

mms
This is part 12 of a series of blog posts that provides a story arc for Scylla Training.

In the previous post, we explained how to create a sample Java application that executes a few basic CQL statements with a Scylla cluster using the Datastax Cassandra driver. After the code was deployed, we found that several citizens were murdered by mutants because the code was too static and not scalable. Changes must be made in order for Division 3 to protect people better by building highly-scalable and performing applications to monitor mutants. In this post, we will explore how we can optimize the existing Java code with prepared statements.

What Are Prepared Statements?

Prepared statements will enable developers at Division 3 to optimize our applications more efficiently. Most or all of the cassandra-compatible drivers support prepared statements. With that in mind, what you learn here can benefit you regardless of the programming language used. A prepared statement is basically a query that is parsed by Scylla and then saved for later use. One of the useful benefits is that you can continue to reuse that query and modify variables in the query to match variables such as names, addresses, and locations. Let’s dive a little deeper to see how it works.

When asked to prepare a CQL statement, a client library will send a CQL statement to Scylla. Scylla will then create a unique fingerprint for that CQL statement by MD5 hashing the CQL statement. Scylla will use this hash to check its query cache to see if it has already cached that CQL statement. If Scylla had seen that CQL statement, it will send back a reference to that cached CQL statement. If Scylla does not have that unique query hash in its cache, it will then proceed to parse the query and insert the parsed output into its cache.

INSERT INTO tb (key, val) VALUES (?,?)

The client will then be able to send an execute request specifying the statement id and providing the (bound) variables.

For more information on prepared statements, please click here. Now let’s go over how to change our existing Java application to support prepared statements.

Changing the Existing Java Application

To get started, we will first need to add a few libraries to our application:

import com.datastax.driver.core.PreparedStatement;
import com.datastax.driver.core.BoundStatement;

The PreparedStatement and BoundStatement libraries provide the functions to create prepared statements. Moving on, we can add two prepared statements to our application:

The first prepared statement is named insert. This statement will add data programmatically for first_name, last_name, address, and picture_location based on input from the application. The second prepared statement is named delete and will delete entries in the table based input gathered for first_name and last_name. We will reuse these statements later to add and delete data in the mutant_data table.

The first section of the application to replace is the insertQuery function as follows:

This function will take input for first_name, last_name, address, and picture_location and then bind to our prepared statement named insert and execute the query. By using prepared statements, we can reuse these functions over and over to add data to the catalog table.

The second section of the application to replace is the deleteQuery function as follows:

In this function, we will take first_name and last_name inputs and then bind and execute the delete prepared statement. Using this prepared statement, we can reuse these functions over and over to delete data from the catalog table.

Finally, we need to modify the main function as follows to pass input to the functions when the application starts:

First, the contents of the catalog table will be displayed followed by calling the insertQuery functions to add two additional mutants. After each insert is done, the contents of the table will be displayed. Finally, each user that was added is deleted and the contents of the table is shown after each delete.

With the coding part done, let’s bring up the Scylla Cluster and then run the sample application in Docker.

Starting the Scylla Cluster

The Scylla Cluster should be up and running with the data imported from the previous blog posts.

The MMS Git repository has been updated to provide the ability to automatically import the keyspaces and data. If you have the Git repository already cloned, you can simply do a “git pull” in the scylla-code-samples directory.

git clone https://github.com/scylladb/scylla-code-samples.git
cd scylla-code-samples/mms

Modify docker-compose.yml and add the following line under the environment: section of scylla-node1:

- IMPORT=IMPORT

Now we can build and run the containers:

docker-compose build
docker-compose up -d

After roughly 60 seconds, the existing MMS data will be automatically imported. When the cluster is up and running, we can run our application code.

Building and Running the Java Example

To build the application in Docker, change into the java subdirectory in scylla-code-samples:

cd scylla-code-samples/mms/java

Now we can build and run the container:

docker build -t java .
docker run -d --net=mms_web --name java java

To connect to the shell of the container, run the following command:

docker exec -it java sh

Finally, the sample Java application can be run:

java -jar App.jar

The output of the application will be:

Conclusion

In this post we explained what prepared statements are and how they can enable developers at Division 3 to optimize their applications more efficiently. We also learned how to modify our existing Java application to take advantage of prepared statements. Division 3 recommends that you keep experimenting with prepared statements and continue to make your applications more efficient.

Stay safe out there!

Next Steps

  • Learn more about Scylla from our product page.
  • See what our users are saying about Scylla.
  • Download Scylla. Check out our download page to run Scylla on AWS, install it locally in a Virtual Machine, or run it in Docker.
  • Take Scylla for a Test drive. Our Test Drive lets you quickly spin-up a running cluster of Scylla so you can see for yourself how it performs.

The post Mutant Monitoring System (MMS) Day 12 – Coding with Java Part 2 appeared first on ScyllaDB.

Mutant Monitoring System (MMS) Day 11 – Coding with Java Part 1

mms
This is part 11 of a series of blog posts that provides a story arc for Scylla Training.

In the previous post, we explained how a Scylla Administrator can backup and restore a cluster. As the number of mutants is on the rise, Division 3 decided that we must use more applications to connect to the mutant catalog and decided to hire Java developers to create powerful applications that can monitor the mutants. In this post, we will explore how to connect to a Scylla cluster using the Cassandra driver for Java.

When creating applications that communicate with a database such as Scylla, it is crucial that the programming language being used has support for database connectivity. Since Scylla is compatible with Cassandra, we can use any of the available Cassandra libraries. For example in Go, there is Gocql and Gocqlx. In Node.js, there is cassandra-driver. For Java, we have the driver available from Datastax. Since Division 3 wants to start investing in Java, let’s begin by writing a sample Java application.

Creating a Sample Java Application

The sample application that we will go over will connect to a Scylla cluster, display the contents of the Mutant Catalog table, insert and delete data, and show the contents of the table after each action. We will first go through each section of the code and then explain how to run the code in a Docker container that will access the Scylla Mutant Monitoring cluster.

In the application, we first need to import the Datastax Cassandra driver for Java:

The Cluster library provides the ability to connect to the cluster. The ResultSet library will allow us to run queries such as SELECT statements and store the output. The Row library allows us to get the value of a CQL Row returned in a ResultSet. The session library provides the ability to connect to a specific keyspace.

For this application, the main class is called App and the code should be stored in a file called App.java. After the class is defined, we can begin defining which cluster and keyspace to use:

To display the data, we will need to create a function that will run a select statement from the Scylla cluster. The function below will use the ResultSet, Session, and Row libraries to gather the data and print it on the screen:

The next function is used to insert a new Mutant into the Mutant Catalog table. Since this is a simple query, only the Session library is used. After the insert takes place, the previous function, selectQuery(), is called to display the contents of the table after the new data is added.

After the data is added and displayed in the terminal, we will delete it and then display the contents of the table again:

This is the main method of the application and is the starting point for any Java application. In this function, we will add the functions that were created above to be executed in the order below. When all of the functions are completed, the application will exit.

With the coding part done, let’s bring up the Scylla Cluster and then run the sample application in Docker.

Starting the Scylla Cluster

The Scylla Cluster should be up and running with the data imported from the previous blog posts.

The MMS Git repository has been updated to provide the ability to automatically import the keyspaces and data. If you have the Git repository already cloned, you can simply do a “git pull” in the scylla-code-samples directory.

git clone https://github.com/scylladb/scylla-code-samples.git
cd scylla-code-samples/mms

Modify docker-compose.yml and add the following line under the environment: section of scylla-node1:

- IMPORT=IMPORT

Now we can build and run the containers:

docker-compose build
docker-compose up -d

After roughly 60 seconds, the existing MMS data will be automatically imported. When the cluster is up and running, we can run our application code.

Building and Running the Java Example

To build the application in Docker, change into the java subdirectory in scylla-code-samples:

cd scylla-code-samples/mms/java

Now we can build and run the container:

docker build -t java .
docker run -d --net=mms_web --name java java

To connect to the shell of the container, run the following command:

docker exec -it java sh

Finally, the sample Java application can be run:

java -jar App.jar

The output of the application will be:

Conclusion

In this post, we explained how to create a sample Java application that executes a few basic CQL statements with a Scylla cluster using the Datastax Cassandra driver. This is only the basics and there are more interesting topics that Division 3 wants developers to explore. In the next post, we will go over prepared statements using the Java driver.

In the meantime, please be safe out there and continue to monitor the mutants!

Next Steps

  • Learn more about Scylla from our product page.
  • See what our users are saying about Scylla.
  • Download Scylla. Check out our download page to run Scylla on AWS, install it locally in a Virtual Machine, or run it in Docker.
  • Take Scylla for a Test drive. Our Test Drive lets you quickly spin-up a running cluster of Scylla so you can see for yourself how it performs.

The post Mutant Monitoring System (MMS) Day 11 – Coding with Java Part 1 appeared first on ScyllaDB.