TypeScript Support in the DataStax Node.js Drivers

TypeScript declarations of the driver are now contained in the same package and repository. These declarations will be maintained and kept in sync along with the JavaScript API. 

Additionally, you can now use DSE-specific types like geo, auth and graph types from TypeScript.

Getting started

To get started with the Node.js driver for Apache Cassandra in a TypeScript project, install the driver package:

npm install cassandra-driver

Making Progress

We announced last week that ScyllaDB has closed another round of funding. And while funding on its own hasn’t been a goal of ours, it is a mechanism to implement many of our key goals. For example, our recent round will allow us to sharply grow our development team and double down on the database ecosystem.

Looking back at the year and a half since our last round of funding, it’s gratifying to recognize the progress we’ve made in the past 18 months. We grew the number of paying customers by 5X! We’ve advanced 80 places on db-engines. We’ve launched an impressive set of new features and capabilities to tackle whole new use cases, as we’ve documented in more than 100 blog posts!

Progress in functionality

As I wrote earlier, we’ve added many awesome new features in the last year and a half, most notably Workload Prioritization. I’m really excited about this feature, and not just because no other NoSQL database is even close to offering this capability. Workload Prioritization is the culmination of years of effort we’ve put into our CPU schedulers, I/O schedulers, tagging of every function(!) in our code with a priority-class tag and much more.
In short, Workload Prioritization enables users to consolidate clusters, to isolate workloads, and to protect their production workloads from development workload/explorations. If you use Scylla, it’s an excellent advanced feature that brings demonstrable benefits while being trivial to configure.

I’ll give an example. As you can see in the graphs below, a single 3-node cluster is running 3 different workloads with throughput of 150k, 110k, 50k ops (together 310k ops) while the latencies of the most important workload are the smallest among the three of them. Workload Prioritization comes into play when Scylla runs at 100% CPU. It’s worth mentioning that below the max CPU utilization point, none of the workloads is capped and all of them have the option to use all of the available resources.

If it looks like I am getting carried away with Workload Prioritization, my apologies. It represents lots of progress for Scylla. Other log-structured merge-tree databases still struggle with compaction while at Scylla it is a solved problem. Repair in Scylla is also a solved problem; we implemented managed repair in Scylla Manager and also improved the merkle tree algorithm with the introduction of row-level repair.

Scylla officially supports materialized views and our secondary indexes are scalable and global and/or local. With the upcoming release of Scylla Manager backup will be fully managed and another checkbox will be marked.

The team has made plenty of progress with auxiliary tools, from migration driven by Spark to much better Grafana monitoring and client enhancement. Scylla modified Cassandra drivers can hint Scylla server to bypass the cache and thus keep the in-memory resident workload in the cache while range scans or rare accesses wouldn’t reach the cache. In addition, the Scylla driver is shard aware, and thus eliminates cpu core hot spots.

Project Alternator, our Amazon DynamoDB-compatible API

Last week we released Project Alternator, an open source project that delivers lots of advantages over the original DynamoDB. Since the launch, we released a new version of our monitor dashboards with support for the DynamoDB API and a Spark-based migration tool. There will be much more to come in the days ahead and more announcements at the Scylla Summit.

Scylla Cloud

In April we released Scylla Cloud, a fully managed service for Scylla on AWS. Today we have many customers who use it in production and we continue to see high demand for it. Scylla Cloud is multi-region and multi-zone with all of the capabilities of Scylla yet with zero maintenance overhead for the user. Our staff are responsible for the uptime, latencies, backups, repairs — everything.

Next month Scylla Cloud will gain the ability to run within a customer’s own AWS account and thus meet the privacy and security requirements of customers who cannot store data outside their premises. We also listen closely to our customers, so you can look forward to us adding Azure and GCP support in the first half of 2020. Thanks for your patience!

Near-term Scylla enhancements

There are several key features to come, some of which have been long anticipated. At the Scylla Summit we will uncover the first beta release of our Lightweight Transactions (LWT). LWT is initially implemented using Paxos and later replaced by a Raft implementation. We will also release a beta of Change Data Capture (CDC) at our Summit. We took a different approach here than Cassandra did — instead of a commitlog-based implementation we create a table of changes. Not only this is an elegant solution that allows users to consume the updates through CQL, the updates themselves are eventually consistent. You can find out more about this feature at the Summit!

Lastly, we have big plans around our User Defined Function (UDF) implementation and have been looking to base a map-reduce capability on top of it. Fundamental new features like CDC, UDF and LWT will allow us to take Scylla to the next level and be much more than just a data store.

Trends

As Scylla improves, we see more use-cases and growing demand from users who come from databases other than Cassandra. Quoting Alexys Jacob from Numberly: “Scylla entered our infrastructure on latency and scale-sensitive use cases. That’s what it does very well in the first place and it has proven to be successful. But as more and more people get their hands on Scylla and understand its capabilities they naturally challenge their usual database of choice and we now have web backends using Scylla as well.”

There are cases where Scylla is almost inevitable, such as replacing a large Cassandra deployment. However, we see growing demand from MySQL users. They do not come to Scylla because we simplify their lives — restructuring a SQL database into a NoSQL data model isn’t trivial — but they make the move either due to requirements for fault tolerance, always-on availability or horizontal scaling to keep up with their pace of growth.

More companies and industries are becoming data driven. For instance, we have several brick-and-mortar companies that have acquired data-driven companies that were customers of ours. There is a hunger for a system to handle their massive data volumes, one that maintains both low latency and high availability, all while packaged in an affordable price. Our goal is to be the best stateful infrastructure that provides for these needs.

Over this past year we’ve added several market-leading features to Scylla and now a growing amount of Fortune 500 companies are getting on board. The next two to three years will be extremely interesting. I invite you to give Scylla a try for the first time or, if you’re already familiar, I’d like to challenge you to expand your Scylla usage and consider it as an alternative to other databases.

To hear more from me on our company and our products or — more importantly — hear directly from our engineers and our customers, please join us at Scylla Summit 2019, November 5 & 6 in San Francisco. I look forward to meeting with you then! And don’t delay! Early bird discounts are about to end on September 30.

REGISTER FOR SCYLLA SUMMIT

The post Making Progress appeared first on ScyllaDB.

Removing Support for Outdated Encryption Mechanisms

At Instaclustr, security is the foundation of everything we do. We are continually working towards compliance with additional security standards as well as conducting regular risk reviews of our environments. This blog post outlines some technical changes we are making that both increases the security of our managed environment and enables compliance with a wider range of security standards.

From October 9, 2019 AEST newly provisioned Instaclustr clusters running recent versions of Cassandra and Kafka will have support for the SSLv3, TLSv1.0 and TLSv1.1 encryption protocols disabled and thus require the use of TLS 1.2 and above. From this date, we will also begin working with customers to roll this change out to existing clusters.

Instaclustr-managed clusters that will be affected are:

  • Apache Cassandra 3.11+
  • Apache Kafka 2.1+

Why are we doing this?

The protocols we are disabling are out of date, have known vulnerabilities and are not compliant with a range of public and enterprise security standards. All identified clients that support this version of Cassandra and Kafka support TLS1.2.

How can I test if I will be affected?

Cassandra

The cqlsh CLI will need to be changed to request TLSv1.2 (otherwise it defaults to TLSv1.0).  Assuming a cqlshrc file based on the Instaclustr example, the updated entry should be:

[ssl]

certfile = full_path_to_cluster-ca-certificate.pem

validate = true

factory = cqlshlib.ssl.ssl_transport_factory

version = TLSv1_2

Note: If running CQLSH from on Mac OS X, the system Python is not updated and will not support the TLSv1_2 option.  You should instead manually update your system Python or run cqlsh from a Docker container.

Clients built on the Datastax Apache Cassandra Java driver can create a custom SSLContext that requires that TLSv1.2 is used e.g.

ctx = SSLContext.getInstance("TLSv1.2", "SunJSSE");

If the client is able to successfully connect, then it confirms that your Java environment supports TLSv1.2 (i.e. is recent enough and is not configured to disable it).

Kafka

If using the official Apache Kafka Java client (or the Instaclustr ic-kafka-topics tool), the client configuration can be updated to allow only TLSv1.2.  For example, based on the Instaclustr example configuration, the enabled protocols becomes:

ssl.enabled.protocols=TLSv1.2

If the client is able to successfully connect, then it confirms that your Java environment supports TLSv1.2 (i.e. is recent enough and is not configured to disable it).

Summary

We understand that testing and changing systems is a time-consuming process. Given the widespread support for TLSv1.2 we do not anticipate that this change will actually impact any current systems. 

If you have any questions or concerns, please do not hesitate to contact us at support@instaclustr.com.

 

The post Removing Support for Outdated Encryption Mechanisms appeared first on Instaclustr.

ApacheCon 2019: DataStax Announces Cassandra Monitoring Free Tier, Unified Drivers, Proxy for DynamoDB & More for the Community

It’s hard to believe that we’re celebrating the 20th year of the Apache Software Foundation. But here we are—and it’s safe to say open source has come a long way over the last two decades.

We just got back from ApacheCon, where DataStax—one of the major forces behind the powerful open source Apache Cassandra™ database—was a platinum sponsor this year. 

We don’t know about you, but we couldn’t be more excited about what the future holds for software development and open source technology in particular.

During CTO Jonathan Ellis’ keynote, we announced three exciting new developer tools for the Cassandra community:

  • DataStax Insights (Cassandra Performance Monitoring)
  • Unified Drivers
  • DataStax Spring Boot Starter for the Java Driver

 

While we’re at it, in my talk “Happiness is a hybrid cloud with Apache Cassandra,” I announced our preview release for another open source tool: DataStax Proxy for DynamoDB™ and Apache Cassandra.

This tool enables developers to run their AWS DynamoDB™ workloads on Cassandra. With this proxy, developers can run DynamoDB workloads on-premises to take advantage of the hybrid, multi-model, and scalability benefits of Cassandra.

These tools highlight our commitment to open source and will help countless Cassandra developers build transformative software solutions and modern applications in the months and years ahead. 

Let’s explore each of them briefly.

1. DataStax Insights (Cassandra Performance Monitoring)

Everyone who uses Cassandra—whether they’re developers or operators—stands to benefit from DataStax Insights, a next-generation performance management and monitoring tool that is included with DataStax Constellation, DataStax Enterprise, and open source Cassandra 3.x and higher. 

We’re now offering sign-ups for DataStax Insights (or better said: Cassandra Monitoring) for free, allowing Cassandra users to have an at-a-glance health index to get a single view of all clusters. The tool also enables users to optimize their clusters using AI to recommend solutions to issues, highlight anti-patterns, and identify performance bottlenecks, among other things. 

DataStax Insights is free for all Cassandra users for up to 50 nodes and includes one week of rolling retention. Interested in joining the DataStax Insights early access program? We’re taking sign-ups now (more on that below). 

ACNA2019

2. Unified Drivers

Historically, DataStax has maintained two sets of drivers: one for DataStax Enterprise and one for open source Cassandra users. Moving forward, we are merging these two sets into a single unified DataStax driver for each supported programming language including C++, C#, Java, Python, and node.js. As a result, each unified driver will work for both Cassandra and DataStax products.

This move benefits developers by simplifying driver choice, which makes it easier to determine which driver to use when building applications. At the same time, developers using the open source version of Cassandra will now have free access to advanced features that initially shipped with our premium solutions. Further, developers that have previously used two different sets of drivers will now only need to use one driver for their applications across any DataStax platform and open source Cassandra. This will help with enhanced load balancing and reactive streams support. 

3. Spring Boot Starter 

The DataStax Java Driver Spring Boot Starter, which is now available in DataStax Labs, streamlines the process of building standalone Spring-based applications with Cassandra and DataStax databases.

Developers will enjoy that this tool centralizes familiar configuration in one place while providing easy access to the Java Driver in Spring applications.

It’s just one more way that makes the application development process easier.

4. DataStax Proxy for DynamoDB™ and Apache Cassandra™

With the DataStax Proxy for DynamoDB and Cassandra, developers can run DynamoDB workloads on-premises, taking advantage of the hybrid, multi-model, and scalability benefits of Cassandra.

The proxy is designed to enable users to back their DynamoDB applications with Cassandra. We determined that the best way to help users leverage this new tool and to help it flourish was to make it an open source Apache 2 licensed project.

The code consists of a scalable proxy layer that sits between your app and the database. It provides compatibility with the DynamoDB SDK which allows existing DynamoDB applications to read and write data to Cassandra without application changes.

ACNA2019

Sign up for the DataStax Insights early access program today!

Are you interested in optimizing your on-premises or cloud-based Cassandra deployments using a platform that lets novices monitor and fine-tune their cluster performance like experts? 

If so, you may want to give DataStax Insights a try. 

We’re currently accepting sign-ups to our early access program. Click the button below to get started!

GET STARTED

DataStax Labs

DataStax Labs provides the Apache Cassandra™ and DataStax communities with early access to product previews or enhancements for developers that are being considered for future production software; including tools, aids, and partner software designed to increase productivity. When you try out some of our new Labs technologies, we would love your feedback—good or bad—let us know!

Making Ad Tech Work at Scale: MediaMath Innovates with Scylla

MediaMath Case Study Graphic

As it enters its 13th year of business, MediaMath is leading the charge to create an accountable, addressable, transparent supply chain that is more aligned to brands’ interests. MediaMath provides a globally scaled, enterprise-grade ad tech platform that delivers personalized content across “touchpoints” that include display ads, mobile, video, advanced TV, native, audio, and digital out-of-home.

Supporting advertisers in 42 countries around the world, MediaMath has customers in all verticals, including retail, consumer packaged goods, travel, and finance. Notable customers include IBM, Uber and a wide range of marquee clients who use MediaMath to elevate their marketing.

MediaMath plays a key role in the ad tech landscape by unifying two underlying platform technologies: a demand-side platform (DSP) and a data management platform (DMP).

A DSP is a system that enables marketers and agencies to buy media, globally, from a multitude of sources through open and private markets. A DSP enables marketers to optimize campaigns based on performance indicators like effective cost per click (eCPC), and effective cost per action (eCPA).

A DMP is essentially a data warehouse, ingesting, sorting and storing information, and presenting it to marketers and publishers. DMP is used for modeling, analyzing, and segmenting online customers in digital marketing.

MediaMath bridges data management and demand-side platforms

By unifying DMP and DSP, Mediamath is able to bridge data management and media activation. These combined capabilities enable programmatic buying strategies that scale campaigns and improve their overall performance.

MediaMath provides several other offerings. MediaMath Audiences is a data solution that identifies the best customers using predictive modeling. Mediamath Brain, a machine-learning algorithm, increases advertiser ROI by making millions of buying decisions per second in real-time.

Programmatic advertising at global scale presents significant technical hurdles. MediaMath’s customers expect a real-time response to campaign activity along with segmented audiences at scale. The sheer data volume from media touchpoints can be staggering. The underlying technologies need to support massive throughput, measured in transactions per second. Bid-matching analytics must support queries with real-time operational performance characteristics.

Initially, MediaMath had 60 nodes running Apache Cassandra on AWS servers. The number of nodes and the complexity of Cassandra was too much, operationally, for the MediaMath team. Just keeping Cassandra up and running took three full-time site reliability engineers. When nodes went down, restoring them was an intensive manual process. Combined with the testing and maintenance drudgery around compactions, JVM garbage collection, and constant tuning, Cassandra soon wore out its welcome.

MediaMath found out about Scylla on Reddit. An engineer on the team read about this new drop-in replacement for Cassandra. Noting how Scylla is implemented in C++ instead of Java, he evangelized it inside MediaMath.

Deciding to take a look, the team decided on two fundamental criteria. First, the new database needed to deliver on its claim of Cassandra compatibility. Second, it needed to prove out performance benchmarks. “We wanted to make sure that what we were switching onto was not going to incur a lot of development work on our side, and that worked out great,” said Knight Fu, Director of Engineering at MediaMath

The installation and evaluation process went smoothly. “Of all of the database migrations I’ve worked on in my career, Scylla was the smoothest, for sure,” Knight noted. “There weren’t any tooling changes that were required for our automation either, which was a huge plus. From our point of view, one day it was Cassandra, and the next day it was Scylla. Seemed as though it was as easy as that.”

A key driver in the decision to go with Scylla is its lower operational overhead. Ultimately, Scylla enabled MediaMath to realize efficiency gains and pivot resources to focus on other important projects.

Today, MediaMath runs 17 Scylla nodes on i3.metal AWS instances, handling about 200,000 events per second and performing about a million reads per second. Read latency is consistently below the company’s required 10-millisecond threshold.

“With Scylla, our uptime was tremendous throughout the last holiday. We saw 99.9999% availability with Scylla.”

One of the main benefits for MediaMath is Scylla’s compatibility with Cassandra. “Our access pattern into the database was relatively new, so being able to keep our schemas was a huge benefit. Not having to retool monitoring and active node management when going from Cassandra to Scylla was also really valuable.”

Emera Trujillo is Senior Product Manager at MediaMath. From her perspective, Scylla has helped MediaMath enhance the data management platform service for their customers. “Thanks to Scylla, we’re increasing data retention in our segmentation product, which lets us deliver new value to clients without increasing prices,” she said.

Want to learn more about Scylla first-hand and hear other success stories from our growing user base? Come meet us at Scylla Summit 2019!

REGISTER FOR SCYLLA SUMMIT

The post Making Ad Tech Work at Scale: MediaMath Innovates with Scylla appeared first on ScyllaDB.

Top 5 Reasons to Choose Apache Cassandra Over DynamoDB

Overview

DynamoDB and Apache Cassandra are both very popular distributed data store technologies. Both are used successfully in many applications and production-proven at phenomenal scale. 

At Instaclustr, we live and breathe Apache Cassandra (and Apache Kafka). We have many customers at all levels of size and maturity who have built successful businesses around Cassandra-based applications. Many of those customers have undertaken significant evaluation exercises before choosing Cassandra over DynamoDB and several have migrated running applications from DynamoDB to Cassandra. 

This blog distills the top reasons that our customers have chosen Apache Cassandra over DynamoDB.

Reason 1: Significant Cost of Writes to DynamoDB

For many use cases, Apache Cassandra can offer a significant cost saving over DynamoDB. This is particularly the case of requirements that are write-heavy. The cost of write to DynamoDB is five times that cost of the read (reflected directly in your AWS bill). For Apache Cassandra, write are several times cheaper than reads (reflected in system resource usage).

Reason 2: Portability

DynamoDB is available in AWS and nowhere else. For multi-tenant SaaS offerings where only a single instance of the application will ever exist, then being all-in on AWS is not a major issue. However, many applications, for a lot of good reasons, still need to be installed and managed on a per-customer basis and many customers (often the largest ones!) will not want to run on AWS. Choosing Cassandra allows your application to run anywhere you can run a linux box.  

Reason 3: Design Without Having to Worry About Pricing Models

DynamoDB’s pricing is complex with two different pricing models and multiple pricing dimensions. Applying the wrong pricing models or designing your architecture without considering pricing can result in order of magnitude differences in costs. This also means that a seemingly innocuous change to your application can dramatically impact cost.  With Apache Cassandra, you have your infrastructure and you know your management fees, once you have completed performance testing and you know that your infrastructure can meet your requirements, you know your costs.

Reason 4: Multi-Region Functionality

Apache Cassandra was the first NoSQL technology to offer active-active multi-region support. While DynamoDB has added Global Tables, these have a couple of key limitations when compared to Apache Cassandra. The most significant in many cases is that you cannot add replicas to an existing global table. So, if you set up in two regions and then decide to add a third you need to completely rebuild from an empty table. With Cassandra, adding a region to a cluster is a normal, and fully online, operation. Another major limitation is that DynamoDB only offers eventual consistency across Global Tables, whereas Apache Cassandra’s tunable consistency levels can enforce strong consistency across multiple regions.

Reason 5: Avoiding Vendor Lock-In

Apache Cassandra is true open source software, owned and governed by the Apache Software Foundation to be developed and maintained for the benefit of the community and able to be run in any cloud or on-premise environment. DynamoDB is an AWS proprietary solution that not only locks you in to DynamoDB but also locks your application to the wider AWS ecosystem. 

While these are the headline reasons that people make the choice of Apache Cassandra over DynamoDB, there are also many advantages at the detailed functional level such as:

  • DynamoDB’s capacity is limited by partition with a maximum of 1,000 write capacity units and 3,000 read capacity units per partition. Cassandra’s capacity is distributed per node which typically provide a per-partition limit orders of magnitude higher than this.
  • Cassandra’s CQL query language provides a simple learning curve for developers familiar with SQL.
  • DynamoDB only allows single value partition and sort (called clustering in Cassandra) keys while Cassandra support multi-part keys. A minor difference but another way Cassandra reduces application complexity.
  • Cassandra supports aggregate functions which in some use cases can provide significant efficiencies.

 

The post Top 5 Reasons to Choose Apache Cassandra Over DynamoDB appeared first on Instaclustr.

DataStax Proxy for DynamoDB™ and Apache Cassandra™ – Preview

Yesterday at ApacheCon, our very own Patrick McFadin announced the public preview of an open source tool that enables developers to run their AWS DynamoDB™ workloads on Apache Cassandra. With the DataStax Proxy for DynamoDB and Cassandra, developers can run DynamoDB workloads on premises, taking advantage of the hybrid, multi-model, and scalability benefits of Cassandra.

The Big Picture

Amazon DynamoDB is a key-value and document database which offers developers elasticity and a zero-ops cloud experience. However, the tight AWS integration that makes DynamoDB great for cloud is a barrier for customers that want to use it on premises.

Cassandra has always supported key-value and tabular data sets so supporting DynamoDB workloads just meant that DataStax customers needed a translation layer to their existing storage engine.

Today we are previewing a proxy that provides compatibility with the DynamoDB SDK, allowing existing applications to read/write data to DataStax Enterprise (DSE) or Cassandra without any code changes. It also provides the hybrid + multi-model + scalability benefits of Cassandra to DynamoDB users.

If you’re just here for the code you can find it in GitHub and DataStax Labs: https://github.com/datastax/dynamo-cassandra-proxy/

Possible Scenarios

Application Lifecycle Management: Many customers develop on premises and then deploy to the cloud for production. The proxy enables customers to run their existing DynamoDB applications using Cassandra clusters on-prem.

Hybrid Deployments: DynamoDB Streams can be used to enable hybrid workload management and transfers from DynamoDB cloud deployments to on-prem Cassandra-proxied deployments. This is supported in the current implementation and, like DynamoDB Global Tables, it uses DynamoDB Streams to move the data. For hybrid transfer to DynamoDB, check out the Cassandra CDC improvements which could be leveraged and stay tuned to the DataStax blog for updates on our Change Data Capture (CDC) capabilities.

What’s in the Proxy?

The proxy is designed to enable users to back their DynamoDB applications with Cassandra. We determined that the best way to help users leverage this new tool and to help it flourish was to make it an open source Apache 2 licensed project.

The code consists of a scalable proxy layer that sits between your app and the database. It provides compatibility with the DynamoDB SDK which allows existing DynamoDB applications to read and write data to Cassandra without application changes.

How It Works

A few design decisions were made when designing the proxy. As always, these are in line with the design principles that we use to guide development for both Cassandra and our DataStax Enterprise product.

Why a Separate Process?

We could have built this as a Cassandra plugin that would execute as part of the core process but we decided to build it as a separate process for the following reasons:

  1. Ability to scale the proxy independently of Cassandra
  2. Ability to leverage k8s / cloud-native tooling
  3. Developer agility and to attract contributors—developers can work on the proxy with limited knowledge of Cassandra internals
  4. Independent release cadence, not tied to the Apache Cassandra project
  5. Better AWS integration story for stateless apps (i.e., leverage CloudWatch alarm, autoscaling, etc.)

Why Pluggable Persistence?

On quick inspection, DynamoDB’s data model is quite simple. It consists of a hash key, a sort key, and a JSON structure which is referred to as an item. Depending on your goals, the DynamoDB data model can be persisted in Cassandra Query Language (CQL) in different ways. To allow for experimentation and pluggability, we have built the translation layer in a pluggable way that allows for different translators. We continue to build on this scaffolding to test out multiple data models and determine which are best suited for:

  1. Different workloads
  2. Different support for consistency / linearization requirements
  3. Different performance tradeoffs based on SLAs

Conclusion

If you have any interest in running DynamoDB workloads on Cassandra, take a look at the project. Getting started is easy and spelled out in the readme and DynamoDB sections. Features supported by the proxy are quickly increasing and collaborators are welcome.

https://github.com/datastax/dynamo-cassandra-proxy/

All product and company names are trademarks or registered trademarks of their respective owner. Use of these trademarks does not imply any affiliation with or endorsement by the trademark owner.


1Often in the DynamoDB documentation, this key is referred to as a partition key, but since these are not one-to-one with DynamoDB partitions we will use the term hash key instead.

Why Developing Modern Applications Is Getting Easier

Historically, software was monolithic. In most cases, development teams would have to rewrite or rebuild an entire application to fix a bug or add a new feature. Building applications with any sense of speed or agility was largely out of the question, which is why software suites like Microsoft Office were generally released once a year.

Much has changed over the last decade or so. In the age of lightning-fast networks and instant gratification, leading software development teams are adopting DevOps workflows and prioritizing CI/CD so they can pump out stronger software releases much faster and much more frequently. 

Monthly, weekly, or even more frequent releases, for example, are becoming something closer to the norm. 

This accelerated release process is the result of the fact that—over several years—it’s become much easier to develop applications. 

Today, many engineering teams are utilizing new technologies to build better applications in less time, developing software with agility. Let’s take a look at four of the key technologies that have largely transformed the development process in recent years. 

1. Microservices

Microservices enable development teams to build applications that—you guessed it—are made up of several smaller services. 

Compared to the old-school monolithic approach, microservices speed up the development process considerably. Engineers can scale microservices independently of one another; updating or adding a feature no longer requires an entire rewrite of an application. 

Beyond that, microservices also bring more flexibility to developers. For example, developers can use their language of choice, building one service in Java and another in Node.js. 

The speed, flexibility, and agility microservices bring to the table have made it much easier to develop modern applications. Add it all up, and it comes as no surprise that a recent survey found that 91 percent of companies are using or plan to use microservices today.

2. Containers

Containers (think Docker) go hand-in-hand with microservices. Using containers, developers can create, deploy, and run applications in any environment.

At a very basic level, containers let developers “package” an application’s code and dependencies together as one unit. Once that package has been created, it can quickly be moved from a container to a laptop to a virtual server and back again. Containers enable developers to start, create, copy, and spin down applications rapidly.

It’s even easier to build modern applications with containers when you use Kubernetes to manage containerized workloads and services.

3. Open source tools

Docker and Kubernetes are both open source. So are Apache Cassandra™, Prometheus, and Grafana. There’s also Jenkins, too, which helps developers accelerate CI/CD workflows. With Jenkins, engineering teams can use automation to safely build, test, and deploy code changes, making it easier to integrate new features into any project.

Open source tools simplify the development process considerably. With open source, engineering teams get access to proven technologies that are built collaboratively by developers around the world to improve the coding process.

Not only does open source provide access to these tools, popular open source projects also have robust user communities that developers can turn to when they get stuck on something. 

4. Hybrid cloud

More and more companies are building applications in hybrid cloud environments because it enables them to leverage the best of what both the public and private cloud have to offer. 

For example, with hybrid cloud, you get the scalability of the public cloud while being able to use on-premises or private cloud resources to keep sensitive data secure (e.g., for HIPAA or GDPR compliance). What’s more, hybrid cloud also increases availability. In the event one provider gets knocked offline, application performance remains unchanged—so long as you have the right database in place.

The same sentiment holds true for multi-cloud or intercloud environments where organizations use several different cloud vendors to take advantage of each of their strengths, avoid vendor lock-in, or reduce the risk of service disruption. 

How does your development process compare?

If you’re not using microservices, containers, open source tools, and hybrid cloud environments to build applications, it’s time to reconsider your approach. 

The rise of these new technologies has given development teams the ability to pivot at a moment’s notice, incorporating user feedback to build new features and respond to incidents quickly and effectively.

Give them a try. It’s only a matter of time before you’ll start wondering why you didn’t think of it sooner.

Four Key Technologies That Enable Microservices (white paper)

READ NOW

How to Choose which Tombstones to Drop

Tombstones are notorious for causing issues in Apache Cassandra. They often become a problem when Cassandra is not able to purge them in a timely fashion. Delays in purging happen because there are a number of conditions that must be met before a tombstone can be dropped.

In this post, we are going to see how to make meeting these conditions more likely. We are going to achieve this by selecting which specific SSTables Cassandra should include in a compaction, which results in smaller and faster compactions that are more likely drop the tombstones.

Before we start, there are a few parameters we need to note:

  • We will consider only cases where the unchecked_tombstone_compaction compaction option is turned off. Enabling this option makes Cassandra run compaction with only one SSTable. This achieves a similar result to our process, but in a far less controlled way.
  • We also assume the unsafe_aggressive_sstable_expiration option of TWCS is turned off. This option makes Cassandra drop entire SSTables once they expire without checking if the partitions appear in other SSTables.

How Cassandra Drops Tombstones

When we look at the source, we see Cassandra will consider deleting a tombstone when a SSTable undergoes a compaction. Cassandra can only delete the tombstone if:

  • The tombstone is older than gc_grace_seconds (a table property).
  • There is no other SSTable outside of this compaction that:
    • Contains a fragment of the same partition the tombstone belongs to, and
    • The timestamp of any value (in the other SSTable) is younger than the tombstone.

In other words, for a tombstone to delete there can not be any data that a tombstone suppresses outside of the planned compaction. Unfortunately, in many cases tombstones are not isolated and will touch other data.

The heavy weight solution to this issue is to run a major compaction. A major compaction will include all SSTables in one big compaction, so there are no SSTables not participating in it. However, this approach comes with a cost:

  • It can require at least 50% of the disk space to be free.
  • It consumes CPU and disk throughput at the expense of regular traffic.
  • It can take hours to compact TBs of data.

So use a more light-weight solution that will compact only the SSTables within a given partition, but no others.

Step 1 - Identify Problematic Partition

The first step is to find out which partition is the most problematic. There are various ways of doing this.

First, if we know our data model inside and out, we can tell straight away which partitions tombstone heavy. If it’s not obvious by looking at it, we can use one of the other options below.

For example, another option is to consult the Cassandra logs. When Cassandra encounters too many tombstones, it will log a line similar to this:

WARN  [SharedPool-Worker-4] 2019-07-15 09:24:15,971 SliceQueryFilter.java:308 - Read 2738 live and 6351 tombstone cells in tlp_stress.sensor_data for key: 55d44291-0343-4bb6-9ac6-dd651f543323 (see tombstone_warn_threshold). 5000 columns were requested, slices=[-]

Finally, we can use a tool like Instaclustr’s ic-purge to give us a detailed overview of the tombstone situation:

Summary:
+---------+---------+
|         | Size    |
+---------+---------+
| Disk    | 36.0 GB |
| Reclaim |  5.5 GB |
+---------+---------+

Largest reclaimable partitions:
+------------------+----------+----------+------------------------------+
| Key              | Size     | Reclaim  | Generations                  |
+------------------+----------+----------+------------------------------+
|     001.0.361268 |  32.9 MB |  15.5 MB |               [46464, 62651] |
|     001.0.618927 |   3.5 MB |   1.8 MB |               [46268, 36368] |
+------------------+----------+----------+------------------------------+

In the table above, we see which partitions take the most reclaimable space (001.0.361268 takes the most). We also see which SSTables these partitions live in (the Generation column). We have found the SSTables to compact. However, we can take this one step further and ask Cassandra for the absolute paths, not just their generation numbers.

Step 2 - List Relevant SSTables

With the partition key known, we can simply use the nodetool getsstables command. It will make Cassandra tell us the absolute paths of SSTables that a partition lives in:

ccm node1 nodetool "getsstables tlp_stress sensor_data 001.0.361268"

/Users/tlp/.ccm/2-2-9/node1/data0/tlp_stress/sensor_data-b260e6a0a7cd11e9a56a372dfba9b857/lb-46464-big-Data.db
/Users/tlp/.ccm/2-2-9/node1/data0/tlp_stress/sensor_data-b260e6a0a7cd11e9a56a372dfba9b857/lb-62651-big-Data.db

After we find all the SSTables, the last thing we need to do is to trigger a compaction.

Step 3 - Trigger a Compaction

Triggering a user-defined compaction is something Jon has described in this post. We will proceed the same way and use jmxterm to trigger the forceUserDefinedCompaction MBean. We will need to pass it a comma-separated list of the SSTables we got in the previous step:

SSTABLE_LIST="/Users/tlp/.ccm/2-2-9/node1/data0/tlp_stress/sensor_data-b260e6a0a7cd11e9a56a372dfba9b857/lb-46464-big-Data.db,\
/Users/tlp/.ccm/2-2-9/node1/data0/tlp_stress/sensor_data-b260e6a0a7cd11e9a56a372dfba9b857/lb-46464-big-Data.db"

JMX_CMD="run -b org.apache.cassandra.db:type=CompactionManager forceUserDefinedCompaction ${SSTABLE_LIST}"
echo ${JMX_CMD} | java -jar jmxterm-1.0-alpha-4-uber.jar -l localhost:7100
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#operation returns:
null
$>

Despite getting a null as the invocation result, the compaction has most likely started. We can go and watch the nodetool compactionstats to see how it is going.

Once the compaction completes, we can repeat the process we used in Step 1 above to validate that the tombstones have been deleted.

Note for LeveledCompactionStrategy: This procedure only works with STCS and TWCS. If a table is using LCS, Cassandra does not allow invoking the forceUserDefinedCompaction MBean. For LCS, we could nudge Cassandra into compacting specific SSTables by resetting their levels. That, however, is complicated enough to deserve its own blog post.

Conclusion

In this post we saw how to trigger compactions for all of the SSTables a partition appears in. This is useful because the compaction will have a smaller footprint and is more efficient than running a major compaction, but will reliably purge droppable tombstones that can cause issues for Apache Cassandra.

ApacheCon 2019

Next week we’ll be in Las Vegas for the North American ApacheCon Conference. From the TLP consulting team Anthony will be giving a talk on the open source tooling we’ve developed to provision and stress test clusters. Aaron and I will be at our booth giving demos all week. We’d love to chat with you about Cassandra internals, application development, performance tuning, our open source tools, Cassandra best practices, and hear your story. We’ll also be more than happy to give you a demo of something special we’ve been working on this year.

Cassandra Data Partitioning

Introduction

When using Apache Cassandra a strong understanding of the concept and role of partitions is crucial for design, performance, and scalability. This blog covers the key information you need to know about partitions to get started with Cassandra. It covers topics including how to define partitions, how Cassandra uses them, what are the best practices and known issues.

Data partitioning is a common concept amongst distributed data systems. Such systems distribute incoming data into chunks called ‘partitions’. Features such as replication, data distribution, and indexing use a partition as their atomic unit. Data partitioning is usually performed using a simple mathematical function such as identity, hashing, etc. The function uses a configured data attribute called ‘partition key’ to group data in distinct partitions.

Consider an example where we have server logs as incoming data. This data can be partitioned using the log timestamp rounded to the hour value — this partitioning configuration results in data partitions with one hour worth of logs each. Here the partitioning function used is ‘identity’ function and the partition key used is timestamp with a rounded hour. 

Apache Cassandra Data Partitions

Apache Cassandra, a NoSQL database, belongs to the big data family of applications and operates as a distributed system, and uses the principle of data partitioning as explained above. Data partitioning is performed using a partitioning algorithm which is configured at the cluster level while the partition key is configured at the table level. 


The Cassandra Query Language (CQL) is designed on SQL terminologies of table, rows and columns. A table is configured with the ‘partition key’ as a component of its primary key. Let’s take a deeper look at the usage of Primary key in the context of a Cassandra cluster.

Primary Key = Partition Key + [Clustering Columns]

A primary key in Cassandra represents a unique data partition and data arrangement within a partition. The optional clustering columns handle the data arrangement part. A unique partition key represents a set of rows in a table which are managed within a server (including all servers managing its replicas).

A primary key has the following CQL syntax representations:

Definition 1:

CREATE TABLE server_logs(

   log_hour timestamp PRIMARYKEY,

   log_level text,

   message text,

   server text

   )

partition key: log_hour 

clustering columns: none

 

Definition 2:

CREATE TABLE server_logs(

   log_hour timestamp,

   log_level text,

   message text,

   server text,

   PRIMARY KEY (log_hour, log_level)

   )

partition key: log_hour 

clustering columns: log_level

 

Definition 3:

CREATE TABLE server_logs(

   log_hour timestamp,

   log_level text,

   message text,

   server text,

   PRIMARY KEY ((log_hour, server))

   )

partition key: log_hour, server

clustering columns: none

 

Definition 4:

CREATE TABLE server_logs(

   log_hour timestamp,

   log_level text,

   message text,

   server text,

   PRIMARY KEY ((log_hour, server),log_level)

   )WITH CLUSTERING ORDER BY (column3 DESC);

partition key: log_hour, server

clustering columns: log_level

 

This set of rows is generally referred to as a ‘partition’. 

  • Definition1 has all the rows sharing a ‘log_hour’ as a single partition.  
  • Definition2 has the same partition key as Definition1, but all rows in each partition are arranged with the ascending order ‘log_level’.
  • Definition3 has all the rows sharing a ‘log_hour’ for each distinct ‘server’ as a single partition.  
  • Definition4 has the same partition as Definition3, but it arranges the rows with descending order of ‘log_level’ within the partition.

Cassandra read and write operations are performed using a partition key on a table. Cassandra uses ‘tokens’ (a long value out of range -2^63 to +2^63 -1) for data distribution and indexing. The tokens are mapped to the partition keys using a ‘partitioner’. The partitioner applies a partitioning function to convert any given partition key to a token. Each node in a Cassandra cluster owns a set of data partitions using this token mechanism. The data is then indexed on each node with the help of the partition key. The takeaway here is, Cassandra uses partition key to determine which node store data on and where to find data when it’s needed.

See below diagram of Cassandra cluster with 3 nodes and token-based ownership. 

Cassandra Partitions - Representation of Tokens

*This is a simple representation of tokens, the actual implementation uses Vnodes.

Impacts of data partition on Cassandra clusters

Controlling the size of the data stored in each partition is essential to ensure even distribution of data across the cluster and to get good I/O performance. Below are the impacts Partitioning has on some of the different aspects of a Cassandra cluster:

  • Read Performance: Cassandra maintains caches, indexes and index summaries to locate partitions within SSTables files on disk. Large partitions cause inefficiency in maintaining these data structures and result in performance degradation. The Cassandra project has made several improvements in this area, especially in version 3.6 where the engine was  restructured to be more performant for large partitions and more resilient against memory issues and crashing.
  • Memory Usage: The partition size directly impacts on the JVM heap size and garbage collection mechanism. Large partitions increase pressure on the JVM heap and make garbage collection inefficient.
  • Cassandra Repairs: Repair is a maintenance operation to make data consistent. It involves scanning data and comparing with other data replicas followed by data streaming if required. Large partition sizes make it hard to repair data.
  • Tombstones Eviction: Cassandra uses unique markers called ‘tombstones’ to mark data deletion. Large partitions can contribute to difficulties in tombstone eviction if data deletion pattern and compaction strategy are not appropriately implemented. 

Being aware of these impacts helps in an optimal partition key design while deploying Cassandra. It might be tempting to design the partition key to having only one row or a few rows per partition. However, a few other factors might influence the design decision, primarily, the data access pattern and an ideal partition size.

The access pattern and its influence on partitioning key design are explained in-depth in one of our ‘Data modelling’ articles here – A 6 step guide to Apache Cassandra data modelling. In a nutshell, an ‘access pattern’ is the way a table is going to be queried, i.e. a set of all ‘select’ queries for a table. Ideal CQL select queries always have a single partition key in the ‘where’ clause. This means Cassandra works most efficiently when queries are designed to get data from a single partition. 

Partitioning Key design

Now let’s look into designing the partitioning key that leads to an ‘ideal partition size’. The practical limit on the size of a partition is two billion cells, but it is not ideal to have such large partitions. The maximum partition size in Cassandra should be under 100MB and ideally less than 10MB. Application workload and its schema design haves an effect on the optimal partition value. However, a maximum of 100MB is a rule of thumb. A ‘large/wide partition’ is hence defined in the context of the standard mean and maximum values. 

In the versions after 3.6, it may be possible to operate with larger partition sizes. However, thorough testing and benchmarking for each specific workload is required to ensure there is no impact of your partition key design on the cluster performance. 

Below are some best practices to consider when designing an optimal partition key:

  • A partition key for a table should be designed to satisfy its access pattern and with the ideal amount of data to fit into partitions. 
  • A partition key should not allow ‘unbounded partitions’. An unbounded partition grows indefinitely in size as time passes. 

In the server_logs table example, if the server column is used as a partition key it will create unbounded partitions as logs for a server will increase with time. The time attribute of log_hour, in this case, puts a bound on each partition to accommodate an hour worth of data.

  • A partition key should not create partition skew, in order to avoid uneven partitions and hotspots. A partition skew is a condition in which there is more data assigned to a partition as compared to other partitions and the partition grows indefinitely over time.

In the server_logs table example, suppose the partition key is server and if one server generates way more logs than other servers, it will create a skew.

Partition skew can be avoided by introducing some other attribute from the table in the partition key so that all partitions get even data. If it is not feasible to use a real attribute to remove skew, a dummy column can be created and introduced to the partition key. The dummy column then distinguishes partitions and it can be controlled from an application without disturbing the data semantics.

In the skew example above, consider a dummy column partition smallint is introduced and the partition key is altered to server, partition. Now the application logic sets the partition attribute to 1 until there are enough rows in a partition and then it sets partition to 2 for the same server.

  • Time Series data can be partitioned using a time element in the partition key along with other attributes. This helps in multiple ways –
    • it works as a safeguard against unbounded partitions
    • access patterns can use the time attribute to query specific data
    • data deletion can be performed for a time-bound etc.

In the server_logs table, all four definitions use the time attribute log_hour. All four definitions are good examples of bounded partitions by the hour value.

Summary

The important elements of the Cassandra partition key discussion are summarized below:

  1. Each Cassandra table has a partition key which can be standalone or composite. 
  2. The partition key determines data locality through indexing in Cassandra.
  3. The partition size is a crucial attribute for Cassandra performance and maintenance.
  4. The ideal size of a Cassandra partition is equal to or lower than 10MB with a maximum of 100MB.
  5. The partition key should be designed carefully to create bounded partitions with size in the ideal range. 
  6. It is essential to understand your data demographics and consider partition size and data distribution when designing your schema. There are several tools to test, analyse and monitor Cassandra partitions.
  7. The Cassandra version 3.6 and above incorporates significant improvements in the storage engine which provides much better partition handling.

The post Cassandra Data Partitioning appeared first on Instaclustr.

Instaclustr delivers improved Cassandra price and performance with AWS I3en Instance Types

Instaclustr has released support for a new AWS I3en instance type with our Apache Cassandra & Spark Managed Service.

I3en is the latest generation of AWS’s “Storage Optimised” family of EC2 instances designed to support I/O intensive workloads, backed by low-latency SSD. The i3en family of instances also are built upon the new generation of AWS nitro virtual machines which offer improved security with continuous monitoring of instances and dedicated hardware and software for different virtualized resources providing improved performance.

Instaclustr customers can now leverage these benefits with the release of the  i3en.xlarge instance type, which provides 4 vCPUs, 32 GiB memory, and 2500 GB of locally attached SSD. As a comparison, the i3.2xlarge offers 8 vCPUs, 61 GiB memory and 1 x 1900 GB SSDs. 

The increased local storage and higher performance virtualisation of the i3en.xl provides a significant cost advantage for many use cases. This advantage is particularly evident when used as a reserved instance and running in your AWS account where the i3en.xl offers the most cost effective storage of any of our offerings. To illustrate (AWS costs only):

  • An r5.xlarge instance instance costs $1,298 per year and provides roughly equivalent compute capacity (cores and memory) to an i3en.xl
  • An i3en.xlarge costs $2,517 per year but provides 2,500GB of storage for the $1,219 additional cost

This works out to a cost per GB of $0.041 per month – less than half the $0.1 per GB per month cost of GP2 EBS. The local SSD also provides better performance than EBS. Cassandra’s native replication of data across multiple instances deals with the issue of data on local SSDs getting lost when an instance fails and Instaclustr’s advanced node replace also helps to deal with some of the operational issues inherent while working with with local SSDs.

Benchmarking of I3en Instance Type

We conducted Cassandra benchmarking of the I3en.xlarge type and compared with the results of a I3.2xlarge cluster. This comparison isn’t exactly “apples to apples” with the i3.2xlarge having double the cores and memory, however it is still useful as a price/performance comparison to see how the two instances stack up.

Our testing procedure is:

  1. Insert data to fill disks to ~30% full.
  2. Wait for compactions to complete.
  3. Target median latency of 10ms.
  4. Run a sequence of tests comprising read, write and mixed read/write operations.
  5. Run the tests and measure the results including operations per second, pending compactions and median operation latency. 
  6. Quorum consistency for all operations.
  7. We incrementally increase the thread count and re-run the tests until we have hit the target median latency and are under the allowable pending compactions

As with any generic benchmarking results, for different data models or application workload, the performance may vary from the benchmark. However, we have found this to be a reliable test for comparison of relative performance that will reflect in many practical use cases.

This benchmarking used Cassandra 3.11.4 and compared a 3-node i3.2xlarge cluster and a 3-node i3en.xlarge. Driving operations to the point where latency on each node was similar, 3-node i3en.xlarge cluster yielded 16,869 operations/sec compared to the i3.2xlarge which achieved 24,023. So, the i3en.xlarge provided roughly 70% of the performance of the larger sized and more expensive i3en.2xlarge cluster. 

Although it has half the cores and memory, the performance of the i3en.xlarge  be explained by an increase of clock speed on each core, as well as gains from AWS’s new Nitro virtualisation technology.

However, raw performance is only part of the equation for this node size; price and storage size also play a part in making the i3en.xlarge a more attractive offering.

Instaclustr offers i3en instances at a price commensurate with its performance in comparison to the i3.2xlarge, around 70% of the price. The other key factor is the disk size, with the i3en.xlarge having 30% more disk available compared to its more expensive counterpart.

This makes the i3en.xlarge an extremely attractive option for customers with a use case that has a lower throughput (ops/sec) to stored data requirement than that is provided by the i3.2xlarge (which has relatively more processing power per GB of data stored).

AWS Instance type Operation type Operations/s Median latency (ms)
i3en.xlarge write 16,869 2.4
read 11,101 7.2
mixed read/write 11,891 7.4
i3.2xlarge write 24,023 3.2
read 16,832 10
mixed read/write 17,301 8

Table 1: Results summary

Note the target latency of 10ms is not always being reached due to the limitation of the stress tool and the high throughput of these instance types.

In order to assist customers in upgrading their Cassandra clusters from currently used instance types to i3en instance type nodes, Instaclustr technical operations team has built several tried and tested node replacement strategies to provide zero-downtime, non-disruptive migrations for our customers. Reach out to our support team if you are interested in i3en instances for your Cassandra cluster.

For full pricing, sign up or log on to our console or contact our sales team.

The post Instaclustr delivers improved Cassandra price and performance with AWS I3en Instance Types appeared first on Instaclustr.

The 3 Data Trends Developers Need to Know

Last weekend, I spoke at Data Con LA, an event DataStax has supported for several years, this year as a platinum sponsor.

For those who weren’t able to make it, I figured I’d share a high-level overview of my keynote presentation, which was about three major data trends that affect developers and what DataStax is doing in response to each of them. 

Data Con LA Keynote

(You can also read about other data management trends here.)

Let’s jump right in.

1. Working with data needs to be easy and obvious

According to a recent Stack Overflow survey, 41 percent of developers have less than five years of professional experience, while nearly two-thirds have less than nine years. In addition, we’re rapidly approaching the point where the majority of developers will be based outside of the U.S. and Europe.

With this massive influx of developers into the industry, it’s more important than ever for all of us to help make working with data easy and obvious. 

At DataStax, we work hard every day to make that a reality. While we’re known as the driving force behind the Apache Cassandra™ distributed database, and our enterprise version, DataStax Enterprise, we also have a history of providing great developer tools including:

  • Drivers for Apache Cassandra and DataStax Enterprise, which are available in several programming languages, and
  • DataStax Studio, a notebook-style tool that brings together data scientists and engineers around data modeling, exploration, and visualization.

We’re also super excited about AppStax, a new tool for rapid application development that automatically generates microservices with REST or GraphQL APIs based on your data.

Add it all up, and DataStax is supporting developers of all experience levels by making it incredibly easy to work with data.

Data Con LA

2. Data must be freely accessible in whatever cloud it’s needed

We recently found out that 75% of IT groups are pursuing a hybrid or multi-cloud strategy. This enables them to leverage the best of both worlds, i.e., the scalability and cost-effectiveness of the public cloud and the privacy and security of the private cloud.

At DataStax, we believe that Cassandra is the ideal database for replicating data across multi- and hybrid cloud environments—whether that’s in your own private data center, public cloud regions, or a combination thereof. 

To this end, we’re thrilled to be bringing Cassandra as a service to market this year as part of our Constellation suite of cloud products.

If you need to operate your own clusters, DataStax can help there, too. We’re continuing to provide our DSE Docker containers and we’re working on a Kubernetes operator to make operations easier for you.

Data Con LA

3. Data must be accessible through the right API for each application

There’s been a growing trend toward multi-model databases in recent years. These databases help developers build powerful applications more quickly by providing APIs that support multiple ways of accessing data, including tabular representations such as SQL or the Cassandra Query Language (CQL), document representations such as JSON, and graph representations such as Gremlin from the Apache TinkerPop project. Data needs to be accessible through the right API for each application you build.

We see increasing interest in the “graph way of thinking” in particular. Gartner says that the graph database market will double every year through 2022, for example. This is due to the fact that graph databases have the unique ability to leverage relationships between disparate data sets. 

DataStax helps here as well. 

The next generation of our DataStax Enterprise Graph database takes another step forward in multi-model flexibility, allowing you to access the same underlying data via the Gremlin graph language from Apache TinkerPop, CQL, or through SQL via Apache Spark. See the talk “DSE Graph.Next: The Cloud-Native Distributed Graph Database for Solving Graph Problems” by Jonathan Lacefield and Dr. Denise Gosnell from the DataStax Accelerate 2019 conference for a sneak preview.

Simply put, DataStax knows what the future of data management looks like and we’re sharply focused on making sure developers have the precise tools they need to build transformative applications—today and tomorrow.

Data’s Day in the Sun

Throughout the rest of the day at Data Con LA—which was held on the beautiful campus of the University of Southern California—we had a ton of conversations at our DataStax booth with students, local startups, and industry folks and were able to hear more about their use cases and data challenges. It’s great to see this conference grow leaps and bounds each year and I can’t wait for next year’s event!

The Multi-Model Data Management Platform (white paper)

READ NOW

The 5 Tools You Need to Simplify Data Distribution

When’s the last time you sat there patiently when a website or app took forever to load—either for personal or professional purposes?

In the age of Amazon and smartphones, we all know how websites should work. When they don’t meet our expectations, we get frustrated. 

For customer-facing sites, inefficient applications lead to disgruntled customers. For employee-facing enterprise apps, slow loading times lead to productivity decreases and unhappy staffers. 

When we talk about expectations, we mean this: Nearly half of users expect sites to load in two seconds or less. Two seconds! 

In the age of distributed teams, everywhere access, and big data, meeting and exceeding these expectations can seem tricky. But with the right architecture and infrastructure in place, it’s something that’s within your reach. 

Here are five tools you need to simplify data distribution, increasing application performance and strengthening the user experience along the way.

1. Hybrid cloud

Simply put, modern applications are best suited for the hybrid cloud. In these environments, you get more control over your on-prem tech infrastructure while also being able to leverage public cloud resources to take care of your AI/ML needs. Every application is different. With the hybrid cloud, you can build the exact infrastructure you need to deliver exemplary experiences.

2. Microservices

According to IDC, 90% of applications will utilize microservices architecture by 2020. This is due to the fact that microservices enable organizations to rapidly build and scale applications, working on individual features as standalone units instead of engineering a massive monolith. Microservices, however, can present data distribution challenges—particularly when data is shared among several different microservices. With the right underlying technologies in place, however, microservices can help simplify data distribution and accelerate performance.

3. Containers

Containers make microservices architecture easy by enabling developers to create, deploy, and run applications in any kind of environment. The technology also makes applications compute-efficient, improving performance and data distribution while decreasing data center costs.

4. Container orchestration

Microservices-architected applications are generally built with several different containers. A container orchestration platform like Kubernetes makes it much easier to manage lots of containers. Still, when it comes to simplifying data management, Kubernetes is not a panacea—particularly when applications are built on top of traditional databases. Luckily, there’s an easy fix for this problem, which segues nicely into number five. 

5. Active Everywhere database

Simplifying data distribution in hybrid cloud environments in applications built with microservices architecture requires an underlying database that can keep pace with performance expectations. Since modern applications use tons of data, traditional active-active or active-passive databases no longer cut it. This is exactly why we built DataStax Enterprise, an Active Everywhere database built on Apache Cassandra’s masterless architecture. With an Active Everywhere database powering your applications, data can flow smoothly in every direction.

Simplify Data Distribution to Accelerate Your Applications

In the age of instant gratification and real-time experiences, velocity matters. Every company has a ton of data under its control, and volumes more are created every day. 

To move faster, increase team productivity, and strengthen user experiences, you need to build applications with microservices architecture that are designed for high-performance in hybrid and multi-cloud environments. The easiest way to do that is by using containers, a container orchestration platform, and an Active Everywhere database like DataStax Enterprise.

Add it all up, and that’s the recipe for stronger applications that drive competitive advantage gains while impressing customers and engaging your team along the way.

Introduction to the Active Everywhere Database (white paper)

READ NOW

tlp-stress 1.0 Released!

We’re very pleased to announce the 1.0 release of tlp-stress! tlp-stress is a workload centric benchmarking tool we’ve built for Apache Cassandra to make performance testing easier. We introduced it around this time last year and have used it to do a variety of testing for ourselves, customers, and the Cassandra project itself. We’re very happy with how far it’s come and we’re excited to share it with the world as a stable release. You can read about some of the latest features in our blog post from last month. The main difference between tlp-stress and cassandra stress is the inclusion of workloads out of the box. You should be able to get up and running with tlp-stress in under 5 minutes, testing the most commonly used data models and access patterns.

We’ve pushed artifacts to Bintray to make installation easier. Instructions to install the RPM and Debian packages can be found on the tlp-stress website. In the near future we’ll get artifacts uploaded to Maven as well as a Formula for Homebrew to make installs on MacOS easier. For now, grab the tarball from the tlp-stress website. We anticipate following a short release cycle similar to how Firefox or Chrome would be released, so we expect to release improvements every few weeks.

We’ve set up a mailing list if you have questions or suggestions. We’d love to hear from you if you’re using the tool!

Evolution of Netflix Conductor:

v2.0 and beyond

By Anoop Panicker and Kishore Banala

Conductor is a workflow orchestration engine developed and open-sourced by Netflix. If you’re new to Conductor, this earlier blogpost and the documentation should help you get started and acclimatized to Conductor.

Netflix Conductor: A microservices orchestrator

In the last two years since inception, Conductor has seen wide adoption and is instrumental in running numerous core workflows at Netflix. Many of the Netflix Content and Studio Engineering services rely on Conductor for efficient processing of their business flows. The Netflix Media Database (NMDB) is one such example.

In this blog, we would like to present the latest updates to Conductor, address some of the frequently asked questions and thank the community for their contributions.

How we’re using Conductor at Netflix

Deployment

Conductor is one of the most heavily used services within Content Engineering at Netflix. Of the multitude of modules that can be plugged into Conductor as shown in the image below, we use the Jersey server module, Cassandra for persisting execution data, Dynomite for persisting metadata, DynoQueues as the queuing recipe built on top of Dynomite, Elasticsearch as the secondary datastore and indexer, and Netflix Spectator + Atlas for Metrics. Our cluster size ranges from 12–18 instances of AWS EC2 m4.4xlarge instances, typically running at ~30% capacity.

Components of Netflix Conductor
* — Cassandra persistence module is a partial implementation.

We do not maintain an internal fork of Conductor within Netflix. Instead, we use a wrapper that pulls in the latest version of Conductor and adds Netflix infrastructure components and libraries before deployment. This allows us to proactively push changes to the open source version while ensuring that the changes are fully functional and well-tested.

Adoption

As of writing this blog, Conductor orchestrates 600+ workflow definitions owned by 50+ teams across Netflix. While we’re not (yet) actively measuring the nth percentiles, our production workloads speak for Conductor’s performance. Below is a snapshot of our Kibana dashboard which shows the workflow execution metrics over a typical 7-day period.

Dashboard with typical Conductor usage over 7 days
Typical Conductor usage at Netflix over a 7 day period.

Use Cases

Some of the use cases served by Conductor at Netflix can be categorized under:

  • Content Ingest and Delivery
  • Content Quality Control
  • Content Localization
  • Encodes and Deployments
  • IMF Deliveries
  • Marketing Tech
  • Studio Engineering

What’s New

gRPC Framework

One of the key features in v2.0 was the introduction of the gRPC framework as an alternative/auxiliary to REST. This was contributed by our counterparts at GitHub, thereby strengthening the value of community contributions to Conductor.

Cassandra Persistence Layer

To enable horizontal scaling of the datastore for large volume of concurrent workflow executions (millions of workflows/day), Cassandra was chosen to provide elastic scaling and meet throughput demands.

External Payload Storage

External payload storage was implemented to prevent the usage of Conductor as a data persistence system and to reduce the pressure on its backend datastore.

Dynamic Workflow Executions

For use cases where the need arises to execute a large/arbitrary number of varying workflow definitions or to run a one-time ad hoc workflow for testing or analytical purposes, registering definitions first with the metadata store in order to then execute them only once, adds a lot of additional overhead. The ability to dynamically create and execute workflows removes this friction. This was another great addition that stemmed from our collaboration with GitHub.

Workflow Status Listener

Conductor can be configured to publish notifications to external systems or queues upon completion/termination of workflows. The workflow status listener provides hooks to connect to any notification system of your choice. The community has contributed an implementation that publishes a message on a dyno queue based on the status of the workflow. An event handler can be configured on these queues to trigger workflows or tasks to perform specific actions upon the terminal state of the workflow.

Bulk Workflow Management

There has always been a need for bulk operations at the workflow level from an operability standpoint. When running at scale, it becomes essential to perform workflow level operations in bulk due to bad downstream dependencies in the worker processes causing task failures or bad task executions. Bulk APIs enable the operators to have macro-level control on the workflows executing within the system.

Decoupling Elasticsearch from Persistence

This inter-dependency was removed by moving the indexing layer into separate persistence modules, exposing a property (workflow.elasticsearch.instanceType) to choose the type of indexing engine. Further, the indexer and persistence layer have been decoupled by moving this orchestration from within the primary persistence layer to a service layer through the ExecutionDAOFacade.

ES5/6 Support

Support for Elasticsearch versions 5 and 6 have been added as part of the major version upgrade to v2.x. This addition also provides the option to use the Elasticsearch RestClient instead of the Transport Client which was enforced in the previous version. This opens the route to using a managed Elasticsearch cluster (a la AWS) as part of the Conductor deployment.

Task Rate Limiting & Concurrent Execution Limits

Task rate limiting helps achieve bounded scheduling of tasks. The task definition parameter rateLimitFrequencyInSeconds sets the duration window, while rateLimitPerFrequency defines the number of tasks that can be scheduled in a duration window. On the other hand, concurrentExecLimit provides unbounded scheduling limits of tasks. I.e the total of current scheduled tasks at any given time will be under concurrentExecLimit. The above parameters can be used in tandem to achieve desired throttling and rate limiting.

API Validations

Validation was one of the core features missing in Conductor 1.x. To improve usability and operability, we added validations, which in practice has greatly helped find bugs during creation of workflow and task definitions. Validations enforce the user to create and register their task definitions before registering the workflow definitions using these tasks. It also ensures that the workflow definition is well-formed with correct wiring of inputs and outputs in the various tasks within the workflow. Any anomalies found are reported to the user with a detailed error message describing the reason for failure.

Developer Labs, Logging and Metrics

We have been continually improving logging and metrics, and revamped the documentation to reflect the latest state of Conductor. To provide a smooth on boarding experience, we have created developer labs, which guides the user through creating task and workflow definitions, managing a workflow lifecycle, configuring advanced workflows with eventing etc., and a brief introduction to Conductor API, UI and other modules.

New Task Types

System tasks have proven to be very valuable in defining the Workflow structure and control flow. As such, Conductor 2.x has seen several new additions to System tasks, mostly contributed by the community:

Lambda

Lambda Task executes ad-hoc logic at Workflow run-time, using the Nashorn Javascript evaluator engine. Instead of creating workers for simple evaluations, Lambda task enables the user to do this inline using simple Javascript expressions.

Terminate

Terminate task is useful when workflow logic should terminate with a given output. For example, if a decision task evaluates to false, and we do not want to execute remaining tasks in the workflow, instead of having a DECISION task with a list of tasks in one case and an empty list in the other, this can scope the decide and terminate workflow execution.

ExclusiveJoin

Exclusive Join task helps capture task output from a DECISION task’s flow. This is useful to wire task inputs from the outputs of one of the cases within a decision flow. This data will only be available during workflow execution time and the ExclusiveJoin task can be used to collect the output from one of the tasks in any of decision branches.

For in-depth implementation details of the new additions, please refer the documentation.

What’s next

There are a lot of features and enhancements we would like to add to Conductor. The below wish list could be considered as a long-term road map. It is by no means exhaustive, and we are very much welcome to ideas and contributions from the community. Some of these listed in no particular order are:

Advanced Eventing with Event Aggregation and Distribution

At the moment, event generation and processing is a very simple implementation. An event task can create only one message, and a task can wait for only one event.

We envision an Event Aggregation and Distribution mechanism that would open up Conductor to a multitude of use-cases. A coarse idea is to allow a task to wait for multiple events, and to progress several tasks based on one event.

UI Improvements

While the current UI provides a neat way to visualize and track workflow executions, we would like to enhance this with features like:

  • Creating metadata objects from UI
  • Support for starting workflows
  • Visualize execution metrics
  • Admin dashboard to show outliers

New Task types like Goto, Loop etc.

Conductor has been using a Directed Acyclic Graph (DAG) structure to define a workflow. The Goto and Loop on tasks are valid use cases, which would deviate from the DAG structure. We would like to add support for these tasks without violating the existing workflow execution rules. This would help unlock several other use cases like streaming flow of data to tasks and others that require repeated execution of a set of tasks within a workflow.

Support for reusable commonly used tasks like Email, DatabaseQuery etc.

Similarly, we’ve seen the value of shared reusable tasks that does a specific thing. At Netflix internal deployment of Conductor, we’ve added tasks specific to services that users can leverage over recreating the tasks from scratch. For example, we provide a TitusTask which enables our users to launch a new Titus container as part of their workflow execution.

We would like to extend this idea such that Conductor can offer a repository of commonly used tasks.

Push based task scheduling interface

Current Conductor architecture is based on polling from a worker to get tasks that it will execute. We need to enhance the grpc modules to leverage the bidirectional channel to push tasks to workers as and when they are scheduled, thus reducing network traffic, load on the server and redundant client calls.

Validating Task inputKeys and outputKeys

This is to provide type safety for tasks and define a parameterized interface for task definitions such that tasks are completely re-usable within Conductor once registered. This provides a contract allowing the user to browse through available task definitions to use as part of their workflow where the tasks could have been implemented by another team/user. This feature would also involve enhancing the UI to display this contract.

Implementing MetadataDAO in Cassandra

As mentioned here, Cassandra module provides a partial implementation for persisting only the workflow executions. Metadata persistence implementation is not available yet and is something we are looking to add soon.

Pluggable Notifications on Task completion

Similar to the Workflow status listener, we would like to provide extensible interfaces for notifications on task execution.

Python client in Pypi

We have seen wide adoption of Python client within the community. However, there is no official Python client in Pypi, and lacks some of the newer additions to the Java client. We would like to achieve feature parity and publish a client from Conductor Github repository, and automate the client release to Pypi.

Removing Elasticsearch from critical path

While Elasticsearch is greatly useful in Conductor, we would like to make this optional for users who do not have Elasticsearch set-up. This means removing Elasticsearch from the critical execution path of a workflow and using it as an opt-in layer.

Pluggable authentication and authorization

Conductor doesn’t support authentication and authorization for API or UI, and is something that we feel would add great value and is a frequent request in the community.

Validations and Testing

Dry runs, i.e the ability to evaluate workflow definitions without actually running it through worker processes and all relevant set-up would make it much easier to test and debug execution paths.

If you would like to be a part of the Conductor community and contribute to one of the Wishlist items or something that you think would provide a great value add, please read through this guide for instructions or feel free to start a conversation on our Gitter channel, which is Conductor’s user forum.

We also highly encourage to polish, genericize and share any customizations that you may have built on top of Conductor with the community.

We really appreciate and are extremely proud of the community involvement, who have made several important contributions to Conductor. We would like to take this further and make Conductor widely adopted with a strong community backing.

Netflix Conductor is maintained by the Media Workflow Infrastructure team. If you like the challenges of building distributed systems and are interested in building the Netflix Content and Studio ecosystem at scale, connect with Charles Zhao to get the conversation started.

Thanks to Alexandra Pau, Charles Zhao, Falguni Jhaveri, Konstantinos Christidis and Senthil Sayeebaba.


Evolution of Netflix Conductor: was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.