ScyllaDB Developer Hackathon: Scylla + S3

ScyllaDB’s internal developer conference and hackathon was very informative and productive. Here we’ll share with you the project we came up with, which involves implementing Binary Large Object (“blob”) support in Scylla along with the S3 API and integration with many interesting features.

Motivation

It started with simple ideas like how cool it could be to allow accessing Scylla with an S3 API that is popular, simple and for which there are a lot of open-source tools that support it. Why can’t we store large objects in our database? Users sometimes want to store media files in the database. Could we implement that in a convenient way? It turned out we could.

From that moment the flywheel of our imagination started to spin faster. We dreamt about possible applications and consequences of that. It turned out that storing large objects plays perfectly well with the S3 API support. From that, we moved far beyond what we could imagine at the beginning. S3 enabled Scylla to support browser access. Browser access pushed us to the idea of streaming live video streams right from the database via our S3 frontend and we came up with the necessity of ranged reads support.

Then it turned out we could expose our work as a filesystem via Filesystem in Userspace (FUSE) and we made it happen. At that moment we truly understood the potential of what we were doing. It was like a vision of boundless freedom, a fresh breath of imagination, a borderless world of what could be achieved. In that process of collective collaboration, a feeling of something big that we are doing, we came up with absolutely crazy ideas like why would we not create a new compaction strategy for our large objects? Or why can’t we expose CDC streams via S3 and connect Scylla to Apache Spark as a Data Source?

It was like we were reinventing Scylla from new perspectives and it was so appealing to look at how seamlessly everything integrated together.

Everything we do we do because we love what we are doing. This project happened because we were able to summon all of our passion, face the unknown, and build together via a process of creativity in which we all became something more than just by ourselves. The further we proceeded the more it felt like we were going in the right direction. We had fun. We felt passionate. It was inspirational and that’s how great things must be born.

The Team

Our team, clockwise from the upper left: Developer Evangelist
Ivan Prisyanzhnyy, Software Developer Raphael S. Carvalho (or at least
the top of his head), Distinguished Engineer Nadav Har’el,
Software Developer Takuya Asada (in silhouette) and Software Engineers
Marcin Maliszkiewicz and Piotr Grabowski.

Collectively our team represented some of the most senior engineering talent at ScyllaDB (Nadav, Raphael, and Takuya have both been with the company for more than seven years) and some of the most recently added (Piotr G joined in 2020). The team spanned the world from Brazil, to Poland, Russia, Israel and Japan.

One of our project authors suddenly found himself in a situation where he was not able to participate in the project right before the start. So the project was at risk of not happening at all.

But we still wanted to give the project a try and decided to move forward with it. We were surprised when on the first day of the Hackathon he came back to us sharing that everything was okay and the circumstances had changed and he was able to participate. It was the first and best news in the long queue of what followed.

Implementation

The implementation consists of 4 sub-projects:

  • A Python proof of concept (PoC) that verifies that the data model works and the API is correct
  • A production-grade implementation in C++ based on the Alternator source code fork,
  • An Object-Aware Compaction Strategy, and
  • Change Data Capture (CDC) S3 support.

The foundation for it is the data model that allows for efficient implementation of the required S3 API and exploitation of Scylla internals.

Right next to it stands the organization of storage of large binary objects. It’s simple to imagine at least two approaches to store large objects: one is to split them into small pieces and another one is to store only the metadata in the database and data on the filesystem itself. There are different cons and pros for both. For the PoC we decided to go with splitting blobs into smaller pieces — chunks — and storing them right in the database.

The process of splitting a blob into chunks is known as “chunkifying.”
No, we didn’t make that up.

When a user wants to upload an object we split it into many small pieces called chunks. Every chunk corresponds to a cell. Cells work best when small so we keep them around 128KB size. Then cells can be organized into a single partition of a larger size. For example, a partition can be 64MB. An advantage of that is that the partition can be read sequentially with a paging and a sufficiently large data piece will be served from a single node. This approach allows to efficiently balance cells and partitions sizes and connections to nodes distribution.

The key for a chunk can be calculated based on the chunk number in the blob. It makes it possible to access chunks without an additional index and allows greater parallelism in uploading / downloading.

When a user wants to read an object we just need to calculate the chunk to start from and start streaming data from the corresponding partition. It’s that simple.

Now knowing all those assumptions about the large object’s data that are usually a) immutable b) big c) stored in pieces that span continuously it is possible to reorganize the existing compaction strategy to make storage much more efficient for both reads and writes.

It’s somewhat obvious that STCS or LCS are not perfect for storing BLOBs (the object’s data). Using our knowledge about data we can reduce amplification factors by much. Basically, we can get rid of compaction in the old sense. Our new Object-Aware Compaction Strategy (OACS) has the following characteristics:

  • It keeps pieces of a certain object together. SSTable stores pieces only of a single object such that it’s much simpler to locate the required partition. This makes writing, reading, and garbage collecting objects much cheaper than with other compaction strategies. A user specifies a column to tell the compaction strategy what parts belong to the same object.
  • Flushing a memtable keeps different objects in separate SSTables from the beginning so no further sorting is required.

This approach gives the following advantages:

  • It allows for optimal write performance, because compaction will not have to rewrite any given object over and over again.
  • It allows for faster reads, because we can quickly locate the SSTable(s) which a particular object is stored at.
  • It allows for more efficient garbage collection, because deleting a given object through a tombstone will not require touching the other existing objects.

As always that was not enough for us. We also implemented the multipart upload to support scenarios of working with very large objects. It’s a S3 feature which allows the user to upload up to 10.000 parts of an object, up to 5 terabytes in total size, in any order with ability to re-upload any part if something goes wrong. It’s worth to note that after finalizing the upload you need to stitch the parts together. To achieve that one can reshuffle the data chunks or keep using the auxiliary metadata information at the expense of extra database lookups. For the PoC we decided on the latter which means that the finalization step is very fast even for huge files.

Was it Enough?

We used the phrase “Was it enough?” so extensively in our final presentation that it became our slogan and went viral internally. It was emblematic of our team, repeatedly one upping our accomplishments and our goals during the hackathon.

We only had the length of the hackathon to do our work, but consider all that we accomplished in that short time. We made Scylla S3 support that enabled following usage scenarios:

  • Use Scylla as a Large Binary Object storage (upload file up to 5TB size as in AWS S3)
  • Use Scylla via S3 API and tools like AWS S3 with MPU
  • Use Scylla as a file system via s3fs (FUSE)
  • Access Scylla data with a curl or browser
  • Stream media content from Scylla directly via the S3 frontend
  • Use Hadoop or Spark to read data right from the Scylla S3 API
  • Use CDC log to write right to the Scylla S3 API
  • Connect Scylla CDC log with Spark

While it’s still a long way off before our work shows up in a released version of Scylla, here’s a sneak peek at what we were able to accomplish with our S3 API:



If you want to follow along with our current and future work, you can find the code on Github here: https://github.com/sitano/s3-scylla/tree/scylla.

If you have more questions, we invite you to ask us on our Slack channel, or contact us directly.

We’d also like to invite you to our upcoming Scylla Summit, where you can learn about the other features we’ve been working on, and from your industry peers on how they’ve used Scylla in their own organizations.

SIGN UP FOR SCYLLA SUMMIT

The post ScyllaDB Developer Hackathon: Scylla + S3 appeared first on ScyllaDB.

How Netflix Scales its API with GraphQL Federation (Part 2)

In our previous post and QConPlus talk, we discussed GraphQL Federation as a solution for distributing our GraphQL schema and implementation. In this post, we shift our attention to what is needed to run a federated GraphQL platform successfully — from our journey implementing it to lessons learned.

Netflix GraphQL Federation

Our Journey so Far

Over the past year, we’ve implemented the core infrastructure pieces necessary for a federated GraphQL architecture as described in our previous post:

Studio Edge Architecture Diagram
Studio Edge Architecture

The first Domain Graph Service (DGS) on the platform was the former GraphQL monolith that we discussed in our first post (Studio API). Next, we worked with a few other application teams to make DGSs that would expose their APIs alongside the former monolith. We had our first Studio applications consuming the federated graph, without any performance degradation, by the end of the 2019. Once we knew that the architecture was feasible, we focused on readying it for broader usage. Our goal was to open up the Studio Edge platform for self-service in April 2020.

April 2020 was a turbulent time with the pandemic and overnight transition to working remotely. Nevertheless, teams started to jump into the graph in droves. Soon we had hundreds of engineers contributing directly to the API on a daily basis. And what about that Studio API monolith that used to be a bottleneck? We migrated the fields exposed by Studio API to individually owned DGSs without breaking the API for consumers. The original monolith is slated to be completely deprecated by the end of 2020.

This journey hasn’t been without its challenges. The biggest challenge was aligning on this strategy across the organization. Initially, there was a lot of skepticism and dissent; the concept was fairly new and would require high alignment across the organization to be successful. Our team spent a lot of time addressing dissenting points and making adjustments to the architecture based on feedback from developers. Through our prototype development and proactive partnership with some key critical voices, we were able to instill confidence and close crucial gaps.

Once we achieved broad alignment on the idea, we needed to ensure that adoption was seamless. This required building robust core infrastructure, ensuring a great developer experience, and solving for key cross-cutting concerns.

Core Infrastructure

Our GraphQL Gateway is based on Apollo’s reference implementation and is written in Kotlin. This gives us access to Netflix’s Java ecosystem, while also giving us the robust language features such as coroutines for efficient parallel fetches, and an expressive type system with null safety.

The schema registry is developed in-house, also in Kotlin. For storing schema changes, we use an internal library that implements the event sourcing pattern on top of the Cassandra database. Using event sourcing allows us to implement new developer experience features such as the Schema History view. The schema registry also integrates with our CI/CD systems like Spinnaker to automatically setup cloud networking for DGSs.

Developer Education & Experience

In the previous architecture, only the monolith Studio API team needed to learn GraphQL. In Studio Edge, every DGS team needs to build expertise in GraphQL. GraphQL has its own learning curve and can get especially tricky for complex cases like batching & lookahead. Also, as discussed in the previous post, understanding GraphQL Federation and implementing entity resolvers is not trivial either.

We partnered with Netflix’s Developer Experience (DevEx) team to build out documentation, training materials, and tutorials for developers. For general GraphQL questions, we lean on the open source community plus cultivate an internal GraphQL community to discuss hot topics like pagination, error handling, nullability, and naming conventions.

DGS Framework & Developer Tools

To make it easy for backend engineers to build a GraphQL DGS, the DevEx team built a “DGS Framework” on top of GraphQL Java and Spring Boot. The framework takes care of all the cross-cutting concerns of running a GraphQL service in production while also making it easier for developers to write GraphQL resolvers. In addition, DevEx built robust tooling for pushing schemas to the Schema Registry and a Self Service UI for browsing the various DGS’s schemas. Check out their conference talk and expect a future blog post from our colleagues. The DGS framework is planned to be open-sourced in early 2021.

Schema Governance

Netflix’s studio data is extremely rich and complex. Early on, we anticipated that active schema management would be crucial for schema evolution and overall health. We had a Studio Data Architect already in the org who was focused on data modeling and alignment across Studio. We engaged with them to determine graph schema best practices to best suit the needs of Studio Engineering.

Our goal was to design a GraphQL schema that was reflective of the domain itself, not the database model. UI developers should not have to build Backends For Frontends (BFF) to massage the data for their needs, rather, they should help shape the schema so that it satisfies their needs. Embracing a collaborative schema design approach was essential to achieving this goal.

Schema Design Workflow Diagram
Schema Design Workflow

The collaborative design process involves feedback and reviews across team boundaries. To streamline schema design and review, we formed a schema working group and a managed technical program for on-boarding to the federated architecture. While reviews add overhead to the product development process, we believe that prioritizing the quality of the graph model will reduce the amount of future changes and reworking needed. The level of review varies based on the entities affected; for the core federated types, more rigor is required (though tooling helps streamline that flow).

We have a deprecation workflow in place for evolving the schema. We’ve leveraged GraphQL’s deprecation feature and also track usage stats for every field in the schema. Once the stats show that a deprecated field is no longer used, we can make a backward incompatible change to remove the field from the schema.

Clients with Deprecated Field Usage
Clients with Deprecated Field Usage

We embraced a schema-first approach instead of generating our schema from existing models such as the Protobuf objects in our gRPC APIs. While Protobufs and gRPC are excellent solutions for building service APIs, we prefer decoupling our GraphQL schema from those layers to enable cleaner graph design and independent evolvability. In some scenarios, we implement generic mapping code from GraphQL resolvers to gRPC calls, but the extra boilerplate is worth the long-term flexibility of the GraphQL API.

Underlying our approach is a foundation of “context over control”, which is a key tenet of Netflix’s culture. Instead of trying to hold tight control of the entire graph, we give guidance and context to product teams so that they can apply their domain knowledge to make a flexible API for their domain. As this architecture matures, we will continue to monitor schema health and develop new tooling, processes, and best practices where needed.

Observability

In our previous architecture, observability was achieved through manual analysis and routing via the API team, which scaled poorly. For our federated architecture, we prioritized solving observability needs in a more scalable manner. We prioritized three areas:

  • Alerting — report when something goes awry
  • Discovery — easily determine what isn’t working
  • Diagnosis — debug why something isn’t working

Our guiding metrics in this space are mean time to resolution (MTTR) and service level objectives and indicators (SLO/SLI).

We teamed up with experts from Netflix’s Telemetry team. We integrated the Gateway and DGS architectural components with Zipkin, the internal distributed tracing tool Edgar, and application monitoring tool TellTale. In GraphQL, almost every response is a 200 with custom errors in the error block. We introspect these custom error codes from the response and emit them to our metrics server, Atlas. These integrations created a great foundation of rich visibility and insights for the consumers and developers of the GraphQL API.

Trace for a Federated Request Lifecycle
Edgar Trace for a Federated Request Lifecycle
Timeline View for a Federated Request lifecycle
Timeline View for a Federated Request

Distributed Log Correlation helps with debugging more complex server issues. By surfacing the application level logging details for all systems involved in processing a request, we gain deeper insights into what happened across the stack. Developers can easily see what was happening around the same time as a given request, to inspect surrounding factors that might have impacted an interaction.

Log correlation across multiple services for a request lifecycle
Logs across multiple services for a Federated Request

To solve the “who do I ask about…” routing problem, we integrated deep linking from GraphQL types and fields to their owning team’s support channels. Finding support is now as simple as clicking a link from a trace, which helps shorten MTTR and reduce the number of times the gateway team needs to get involved.

Securing the Federated Graph

Our goal is to enable robust and consistent security practices across the federated architecture. To achieve this, we partnered with the security experts at Netflix to build security into the graph. Let’s look at two essential parts of our security solution: AuthN and AuthZ.

Authentication

All of our product experiences in the Studio space require an authenticated account, so we restrict the GraphQL Gateway access to only trusted authenticated callers. Additionally, Graph Introspection is restricted to Netflix internal developers.

Authorization

Before Studio Edge, authorization logic was fragmented across teams. Some teams implemented authorization in their BFFs, some in microservices, and others did both for good measure. The result was often a different authorization story for a given piece of data depending on which UI a user was accessing it through. UI teams also found themselves needing to implement (and re-implement) authorization checks with each new frontend.

In Studio Edge, we delegated the authorization responsibility to DGS owners. This resulted in consistent authorization for the same user across different applications. Plus, Product Managers, Engineers and the Security team can easily get a bird’s eye view of who has access to each data type and how.

We have multiple authorization offerings within Netflix: from a simple system that grants access based on user identity to a more granular system that brings in the concept of roles and capabilities. DGS developers can choose a solution based on their needs. Then they simply annotate their resolvers with @Secured annotation and configure that to use one of the available systems. If needed, more complex authorization can be implemented in the resolver or in downstream systems.

Future of Authorization

We are currently prototyping a GraphQL-aware authorization solution. The Schema Registry automatically generates Access Control Groups (ACGs) for each field and its corresponding type when its schema is registered. Product managers & DGS Engineers decide membership and rules for these generated ACGs. Since the ACGs map to a field in GraphQL, the DGS framework then automatically applies the rules associated with the ACG during execution.

Architecting for Failure

The GraphQL Gateway is the single entry point for all requests; a failure on the gateway can cause significant disruptions. Following Netflix engineering best practices, we assume failures will happen and design ways to mitigate the impact of those failures. These are our design principles for ensuring the gateway layer is resilient:

  1. Single purpose
  2. Stateless service
  3. Demand controlled
  4. Multi-region
  5. Sharded by functionality

First, we focus the responsibilities of the gateway layer on a single purpose: parse client queries, then build and execute query plans. By reducing the scope, we limit the range of problems that can occur. We aim to perform any additional resource-intensive operations off-box with the exception of logging and metrics. Taking on additional unrelated logic in the gateway layer could increase surface area for failures in this critical tier.

Second, we run multiple stateless instances of the gateway service. Any gateway instance is able to generate and execute a query plan for any request. When we do code changes to the gateway layer, we rigorously test them before rolling out to production.

Third, we seek to balance the resources each request consumes through applying demand control. We rate-limit callers to avoid overloading the underlying databases that are the source of most of our domain elements. We also run a static query cost calculation on all incoming queries and reject expensive queries to avoid gridlock in gateway and DGS resources. Our partners understand these tradeoffs and work with us to meet these requirements, reworking expensive queries and reducing high volume callers.

Fourth, we deploy our gateway layer to multiple AWS regions around the world. This allows us to limit the blast radius for problems that inevitably arise. When problems happen, we can fail over to another region to ensure our clients are minimally impacted.

Last, we deploy multiple functional shards of our gateway layer. The code is the same in each shard and incoming requests are routed based on category. For example, GraphQL subscriptions generally result in long-lived connections while Queries & Mutations are short-lived. We use a separate fleet of instances for Subscriptions so “running out of connections” does not affect the availability of Queries and Mutations.

There is more we can do to improve resilience. We have plans to do canary deployments and analysis for gateway deployments and, eventually, schema changes. Today, our gateway dynamically updates its schema by polling the schema registry. We are in the process of decoupling these by storing the federation config in a versioned S3 bucket, making the gateway resilient to schema registry failures.

Closing Thoughts

GraphQL and Federation have been a productivity multiplier for Studio applications. Motivated by this, we’ve recently prototyped using GraphQL Federation for the Netflix consumer app search page on iOS & Android. To do this, we created three DGSs to provide the data for a minimal portion of the consumer graph. We are sending a small subset of users to this alternative stack and measuring high-level metrics. We are excited to see the results and explore further applicability in the Netflix consumer space.

Despite our positive experience, GraphQL Federation is early in its maturity lifecycle and may not be the best fit for every team or organization. Learning GraphQL and DGS development, running a federation layer, and doing a migration requires high commitment from partner teams and seamless cross-functional collaboration. If you’re considering going in this direction, we recommend checking out Apollo’s SaaS offering for Federation and the many online resources for learning GraphQL. For ecosystems like ours with a large swath of microservices that need to be aggregated together, the development velocity and improved operability has made the transition worth it.

In closing, we want to hear from you! If you have already implemented federation or tried to solve this problem with another approach, we would love to learn more. Sharing knowledge is one of the ways our industry learns and improves rapidly. Finally, if you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

By Tejas Shikhare, Edited by Philip Fisher-Ogden

Additional Credits: Stephen Spalding, Jennifer Shin, Robert Reta, Antoine Boyer, Bruce Wang, David Simmer


How Netflix Scales its API with GraphQL Federation (Part 2) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dynamic Status Timeouts in Scylla Manager 2.2

Keeping Scylla clusters repaired and backed up are the two most important functions that Scylla Manager provides. But there is one more function that can be easily overlooked: probing clusters to establish health check status reports of the clusters.

It may seem redundant that each node status is reported to Scylla Monitoring twice:

  • By each node directly, over Prometheus metric API
  • By Manager Health Check

This is useful in the (rare) cases when a node thinks he is up and running, while in practice, clients can not connect to the CQL API or are experiencing high latency responses. In such cases, the node will report itself to be in a normal state, while Manager Health Check, using the CQL API, identifies an issue and reports it.

This article will explore how health checks were improved with dynamic timeouts in Scylla Manager 2.2.

Dynamic Timeouts

Health Check status reports are provided by the sctool status command. Behind the scenes this command sends a request to the Scylla Manager REST API which in turn is doing all necessary status checks. Checks are performed in parallel on all nodes of a cluster.

Scylla Manager can manage multiple clusters spanning multiple datacenters (DCs) at the same time. Latencies can vary greatly across all of them. Scylla Manager’s existing static latency configuration proved to be problematic for a use case with different DCs, each within different regions. To tackle this problem Scylla Manager 2.2 introduced dynamic timeouts.

Idea behind dynamic timeouts is to minimize manual configuration but still adapt dynamically to changes in the managed environments. This is done by calculating timeout based on a series of past measurements.

Main advantage of dynamic over static timeouts is that the former are able to adapt to gradual increase in response time.

Dynamic vs. Static Timeout Limits

Sudden spikes in response time are still going to hit the limits:


Sudden increase in response time with a down node

Here’s an example configuration for Health Check dynamic timeouts in the config file:

# Dynamic timeout calculates timeout based on past measurements.
# It takes recent RTTs, calculates mean (m), standard deviation (stddev),
# and returns timeout of next probe equal to m + stddev_multiplier * max(stddev, 1ms).
#
# Higher stddev_multiplier is recommended on stable network environments,
# because standard deviation may be close to 0.
# dynamic_timeout:
# enabled: true
# probes: 200
# max_timeout: 30s
# stddev_multiplier: 5

Fixed timeout configuration remains as an option, but if healthcheck.dynamic_timeout.enabled is set to “true“, then it will be ignored and dynamic timeouts will be used instead.

The formula for calculating dynamic timeout is calculated as follows:

  • mmean across N most recent probes
  • stddevstandard deviation across N most recent probes
  • timeout = m + stdev_multiplier * max(stddev, 1ms)

Instead of setting global fixed value for the timeout users can set behaviour of dynamic timeout by turning a few knobs:

healthcheck.dynamic_timeout.probes — How many past measurements to consider in the timeout limit calculation. The more probes there are, the more current limit will be based on the trend of past values. If it’s expected for the limit to not quickly adjust to changes this should be increased and vice versa.

healthcheck.dynamic_timeout.max_timeout — If there are no past measurements or we haven’t yet fulfilled at least 10% of the required number of measurements then this value will be used as the default timeout. This should be set to a big enough value that empirically makes sense for all managed clusters until the optimal number of probes is collected.

max_timeout behavior

healthcheck.dynamic_timeout.stddev_multiplier — Use this to configure an acceptable level of variation to the mean response time. This value sets the threshold for fluctuations in normal operations. With increase it will allow more erratic behavior in response time before timeout is triggered.

Comparison of stddev_multiplier settings

Conclusion

Scylla Manager already had excellent options for ad hoc health checks of Scylla clusters. With dynamic timeouts it just became even more user friendly. Beyond health checks Scylla Manager 2.2 offers improved repair processes with features like parallel repairs, graceful stops, visibility into repair progress, small table optimization and more.

If you are a user of Scylla Manager we are always looking for feedback on how we can improve our product. Feel free to contact us directly, or bring up your issues in our Slack community.

The post Dynamic Status Timeouts in Scylla Manager 2.2 appeared first on ScyllaDB.

The Instaclustr LDAP Plugin for Cassandra 2.0, 3.0, and 4.0

LDAP (Lightweight Directory Access Protocol) is a common vendor-neutral and lightweight protocol for organizing authentication of network services. Integration with LDAP allows users to unify the company’s security policies when one user or entity can log in and authenticate against a variety of services. 

There is a lot of demand from our enterprise customers to be able to authenticate to their Apache Cassandra clusters against LDAP. As the leading NoSQL database, Cassandra is typically deployed across the enterprise and needs this connectivity.

Instaclustr has previously developed our LDAP plugin to work with the latest Cassandra releases. However, with Cassandra 4.0 right around the corner, it was due for an update to ensure compatibility. Instaclustr takes a great deal of care to provide cutting-edge features and integrations for our customers, and our new LDAP plugin for Cassandra 4.0 showcases this commitment. We always use open source and maintain a number of Apache 2.0-licensed tools, and have released our LDAP plugin under the Apache 2.0 license. 

Modular Architecture to Support All Versions of Cassandra

Previously, the implementations for Cassandra 2.0 and 3.0 lived in separate branches, which resulted in some duplicative code. With our new LDAP plugin update, everything lives in one branch and we have modularized the whole solution so it aligns with earlier Cassandra versions and Cassandra 4.0.

The modularization of our LDAP plugin means that there is the “base” module that all implementations are dependent on. If you look into the codebase on GitHub, you see that the implementation modules consist of one or two classes at maximum, with the rest inherited from the base module. 

This way of organizing the project is beneficial from a long-term maintenance perspective. We no longer need to keep track of all changes and apply them to the branched code for the LDAP plugin for each Cassandra version. When we implement changes and improvements to the base module, all modules are updated automatically and benefit.

Customizable for Any LDAP Implementation

This plugin offers a default authenticator implementation for connecting to a LDAP server and authenticating a user against it. It also offers a way to implement custom logic for specific use cases. In the module, we provide the most LDAP server-agnostic implementation possible, but there is also scope for customization to meet specific LDAP server nuances. 

If the default solution needs to be modified for a particular customer use case, it is possible to add in custom logic for that particular LDAP implementation. The implementation for customized connections is found in “LDAPPasswordRetriever” (DefaultLDAPServer being the default implementation from which one might extend and override appropriate methods). This is possible thanks to the SPI mechanism. If you need this functionality you can read more about it in the relevant section of our documentation.

Enhanced Testing for Reliability

Our GitHub build pipeline now tests the LDAP plugins for each supported Cassandra version on each merged commit. This update provides integration tests that will spin up a standalone Cassandra node as part of JUnit tests as well as a LDAP server. This is started in Docker as part of a Maven build before the actual JUnit tests. 

This testing framework enables us to test to make sure that any changes don’t break the authentication mechanism. This is achieved by actually logging in via the usual mechanism as well as via LDAP.

Packaged for Your Operating System

Last but not least, we have now added Debian and RPM packages with our plugin for each Cassandra version release. Until now, a user of this plugin had to install the JAR file to Cassandra libraries directory manually. With the introduction of these packages, you do not need to perform this manual action anymore. The plugin’s JAR along with the configuration file will be installed in the right place if the official Debian or RPM Cassandra package is installed too.

How to Configure LDAP for Cassandra

In this section we will walk you through the setup of the LDAP plugin and explain the most crucial parts of how the plugin works.

After placing the LDAP plugin JAR to Cassandra’s classpath—either by copying it over manually or by installing a package—you will need to modify a configuration file in /etc/cassandra/ldap.properties.

There are also changes that need to be applied to cassandra.yaml. For Cassandra 4.0, please be sure that your authenticator, authorizer, and role_manager are configured as follows:

authenticator: LDAPAuthenticator
authorizer: CassandraAuthorizer
role_manager: LDAPCassandraRoleManager

Before using this plugin, an operator of a Cassandra cluster should configure system_auth keyspace to use NetworkTopologyStrategy.

How the LDAP Plugin Works With Cassandra Roles

LDAP plugin works via a “dual authentication” technique. If a user tries to log in with a role that already exists in Cassandra, separate from LDAP, it will authenticate against that role. However, if that role is not present in Cassandra, it will reach out to the LDAP server and it will try to authenticate against it. If it is successful, from the user’s point of view, it looks like this role was in Cassandra the whole time as it logs in the user transparently. 

If your LDAP server is down, you will not be able to authenticate with the specified LDAP user. You can enable caching for LDAP users—available in the Cassandra 3.0 or 4.0 plugins—to take some load off a LDAP server when authentication is conducted frequently.

The Bottom Line

Our LDAP plugin meets the enterprise need for a consolidated security and authentication policy. 100% open source and supporting all major versions of Cassandra, the plugin works with all major LDAP implementations and can be easily customized for others. 

The plugin is part of our suite of supported tools for our support customers and Instaclustr is committed to actively maintaining and developing the plugin. Our work updating it to support the upcoming Cassandra 4.0 release is part of this commitment. You can download it here and feel free to get in touch with any questions you might have. Cassandra 4.0 beta 2 is currently in preview on our managed platform and you can use our free trial to check it out.

The post The Instaclustr LDAP Plugin for Cassandra 2.0, 3.0, and 4.0 appeared first on Instaclustr.

Announcing: Stargate 1.0 in Astra; REST, GraphQL, & Schemaless JSON for Your Cassandra Development

Apache Cassandra Changelog #2 | December 2020

Our monthly roundup of key activities and knowledge to keep the community informed.

Apache Cassandra Changelog Header

Release Notes

Released

Apache #Cassandra 4.0-beta3, 3.11.9, 3.0.23, and 2.2.19 were released on November 4 and are in the repositories. Please pay attention to release notes and let the community know if you encounter problems. Join the Cassandra mailing list to stay updated.

Changed

Cassandra 4.0 is progressing toward GA. There are 1,390 total tickets and remaining tickets represent 5.5% of total scope. Read the full summary shared to the dev mailing list and take a look at the open tickets that need reviewers.

Cassandra 4.0 will be dropping support for older distributions of CentOS 5, Debian 4, and Ubuntu 7.10. Learn more.

Community Notes

Updates on Cassandra Enhancement Proposals (CEPs), how to contribute, and other community activities.

Added

The community weighed options to address reads inconsistencies for Compact Storage as noted in ticket CASSANDRA-16217 (committed). The conversation continues in ticket CASSANDRA-16226 with the aim of ensuring there are no huge performance regressions for common queries when you upgrade from 2.x to 3.0 with Compact Storage tables or drop it from a table on 3.0+.

Added

CASSANDRA-16222 is a Spark library that can compact and read raw Cassandra SSTables into SparkSQL. By reading the sstables directly from a snapshot directory, one can achieve high performance with minimal impact to a production cluster. It was used to successfully export a 32TB Cassandra table (46bn CQL rows) to HDFS in Parquet format in around 70 minutes, a 20x improvement on previous solutions.

Changed

Great news for CEP-2: Kubernetes Operator, the community has agreed to create a community-based operator by merging the cass-operator and CassKop. The work being done can be viewed on GitHub here.

Released

The Reaper community announced v2.1 of its tool that schedules and orchestrates repairs of Apache Cassandra clusters. Read the docs.

Released

Apache Cassandra 4.0-beta-1 was released on FreeBSD.

User Space

Netflix

“With these optimized Cassandra clusters in place, it now costs us 71% less to operate clusters and we could store 35x more data than our previous configuration.” - Maulik Pandey

Yelp

“Cassandra is a distributed wide-column NoSQL datastore and is used at Yelp for both primary and derived data. Yelp’s infrastructure for Cassandra has been deployed on AWS EC2 and ASG (Autoscaling Group) for a while now. Each Cassandra cluster in production spans multiple AWS regions.” - Raghavendra D Prabhu

In the News

DevPro Journal - What’s included in the Cassandra 4.0 Release?

JAXenter - Moving to cloud-native applications and data with Kubernetes and Apache Cassandra

DZone - Improving Apache Cassandra’s Front Door and Backpressure

ApacheCon - Building Apache Cassandra 4.0: behind the scenes

Cassandra Tutorials & More

Users in search of a tool for scheduling backups and performing restores with cloud storage support (archiving to AWS S3, GCS, etc) should consider Cassandra Medusa.

Apache Cassandra Deployment on OpenEBS and Monitoring on Kubera - Abhishek Raj, MayaData

Lucene Based Indexes on Cassandra - Rahul Singh, Anant

How Netflix Manages Version Upgrades of Cassandra at Scale - Sumanth Pasupuleti, Netflix

Impacts of many tables in a Cassandra data model - Alex Dejanovski, The Last Pickle

Cassandra Upgrade in production : Strategies and Best Practices - Laxmikant Upadhyay, American Express

Apache Cassandra Collections and Tombstones - Jeremy Hanna

Spark + Cassandra, All You Need to Know: Tips and Optimizations - Javier Ramos, ITNext

How to install the Apache Cassandra NoSQL database server on Ubuntu 20.04 - Jack Wallen, TechRepublic

How to deploy Cassandra on Openshift and open it up to remote connections - Sindhu Murugavel

Apache Cassandra Changelog Footer

Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

Announcing the Astra Service Broker: Tradeoff-Free Cassandra in Kubernetes

[Webcast] The 411 on Storage Attached Indexing in Apache Cassandra

Renaming and reshaping Scylla tables using scylla-migrator

We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved schemas that recent tables are using.

This typical engineering problem requires either to keep those legacy tables and data queries or to migrate it to the more optimal model with the bandwagon of applications to be modified to query the data the new way… That’s something nobody likes doing but hey, we don’t like legacy at Numberly so let’s kill that one!

To overcome this challenge we used the scylla-migrator project and I thought it could be useful to share this experience.

How and why our schema evolved

When we first approached ID matching tables we chose to answer two problems at the same time: query the most recent data and keep the history of the changes per source ID.

This means that those tables included a date as part of their PRIMARY KEY while the partition key was obviously the matching table ID we wanted to lookup from:

CREATE TABLE IF NOT EXISTS ids_by_partnerid(
partnerid text,
id text,
date timestamp,
PRIMARY KEY ((partnerid), date, id)
)
WITH CLUSTERING ORDER BY (date DESC)

Making a table with an ever changing date in the clustering key creates what we call a history table. In the schema above the uniqueness of a row is not only defined by a partner_id / id couple but also by its date!

Quick caveat: you have to be careful about the actual date timestamp resolution since you may not want to create a row for every second of the same partner_id / id couple (we use an hour resolution).

History tables are good for analytics and we also figured we could use them for batch and real time queries where we would be interested in the “most recent ids for the given partner_id” (sometimes flavored with a LIMIT):

SELECT id FROM ids_by_partnerid WHERE partner_id = "AXZAZLKDJ" ORDER BY date DESC;

As time passed, real time Kafka pipelines started to query these tables hard and were mostly interested in “all the ids known for the given partner_id“.

A sort of DISTINCT(id) is out of the scope of our table! For this we need a table schema that represents a condensed view of the data. We call them compact tables and the only difference with the history table is that the date timestamp is simply not part of the PRIMARY KEY:

CREATE TABLE IF NOT EXISTS ids_by_partnerid(
partnerid text,
id text,
seen_date timestamp,
PRIMARY KEY ((partnerid), id)
)

To make that transition happen we thus wanted to:

  • rename history tables with an _history suffix so that they are clearly identified as such
  • get a compacted version of the tables (by keeping their old name) while renaming the date column name to seen_date
  • do it as fast as possible since we will need to stop our feeding pipeline and most of our applications during the process…

STOP: it’s not possible to rename a table in CQL!

Scylla-migrator to the rescue

We decided to abuse the scylla-migrator to perform this perilous migration.

As it was originally designed to help users migrate from Cassandra to Scylla by leveraging Spark it seemed like a good fit for the task since we happen to own Spark clusters powered by Hadoop YARN.

Building scylla-migrator for Spark < 2.4

Recent scylla-migrator does not support older Spark versions. The trick is to look at the README.md git log and checkout the hopefully right commit that supports your Spark cluster version.

In our case for Spark 2.3 we used git commit bc82a57e4134452f19a11cd127bd4c6a25f75020.

On Gentoo, make sure to use dev-java/sbt-bin since the non binary version is vastly out of date and won’t build the project. You need at least version 1.3.

The scylla-migrator plan

The documentation explains that we need a config file that points to a source cluster+table and a destination cluster+table as long as they have the same schema structure…

Renaming is then as simple as duplicating the schema using CQLSH and running the migrator!

But what about our compacted version of our original table? The schema is different from the source table!…

Good news is that as long as all your columns remain present, you can also change the PRIMARY KEY of your destination table and it will still work!

This make the scylla-migrator an amazing tool to reshape or pivot tables!

  • the column date is renamed to seen_date: that’s okay, scylla-migrator supports column renaming (it’s a Spark dataframe after all)!
  • the PRIMARY KEY is different in the compacted table since we removed the ‘date‘ from the clustering columns: we’ll get a compacted table for free!

Using scylla-migrator

The documentation is a bit poor on how to submit your application to a Hadoop YARN cluster but that’s kind of expected.

It also did not mention how to connect to a SSL enabled cluster (are there people really not using SSL on the wire in their production environment?)… anyway let’s not start a flame war 🙂

The trick that will save you is to know that you can append all the usual Spark options that are available in the spark-cassandra-connector!

Submitting to a Kerberos protected Hadoop YARN cluster targeting a SSL enabled Scylla cluster then looks like this:

export JAR_NAME=target/scala-2.11/scylla-migrator-assembly-0.0.1.jar
export KRB_PRINCIPAL=USERNAME

spark2-submit \
 --name ScyllaMigratorApplication \
 --class com.scylladb.migrator.Migrator  \
 --conf spark.cassandra.connection.ssl.clientAuth.enabled=True  \
 --conf spark.cassandra.connection.ssl.enabled=True  \
 --conf spark.cassandra.connection.ssl.trustStore.path=jssecacerts  \
 --conf spark.cassandra.connection.ssl.trustStore.password=JKS_PASSWORD  \
 --conf spark.cassandra.input.consistency.level=LOCAL_QUORUM \
 --conf spark.cassandra.output.consistency.level=LOCAL_QUORUM \
 --conf spark.scylla.config=config.yaml \
 --conf spark.yarn.executor.memoryOverhead=1g \
 --conf spark.blacklist.enabled=true  \
 --conf spark.blacklist.task.maxTaskAttemptsPerExecutor=1  \
 --conf spark.blacklist.task.maxTaskAttemptsPerNode=1  \
 --conf spark.blacklist.stage.maxFailedTasksPerExecutor=1  \
 --conf spark.blacklist.stage.maxFailedExecutorsPerNode=1  \
 --conf spark.executor.cores=16 \
 --deploy-mode client \
 --files jssecacerts \
 --jars ${JAR_NAME}  \
 --keytab ${KRB_PRINCIPAL}.keytab  \
 --master yarn \
 --principal ${KRB_PRINCIPAL}  \
 ${JAR_NAME}

Note that we chose to apply a higher consistency level to our reads using a LOCAL_QUORUM instead of the default LOCAL_ONE. I strongly encourage you to do the same since it’s appropriate when you’re using this kind of tool!

Column renaming is simply expressed in the configuration file like this:

# Column renaming configuration.
renames:
  - from: date
    to: seen_date

Tuning scylla-migrator

While easy to use, tuning scylla-migrator to operate those migrations as fast as possible turned out to be a real challenge (remember we have some production applications shut down during the process).

Even using 300+ Spark executors I couldn’t get my Scylla cluster utilization to more than 50% and migrating a single table with a bit more than 1B rows took almost 2 hours…

We found the best knobs to play with thanks to the help of Lubos Kosco and this blog post from ScyllaDB:

  • Increase the splitCount setting: more splits means more Spark executors will be spawned and more tasks out of it. While it might be magic on a pure Spark deployment it’s not that amazing on a Hadoop YARN one where executors are scheduled in containers with 1 core by default. We simply moved it from 256 to 384.
  • Disable compaction on destination tables schemas. This gave us a big boost and saved the day since it avoids adding the overhead of compacting while you’re pushing down data hard!

To disable compaction on a table simply:

ALTER TABLE ids_by_partnerid_history WITH compaction = {'class': 'NullCompactionStrategy'};

Remember to run a manual compaction (nodetool compact <keyspace> <table>) and to enable compaction back on your tables once you’re done!

Happy Scylla tables mangling!

How to Keep Pirates “Hackers” Away From Your Booty “Data” with Cassandra RBAC

Sizing Matters: Sizing Astra for Apache Cassandra Apps

Clearing the Air on Cassandra Batches

Apache Cassandra Changelog #1 (October 2020)

Introducing the first Cassandra Changelog blog! Our monthly roundup of key activities and knowledge to keep the community informed.

Cassandra Changelog header

Release Notes

Updated

The most current Apache Cassandra releases are 4.0-beta2, 3.11.8, 3.0.22, 2.2.18 and 2.1.22 released on August 31 and are in the repositories. The next cut of releases will be out soon―join the Cassandra mailing list to stay up-to-date.

We continue to make progress toward the 4.0 GA release with the overarching goal of it being at a state where major users should feel confident running it in production when it is cut. Over 1,300 Jira tickets have been closed and less than 100 remain as of this post. To gain this confidence, there are various ongoing testing efforts involving correctness, performance, and ease of use.

Added

With CASSANDRA-15013, the community improved Cassandra's ability to handle high throughput workloads, while having enough safeguards in place to protect itself from potentially going out of memory.

Added

The Harry project is a fuzz testing tool that aims to generate reproducible workloads that are as close to real-life as possible, while being able to efficiently verify the cluster state against the model without pausing the workload itself.

Added

The community published its first Apache Cassandra Usage Report 2020 detailing findings from a comprehensive global survey of 901 practitioners on Cassandra usage to provide a baseline understanding of who, how, and why organizations use Cassandra.

Community Notes

Updates on new and active Cassandra Enhancement Proposals (CEPs) and how to contribute.

Changed

CEP-2: Kubernetes Operator was introduced this year and is an active discussion on creation of a community-based operator with the goal of making it easy to run Cassandra on Kubernetes.

Added

CEP-7: Storage Attached Index (SAI) is a new secondary index for Cassandra that builds on the advancements made with SASI. It is intended to replace the existing built-in secondary index implementations.

Added

Cassandra was selected by the ASF Diversity & Inclusion committee to be included in a research project to evaluate and understand the current state of diversity.

User Space

Bigmate

"In vetting MySQL, MongoDB, and other potential databases for IoT scale, we found they couldn't match the scalability we could get with open source Apache Cassandra. Cassandra's built-for-scale architecture enables us to handle millions of operations or concurrent users each second with ease – making it ideal for IoT deployments." - Brett Orr

Bloomberg

"Our group is working on a multi-year build, creating a new Index Construction Platform to handle the daily production of the Bloomberg Barclays fixed income indices. This involves building and productionizing an Apache Solr-backed search platform to handle thousands of searches per minute, an Apache Cassandra back-end database to store millions of data points per day, and a distributed computational engine to handle millions of computations daily." - Noel Gunasekar

In the News

Solutions Review - The Five Best Apache Cassandra Books on Our Reading List

ZDNet - What Cassandra users think of their NoSQL DBMS

Datanami - Cassandra Adoption Correlates with Experience

Container Journal - 5 to 1: An Overview of Apache Cassandra Kubernetes Operators

Datanami - Cassandra Gets Monitoring, Performance Upgrades

ZDNet - Faster than ever, Apache Cassandra 4.0 beta is on its way

Cassandra Tutorials & More

A Cassandra user was in search of a tool to perform schema DDL upgrades. Another user suggested https://github.com/patka/cassandra-migration to ensure you don't get schema mismatches if running multiple upgrade statements in one migration. See the full email on the user mailing list for other recommended tools.

Start using virtual tables in Apache Cassandra 4.0 - Ben Bromhead, Instaclustr

Benchmarking Apache Cassandra with Rust - Piotr Kołaczkowski, DataStax

Open Source BI Tools and Cassandra - Arpan Patel, Anant Corporation

Build Fault Tolerant Applications With Cassandra API for Azure Cosmos DB - Abhishek Gupta, Microsoft

Understanding Data Modifications in Cassandra - Sameer Shukla, Redgate

Cassandra Changelog footer

Cassandra Changelog is curated by the community. Please send submissions to cassandra@constantia.io.

Building a Low-Latency Distributed Stock Broker Application: Part 4

In the fourth blog of the  “Around the World ” series we built a prototype of the application, designed to run in two georegions.

Recently I re-watched “Star Trek: The Motion Picture” (The original 1979 Star Trek Film). I’d forgotten how much like “2001: A Space Odyssey” the vibe was (a drawn out quest to encounter a distant, but rapidly approaching very powerful and dangerous alien called “V’ger”), and also that the “alien” was originally from Earth and returning in search of “the Creator”—V’ger was actually a seriously upgraded Voyager spacecraft!

Star Trek: The Motion Picture (V’ger)

The original Voyager 1 and 2 had only been recently launched when the movie came out, and were responsible for some remarkable discoveries (including the famous “Death Star” image of Saturn’s moon Mimas, taken in 1980). Voyager 1 has now traveled further than any other human artefact and has since left the solar system! It’s amazing that after 40+ years it’s still working, although communicating with it now takes a whopping 40 hours in round-trip latency (which happens via the Deep Space Network—one of the stations is at Tidbinbilla, near our Canberra office).

Canberra Deep Space Communication Complex (CSIRO)

Luckily we are only interested in traveling “Around the World” in this blog series, so the latency challenges we face in deploying a globally distributed stock trading application are substantially less than the 40 hours latency to outer space and back. In Part 4 of this blog series we catch up with Phileas Fogg and Passepartout in their journey, and explore the initial results from a prototype application deployed in two locations, Sydney and North Virginia.

1. The Story So Far

In Part 1 of this blog series we built a map of the world to take into account inter-region AWS latencies, and identified some “georegions” that enabled sub 100ms latency between AWS regions within the same georegion. In Part 2 we conducted some experiments to understand how to configure multi-DC Cassandra clusters and use java clients, and measured latencies from Sydney to North Virginia. In Part 3 we explored the design for a globally distributed stock broker application, and built a simulation and got some indicative latency predictions. 

The goal of the application is to ensure that stock trades are done as close as possible to the stock exchanges the stocks are listed on, to reduce the latency between receiving a stock ticker update, checking the conditions for a stock order, and initiating a trade if the conditions are met. 

2. The Prototype

For this blog I built a prototype of the application, designed to run in two georegions, Australia and the USA. We are initially only trading stocks available from stock exchanges in these two georegions, and orders for stocks traded in New York will be directed to a Broker deployed in the AWS North Virginia region, and orders for stocks traded in Sydney will be directed to a Broker deployed in the AWS Sydney region. As this Google Earth image shows, they are close to being antipodes (diametrically opposite each other, at 15,500 km apart, so pretty much satisfy the definition of  traveling “Around the World”), with a measured inter-region latency (from blog 2) of 230ms.

Sydney to North Virginia (George Washington’s Mount Vernon house)

The design of the initial (simulated) version of the application was changed in a couple of ways to: 

  • Ensure that it worked correctly when instances of the Broker were deployed in multiple AWS regions
  • Measure actual rather than simulated latencies, and
  • Use a multi-DC Cassandra Cluster. 

The prototype isn’t a complete implementation of the application yet. In particular it only uses a single Cassandra table for Orders—to ensure that Orders are made available in both georegions, and can be matched against incoming stock tickers by the Broker deployed in that region. 

Some other parts of the application are “stubs”, including the Stock Ticker component (which will eventually use Kafka), and checks/updates of Holdings and generation of Transactions records (which will eventually also be Cassandra tables). Currently only the asynchronous, non-market, order types are implemented (Limit and Stop orders), as I realized that implementing Market orders (which are required to be traded immediately) using a Cassandra table would result in too many tombstones being produced—as each order is deleted immediately upon being filled (rather than being marked as “filled” in the original design, to enable the Cassandra query to quickly find unfilled orders and prevent the Orders table from growing indefinitely). However, for non-market orders it is a reasonable design choice to use a Cassandra table, as Orders may exist for extended periods of time (hours, days, or even weeks) before being traded, and as there is only a small number of successful trades (10s to 100s) per second relative to the total number of waiting Orders (potentially millions), the number of deletions, and therefore tombstones, will be acceptable. 

We now have a look at the details of some of the more significant changes that were made to the application.

Cassandra

I created a multi-DC Cassandra keyspace as follows:

CREATE KEYSPACE broker WITH replication = {'class': 'NetworkTopologyStrategy', 'NorthVirginiaDC': '3', 'SydneyDC': '3'};

In the Cassandra Java driver, the application.conf file determines which Data Center the Java client connects to. For example, to connect to the SydneyDC the file has the following settings:

datastax-java-driver {
    basic.contact-points = [
        "1.2.3.4:9042"
    ]

    basic.load-balancing-policy {
        class = DefaultLoadBalancingPolicy
        local-datacenter = "SydneyDC"
    }
}

The Orders table was created as follows (note that I couldn’t use “limit” for a column name as it’s a reserved word in Cassandra!):

CREATE TABLE broker.orders (
    symbol text,
    orderid text,
    buyorsell text,
    customerid text,
    limitthreshold double,
    location text,
    ordertype text,
    quantity bigint,
    starttime bigint,
    PRIMARY KEY (symbol, orderid)
);

For the primary key, the partition key is the stock symbol, so that all outstanding orders for a stock can be found when each stock ticker is received by the Broker, and the clustering column is the (unique) orderid, so that multiple orders for the same symbol can be written and read, and a specific order (for an orderid) can be deleted. In a production environment using a single stock symbol partition may result in skewed and unbounded partitions which is not recommended.

The prepared statements for creating, reading, and deleting orders are as follows:

PreparedStatement prepared_insert = Cassandra.session.prepare(
"insert into broker.orders (symbol, orderid, buyorsell, customerid, limitthreshold, location, ordertype, quantity, starttime) values (?, ?, ?, ?, ?, ?, ?, ?, ?)");
                 
PreparedStatement prepared_select = Cassandra.session.prepare(
        "select * from broker.orders where symbol = ?");

PreparedStatement prepared_delete = Cassandra.session.prepare(
        "delete from broker.orders where symbol = ? and orderid = ?");

I implemented a simplified “Place limit or stop order” operation (see Part 3), which uses the prepared_insert statement to create each new order, initially in the Cassandra Data Center local to the Broker where the order was created from, which is then automatically replicated in the other Cassandra Data Center. I also implemented the “Trade Matching Order” operation (Part 3), which uses the prepared_select statement to query orders matching each incoming Stock Ticker, checks the rules, and then if a trade is filled deletes the order.

Deployment

I created a 3 node Cassandra cluster in the Sydney AWS region, and then added another identical Data Center in the North Virginia AWS regions using Instaclustr Managed Cassandra for AWS. This gave me 6 nodes in total, running on t3.small instances (5 GB SSD, 2GB RAM, 2 CPU Cores). This is a small developer sized cluster, but is adequate for a prototype, and very affordable (2 cents an hour per node for AWS costs) given that the Brokers are currently only single threaded so don’t produce much load. We’re more interested in latency at this point of the experiment, and we may want to increase the number of Data Centers in the future. I also spun up an EC2 instance (t3a.micro) in the same AWS regions, and deployed an instance of the Stock Broker on each (it only used 20% CPU). Here’s what the complete deployment looks like:

3. The Results

For the prototype, the focus was on demonstrating that the design goal of minimizing latency for trading stop and limit orders (asynchronous trades) was achieved. For the prototype, the latency for these order types is measured from the time of receiving a Stock Ticker, to the time an Order is filled. We ran a Broker in each AWS region concurrently for an hour, with the same workload for each, and measured the average and maximum latencies. For the first configuration, each Broker is connected to its local Cassandra Data Center, which is how it would be configured in practice. The results were encouraging, with an average latency of 3ms, and a maximum of 60ms, as shown in this graph.  

During the run, across both Brokers, 300 new orders were created each second, 600 stock tickers were received each second, and 200 trades were carried out each second. 

Given that I hadn’t implemented Market Orders yet, I wondered how I could configure and approximately measure the expected latency for these synchronous order types between different regions (i.e. Sydney and North Virginia)? The latency for Market orders in the same region will be comparable to the non-market orders. The solution turned out to be simple— just re-configure the Brokers to use the remote Cassandra Data Center, which introduces the inter-region round-trip latency which would also be encountered with Market Orders placed on one region and traded immediately in the other region. I could also have achieved a similar result by changing the consistency level to EACH_QUOROM (which requires a majority of nodes in each data center to respond). Not surprisingly, the latencies were higher, rising to 360ms average, and 1200ms maximum, as shown in this graph with both configurations (Stop and Limit Orders on the left, and Market Orders on the right):

So our initial experiments are a success, and validate the primary design goal, as asynchronous stop and limit Orders can be traded with low latency from the Broker nearest the relevant stock exchanges, while synchronous Market Orders will take significantly longer due to inter-region latency. 

Write Amplification

I wondered what else can be learned from running this experiment? We can understand more about resource utilization in multi-DC Cassandra clusters. Using the Instaclustr Cassandra Console, I monitored the CPU Utilization on each of the nodes in the cluster, initially with only one Data Center and one Broker, and then with two Data Centers and a single Broker, and then both Brokers running. It turns out that the read load results in 20% CPU Utilization on each node in the local Cassandra Data Center, and the write load also results in 20% locally.  Thus, for a single Data Center cluster the total load is 40% CPU. However, with two Data Centers things get more complex due to the replication of the local write loads to each other Data Center. This is also called “Write Amplification”.

The following table shows the measured total load for 1 and 2 Data Centers, and predicted load for up to 8 Data Centers, showing that for more than 3 Data Centers you need bigger nodes (or bigger clusters). A four CPU Core node instance type would be adequate for 7 Data Centers, and would result in about 80% CPU Utilization.  

  Number of Data  Centres Local Read Load Local Write Load Remote Write Load Total Write Load Total Load
  1 20 20 0 20 40
  2 20 20 20 40 60
  3 20 20 40 60 80
  4 20 20 60 80 100
  5 20 20 80 100 120
  6 20 20 100 120 140
  7 20 20 120 140 160
  8 20 20 140 160 180

Costs

The total cost to run the prototype includes the Instaclustr Managed Cassandra nodes (3 nodes per Data Center x 2 Data Centers = 6 nodes), the two AWS EC2 Broker instances, and the data transfer between regions (AWS only charges for data out of a region, not in, but the prices vary depending on the source region). For example, data transfer out of North Virginia is 2 cents/GB, but Sydney is more expensive at 9.8 cents/GB. I computed the total monthly operating cost to be $361 for this configuration, broken down into $337/month (93%) for Cassandra and EC2 instances, and $24/month (7%) for data transfer, to process around 500 million stock trades. Note that this is only a small prototype configuration, but can easily be scaled for higher throughputs (with incrementally and proportionally increasing costs).

Conclusions

In this blog we built and experimented with a prototype of the globally distributed stock broker application, focussing on testing the multi-DC Cassandra part of the system which enabled us to significantly reduce the impact of planetary scale latencies (from seconds to low milliseconds) and ensure greater redundancy (across multiple AWS regions), for the real-time stock trading function. Some parts of the application remain as stubs, and in future blogs I aim to replace them with suitable functionality (e.g. streaming, analytics) and non-functionality (e.g. failover) from a selection of Kafka, Elasticsearch and maybe even Redis!

The post Building a Low-Latency Distributed Stock Broker Application: Part 4 appeared first on Instaclustr.

Understanding the Impacts of the Native Transport Requests Change Introduced in Cassandra 3.11.5

Summary

Recently, Cassandra made changes to the Native Transport Requests (NTR) queue behaviour. Through our performance testing, we found the new NTR change to be good for clusters that have a constant load causing the NTR queue to block. Under the new mechanism the queue no longer blocks, but throttles the load based on queue size setting, which by default is 10% of the heap.

Compared to the Native Transport Requests queue length limit, this improves how Cassandra handles traffic when queue capacity is reached. The “back pressure” mechanism more gracefully handles the overloaded NTR queue, resulting in a significant lift of operations without clients timing out. In summary, clusters with later versions of Cassandra can handle more load before hitting hard limits.

Introduction

At Instaclustr, we are responsible for managing the Cassandra versions that we release to the public. This involves performing a review of Cassandra release changes, followed by performance testing. In cases where major changes have been made in the behaviour of Cassandra, further research is required. So without further delay let’s introduce the change to be investigated.

Change:
  • Prevent client requests from blocking on executor task queue (CASSANDRA-15013)
Versions affected:

Background

Native Transport Requests

Native transport requests (NTR) are any requests made via the CQL Native Protocol. CQL Native Protocol is the way the Cassandra driver communicates with the server. This includes all reads, writes, schema changes, etc. There are a limited number of threads available to process incoming requests. When all threads are in use, some requests wait in a queue (pending). If the queue fills up, some requests are silently rejected (blocked). The server never replies, so this eventually causes a client-side timeout. The main way to prevent blocked native transport requests is to throttle load, so the requests are performed over a longer period.

Prior to 3.11.5

Prior to 3.11.5, Cassandra used the following configuration settings to set the size and throughput of the queue:

  • native_transport_max_threads is used to set the maximum threads for handling requests.  Each thread pulls requests from the NTR queue.
  • cassandra.max_queued_native_transport_requests is used to set queue size. Once the queue is full the Netty threads are blocked waiting for the queue to have free space (default 128).

Once the NTR queue is full requests from all clients are not accepted. There is no strict ordering by which blocked Netty threads will process requests. Therefore in 3.11.4 latency becomes random once all Netty threads are blocked.

Native Transport Requests - Cassandra 3.11.4

Change After 3.11.5

In 3.11.5 and above, instead of blocking the NTR queue as previously described, it throttles. The NTR queue is throttled based on the heap size. The native transport requests are limited in terms of total size occupied in memory rather than the number of them. Requests are paused after the queue is full.

  • native_transport_max_concurrent_requests_in_bytes a global limit on the number of NTR requests, measured in bytes. (default heapSize / 10)
  • native_transport_max_concurrent_requests_in_bytes_per_ip an endpoint limit on the number of NTR requests, measured in bytes. (default heapSize / 40)

Maxed Queue Behaviour

From previously conducted performance testing of 3.11.4 and 3.11.6 we noticed similar behaviour when the traffic pressure has not yet reached the point of saturation in the NTR queue. In this section, we will discuss the expected behaviour when saturation does occur and breaking point is reached. 

In 3.11.4, when the queue has been maxed, client requests will be refused. For example, when trying to make a connection via cqlsh, it will yield an error, see Figure 2.

Cassandra 3.11.4 - queue maxed out, client requests refused
Figure 2: Timed out request

Or on the client that tries to run a query, you may see NoHostAvailableException

Where a 3.11.4 cluster previously got blocked NTRs, when upgraded to 3.11.6 NTRs are no longer blocked. The reason is that 3.11.6 doesn’t place a limit on the number of NTRs but rather on the size of memory of all those NTRs. Thus when the new size limit is reached, NTRs are paused. Default settings in 3.11.6 result in a much larger NTR queue in comparison to the small 128 limit in 3.11.4 (in normal situations where the payload size would not be extremely large).

Benchmarking Setup

This testing procedure requires the NTR queue on a cluster to be at max capacity with enough load to start blocking requests at a constant rate. In order to do this we used multiple test boxes to stress the cluster. This was achieved by using 12 active boxes to create multiple client connections to the test cluster. Once the cluster NTR queue is in constant contention, we monitored the performance using:

  • Client metrics: requests per second, latency from client perspective
  • NTR Queue metrics: Active Tasks, Pending Tasks, Currently Blocked Tasks, and Paused Connections.

For testing purposes we used two testing clusters with details provided in the table below:

Cassandra Cluster size Instance Type Cores RAM Disk
3.11.4 3 M5xl-1600-v2  4 16GB 1600 GB
3.11.6 3 m5xl-1600-v2 4 16GB 1600 GB
Table 1: Cluster Details

To simplify the setup we disabled encryption and authentication. Multiple test instances were set up in the same region as the clusters. For testing purposes we used 12 KB blob payloads. To give each cluster node a balanced mixed load, we kept the number of test boxes generating write load equal to the number of instances generating read load. We ran the load against the cluster for 10 mins to temporarily saturate the queue with read and write requests and cause contention for the Netty threads.

Our test script used cassandra-stress for generating the load, you can also refer to Deep Diving cassandra-stress – Part 3 (Using YAML Profiles) for more information.

In the stressSpec.yaml, we used the following table definition and queries:

table_definition: |
 CREATE TABLE typestest (
       name text,
       choice boolean,
       date timestamp,
       address inet,
       dbl double,
       lval bigint,
               ival int,
       uid timeuuid,
       value blob,
       PRIMARY KEY((name,choice), date, address, dbl, lval, ival, uid)
 ) WITH compaction = { 'class':'LeveledCompactionStrategy' }
   AND comment='A table of many types to test wide rows'
 
columnspec:
 - name: name
   size: fixed(48)
   population: uniform(1..1000000000) # the range of unique values to select for the field 
 - name: date
   cluster: uniform(20..1000)
 - name: lval
   population: gaussian(1..1000)
   cluster: uniform(1..4)
 - name: value
   size: fixed(12000)
 
insert:
 partitions: fixed(1)       # number of unique partitions to update in a single operation
                                 # if batchcount > 1, multiple batches will be used but all partitions will
                                 # occur in all batches (unless they finish early); only the row counts will vary
 batchtype: UNLOGGED               # type of batch to use
 select: uniform(1..10)/10       # uniform chance any single generated CQL row will be visited in a partition;
                                 # generated for each partition independently, each time we visit it
 
#
# List of queries to run against the schema
#
queries:
  simple1:
     cql: select * from typestest where name = ? and choice = ? LIMIT 1
     fields: samerow             # samerow or multirow (select arguments from the same row, or randomly from all rows in the partition)
  range1:
     cql: select name, choice, uid  from typestest where name = ? and choice = ? and date >= ? LIMIT 10
     fields: multirow            # samerow or multirow (select arguments from the same row, or randomly from all rows in the partition)
  simple2:
     cql: select name, choice, uid from typestest where name = ? and choice = ? LIMIT 1
     fields: samerow             # samerow or multirow (select arguments from the same row, or randomly from all rows in the partition)

Write loads were generated with:

cassandra-stress user no-warmup 'ops(insert=10)' profile=stressSpec.yaml cl=QUORUM duration=10m -mode native cql3 maxPending=32768 connectionsPerHost=40 -rate threads=2048 -node file=node_list.txt

Read loads were generated by changing ops to

ops(simple1=10,range1=1)'

Comparison

3.11.4 Queue Saturation Test

The active NTR queue reached max capacity (at 128) and remained in contention under load. Pending NTR tasks remained above 128 throughout. At this point, timeouts were occurring when running 12 load instances to stress the cluster. Each node had 2 load instances performing reads and another 2 performing writes. 4 of the read load instances constantly logged NoHostAvailableExceptions as shown in the example below.

ERROR 04:26:42,542 [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ec2-18-211-4-255.compute-1.amazonaws.com/18.211.4.255:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [ec2-18-211-4-255.compute-1.amazonaws.com/18.211.4.255] Timed out waiting for server response), ec2-35-170-231-79.compute-1.amazonaws.com/35.170.231.79:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [ec2-35-170-231-79.compute-1.amazonaws.com/35.170.231.79] Timed out waiting for server response), ec2-35-168-69-19.compute-1.amazonaws.com/35.168.69.19:9042 (com.datastax.driver.core.exceptions.OperationTimedOutException: [ec2-35-168-69-19.compute-1.amazonaws.com/35.168.69.19] Timed out waiting for server response))

The client results we got from this stress run are shown in Table 2.

Box Op rate (op/s) Latency mean (ms) Latency median (ms) Latency 95th percentile (ms) latency 99th percentile (ms) Latency 99.9th percentile (ms) Latency max (ms)
1 700.00 2,862.20 2,078.30 7,977.60 11,291.10 19,495.10 34,426.80
2 651.00 3,054.50 2,319.50 8,048.90 11,525.90 19,528.70 32,950.50
3 620.00 3,200.90 2,426.40 8,409.60 12,599.70 20,367.50 34,158.40
4 607.00 3,312.80 2,621.40 8,304.70 11,769.20 19,730.00 31,977.40
5 568.00 3,529.80 3,011.50 8,216.60 11,618.20 19,260.20 32,698.80
6 553.00 3,627.10 3,028.30 8,631.90 12,918.50 20,115.90 34,292.60
Writes 3,699.00 3,264.55 2,580.90 8,264.88 11,953.77 19,749.57 34,426.80
7 469.00 4,296.50 3,839.90 9,101.60 14,831.10 21,290.30 35,634.80
8 484.00 4,221.50 3,808.40 8,925.50 11,760.80 20,468.20 34,863.10
9 Crashed due to time out
10 Crashed due to time out
11 Crashed due to time out
12 Crashed due to time out
Reads 953.00 4,259.00 3,824.15 9,092.80 14,800.40 21,289.48 35,634.80
Summary 4,652.00 3,761.78 3,202.53 8,678.84 13,377.08 20,519.52 35,634.80
Table 2: 3.11.4 Mixed Load Saturating The NTR Queue

* To calculate the total write operations, we summed the values from 6 instances. For max write latency we used the max value from all instances and for the rest of latency values, we calculated the average of results. Write results are summarised in the Table 2 “Write” row. For the read result we did the same, and results are recorded in the “Read” row. The last row in the table summarises the results in “Write” and “Read” rows.

The 6 write load instances finished normally, but the read instances struggled. Only 2 of the read load instances were able to send traffic through normally, the other clients received too many timeout errors causing them to crash. Another observation we have made is that the Cassandra timeout metrics, under client-request-metrics, did not capture any of the client timeout we have observed.

Same Load on 3.11.6

Next, we proceeded to test 3.11.6 with the same load. Using the default NTR settings, all test instances were able to finish the stress test successfully.

Box Op rate (op/s) Latency mean (ms) Latency median (ms) Latency 95th percentile (ms) latency 99th percentile (ms) Latency 99.9th percentile (ms) Latency max (ms)
1 677.00 2,992.60 2,715.80 7,868.50 9,303.00 9,957.30 10,510.90
2 658.00 3,080.20 2,770.30 7,918.80 9,319.70 10,116.70 10,510.90
3 653.00 3,102.80 2,785.00 7,939.80 9,353.30 10,116.70 10,510.90
4 608.00 3,340.90 3,028.30 8,057.30 9,386.90 10,192.20 10,502.50
5 639.00 3,178.30 2,868.90 7,994.30 9,370.10 10,116.70 10,510.90
6 650.00 3,120.50 2,799.70 7,952.40 9,353.30 10,116.70 10,510.90
Writes 3,885.00 3,135.88 2,828.00 7,955.18 9,347.72 10,102.72 10,510.90
7 755.00 2,677.70 2,468.30 7,923.00 9,378.50 9,982.40 10,762.60
8 640.00 3,160.70 2,812.30 8,132.80 9,529.50 10,418.70 11,031.00
9 592.00 3,427.60 3,101.70 8,262.80 9,579.80 10,452.20 11,005.90
10 583.00 3,483.00 3,160.40 8,279.60 9,579.80 10,435.40 11,022.60
11 582.00 3,503.60 3,181.40 8,287.90 9,588.20 10,469.00 11,047.80
12 582.00 3,506.70 3,181.40 8,279.60 9,588.20 10,460.60 11,014.20
Reads 3,734.00 3,293.22 2,984.25 8,194.28 9,540.67 10,369.72 11,047.80
Summary 7,619.00 3,214.55 2,906.13 8,074.73 9,444.19 10,236.22 11,047.80
Table 3: 3.11.6 Mixed Load

Default Native Transport Requests (NTR) Setting Comparison

Taking the summary row from both versions (Table 2 and Table 3), we produced Table 4.

Op rate (op/s) Latency mean (ms) Latency median (ms) Latency 95th percentile (ms) latency 99th percentile (ms) Latency 99.9th percentile (ms) Latency max (ms)
3.11.4 4652 3761.775 3202.525 8678.839167 13377.08183 20519.52228 35634.8
3.11.6 7619 3214.55 2906.125 8074.733333 9444.191667 10236.21667 11047.8
Table 4: Mixed Load 3.11.4 vs 3.11.6


Figure 2: Latency 3.11.4 vs 3.11.6

Figure 2 shows the latencies from Table 4. From the results, 3.11.6 had slightly better average latency than 3.11.4. Furthermore, in the worst case where contention is high, 3.11.6 handled the latency of a request better than 3.11.4. This is shown by the difference in Latency Max. Not only did 3.11.6 have lower latency but it was able to process many more requests due to not having a blocked queue.

3.11.6 Queue Saturation Test

The default native_transport_max_concurrent_requests_in_bytes is set to 1/10 of the heap size. The Cassandra max heap size of our cluster is 8 GB, so the default queue size for our queue is 0.8 GB. This turns out to be too large for this cluster size, as this configuration will run into CPU and other bottlenecks before we hit NTR saturation.

So we took the reverse approach to investigate full queue behaviour, which is setting the queue size to a lower number. In cassandra.yaml, we added:

native_transport_max_concurrent_requests_in_bytes: 1000000

This means we set the global queue size to be throttled at 1MB. Once Cassandra was restarted and all nodes were online with the new settings, we ran the same mixed load on this cluster, the results we got are shown in Table 5.

3.11.6 Op rate (op/s) Latency mean (ms) Latency median (ms) Latency 95th percentile (ms) latency 99th percentile (ms) Latency 99.9th percentile (ms) Latency max (ms)
Write: Default setting 3,885.00 3,135.88 2,828.00 7,955.18 9,347.72 10,102.72 10,510.90
Write: 1MB setting 2,105.00 5,749.13 3,471.82 16,924.02 26,172.45 29,681.68 31,105.00
Read: Default setting 3,734.00 3,293.22 2,984.25 8,194.28 9,540.67 10,369.72 11,047.80
Read: 1MB setting 5,395.00 2,263.13 1,864.55 5,176.47 8,074.73 9,693.03 15,183.40
Summary: Default setting 7,619.00 3,214.55 2,906.13 8,074.73 9,444.19 10,236.22 11,047.80
Summary: 1MB setting 7,500.00 4,006.13 2,668.18 11,050.24 17,123.59 19,687.36 31,105.00

Table 5: 3.11.6 native_transport_max_concurrent_requests_in_bytes default and 1MB setting 

During the test, we observed a lot of paused connections and discarded requests—see Figure 3. For a full list of Instaclustr exposed metrics see our support documentation.

NTR Test - Paused Connections and Discarded Requests
Figure 3: 3.11.6 Paused Connections and Discarded Requests

After setting native_transport_max_concurrent_requests_in_bytes to a lower number, we start to get paused connections and discarded requests, write latency increased resulting in fewer processed operations, shown in Table 5. The increased write latency is illustrated Figure 4.

Cassandra 3.11.6 Write Latency Under Different Settings
Figure 4: 3.11.6 Write Latency Under Different Settings

On the other hand, read latency decreased, see Figure 5, resulting in a higher number of operations being processed.

Cassandra 3.11.6 Read Latency Under Different Settings
Figure 5: 3.11.6 Read Latency Under Different Settings
Cassandra 3.11.6 Operations Rate Under Different Settings
Figure 6: 3.11.6 Operations Rate Under Different Settings

As illustrated in Figure 6, the total number of operations decreased slightly with the 1MB setting, but the difference is very small and the effect of read and write almost “cancel each other out”. However, when we look at each type of operation individually, we can see that rather than getting equal share of the channel in a default setting of “almost unlimited queue”, the lower queue size penalizes writes and favors read. While our testing identified this outcome, further investigation will be required to determine exactly why this is the case.

Conclusion

In conclusion, the new NTR change offers an improvement over the previous NTR queue behaviour. Through our performance testing we found the change to be good for clusters that have a constant load causing the NTR queue to block. Under the new mechanism the queue no longer blocks, but throttles the load based on the amount of memory allocated to requests.

The results from testing indicated that the changed queue behaviour reduced latency and provided a significant lift in the number of operations without clients timing out. Clusters with our latest version of Cassandra can handle more load before hitting hard limits. For more information feel free to comment below or reach out to our Support team to learn more about changes to 3.11.6 or any of our other supported Cassandra versions.

The post Understanding the Impacts of the Native Transport Requests Change Introduced in Cassandra 3.11.5 appeared first on Instaclustr.

Apache Cassandra Usage Report 2020

Apache Cassandra is the open source NoSQL database for mission critical data. Today the community announced findings from a comprehensive global survey of 901 practitioners on Cassandra usage. It’s the first of what will become an annual survey that provides a baseline understanding of who, how, and why organizations use Cassandra.

“I saw zero downtime at global scale with Apache Cassandra. That’s a powerful statement to make. For our business that’s quite crucial.” - Practitioner, London

Key Themes

Cassandra adoption is correlated with organizations in a more advanced stage of digital transformation.

People from organizations that self-identified as being in a “highly advanced” stage of digital transformation were more likely to be using Cassandra (26%) compared with those in an “advanced” stage (10%) or “in process” (5%).

Optionality, security, and scalability are among the key reasons Cassandra is selected by practitioners.

The top reasons practitioners use Cassandra for mission critical apps are “good hybrid solutions” (62%), “very secure” (60%), “highly scalable” (57%), “fast” (57%), and “easy to build apps with” (55%).

A lack of skilled staff and the challenge of migration deters adoption of Cassandra.

Thirty-six percent of practitioners currently using Cassandra for mission critical apps say that a lack of Cassandra-skilled team members may deter adoption. When asked what it would take for practitioners to use Cassandra for more applications and features in production, they said “easier to migrate” and “easier to integrate.”

Methodology

Sample. The survey consisted of 1,404 interviews of IT professionals and executives, including 901 practitioners which is the focus of this usage report, from April 13-23, 2020. Respondents came from 13 geographies (China, India, Japan, South Korea, Germany, United Kingdom, France, the Netherlands, Ireland, Brazil, Mexico, Argentina, and the U.S.) and the survey was offered in seven languages corresponding to those geographies. While margin of sampling error cannot technically be calculated for online panel populations where the relationship between sample and universe is unknown, the margin of sampling error for equivalent representative samples would be +/- 2.6% for the total sample, +/- 3.3% for the practitioner sample, and +/- 4.4% for the executive sample.

To ensure the highest quality respondents, surveys include enhanced screening beyond title and activities of company size (no companies under 100 employees), cloud IT knowledge, and years of IT experience.

Rounding and multi-response. Figures may not add to 100 due to rounding or multi-response questions.

Demographics

Practitioner respondents represent a variety of roles as follows: Dev/DevOps (52%), Ops/Architect (29%), Data Scientists and Engineers (11%), and Database Administrators (8%) in the Americas (43%), Europe (32%), and Asia Pacific (12%).

Cassandra roles

Respondents include both enterprise (65% from companies with 1k+ employees) and SMEs (35% from companies with at least 100 employees). Industries include IT (45%), financial services (11%), manufacturing (8%), health care (4%), retail (3%), government (5%), education (4%), telco (3%), and 17% were listed as “other.”

Cassandra companies

Cassandra Adoption

Twenty-two percent of practitioners are currently using or evaluating Cassandra with an additional 11% planning to use it in the next 12 months.

Of those currently using Cassandra, 89% are using open source Cassandra, including both self-managed (72%) and third-party managed (48%).

Practitioners using Cassandra today are more likely to use it for more projects tomorrow. Overall, 15% of practitioners say they are extremely likely (10 on a 10-pt scale) to use it for their next project. Of those, 71% are currently using or have used it before.

Cassandra adoption

Cassandra Usage

People from organizations that self-identified as being in a “highly advanced” stage of digital transformation were more likely to be using Cassandra (26%) compared with those in an “advanced” stage (10%) in “in process” (5%).

Cassandra predominates in very important or mission critical apps. Among practitioners, 31% use Cassandra for their mission critical applications, 55% for their very important applications, 38% for their somewhat important applications, and 20% for their least important applications.

“We’re scheduling 100s of millions of messages to be sent. Per day. If it’s two weeks, we’re talking about a couple billion. So for this, we use Cassandra.” - Practitioner, Amsterdam

Cassandra usage

Why Cassandra?

The top reasons practitioners use Cassandra for mission critical apps are “good hybrid solutions” (62%), “very secure” (60%), “highly scalable” (57%), “fast” (57%), and “easy to build apps with” (55%).

“High traffic, high data environments where really you’re just looking for very simplistic key value persistence of your data. It’s going to be a great fit for you, I can promise that.” - Global SVP Engineering

Top reasons practitioners use Cassandra

For companies in a highly advanced stage of digital transformation, 58% cite “won’t lose data” as the top reason, followed by “gives me confidence” (56%), “cloud native” (56%), and “very secure” (56%).

“It can’t lose anything, it has to be able to capture everything. It can’t have any security defects. It needs to be somewhat compatible with the environment. If we adopt a new database, it can’t be a duplicate of the data we already have.… So: Cassandra.” - Practitioner, San Francisco

However, 36% of practitioners currently using Cassandra for mission critical apps say that a lack of Cassandra-skilled team members may deter adoption.

“We don’t have time to train a ton of developers, so that time to deploy, time to onboard, that’s really key. All the other stuff, scalability, that all sounds fine.” – Practitioner, London

When asked what it would take for practitioners to use Cassandra for more applications and features in production, they said “easier to migrate” and “easier to integrate.”

“If I can get started and be productive in 30 minutes, it’s a no brainer.” - Practitioner, London

Conclusion

We invite anyone who is curious about Cassandra to test the 4.0 beta release. There will be no new features or breaking API changes in future Beta or GA builds, so you can expect the time you put into the beta to translate into transitioning your production workloads to 4.0.

We also invite you to participate in a short survey about Kubernetes and Cassandra that is open through September 24, 2020. Details will be shared with the Cassandra Kubernetes SIG after it closes.

Survey Credits

A volunteer from the community helped analyze the report, which was conducted by ClearPath Strategies, a strategic consulting and research firm, and donated to the community by DataStax. It is available for use under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).