DataStax Developer Day London: Developers Are Moving to Multi-Cloud

So far, we have run five out of six DataStax Developer Day events this month. After four dates in the United States, we are now on the first leg of the European tour.

DataStax Developer Days are aimed at a mix of people with Cassandra skills—from those who are new to this database to those who want to expand what they do with search, analytics, and graph.

However, our Developer Day in London (on November 6) surprised me by how many people wanted to discuss multi-cloud.

Multi-Cloud is Real…Now What Was the Question?

Multi-cloud is the bright new shiny object for developers today. It involves running across more than one public cloud service. I thought this would be something that people might be interested in as a theoretical use case, but Developer Day London had multiple attendees doing this already.

One of the attendees—someone from a media company with millions of customers across the UK—stood up and gave an overview of why this is: it’s to avoid lock-in to a particular vendor or location.

From a data perspective, this attendee discussed how he wants to get the best platform for each particular job, provide teams with flexibility in how they work, and avoid issues if and when something failed.

After this, other attendees stood up and said that is the same thinking that they were going through. What surprised me was how quickly this was taking place in the developer community here. What is causing this? It’s the quality of collaboration between developers, operations teams and cloud teams, and London seems to be ahead on this topic compared to Europe and the US.

Looking Ahead With Cassandra

So far, we have seen a real mix of people attending DataStax Developer Days. Each event has thrown up some great discussions, and there have been some surprising questions that we learned from.

What is making the most difference to us is how we are approaching feedback. Each session has feedback-gathering tools available, and we are plugged into that to give some insight into what we are doing well and what we could do better. It’s all been constructive and is helping us improve what we offer to the developer community.

With one event to go in Paris, we want even more feedback on these specific events, and how we can broaden them out to more community members. If there is anything that you want to see from the DataStax Academy team, @datastaxacademy on Twitter is the way to go. We want to hear from you!

We hope to see you at one of our next DataStax Developer Days. Check out our calendar of events here.

Also, be sure to check out DataStax at DataStax Accelerate, The World’s Premier Apache Cassandra Conference, May 21-23, 2019 in the Washington D.C. Area.

To join us at DataStax Accelerate

REGISTER HERE

Reaper 1.3 Released

Cassandra Reaper 1.3 was released a few weeks ago, and it’s time to cover its highlights.

Configurable pending compactions threshold

Reaper protects clusters from being overloaded by repairs by not submitting new segments that involve a replica with more than 20 pending compactions. This usually means that nodes aren’t keeping up with streaming triggered by repair and running more repair could end up seriously harming the cluster.

While the default value is a good one, there may be cases where one would want to tune the value to match its specific needs. Reaper 1.3.0 allows this by adding a new configuration setting to add in your yaml file:

maxPendingCompactions: 50

Use this setting with care and for specific cases only.

More nodes metrics in the UI

Following up on the effort to expand the capabilities of Reaper beyond the repair features, the following informations are now available in the UI (and through the REST API), when clicking on a node in the cluster view.

Progress of incoming and outgoing streams is now displayed:
Streaming progress

Compactions can be tracked:
Streaming progress

And both thread pool stats and client latency metrics are displayed in tables:
Streaming progress

Streaming progress

What’s next for Reaper?

As shown in a previous blog post, we’re working on an integration with Spotify’s cstar in order to run and schedule topology aware commands on clusters managed by Reaper.

We are also looking to add a sidecar mode, which would collocate an instance of Reaper on each Cassandra node, allowing to use Reaper on clusters where JMX access is restricted to localhost.

The upgrade to 1.3 is recommended for all Reaper users. The binaries are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 1.3 are available on the Reaper website.

Overheard at Scylla Summit 2018

Dor Laor addresses Scylla Summit 2018

Scylla Summit 2018 was quite an event! Your intrepid reporter tried to keep up with the goings-on, live-tweeting the event from opening to close. If you missed my Tweetstream, you can pick it up here:

It’s impossible to pack two days and dozens of speakers into a few thousand words, so I’m going to give just the highlights and will embed the SlideShare links for a selected few talks. However, free to check out the ScyllaDB SlideShare page for all the presentations. And yes, in due time, we’ll post all the videos of the sessions for you to view!

Day One: Keynotes

Tuesday kicked off with ScyllaDB CEO Dor Laor giving a history of Scylla from its origins (both in mythology and as a database project) on through to its present-day capabilities in the newly-announced Scylla Open Source 3.0. He also announced the availability of early access to Scylla Cloud. Go to ScyllaDB.com/cloud to sign up! It’s on a first-come, first-served basis.

Scylla Summit 2018 Banner

Dor was followed on stage by Avi Kivity, CTO of ScyllaDB. Avi gave an overview of Scylla’s newest capabilities, from “mc” format SSTables (to bring Scylla into parity with Cassandra 3.x), to in-memory and tiered storage options.

If you know anything about Avi, you know how excited he gets when talking about low-level systemic improvements and capabilities. So he spent much of his talk on schedulers to manage write rates to balance synchronous and async updates, on how to isolate workloads, and on system resource accounting, security, and other requirements for true multi-tenancy. Full-table scans, which are a prerequisite for any sort of analytics. CQL “ALLOW FILTERING” and driver improvements. Each of these features were featured at Scylla Summit with their own in-depth talks — Avi just went through the highlights.

Customer Keynotes: Comcast & GE Digital

Tuesday also featured customer keynotes. We were honored to have some of the biggest names in technology showcase how they’re using Scylla.

 

First came Comcast’s Vijay Velusamy, who described how the Xfinity X1 platform now uses Scylla to support features like “last watched,” “resume watching,” and parental controls. Those preferences, channels, shows and timestamps are managed for thirteen million accounts on Scylla. Compared to their old system, they now connect Scylla directly to their REST API data services. This allows Comcast to simplify their infrastructure by getting rid of their cache and pre-cache servers, and improved performance by 10x.

GE Digital’s Venkatesh Sivasubramanian & Arvind Singh came to the stage later in the morning to talk about embedding Scylla in their Predix platform, the world’s largest Industrial Internet-of-Things (IIoT) platform. From power to aviation to 3D printing industrial components, there are entirely different classes of data that GE Digital manages.

 

OLTP or OLAP — Why not both?

Glauber Costa, VP of Field Engineering at ScyllaDB also had a keynote discussing Scylla’s new support of analytics and transaction processing in the same database. The challenge has always been that OLTP relies on a rapid stream of small transactions with a mix of reads and writes, and where latency is paramount, whereas OLAP is oriented towards broad data reads where throughput is paramount. Mixing those two types of loads in the past has traditionally caused one operation or the other to suffer. Thus, in the past, organizations have simply maintained two different clusters. One for transactions, and one for analytics. However, this is extremely inefficient and costly.

OLTP or OLAP — Why not both?

How can you engineer a database so that these different loads can work well together? On the same cluster? Avoiding doubling your servers and the necessity to keep two different clusters in synch? And make it work so well your database in fact becomes “boring?”

Scylla Summit 2018: OLAP or OLTP? Why Not Both? from ScyllaDB

Scylla’s new per-user SLA capability builds on the existing I/O scheduler along with a new CPU scheduler to adjust priority of low-level activities in each shard. Different tasks and users can then be granted, with precision, shares of system resources based on adjustable settings.

While you can still overload a system if you have insufficient overall resources, this new feature allows you to now mix traffic loads using a single, scalable cluster to handle all your data needs.

Breakout Sessions

Tuesday afternoon and Wednesday morning were jam-packed with session-after-session by both users presenting their use cases and war stories, as well as ScyllaDB’s own engineering staff taking deep dives into Scylla’s latest features and capabilities.

Amongst the highest-rated sessions at the conference:

Again, there’s just too much to get into a deep dive of each of these sessions. I highly encourage you to look at the dozens of presentations now up on SlideShare.

The Big Finish

Wednesday afternoon brought everyone back together for the closing general sessions, which were kicked off by Ľuboš Koščo & Michal Šenkýř of AdTech leader Sizmek. Their session on Adventures in AdTech: Processing 50 Billion User Profiles in Real Time with Scylla was impressive in many ways. They are managing Scylla across seven datacenters, serving up billions of ad bids per day in the blink of an eye in 70 countries around the world.

Real Time Bidding (RTB) process at Sizmek

Sizmek were followed by ScyllaDB’s Piotr Sarna, who delivered an in-depth presentation on three weighty and interrelated topics: Materialized Views, Secondary Indexes, and Filtering. Yes indeed, they are finally here!

Piotr Sarna at Scylla Summit 2018

Well, in fact, Materialized Views were experimental in 2.0, and Secondary Indexes were experimental in 2.1. We learned a lot and evolved their implementation working with customers using these features. With Scylla Open Source 3.0, they are both production-ready, along with the new Filtering feature.

There is a lot to digest in his talk, from handling reads-before-writes in automatic view updates for materialized views, and applying backpressure to prevent overloading clusters, to hinted handoffs for asynchronous consistency. Why we decided on global secondary indexes and how we implemented paging. Piotr also gave guidance on when to use a secondary index versus a materialized view. On top of all of that, how to apply filtering to narrow-down on the all the data you need and yet only get the data you want. When we publish it, this will definitely be one of those videos you want to digest in full.
Grab's Aravind Srinivasan at Scylla Summit 2018

Grab’s Aravind Srinivasan talked about Grab and Scylla: Driving Southeast Asia Forward. Grab, beyond being the largest ride-sharing company in their geography, has expanded to provide a broader ecosystem for online shopping, payments, and delivery. Their fraud detection system is not just important to their business internally; it is vital for the trust of their community and the financial viability of their vendors.

Technologically, Grab’s use case highlighted how vital the combination of Apache Kafka and Scylla was for his internal developers. Such a powerful tech combo was a common theme to many of the talks over the two days. Indeed, we were privileged to have Confluent’s own Hojjat Jafarpour speak about KSQL at the event.

Scylla Feature Talks were the final tech presentations of the event. It comprised a series of four highly-rated short talks by ScyllaDB engineers, spanning the gamut, from SSTable 3.0 to Scylla-Specific Drivers, to Scylla Monitor 2.0 and Streaming and Repairs.

We ended with an Ask Me Anything, featuring Dor, Avi and Glauber on stage. It was no surprise to us that many of the questions that peppered our team came from Kiwi’s Martin Strycek and Numberly’s Alexys Jacob! If there were any questions we didn’t get to answer for you, feel free to drop in on our Slack channel, mailing list, or Github and start a conversation.

We want to thank everyone who came to our Summit, from the open source community to the enterprise users, from the Bay Area to all parts from around the globe. We also have to give our special thanks to our many speakers who brought to us their incredible stories, remarkable achievements, and incredible solutions.

Scylla Summit 2018 Banner

The post Overheard at Scylla Summit 2018 appeared first on ScyllaDB.

DataStax and the Cassandra Community

DataStax’s contributions to the Cassandra community cover a wide range of needs, all oriented towards your success in building Cassandra-based applications. I think of these as the “Three C’s of Community”: Code, Coaching, and Conferences.

Code

Cassandra

Obviously, DataStax continues to contribute to Apache Cassandra™ itself.  Besides contributing the majority of the “tick-tock” releases culminating in the current 3.11, DataStax has contributed multiple features to the upcoming 4.0 release. These include:

I’m happy to announce that as of today we are also releasing the DataStax Distribution of Apache Cassandra as a new subscription offering to help fit the feature and price point that the community has asked us for.

Drivers

DataStax also maintains the most popular Cassandra drivers for seven languages. Collectively, these represent an estimated 36 engineer-years of effort. The Java driver alone includes over 300,000 lines of code added or modified in over 800 commits since Jan 2017. These drivers are completely open source with a liberal Apache license.

DataStax also recently updated its contributor license agreement (CLA) to align more closely with the Apache Software Foundation’s ICLA. Non-DataStax contributions make up close to 100 commits across all drivers for this time frame.

TinkerPop

Besides the tabular, document, and key-value models supported by Cassandra, DataStax Enterprise also provides a Graph engine. DataStax contributes the majority of code to the graph framework Apache TinkerPop™, which provides standard, vendor-neutral ways of working with graph applications. TinkerPop is used by pure-play graph vendors like OrientDB as well as Amazon Neptune, IBM Graph and Microsoft Cosmos DB.

Other Contributions

DataStax’s engineers are also pleased to share some of the other great things that they work on with the open-source community.  These include:

  • EngineBlock, a load generation and workload characterization tool
  • CQL Data Modeler, a user-friendly front-end to cassandra-stress
  • Simulacron, a simulator of the Cassandra Native Protocol

Coaching

From DataStax Academy to DataStax Developer Day to worldwide meetups, we pride ourselves on our commitment to coaching and educating developers on everything Apache Cassandra.

DataStax Academy has long been known for high quality, free training on both Apache Cassandra and DataStax Enterprise (DSE). We’ve recently refreshed our course catalog to reflect updates to the latest versions, including two courses that focus almost exclusively on Cassandra concepts:

We continue to maintain an active DataStax Academy Slack community with hundreds of active community users and active participation from DataStax experts. We also produce and host the weekly Distributed Data Show and other howto content directly and indirectly related to Cassandra and DSE.

Developer Days

The Developer Relations team at DataStax has been really excited to host this series of hands-on learning events this year in cities including San Francisco, New York, Washington DC, Chicago, Dallas, London, and Paris. We’ve offered dual tracks on both Cassandra and DSE at these events and the response has been extremely positive. Cassandra topics covered include data modeling, operations, drivers, and application development.

Meetups

DataStax’s Developer Relations team has also been active in the Cassandra meetup scene. In addition to working to inject new energy into Cassandra-focused meetups, we’re also getting the word out at other NoSQL / Big Data meetups, and hosting a monthly Distributed Systems meetup in Seattle. You can track where we’re headed next on our events page.

Conferences

In 2019, DataStax is bringing its experience in organizing the Cassandra Summit for six years to the Washington, DC metro area with a new conference: DataStax Accelerate.

Accelerate will feature cutting-edge content from DataStax experts and the Cassandra community for data architects, DBAs, Cassandra developers, IT administrators and architects, and IT executives. Accelerate will also include a full-day Cassandra Application Bootcamp that teaches the fundamentals for building an application with Apache Cassandra that can run everywhere.

Conclusion

DataStax has always been the home of the world’s top experts in Cassandra, and we continue to contribute to our community with the Three C’s: Code, Coaching, and Conferences. Check out our announcement of the new DataStax Distribution of Apache Cassandra and register here to attend or participate in the call for papers for DataStax Accelerate in May!

VMware and DataStax Unlock Big Data’s Potential

VMware and DataStax are proud to announce a joint collaboration to power big data environments on VMwares vSAN. We took this effort on for our cloud-native customers that wanted the operational simplicity that comes with centrally managed storage, as well as the cost savings of hyper-converged environments.

DataStax Enterprise is underpinned by Apache Cassandra™ and utilizes a masterless architecture where data is replicated between nodes to achieve availability requirements. Most customers opt for fast local storage to fulfill these requirements without sacrificing performance, but this limited the infrastructure choices for customers that wanted VMware’s Enterprise management options. Partial solutions included:

  • Adopting a traditional vSAN infrastructure. This doubled the capacity requirements for a system already storing data replicas and risked performance SLAs.
  • Disabling replication at the vSAN layer. This eliminated the unneeded replicas but couldn’t guarantee that a virtual machine’s data was stored within the same host. Also, because storage access was still remote, performance SLAs were still at risk.

To solve this issue, VMware worked with DataStax to engineer a unique solution. Referred to as HostAffinity, this policy offers customers additional flexibility to configure vSAN placement and replication specific to the application that has been deployed.

Host Affinity delegates replication to DataStax Enterprise while maintaining data locality with DataStax Enterprise compute. The Host Affinity policy is available in addition to standard vSAN replication policies and intended to offer customers choice of deployment based on their criticality, uptime, and maintenance requirements.

This solution also gives you the benefits of simplified visibility, lifecycle, and consolidation of computing workloads. The result is a reference architecture that demonstrates how performance consolidation, failure mitigation, and linear scaling can be achieved using vSAN and DataStax together.

For more details on the benefits this collaboration brings, see:

DataStax Enterprise™ (DSE) and vSAN Reference Architecture 

Scaling Time Series Data Storage — Part II

Scaling Time Series Data Storage — Part II

by Dhruv Garg, Dhaval Patel, Ketan Duvedi

In January 2016 Netflix expanded worldwide, opening service to 130 additional countries and supporting 20 total languages. Later in 2016 the TV experience evolved to include video previews during the browsing experience. More members, more languages, and more video playbacks stretched the times series data storage architecture from part 1 close to its breaking point. In part 2 here, we will explore the limitations of that architecture and describe how we’re re-architecting for this next phase in our evolution.

Breaking Point

Part 1’s architecture treated all viewing data the same, regardless of type (full title plays vs video previews) or age (how long ago a title was viewed). The ratio of previews to full views was growing rapidly as that feature rolled out to more devices. By the end of 2016 we were seeing 30% growth in one quarter for that data store; video preview roll-outs were being delayed because of their potential impact to this data store. The naive solution would be to scale the underlying viewing data Cassandra (C*) cluster to accommodate that growth, but it was already the biggest cluster in use and nearing cluster size limits that few C* users have gone past successfully. Something had to be done, and that too soon.

Rethinking Our Design

We challenged ourselves to rethink our approach and design one that would scale for at least 5x growth. We had patterns that we could reuse from part 1’s architecture, but by themselves those weren’t sufficient. New patterns and techniques were needed.

Analysis

We started by analyzing our data set’s access patterns. What emerged was three distinct categories of data:

  • Full title plays
  • Video preview plays
  • Language preference (i.e., which subtitles/dubs were played, indicating what is the member’s preference when they play titles in a given language)

For each category, we discovered another pattern — the majority of access was to recent data. As the age of the data increased, the level of detail needed decreased. Combining these insights with conversations with our data consumers, we negotiated which data was needed at what detail and for how long.

Storage Inefficiency

For the fastest growing data sets, video previews and language information, our partners needed only recent data. Very short duration views of video previews were being filtered out by our partners as they weren’t a positive or negative signal of member’s intent for the content. Additionally, we found most members choose the same subs/dubs languages for the majority of the titles that they watched. Storing the same language preference with each viewing record resulted in a lot of data duplication.

Client Complexity

Another limiting factor we looked into was how our viewing data service’s client library satisfied a caller’s particular need for specific data from a specific time duration. Callers could retrieve viewing data by specifying:

  • Video Type — Full title or video preview
  • Time Range — last X days/months/years with X being different for various use cases
  • Level of detail — complete or summary
  • Whether to include subs/dubs information

For the majority of use cases, these filters were applied on the client side after fetching the complete data from the back-end service. As you might imagine, this led to a lot of unnecessary data transfer. Additionally, for larger viewing data sets the performance degraded rapidly, leading to huge variations in the 99th percentile read latencies.

Redesign

Our goal was to design a solution that would scale to 5x growth, with reasonable cost efficiencies and improved as well as more predictable latencies. Informed by the analysis and understanding of the problems discussed above, we undertook this significant redesign. Here are our design guidelines:

Data Category

  • Shard by data type
  • Reduce data fields to just the essential elements

Data Age

  • Shard by age of data. For recent data, expire after a set TTL
  • For historical data, summarize and rotate into an archive cluster

Performance

  • Parallelize reads to provide an unified abstraction across recent and historical data

Cluster Sharding

Previously, we had all the data combined together into one cluster, with a client library that filtered the data based on type/age/level of detail. We inverted that approach and now have clusters sharded by type/age/level of detail. This decouples each data set’s different growth rates from one another, simplifies the client, and improves the read latencies.

Storage Efficiency

For the fastest growing data sets, video previews and language information, we were able to align with our partners on only keeping recent data. We do not store very short duration preview plays since they are not a good signal of member’s interest in the content. Also, we now store the initial language preference and then store only the deltas for subsequent plays. For vast majority of members, this means storing only a single record for language preference resulting in huge storage saving. We also have a lower TTL for preview plays and for language preference data thereby expiring it more aggressively than data for full title plays.

Where needed, we apply the live and compressed technique from part I, where a configurable number of recent records are stored in uncompressed form and the rest of the records are stored in compressed form in a separate table. For clusters storing older data, we store the data entirely in compressed form, trading off lower storage costs for higher compute costs at the time of access.

Finally, instead of storing all the details for historical full title plays, we store summarized view with fewer columns in a separate table. This summary view is also compressed to further optimize for storage costs.

Overall, our new architecture looks like this:

Viewing Data Storage Architecture

As shown above, Viewing data storage is sharded by type — there are separate clusters for full title plays, preview title plays and language preferences. Within full title plays, storage is sharded by age. There are separate clusters for recent viewing data (last few days), past viewing data (few days to few years) and historical viewing data. Finally, there is only a summary view rather than detailed records for historical viewing data.

Data Flows

Writes

Data writes go to into the most recent clusters. Filters are applied before entry, like not storing very short video previews plays or comparing the subs/dubs played to the previous preferences, and only storing when there is a change from previous behavior.

Reads

Requests for the most recent data go directly to the most recent clusters. When more data is requested, parallel reads enable efficient retrieval.

Last few days of viewing data: For the large majority of use cases that need few days of full title plays, information is read only from the “Recent” cluster. Parallel reads to LIVE and COMPRESSED tables in the cluster are performed. Continuing on the pattern of Live and Compressed data sets that is detailed in part 1 of this blog post series, during reads from LIVE if the number of records is beyond a configurable threshold, then the records are rolled up, compressed and written to COMPRESSED table as a new version with the same row key.

Additionally, if language preference information is needed, then a parallel read to the “Language Preference” cluster is made. Similarly if preview plays information is needed then parallel reads are made to the LIVE and COMPRESSED tables in the “Preview Titles” cluster. Similar to full title viewing data, if number of records in the LIVE table exceed a configurable threshold then the records are rolled up, compressed and written to COMPRESSED table as a new version with the same row key.

Last few months of full title plays are enabled via parallel reads to the “Recent” and “Past” clusters.

Summarized viewing data is returned via parallel reads to the “Recent”, “Past” and “Historical” clusters. The data is then stitched together to get the complete summarized view. To reduce storage size and cost, the summarized view in “Historical” cluster does not contain updates from the last few years of member viewing and hence needs to be augmented by summarizing viewing data from the “Recent” and “Past” clusters.

Data Rotation

For full title plays, movement of records between the different age clusters happens asynchronously. On reading viewing data for a member from the “Recent” cluster, if it is determined that there are records older than configured number of days, then a task is queued to move relevant records for that member from “Recent” to “Past” cluster. On task execution, the relevant records are combined with the existing records from COMPRESSED table in the “Past” cluster. The combined recordset is then compressed and stored in the COMPRESSED table with a new version. Once the new version write is successful, the previous version record is deleted.

If the size of the compressed new version recordset is greater than a configurable threshold then the recordset is chunked and the multiple chunks are written in parallel. These background transfers of records from one cluster to other are batched so that they are not triggered on every read. All of this is similar to the data movement in the Live to Compressed storage approach that is detailed in part 1.

Data Rotation between clusters

Similar movement of records to “Historical” cluster is accomplished while reading from “Past” cluster. The relevant records are re-processed with the existing summary records to create new summary records. They are then compressed and written to the COMPRESSED table in the “Historical” cluster with a new version. Once the new version is written successfully, the previous version record is deleted.

Performance Tuning

Like in the previous architecture, LIVE and COMPRESSED records are stored in different tables and are tuned differently to achieve better performance. Since LIVE tables have frequent updates and small number of viewing records, compactions are run frequently and gc_grace_seconds is small to reduce number of SSTables and data size. Read repair and full column family repair are run frequently to improve data consistency. Since updates to COMPRESSED tables are rare, manual and infrequent full compactions are sufficient to reduce number of SSTables. Data is checked for consistency during the rare updates. This obviates the need for read repair as well as full column family repair.

Caching Layer Changes

Since we do a lot of parallel reads of large data chunks from Cassandra, there is a huge benefit to having a caching layer. The EVCache caching layer architecture is also changed to mimic the backend storage architecture and is illustrated in the following diagram. All of the caches have close to 99% hit rate and are very effective in minimizing the number of read requests to the Cassandra layer.

Caching Layer Architecture

One difference between the caching and storage architecture is that the “Summary” cache cluster stores the compressed summary of the entire viewing data for full title plays. With approximately 99% cache hit rate only a small fraction of total requests goes to the Cassandra layer where parallel reads to 3 tables and stitching together of records is needed to create a summary across the entire viewing data.

Migration: Preliminary Results

The team is more than halfway through these changes. Use cases taking advantage of sharding by data type have already been migrated. So while we don’t have complete results to share, here are the preliminary results and lessons learned:

  • Big improvement in the operational characteristics (compactions, GC pressure and latencies) of Cassandra based just on sharding the clusters by data type.
  • Huge headroom for the full title, viewing data Cassandra clusters enabling the team to scale for at least 5x growth.
  • Substantial cost savings due to more aggressive data compression and data TTL.
  • Re-architecture is backward compatible. Existing APIs will continue to work and are projected to have better and more predictable latencies. New APIs created to access subset of data would give significant additional latency benefits but need client changes. This makes it easier to roll out server side changes independent of client changes as well as migrate various clients at different times based on their engagement bandwidth.

Conclusion

Viewing data storage architecture has come a long way over the last few years. We evolved to using a pattern of live and compressed data with parallel reads for viewing data storage and have re-used that pattern for other time-series data storage needs within the team. Recently, we sharded our storage clusters to satisfy the unique needs of different use cases and have used the live and compressed data pattern for some of the clusters. We extended the live and compressed data movement pattern to move data between the age-sharded clusters.

Designing these extensible building blocks scales our storage tier in a simple and efficient way. While we redesigned for 5x growth of today’s use cases, we know Netflix’s product experience continues to change and improve. We’re keeping our eyes open for shifts that might require further evolution.

If similar problems excite you, we are always on the lookout for talented engineers to join our team and help us solve the next set of challenges around scaling data storage and data processing.


Scaling Time Series Data Storage — Part II was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

More Efficient Range Scan Paging with Scylla 3.0

More Efficient Range Scan Paging with Scylla 3.0

In a previous blog post we examined how Scylla’s paging works, explained the problems with it and introduced the new stateful paging in Scylla 2.2 that solves these problems for singular partition queries by making paging stateful.

In this second blog post we are going to look into how stateful paging was extended to support range-scans as well. We were able to increase the throughput of range scans by 30% and how we also significantly reduced the amount of data read from the disk by 39% and the amount of disk operations by 73%.

A range scan, or a full table scan, is a query which does not use the partition key in the WHERE clause. Such scans are less efficient than a single partition scan, but they are very useful for ad-hoc queries and analytics, where the selection criteria does not match the partition key.

How do Range Scans Work in Scylla 2.3?

Range scans work quite differently compared to singular partition queries. As opposed to singular partition queries, which read a single partition or a list of distinct partitions, range scans read all of the partitions that fall into the range specified by the client. The exact number of partitions that belong to a given range and their identity cannot be determined up-front, so the query has to read all of the data from all of the nodes that contain data for the range.

Tokens (and thus partitions) in Scylla are distributed on two levels. To quickly recap, a token is the hash value of the partition key, and is used as the basis for distributing the partitions in the cluster.

Tokens are distributed among the nodes, each node owning a configurable amount of chunks from the token ring. These chunks are called vnodes. Note that when the replication factor (RF) of a keyspace is larger than 1, a single vnode can be found on a number of nodes, this number being equal to the replication factor (RF).

On each node, tokens of a vnode are further distributed among the shards of the node. The token of a partition key in Scylla (which uses the MurmurHash3 hashing algorithm), is a signed 64 bit integer. The sharding algorithm ignores the 12 most significant bit of this integer, and maps the rest to a shard id. This results in a distribution that resembles a round robin.

Figure 1: Scylla’s distribution of tokens of a vnode across the shards of a node.

Figure 1: Scylla’s distribution of tokens of a vnode across the shards of a node.

A range scan also works on two levels. The coordinator has to read all vnodes that intersect with the read range, and each contacted replica has to read all shard chunks that intersect with the read vnode. Both of these present an excellent opportunity for parallelism that Scylla exploits. As already mentioned, the amount of data each vnode, and further down each shard chunk contains is unknown. Yet the read operates with a page limit that has to be respected on both levels. It is easy to see that it is impossible to find a fixed concurrency that works well on both sparse and dense tables, and everything in between. A low concurrency would be unbearably slow on a sparse table, as most requests would return very little data or no data at all. A high concurrency would overread on a dense table and most of the results would have to be discarded.

To overcome this, an adaptive algorithm is used on both the coordinator and the replicas. The algorithm works in an iterative fashion The first iteration starts with a concurrency of 1 and if the current iteration did not yield enough data to fill the page, the concurrency is doubled on the next iteration. This exponentially increasing concurrency works quite well for both dense and sparse tables. For dense tables, it will fill the page in a single iteration. For sparse tables, it will quickly reach high enough concurrency to fill the page in reasonable time.

Although this algorithm works reasonably well, it’s not perfect. It works best for dense tables where the page is filled in a single iteration. For tables that don’t have enough data to fill a page in a single iteration, it suffers from a cold-start on the beginning of each page while the concurrency is ramping up. The algorithm may also end up discarding data when the amount of data returned by the concurrent requests is above what was required to fill the page, which is quite common once the concurrency is above 1.

Figure 2: Flow diagram of an example of a page being filled on the coordinator. Note how the second iteration increases concurrency, reading two vnodes in parallel.

Figure 2: Flow diagram of an example of a page being filled on the coordinator. Note how the second iteration increases concurrency, reading two vnodes in parallel.

Figure 3: Flow diagram of an example of the stateless algorithm reading a vnode on the replica. Note the exponentially increasing concurrency. When the concurrency exceeds the number of shards, some shards (both in this example) will be asked for multiple chunks. When this happens the results need to be sorted and then merged as read ranges of the shards overlap.

Figure 3: Flow diagram of an example of the stateless algorithm reading a vnode on the replica. Note the exponentially increasing concurrency. When the concurrency exceeds the number of shards, some shards (both in this example) will be asked for multiple chunks. When this happens the results need to be sorted and then merged as read ranges of the shards overlap.

Similarly to singular partition queries, the coordinator adjusts the read range (trim the part that was already read) at the start of each page and saves the position of the page at the end.

To reiterate, all this is completely stateless. Nothing is stored on the replicas or the coordinator. At the end of each page, all those objects created and all that work invested into serving the read is discarded, and on the next page it has to be done again from scratch. The only state the query has is the paging-state cookie, which stores just enough information so that the coordinator can compute the remaining range to-be-read on the beginning of each page.

Making Range Scans Stateful

To make range scans stateful we used the existing infrastructure, introduced for making singular partition queries stateful. To reiterate, the solution we came up with was to save the reading pipeline (queriers) on the replicas in a special cache, called the querier cache. Queriers are saved at the end of the page and looked up on the beginning of the next page and used to continue the query where it was left off. To ensure that the resources consumed by this cache stay bounded, it implements several eviction strategies. Queriers can be evicted if they stay in the cache for too long or if there is a shortage of resources.

Making range scans stateful proved to be much more challenging than it was for singular partition queries. We had to make significant changes to the reading pipeline on the replica to facilitate making it stateful. The vast majority of these changes revolved around designing a new algorithm, for reading all data belonging to a range from all shards, which can be suspended and resumed from this saved state later. The new algorithm is essentially a multiplexer that combines the output of readers opened on affected shards into a single stream. The readers are created on-demand when the shard is attempted to be read from the first time. To ensure that the read won’t stall, the algorithm uses buffering and read-ahead.

Figure 4: The new algorithm for reading the contents of a vnode on a replica.

Figure 4: The new algorithm for reading the contents of a vnode on a replica.

This algorithm has several desirable properties with regards to suspending and resuming later. The most important of these is that it doesn’t need to discard data. Discarding data means that the reader, from which the data originates from, cannot be saved, because its read position will be ahead compared to the position where the read should continue from. While the new algorithm can also overread (due to the buffering and read-ahead) it will overread less, and since data is in raw form, it can be moved back to the originating readers, restoring them into a state as if they stopped reading right at reaching the limit. It doesn’t need complex exponentially increasing concurrency and the problems that come with it. No slow start and expensive sorting and merging for sparse tables.

When the page is filled only the shard readers are saved, buffered but unconsumed data is pushed back to them so there is no need to save the state of the reading algorithm. This ensures that the saved state, as a whole, is resilient to individual readers being evicted from the querier cache. Saving the state of the reading algorithm as well would have the advantage of not having to move already read data back to the originating shard when the page is over, at the cost of introducing a special state that, if evicted, would make all the shard readers unusable, as their read position would suddenly be ahead, due to data already read into buffers being discarded. This is highly undesirable, so instead we opted for moving buffered but unconsumed data back to the originating readers and saving only the shard readers. As a side note, saving the algorithm’s state would also tie the remaining pages of the query to be processed on the same shard, which is bad for load balancing.

Figure 5: Flow diagram of an example of the new algorithm filling a page.

Figure 5: Flow diagram of an example of the new algorithm filling a page. Readers are created on demand. There is no need to discard data as when reading shard chunk 5, the read stops exactly when the page is filled. Data that is not consumed but is already in the buffers is moved back to the originating shard reader. Read ahead and buffering is not represented on the diagram to keep it simple.

Diagnostics

As stateful range scans use the existing infrastructure, introduced for singular partition queries, for saving and restoring readers, the effectiveness of this caching can be observed via the same metrics, already introduced in the More Efficient Query Paging with Scylla 2.2 blog post.

Moving buffered but unconsumed data back to the originating shard can cause problems for partitions that contain loads of range tombstones. To help spot cases like this two new metrics are added:

  1. multishard_query_unpopped_fragments counts the number of fragments (roughly rows) that had to be moved back to the originating reader.
  2. multishard_query_unpopped_bytes counts the number of bytes that had to be moved back to the originating reader.

These counters are soft badness counters, they will normally not be zero, but outstanding spikes in their values can explain problems and thus should be looked at when queries are slower than expected.

Saving individual readers can fail. Although this will not fail the read itself, we still want to know when this happens. To track these events two additional counters are added:

  1. multishard_query_failed_reader_stops counts the number of times stopping a reader, executing a background read-ahead when the page ended, failed.
  2. multishard_query_failed_reader_saves counts the number of times saving a successfully stopped reader failed.

These counters are hard badness counters, they should be zero at all times, any other value indicates either serious problems with the node (no available memory or I/O errors) or a bug.

Performance

To measure the performance benefits of making range scans stateful, we compared the recently released 2.3.0 (which doesn’t have this optimization) with current master, the future Scylla Open Source 3.0.

We populated a cluster of 3 nodes with roughly 1TB of data then ran full scans against it. The nodes were n1-highmem-16 (16 vCPUs, 104GB memory) GCE nodes, with 2 local NVME SSD disks in RAID0. The dataset was composed of roughly 1.6M partitions, of which 1% was large (1M-20M), around 20% medium (100K-1M) and the rest small (>100K). We also fully compacted the table to filter out any differences due to differences in the effectiveness of compaction. The measurements were done with cache disabled.

We loaded the cluster with scylla-bench which implements an efficient range scan algorithm. This algorithm runs the scan by splitting the range into chunks and executing scans for these chunks concurrently. We could fully load the Scylla Open Source 2.3.0 cluster with two loaders, adding a third loader resulted in reads timing out. In comparison the Scylla Open Source 3.0 cluster could comfortably handle even five loaders, of course individual scans took more time compared to a run with two loaders.

After normalizing the results of the measurement we found that Scylla Open Source 3.0 can handle 1.3X more reads/s than Scylla Open Source 2.3.0. While this doesn’t sound very impressive, especially in light of the 2.5X improvements measured for singular partition queries, there is more to this than a single number.

In Scylla Open Source 3.0, the bottleneck during range-scans moves from the disk to the CPU. This is because Scylla Open Source 3.0 needs to read up to 39% less bytes and issue up to 73% less disk OPS per read, which allows the CPU cost of range scans to dominate the execution time.

In the case of a table that is not fully compacted, the improvements are expected to be even larger.

Figure 6: Chart for comparing normalized results for BEFORE (stateless scans) and AFTER (stateful scans).

Figure 6: Chart for comparing normalized results for BEFORE (stateless scans) and AFTER (stateful scans).

Summary

Making range scans stateful delivers on the promise of reducing the strain on the cluster while also increasing the throughput. It is also evident that range scans are a lot more complex than singular partition queries and being stateless was a smaller factor in their performance as compared to singular partition queries. Nevertheless, the improvements are significant and they should allow you to run range scans against your cluster knowing that your application will perform better.

The post More Efficient Range Scan Paging with Scylla 3.0 appeared first on ScyllaDB.

Scylla Summit Preview: Adventures in AdTech: Processing 50 Billion User Profiles in Real Time with Scylla

Scylla Summit Preview: Adventures in AdTech: Processing 50 Billion User Profiles in Real Time with Scylla

In the run-up to Scylla Summit 2018, we’re featuring our speakers and providing sneak peeks at their presentations. This interview in our ongoing series is with Ľuboš Koščo and Michal Šenkýř, both senior software engineers in the Streaming Infrastructure Team at Sizmek, the largest independent Demand-Side Platform (DSP) for ad tech. Their session is entitled Adventures in AdTech: Processing 50 Billion User Profiles in Real Time with Scylla.

Thank you for taking the time to speak with me. I’d like to ask you each to describe your journey as technical professionals. How did you each get involved with ad tech and Sizmek in particular?

Ľuboš: I worked for Sun Microsystems and later for Oracle on applications and hosts in datacenter management software and cloud monitoring solutions. I am also interested in making source code readable and easily accessible, so I am part of the {OpenGrok team.

I was looking for new challenges and the world of big data, artificial intelligence, mass processing and real-time processing were all interesting topics for me. AdTech is certainly an industry that converges all of them. So Sizmek was an obvious choice to fill in my curiosity and open new horizons.

Michal: Frankly, getting into AdTech was a bit of a coincidence in my case, since before Sizmek I joined Seznam, a media company not unlike Google in its offerings, but focusing only on the Czech market. I intended to join the search engine team but there was a mixup and I ended up in the ad platform team. I decided to stick around and soon, due to my expertise in Scala, got to work on my first Big Data project using Spark. With that, a whole new world of distributed systems opened up to me. Some time (and several projects) later, I got contacted by Rocket Fuel (now Sizmek) to work on their real-time bidding system. They got me with the much bigger scale of operations, with a platform spanning the whole globe. It was a challenge I gladly took.

Last year you talked about how quickly you got up and running with Scylla. “We picked Scylla and just got it done–seven data centers up and running in two months.” What have you been up to since?

Ľuboš: We took on a bigger challenge: replace our user profile store with Scylla. We knew back then it won’t be that easy, since it’s one of the core parts of our real-time infrastructure. A lot of flows depend on it and the hardware takes a significant amount of space in our datacenters.

Preparing a proof of concept, doing capacity planning, making sure all pieces will work as designed were all tasks we had to do. However, we were able to tackle most of the challenges with the help of the Scylla guys and we’re close to production now.

At the same time a similar task happened within our other department, where we replaced our page context proxy cache with Scylla. This task is mostly getting to production now.

How about data management? How much data do you store, and what do your needs look like in terms of growth over time? How long do you keep your data?

Ľuboš: Currently we store roughly 30G per node, part of it in ramdisk on a total 21 nodes across the globe. This data can grow up to 50G per node per design. TTL here depends on the use case deployed, but it’s from a few hours to 3 or 7 days. The recent use case going to production will store much more data directly on SSD disks and right now it’s on 175GB per node on 20 nodes around the world. We can grow up to 1.7TB. This storage is persistent. Michal will comment on upcoming profile store, which we will talk about in more detail on the Summit.

Michal: In terms of user profiles, we currently store about 50 billion records, which amounts to about 150TB of replicated data. It fluctuates quite a bit with increases for new enhancements and decreases due to optimizations, legislative changes, etc. We keep them for just a few months unless we detect further activity.

What about AdTech makes NoSQL databases like Scylla so compelling for your architecture?

Ľuboš: Read latency of Scylla is very good. Also bearing in mind the fit with SSDs, CPU and memory allocation, and ease of node and cluster management. Scylla seems to also nicely scale vertically.

Michal: We have a huge amount of data that needs to be referenced in a very short amount of time. There simply is no way to do it other than a distributed storage system like Scylla. No centralized system can keep up with that. Using the profile data, we can make complex decisions when selecting and customising ads based on the audience.

How is Sizmek setting itself apart from other AdTech platforms?

Ľuboš: Sizmek is a high-performance platform. That means that we deliver on the campaign promise and we deliver with high quality. Sizmek’s AI models combined with real-time adjustments are one of the best in targeting for programmatic marketing. Sizmek was named the best innovator in this space by Gartner, and lots of our features that make us different are well described on our blog, so look it up. It’s very interesting reading.

Michal: We think deeply about which ads to show to which person and in what context. Internet advertising tends to have sort of a stigma because users can get annoying ads that follow them around, are displayed in inappropriate places, multiple times, etc. Our dedicated AI team is constantly working on improvements to our machine learning models to ensure that this is not the case and every advertisement is shown at precisely the right time to precisely the right user to maximize the effect our client wants to achieve.

Tell us about the SLAs you have for real time bidding (RTB).

Ľuboš: Right now we have 4 milliseconds maximum, but 1-2 milliseconds is the usual, for the page context caching service. For the proxy cache use case we are on 10 milliseconds as maximum, but generally this is around 3 milliseconds. [Editor’s note: These times are at the database level.]

Michal: For each given bid request, our [end-to-end] response needs to come in 70 milliseconds. Any longer than that and our bid is discarded. We allocate no more than 60% of that time to the actual lookup, which can involve multiple profile lookups if they are part of a cluster, as well as the subsequent transformation of the returned result. All in all, Scylla is left with less than 10 milliseconds at best to complete the actual query.

Anything you’re especially looking forward to at this year’s Scylla Summit?

Ľuboš: Scyla 3.0, in-memory Scylla, Spark and Scylla debugging are on my list so far.

Michal: I am very interested to hear about the progress the Scylla team is making towards version 3.0. It is going to be a huge update with several features that we already plan to take advantage of. I am also looking forward to hearing from the other users of Scylla about all the different use cases they are using the technology on.

Thank you both for this glimpse into your talk! I am sure attendees are going to learn a lot.

This is it! Scylla Summit is coming up next week. The Pre-Summit Training is Monday, November 5, followed by two days of sessions, Tuesday and Wednesday, November 6-7, 2018. Thank you for following our series of Scylla Summit Previews. We’re now busily preparing and look forward to seeing all of you at the show. So if you haven’t registered yet, now’s your chance.

The post Scylla Summit Preview: Adventures in AdTech: Processing 50 Billion User Profiles in Real Time with Scylla appeared first on ScyllaDB.

Introduction to tlp-stress

If you’re a frequent reader of our blog, you may have noticed we’ve been spending a lot of time looking at performance tuning. We’ve looked at tuning Compression, Garbage Collection, and how you can use Flame Graphs to better understand Cassandra’s internals. To do any sort of reasonable performance tuning you need to be able to apply workloads to test clusters. With Cassandra, that means either writing a custom tool to mimic your data model or using Cassandra stress to try to put load on a cluster.

Unfortunately, in the wise words of Alex Dejanovski, modeling real life workloads with cassandra-stress is hard. Writing a stress profile that gives useful results that can be very difficult, sometimes impossible. As Alex mentions in his post, cassandra-stress will do batches whether or not you’ve asked it to, and disabling that functionality is far from straightforward.

Outside of the world of Cassandra, I’ve done a fair bit of benchmarking and performance profiling. One of the tools I’ve been most impressed with is fio. The author, Jens Axboe, recognizes that when doing performance testing there’s a handful of patterns that come up frequently. With very little work, fio can be configured to benchmark reads, writes, mixed, and the operations can be random or sequential. The idea is that you start with a predefined idea of a job, and configure it via parameters to run the workload you’re interested in. Once you understand the tool, creating a new workload takes minutes, not hours, and is fairly straightforward. It can output the results as JSON, so when you’re done it’s easy to archive and process.

Benchmarking Cassandra is, of course, different than benchmarking a filesystem. Workloads vary significantly between applications, but ultimately we still see a lot of patterns. Time series and key value workloads are very common, for instance. We decided to create a stress tool that shipped with a variety of commonly run workloads, and allow the tool to tweak their behavior. If the desired workload could not be configured based on what we ship, it should be straightforward to create a new one.

Thus, tlp-stress was born.

tlp-stress is written in Kotlin, so it runs on the JVM. We make use of the Datastax Java Driver and make use of best practices to maximize query throughput. Metrics are tracked using the same instrumentation as the Java driver itself, which will make exporting them to tools like Prometheusa no brainer in the long run (see related issue). We chose Kotlin because it gives us access to stable Java libraries and runs on the JVM without a lot of the mental overhead of Java. We’ve used Kotlin internally for over a year at TLP and have found it to be as easy to write as Python while still providing static typing, great IDE support, while not giving up the libraries we rely on and know well.

There’s documentation for tlp-stress, which includes examples generated by the tool itself.

Let’s take a look at a practical example. Let’s say we’re looking to understand how Cassandra will perform when we’ve got a key-value style workload that’s 90% reads. This isn’t the usual use case people talk about a lot - usually we’re discussing optimizing Time Series and similar write heavy workloads.

We can put together this workload rather trivially.

Note: For demonstration purposes I’m only running 10 million operations - for real testing we’d want to set this up to run over several days.

We’ll fire up a test using the run subcommand, specifiying one of the available tests, this one is KeyValue. We’ll limit ourselves to 10,000 partition keys by specifiying -p 10k (note the human friendly inputs) 90% reads with -r .9 and specify the compaction option:

$ tlp-stress run KeyValue -n 10M -p 10k -r .9 --compaction "{'class':'LeveledCompactionStrategy'}" 
Creating schema
Executing 10000000 operations
Connected
Creating tlp_stress: 
CREATE KEYSPACE
 IF NOT EXISTS tlp_stress
 WITH replication = {'class': 'SimpleStrategy', 'replication_factor':3 }

Creating Tables
CREATE TABLE IF NOT EXISTS keyvalue (
                        key text PRIMARY KEY,
                        value text
                        ) WITH compaction = {'class':'LeveledCompactionStrategy'}
Preparing queries
Initializing metrics
Connecting
Preparing
1 threads prepared.
Running
                  Writes                                    Reads                  Errors
  Count  Latency (p99)  5min (req/s) |   Count  Latency (p99)  5min (req/s) |   Count  5min (errors/s)
  13415          15.61             0 |  122038          25.13             0 |       0                0
  33941          15.02        5404.6 |  306897           16.6       48731.6 |       0                0
  54414          15.56        5404.6 |  490757          24.15       48731.6 |       0                0

Note that we didn’t have to write a single query or a schema. That’s because tlp-stress includes common workloads and features out of the box. We can find out what those workloads are by running the list command:

$ tlp-stress list 
Available Workloads:

BasicTimeSeries 
Maps 
CountersWide 
MaterializedViews 
KeyValue 
LWT 

Done.

What if we have a use case that this test is close to, but doesn’t match exactly? Perhaps we know our workload will be fairly heavy on each request, The next thing we can do is customize the data in specific fields. Note the schema in the above table has a table called keyvalue, which has a value field. Let’s suppose our workload is caching larger blobs of data, maybe 100,000 - 150,000 characters per request. Not a problem, we can tweak the field.

$ tlp-stress run KeyValue -n 10M -p 10k -r .9 --compaction "{'class':'LeveledCompactionStrategy'}" --field.keyvalue.value='random(100000,150000)' 
Creating schema
Executing 10000000 operations
Connected
Creating tlp_stress: 
CREATE KEYSPACE
 IF NOT EXISTS tlp_stress
 WITH replication = {'class': 'SimpleStrategy', 'replication_factor':3 }

Creating Tables
CREATE TABLE IF NOT EXISTS keyvalue (
                        key text PRIMARY KEY,
                        value text
                        ) WITH compaction = {'class':'LeveledCompactionStrategy'}
keyvalue.value, random(100000,150000)
Preparing queries
Initializing metrics
Connecting
Preparing
1 threads prepared.
Running
                  Writes                                    Reads                  Errors
  Count  Latency (p99)  5min (req/s) |   Count  Latency (p99)  5min (req/s) |   Count  5min (errors/s)
   1450          43.75             0 |   13648           40.7             0 |       0                0
   3100          35.29         512.8 |   28807          42.81        4733.8 |       0                0
   4655          49.32         512.8 |   42809          44.25        4733.8 |       0                0

Note in the above example, --field.keyvalue.value corresponds to the keyvalue table and the value field, and we’re using random values between 100k-150k characters. We could have specified --field.keyvalue.value='cities()' to pick randomly from a cities of the world list, or --field.keyvalue.value='firstname()' to pick from a list of randomly supplied names. This feature is mostly undocumented and will be receiving some attention soon to make it a lot more useful than the current options, but you should be able to see where we’re going with this.

In addition to our stress tool we’ve also begun work on another tool, cleverly named tlp-cluster, to launch test clusters to aid in development and diagnostics. To be clear, this tool is in it’s early infancy and we don’t expect it to be used by the general Cassandra population. It isn’t a substitute for provisioning tools like Salt and Ansible. With that warning out of the way, if you’re interested in checking it out, we’ve put the project on GitHub and we’re working on some Documentation.

We’ll be building out these tools over the next few months to further help us understand existing Cassandra clusters as well as profile the latest and greatest features that get merged into trunk. By focusing our effort on what we know, tooling and profiling, our goal is to expose and share as much knowledge as possible. Check out the tlp-stress documentation, it’ll be the best source of information over time. Please let us know if you’re finding tlp-stress useful by giving us a shout on Twitter!

ProtectWise Delivers Cloud-Powered Network Detection and Response at Scale with DataStax

ProtectWise is on a mission to fundamentally change human experience in security. We believe there needs to be a new approach to how the enterprise acquires, manages, and operates security.

We launched our cloud-powered Network Detection and Response (NDR) platform, The ProtectWise GridTM, in 2015 to shift core network security functionality away from traditional fragmented hardware appliances to a fully on-demand model delivered entirely from the cloud.

How it Works

The ProtectWise Grid is deployed by placing free software sensors on any network segment where our customers need visibility, detection, and response capabilities, including enterprise, cloud, or industrial environments.

These sensors operate passively on the network, receiving an exact copy of all the network communications and transactions, which we compress, optimize, and stream to the cloud.

Once in the cloud, we run it through a suite of threat detection capabilities ranging from deterministic approaches, using things such as signatures and heuristics, to the probabilistic, where we profile behavior and look for anomalies. What’s more, we wanted to store a copy of this data for an unlimited amount of time.

Network data is truly massive. It can contain lots of features to be extracted, such as IPs, ports, email, web traffic, files, URLs, domains, hashes, certificates, DNS queries, SMB transfers, geographic information, and on and on.

At scale, this means we are ingesting trillions of data points per day, all of which are candidates for threat detection. Being “NDR” means that it’s not enough to simply detect attacks, we need to be able to integrate with the existing architecture, respond automatically, support rich investigation, and offer up that entire haystack for open ended searching and querying.

Given that the median time for breach detection in most organizations is greater than six months, we offer a standard retention window that is a year of storage.

How it’s Doing

After three years of commercial availability, ProtectWise has amassed one of the largest security data sets ever created. Today, The ProtectWise Grid ingests over half a trillion data points per hour and performs 10s of millions of transactions per second.

With this kind of volume and velocity, our technology infrastructure must be purpose-built to store and manage time series data without impacting performance and availability. DataStax, and specifically its distributed database DataStax Enterprise (DSE), are a critical component to making this possible.

In the early stages of our development, the ProtectWise team discovered DSE through the DataStax Startup Program. At the time, we knew database performance would be key as we kept up with epic write speeds, and we also knew legacy relational or Hadoop-based database technologies weren’t built to support the level of performance we needed. As a fast, highly scalable distributed database delivering Apache Cassandra, DSE allowed us to manage this tremendous volume of data at scale.

ProtectWise also needed to solve for enabling search at scale. DSE integrates with Apache Solr™, which let us fuse Cassandra and Solr together in a way that supported our demands for superior system performance and linear scalability.

Ultimately, DSE enabled us to store, index, and search all of our customers’ data in real time, and it plays a key role in our ability to query all of that data going back a year or more:  we built some very interesting technology which allows us to search against even petabytes of data in seconds and to do so cost effectively, and DSE plays an important role in that.

The DSE Solution

DSE has allowed us to solve a hard problem very quickly and continues to serve as the database backbone of our cloud-delivered NDR platform. As more organizations evolve their security strategies with NDR, our work together becomes even more important.

NDR is designed to dramatically accelerate threat detection, integrated incident response, with open querying. In complement, DataStax gives us the ability to search through billions of network communications very quickly and derive answers from these data points in mere seconds. Our integrations with other security products including Endpoint Detection and Response (EDR) solutions allow us to solve another hard problem: enabling complete visibility, detection, forensics and response from the endpoint to the network.

Together, we are creating the future of security—one that is simpler, faster, more effective, and more affordable.

ProtectWise Revolutionizes Enterprise Network Security with DataStax (Case Study)

READ NOW