Planet Cassandra

All your NoSQL Apache Cassandra resources in one place.

DSE Advanced Performance: Apache Cassandra™ Goes Turbo

Introduction

Today, we are very excited to unveil some of the most critical performance enhancements we’ve ever made to DataStax Enterprise (DSE). Enterprise workloads are becoming more and more demanding each day, so we took the opportunity to channel the amazing engineering talent at DataStax to re-architect how we take advantage of compute, memory, and storage. It’s not just about speed, either; we’ve made DSE more efficient and resilient to meet the most demanding workloads.

We’ve named our new suite of performance optimizations, utilities, and tools  “DataStax Advanced Performance”. The best part? You just need to upgrade to enjoy these out-of-the-box benefits, which include:

  • New thread-per-core and asynchronous architecture, which results in double the read/write performance of open source Apache Cassandra™
  • Storage engine optimizations that literally slice read/write latencies in open source Cassandra in half  
  • Faster analytics that can deliver up to 3x query speed-ups over open source Apache Spark™
  • DataStax Bulk Loader, which load and unloads data up to 4x faster than current Cassandra load tools

Thread-Per-Core and Asynchronous Architecture

Apache Cassandra uses a traditional, staged, event-driven architecture (SEDA). With the SEDA architecture, Cassandra assigns thread pools to events or tasks and connects them via a message service. This architecture also uses multiple threads per task, meaning that threads need to be coordinated. Additionally, events in this architecture are synchronous, which can cause contention and slowdowns. Because of this, adding CPU cores eventually sees diminishing returns.


DSE 5.1 left, DSE 6 right – With the traditional SEDA architecture, we see much more context switching which is expensive and degrades performance.   

DSE 6 has a coordination-free design, a thread-per-core architecture which yields incredible performance gains. The whole data layer across DSE, including search, analytics, and graph, benefits from this new architectural change.  


cassandra-stress, 5 nodes, RF=3, CL=QUORUM, 500GB density

Each node in a cluster owns part of the token range; that’s not new. What’s new is that a respective node’s token range is divided evenly among CPU threads: one per CPU core to be exact. A respective thread is now responsible for incoming writes for the part of the token range it owns, and any available thread can be used to handle read requests. This means that evenly distributed data in the cluster results in evenly distributed CPU utilization on a server. This architecture also means that very little coordination is needed between threads, which ensures that a CPU core can be used to its fullest capabilities.


cassandra-stress, 5 nodes, RF=3, CL=One, 500GB density

Since a single thread owns the writes for its respective token range, what about contention? In DSE 6, we’ve moved reads, writes, and other tasks from synchronous operations to asynchronous. This allows us to eliminate thread contention, always keeping threads working. Combined with the thread-per-core architecture, this allows us to scale performance linearly as we scale the number of CPU cores. This is extremely important as multi-socket motherboards and high core-count cloud instances have become the standard.      

Storage Engine Optimizations

Besides ingesting and serving data faster with thread-per-core, we’ve also made improvements to the storage engine that  improve latency and optimize compaction, which can also be a bottleneck for write-heavy workloads. In DSE 6, that compaction performance is 22% faster than in DSE 5.1 which is already 2x faster than open source Cassandra. We’re also seeing latency improvements of 2x on reads and writes.


In DSE 5.1, we introduced 2x compaction performance over Apache Cassandra. In DSE 6, compaction is even faster.


40k fixed throughput test, 3 nodes, RF=3, CL=QUORUM

Faster Analytics Scans

Also included in DSE Advanced Performance are improved analytics read performance that is 3x over open source Cassandra and Spark. This was made possible by a feature called Continuous Paging, an optimization designed specifically for DataStax analytics queries that scan a large portion of data. We have tested this in a number of scenarios: selecting all columns or some columns, with or without a clustering-column predicate, and in all scenarios we see a 2.5 to 3.5x performance improvement.


3x analytics read performance over open source Spark and Apache Cassandra.

DataStax Bulk Loader

Also new in DSE 6 is a bulk loader utility that greatly outpaces current Cassandra load/unload utilities. The command line tool handles both standard delimited formats as well as JSON and can load and unload data up to 4x faster than current tools.   

Conclusion

We’re extremely excited for our customers to experience the new Advanced Performance capabilities of DSE 6. With a 2x throughput improvement, massive latency improvements, 10% compaction improvement, a 3x analytics improvement, and a crazy-fast bulk loader, we can’t wait to see the kinds of innovation and disruption our customers will continue to make.

To download DSE 6 and for more information on DSE Advanced Performance, check out this page.

DSE NodeSync: Operational Simplicity at its Best

Introduction

We’ve got something really special for administrators in DataStax Enterprise (DSE) 6: DSE NodeSync, designed with operational simplicity in mind, can virtually eliminate manual efforts required to run repair operations in a DataStax cluster.

NodeSync

To understand NodeSync, let’s talk about how we got here. One of the most important mechanisms for an administrator to run in Apache Cassandra™ is anti-entropy repair. Despite its name, repair is a process that should always be running in a cluster to ensure that data between nodes are consistent.

The fundamentals of repair haven’t changed since it was initially introduced many years ago.It’s designed as a single-process bulk operation that continuously runs for a long time which means when failure occurs, you must begin the repair over again. Repair is also computationally and network intensive as it creates merkle trees and streams them between nodes.  


The longer classic repair runs, the more failure prone it is.

To help mitigate some of these problems, complex tools were built to help orchestrate and add some structure and resiliency to repair. These tools try to split the repair process in multiple, more manageable pieces in an effort to improve operational simplicity, but in the end, these client-side tools were built to solve issues with a server-side mechanism. There’s only so much that can be done with tooling.

Enter NodeSync: NodeSync is a ground-up rethinking of how we do entropy resolution in a DataStax cluster. Once you install DSE 6, NodeSync automatically starts running in the background. You simply tell it which keyspace or tables you’d like managed with NodeSync, and it handles the rest. No more compute-intensive tasks, no more complex tooling, just hands-off repair operations.


Enabling nodesync on a table is as easy as an alter table command.

NodeSync is designed to be simple and reliable. It divides the work it must complete into small tasks. These tasks are always tracked so it knows which data has been synchronized and which hasn’t. It also acts as a checkpoint mechanism so that if a node goes down, NodeSync knows exactly where to start again. NodeSync is also self-managing in that it will prioritize what to synchronize based on the last time the data was synced and whether it failed or not.


Easily enable/disable nodesync on tables through OpsCenter

While NodeSync is designed to be as hands-off as possible, we know how important it is for administrators to understand what’s happening in the cluster so we’ve also updated OpsCenter to monitor NodeSync progress for you.


OpsCenter 6.5 lets you monitor NodeSync progress

Conclusion

We know our customers are going love NodeSync as it’s designed to make operations simpler with DataStax. Eliminating the need to orchestrate and manage repair means that administrators spend less time managing their DataStax clusters and more time doing other important tasks. To download DSE 6, and to get more information about NodeSync, please check out this page.

Voice of Experience: Samsung SDS Chooses Scylla

samsung

Samsung SDS is a global IT services and solutions company with 57 offices spread across 31 countries. They are tasked with implementing highly performant and scalable systems for a number of Samsung businesses. However, they were experiencing a number of issues at the database layer. For example, their relational database couldn’t meet the performance requirements of several business use cases. As a result, they decided to conduct an in-depth technical evaluation of NoSQL databases.

Samsung SDS was looking for a database with high throughput, scalability, low latency, ease of deployment and maintenance, and reduced operational costs. They decided to compare Scylla against Apache Cassandra. In the proof-of-concept, Scylla delivered 3X better throughput and latency than Cassandra. The team also noted price and ease of maintenance as key factors when deciding on Scylla.

“Its excellent performance means we can save lots of money adopting Scylla over other NoSQL databases.” – Kuyul Noh, Principal Data Architect, Samsung SDS

Hear from Samsung SDS about their experiences with Scylla in the video below.

Next Steps

  • Read the entire Case Study.
  • Learn more about Scylla on our product page.
  • See what our users are saying about Scylla.
  • Download Scylla. Check out our download page to run Scylla on AWS, install it locally in a Virtual Machine, or run it in Docker.
  • Take Scylla for a Test drive. Our Test Drive lets you quickly spin-up a running cluster of Scylla so you can see for yourself how it performs.

The post Voice of Experience: Samsung SDS Chooses Scylla appeared first on ScyllaDB.

DataStax Enterprise 6 – the Distributed Cloud Database Designed for Hybrid Cloud

Each time we have a major release, I look back and think there’s no way our team can top it; that future releases will somehow be less than what just went out the door. But every time I’m proven wrong when our next release becomes GA, and there’s no better example of that than what we’re announcing today.  

DataStax Enterprise (DSE) 6 represents a major win for our customers who require an always-on, distributed database to support their modern real-time (what we call ‘Right-Now’) applications, particularly in a hybrid cloud environment. Not only does it contain the best distribution of Apache Cassandra™, but it represents the only hybrid cloud database capable of maintaining and distributing your data in any format, anywhere – on-premise, in the cloud, multi-cloud, and hybrid-cloud – in truly data autonomous fashion.

Let me take you on a quick tour of what’s inside the DSE 6 box, as well as OpsCenter 6.5, DataStax Studio 6, and DSE Drivers, and show you how our team has knocked yet another one out of the park.

Double the Performance

Enterprises with Right-Now applications know they have three seconds – just three seconds – to keep a customer waiting before almost half of them click away to a competitor. Because these apps are constantly interacting with a database that holds the contextual info needed for producing a personalized customer experience, it’s vital that the database not play a part in exceeding those three seconds.

Exceeding the high bar of speed expectations set by today’s digital consumer is tough, but DSE has been doing it for some time now, and with version 6, things only get better. DSE Advanced Performance is a new set of performance-related optimizations, technologies, and tools that dramatically increase DSE’s performance over its foundational open source components as well as its competitors.

To start, new functionality designed to make Cassandra more efficient with high-compute instances has resulted in a 2x or more out-of-the-box gain in throughput for both reads and writes. Note that these speed and throughput increases apply to all areas of DSE, including analytics, search, and graph. A new diagnostic testing framework developed by DataStax helped pinpoint performance optimization opportunities in Cassandra, with more enhancements coming in future releases.

Next, DSE 6 includes our first ever advanced Apache Spark™ integration (over the open source work we’ve done for Spark in the past)  that delivers a number of improvements, as well as a 3x query performance increase.

Lastly, loading and unloading large volumes of data is still a very pressing need for many enterprises. DSE 6 answers this call with our new DataStax Bulk Loader that’s built to rapidly move data in and out of the platform at impressive rates – up to 4x faster than current data loading utilities.

All of these performance improvements have been designed with our customers in mind so that their Right-Now applications deliver a better-than-expected customer experience by processing more orders, fielding more queries, performing faster searches, and moving more data faster than ever before. If an app’s response time exceeds three seconds, it won’t be because of DSE.

Self-Driving Operational Simplicity

In designing DSE 6, we listened to both DataStax customers and the Cassandra community. While the interests of these groups sometimes diverge, they do have a few things in common.

It turns out that helping with Cassandra repair operations is a top priority for both. For some, Cassandra repairs aren’t a big deal, but for others they are a PITA (pain in the AHEM). Don’t get repair right in a busy and dynamic cluster, and it’s just a matter of time until you have production-threatening issues.

While we introduced an OpsCenter-based repair service some years ago, it was limited to repair functionality available at the Cassandra level. Knowing that a server-based approach is what Cassandra users want, our talented engineering team has delivered DSE NodeSync, which essentially makes DSE ‘repair free’ by operating in a transparent and continuous fashion to keep data synchronized in DSE clusters.

If you like your current repair setup, keep it. But if you want to eliminate scripting, manual intervention, and piloting repair operations, you can turn on NodeSync and be done. It works at the table level so you have strong flexibility and granularity with NodeSync, plus it can be enabled either with CQL or visually in OpsCenter.

Something else we’ve added to version 6 is DSE TrafficControl, which delivers advanced resiliency that ensures DSE nodes stay online under extreme workloads. Under severe concurrent request traffic, there have been cases of open source Cassandra nodes going offline due to the abnormal pressure. DSE TrafficControl has intelligent queueing, not found in open source, that prevents this from happening on DSE nodes.  

Another area for improvement on which open source users and DataStax customers agree is upgrades. No technical pro that I know looks forward to upgrading their database software, regardless of the vendor used.

I’m happy to say we now provide automated help for upgrades with our new Upgrade service that’s a part of OpsCenter 6.5. Our new upgrade functionality effortlessly handles patch upgrades by notifying you that an upgrade is available, downloading the software you need, applying it to a cluster in a rolling restart fashion so you experience zero downtime, and freeing you up to do other things.

These management improvements and others are directly aimed at increasing your team’s productivity and letting you focus on business needs vs. operational overhead. The operational simplicity allows even novice DBAs and DevOps professionals to run DSE 6 like seasoned professionals. Ultimately that means much easier enterprise-wide adoption of data management at scale.

Analyze (and Search) This!

Forrester ranked DataStax a leader in their Translytical Wave, and for good reason: DSE provides the translytical functionality needed by Right-Now apps that meld transactional and analytical data together. For years, DataStax has provided 100% of the development needed to freely integrate open source Spark and Cassandra, but with DSE 6, we’re kicking things up a notch (or two).  

For the first time, we’re introducing our advanced Spark SQL connectivity layer that provides a new AlwaysOn SQL Engine that automates uptime for applications connecting to DSE Analytics. This makes DSE Analytics even more capable of handling around-the-clock analytics requests, and better support interactive end-user analytics, while leveraging your existing SQL investment in tools (e.g. BI, ETL) and expertise.

I’d also like to give a shout-out to the recently introduced DSE Analytics Solo. This is a subscription option introduced recently that gives a more cost-effective way to isolate analytic workloads in order to achieve predictable application performance.

We also have great news for analytics developers and others who want to directly query and interact with data stored in DSE Analytics. DataStax Studio 6 provides notebook support for Spark SQL, which means you now have a visual and intelligent interface and query builder that helps you write Spark SQL queries and review the results – a huge time saver! Plus you can now export/import any notebook (graph, CQL, Spark SQL) for easy developer collaboration as well as undo notebook changes with a new versioning feature.

Finally, let’s not forget the critical role search functionality plays in apps that rely on contextual and converged data. DSE Search has upped its game in this area by delivering CQL support for common search queries, such as those that use LIKE, IN, range searches, and more.

Supporting Distributed Hybrid Cloud

Over 60% of DataStax customers currently deploy DSE in the cloud, which isn’t surprising given that our technology has been built from the ground up with limitless data distribution and the cloud in mind. Customers run DSE today on AWS, Azure, GCP, Oracle Cloud, and others, as well as private clouds of course.

DataStax Managed Cloud, which currently supports both AWS and Azure, will be updated to support DSE 6, so all the new functionality in our latest release is available in managed form. Whether fully managed or self-managed, our goal is to provide you with multi and hybrid cloud flexibility that supplies all the benefits of a distributed cloud database without public cloud lock-in.

Yes, There’s Actually More…

I’d be remiss if I didn’t also mention additions to our DSE Advanced Security package that contains new separation of duties capabilities and unified authentication support for DSE Analytics, the backup enhancements we’ve done for cloud operations, or all the updates to our DSE drivers. Like I mentioned at the beginning of this post, our team always delivers.

With DSE 6, we want you to enjoy all the heavy-lifting advantages of Cassandra with none of the complexities and also get double the power. Downloads, free online training, and other resources are now available, so give DSE 6 a try (also now available for non-production development environments via Docker Hub) and let us know what you think.

It’s Here! DataStax Docker Images for DSE

Let’s skip to the last page in this book. You can head over to DataStax Academy right now and take a guided tour of DataStax Enterprise 6 using Official DataStax Docker Images which are approved for non-production use.

We started our Docker journey in 2014 and and began exploring orchestration technologies shortly thereafter. In the fall of 2015, we announced production support for customer-built Docker images and offered best-practice guidance for anyone integrating DataStax Enterprise (DSE) into Docker.

Today we are making DataStax-built images widely available for non-production use by hosting them in Docker Hub. We want the images to be as easy for you to use as they are for our internal teams.

Internally, we use Docker for many of our unit, integration, and functional tests. Doing so enables us to run many tests in parallel on a single machine and/or test cluster. With this approach we’ve crunched 15+ hours of total testing times into 20-to-60-minute testing rounds. The result is that our developers get feedback much faster, and we want our customers to have this same experience! To learn more about our testing strategy, check Predrag Knezevic’s Cassandra Summit talk.

We also use the DataStax images to power our reference application, KillrVideo. The official images ensure we are using stable versions and testing various configurations quick and easy, and by eliminating much of the setup work we enable users to more quickly learn and understand the platform.

We want to see what you build from the images and will showcase examples within our github account. Existing examples include partner integrations (StreamSets), Security Configuration (LDAP), KillrVideo, and advanced examples of using Docker Compose. Create a pull request to add your own.

This brings us to configuration and customization. We are also providing:

  • Docker Compose scripts to enable you to easily deploy clusters and expose the components (DSE/Opscenter/Studio) to each other.
  • Access to the GitHub Repo for developers that want to customize the images

We also want to make these images universally applicable to all your key use cases. For simple use cases, we’ve exposed common settings as environment variables. For advanced configuration management, we’re providing a simple mechanism to let you change or modify configurations without replacing or customizing the containers. You can add any of the approved config files to a mounted host volume and we’ll handle the hard work of mapping them within the container. You can read more about that feature here.

Lastly, adoption and feedback will drive these to approval for production use. Here are a few ways to provide input:

Stuff The Internet Says On Scalability For April 13th, 2018

Hey, it's HighScalability time:

 

Bathroom tile? Grandma's needlepoint? Nope. It's a diagram of the dark web. Looks surprisingly like a tumor.

If you like this sort of Stuff then please support me on Patreon. And I'd appreciate if you would recommend my new book—Explain the Cloud Like I'm 10—to anyone who needs to understand the cloud (who doesn't?). I think they'll learn a lot, even if they're already familiar with the basics. 

  • $23 billion: Amazon spend on R&D in 2017; $0.04: cost to unhash your email address; $35: build your own LIDAR; 66%: links to popular sites on Twitter come from bots; 60.73%: companies report JavaScript as primary language; 11,000+: object dataset provide real objects with associated depth information; 150 years: age of the idea of privacy; 30%~ AV1's better video compression; 100s of years: rare-earth materials found underneath Japanese waters; 67%: better image compression using Generative Adversarial Networks; 1000 bit/sec: data exfiltrated from air-gapped computers through power lines using conducted emissions; 

  • Quotable Quotes:
    • @Susan_Hennessey: Less than two months ago, Apple announced its decision to move mainland Chinese iCloud data to state-run servers.
    • @PaulTassi: Ninja's New 'Fortnite' Twitch Records: 5 Million Followers, 250,000 Subs, $875,000+ A Month via @forbes
    • @iamtrask: Anonymous Proof-of-Stake and Anonymous, Decentralized Betting markets are fundamentally rule by the rich. If you can write a big enough check, you can cause anything to happen. I fundamentally disagree that these mechanisms create fair and transparent markets.
    • David Rosenthal: The redundancy needed for protection is frequently less than the natural redundancy in the uncompressed file. The major threat to stored data is economic, so compressing files before erasure coding them for storage will typically reduce cost and thus enhance data survivability.
    • @mjpt777: The more I program with threads the more I come to realise they are a tool of last resort.
    • JPEG XS~ For the first time in the history of image coding, we are compressing less in order to better preserve quality, and we are making the process faster while using less energy. Expected to be useful for virtual reality, augmented reality, space imagery, self-driving cars, and professional movie editing.
    • Martin Thompson: 5+ years ago it was pretty common for folks to modify the Linux kernel or run cut down OS implementations when pushing the edge of HFT. These days the really fast stuff is all in FPGAs in the switches. However there is still work done on isolating threads to their own exclusive cores. This is often done by exchanges or those who want good predictable performance but not necessarily be the best. A simple way I have to look at it. You are either predator or prey. If predator then you are mostly likely on FPGAs and doing some pretty advanced stuff. If prey then you don't want to be at the back of the herd where you get picked off. For the avoidance of doubt if you are not sure if you are prey or predator then you are prey. ;-)
    • Brian Granatir: serverless now makes event-driven architecture and microservices not only a reality, but almost a necessity. Viewing your system as a series of events will allow for resilient design and efficient expansion. DevOps is dead. Serverless systems (with proper non-destructive, deterministic data management and testing) means that we’re just developers again! No calls at 2am because some server got stuck? 
    • @chrismunns: I think almost 90% of the best practices of #serverless are general development best practices. be good at DevOps in general and you'll be good at serverless with just a bit of effort
    • David Gerard: Bitcoin has failed every aspiration that Satoshi Nakamoto had for it. 
    • @joshelman: Fortnite is a giant hit. Will be bigger than most all movies this year. 
    • @swardley: To put it mildly, the reduction in obscurity of cost through serverless will change the way we develop, build, refactor, invest, monitor, operate, organise & commercialise almost everything. Micro services is a storm in a tea cup compared to this category 5.
    • James Clear: The 1 Percent Rule is not merely a reference to the fact that small differences accumulate into significant advantages, but also to the idea that those who are one percent better rule their respective fields and industries. Thus, the process of accumulative advantage is the hidden engine that drives the 80/20 Rule.
    • Ólafur Arnalds: MIDI is the greatest form of art.
    • Abraham Lincoln: Give me six hours to chop down a tree and I will spend the first four sharpening the axe.
    • @RichardWarburto: Pretty interesting that async/await is listed as essentially a sequential programming paradigm.
    • @PatrickMcFadin: "Most everyone doing something at scale is probably using #cassandra" Oh. Except for @EpicGames and @FortniteGame They went with MongoDB. 
    • Meetup: In the CloudWatch screenshot above, you can see what happened. DynamoDB (the graph on the top) happily handled 20 million writes per hour, but our error rate on Lambda (the red line in the graph on the bottom) was spiking as soon as we went above 1 million/hour invocations, and we were not being throttled. Looking at the logs, we quickly understood what was happening. We were overwhelming the S3 bucket with PUT requests
    • Sarah Zhang: By looking at the polarization pattern in water and the exact time and date a reading was taken, Gruev realized they could estimate their location in the world. Could marine animals be using these polarization patterns to navigate through the ocean? 
    • Vinod Khosla: I have gone through an exercise of trying to just see if I could find a large innovation coming out of big companies in the last twenty five years, a major innovation (there’s plenty of minor innovations, incremental innovations that come out of big companies), but I couldn’t find one in the last twenty five years.
    • Click through for lots more quotes.

Don't miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge (which means this post has many more items to read so please keep on reading)...

Voice of Experience: IBM Cloud’s Compose for JanusGraph Uncovers Advantages of Scylla

ibm

IBM had previously used only Apache Cassandra and HBase as storage back-ends for the graph databases it makes available on IBM Cloud. Having heard about the advantages of Scylla, IBM’s Open Tech and Performance teams conducted a series of tests to compare Scylla with HBase and Apache Cassandra.

The IBM team learned quite a lot from their performance tests. In their first test, which generated a load of 40,000,000 vertices with two properties, Scylla displayed nearly 35% higher throughput than HBase and almost 3X Cassandra’s throughput. In their second test, which consisted of randomly picking 30,000,000 pairs of vertices and entering 30,000,000 edges into it with one property, Scylla’s throughput was 160% better than HBase and more than 4X that of Cassandra.

In addition to its performance advantages, Scylla was also the easiest database to cluster, especially when adding multiple nodes to a cluster. The IBM Compose team was very pleased to see Scylla’s self-tuning capabilities, load balancing, and its ability to fully utilize the available system resources.

Hear what IBM’s teams have to say about their experiences with Scylla in this video.

 

Next Steps

  • Learn more about Scylla on our product page.
  • See what our users are saying about Scylla.
  • Download Scylla. Check out our download page to run Scylla on AWS, install it locally in a Virtual Machine, or run it in Docker.
  • Take Scylla for a Test drive. Our Test Drive lets you quickly spin-up a running cluster of Scylla so you can see for yourself how it performs.

The post Voice of Experience: IBM Cloud’s Compose for JanusGraph Uncovers Advantages of Scylla appeared first on ScyllaDB.

Role-based Access Control in Scylla

Role-based

The next open-source release (version 2.2) of Scylla will include support for role-based access control. This feature was introduced in version 2.2 of Apache Cassandra. This post starts with an overview of the access control system in Scylla and some of the motivation for augmenting it with roles. We’ll explain what roles are and show an example of their use. Finally, we’ll cover how Scylla transitions existing access-control data to the new roles-based system when you upgrade a cluster.

Access Control in Scylla

There are two aspects of access control in Scylla: controlling client connections to a Scylla node (authentication), and controlling which operations a client can execute (authorization).

By default, no access control is enabled on Scylla clusters. This means that a client can connect to any node unrestricted, and that the client can execute any operation supported by the database.

When we enable access-control (which is described in Scylla’s documentation), there are two important changes to Scylla’s behavior:

  • A client cannot connect to a node unless it provides valid credentials for an identity known to the system (a username and password)
  • A CQL query can only be executed if the authenticated identity has been granted the applicable permissions on the database objects involved in the query

For example, a logged-in user jsmith will only be permitted to execute

SELECT * FROM events.ingress;

if jsmith has been granted (directly or indirectly) the SELECT permission on the events.ingress table.

One way to grant jsmith the permissions they need is to grant SELECT on the entirety of the event keyspace. This encompasses all tables in the keyspace as well.

GRANT SELECT ON KEYSPACE events TO jsmith;

We can verify the permissions granted to jsmith:

LIST ALL PERMISSIONS OF jsmith;

role username resource permission
jsmith jsmith <keyspace events> SELECT

Limitations of User-based Access Control

Access control based only on users can quickly be unwieldy. To see why, consider a large set of resources that all analysts at an organization need to have similar permissions on.

GRANT SELECT ON events.ingress TO jsmith;
GRANT MODIFY ON events.ingress TO jsmith;
GRANT SELECT ON events.egress TO jsmith;
GRANT MODIFY ON events.egress TO jsmith;
GRANT SELECT ON KEYSPACE endpoints TO jsmith;

The same permissions have been granted to users aburns, tpetty, and many others. If an analyst joins the company, then an administrator needs to carefully grant them the applicable permissions. If the set of resources changes, then all the analysts need to be modified with the updated permissions.

To avoid this problem, a critical administrator might decide to create an “umbrella” user, like analyst, and have all analysts log in with that username and password whenever they interact with the system. That way, we only have to deal with a permission set for a single user. Unfortunately, by doing this, we lose an important security property: non-repudiation. This roughly means that the origin of data can be traced to a particular identity. We may want to know who modified data or accessed a particular table (i.e., we want access auditing), and having a single user makes this impossible.

Introducing Roles

One solution to the complexity described above is the use of roles. A role is an identity with a permission set, just like a user. Roles generalize users, though, because a role can also be granted to other roles.

In our example, we could create an analyst role and grant them all of the permissions that analysts need to do their job. An analyst has no credentials associated with it and cannot login to the system. We grant analyst to aburns to give aburns all the permissions of analyst. If the permission set for analysts needs to change, we only need to change the analyst role.

A Concrete Example

We’ll briefly go through the example above to demonstrate the CQL syntax of the roles-based system. This particular example is from the master branch of Scylla (specifically at commit 4419e602074c8d647f492612979cd98c677d89d9), as we are preparing for the next release.

First, we create the analyst role and grant them the necessary permissions.

CREATE ROLE analyst;

GRANT SELECT ON events.ingress TO analyst;
GRANT MODIFY ON events.ingress TO analyst;
GRANT SELECT ON events.egress TO analyst;
GRANT MODIFY ON events.egress TO analyst;
GRANT SELECT ON KEYSPACE endpoints TO analyst;

Then we create a user that can login for each of the analysts in our system.

CREATE ROLE jsmith WITH LOGIN = true AND PASSWORD = 'jsmith';
CREATE ROLE aburns WITH LOGIN = true AND PASSWORD = 'aburns';
CREATE ROLE tpetty WITH LOGIN = true AND PASSWORD = 'tpetty';

We grant analyst to each.

GRANT analyst TO jsmith;
GRANT analyst TO aburns;
GRANT analyst TO tpetty;

We can inspect the permissions of a user and see that they inherit those of analyst:

LIST ALL PERMISSIONS OF jsmith;

role username resource permission
analyst analyst <table events.egress> MODIFY
analyst analyst <table events.egress> SELECT
analyst analyst <table events.ingress> MODIFY
analyst analyst <table events.ingress> SELECT
analyst analyst <keyspace endpoints> SELECT

The Old USER CQL Statements

Astute readers may be wondering about the old user-based CQL statements: CREATE USER, ALTER USER, DROP USER, and LIST USERS. These still exist and with the same syntax as they had before.

What is important to understand is that roles generalize users. All roles can be granted permissions, can be granted to other roles, have authentication credentials, and can be allowed to login to the system. By convention, when a role is allowed to login to the system, we call it a user. Therefore, all users are roles but not all roles are users.

CREATE USER is just like CREATE ROLE (with different syntax), except CREATE USER implicitly sets LOGIN = true.

Executing LIST ROLES will display all the roles in the system, but LIST USERS will only display roles with LOGIN = true.

Migrating Old Scylla Clusters

With the switch to role-based access control, Scylla internally uses a new schema for storing metadata. Scylla will automatically convert the old user-based metadata into the new format during a cluster upgrade.

When the first node in the cluster is restarted with the new Scylla version, the metadata will be converted with a log message like the following:

INFO 2018-04-05 09:53:53,061 [shard 0] password_authenticator - Starting migration of legacy authentication metadata.
INFO 2018-04-05 09:53:53,065 [shard 0] password_authenticator - Finished migrating legacy authentication metadata.
INFO 2018-04-05 09:53:54,005 [shard 0] standard_role_manager - Starting migration of legacy user metadata.
INFO 2018-04-05 09:53:54,015 [shard 0] standard_role_manager - Finished migrating legacy user metadata.
INFO 2018-04-05 09:53:54,681 [shard 0] default_authorizer - Starting migration of legacy permissions metadata.
INFO 2018-04-05 09:53:54,690 [shard 0] default_authorizer - Finished migrating legacy permissions metadata.

Importantly, we do not support modifying access-control data during a cluster upgrade.

If a client is connected to an already-upgraded node in the midst of an upgrade, all modification statements will fail with an error message about incomplete cluster upgrades.

If a client is connected to an un-upgraded node, then the modification statements will succeed but not be reflected in the upgraded cluster.

The following table describes the old and new metadata tables, with the correspondence between the two if it exists.

Old table New Table
system_auth.users system_auth.roles
system_auth.role_members
system_auth.credentials
system_auth.permissions  system_auth.role_permissions

Once the cluster has been fully upgraded and you have verified that all access-control information is correct, you can drop the legacy metadata tables:

DROP TABLE system_auth.users;
DROP TABLE system_auth.credentials;
DROP TABLE system_auth.permissions;

Conclusion and Acknowledgments

Roles can make it easier to achieve good security properties in your Scylla cluster and can simplify a lot of common operations.

Please give this new feature a try and provide feedback either as a GitHub issue (in the case of bugs), on the mailing list, or on our Slack Channel.

Adding roles support to Scylla also required restructuring existing support for access-control and many other parts of the system. Thanks to everyone involved for their careful review and input during this process.

Next Steps

  • Learn more about Scylla on our product page.
  • See what our users are saying about Scylla.
  • Download Scylla. Check out our download page to run Scylla on AWS, install it locally in a Virtual Machine, or run it in Docker.
  • Take Scylla for a Test drive. Our Test Drive lets you quickly spin-up a running cluster of Scylla so you can see for yourself how it performs.

The post Role-based Access Control in Scylla appeared first on ScyllaDB.

Which Cassandra Version Should I Use (2018)?

If you’re starting new or in the 3.0.x series: 3.11.2

  • Apache Cassandra 3.0 is supported until 6 months after 4.0 release (date TBD)

If you’re in 2.x, update to the latest in the series (2.1.20, 2.2.12)

  • Apache Cassandra 2.2 is supported until 4.0 release (date TBD)
  • Apache Cassandra 2.1 is supported until 4.0 release (date TBD). Critical fixes only

 

Long Version

– If you’re starting new or in the 3.0.x series: 3.11.2

Stability wise, both 3.0.16 and 3.11.2 are stable at this point. The biggest advantage of 3.11.2 vs 3.0.16 is the additional features that went into the 3.x series (with x>0).
Not all features are desirable though. (Move away from Materialized Views, since they are marked as experimental on the latest releases).

Despite this, the Slow Query Log and Change-Data-Capture are examples of really useful ones that might make you consider jump to 3.11.2, as you will not get them in the 3.0.x series. JBOD users should also look at CASSANDRA-6696 might be interesting.

– If you’re in 2.x, update to the latest in the series (2.1.20, 2.2.12)

As you might expect, these two releases are very stable, since they have a lot of development time on top of them. If a cluster is still running these Cassandra versions, the best is to upgrade to
the latest releases in the respective series (either 2.1.20 or 2.2.12).

To me, the biggest downside of using these versions, it the fact that they will probably be the last releases of either Cassandra series. The support for critical bugs is here until 4.0 is released (https://cassandra.apache.org/download/) but besides that no major changes or improvements will come.

An additional possible thing to consider, that if there may not be a direct upgrade to the 4.x series, an upgrade may need to be done via 2.x -> 3.x -> 4.x.

But for now, I would stick with the recommendation keep your current major version if you’re already there and not needing anything new!

 


Find out how Pythian can help with all of your Cassandra needs.

Contributing to Open Source Software

Open source software is tricky business. One might think a volunteer project that gives you free software is the greatest thing ever, however making such a project work is complex. Creating open source software is quite simple for a small project with only a few hobbyist maintainers, where making decisions comes down to only one person or a very small group, and if users don’t like it they can fork the project and continue on their way without many hurdles. Things are not so simple for large projects with multiple stakeholders, where priorities are frequently conflicting but the health of the project still relies on all contributors behaving as, well, a community. This is what I’ll be writing about, and more specifically the do’s and don’ts when contributing to large open source projects.

But first, let’s talk about the kind of community that makes up an (large) open source project.

Four main types of contributors for an open source project

  1. The full timer who usually works for a company which utilizes/backs the project. This person is employed by the company to work on the project, usually directed by the company to work on specific bugs and features that are affecting them, and also on larger feature sets. They often work in a team within their company who are also working on the project. In some cases, these full-timers are not dedicated to writing code but more dedicated to the managerial side of the project. Part-timers similar to these also exist.
  2. The part-timer who has a vested interest in the project. Mostly these are consultants, but could still be from companies who use the software but don’t have enough resources to contribute full-time. Generally, they contribute to the project because it heavily influences their day jobs, and they see users with a certain need. Usually have a very good understanding of the project and will also contribute major features/fixes as well as smaller improvements. They may also just be very well versed users who contribute to discussions, helping other users, and documenting the software.
  3. The part-timer who has some interaction with the software during their day job, but is not dedicated to working on the software. These people often contribute patches related to specific issues they encounter while working with the software. Typically these people are sysadmins or developers. I’d sum these up as “the people that encounter something that annoys them and fix it”.
  4. The users. No point having all this software if there is no one to use it. Users contribute on mailing lists and ancient IRC’s, helping other users get on board with the software. They also give important feedback to the developers on improvements, bug fixes, and documentation, as well as testing and reporting bugs they find. Typically in a large project, they don’t drive features significantly, but it can happen.

There are many other types of contributors to a project, but these (to me) seem to be the main ones for large projects such as Apache Cassandra. You’ll note there is no mention of the hobbyist. While they do exist, in such large projects they only usually come about through extraneous circumstances. It’s quite hard to work on such a large project on the side, as it generally requires a lot of background knowledge, which you can only really grasp if you’ve spent countless hours working with the software already. It is possible to pick up a very small task and complete it without much knowledge about the project as a whole, but these are rare, which results in less hobbyists working on the project.

It’s worth noting that all of these contributors are essentially volunteers. They may be employed full time to work on the project, but not by the project. The company employing them volunteers their employees to work on the project.

Now there are a few important things to consider about a large project with a contributor-base like the above. For starters, priorities. Every contributor will come to the project with their own set of priorities. These may come from the company they work for, or may be itches they want to scratch, but generally, they will be directed to work on certain bugs/features and these bugs/features will not always coincide with other contributors priorities. This is where managing of the project gets complicated. The project has a bunch of volunteers, and these need to be organized in a way that will produce stable, functioning software that meets the needs of the user base, at least in a somewhat timely fashion. The project needs to be kept healthy and needs to continue satisfying the users needs if it is to survive. However, the user’s needs and the needs of the people writing the code often don’t intersect, and they don’t always see eye to eye. On a project run by volunteers this is important to consider when you’re asking for something, because although you may have a valid argument, there might not be someone who wants to make the contribution, and even if there is, they might not have a chance to work on it for a long time/ever.

Do’s

  1. Take responsibility for your contributions. I’ve noted it’s a common opinion that developers are only beholden to their employer, but this is not true. If you wrote code in an open source project, you’re still responsible for the quality and performance of that code. Your code affects other people, and when it gets into an open source project you have no idea what that code could be used for. Just because you’re not liable doesn’t mean you shouldn’t do a good job.
  2. Be polite and respectful in all discussions. This is very important if you want to get someone to help you in the project. Being rude or arrogant will immediately get people off-side and you’ll have a very hard time making meaningful contributions.
  3. Be patient. Remember OSS is generally volunteer-based, and those volunteers typically have priorities of their own, and may not be able to be prompt on your issues. They’ll get to it eventually, nudge them every now and again, just don’t do it all the time. I recommend picking up a number of things to do that you can occupy yourself with while you wait.
  4. Contribute in any way you can. Every contribution is important. Helping people on the mailing list/public forums, writing documentation, testing, reporting bugs, verifying behavior, writing code, contributing to discussions are all great ways to contribute. You don’t have to do all of them, and a little bit of help goes a long way. This will help keep Open Source Software alive, and we all want free software don’t we?

Don’ts

  1. Don’t assume that just because you have an idea that other people will think it’s good. Following that, don’t assume that even if it is good, someone else will be willing to implement it.
  2. Don’t assume that OSS is competing with any other software. If something better comes along (subject to licensing), it would make sense for the effort to be directed towards the new software. The only thing keeping the project alive is that people are using it. If it stops being relevant, it will stop being supported.
  3. Don’t expect other volunteers to work for you. If you have a great idea you must still be prepared to wait and get involved yourself to get it implemented. The nature of large OSS projects is that there is always more ideas than there is people to implement them, and the contributors are more likely to prioritize their own ideas over yours. If you can do some of the legwork to getting your ideas in place (proof of concepts, design documents, validation, etc) it will go a long way to making your idea a reality.
  4. Don’t expect to show up and be listened to. It takes years of working with a large project before you have enough knowledge (and wisdom) to make significant improvements. If you just show up and throw your ideas about like they are better than sliced bread you’ll likely put existing contributors on edge. Start small and incrementally build yourself a reputation of which people will give your ideas the consideration it deserves.
  5. Don’t waste people’s time. It may seem harsh but things like not having enough details to diagnose problems or when reporting bugs are huge time wasters and generally lead to your problems getting lost in the backlog. Make sure you always search the backlog for existing related issues and make sure you are prepared to provide all relevant information for the maximum chance of your request being implemented.

Hopefully, this gives a good overview of the kind of community that makes up an open source project and gives you a good idea of what you’re dealing with when you’re looking to contribute to <insert favorite OSS software here>. If you follow these simple do’s and don’ts you’ll have the best chances of success when making contributions. Don’t hold off, contribute today!

The post Contributing to Open Source Software appeared first on Instaclustr.