Today, we are very excited to unveil some of the most
critical performance enhancements we’ve ever made to
DataStax Enterprise(DSE). Enterprise workloads
are becoming more and more demanding each day, so we took the
opportunity to channel the amazing engineering talent at DataStax
to re-architect how we take advantage of compute, memory, and
storage. It’s not just about speed, either; we’ve made DSE more
efficient and resilient to meet the most demanding
workloads.
We’ve named our new suite of performance optimizations,
utilities, and tools “DataStax Advanced Performance”. The
best part? You just need to upgrade to enjoy these out-of-the-box
benefits, which include:
New thread-per-core and asynchronous architecture, which
results in double the read/write performance of open source Apache
Cassandra
Storage engine optimizations that literally slice
read/write latencies in open source Cassandra in half
Faster analytics that can deliver up to 3x query
speed-ups over open source Apache Spark
DataStax Bulk Loader, which load and unloads data up to
4x faster than current Cassandra load tools
Thread-Per-Core and Asynchronous Architecture
Apache Cassandra uses a traditional, staged, event-driven
architecture (SEDA). With the SEDA architecture, Cassandra assigns
thread pools to events or tasks and connects them via a message
service. This architecture also uses multiple threads per task,
meaning that threads need to be coordinated. Additionally, events
in this architecture are synchronous, which can cause contention
and slowdowns. Because of this, adding CPU cores eventually sees
diminishing returns.
DSE 5.1 left, DSE 6 right – With the traditional SEDA
architecture, we see much more context switching which is expensive
and degrades performance.
DSE 6 has a coordination-free design, a thread-per-core
architecture which yields incredible performance gains. The whole
data layer across DSE, including search, analytics, and graph,
benefits from this new architectural change.
cassandra-stress, 5 nodes, RF=3, CL=QUORUM, 500GB
density
Each node in a cluster owns part of the token range;
that’s not new. What’s new is that a respective node’s token range
is divided evenly among CPU threads: one per CPU core to be exact.
A respective thread is now responsible for incoming writes for the
part of the token range it owns, and any available thread can be
used to handle read requests. This means that evenly distributed
data in the cluster results in evenly distributed CPU utilization
on a server. This architecture also means that very little
coordination is needed between threads, which ensures that a CPU
core can be used to its fullest capabilities.
cassandra-stress, 5 nodes, RF=3, CL=One, 500GB
density
Since a single thread owns the writes for its respective
token range, what about contention? In DSE 6, we’ve moved reads,
writes, and other tasks from synchronous operations to
asynchronous. This allows us to eliminate thread contention, always
keeping threads working. Combined with the thread-per-core
architecture, this allows us to scale performance linearly as we
scale the number of CPU cores. This is extremely important as
multi-socket motherboards and high core-count cloud instances have
become the standard.
Storage Engine Optimizations
Besides ingesting and serving data faster with
thread-per-core, we’ve also made improvements to the storage engine
that improve latency and optimize compaction, which can also
be a bottleneck for write-heavy workloads. In DSE 6, that
compaction performance is 22% faster than in DSE 5.1 which is
already 2x faster than open source Cassandra. We’re also seeing
latency improvements of 2x on reads and writes.
In DSE 5.1, we introduced 2x compaction performance over
Apache Cassandra. In DSE 6, compaction is even
faster.
Also included in DSE Advanced Performance are improved
analytics read performance that is 3x over open source Cassandra
and Spark. This was made possible by a feature called Continuous
Paging, an optimization designed specifically for DataStax
analytics queries that scan a large portion of data. We have tested
this in a number of scenarios: selecting all columns or some
columns, with or without a clustering-column predicate, and in all
scenarios we see a 2.5 to 3.5x performance improvement.
3x analytics read performance over open source Spark and
Apache Cassandra.
DataStax Bulk Loader
Also new in DSE 6 is a bulk loader utility that greatly
outpaces current Cassandra load/unload utilities. The command line
tool handles both standard delimited formats as well as JSON and
can load and unload data up to 4x faster than current tools.
Conclusion
We’re extremely excited for our customers to experience
the new Advanced Performance capabilities of DSE 6. With a 2x
throughput improvement, massive latency improvements, 10%
compaction improvement, a 3x analytics improvement, and a
crazy-fast bulk loader, we can’t wait to see the kinds of
innovation and disruption our customers will continue to
make.
To download DSE 6 and for more information on DSE Advanced
Performance, check outthis
page.
We’ve got something really special for administrators
inDataStax
Enterprise(DSE) 6: DSE NodeSync, designed with
operational simplicity in mind, can virtually eliminate manual
efforts required to run repair operations in a DataStax
cluster.
NodeSync
To understand NodeSync, let’s talk about how we got here.
One of the most important mechanisms for an administrator to run in
Apache Cassandra is anti-entropy repair. Despite its name, repair
is a process that should always be running in a cluster to ensure
that data between nodes are consistent.
The fundamentals of repair haven’t changed since it was
initially introduced many years ago.It’s designed as a
single-process bulk operation that continuously runs for a long
time which means when failure occurs, you must begin the repair
over again. Repair is also computationally and network intensive as
it creates merkle trees and streams them between nodes.
The
longer classic repair runs, the more failure prone it
is.
To help mitigate some of these problems, complex tools
were built to help orchestrate and add some structure and
resiliency to repair. These tools try to split the repair process
in multiple, more manageable pieces in an effort to improve
operational simplicity, but in the end, these client-side tools
were built to solve issues with a server-side mechanism. There’s
only so much that can be done with tooling.
Enter NodeSync: NodeSync is a ground-up rethinking of how
we do entropy resolution in a DataStax cluster. Once you install
DSE 6, NodeSync automatically starts running in the background. You
simply tell it which keyspace or tables you’d like managed with
NodeSync, and it handles the rest. No more compute-intensive tasks,
no more complex tooling, just hands-off repair
operations.
Enabling
nodesync on a table is as easy as an alter table
command.
NodeSync is designed to be simple and reliable. It divides
the work it must complete into small tasks. These tasks are always
tracked so it knows which data has been synchronized and which
hasn’t. It also acts as a checkpoint mechanism so that if a node
goes down, NodeSync knows exactly where to start again. NodeSync is
also self-managing in that it will prioritize what to synchronize
based on the last time the data was synced and whether it failed or
not.
Easily
enable/disable nodesync on tables through OpsCenter
While NodeSync is designed to be as hands-off as possible,
we know how important it is for administrators to understand what’s
happening in the cluster so we’ve also updated OpsCenter to monitor
NodeSync progress for you.
OpsCenter
6.5 lets you monitor NodeSync progress
Conclusion
We know our customers are going love NodeSync as it’s
designed to make operations simpler with DataStax. Eliminating the
need to orchestrate and manage repair means that administrators
spend less time managing their DataStax clusters and more time
doing other important tasks. To download DSE 6, and to get more
information about NodeSync, please check outthis
page.
Samsung SDS is a global IT services and solutions company with
57 offices spread across 31 countries. They are tasked with
implementing highly performant and scalable systems for a number of
Samsung businesses. However, they were experiencing a number of
issues at the database layer. For example, their relational
database couldn’t meet the performance requirements of several
business use cases. As a result, they decided to conduct an
in-depth technical evaluation of NoSQL databases.
Samsung SDS was looking for a database with high throughput,
scalability, low latency, ease of deployment and maintenance, and
reduced operational costs. They decided to compare Scylla against
Apache Cassandra. In the
proof-of-concept, Scylla delivered 3X better throughput and
latency than Cassandra. The team also noted price and ease of
maintenance as key factors when deciding on Scylla.
“Its excellent performance means we can save lots of money
adopting Scylla over other NoSQL databases.” – Kuyul Noh, Principal
Data Architect, Samsung SDS
Hear from Samsung SDS about their experiences with Scylla in the
video below.
Each time we have a major release, I look back and think
there’s no way our team can top it; that future releases will
somehow be less than what just went out the door. But every time
I’m proven wrong when our next release becomes GA, and there’s no
better example of that than what we’re announcingtoday.
DataStax
Enterprise (DSE) 6represents a major win for our
customers who require an always-on, distributed database to support
their modern real-time (what we call ‘Right-Now’) applications,
particularly in a hybrid cloud environment. Not only does it
contain the best distribution of Apache Cassandra, but it represents the only hybrid cloud
database capable of maintaining and distributing your data in any
format, anywhere – on-premise, in the cloud, multi-cloud, and
hybrid-cloud – in truly data autonomous fashion.
Let me take you on a quick tour of what’s inside the DSE 6
box, as well asOpsCenter
6.5,DataStax Studio 6, andDSE
Drivers, and show you how our team has knocked yet
another one out of the park.
Double the Performance
Enterprises with Right-Now applications know they
havethree
seconds– just three seconds – to keep a customer
waiting before almost half of them click away to a competitor.
Because these apps are constantly interacting with a database that
holds the contextual info needed for producing a personalized
customer experience, it’s vital that the database not play a part
in exceeding those three seconds.
Exceeding the high bar of speed expectations set by
today’s digital consumer is tough, but DSE has been doing it for
some time now, and with version 6, things only get better.DSE
Advanced Performanceis a new set of
performance-related optimizations, technologies, and tools that
dramatically increase DSE’s performance over its foundational open
source components as well as its competitors.
To start, new functionality designed to make Cassandra
more efficient with high-compute instances has resulted in a 2x or
more out-of-the-box gain in throughput for both reads and writes.
Note that these speed and throughput increases apply to all areas
of DSE, including analytics, search, and graph. A new diagnostic
testing framework developed by DataStax helped pinpoint performance
optimization opportunities in Cassandra, with more enhancements
coming in future releases.
Next, DSE 6 includes our first ever advanced Apache
Spark integration (over the open source work we’ve
done for Spark in the past) that delivers a number of
improvements, as well as a 3x query performance
increase.
Lastly, loading and unloading large volumes of data is
still a very pressing need for many enterprises. DSE 6 answers this
call with our newDataStax
Bulk Loaderthat’s built to rapidly move data in
and out of the platform at impressive rates – up to 4x faster than
current data loading utilities.
All of these performance improvements have been designed
with our customers in mind so that their Right-Now applications
deliver a better-than-expected customer experience by processing
more orders, fielding more queries, performing faster searches, and
moving more data faster than ever before. If an app’s response time
exceeds three seconds, it won’t be because of DSE.
Self-Driving Operational Simplicity
In designing DSE 6, we listened to both DataStax customers
and the Cassandra community. While the interests of these groups
sometimes diverge, they do have a few things in common.
It turns out that helping with Cassandra repair operations
is a top priority for both. For some, Cassandra repairs aren’t a
big deal, but for others they are a PITA (pain in the AHEM). Don’t
get repair right in a busy and dynamic cluster, and it’s just a
matter of time until you have production-threatening
issues.
While we introduced an OpsCenter-based repair service some
years ago, it was limited to repair functionality available at the
Cassandra level. Knowing that a server-based approach is what
Cassandra users want, our talented engineering team has
deliveredDSE
NodeSync, which essentially makes DSE ‘repair
free’ by operating in a transparent and continuous fashion to keep
data synchronized in DSE clusters.
If you like your current repair setup, keep it. But if you
want to eliminate scripting, manual intervention, and piloting
repair operations, you can turn on NodeSync and be done. It works
at the table level so you have strong flexibility and granularity
with NodeSync, plus it can be enabled either with CQL or visually
in OpsCenter.
Something else we’ve added to version 6 isDSE
TrafficControl, which delivers advanced resiliency
that ensures DSE nodes stay online under extreme workloads. Under
severe concurrent request traffic, there have been cases of open
source Cassandra nodes going offline due to the abnormal pressure.
DSE TrafficControl has intelligent queueing, not found in open
source, that prevents this from happening on DSE nodes.
Another area for improvement on which open source users
and DataStax customers agree is upgrades. No technical pro that I
know looks forward to upgrading their database software, regardless
of the vendor used.
I’m happy to say we now provide automated help for
upgrades with our newUpgrade
servicethat’s a part of OpsCenter 6.5. Our new
upgrade functionality effortlessly handles patch upgrades by
notifying you that an upgrade is available, downloading the
software you need, applying it to a cluster in a rolling restart
fashion so you experience zero downtime, and freeing you up to do
other things.
These management improvements and others are directly
aimed at increasing your team’s productivity and letting you focus
on business needs vs. operational overhead. The operational
simplicity allows even novice DBAs and DevOps professionals to run
DSE 6 like seasoned professionals. Ultimately that means much
easier enterprise-wide adoption of data management at
scale.
Analyze (and Search) This!
Forrester ranked DataStax a leader in theirTranslytical Wave, and for good reason: DSE
provides the translytical functionality needed by Right-Now apps
that meld transactional and analytical data together. For years,
DataStax has provided 100% of the development needed to freely
integrate open source Spark and Cassandra, but with DSE 6, we’re
kicking things up a notch (or two).
For the first time, we’re introducing our advanced Spark
SQL connectivity layer that provides a newAlwaysOn
SQL Enginethat automates uptime for applications
connecting to
DSE Analytics. This makes DSE Analytics even more
capable of handling around-the-clock analytics requests, and better
support interactive end-user analytics, while leveraging your
existing SQL investment in tools (e.g. BI, ETL) and
expertise.
I’d also like to give a shout-out to the recently
introducedDSE Analytics Solo. This is a subscription
option introduced recently that gives a more cost-effective way to
isolate analytic workloads in order to achieve predictable
application performance.
We also have great news for analytics developers and
others who want to directly query and interact with data stored in
DSE Analytics. DataStax Studio 6 provides notebook support for
Spark SQL, which means you now have a visual and intelligent
interface and query builder that helps you write Spark SQL queries
and review the results – a huge time saver! Plus you can now
export/import any notebook (graph, CQL, Spark SQL) for easy
developer collaboration as well as undo notebook changes with a new
versioning feature.
Finally, let’s not forget the critical role search
functionality plays in apps that rely on contextual and converged
data.
DSE Searchhas upped its game in this area by
delivering CQL support for common search queries, such as those
that use LIKE, IN, range searches, and more.
Supporting Distributed Hybrid Cloud
Over 60% of DataStax customers currently deploy DSE in the
cloud, which isn’t surprising given that our technology has been
built from the ground up with limitless data distribution and the
cloud in mind. Customers run DSE today on AWS, Azure, GCP, Oracle
Cloud, and others, as well as private clouds of course.
DataStax
Managed Cloud, which currently supports both AWS
and Azure, will be updated to support DSE 6, so all the new
functionality in our latest release is available in managed form.
Whether fully managed or self-managed, our goal is to provide you
with multi and hybrid cloud flexibility that supplies all the
benefits of a distributed cloud database without public cloud
lock-in.
Yes, There’s Actually More…
I’d be remiss if I didn’t also mention additions to
ourDSE Advanced Securitypackage that contains
new separation of duties capabilities and unified authentication
support for DSE Analytics, the backup enhancements we’ve done for
cloud operations, or all the updates to our DSE drivers. Like I
mentioned at the beginning of this post, our team always
delivers.
With DSE 6, we want you to enjoy all the heavy-lifting
advantages of Cassandra with none of the complexities and also get
double the power.Downloads,free online
training, and otherresourcesare now available, so give DSE 6 a try (also now available
for non-production development environments via Docker Hub) and let us
know what you think.
We started our Docker journey in 2014 and and began
exploring orchestration technologies shortly thereafter. In the
fall of 2015, we announced production support for customer-built
Docker images and offered best-practice guidance for anyone
integratingDataStax
Enterprise(DSE) into Docker.
Today we are making DataStax-built images widely available
for non-production use by hosting them in Docker Hub. We want the
images to be as easy for you to use as they are for our internal
teams.
Internally, we use Docker for many of our unit,
integration, and functional tests. Doing so enables us to run many
tests in parallel on a single machine and/or test cluster. With
this approach we’ve crunched 15+ hours of total testing times into
20-to-60-minute testing rounds. The result is that our developers
get feedback much faster, and we want our customers to have this
same experience! To learn more about our testing strategy, check
Predrag Knezevic’s
Cassandra Summit talk.
We also use the DataStax images to power our reference
application,KillrVideo.
The official images ensure we are using stable versions and testing
various configurations quick and easy, and by eliminating much of
the setup work we enable users to more quickly learn and understand
the platform.
We want to see what you build from the images and will
showcaseexampleswithin ourgithub
account.Existing examples include partner
integrations (StreamSets), Security Configuration (LDAP),
KillrVideo, and advanced examples of using Docker Compose. Create a
pull request to add your own.
This brings us to configuration and customization. We are
also providing:
Docker Compose scripts to enable you to easily deploy
clusters and expose the components (DSE/Opscenter/Studio) to each
other.
Access to theGitHub
Repofor developers that want to customize the
images
We also want to make these images universally applicable
to all your key use cases. For simple use cases, we’ve exposed
common settings as environment variables. For advanced
configuration management, we’re providing a simple mechanism to let
you change or modify configurations without replacing or
customizing the containers. You can add any of the approved config
files to a mounted host volume and we’ll handle the hard work of
mapping them within the container. You can read more about that
featurehere.
Lastly, adoption and feedback will drive these to approval
for production use. Here are a few ways to provide
input:
Bathroom tile? Grandma's needlepoint? Nope. It's a diagram of
the dark
web. Looks surprisingly like a
tumor.
If you like this sort of Stuff then please support me
on Patreon. And I'd
appreciate if you would recommend my new book—Explain the Cloud Like I'm
10—to anyone who needs to understand the cloud (who doesn't?).
I think they'll learn a lot, even if they're already familiar with
the basics.
$23
billion: Amazon spend on R&D in 2017; $0.04:
cost to unhash your email address;
$35: build your own LIDAR; 66%: links to
popular sites on Twitter come from bots;
60.73%: companies report JavaScript as primary
language; 11,000+: object
dataset provide real objects with associated depth
information; 150
years: age of the idea of privacy;
30%~ AV1's better video compression;
100s of years: rare-earth materials found underneath
Japanese waters; 67%: better image
compression using Generative Adversarial Networks; 1000 bit/sec: data
exfiltrated from air-gapped computers through power lines using
conducted emissions;
Quotable Quotes:
@Susan_Hennessey:
Less than two months ago, Apple announced its decision to move
mainland Chinese iCloud data to state-run servers.
@PaulTassi:
Ninja's New 'Fortnite' Twitch Records: 5 Million Followers, 250,000
Subs, $875,000+ A Month via @forbes
@iamtrask:
Anonymous Proof-of-Stake and Anonymous, Decentralized Betting
markets are fundamentally rule by the rich. If you can write a big
enough check, you can cause anything to happen. I fundamentally
disagree that these mechanisms create fair and transparent
markets.
David
Rosenthal: The redundancy needed for protection is frequently
less than the natural redundancy in the uncompressed file. The
major threat to stored data is economic, so compressing files
before erasure coding them for storage will typically reduce cost
and thus enhance data survivability.
@mjpt777: The
more I program with threads the more I come to realise they are a
tool of last resort.
JPEG XS~ For the first time in the history of image coding, we
are compressing less in order to better preserve quality, and we
are making the process faster while using less energy. Expected to
be useful for virtual reality, augmented reality, space imagery,
self-driving cars, and professional movie editing.
Martin Thompson: 5+ years ago it was pretty common for folks to
modify the Linux kernel or run cut down OS implementations when
pushing the edge of HFT. These days the really fast stuff is all in
FPGAs in the switches. However there is still work done on
isolating threads to their own exclusive cores. This is often done
by exchanges or those who want good predictable performance but not
necessarily be the best. A simple way I have to look at it. You are
either predator or prey. If predator then you are mostly likely on
FPGAs and doing some pretty advanced stuff. If prey then you don't
want to be at the back of the herd where you get picked off. For
the avoidance of doubt if you are not sure if you are prey or
predator then you are prey. ;-)
Brian Granatir: serverless now makes event-driven architecture
and microservices not only a reality, but almost a necessity.
Viewing your system as a series of events will allow for resilient
design and efficient expansion. DevOps is dead. Serverless
systems (with proper non-destructive, deterministic data management
and testing) means that we’re just developers again! No calls at
2am because some server got stuck?
@chrismunns: I
think almost 90% of the best practices of #serverless are general
development best practices. be good at DevOps in general and you'll
be good at serverless with just a bit of effort
David Gerard: Bitcoin has failed every aspiration that Satoshi
Nakamoto had for it.
@joshelman: Fortnite
is a giant hit. Will be bigger than most all movies this
year.
@swardley: To
put it mildly, the reduction in obscurity of cost through
serverless will change the way we develop, build, refactor, invest,
monitor, operate, organise & commercialise almost everything. Micro
services is a storm in a tea cup compared to this category 5.
James Clear: The 1 Percent Rule is not merely a reference to
the fact that small differences accumulate into significant
advantages, but also to the idea that those who are one percent
better rule their respective fields and industries. Thus, the
process of accumulative advantage is the hidden engine that drives
the 80/20 Rule.
Abraham Lincoln: Give me six hours to chop down a tree and
I will spend the first four sharpening the axe.
@RichardWarburto: Pretty
interesting that async/await is listed as essentially a sequential
programming paradigm.
@PatrickMcFadin: "Most
everyone doing something at scale is probably using #cassandra" Oh.
Except for @EpicGames and @FortniteGame They went with
MongoDB.
Meetup: In the CloudWatch screenshot above, you can see
what happened. DynamoDB (the graph on the top) happily handled 20
million writes per hour, but our error rate on Lambda (the red line
in the graph on the bottom) was spiking as soon as we went above 1
million/hour invocations, and we were not being throttled. Looking
at the logs, we quickly understood what was happening. We were
overwhelming the S3 bucket with PUT requests
Sarah Zhang: By looking at the polarization pattern in water
and the exact time and date a reading was taken, Gruev realized
they could estimate their location in the world. Could marine
animals be using these polarization patterns to navigate through
the ocean?
Vinod Khosla: I have gone through an exercise of trying to just
see if I could find a large innovation coming out of big companies
in the last twenty five years, a major innovation (there’s plenty
of minor innovations, incremental innovations that come out of big
companies), but I couldn’t find one in the last twenty five
years.
Click through for lots more quotes.
Don't miss all that the Internet has to say on Scalability,
click below and become eventually consistent with all scalability
knowledge (which means this post has many more items to read so
please keep on reading)...
IBM had previously used only Apache Cassandra and HBase as
storage back-ends for the graph databases it makes available on IBM
Cloud. Having heard about the advantages of Scylla, IBM’s Open Tech
and Performance teams conducted a series of tests to compare Scylla
with HBase and Apache Cassandra.
The IBM team learned quite a lot from their performance tests.
In their first test, which generated a load of 40,000,000 vertices
with two properties, Scylla displayed nearly 35% higher throughput
than HBase and almost 3X Cassandra’s throughput. In their second
test, which consisted of randomly picking 30,000,000 pairs of
vertices and entering 30,000,000 edges into it with one property,
Scylla’s throughput was 160% better than HBase and more than 4X
that of Cassandra.
In addition to its performance advantages, Scylla was also the
easiest database to cluster, especially when adding multiple nodes
to a cluster. The IBM Compose team was very pleased to see Scylla’s
self-tuning capabilities, load balancing, and its ability to fully
utilize the available system resources.
Hear what IBM’s teams have to say about their experiences with
Scylla in this video.
The next open-source release (version 2.2) of Scylla will
include support for role-based access control. This feature was
introduced in version 2.2 of Apache Cassandra. This post starts
with an overview of the access control system in Scylla and some of
the motivation for augmenting it with roles. We’ll explain what
roles are and show an example of their use. Finally, we’ll cover
how Scylla transitions existing access-control data to the new
roles-based system when you upgrade a cluster.
Access Control in Scylla
There are two aspects of access control in Scylla: controlling
client connections to a Scylla node
(authentication), and controlling which operations
a client can execute (authorization).
By default, no access control is enabled on Scylla clusters.
This means that a client can connect to any node unrestricted, and
that the client can execute any operation supported by the
database.
When we enable access-control (which is described in Scylla’s
documentation),
there are two important changes to Scylla’s behavior:
A client cannot connect to a node unless it provides valid
credentials for an identity known to the system (a username and
password)
A CQL query can only be executed if the authenticated identity
has been granted the applicable permissions on the database objects
involved in the query
For example, a logged-in user jsmith will only be permitted to
execute
SELECT * FROM events.ingress;
if jsmith has been granted (directly or indirectly)
the SELECT permission on the events.ingress table.
One way to grant jsmith the permissions they need
is to grant SELECT on the entirety of the
event keyspace. This encompasses all tables in the
keyspace as well.
GRANT SELECT ON KEYSPACE events TO jsmith;
We can verify the permissions granted to
jsmith:
LIST ALL PERMISSIONS OF jsmith;
role
username
resource
permission
jsmith
jsmith
<keyspace events>
SELECT
Limitations of User-based Access Control
Access control based only on users can quickly be unwieldy. To
see why, consider a large set of resources that all analysts at an
organization need to have similar permissions on.
GRANT SELECT ON events.ingress TO jsmith; GRANT MODIFY ON events.ingress TO jsmith; GRANT SELECT ON events.egress TO jsmith; GRANT MODIFY ON events.egress TO jsmith; GRANT SELECT ON KEYSPACE endpoints TO jsmith;
The same permissions have been granted to users
aburns, tpetty, and many others. If an
analyst joins the company, then an administrator needs to carefully
grant them the applicable permissions. If the set of resources
changes, then all the analysts need to be modified with the updated
permissions.
To avoid this problem, a critical administrator might decide to
create an “umbrella” user, like analyst, and have all
analysts log in with that username and password whenever they
interact with the system. That way, we only have to deal with a
permission set for a single user. Unfortunately, by doing this, we
lose an important security property:
non-repudiation. This roughly means that the
origin of data can be traced to a particular identity. We may want
to know who modified data or accessed a particular table (i.e., we
want access auditing), and having a single user makes this
impossible.
Introducing Roles
One solution to the complexity described above is the use of
roles. A role is an identity with a permission set, just like a
user. Roles generalize users, though, because a role can also be
granted to other roles.
In our example, we could create an analyst role and
grant them all of the permissions that analysts need to do their
job. An analyst has no credentials associated with it
and cannot login to the system. We grant analyst to
aburns to give aburns all the permissions
of analyst. If the permission set for analysts needs
to change, we only need to change the analyst
role.
A Concrete Example
We’ll briefly go through the example above to demonstrate the
CQL syntax of the roles-based system. This particular example is
from the master
branch of Scylla (specifically at commit
4419e602074c8d647f492612979cd98c677d89d9), as we are
preparing for the next release.
First, we create the analyst role and grant them
the necessary permissions.
CREATE ROLE analyst;
GRANT SELECT ON events.ingress TO analyst; GRANT MODIFY ON events.ingress TO analyst; GRANT SELECT ON events.egress TO analyst; GRANT MODIFY ON events.egress TO analyst; GRANT SELECT ON KEYSPACE endpoints TO analyst;
Then we create a user that can login for each of the analysts in
our system.
CREATE ROLE jsmith WITH LOGIN = true AND PASSWORD =
'jsmith'; CREATE ROLE aburns WITH LOGIN = true AND PASSWORD =
'aburns'; CREATE ROLE tpetty WITH LOGIN = true AND PASSWORD =
'tpetty';
We grant analyst to each.
GRANT analyst TO jsmith; GRANT analyst TO aburns; GRANT analyst TO tpetty;
We can inspect the permissions of a user and see that they
inherit those of analyst:
LIST ALL PERMISSIONS OF jsmith;
role
username
resource
permission
analyst
analyst
<table events.egress>
MODIFY
analyst
analyst
<table events.egress>
SELECT
analyst
analyst
<table events.ingress>
MODIFY
analyst
analyst
<table events.ingress>
SELECT
analyst
analyst
<keyspace endpoints>
SELECT
The Old USER CQL Statements
Astute readers may be wondering about the old user-based CQL
statements: CREATE USER, ALTER USER, DROP USER, and LIST
USERS. These still exist and with the same syntax as they
had before.
What is important to understand is that roles generalize users.
All roles can be granted permissions, can be granted to other
roles, have authentication credentials, and can be allowed to login
to the system. By convention, when a role is allowed to login to
the system, we call it a user. Therefore, all users are roles but
not all roles are users.
CREATE USER is just like CREATE ROLE
(with different syntax), except CREATE USER implicitly
sets LOGIN = true.
Executing LIST ROLES will display all the roles in
the system, but LIST USERS will only display roles
with LOGIN = true.
Migrating Old Scylla Clusters
With the switch to role-based access control, Scylla internally
uses a new schema for storing metadata. Scylla will automatically
convert the old user-based metadata into the new format during a
cluster upgrade.
When the first node in the cluster is restarted with the new
Scylla version, the metadata will be converted with a log message
like the following:
INFO 2018-04-05 09:53:53,061 [shard 0] password_authenticator -
Starting migration of legacy authentication metadata.
INFO 2018-04-05 09:53:53,065 [shard 0] password_authenticator -
Finished migrating legacy authentication metadata.
INFO 2018-04-05 09:53:54,005 [shard 0] standard_role_manager -
Starting migration of legacy user metadata.
INFO 2018-04-05 09:53:54,015 [shard 0] standard_role_manager -
Finished migrating legacy user metadata.
INFO 2018-04-05 09:53:54,681 [shard 0] default_authorizer -
Starting migration of legacy permissions metadata.
INFO 2018-04-05 09:53:54,690 [shard 0] default_authorizer -
Finished migrating legacy permissions metadata.
Importantly, we do not support modifying
access-control data during a cluster upgrade.
If a client is connected to an already-upgraded node in the
midst of an upgrade, all modification statements will fail with an
error message about incomplete cluster upgrades.
If a client is connected to an un-upgraded node, then the
modification statements will succeed but not be reflected in the
upgraded cluster.
The following table describes the old and new metadata tables,
with the correspondence between the two if it exists.
Old table
New Table
system_auth.users
system_auth.roles
system_auth.role_members
system_auth.credentials
system_auth.permissions
system_auth.role_permissions
Once the cluster has been fully upgraded and you have verified
that all access-control information is correct, you can drop the
legacy metadata tables:
DROP TABLE system_auth.users; DROP TABLE system_auth.credentials; DROP TABLE system_auth.permissions;
Conclusion and Acknowledgments
Roles can make it easier to achieve good security properties in
your Scylla cluster and can simplify a lot of common
operations.
Please give this new feature a try and provide feedback either
as a GitHub issue (in the case
of bugs), on the mailing list, or on our
Slack Channel.
Adding roles support to Scylla also required restructuring
existing support for access-control and many other parts of the
system. Thanks to everyone involved for their careful review and
input during this process.
If you’re starting new or in the 3.0.x series: 3.11.2
Apache Cassandra 3.0 is supported until 6 months after 4.0
release (date TBD)
If you’re in 2.x, update to the latest in the series (2.1.20,
2.2.12)
Apache Cassandra 2.2 is supported until 4.0 release (date
TBD)
Apache Cassandra 2.1 is supported until 4.0 release (date TBD).
Critical fixes only
Long Version
– If you’re starting new or in the 3.0.x series:
3.11.2
Stability wise, both 3.0.16 and 3.11.2 are stable at this point.
The biggest advantage of 3.11.2 vs 3.0.16 is the additional
features that went into the 3.x series (with x>0).
Not all features are desirable though. (Move away from Materialized
Views, since they are marked as experimental on the latest
releases).
Despite this, the Slow Query Log and Change-Data-Capture are
examples of really useful ones that might make you consider jump to
3.11.2, as you will not get them in the 3.0.x series. JBOD users
should also look at CASSANDRA-6696 might be interesting.
– If you’re in 2.x, update to the latest in the series
(2.1.20, 2.2.12)
As you might expect, these two releases are very stable, since
they have a lot of development time on top of them. If a cluster is
still running these Cassandra versions, the best is to upgrade
to
the latest releases in the respective series (either 2.1.20 or
2.2.12).
To me, the biggest downside of using these versions, it the fact
that they will probably be the last releases of either Cassandra
series. The support for critical bugs is here until 4.0 is released
(https://cassandra.apache.org/download/) but besides that no major
changes or improvements will come.
An additional possible thing to consider, that if there may not
be a direct upgrade to the 4.x series, an upgrade may need to be
done via 2.x -> 3.x -> 4.x.
But for now, I would stick with the recommendation keep your
current major version if you’re already there and not needing
anything new!
Find out how Pythian can help with all of your
Cassandra needs.
Open source software is tricky business. One might think a
volunteer project that gives you free software is the greatest
thing ever, however making such a project work is complex. Creating
open source software is quite simple for a small project with only
a few hobbyist maintainers, where making decisions comes down to
only one person or a very small group, and if users don’t like it
they can fork the project and continue on their way without many
hurdles. Things are not so simple for large projects with multiple
stakeholders, where priorities are frequently conflicting but the
health of the project still relies on all contributors behaving as,
well, a community. This is what I’ll be writing about, and more
specifically the do’s and don’ts when contributing to large open
source projects.
But first, let’s talk about the kind of community that
makes up an (large) open source project.
Four main types of contributors for an open source
project
The full timer who usually works for a company which
utilizes/backs the project. This person is employed by the company
to work on the project, usually directed by the company to work on
specific bugs and features that are affecting them, and also on
larger feature sets. They often work in a team within their company
who are also working on the project. In some cases, these
full-timers are not dedicated to writing code but more dedicated to
the managerial side of the project. Part-timers similar to these
also exist.
The part-timer who has a vested interest in the project.
Mostly these are consultants, but could still be from companies who
use the software but don’t have enough resources to contribute
full-time. Generally, they contribute to the project because it
heavily influences their day jobs, and they see users with a
certain need. Usually have a very good understanding of the project
and will also contribute major features/fixes as well as smaller
improvements. They may also just be very well versed users who
contribute to discussions, helping other users, and documenting the
software.
The part-timer who has some interaction with the software
during their day job, but is not dedicated to working on the
software. These people often contribute patches related to specific
issues they encounter while working with the software. Typically
these people are sysadmins or developers. I’d sum these up as “the
people that encounter something that annoys them and fix
it”.
The users. No point having all this software if there is
no one to use it. Users contribute on mailing lists and ancient
IRC’s, helping other users get on board with the software. They
also give important feedback to the developers on improvements, bug
fixes, and documentation, as well as testing and reporting bugs
they find. Typically in a large project, they don’t drive features
significantly, but it can happen.
There are many other types of contributors to a project,
but these (to me) seem to be the main ones for large projects such
as Apache Cassandra. You’ll note there is no mention ofthe hobbyist. While they do exist, in
such large projects they only usually come about through extraneous
circumstances. It’s quite hard to work on such a large project on
the side, as it generally requires a lot of background knowledge,
which you can only really grasp if you’ve spent countless hours
working with the software already. It is possible to pick up a very
small task and complete it without much knowledge about the project
as a whole, but these are rare, which results in less hobbyists
working on the
project. It’s worth
noting that all of these contributors areessentiallyvolunteers. They may be
employed full time to work on the project, but notbythe project. The company employing
themvolunteerstheir employees to
work on the
project. Now there
are a few important things to consider about a large project with a
contributor-base like the above. For starters, priorities. Every
contributor will come to the project with their own set of
priorities. These may come from the company they work for, or may
be itches they want to scratch, but generally, they will be
directed to work on certain bugs/features and these bugs/features
will not always coincide with other contributors priorities. This
is where managing of the project gets complicated. The project has
a bunch of volunteers, and these need to be organized in a way that
will produce stable, functioning software that meets the needs of
the user base, at least in a somewhat timely fashion. The project
needs to be kept healthy and needs to continue satisfying the users
needs if it is to survive. However, the user’s needs and the needs
of the people writing the code often don’t intersect, and they
don’t always see eye to eye. On a project run by volunteers this is
important to consider when you’re asking for something, because
although you may have a valid argument, there might not be someone
who wants to make the contribution, and even if there is, they
might not have a chance to work on it for a long
time/ever.
Do’s
Take responsibility for your contributions. I’ve noted
it’s a common opinion that developers are only beholden to their
employer, but this is not true. If you wrote code in an open source
project, you’re still responsible for the quality and performance
of that code. Your code affects other people, and when it gets into
an open source project you have no idea what that code could be
used for. Just because you’re not liable doesn’t mean you shouldn’t
do a good job.
Be polite and respectful in all discussions. This is very
important if you want to get someone to help you in the project.
Being rude or arrogant will immediately get people off-side and
you’ll have a very hard time making meaningful
contributions.
Be patient. Remember OSS is generally volunteer-based,
and those volunteers typically have priorities of their own, and
may not be able to be prompt on your issues. They’ll get to it
eventually, nudge them every now and again, just don’t do it all
the time. I recommend picking up a number of things to do that you
can occupy yourself with while you wait.
Contribute in any way you can. Every contribution is
important. Helping people on the mailing list/public forums,
writing documentation, testing, reporting bugs, verifying behavior,
writing code, contributing to discussions are all great ways to
contribute. You don’t have to do all of them, and a little bit of
help goes a long way. This will help keep Open Source Software
alive, and we all want free software don’t we?
Don’ts
Don’t assume that just because you have an idea that
other people will think it’s good. Following that, don’t assume
that even if it is good, someone else will be willing to implement
it.
Don’t assume that OSS is competing with any other
software. If something better comes along (subject to licensing),
it would make sense for the effort to be directed towards the new
software. The only thing keeping the project alive is that people
are using it. If it stops being relevant, it will stop being
supported.
Don’t expect other volunteers to work for you. If you
have a great idea you must still be prepared to wait and get
involved yourself to get it implemented. The nature of large OSS
projects is that there is always more ideas than there is people to
implement them, and the contributors are more likely to prioritize
their own ideas over yours. If you can do some of the legwork to
getting your ideas in place (proof of concepts, design documents,
validation, etc) it will go a long way to making your idea a
reality.
Don’t expect to show up and be listened to. It takes
years of working with a large project before you have enough
knowledge (and wisdom) to make significant improvements. If you
just show up and throw your ideas about like they are better than
sliced bread you’ll likely put existing contributors on edge. Start
small and incrementally build yourself a reputation of which people
will give your ideas the consideration it deserves.
Don’t waste people’s time. It may seem harsh but things
like not having enough details to diagnose problems or when
reporting bugs are huge time wasters and generally lead to your
problems getting lost in the backlog. Make sure you always search
the backlog for existing related issues and make sure you are
prepared to provide all relevant information for the maximum chance
of your request being implemented.
Hopefully, this gives a good overview of the kind of
community that makes up an open source project and gives you a good
idea of what you’re dealing with when you’re looking to contribute
to <insert favorite OSS software here>. If you follow these
simple do’s and don’ts you’ll have the best chances of success when
making contributions. Don’t hold off, contribute
today!