The Perfect Hybrid Cloud for IBMHAT

The acquisition of Red Hat by IBM caught many, including myself, by surprise. It’s not that such an option was never on the table. During the time I was at Red Hat (2008-2012) such ideas were tossed about. Funny to say, but in 2012 Red Hat seemed too expensive a play. Revenues rose sharply since, and so has the price.

Before we dive into whether this move will save IBM, let’s first tip our caps for Red Hat, one of the most important technology companies. Red Hat is an innovator of the open source business model. The leader of the free developer spirit. The altruistic fighting force of openness and innovation. Most open source vendors strove to achieve Red Hat’s success and we all look up to its purist attitude for open source.

Without Red Hat, the world would likely be in a different place today. We see Microsoft embracing Linux and joining the Open Innovation Network (OIN), of which Scylla is a proud member. Today Linux dominates the server world and many organizations have an open source strategy, but it was Red Hat that planted the seeds, nurtured them and proved the world they can be sold for profit.

Red Hat employees are strongly bound to their company by a force that’s hard to find in even the coolest, most innovative companies. Every business decision is subject to scrutiny along the lines of “What would the Shadowman have said about it?”

Je Suis Red Hat

It’s great IBM will allow Red Hat to continue as an independent business unit. The world needs Red Hat to continue to flourish. Eight percent of the Linux kernel code is developed by Red Hat. It used to be much more, but over time other companies joined in Linux’ success. Red Hat remains a leader in Kubernetes, OpenStack, Ansible and many, many other important projects, including my good old KVM hypervisor.

Let’s not forget, IBM is a big innovator, too, with a great legacy and great assets. People like to mock IBM as a slow-moving giant but let’s not forget that Watson AI led the industry, way before AI became widespread. IBM Z series has hardware instructions to x86 hypervisor hackery and IBM Power has 8 hyper threads per core.

IBM’s customer base is also huge. They sell to brick-and-mortar enterprises. If you visited the IBM THINK conference after an AWS re:invent, you would have been shocked by how different the audience is, mainly the age and the outfits.

It’s not yet clear how the two companies will integrate and work together. During my tenure at Red Hat we worked closely with IBM, who were the best technology partner, but it won’t be easy to merge the bottom-up approach at Red Hat with the top-down at IBM.

Hybrid cloud opportunity

The joint mission is to win the Hybrid cloud, a market anticipated to grow to nearly $100 billion by 2023. IBM has not become one of the three leading cloud vendors (a larger market, valued at over $186 billion today) so it seeks to win in an adjacent large market. Today, private data center spending is still significantly larger than the cloud. It won’t last forever but most companies rely on more than a single cloud vendor or have combined private and public cloud needs — and this is where IBMHAT should go.

The way to get there is not by lock-in (though they could, as, Power + RHEL + OpenShift + Ansible make a lot of sense) but the opposite. Customers should go to the IBMHAT offering in order to have choice and flexibility. Now, theoretically, Red Hat alone and IBM alone already have such assets. Openshift and ICP (IBM Cloud Private) offer such Kubernetes marketplace today.

What’s missing?

Unlocking the attractiveness of hybrid cloud is easier said than done. My take is that it’s not enough to simply have a rich marketplace. A set of the best-of-breed services should be there out of the box, exactly as users are used to finding these services on public clouds.

There should be one central authentication service, one location for billing and EULAs, out-of-the-box object store, out-of-the-box shared storage (EBS/PD/..), one security console, a unified key management solution, one relational database and one DynamoDB-style NoSQL database (ScyllaDB is shamelessly recommended), plus a container service that can run on-prem and on-cloud, serverless capabilities and so forth. Only once the base services are there, the market place comes into play.

They all need to be provided with the same straightforward experience and dramatically improved cost over public clouds. This is the recipe for winning: simple, integrated default applications offered at competitive price. Public cloud vendors are eating IT since they provide an amazing world of self-service, just a few clicks for bullet-proof functionality. That’s quite a high bar that so far was not matched by on-premise vendors, so truly, good luck IBMHAT.

How good can this opportunity be? Let’s take a look at the financials of all-in cloud company, such as a Snap, where 60% of their revenues are invested (lost) in provisioning their public cloud.

Snap earnings 2018-2018

Dropbox, on the other hand, moved from public cloud to its own data centers (a move you do only after you’ve reached scale) and have far better margins.

Dropbox Revenue vs. COGS 2015-2017

With an independant, lock-in free stack IBMHAT would be able to help customers make their own choices and even run the software on AWS bare metal i3.metal servers while having the ability to migrate at any given time.

Final thoughts

IBM does not live in a void. We may see EMC/Pivotal doing a similar move and, in parallel, the cloud vendors realizing they need to go for private cloud. Other Linux vendors, such as Suse and Canonical, may well be next.

The post The Perfect Hybrid Cloud for IBMHAT appeared first on ScyllaDB.

Audit Logging in Apache Cassandra 4.0

Database audit logging is an industry standard tool for enterprises to capture critical data change events including what data changed and who triggered the event. These captured records can then be reviewed later to ensure compliance with regulatory, security and operational policies.

Prior to Apache Cassandra 4.0, the open source community did not have a good way of tracking such critical database activity. With this goal in mind, Netflix implemented CASSANDRA-12151 so that users of Cassandra would have a simple yet powerful audit logging tool built into their database out of the box.

Why are Audit Logs Important?

Audit logging database activity is one of the key components for making a database truly ready for the enterprise. Audit logging is generally useful but enterprises frequently use it for:

  1. Regulatory compliance with laws such as SOX, PCI and GDPR et al. These types of compliance are crucial for companies that are traded on public stock exchanges, hold payment information such as credit cards, or retain private user information.
  2. Security compliance. Companies often have strict rules for what data can be accessed by which employees, both to protect the privacy of users but also to limit the probability of a data breach.
  3. Debugging complex data corruption bugs such as those found in massively distributed microservice architectures like Netflix’s.

Why is Audit Logging Difficult?

Implementing a simple logger in the request (inbound/outbound) path sounds easy, but the devil is in the details. In particular, the “fast path” of a database, where audit logging must operate, strives to do as little as humanly possible so that users get the fastest and most scalable database system possible. While implementing Cassandra audit logging, we had to ensure that the audit log infrastructure does not take up excessive CPU or IO resources from the actual database execution itself. However, one cannot simply optimize only for performance because that may compromise the guarantees of the audit logging.

For example, if producing an audit record would block a thread, it should be dropped to maintain maximum performance. However, most compliance requirements prohibit dropping records. Therefore, the key to implementing audit logging correctly lies in allowing users to achieve both performance and reliability, or absent being able to achieve both allow users to make an explicit trade-off through configuration.


Audit Logging Design Goals

The design goal of the Audit log are broadly categorized into 3 different areas:

Performance: Considering the Audit Log injection points are live in the request path, performance is an important goal in every design decision.

Accuracy : Accuracy is required by compliance and is thus a critical goal. Audit Logging must be able to answer crucial auditor questions like “Is every write request to the database being audited?”. As such, accuracy cannot be compromised.

Usability & Extensibility: The diverse Cassandra ecosystem demands that any frequently used feature must be easily usable and pluggable (e.g., Compaction, Compression, SeedProvider etc...), so the Audit Log interface was designed with this context in mind from the start.

Implementation

With these three design goals in mind, the OpenHFT libraries were an obvious choice due to their reliability and high performance. Earlier in CASSANDRA-13983 the chronical queue library of OpenHFT was introduced as a BinLog utility to the Apache Cassandra code base. The performance of Full Query Logging (FQL) was excellent, but it only instrumented mutation and read query paths. It was missing a lot of critical data such as when queries failed, where they came from, and which user issued the query. The FQL was also single purpose: preferring to drop messages rather than delay the process (which makes sense for FQL but not for Audit Logging). Lastly, the FQL didn’t allow for pluggability, which would make it harder to adopt in the codebase for this feature.

As shown in the architecture figure below, we were able to unify the FQL feature with the AuditLog functionality through the AuditLogManager and IAuditLogger abstractions. Using this architecture, we can support any output format: logs, files, databases, etc. By default, the BinAuditLogger implementation comes out of the box to maintain performance. Users can choose the custom audit logger implementation by dropping the jar file on Cassandra classpath and customizing with configuration options in cassandra.yaml file.


Architecture

Fig 1. AuditLog Architecture Figure.


What does it log

Each audit log implementation has access to the following attributes. For the default text-based logger, these fields are concatenated with | to yield the final message.

  • user: User name(if available)
  • host: Host IP, where the command is being executed
  • source ip address: Source IP address from where the request initiated
  • source port: Source port number from where the request initiated
  • timestamp: unix time stamp
  • type: Type of the request (SELECT, INSERT, etc.,)
  • category - Category of the request (DDL, DML, etc.,)
  • keyspace - Keyspace(If applicable) on which request is targeted to be executed
  • scope - Table/Aggregate name/ function name/ trigger name etc., as applicable
  • operation - CQL command being executed

Example of Audit log messages

Type: AuditLog
LogMessage: user:anonymous|host:127.0.0.1:7000|source:/127.0.0.1|port:53418|timestamp:1539978679457|type:SELECT|category:QUERY|ks:k1|scope:t1|operation:SELECT * from k1.t1 ;

Type: AuditLog
LogMessage: user:anonymous|host:127.0.0.1:7000|source:/127.0.0.1|port:53418|timestamp:1539978692456|type:SELECT|category:QUERY|ks:system|scope:peers|operation:SELECT * from system.peers limit 1;

Type: AuditLog
LogMessage: user:anonymous|host:127.0.0.1:7000|source:/127.0.0.1|port:53418|timestamp:1539980764310|type:SELECT|category:QUERY|ks:system_virtual_schema|scope:columns|operation:SELECT * from system_virtual_schema.columns ;

How to configure

Auditlog can be configured using cassandra.yaml. If you want to try Auditlog on one node, it can also be enabled and configured using nodetool.

cassandra.yaml configurations for AuditLog

  • enabled: This option enables/ disables audit log
  • logger: Class name of the logger/ custom logger.
  • audit_logs_dir: Auditlogs directory location, if not set, default to cassandra.logdir.audit or cassandra.logdir + /audit/
  • included_keyspaces: Comma separated list of keyspaces to be included in audit log, default - includes all keyspaces
  • excluded_keyspaces: Comma separated list of keyspaces to be excluded from audit log, default - excludes no keyspace
  • included_categories: Comma separated list of Audit Log Categories to be included in audit log, default - includes all categories
  • excluded_categories: Comma separated list of Audit Log Categories to be excluded from audit log, default - excludes no category
  • included_users: Comma separated list of users to be included in audit log, default - includes all users
  • excluded_users: Comma separated list of users to be excluded from audit log, default - excludes no user

Note: BinAuditLogger configurations can be tuned using cassandra.yaml properties as well.

List of available categories are: QUERY, DML, DDL, DCL, OTHER, AUTH, ERROR, PREPARE

NodeTool command to enable AuditLog

enableauditlog: Enables AuditLog with yaml defaults. yaml configurations can be overridden using options via nodetool command.

nodetool enableauditlog

Options:

--excluded-categories Comma separated list of Audit Log Categories to be excluded for audit log. If not set the value from cassandra.yaml will be used

--excluded-keyspaces Comma separated list of keyspaces to be excluded for audit log. If not set the value from cassandra.yaml will be used

--excluded-users Comma separated list of users to be excluded for audit log. If not set the value from cassandra.yaml will be used

--included-categories Comma separated list of Audit Log Categories to be included for audit log. If not set the value from cassandra.yaml will be used

--included-keyspaces Comma separated list of keyspaces to be included for audit log. If not set the value from cassandra.yaml will be used

--included-users Comma separated list of users to be included for audit log. If not set the value from cassandra.yaml will be used

--logger Logger name to be used for AuditLogging. Default BinAuditLogger. If not set the value from cassandra.yaml will be used

NodeTool command to disable AuditLog

disableauditlog: Disables AuditLog.

nodetool disableuditlog

NodeTool command to reload AuditLog filters

enableauditlog: NodeTool enableauditlog command can be used to reload auditlog filters when called with default or previous loggername and updated filters

nodetool enableauditlog --loggername <Default/ existing loggerName> --included-keyspaces <New Filter values>

Conclusion

Now that Apache Cassandra ships with audit logging out of the box, users can easily capture data change events to a persistent record indicating what happened, when it happened, and where the event originated. This type of information remains critical to modern enterprises operating in a diverse regulatory environment. While audit logging represents one of many steps forward in the 4.0 release, we believe that it will uniquely enable enterprises to use the database in ways they could not previously.

Scylla Summit Preview: Scylla and KairosDB in Smart Vehicle Diagnostics

Scylla Summit Preview: Scylla and KairosDB in Smart Vehicle Diagnostics

In the run-up to Scylla Summit 2018, we’ll be featuring our speakers and providing sneak peeks at their presentations. This interview in our ongoing series is with two speakers holding a joint session: Scylla and KairosDB in Smart Vehicle Diagnostics. The first part of the talk will feature Brian Hawkins speaking on the time-series database (TSDB) KairosDB, which runs atop Scylla or Cassandra. He’ll then turn the session over to Bin Wang of Faraday Future (FF), who will discuss his company’s use case in automotive real-time data collection.

Brian, Bin, thank you for taking the time to speak with me. First, tell our readers a little about yourselves and what you enjoy doing outside of work.

Brian: I recently purchased a new house and I’m still putting in the yard so I don’t have the luxury of doing anything I enjoy.

Bin: I adopted a small dog. I enjoy taking my family out in the nature on the weekends.

Brian, people have been enthusiastic about KairosDB from the start. Out of the many time-series databases available, what do you believe sets it apart?

Brian: Time-series data is addicting, no matter how much you have you want more. One way were Kairos stands out is its ability to scale as you want to store more data. If you want to store 100 metrics/sec or 1 million metrics/sec Kairos can handle it.

Another way where Kairos is different is that it is extremely customizable with plugins. You can embed it or customize it to your specific use case.

Bin, I was curious. Had you looked at our 2017 blog or seen the 2018 webinar about using KairosDB with Scylla? If not, how did you decide on this path for implementation? How did you determine to use these two technologies together?

Bin: I didn’t see that webinar before I made this decision at the beginning of our project. We chose it since we need to write huge quantities of signals data into persistent storage for use later. Our usage is not only loading in module training, but also to load historical values by our engineering team.

So we decided to store them in Cassandra. After that we found Scylla, which is written in C++ and has much better performance than traditional Cassandra. We switched our underlayer DB to Scylla.

After running for some time, we found we used a lot of storage with plain signals storage and also there are much greater requirements with visualization of signals.

So next I looked at InfluxDB and Grafana. They are perfect for signal visualization, but InfluxDB doesn’t have the same performance as Scylla. My team uses a lot Scylla in our projects. I like it.

After that I found KairosDB. It is perfect for me: working on Scylla, having great performance, more efficient storage of signals, and can work with Grafana as signal visualization. So my final solution is Scylla, KairosDB and Grafana.

What is the use case for Scylla and KairosDB at Faraday Future? Describe the data you need to manage, and the data ecosystem Scylla/KairosDB would need to integrate with.

Bin: Well, as I described before, KairosDB is an interface of storage of the signals sent from FF vehicles. Each of vehicle uploading hundreds to thousands of signals each second. We created a pipeline from RabbitMQ to decoding in Spring Cloud and message queue of Kafka and store them in KairosDB and persistent into Scylla.

In the consumer side, we use Grafana for the raw signal visualization, use Spark for module training and data analysing, and use Spring Cloud for the REST APIs for UI. KairosDB and Scylla is the core persistence and data accessing interface of the system.

Brian, when you listen to developers like Bin putting KairosDB to the test in IoT environments, what do you believe are important considerations for how they manage KairosDB and their real-time data?

Brian: Make sure you scale to keep ahead of your data flow. Have a strategy for either expiring your data or migrating it as things fill up. A lot of users underestimate how much time-series data they will end up collecting. Also make sure each user that will send data to Kairos understands tag cardinality. Basically if a metric has too many tag combinations it will take a long time to query and won’t be very useful.

Thank you both for your time!

Speaking of time series, if you have the time to see a series of great presentations, it it seriously time to sign up for Scylla Summit 2018, which is right around the corner. November 6-7, 2018.

The post Scylla Summit Preview: Scylla and KairosDB in Smart Vehicle Diagnostics appeared first on ScyllaDB.

New Maintenance Releases for Scylla Enterprise and Scylla Open Source

Scylla Release

The Scylla team announces the availability of three maintenance releases: Scylla Enterprise 2018.1.6, Scylla Open Source 2.3.1 and Scylla Open Source 2.2.1. Scylla Enterprise and Scylla Open Source customers are encouraged to upgrade to these releases in coordination with the Scylla support team.

  • Scylla Enterprise 2018.1.6, a production-ready Scylla Enterprise maintenance release for the 2018.1 branch, the latest stable branch of Scylla Enterprise
  • Scylla Open Source 2.3.1, a bug fix release of the Scylla Open Source 2.3 stable branch
  • Scylla Open Source 2.2.1, a bug fix release of the Scylla Open Source 2.2 stable branch

Scylla Open Source 2.3.1 and Scylla Open Source 2.2.1, like all past and future 2.x.y releases, are backward compatible and support rolling upgrades.

These three maintenance releases fix a critical issue with nodetool cleanup:

nodetool cleanup is used after adding a node to a cluster, to clean partition ranges not owned by a node. The critical issue we found is that nodetool cleanup would wrongly erase up to 2 token ranges that were local to the node. This problem will persist if cleanup is executed on all replicas of this range. The root cause of the issue is #3872, an error in an internal function used by cleanup to identify which token ranges are owned by each node.

If you ran nodetool cleanup, and you have questions about this issue, please contact the Scylla support team (for Enterprise customers, submit a ticket; for Open Source users, please let us know via GitHub).

Related Links for Scylla Enterprise 2018.1.6

Related links for Scylla Open Source

Additional fixed issues in Scylla Enterprise 2018.1.6 release, with open source references, if they exist:

  • CQL: Unable to count() a column with type UUID #3368, Enterprise #619
  • CQL: Missing a counter for reverse queries #3492
  • CQL: some CQL syntax errors can cause Scylla to exit #3740#3764
  • Schema changes: race condition when dropping and creating a table with the same name #3797
  • Schema compatibility when downgrading to older Scylla or Apache Cassandra #3546
  • nodetool cleanup may double the used disk space while running #3735
  • Redundant Seastar “Exceptional future ignored” warnings during system shutdown. Enterprise #633 

Additional issues solved in Scylla Open Source 2.2.1:

  • CQL: DISTINCT was ignored with IN restrictions #2837
  • CQL: Selecting from a partition with no clustering restrictions (single partition scan) might have resulted in a temporary loss of writes #3608
  • CQL: Fixed a rare race condition when adding a new table, which could have generated an exception #3636
  • CQL: INSERT using a prepared statement with the wrong fields may have generated a segmentation fault #3688
  • CQL: failed to set an element on a null list #3703
  • CQL: some CQL syntax errors can cause Scylla to exit #3740#3764 – Performance: a mistake in static row digest calculations may have lead to redundant read repairs #3753#3755
  • CQL: MIN/MAX CQL aggregates were broken for timestamp/timeuuid values. For example SELECT MIN(date) FROM ks.hashes_by_ruid; where the date is of type timestamp #3789
  • CQL: TRUNCATE request could have returned a succeeds response even if it failed on some replicas #3796
  • scylla_setup: scylla_setup run in silent mode should fail with an appropriate error when mdadm fails #3433
  • scylla_setup: may report on “NTP setup failed” after successful setup #3485
  • scylla_setup: add an option to select server NIC #3658
  • improve protection against out of memory while which insert a cell to existing row #3678
  • Under certain, rare, circumstances a scan query can end prematurely, returning only parts of the expected data-set #3605

Additional issues solved in Scylla Open Source 2.3.1:

  • Gossip: non zero shards may have stale (old) values of gossiper application states for some time #3798. This can create an issue with schema change propagation, for example, TRUNCATE TABLE #3694
  • CQL: some CQL syntax errors can cause Scylla to exit #3740#3764
  • In Transit Encryption: Possible out of memory when using TLS (encryption) with many connections #3757
  • CQL: MIN/MAX CQL aggregates were broken for timestamp/timeuuid values. For example SELECT MIN(date) FROM ks.hashes_by_ruid; where the date is of type timestamp #3789
  • CQL: TRUNCATE request could have returned a succeeds response even if it failed on some replicas #3796
  • Prometheus: Fix histogram text representation #3827

The post New Maintenance Releases for Scylla Enterprise and Scylla Open Source appeared first on ScyllaDB.

The Last Pickle Is Hiring

The Last Pickle (TLP) intends to hire a team member in the United States to work directly with customers. You will be part of the TLP tech team, delivering high quality consulting services including expert advice, documentation and run books, diagnostics and troubleshooting, and proof-of-concept code.

Responsibilities Include

  • Delivering world-class consulting to The Last Pickle’s customers globally.
  • Ensuring timely response to customer requests.
  • Guaranteeing professional standards are maintained in both the content and delivery of consulting services to customers.
  • Continually develop skills and expertise to ensure The Last Pickle is able to deliver “Research-Driven Consultancy”.
  • Continued conference speaking engagements, authoring technical blog posts, and contributions to the Open Source community.

Skills and Experience

  • 3+ years experience using and or troubleshooting Apache Cassandra or DataStax Enterprise in production systems.
  • Code and/or other contributions to the Apache Cassandra community.
  • Spoken in public about Cassandra or related Big Data platforms.
  • Experience working remotely with a high-level of autonomy.

In return we offer:

  • Being part of a globally recognised team of experts.
  • Flexible workday and location.
  • Time to work on open source projects and support for public speaking.
  • As much or as little travel as you want.
  • No on-call roster.
  • A great experience helping companies big and small be successful.

If this sounds like the right job for you let us know by emailing careers@thelastpickle.com.

The Dark Side of MongoDB’s New License

SSPL vs. APGL Side-by-Side

It’s never been simple to be an Open Source vendor. With the rise of the cloud and the emergence of software as a service, the open source monetization model continues to encounter risks and challenges. A recent example can be found in MongoDB, the most prevalent NoSQL OSS vendor, which just changed its license from AGPL license to a new, more-restrictive license called SSPL.

This article will cover why MongoDB made this change, and the problems and risks of the new model. We’ll show how SSPL broadens the definition of copyleft to an almost impossible extent and argue that MongoDB would have been better off with Commons Clause or just swallowed a hard pill and stayed with APGL.

Why a new license?

According to a recent post by MongoDB’s CTO, Elliot Horowitz, MongoDB suffers from unfair usage of their software by vendors who resell it as a service.

Elliot claims the AGPL clause that is supposed to protect the software from being abused in such a manner isn’t enough, since enforcing it would result in costly legal expenses.

As an OSS co-founder myself, these claims seem valid at first blush. Shouldn’t vendors and communities defend themselves against organizations that just consume technology without contributing anything back? How can an OSS vendor maintain a sustainable business and how can it grow?

As InfluxDB CTO Paul Dix said, someone needs to subsidize OSS, otherwise it cannot exist. There are cases where it’s being subsidized by a consortium of companies or foundations (Linux, Apache). There are cases where very profitable corporations (Google/Facebook), who rely on their profits from other business models, act as “patrons-of-the-arts” to open source code for various reasons — for instance, to attract top talent, to be considered a benevolent brand, or attract a large user base for their other proprietary products.

Pure OSS vendors are under constant pressure since their business model needs to subsidize their development and their margins are tight. Indeed, many OSS vendors are forced to an open core approach while they hold back functionality from the community (Cloudera), provide some of the closed-source functionality as a service (Databricks) or even making a complete U-turn, back to closed-source software (DataStax).

Could MongoDB and RedisLabs (who recently changed to Apache 2 + Commons Clause licensing; the core remained BSD) have found the perfect solution? These new solutions allow them to keep sharing the code while having an edge over opportunistic commercializers who take advantage of OSS with little to no contributions.

APGL vs. SSPL: A side-by-side comparison

 

APGL’s section 13 where “Remote Network Interaction” was totally replaced by SSPL’s limitations around “Offering the Program as a Service,” with completely new text therein.

SSPL requires that if you offer software as a service you must make public, open source, practically almost everything related to your service:

“including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.”

What’s the risk of SSPL?

From a 30,000 foot view, it might sound fair. Community users will be able to read the software, and use it for internal purposes, while usage that directly competes with the OSS vendor’s service model will be disallowed. Have your cake and eat it, too!

Have your cake and eat it too!

Hmmm… not so fast. Life, software and law aren’t so simple.

On the surface, the intention is good and clear, the following two types of OSS usage will be handled fairly well:

  1. Valid OSS internal usage
    A company, let’s call it CatWalkingStartup, uses MongoDB OSS to store cat walking paths. It’s definitely not a MongoDB competitor and not a database service, thus a valid use of the license.
  2. MongoDB as a service usage
    A company, let’s call it BetterBiggerCloud, offers MongoDB as a service without contributing back a single line of code. This is not a valid use according to SSPL. In such a case, BetterBiggerCloud will either need to pay for an Enterprise license or open all of their code base (which is less likely to happen).

Here’s where things get complicated. Let’s imagine the usage of a hypothetical company like a Twilio or a PubNub (these are just presented for example, this is not to assert whether they do or ever have used MongoDB). Imagine they use MongoDB and provide APIs on top of their core service. Would this be considered a fair usage? They do provide a service and make money by using database APIs and offering additional/different APIs on top of it. At what point is the implementation far enough from the original?

GPL and the Linux kernel used a cleaner definition where usage is defined as derived work. Thus, linking with the Linux kernel is considered derived and userspace programs are independent. There is a small gray area with applications that share memory between userspace and kernel, but, for the most part, the definition of what is allowed is well understood.

With the goal of closing the loophole with services, AGPL defined the term ‘Remote Network Interactions.’ The problem with SSPL is that there are only barely such boundaries. And now – users must share their backup code, monitoring code and everything else. It doesn’t seem practical and is very hard to defend on non trivial cases.

I wonder if folks at MongoDB have given this enough thought. What if a cloud service does not use the MongoDB query language and instead offers a slightly different interface to query and save JSON objects? Would it be allowed?

It's a mess! (A messy baby)

Should others follow SSPL?

In a word, no.

If you did intend to sell MongoDB as a service, you have to open source your whole store. It may be acceptable to smaller players but you wouldn’t find large businesses that will agree to this. It might as well just read, “You are not allowed to offer MongoDB as a service.”

MongoDB is foolishly overreaching.

The intent to control others offering MongoDB-as-a-[commercialized]-service is commendable. To desire to profit off your work when it is commercialized by others seems all well-and-good and Commons Clause takes care of it (although it expands beyond the limits of services). But let’s face it, there is nothing that unique about services; it’s more about commercializing the $300M investment in MongoDB.

I actually do not think this is MongoDB trying to turn itself into a “second Oracle.” I believe the intentions of MongoDB’s technical team are honest. However, they may have missed a loophole with their SSPL and generated more problems than solutions. It would have been better to use the existing OSS/Enterprise toolset instead of creating confusion. The motivation, to keep as much code as possible open, is admirable and positive.

This is, of course, not the end of open source software vendors. Quite the contrary. The OSS movement keeps on growing. There are more OSS vendors, one of whom just recently IPOed (Elasticsearch) and others on their way towards IPO.

While Open Source is not going away, the business models around it must and will continue to evolve. SSPL was a stab at correcting a widely-perceived deficiency by MongoDB. However, we believe there are better, less-burdensome ways to address the issue.

Disclosure: Scylla is a reimplementation of Apache Cassandra in C++. ScyllaDB chose AGPL for its database product for the very same reasons MongoDB originally chose AGPL. Our core engine, Seastar, is licensed under Apache 2.0

Editor’s Note: This article has been updated to clarify Redis Labs license model. (October 29, 2018)

The post The Dark Side of MongoDB’s New License appeared first on ScyllaDB.

Scylla Manager — Now even easier to maintain your Scylla clusters!

 

Last week we announced the release of Scylla Manager 1.2, a management system that automates maintenance tasks on a Scylla cluster. This release provides enhanced repair features that make it easier to configure repairs on a cluster. In this blog, we take a closer look at what’s new.

Efficient Repair

Scylla Manager provides a robust suite of tools aligned with Scylla’s shard-per-core architecture for the easy and efficient management of Scylla clusters. Clusters are repaired shard-by-shard, ensuring that each database shard performs exactly one repair task at a time. This gives the best repair parallelism on a node, shortens the overall repair time, and does not introduce unnecessary load.

General Changes

In order to simplify configuration of Scylla Manager we have removed the somewhat confusing “repair unit” concept. This concept served only as a static bridge between what to do (“tasks”), and when to run them, and was thus unnecessary. A task is defined as the set of hosts, datacenters, keyspaces and tables on which to perform repairs. Essentially it boils down to what you want to repair. Which tasks it operates on exactly is determined at runtime. This means that if you add a new table that matches a task definition filter it will also be repaired by that task even though it did not exist at the time the task was added. This makes Scylla Manager simultaneously easier for users while making the Scylla Manager code simpler, which will also allow us to develop new features faster.

 

Scylla Manager Logical View

Multi-DC Repairs

One of the most sought-after features is to isolate repairs. Scylla Manager 1.2 provides a simple yet powerful way to select a specific datacenter or even limit which nodes, specific table, keyspace, or even token ranges are repaired.

You can furthermore decide with great precision when to perform repairs using timestamps and time deltas. For example, to repair all the shopping cart related tables in the Asian datacenter and to start the repair task in two hours, you would run a repair command such as this:

sctool repair -c 'cname' --dc 'dc_asia' --keyspace 'shopping.cart_\*' -s now+2h

This command issues repair instructions only to the nodes located in the datacenter dc_asia and repair only the tables matching the glob expression shopping.cart_*. This is a one time repair. To make it recurrent, use the --interval-days flag to specify the number of days between repair tasks. If you want to repair multiple keyspaces at a time, simply add another glob pattern in the --keyspace argument such as this:

sctool repair -c 'name' --dc 'dc_asia' --keyspace 'shopping.cart_\*,audit.\*' -s now+2h

This repairs all the tables in the audit keyspace as well in the same repair task. If you want to skip repairing one of the tables (audit.old, for example) just add the exclude command like this:

sctool repair -c 'name' --dc 'dc_asia' --keyspace 'shopping.cart_*,audit.\*,\!audit.old' -s now+2h

This repairs all tables in the “audit” keyspace except for the table named “old”.

If you want to further control what the repair should use as its source of truth, you can use the --with-hosts flag which specifies a list of hosts. This will instruct Scylla to use only these hosts when repairing, rather than all of them which is normally the case. To repair just a single host you can use the flag --host which is particularly useful in combination with --with-hosts since you can with minimal impact quickly repair a broken node.

By default Scylla Manager will instruct Scylla to repair “primary token ranges” which means that only token ranges owned by the node will be repaired. To change this behavior you can inverse it by simply adding the argument --npr or use --all to repair all token ranges.

A note on glob patterns…

We chose to use glob patterns, for example keyspace.prefix*, for specifying filters as they provide a high degree of power and flexibility without the complexity of regular expressions which can easily lead to human errors. These patterns easily allow you to specify several keyspaces and tables without writing out all of them in a list which can be quite tedious in a large application with lots of different data types.

Simplified Installation

Scylla Manager uses the Scylla REST API for all of its operations. Scylla Manager supports SSH tunneling for its interactions with the Scylla database nodes for increased security. It is not mandatory to do so, but we recommend using SSH tunneling because it does not require any changes to database configuration, and be implemented with a dedicated low-permission user. It also does not need any special software installed on the database machines. This simplifies operations since there is nothing extra to monitor, update, and otherwise manage. The concept of having a companion application installed together with another main application is known as a “sidecar.” We do not believe sidecars to be a good design pattern for Scylla Manager, since this bring additional operational burdens.

In Scylla Manager 1.2, we have made it very easy to setup the SSH connectivity for the Scylla Manager to talk to the Scylla database nodes.

One thing many users reported as being troublesome was the generation and distribution of the SSH keys necessary for this to work. To solve this problem , we have now added the scyllamgr_ssh_setup script that is available after you install Scylla Manager. This script does not simply copy key files, but discovers all the nodes in a cluster for every node sets up the proper user that is needed for the SSH connectivity to work.

To run the script, make sure there is an admin user who has root privileges so that the script can use these permissions to perform the setup. This power user is not remembered or reused in any way but is simply used to perform the needed administrative functions to setup the required user and keys. The admin user is much like the Amazon ec2-user. The script creates a user specified by the -m parameter that the Scylla Manager later uses for its SSH connection. This is a very restricted user as you cannot get shell access.

Generating and distributing the needed keys is as simple as:

scyllamgr_ssh_setup -u ec2-user -i /tmp/amazon_key.pem -m scylla-manager -o /tmp/scyllamgr_cluster.pem -d <SCYLLA_IP>

This generates or reuses the file: /tmp/scyllamgr_cluster.pem which is distributed to all of the nodes in the cluster. In order to do this, the script uses the Scylla REST API to discover the other members of the cluster and sets up the needed users and keys on these nodes as well. If you later add a node to the cluster, you can re-run the script for just that node.

Further improvements

  • HTTPS – By default, the Scylla Manager server API is now deployed using HTTPS for additional safety. The default port is 56443 but this can be changed in the config file /etc/scylla-manager/scylla-manager.yaml.
  • The SSH config is now set per cluster in the cluster add command. This allows a very secure setup where different clusters have their own keys.
  • The ssh configuration is now dropped and and any data available will be migrated as part of the upgrade.
  • cluster add is now topology aware in the sense that it will discover the cluster nodes if you just supply one node using the --host argument. There is no need to specify all the cluster nodes when registering a new cluster.
  • repair will obtain the number of shards for the cluster nodes dynamically when it runs so you do not need to know the shard count when you add the cluster. This can be very convenient in a cluster with different sized nodes.
  • Automated DC selection – Scylla Manager will ping the nodes and determine which DC is the closest and then use that for its interactions with the cluster whenever possible. The repair_auto_schedule task has been replaced by a standard repair task like any other that you might add.
  • The visual layout of the progress reporting in sctool is greatly improved.

The post Scylla Manager — Now even easier to maintain your Scylla clusters! appeared first on ScyllaDB.

Scylla Summit Preview: Scylla: Protecting Children Through Technology

Scylla Summit Preview: Scylla: Protecting Children Through Technology

In the run-up to Scylla Summit 2018, we’ll be featuring our speakers and providing sneak peeks at their presentations. This interview in our ongoing series is with Jose Garcia-Fernandez, Senior Vice President of Technology at Child Rescue Coalition, a 501(c)(3) non-profit organization. His session is entitled Scylla: Protecting Children through Technology.

Jose, it is our privilege to have you at Scylla Summit this year. Before we get into the work of the Child Rescue Coalition, I’d like to know how you got on this path. What is your background, and how did you end up working on this project?

I have been developing software solutions for many years. I hold a Master in Computer Science and I work on software solutions in the areas of Big Data, Computer Networks and Cyber Security. I am responsible for the development, operation, and enhancement of the tools that makes CRC’s “Child Protection System” (CPS). CPS is the main tool thousands of investigators, on all U.S. states, and more than 90 countries, use in a daily basis to track, catch, and prosecute online pedophiles who use the Internet to harm children.

I started this path when I was working developing TLOXp, an investigative tool that fusions billions of records about people and companies for investigative purposes. About 10 years ago, a group of investigators showed us how pedophiles were using the Internet to harm children. I was shocked to know that the same power we use in a daily basis to share information, connect with other people on social media, and all benefits we have from using the Internet, was also used by pedophiles to communicate, share illegal material, mentor each other, and, the worst of all, contacting new victims in a way that was not possible before. We worked together, and, as a result of that work, we developed a set of tools for Law Enforcement, and they have been using them successfully over the last years to catch online child predators. More than a thousand kids have been rescued, and 10,000 pedophiles have been prosecuted, as a direct consequence of the use of the tools we created and maintain at CRC, along with the extraordinary work of committed law enforcement investigators. In 2014, that platform and the people involved on the project created Child Rescue Coalition in order to further grow the platform and expand its reach to other countries.

For those not familiar with your work, can you describe the challenge and the goals of the Child Rescue Coalition, and the technology you are using to address the problem?

Child Rescue Coalition’s mission is to protect children by developing online state-of-the-art technology. We deal with more than 17 billion records, we combine them into target reports, ranked them using several algorithms developed with law enforcement organizations and provide, through a web-based application, free of charge, to law enforcement agents, ranked targets in their respective jurisdictions. Technology comes in different tools for different purposes, we deal with a lot of open-source product as well as proprietary technology to deal with big distributed systems.

In human terms, what is the scale of this issue?

People may not know of bad the problem is. We have programs called “bots” or “crawlers” on the Internet. These programs identify and send events informing about illegal activity to our servers. We deal with more than 50 million leads per day. Every year, we identify more than 5 million computers generating those leads, in other words, this means, millions of pedophiles looking for ways to victimize children.

Let me quote you this statement: “This tool stores several billions of records in Scylla, and it is expected to grow in the tens of billions of records in the near future.” That’s a shocking thing to imagine; just the raw quantity of data. What are the main considerations you face in managing it?

Our main consideration, and the reason why we selected Scylla, was its efficient and optimized design for modern hardware. This means, we can implement our solution with only 5 servers, but it would have required at least 20 servers using other technologies based on JVM. Having less servers mean lower hardware costs, but, more importantly, less time maintaining server failures, and more time for developing new projects. That also means we can horizontally scale as needed, with almost no impact.

Besides Scylla, what other technologies are critical to your mission’s success?

In order to grow we have been working towards making our platform more flexible, efficient and scalable. Recently, we have implemented Kubernetes to containerize ours tools and expand into the cloud. We have implemented Kafka and Apache NiFi for the expanding our data flow to new sources and processing with minimum impact, and standardizing to ScyllaDB for NoSQL storage for all the new tools.

Child Rescue Coalition is constantly sharing information about potential and actual crimes involving minors. Data privacy, retention and governance must be paramount. What special considerations do you have in that regard?

Our systems have been challenged over the years in court, and time after time we have proven, even with independent third-party validators, that our tools do not invade privacy, or use any nefarious or invasive ways to obtain the information we gather, or the processes law enforcement used with our tools were appropriate.

How can readers help if they want to support the Child Rescue Coalition?

We are a 501(c)(3) non-profit organization, meaning all received donations or funding could be tax-deductible. At the individual level, the easiest way is to become a coalition club member or make your contributions using this page: https://childrescuecoalition.org/donate/

Corporations can also become corporate sponsors, and/or fund specific projects or activities, or donate hardware, software, or services.

All funds are used primarily bring awareness, training and certification of new investigators on the use of our tools, at no cost, on underserved communities. Funds are also use for development and maintenance of new software as new online threads emerge. You can also follow us or find more about our organization on these links:

Thank you for all you do. Your session at the Scylla Summit will certainly be riveting.

The post Scylla Summit Preview: Scylla: Protecting Children Through Technology appeared first on ScyllaDB.

Finding Bugs in Cassandra’s Internals with Property-based Testing

As of September 1st, the Apache Cassandra community has shifted the focus of Cassandra 4.0 development from new feature work to testing, validation, and hardening, with the goal of releasing a stable 4.0 that every Cassandra user, from small deployments to large corporations, can deploy with confidence. There are several projects and methodologies that the community is undertaking to this end. One of these is the adoption of property-based testing, which was previously introduced here. This post will take a look at a specific use of this approach and how it found a bug in a new feature meant to ensure data integrity between the client and Cassandra.

Detecting Corruption is a Property

In this post, we demonstrate property-based testing in Cassandra through the integration of the QuickTheories library introduced as part of the work done for CASSANDRA-13304.

This ticket modifies the framing of Cassandra’s native client protocol to include checksums in addition to the existing, optional compression. Clients can opt-in to this new feature to retain data integrity across the many hops between themselves and Cassandra. This is meant to address cases where hardware and protocol level checksums fail (due to underlying hardware issues) — a case that has been seen in production. A description of the protocol changes can be found in the ticket but for the purposes of this discussion the salient part is that two checksums are added: one that covers the length(s) of the data (if compressed there are two lengths), and one for the data itself. Before merging this feature, property-based testing using QuickTheories was used to uncover a bug in the calculation of the checksum over the lengths. This bug could have led to silent corruption at worst or unexpected errors during deserialization at best.

The test used to find this bug is shown below. This example tests the property that when a frame is corrupted, that corruption should be caught by checksum comparison. The test is wrapped inside of a standard JUnit test case but, once called by JUnit, execution is handed over to QuickTheories to generate and execute hundreds of examples. These examples are dictated by the types of input that should be generated (the arguments to forAll). The execution of each individual example is done by checkAssert and its argument, the roundTripWithCorruption function.

@Test
public void corruptionCausesFailure()
{
    qt().withExamples(500)
        .forAll(inputWithCorruptablePosition(),
                integers().between(0, Byte.MAX_VALUE).map(Integer::byteValue),
                compressors(),
                checksumTypes())
        .checkAssert(this::roundTripWithCorruption);
}

The roundTripWithCorruption function is a generalization of a unit test that worked similarly but for a single case. It is given an input to transform and a position in the transformed output to insert corruption, as well as what byte to write to the corrupted position. The additional arguments (the compressor and checksum type) are used to ensure coverage of Cassandra’s various compression and checksumming implementations.

private void roundTripWithCorruption(Pair<String, Integer> inputAndCorruptablePosition,
                                     byte corruptionValue,
                                     Compressor compressor,
                                     ChecksumType checksum) {
    String input = inputAndCorruptablePosition.left;
    ByteBuf expectedBuf = Unpooled.wrappedBuffer(input.getBytes());
    int byteToCorrupt = inputAndCorruptablePosition.right;
    ChecksummingTransformer transformer = new ChecksummingTransformer(checksum, DEFAULT_BLOCK_SIZE, compressor);
    ByteBuf outbound = transformer.transformOutbound(expectedBuf);

    // make sure we're actually expecting to produce some corruption
    if (outbound.getByte(byteToCorrupt) == corruptionValue)
        return;

    if (byteToCorrupt >= outbound.writerIndex())
        return;
 
    try {
        int oldIndex = outbound.writerIndex();
        outbound.writerIndex(byteToCorrupt);
        outbound.writeByte(corruptionValue);
        outbound.writerIndex(oldIndex);
        ByteBuf inbound = transformer.transformInbound(outbound, FLAGS);

        // verify that the content was actually corrupted
        expectedBuf.readerIndex(0);
        Assert.assertEquals(expectedBuf, inbound);
    } catch(ProtocolException e) {
       return;
    }
}

The remaining piece is how those arguments are generated — the arguments to forAll mentioned above. Each argument is a function that returns an input generator. For each example, an input is pulled from each generator and passed to roundTripWithCorruption. The compressors() and checksums() generators aren’t copied here. They can be found in the source and are based on built-in generator methods, provided by QuickTheories, that select a value from a list of values. The second argument, integers().between(0, Byte.MAX_VALUE).map(Integer::byteValue), generates non-negative numbers that fit into a single byte. These numbers will be passed as the corruptionValue argument.

The inputWithCorruptiblePosition generator, copied below, generates strings to use as input to the transformation function and a position within the output byte stream to corrupt. Because compression prevents knowledge of the output size of the frame, the generator tries to choose a somewhat reasonable position to corrupt by limiting the choice to the size of the generated string (it’s uncommon for compression to generate a larger string and the implementation discards the compressed value if it does). It also avoids corrupting the first two bytes of the stream which are not covered by a checksum and therefore can be corrupted without being caught. The function above ensures that corruption is actually introduced and that corrupting a position larger than the size of the output does not occur.

private Gen<Pair<String, Integer>> inputWithCorruptablePosition()
{
    return inputs().flatMap(s -> integers().between(2, s.length() + 2)
                   .map(i -> Pair.create(s, i)));
}

With all those pieces in place, if the test were run before the bug were fixed, it would fail with the following output.

java.lang.AssertionError: Property falsified after 2 example(s) 
Smallest found falsifying value(s) :-
{(c,3), 0, null, Adler32}

Cause was :-
java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54)
    at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401)
    at io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388)
    at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870)
    at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289)
    at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106)
    ...
Other found falsifying value(s) :- 
{(c,3), 0, null, CRC32}
{(c,3), 1, null, CRC32}
{(c,3), 9, null, CRC32}
{(c,3), 11, null, CRC32}
{(c,3), 36, null, CRC32}
{(c,3), 50, null, CRC32}
{(c,3), 74, null, CRC32}
{(c,3), 99, null, CRC32}

Seed was 179207634899674

The output shows more than a single failing example. This is because QuickTheories, like most property-based testing libraries, comes with a shrinker, which performs the task of taking a failure and minimizing its inputs. This aids in debugging because there are multiple failing examples to look at often removing noise in the process. Additionally, a seed value is provided so the same series of tests and failures can be generated again — another useful feature when debugging. In this case, the library generated an example that contains a single byte of input, which will corrupt the fourth byte in the output stream by setting it to zero, using no compression, and using Adler32 for checksumming. It can be seen from the other failing examples that using CRC32 also fails. This is due to improper calculation of the checksum, regardless of the algorithm. In particular, the checksum was only calculated over the least significant byte of each length rather than all eight bytes. By corrupting the fourth byte of the output stream (the first length’s second-most significant byte not covered by the calculation), an invalid length is read and later used.

Where to Find More

Property-based testing is a broad topic, much of which is not covered by this post. In addition to Cassandra, it has been used successfully in several places including car operating systems and suppliers’ products, GNOME Glib, distributed consensus, and other distributed databases. It can also be combined with other approaches such as fault-injection and memory leak detection. Stateful models can also be built to generate a series of commands instead of running each example on one generated set of inputs. Our goal is to evangelize this approach within the Cassandra developer community and encourage more testing of this kind as part of our work to deliver the most stable major release of Cassandra yet.

Scylla Summit Preview: Grab and Scylla – Driving Southeast Asia Forward

Scylla Summit Preview: Grab and Scylla - Driving Southeast Asia Forward

In the run-up to Scylla Summit 2018, we’ll be featuring our speakers and providing sneak peaks at their presentations. This interview in our ongoing series is with Aravind Srinivasan, Staff Software Engineer of Grab, Southeast Asia’s leading on-demand and same-day logistics company. His presentation at Scylla Summit will be on Grab and Scylla: Driving Southeast Asia Forward.

Aravind, before we get into the details of your talk, we’d like to get to know you a little better. Outside of technology, what do you enjoy doing? What are your interests and hobbies?

I love hiking and biking. But now my (and my wife’s) world revolves around our 2 year old son who keeps us busy. 😃

How did you end up getting into database technologies? What path led you to getting hands-on with Scylla?

I started my career working on filesystems (for Isilon systems — now EMC/Dell) and so was always close to storage. After I decided to get out of the Kernel world and into the services world and moved to Uber, I was fortunate to work for a team which was building a queueing system from scratch where we used Cassandra as our metadata store, which worked ok for a while before we ran into lots of operational headaches. After I moved away from Uber and joined Grab to the Data Platform team, we needed a high-performing, low-overhead metadata store and we bumped into ScyllaDB at that point and that’s where we started our relationship with ScyllaDB.

What will you cover in your talk?

This talk will give an overview of how Grab uses ScyllaDB, the reason we chose ScyllaDB over others for the use cases and our experience so far with ScyllaDB.

Can you describe Grab’s data management environment for us? What other technologies are you using? What does Scylla need to connect and work with?

First and foremost Grab is an AWS shop but our predominant use case for ScyllaDB is with Kafka, which is the ingestion point for ScyllaDB. A couple of use cases also has a Spark job which talks to ScyllaDB directly. 

What is unique about your use case?

The most unique characteristic of our use case is the scale up pattern of the traffic volume (TPS). Generally the traffic volume (TPS) just hikes up and so a store which we use for our use case should be able to scale fast and should have the ability to handle bursts.

Is there anything Scylla Summit attendees need to know in order to get the most out of your talk? What technology or tools should they be familiar with?

Kafka, Stream Processing and some terminologies like TPS, QPS, p99 and other stats.

Thanks, Aravind. By the way, that seems like a perfect segue to highlight that we will have Confluent’s Hojjat Jafarpour talking about Kafka.

If you are interested in learning more about how Grab scaled their hypergrowth across Asia, make sure you register for Scylla Summit today!

 

The post Scylla Summit Preview: Grab and Scylla – Driving Southeast Asia Forward appeared first on ScyllaDB.

Registration Now Open for DataStax Accelerate!

I’m very excited to announce that registration is now open for DataStax Accelerate—the world’s premier Apache CassandraTM conference. The call for papers is also open.

Today’s enterprises are facing massive shifts in the use of real time data to create differentiated experiences through internally and externally facing applications. Simple, single-cloud deployments are insufficient for these globally distributed and scalable applications and requires a hybrid and multi-cloud approach.

This demands new ways to think about data management and how enterprises can make the most of their data while still keeping it portable and secure.

To discuss these trends and the technologies that underpin them, we are bringing together the best minds and most cutting-edge innovations in distributed database technology for DataStax Accelerate.  This premier conference will be held May 21-23, 2019 at the Gaylord National Resort and Convention Center in Oxon Hill, Maryland, just outside Washington, D.C.

Mark your calendar—because you will not want to miss it.

DataStax Accelerate will feature separate executive and technical tracks, as well as training, hands-on sessions, an exhibitor pavilion, networking, and a full presentation agenda from DataStax executives, customers, and partners.

You’ll be able to learn from your peers, industry experts, and thought leaders on how Apache Cassandra and DataStax can help you build and deploy game-changing enterprise applications and easily scale your data needs to fit your company’s growth and today’s hybrid and multi-cloud world.

Additionally, you can learn from:

  • Deep technical sessions for developers, administrators, and architects on DataStax and Apache Cassandra internals, theory, and operations
  • An informative and engaging showcase of some of the world’s largest enterprise companies that are running their strategic business initiatives with DataStax
  • In-depth and hands-on workshops around DataStax Enterprise 6.0 advanced technologies, like Graph, Solr, Kafka and SPARK
  • Face-time with DataStax engineers in various workshops and breakout sessions
  • Designer and innovator sessions on how to accelerate your hybrid and multi-cloud deployments using DataStax’s unique masterless database architecture that is always-on, active everywhere, and infinitely scalable

Register now!

Custom commands in cstar

Welcome to the next part of the cstar post series. The previous post introduced cstar and showed how it can run simple shell commands using various execution strategies. In this post, we will teach you how to build more complex custom commands.

Basic Custom Commands

Out of the box, cstar comes with three commands:

$ cstar
usage: cstar [-h] {continue,cleanup-jobs,run} ...

cstar

positional arguments:
  {continue,cleanup-jobs,run}
    continue            Continue a previously created job (*)
    cleanup-jobs        Cleanup old finished jobs and exit (*)
    run                 Run an arbitrary shell command

Custom commands allow extending these three with anything one might find useful. Adding a custom command to cstar is as easy as placing a file to ~/.cstar/commands or /etc/cstar/commands. For example, we can create ~/.cstar/commands/status that looks like this:

#!/usr/bin/env bash
nodetool status

With this file in place, cstar now features a brand new status command:

$ cstar
usage: cstar [-h] {continue,cleanup-jobs,run,status} ...

cstar

positional arguments:
  {continue,cleanup-jobs,run,status}
    continue            Continue a previously created job (*)
    cleanup-jobs        Cleanup old finished jobs and exit (*)
    run                 Run an arbitrary shell command
    status

A command like this allows us to stop using:

cstar run --command "nodetool status" --seed-host <host_ip>

And use a shorter version instead:

cstar status --seed-host <host_ip>

We can also declare the command description and default values for cstar’s options in the command file. We can do this by including commented lines with a special prefix. For example, we can include the following lines in our ~/.cstar/commands/status file:

#!/usr/bin/env bash
# C* cluster-parallel: true
# C* dc-parallel: true
# C* strategy: all
# C* description: Run nodetool status

nodetool status

Once we do this, the status will show up with a proper description in cstar’s help and running cstar status --seed-host <host_ip> will be equivalent to:

cstar status --seed-host <host_ip> --cluster-parallel --dc-parallel --strategy all

When cstar begins the execution of a command, it will print an unique ID of the command being run. This ID is needed for resuming a job but more on this later. We also need the job ID to examine the output of the commands. We can find the output in:

$ ~/.cstar/jobs/<job_id>/<hostname>/out

Parametrized Custom Commands

When creating custom commands, cstar allows declaring custom arguments as well. We will explain this feature by introducing a command that deletes snapshots older than given number of days.

We will create a new file, ~/.cstar/commands/clear-snapshots, that will start like this:

#!/usr/bin/env bash
# C* cluster-parallel: true
# C* dc-parallel: true
# C* strategy: all
# C* description: Clear snapshots older than given number of days
# C* argument: {"option":"--days", "name":"DAYS", "description":"Snapshots older than this many days will be deleted", "default":"7", "required": false}

The new element here is the last line starting with # C* argument:. Upon seeing this prefix, cstar will parse the remainder of the line as a JSON payload describing the custom argument. In the case above, cstar will:

  • Use --days as the name of the argument.
  • Save the value of this argument into a variable named DAYS. We will see how to access this in a bit.
  • Associate a description with this argument.
  • Use 7 as a default value.
  • Do not require this option.

With this file in place, cstar already features the command in its helps:

$ cstar
usage: cstar [-h] {continue,cleanup-jobs,run,status,clear-snapshots} ...

cstar

positional arguments:
  {continue,cleanup-jobs,run,clear-snapshots}
    continue            Continue a previously created job (*)
    cleanup-jobs        Cleanup old finished jobs and exit (*)
    run                 Run an arbitrary shell command
    status              Run nodetool status
    clear-snapshots     Clear snapshots older than given number of days
    

$ cstar clear-snapshots --help
usage: cstar clear-snapshots [-h] [--days DAYS]
                             [--seed-host [SEED_HOST [SEED_HOST ...]]]
                             ...
                             <other default options omitted>

optional arguments:
 -h, --help            show this help message and exit
 --days DAYS
                       Snapshots older than this many days will be deleted
 --seed-host [SEED_HOST [SEED_HOST ...]]
                       One or more hosts to use as seeds for the cluster (edited)
 ...
 <other default options omitted>

Now we need to add the command which will actually clear the snapshots. This command needs to do three things:

  • Find the snapshots that are older than given number of days.
    • We will use the -mtime filter of the find utility:
    • find /var/lib/cassandra/*/data/*/*/snapshots/ -mtime +"$DAYS" -type d
    • Note we are using "$DAYS" to reference the value of the custom argument.
  • Extract the snapshot names from the findings.
    • We got absolute paths to the directories found. Snapshot names are the last portion of these paths. Also, we will make sure to keep each snapshot name only once:
    • sed -e 's#.*/##' | sort -u
  • Invoke nodetool clearsnapshot -t <snapshot_name> to clear each of the snapshots.

Putting this all together, the clear-snapshots file will look like this:

#!/usr/bin/env bash
# C* cluster-parallel: true
# C* dc-parallel: true
# C* strategy: all
# C* description: Clear up snapshots older than given number of days
# C* argument: {"option":"--days", "name":"DAYS", "description":"Snapshots older than this many days will be deleted", "default":"7", "required": false}

find /var/lib/cassandra/data/*/*/snapshots/ -mtime +"$DAYS" -type d |\
sed -e 's#.*/##' |\
sort -u |\
while read line; do nodetool clearsnapshot -t "${line}"; done

We can now run the clear-snpahsots command like this:

$ cstar clear-snapshots --days 2 --seed-host <seed_host>

Complex Custom Commands

One of the main reasons we consider cstar so useful is that the custom commands can be arbitrary shell scripts, not just one-liners we have seen so far. To illustrate this, we are going to share two relatively complicated commands.

Upgrading Cassandra Version

The first command will cover a rolling upgrade of the Cassandra version. Generally speaking, the upgrade should happen as quickly as possible and with as little downtime as possible. This is the ideal application of cstar’s topology strategy: it will execute the upgrade on as many nodes as possible while ensuring a quorum of replicas stays up at any moment. Then, the upgrade of a node should follow these steps:

  • Create snapshots to allow rollback if the need arises.
  • Upgrade the Cassandra installation.
  • Restart the Cassandra process.
  • Check the upgrade happened successfully.

Clearing the snapshots, or upgrading SSTables is something that should not be part of the upgrade itself. Snapshots being just hardlinks will not consume excessive space and Cassandra (in most cases) can operate with older SSTable versions. Once all nodes are upgraded, these actions are easy enough to perform with dedicated cstar commands.

The ~/.cstar/commands/upgrade command might look like this:

#!/usr/bin/env bash
# C* cluster-parallel: true
# C* dc-parallel: true
# C* strategy: topology
# C* description: Upgrade Cassandra package to given target version
# C* argument: {"option":"--snapshot-name", "name":"SNAPSHOT_NAME", "description":"Name of pre-upgrade snapshot", "default":"preupgrade", "required": false}
# C* argument: {"option":"--target-version", "name":"VERSION", "description":"Target version", "required": true}

# -x prints the executed commands commands to standard output
# -e fails the entire script if any of the commands fails
# -u fails the script if any of the variables is not bound
# -o pipefail instrucs the interpreter to return right-most non-zero status of a piped command in case of failure
set -xeuo pipefail

# exit if a node is already on the target version
if [[ $(nodetool version) == *$VERSION ]]; then
  exit 0
fi

# create Cassandra snapshots to allow rollback in case of problems
nodetool clearsnapshot -t "$SNAPSHOT_NAME"
nodetool snapshot -t "$SNAPSHOT_NAME"

# upgrade Cassandra version
sudo apt-get install -y cassandra="$VERSION"

# gently stop the cassandra process
nodetool drain && sleep 5 && sudo service cassandra stop

# start the Cassandra process again
sudo service cassandra start

# wait for Cassandra to start answering JMX queries
for i in $(seq 60); do
    if ! nodetool version 2>&1 > /dev/null; then
        break
    fi
    sleep 1s
done

# fail if the upgrade did not happen
if ! [[ $(nodetool version) == *$VERSION ]]; then
  exit 1
fi

When running this command, we can be extra-safe and use the --stop-after option:

$ cstar upgrade --seed-host <host_name> --target-version 3.11.2 --stop-after 1

This will instruct cstar to upgrade only one node and exit the execution. Once that happens, we can take our time to inspect the node to see if the upgrade went smoothly. When we are confident enough, we can resume the command. Output of each cstar command starts with a highlighted job identifier, which we can use with the continue command:

$ cstar continue <job_id>

Changing Compaction Strategy

The second command we would like to share performs a compaction strategy change in a rolling fashion.

Compaction configuration is a table property. It needs an ALTER TABLE CQL statement execution to change. Running a CQL statement is effective immediately across the cluster. This means once we issue the statement, each node will react to the compaction change. The exact reaction depends on the change, but it generally translates to increased compaction activity. It is not always desirable to have this happen: compaction can be an intrusive process and affect the cluster performance.

Thanks to CASSANDRA-9965, there is a way of altering compaction configuration on a single node via JMX since Cassandra version 2.1.9. We can set CompactionParametersJson MBean value and change the compaction configuration the node uses. Once we know how to change one node, we can have cstar do the same but across the whole cluster.

Once we change the compaction settings, we should also manage the aftermath. Even though the change is effective immediately, it might take a very long time until each SSTable undergoes a newly configured compaction. The best way of doing this is to trigger a major compaction and wait for it to finish. After a major compaction, all SSTables are organised according to the new compaction settings and there should not be any unexpected compaction activity afterwards.

While cstar is excellent in checking which nodes are up or down, it does not check for other aspects of nodes health. It does not have the ability to monitor compaction activity. Therefore we should include the wait for major compaction in the command we are about to build. The command will then follow these steps:

  • Stop any compactions that are currently happening.
  • Set the CompactionParametersJson MBean to the new value.
    • We will use jmxterm for this and assume the JAR file is already present on the nodes.
  • Run a major compaction to force Cassandra to organise SSTables according to the new setting and make cstar wait for the compactions to finish.
    • This step is not mandatory. Cassandra would re-compact the SSTables eventually.
    • Doing a major compaction will cost extra resources and possibly impact the node’s performance. We do not recommend doing this at all nodes in parallel.
    • We are taking advantage of the topology strategy which will guarantee a quorum of replicas free from this load at any time.

The ~/.cstar/commands/change-compaction command might look like this:

#! /bin/bash
# C* cluster-parallel: true
# C* dc-parallel: true
# C* strategy: topology
# C* description: Switch compaction strategy using jmxterm and perform a major compaction on a specific table
# C* argument: {"option":"--keyspace-name", "name":"KEYSPACE", "description":"Keyspace containing the target table", "required": true}
# C* argument: {"option":"--table", "name":"TABLE", "description":"Table to switch the compaction strategy on", "required": true}
# C* argument: {"option":"--compaction-parameters-json", "name":"COMPACTION_PARAMETERS_JSON", "description":"New compaction parameters", "required": true}
# C* argument: {"option":"--major-compaction-flags", "name":"MAJOR_COMPACTION_FLAGS", "description":"Flags to add to the major compaction command", "default":"", "required": false}
# C* argument: {"option":"--jmxterm-jar-location", "name":"JMXTERM_JAR", "description":"jmxterm jar location on disk", "required": true}

set -xeuo pipefail

echo "Switching compaction strategy on $KEYSPACE.$TABLE"
echo "Stopping running compactions"
nodetool stop COMPACTION
echo "Altering compaction through JMX..."
echo "set -b org.apache.cassandra.db:columnfamily=$TABLE,keyspace=$KEYSPACE,type=ColumnFamilies CompactionParametersJson $COMPACTION_PARAMETERS_JSON"  | java -jar $JMXTERM_JAR --url 127.0.0.1:7199 -e

echo "Running a major compaction..."
nodetool compact ${KEYSPACE} ${TABLE} $MAJOR_COMPACTION_FLAGS

The command requires options specifying which keyspace and table to apply the change on. The jmxterm location and the new value for the compaction parameters are another two required arguments. The command also allows passing in flags to the major compaction. This is useful for cases when we are switching to SizeTieredCompactionStrategy, where the -s flag will instruct cassandra to produce several size-tiered files instead of a single big file.

Running the nodetool compact command will not return until the major compaction finishes. This will cause the execution on one node to not complete until this happens. Consequently, cstar will see this long execution and dutifully wait for it to complete before moving on to other nodes.

Here is an example of running this command:

$ cstar change-compaction --seed-host <host_name> --keyspace tlp_stress --table KeyValue --jmxterm-jar-location /usr/share/jmxterm-1.0.0-uber.jar --compaction-parameters-json "{\"class\":\"LeveledCompactionStrategy\",\"sstable_size_in_mb\":\"120\"}"

This command also benefits from the --stop-after option. Moreover, once all nodes are changed, we should not forget to persist the schema change by doing the actual ALTER TABLE command.

Conclusion

In this post we talked about cstar and its feature of adding custom commands. We have seen:

  • How to add a simple command to execute nodetool status on all nodes at once.
  • How define custom parameters for our commands, which allowed us to build a command for deleting old snapshots.
  • That the custom commands are essentially regular bash scripts and can include multiple statements. We used this feature to do a safe and fast Cassandra version upgrade.
  • That the custom commands can call external utilities such as jmxterm, which we used to change compaction strategy for a table in a rolling fashion.

In the next post, we are going to look into cstar’s cousin called cstarpar. cstarpar differs in the way commands are executed on remote nodes and allows for heavier operations such as rolling reboots.

How to Tweak the Number of num_tokens (vnodes) in Live Cassandra Cluster

Some clients have asked us to change the number of num_tokens as their requirement changes.
For example lower number of num_tokens are recommended is using DSE search etc..
The most important thing during this process is that the cluster stays up, and is healthy and fast. Anything we do needs to be deliberate and safe, as we have production traffic flowing through.

The process includes adding a new DC with a changed number of num_tokens, decommissioning the old DC one by one, and letting Cassandra automatic mechanisms distribute the existing data into the new nodes.

The below procedure is based on the assumption that you have 2 DC DC1 & DC2.

Procedure:


1. Run Repair To Keep Data Consistent Across Cluster

Make sure to run a full repair with nodetool repair. More detail about repairs can be found here. This ensures that all data is propagated from the datacenter which is being decommissioned.


2. Add New DC DC3 And Decommission Old Datacenter DC1

Step 1: Download and Install a similar Cassandra version to the other nodes in the cluster, but do not start.

How-To stop Cassandra

Note: Don’t stop any node in DC1 unless DC3 added.

If you used the Debian package, Cassandra starts automatically. You must stop the node and clear the data.
Stop the node:
Packaged installations: $ sudo service cassandra stop
Tarball installations: nodetool stopdaemon
If for some reason the previous command doesn’t work, find the Cassandra Java process ID (PID), and then kill the process using its PID number:
$ ps auwx | grep cassandra
$ sudo kill pid

Step 2: Clear the data from the default directories once the node is down.

sudo rm -rf /var/lib/cassandra/*
Step 3: Configure the parameter by similar settings of other nodes in the cluster.
Properties which should be set by comparing to other nodes.
Cassandra.yaml:
  • Seeds: This should include nodes from live DC because new nodes have to stream data from them.
  • snitch: Keep it similar to the nodes in live DC.
  • cluster_name: Similar to the nodes in another live DC.
  • num_tokens: Number of vnodes required.
  • initial_tokne: Make sure this is commented out.

Set the local parameters below:

  • auto_bootstrap: false
  • listen_address: Local to the node
  • rpc_address: Local to the node
  • data_directory: Local to the node
  • saved_cache_directory: Local to the node
  • commitlog_directory: Local to the node

Cassandra-rackdc.properties:  Set the parameter for new datacenter and rack:

  • dc: “dc name”
  • rack: “rack name”
Set the below configurations files, as needed:
Cassandra-env.sh
Logback.xml
Jvm.options
Step 4: Start Cassandra on each node, one by one.
Step 5: Now that all nodes are up and running, alter Keyspaces to set RF in a new datacenter with the number of replicas, as well.
ALTER KEYSPACE Keyspace_name WITH REPLICATION = {‘class’ : ‘NetworkTopologyStrategy’, ‘dc1’ : 3, ‘dc2’ : 3, ‘dc3’ : 3};

Step 6: Finally, now that the nodes are up and empty, we should run “nodetool rebuild” on each node to stream data from the existing datacenter.

nodetool rebuild “Existing DC Name”

Step 7: Remove “auto_bootstrap: false” from each Cassandra.yaml or set it to true after the complete process.

auto_bootstrap: true

Decommission DC1:

Now that we have added DC3 into a cluster, it’s time to decommission DC1. However, before decommissioning the datacenter in a production environment, the first step should be to prevent the client from connecting to it and ensure reads or writes do not query this datacenter.
Step 1: Prevent clients from communicating with DC1
  • First of all, ensure that the clients point to an existing datacenter.
  • Set DCAwareRoundRobinPolicy to local to avoid any requests.
Make sure to change QUORUM consistency level to LOCAL_QUORUM and ONE to LOCAL_ONE.
Step 2: ALTER KEYSPACE to not have a replica in decommissioning DC.
ALTER KEYSPACE “Keyspace_name” WITH REPLICATION = {‘class’ : ‘NetworkTopologyStrategy’, ‘dc2’ : 3, ‘dc3’ : 3};
Step 3: Decommission each node using nodetool decommission.
nodetool decommission
Step 4: Remove all data from data, saved caches, and commitlog directory after all nodes are decommissioned to reclaim disk space.
sudo rm -rf “Data_directory”/“Saved_cache_directory”/“Commitlog_directory”
Step 5: Finally, stop Cassandra as described in Step 1.
Step 6: Decommission each node in DC2 by following the above procedure.

3. Add New DC DC4 And Decommission Old DC2


Hopefully, this blog post will help you to understand the procedure for changing the number of vnodes on a live Cluster. Keep in mind that bootstrapping/rebuilding/decommissioning process time depends upon data size.

Introduction to cstar

Spotify is a long time user of Apache Cassandra at very large scale. It is also a creative company which tries to open source most of the tools they build for internal needs. They released Cassandra Reaper a few years ago to give the community a reliable way of repairing clusters, which we now love and actively maintain. Their latest open sourced tool for Cassandra is cstar, a parallel-ssh equivalent (distributed shell) that is Cassandra topology aware. At TLP, we love it already and are sure you soon will too.

What is cstar?

Running distributed databases requires good automation, especially at scale. But even with small clusters, running the same command or roll restarting a cluster can quickly get tedious. Sure, you can use tools like dsh and pssh, but they run commands on all servers at the same time (or just a given number) and you need to keep a list of the nodes to connect to locally. Each time your cluster scales out/in or if nodes get replaced you need to update the list. If you forget to update you may run commands that won’t touch the whole cluster without noticing.

All commands cannot run on all nodes at the same time either. For instance upgrading sstables, running cleanup, major compaction or restarting nodes will have an impact on either latencies or availability and require more granularity of execution.

Cstar doesn’t suffer any of the above problems. It will discover the topology of the cluster dynamically and tune concurrency based on replication settings. In addition, cstar will run from a single machine (not necessarily within the cluster) that has SSH access to all nodes in the cluster, and perform operations through SSH and SFTP. It requires no dependency, other than nodetool, to be installed on the Cassandra nodes.

Installing cstar

You’ll need to have Python 3 and pip3 installed on your server/laptop and then follow the README instructions which will, in the simplest case, boil down to:

pip3 install cstar

Running cstar

Cstar is built with Python 3 and offers a straightforward way to run simple commands or complex scripts on an Apache Cassandra cluster using a single contact point.

The following command, for example, will perform a rolling restart of Cassandra in the cluster, one node at a time using the one strategy:

cstar run --command="sudo service cassandra restart" --seed-host=<contact_point_ip> --strategy=one

During the execution, cstar will update progress with a clear and pleasant output:

 +  Done, up      * Executing, up      !  Failed, up      . Waiting, up
 -  Done, down    / Executing, down    X  Failed, down    : Waiting, down
Cluster: Test Cluster
DC: dc1
+....
....
....
DC: dc2
+....
....
....
DC: dc3
*....
....
....
2 done, 0 failed, 1 executing

If we want to perform cleanup with topology awareness and have only one replica at a time, running the command for each token range (leaving a quorum of unaffected replicas at RF=3), we can use the default topology strategy:

cstar run --command="nodetool cleanup" --seed-host=<contact_point_ip> --strategy=topology

This way, we’ll have several nodes processing the command to minimize the overall time spent on the operation and still ensure low impact on latencies:

 +  Done, up      * Executing, up      !  Failed, up      . Waiting, up
 -  Done, down    / Executing, down    X  Failed, down    : Waiting, down
Cluster: Test Cluster
DC: dc1
****.
....
....
DC: dc2
++++.
****
....
DC: dc3
+****
....
....
5 done, 0 failed, 12 executing

Finally, if we want to run a command that doesn’t involve pressure on latencies and display the outputs locally, we can use strategy all and add the -v flag to display the command outputs:

cstar run --command="nodetool getcompactionthroughput" --seed-host=<contact_point_ip> --strategy=all -v

Which will give us the following output:

 +  Done, up      * Executing, up      !  Failed, up      . Waiting, up
 -  Done, down    / Executing, down    X  Failed, down    : Waiting, down
Cluster: Test Cluster
DC: dc1
*****
****
****
DC: dc2
*****
****
****
DC: dc3
*****
****
****
0 done, 0 failed, 39 executing
Host node1.mycompany.com finished successfully
stdout:
Current compaction throughput: 0 MB/s

Host node21.mycompany.com finished successfully
stdout:
Current compaction throughput: 0 MB/s

Host node10.mycompany.com finished successfully
stdout:
Current compaction throughput: 0 MB/s

...
...

Host node7.mycompany.com finished successfully
stdout:
Current compaction throughput: 0 MB/s

Host node18.mycompany.com finished successfully
stdout:
Current compaction throughput: 0 MB/s

 +  Done, up      * Executing, up      !  Failed, up      . Waiting, up
 -  Done, down    / Executing, down    X  Failed, down    : Waiting, down
Cluster: Test Cluster
DC: dc1
+++++
++++
++++
DC: dc2
+++++
++++
++++
DC: dc3
+++++
++++
++++
39 done, 0 failed, 0 executing
Job cff7f435-1b9a-416f-99e4-7185662b88b2 finished successfully

How cstar does its magic

When you run a cstar command it will first connect to the seed node you provided and run a set of nodetool commands through SSH.

First, nodetool ring will give it the cluster topology with the state of each node. By default, cstar will stop the execution if one node in the cluster is down or unresponsive. If you’re aware that nodes are down and want to run a command nonetheless, you can add the --ignore-down-nodes flag to bypass the check.

Then cstar will list the keyspaces using nodetool cfstats and build a map of the replicas for all token ranges for each of them. This will allow it to identify which nodes contain the same token ranges, using nodetool describering, and apply the topology strategy accordingly. As shown before, the topology strategy will not allow two nodes that are replicas for the same token to run the command at the same time. If the cluster does not use vnodes, the topology strategy will run the command every RF node. If the cluster uses vnodes but is not using NetworkTopologyStrategy (NTS) for all keyspaces nor spreading across racks, chances are only one node will be able to run the command at once, even with the topology strategy.If both NTS and racks are in use, the topology strategy will run the command on a whole rack at a time.

By default, cstar will process the datacenters in parallel, so 2 nodes being replicas for the same tokens but residing in different datacenters can be processed at the same time.

Once the cluster has been fully mapped execution will start in token order. Cstar is very resilient because it uploads a script on each remote node through SFTP and runs it using nohup. Each execution will write output (std and err) files along with the exit code for cstar to check on regularly. If the command is interrupted on the server that runs cstar, it can be resumed safely as cstar will first check if the script is still running or has finished already on each node that hasn’t gone through yet.
Note that interrupting the command on the cstar host will not stop it on the remote nodes that are already running it.
Resuming an interrupted command is done simply by executing : cstar continue <job_id>

Each time a node finishes running the command cstar will check if the cluster health is still good and if the node is up. This way, if you perform a rolling restart and one of the nodes doesn’t come back up properly, although the exit code of the restart command is 0, cstar will wait indefinitely to protect the availability of the cluster. That is unless you specified a timeout on the job. In such a case, the job will fail. Once the node is up after the command has run, cstar will look for the next candidate node in the ring to run the command.

A few handy flags

Two steps execution

Some commands may be scary to run on the whole cluster and you may want to run them on a subset of the nodes first, check that they are in the expected state manually, and then continue the execution on the rest of the cluster. The --stop-after=<number-of-nodes> flag will do just that. Setting it to --stop-after=1 will run the command on a single node and exit. Once you’ve verified that you’re happy with the execution on that one node you can process the rest of the cluster using cstar continue <job_id>.

Retry failed nodes

Some commands might fail mid-course due to transient problems. By default, cstar continue <job_id> will halt if there is any failed execution in the history of the job. In order to resume the job and retry the execution on the failed nodes, add the --retry-failed flag.

Run the command on a specific datacenter

To process only a specific datacenter add the --dc-filter=<datacenter-name> flag. All other datacenters will be ignored by cstar.

Datacenter parallelism

By default, cstar will process the datacenters in parallel. If you only want only one datacenter to process the command at a time, add the --dc-serial flag.

Specifying a maximum concurrency

You can forcefully limit the number of nodes running the command at the same time, regardless of topology, by adding the --max-concurrency=<number-of-nodes> flag.

Wait between each node

You may want to delay executions between nodes in order to give some room for the cluster to recover from the command. The --node-done-pause-time=<time-in-seconds> flag will allow to specify a pause time that cstar will apply before looking for the next node to run the command on.

Run the command regardless down nodes

If you want to run a command while nodes are down in the cluster add the --ignore-down-nodes flag to cstar.

Run on specific nodes only

If the command is meant to run on some specific nodes only you can use either the --host or the --host-file flags.

Control the verbosity of the output

By default, cstar will only display the progress of the execution as shown above in this post. To get the output of the remote commands, add the -v flag. If you want to get more verbosity on the executions and get debug loggings use either -vv (very verbose) or -vvv (extra verbose).

You haven’t installed it already?

Cstar is the tool that all Apache Cassandra operators have been waiting for to manage clusters of all sizes. We were happy to collaborate closely with Spotify to help them open source it. It has been built and matured at one of the most smart and successful start-ups in the world and was developed to manage hundreds of clusters of all sizes. It requires no dependency to be installed on the cluster and uses SSH exclusively. Thus, it will comply nicely with any security policy and you should be able to run it within minutes on any cluster of any size.

We love cstar so much we are already working on integrating it with Reaper as you can see in the following video :

We’ve seen in this blog post how to run simple one line commands with cstar, but there is much more than meets the eye. In an upcoming blog post we will introduce complex command scripts that perform operations like upgrading a Cassandra cluster, selectively clearing snapshots, or safely switching compaction strategies in a single cstar invocation.

Assassinate - A Command of Last Resort within Apache Cassandra

The nodetool assassinate command is meant specifically to remove cosmetic issues after nodetool decommission or nodetool removenode commands have been properly run and at least 72 hours have passed. It is not a command that should be run under most circumstances nor included in your regular toolbox. Rather the lengthier nodetool decommission process is preferred when removing nodes to ensure no data is lost. Note that you can also use the nodetool removenode command if cluster consistency is not the primary concern.

This blog post will explain:

  • How gossip works and why assassinate can disrupt it.
  • How to properly remove nodes.
  • When and how to assassinate nodes.
  • How to resolve issues when assassination attempts fail.

Gossip: Cassandra’s Decentralized Topology State

Since all topological changes happen within Cassandra’s gossip layer, before we discuss how to manipulate the gossip layer, let’s discuss the fundamentals of how gossip works.

From Wikipedia:

A gossip (communication) protocol is a procedure or process of computer-computer communication that is based on the way social networks disseminate information or how epidemics spread… Modern distributed systems often use gossip protocols to solve problems that might be difficult to solve in other ways, either because the underlying network has an inconvenient structure, is extremely large, or because gossip solutions are the most efficient ones available.

The gossip state within Cassandra is the decentralized, eventually consistent, agreed upon topological state of all nodes within a Cassandra cluster. Cassandra gossip heartbeats keep the topological gossip state updated, are emitted via each node in the cluster, and contain the following information:

  • What that node’s status is, and
  • What its neighbors statuses’ are.

When a node goes offline the gossip heartbeat will not be emitted and the node’s neighbors will detect that the node is offline (with help from an algorithm which is tuned by the phi_convict_threshold parameter as defined within the cassandra.yaml), and the neighbor will broadcast an updated status saying that the neighbor node is unavailable until further notice.

However, as soon as the node comes online, two things will happen:

  1. The revived node will:
    • Ask a neighbor node what the current gossip state is.
    • Modify the received gossip state to include its own status.
    • Assume the modified state as its own.
    • Broadcast the new gossip state across the network.
  2. A neighbor node will:
    • Discover the revived node is back online, either by:
      • First-hand discovery, or
      • Second-hand gossiping.
    • Update the received gossip state with the new information.
    • Modify the received gossip state to include its own status.
    • Assume the modified state as its own.
    • Broadcast the new gossip state across the network.

The above gossip protocol is responsible for the UN/DN, or Up|Down/Normal, statuses seen within nodetool status and is responsible for ensuring requests and replicas are properly routed to the available and responsible nodes, among other tasks.

Differences Between Assassination, Decommission, and Removenode

There are three main commands used to take a node offline: assassination, decommission, and removenode. Having the node be in the LEFT state ensures that each node’s gossip state will eventually be consistent and agree that:

  • The deprecated node has in fact been deprecated.
  • The deprecated node was deprecated after a given timestamp.
  • The deprecated token ranges are now owned by a new node.
  • Ideally, the deprecated LEFT stage will be automatically purged in 72 hours.

Underlying Actions of Decommission and Removenode on the Gossip Layer

When nodetool decommission and nodetool removenode commands are run, we are changing the state of the gossip layer to the LEFT state for the deprecated node.

Following the gossip protocol procedure in the previous section, the LEFT status will spread across the cluster as the the new truth since the LEFT status has a more recent timestamp than the previous status.

As more nodes begin to assimilate the LEFT status, the cluster will ultimately reach consensus that the deprecated node has LEFT the cluster.

Underlying Actions of Assassination

Unlike nodetool decommission and nodetool removenode above, when nodetool assassinate is used we are updating the gossip state to the LEFT state, then forcing an incrementation of the gossip generation number, and updating the application state to the LEFT state explicitly, which will then propagate as normal.

Removing Nodes: The “Proper” Way

When clusters grow large, an operator may need to remove a node, either due to hardware faults or horizontally scaling down the cluster. At that time, the operator will need to modify the topological gossip state with either a nodetool decommission command for online nodes or nodetool removenode for offline nodes.

Decommissioning a Node: While Saving All Replicas

The typical command to run on a live node that will be exiting the cluster is:

nodetool decommission

The nodetool decommission command will:

  • Extend certain token ranges within the gossip state.
  • Stream all of the decommissioned node’s data to the new replicas in a consistent manner (the opposite of bootstrap).
  • Report to the gossip state that the node has exited the ring.

While this command may take a while to complete, the extra time spent on this command will ensure that all owned replicas are streamed off the node and towards the new replica owners.

Removing a Node: And Losing Non-Replicated Replicas

Sometimes a node may be offline due to hardware issues and/or has been offline for longer than gc_grace_seconds within a cluster that ingests deletion mutations. In this case, the node needs to be removed from the cluster while remaining offline to prevent “zombie data” from propagating around the cluster due to already expired tombstones, as defined by the gc_grace_seconds window. In the case where the node will remain offline, the following command should be run on a neighbor node:

nodetool removenode $HOST_ID

The nodetool removenode command will:

  • Extend certain token ranges within the gossip state.
  • Report to the gossip state that the node has exited the ring.
  • Will NOT stream any of the decommissioned node’s data to the new replicas.

Increasing Consistency After Removing a Node

Typically a follow up repair is required in a rolling fashion around the data center to ensure each new replica has the required information:

nodetool repair -pr

Note that:

  • The above command will only repair replica consistencies if the replication factor is greater than 1 and one of the surviving nodes contains a replica of the data.
  • Running a rolling repair will generate disk, CPU, and network load proportional to the amount of data needing to be repaired.
  • Throttling a rolling repair by repairing only one node at a time may be ideal.
  • Using Reaper for Apache Cassandra can schedule, manage, and load balance the repair operations throughout the lifetime of the cluster.

How We Can Detect Assassination is Needed

In either of the above cases, sometimes the gossip state will continue to be out of sync. There will be echoes of past statuses that claim not only the node is still part of the cluster, but it may still be alive. And then missing. Intermittently.

When the gossip truth is continuously inconsistent, nodetool assassinate will resolve these inconsistencies, but should only be run after nodetool decommission or nodetool removenode have been run and at least 72 hours have passed.

These issues are typically cosmetic and appear as similar lines within the system.log:

2014-09-26 01:26:41,901 DEBUG [Reconnection-1] Cluster - Failed reconnection to /172.x.y.zzz:9042 ([/172.x.y.zzz:9042] Cannot connect), scheduling retry in 600000 milliseconds

Or may appear as UNREACHABLE within the nodetool describecluster output:

Cluster Information:
        Name: Production Cluster
       Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
       Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
       Schema versions:
              65e78f0e-e81e-30d8-a631-a65dff93bf82: [172.x.y.z]
              UNREACHABLE: [172.x.y.zzz]

Sometimes you may find yourself looking even deeper and spot the deprecated node within nodetool gossipinfo months after removing the node:

/172.x.y.zzz
generation:1486750581
heartbeat:9999
STATUS:131099:LEFT,-5747706879722151680,1487011227852
TOKENS: not present

Note that the LEFT status should stick around for 72 hours to ensure all nodes come to the consensus that the node has been removed. So please don’t rush things if that’s the case. Again, it’s only cosmetic.

In all of these cases the truth may be slightly outdated and an operator may want to set the record straight with truth-based gossip states instead of cosmetic rumor-filled gossip states that include offline deprecated nodes.

How to Run the Assassination Command

Pre-2.2.0, operators used to have to use Java MBeans to assassinate a token (see below). Post-2.2.0, operators can use the nodetool assassinate method.

From an online node, run the command:

nodetool assassinate $IP_ADDRESS

Internally, the nodetool assassinate command will execute the unsafeAssassinateEndpoint command over JMX on the Gossiper MBean.

Java Mbeans Assassination

If using a version of Cassandra that does not yet have the nodetool assassinate command, we’ll have to rely on jmxterm.

You can use the following command to download jmxterm:

wget http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0.0-uber.jar

Then we’ll want to use the Gossiper MBean and run the unsafeAssassinateEndpoint command:

echo "run -b org.apache.cassandra.net:type=Gossiper unsafeAssassinateEndpoint $IP_TO_ASSASSINATE" \
    | java -jar jmxterm-1.0.0-uber.jar -l $IP_OF_LIVE_NODE:7199

Both of the assassination commands will trigger the same MBean command over JMX, however the nodetool assassinatecommand is preferred for its ease of use without additional dependencies.

Resolving Failed Assassination Attempts: And Why the First Attempts Failed

When clusters grow large enough, are geospatially distant enough, or are under intense load, the gossip state may become a bit out of sync with reality. Sometimes this causes assassination attempts to fail and while the solution may sound unnerving, it’s relatively simple once you consider how gossip states act and are maintained.

Because gossip states can be decentralized across high latency nodes, sometimes gossip state updates can be delayed and cause a variety of race conditions that may show offline nodes as still being online. Most times these race conditions will be corrected in relatively short-time periods, as tuned by the phi_convict_threshold within the cassandra.yaml (between a value of 8 for hardware and 12 for virtualized instances). In almost all cases the gossip state will converge into a global truth.

However, because gossip state from nodes that are no longer participating in gossip heartbeat rounds do not have an explicit source and are instead fueled by rumors, dead nodes may sometimes continue to live within the gossip state even after calling the assassinate command.

To solve these issues, you must ensure all race conditions are eliminated.

If a gossip state will not forget a node that was removed from the cluster more than a week ago:

  • Login to each node within the Cassandra cluster.
  • Download jmxterm on each node, if nodetool assassinate is not an option.
  • Run nodetool assassinate, or the unsafeAssassinateEndpoint command, multiple times in quick succession.
    • I typically recommend running the command 3-5 times within 2 seconds.
    • I understand that sometimes the command takes time to return, so the “2 seconds” suggestion is less of a requirement than it is a mindset.
    • Also, sometimes 3-5 times isn’t enough. In such cases, shoot for the moon and try 20 assassination attempts in quick succession.

What we are trying to do is to create a flood of messages requesting all nodes completely forget there used to be an entry within the gossip state for the given IP address. If each node can prune its own gossip state and broadcast that to the rest of the nodes, we should eliminate any race conditions that may exist where at least one node still remembers the given IP address.

As soon as all nodes come to agreement that they don’t remember the deprecated node, the cosmetic issue will no longer be a concern in any system.logs, nodetool describecluster commands, nor nodetool gossipinfo output.

Recap: How To Properly Remove Nodes Completely

Operators shouldn’t opt for the assassinate command as a first resort for taking a Cassandra node out since it is sledgehammer and most of the time operators are dealing with a screw.

However, when operators follow best practices and perform a nodetool decommission for live nodes or nodetool removenode for offline nodes, sometimes lingering cosmetic issues may lead the operator to want to keep the gossip state consistent.

After at least a week of inconsistent gossip state, nodetool assassinate or the unsafeAssassinateEndpoint command may be used to remove deprecated nodes from the gossip state.

When a single assassination attempt does not work across an entire cluster, sometimes the assassination attempt needs to be attempted multiple times on all node within the cluster simultaneously. Doing so will ensure that each node modifies its own gossip state to accurately reflect the deprecated node’s absence within the gossip state as well as ensuring no node will further broadcast rumors of a false gossip state.

Incremental Repair Improvements in Cassandra 4

In our previous post, “Should you use incremental repair?”, we recommended to use subrange full repairs instead of incremental repair as CASSANDRA-9143 could generate some severe instabilities on a running cluster. As the 4.0 release approaches, let’s see how incremental repair was modified for the next major version of Apache Cassandra in order to become reliable in production.

Incremental Repair in Pre-4.0 Clusters

Since Apache Cassandra 2.1, incremental repair was performed as follows:

  • The repair coordinator will ask all replicas to build Merkle trees only using SSTables with a RepairedAt value of 0 (meaning they haven’t been part of a repair yet).
    Merkle trees are hash trees of the data they represent, they don’t store the original data.
  • Mismatching leaves of the Merkle trees will get streamed between replicas
  • When all streaming is done, anticompaction is performed on all SSTables that were part of the repair session

But during the whole process, SSTables could still get compacted away as part of the standard automatic compactions. If that happened, the SSTable would not get anticompacted and all the data it contains would not be marked as repaired. In the below diagram, SSTable 1 is compacted with 3, 4 and 5, creating SSTable 6 during the streaming phase. This happens before anticompaction is able to split apart the repaired and unrepaired data:

SSTable 1 gets compacted away before anticompaction could kick in
SSTable 1 gets compacted away before anticompaction could kick in.

If this happens on a single node, the next incremental repair run would find differences as the previously repaired data would be skipped on all replicas but one, which would lead potentially to a lot of overstreaming. This happens because Merkle trees only contain hashes of data, and in Cassandra, the height of the tree is bounded to prevent over allocation of memory. The more data we use to build our tree, the larger the tree would be. Limiting the height of the tree means the hashes in the leaves are responsible for bigger ranges of data.

Already repaired data in SSTable 6 will be part of the Merkle tree computation
Already repaired data in SSTable 6 will be part of the Merkle tree computation.

If you wonder what troubles can be generated by this bug, I invite you to read my previous blog post on this topic.

Incremental repair in 4.0, the theory

The incremental repair process is now supervised by a transaction to guarantee its consistency. In the “Prepare phase”, anticompaction is performed before the Merkle trees are computed, and the candidate SSTables will be marked as pending a specific repair. note that they are not marked as repaired just yet to avoid inconsistencies in case the repair session fails.

If a candidate SSTable is currently part of a running compaction, Cassandra will try to cancel that compaction and wait up to a minute. If the compaction successfully stops within that time, the SSTable will be locked for future anticompaction, otherwise the whole prepare phase and the repair session will fail.

Incremental repair in 4.0

SSTables marked as pending repair are only eligible to be compacted with other tables marked as pending.
SSTables in the pending repair pool are the only ones participating in both Merkle tree computations and streaming operations :

Incremental repair in 4.0

During repair, the pool of unrepaired SSTables receives newly flushed ones and compaction takes place as usual within it. SSTables that are being streamed in are part of the “pending repair” pool. This prevents two potential problems: If the streamed SSTables were put in the unrepaired pool, it could get compacted away as part of normal compaction tasks and would never be marked as repaired If the streamed SSTables were put in the repaired pool and the repair session failed, we would have data that is marked as repaired on some nodes and not others, which would generate overstreaming during the next repair

Once the repair succeeds, the coordinator sends a request to all replicas to mark the SSTables in pending state as repaired, by setting the RepairedAt timestamp (since anticompaction already took place, Cassandra just needs to set this timestamp).

Incremental repair in 4.0

If some nodes failed during the repair, the “pending repair” SSTables will be released and eligible for compaction (and repair) again. They will not be marked as repaired :

Incremental repair in 4.0

The practice

Let’s see how all of this process takes place by running a repair and observing the behavior of Cassandra.

To that end, I created a 5 node CCM cluster running locally on my laptop and used tlp-stress to load some data with a replication factor of 2 :

bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  --replication "{'class':'SimpleStrategy', 'replication_factor':2}"  --compaction "{'class': 'SizeTieredCompactionStrategy'}"  --host 127.0.0.1

Node 127.0.0.1 was then stopped and I deleted all the SSTables from the tlp_stress.sensor_data table :

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                               Rack
UN  127.0.0.1  247,07 KiB  1            ?       dbccdd3e-f74a-4b7f-8cea-e8770bf995db  rack1
UN  127.0.0.2  44,08 MiB  1            ?       3ce4cca5-da75-4ede-94b7-a37e01d2c725  rack1
UN  127.0.0.3  44,07 MiB  1            ?       3b9fd30d-80c2-4fa6-b324-eaecc4f9564c  rack1
UN  127.0.0.4  43,98 MiB  1            ?       f34af1cb-4862-45e5-95cd-c36404142b9c  rack1
UN  127.0.0.5  44,05 MiB  1            ?       a5add584-2e00-4adb-8949-716b7ef35925  rack1

I ran a major compaction on all nodes to easily observe the anticompactions. On node 127.0.0.2, we then have a single SSTable on disk :

sensor_data-f4b94700ad1d11e8981cd5d05c109484 adejanovski$ ls -lrt *Data*
-rw-r--r--  1 adejanovski  staff  41110587 31 aoû 15:09 na-4-big-Data.db

The sstablemetadata tool gives us interesting information about this file :

sstablemetadata na-4-big-Data.db
SSTable: /Users/adejanovski/.ccm/inc-repair-fix-2/node2/data0/tlp_stress/sensor_data-f4b94700ad1d11e8981cd5d05c109484/na-4-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.01
Minimum timestamp: 1535720482962762 (08/31/2018 15:01:22)
Maximum timestamp: 1535720601312716 (08/31/2018 15:03:21)
SSTable min local deletion time: 2147483647 (no tombstones)
SSTable max local deletion time: 2147483647 (no tombstones)
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: 0.8694195642299255
TTL min: 0
TTL max: 0
First token: -9223352583900436183 (001.0.1824322)
Last token: 9223317557999414559 (001.1.2601952)
minClusteringValues: [3ca8ce0d-ad1e-11e8-80a6-91cbb8e39b05]
maxClusteringValues: [f61aabc1-ad1d-11e8-80a6-91cbb8e39b05]
Estimated droppable tombstones: 0.0
SSTable Level: 0
Repaired at: 0
Pending repair: --
Replay positions covered: {CommitLogPosition(segmentId=1535719935055, position=7307)=CommitLogPosition(segmentId=1535719935056, position=20131708)}
totalColumnsSet: 231168
totalRows: 231168
Estimated tombstone drop times: 
   Drop Time | Count  (%)  Histogram 
   Percentiles
   50th      0 
   75th      0 
   95th      0 
   98th      0 
   99th      0 
   Min       0 
   Max       0 
Partition Size: 
   Size (bytes) | Count  (%)  Histogram 
   179 (179 B)  | 56330 ( 24) OOOOOOOOOOOOOOOOOOo
   215 (215 B)  | 78726 ( 34) OOOOOOOOOOOOOOOOOOOOOOOOOO.
   258 (258 B)  | 89550 ( 39) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
   310 (310 B)  |   158 (  0) 
   372 (372 B)  |  1166 (  0) .
   446 (446 B)  |  1691 (  0) .
   535 (535 B)  |   225 (  0) 
   642 (642 B)  |    23 (  0) 
   770 (770 B)  |     1 (  0) 
   Percentiles
   50th      215 (215 B)
   75th      258 (258 B)
   95th      258 (258 B)
   98th      258 (258 B)
   99th      372 (372 B)
   Min       150 (150 B)
   Max       770 (770 B)
Column Count: 
   Columns | Count   (%)  Histogram 
   1       | 224606 ( 98) OOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
   2       |   3230 (  1) .
   3       |     34 (  0) 
   Percentiles
   50th      1
   75th      1
   95th      1
   98th      1
   99th      2
   Min       0
   Max       3
Estimated cardinality: 222877
EncodingStats minTTL: 0
EncodingStats minLocalDeletionTime: 1442880000 (09/22/2015 02:00:00)
EncodingStats minTimestamp: 1535720482962762 (08/31/2018 15:01:22)
KeyType: org.apache.cassandra.db.marshal.UTF8Type
ClusteringTypes: [org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.TimeUUIDType)]
StaticColumns: 
RegularColumns: data:org.apache.cassandra.db.marshal.UTF8Type

It is worth noting the cool improvements sstablemetadata has gone through in 4.0, especially regarding the histograms rendering. So far, and as expected, our SSTable is not repaired and it is not pending a running repair.

Once the repair starts, the coordinator node executes the Prepare phase and anticompaction is performed :

sensor_data-f4b94700ad1d11e8981cd5d05c109484 adejanovski$ ls -lrt *Data*
-rw-r--r--  1 adejanovski  staff  20939890 31 aoû 15:41 na-6-big-Data.db
-rw-r--r--  1 adejanovski  staff  20863325 31 aoû 15:41 na-7-big-Data.db

SSTable na-6-big is marked as pending our repair :

sstablemetadata na-6-big-Data.db
SSTable: /Users/adejanovski/.ccm/inc-repair-fix-2/node2/data0/tlp_stress/sensor_data-f4b94700ad1d11e8981cd5d05c109484/na-6-big
...
Repaired at: 0
Pending repair: 8e584410-ad23-11e8-ba2c-0feeb881768f
Replay positions covered: {CommitLogPosition(segmentId=1535719935055, position=7307)=CommitLogPosition(segmentId=1535719935056, position=21103491)}

na-7-big remains in the “unrepaired pool” (it contains tokens that are not being repaired in this session) :

sstablemetadata na-7-big-Data.db
SSTable: /Users/adejanovski/.ccm/inc-repair-fix-2/node2/data0/tlp_stress/sensor_data-f4b94700ad1d11e8981cd5d05c109484/na-7-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.01
...
Repaired at: 0
Pending repair: --

Once repair finishes, another look at sstablemetadata on na-6-big shows us that it is now marked as repaired :

sstablemetadata na-6-big-Data.db
SSTable: /Users/adejanovski/.ccm/inc-repair-fix-2/node2/data0/tlp_stress/sensor_data-f4b94700ad1d11e8981cd5d05c109484/na-6-big
...
Estimated droppable tombstones: 0.0
SSTable Level: 0
Repaired at: 1535722885852 (08/31/2018 15:41:25)
Pending repair: --
…

Again, I really appreciate not having to compute the repair date by myself thanks to an sstablemetadata output that is a lot more readable than it was before.

Reliable incremental repair

While Apache Cassandra 4.0 is being stabilized and there are still a few bugs to hunt down, incremental repair finally received the fix it deserved to make it production ready for all situations. The transaction that encloses the whole operation will shield Cassandra from inconsistencies and overstreaming, making cyclic repairs a fast and safe operation. Orchestration is still needed though as SSTables cannot be part of 2 distinct repair sessions that would run at the same time, and it is advised to use a topology aware tool to perform the operation without hurdles.
It it worth noting that full repair in 4.0 doesn’t involve anticompaction anymore and does not mark SSTables as repaired. This will bring full repair back to its 2.1 behavior and allow to run it on several nodes at the same time without fearing conflicts between validation compactions and anticompactions.

So you have a broken Cassandra SSTable file?

Every few months I have a customer come to me with the following concern: my compactions for one of my Cassandra tables are stuck or my repairs fail when referencing one of the nodes in my Cassandra cluster. I take a look or just ask a couple of questions and it becomes apparent that the problem is a broken SSTable file. Occasionally, they will come to me in a panic and tell me that they have looked at their logs and discovered they have a broken SSTable file.

Don’t panic. A broken SSTable file is not a crisis.

A broken SSTable file does not represent lost data or an unusable database. Well, that’s true unless you are using a Replication Factor (RF) of ONE. The cluster is still going to operate, and queries should be working just fine. But… it does need to be fixed. There are four ways to fix the problem which I will explain in this post, one of which I freely admit is not one of the community’s recommended ways, but is often the simplest and quickest with minimal downside risk.

Before I begin to explain the ways to repair an SSTable, I will spend a few lines to explain what an SSTable file is, then I will walk you through the four options from easiest and safest to the most difficult and risky.

An SSTable file is not a file. It’s a set of eight files. One of those eight contains the actual data. The others contain metadata used by Cassandra to find specific partitions and rows in the data file. Here is a sample list of the files:

mc-131-big-CRC.db Checksums of chunks of the data file.
mc-131-big-Data.db The data file that contains all of rows and columns.
mc-131-big-Digest.crc32 Single checksum of the data file.
mc-131-big-Filter.db Bloom filter containing partial checksums of all partition and cluster keys.
mc-131-big-Index.db A single level index of the partitions and cluster keys in the data file.
mc-131-big-Statistics.db Bunch of metadata that Cassandra keeps about this file including information about the columns, tombstones etc.
mc-131-big-Summary.db An index into the index file. Making this a second level index.
mc-131-big-TOC.txt This list of file names. No idea why it exists.

The “mc” is the SSTable file version. This changes whenever a new release of Cassandra changes anything in the way data is stored in any of the files listed in the table above.

The number 131 is the sequence number of the SSTable file. It increases for each new SSTable file written through memtable flush, compaction, or streaming from another node.

The word “big” was added to Cassandra SSTable files starting in Cassandra 2.2. I have no idea what its purpose is.

The rest of the file name parts are explained in the chart above.

When you get the dreaded error that an SSTable file is broken, it almost always is because an internal consistency check such as “column too long” or “one of the checksums has failed to validate”. This has relatively little effect on normal reads against the table except for the request where the failure took place. It has a serious effect on compactions and repairs, stopping them in their tracks.

Having repairs fail can result in long-term consistency issues between nodes and eventually the application returning incorrect results. Having compactions fail will degrade read performance in the short term and cause storage space problems in the long term.

So… what are the four options?

  1. Nodetool scrub command – Performed online with little difficulty. It usually has a low success rate in my own personal experience.
  2. Offline sstablescrub – Must be performed offline. The tool is in /usr/bin with a package install. Otherwise its in $CASSANDRAHOME/bin. Its effectiveness rate is significantly better than the Nodetool scrub, but it requires the node to be down to work. And it takes forever…
  3. rm -f – Performed offline. it must also be followed immediately with a Nodetool repair when you bring the node back up. This is the method I have successfully used most often but it also has some consistency risks while the repairs complete.
  4. Bootstrap the node – This is kind of like number 3 but it has less theoretical impact on consistency.

Let us get into the details

It starts out like this. You are running a Nodetool repair and you get an error:
$ nodetool repair -full

[2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false)

error: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed

— StackTrace —

java.lang.RuntimeException: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed

You see the error. But it doesn’t tell you a whole lot. Just that the repair failed. Next step look at the Cassandra system.log file you want to see the errors:
$ grep -n -A10 ERROR /var/log/cassandra/system.log

ERROR [RepairJobTask:8] 2018-08-08 15:15:57,726 RepairRunnable.java:277 – Repair session 2c5f89e0-9b39-11e8-b5ee-bb8feee1767a for range [(-1377105920845202291,-1371711029446682941], (-8865445607623519086,-885162575564883…. 425683885]]] Sync failed between /192.168.1.90 and /192.168.1.92

/var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,722 RepairSession.java:281 – [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8] Session completed with the following error

/var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,726 RepairRunnable.java:277 – Repair session 25682740-9c11-11e8-8e8f-fbc0ff4d2cb8…… 7115161941975432804,7115472305105341673], (5979423340500726528,5980417142425683885]]] Validation failed in /192.168.1.88

/var/log/cassandra/system.log:ERROR [ValidationExecutor:2] 2018-08-09 16:16:50,707 Validator.java:268 – Failed creating a merkle tree for [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8 on keyspace1/standard1,

The first error message Sync Failed is misleading although sometimes it can be a clue. Looking further, you see Validation failed in /192.168.1.88. This tells us that the error occurred on 192.158.1.88 which just happens to be the node we are on. Finally, we get the message showing the keyspace and table the error occurred on. Depending on the message, you might see the table file number mentioned. In this case it was not mentioned.

Looking in the directory tree we see that we have the following SSTable files:

4,417,919,455 mc-30-big-Data.db
8,831,253,280 mc-45-big-Data.db
374,007,490 mc-49-big-Data.db
342,529,995 mc-55-big-Data.db
204,178,145 mc-57-big-Data.db
83,234,470 mc-59-big-Data.db
3,223,224,985 mc-61-big-Data.db
24,552,560 mc-62-big-Data.db
2,257,479,515 mc-63-big-Data.db
2,697,986,445 mc-66-big-Data.db
5,285 mc-67-big-Data.db

At this point we have our repair options. I’ll take them one at a time.

Online SSTable repair – Nodetool scrub

This command is easy to perform. It is also the option least likely to succeed.

Steps:

  1. Find out which SSTable is broken.
  2. Run nodetool scrub keyspace tablename.
  3. Run nodetool repair.
  4. Run nodetool listsnapshots.
  5. Run nodetool clearsnapshot keyspacename -t snapshot name.

We did the whole “find out what table is broken” thing just above, so we aren’t going to do it again. We will start with step 2.

Scrub will take a snapshot and rebuild your table files. The one(s) that are corrupt will disappear. You will lose at least a few rows and possibly all the rows from the corrupted SSTable files. Hence the need to do a repair.


$ nodetool scrub keyspace1 standard1

After the scrub, we have fewer SStable files and their names have all changed. There is also less space consumed and very likely some rows missing.

2,257,479,515 mc-68-big-Data.db
342,529,995 mc-70-big-Data.db
3,223,224,985 mc-71-big-Data.db
83,234,470 mc-72-big-Data.db
4,417,919,455 mc-73-big-Data.db
204,178,145 mc-75-big-Data.db
374,007,490 mc-76-big-Data.db
2,697,986,445 mc-77-big-Data.db
1,194,479,930 mc-80-big-Data.db

So we do a repair.

$ nodetool repair -full

[2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-09 18:14:09,799] Repair session 4cadf590-9c17-11e8-8e8f-fbc0ff4d2cb8 for range [(-1377105920845202291,… [2018-08-09 18:14:10,130] Repair completed successfully [2018-08-09 18:14:10,131] Repair command #2 finished in 1 hour 13 minutes 18 seconds

After the repair, we have almost twice as many SSTable files with data pulled in from other nodes to replace the corrupted data lost by the scrub process.

2,257,479,515 mc-68-big-Data.db
342,529,995 mc-70-big-Data.db
3,223,224,985 mc-71-big-Data.db
83,234,470 mc-72-big-Data.db
4,417,919,455 mc-73-big-Data.db
204,178,145 mc-75-big-Data.db
374,007,490 mc-76-big-Data.db
2,697,986,445 mc-77-big-Data.db
1,194,479,930 mc-80-big-Data.db
1,209,518,945 mc-88-big-Data.db
193,896,835 mc-89-big-Data.db
170,061,285 mc-91-big-Data.db
63,427,680 mc-93-big-Data.db
733,830,580 mc-95-big-Data.db
1,747,015,110 mc-96-big-Data.db
16,715,886,480 mc-98-big-Data.db
49,167,805 mc-99-big-Data.db

Once the scrub and repair are completed, you are almost done.

One of the side effects of the scrub is a snapshot called pre-scrub-<timestamp>. If you don’t want to run out of diskspace, you are going to want to remove it, preferably with the nodetool.

$ nodetool listsnapshots

Snapshot Details:

Snapshot name Keyspace name Column family name True size Size on disk

pre-scrub-1533897462847 keyspace1 standard1 35.93 GiB 35.93 GiB

$ nodetool clearsnapshot -t pre-scrub-1533897462847

Requested clearing snapshot(s) for [all keyspaces] with snapshot name [pre-scrub-1533897462847]

If the repair still fails to complete, we get to try one of the other methods.

Offline SSTable repair utility – sstablescrub

This option is a bit more complex to do but it often will work when the online version won’t work. Warning: it is very slow.

Steps:

  1. Bring the node down.
  2. Run the sstablescrub command.
  3. Start the node back up.
  4. Run nodetool repair on the table.
  5. Run nodetool clearsnapshot to remove the pre-scrub snapshot.

If the node is not already down, bring it down. I usually do the following commands:

$ nodetool drain

$ pkill java

$ ps -ef |grep cassandra

root 18271 14813 0 20:39 pts/1 00:00:00 grep –color=auto cassandra

Then issue the sstablescrub command with the -n option unless you have the patience of a saint. Without the -n option, every column in every row in every SSTable file will be validated. Single threaded. It will take forever. In preparing for this blog post, I forgot to use the -n and found that it took 12 hours to scrub 500 megabytes of a 30 GB table. Not willing to wait 30 days for the scrub to complete, I stopped it and switched to the -n option completing the scrub in only… hang on for this, 6 days. So, um, maybe this isn’t going to be useful in most real-world situations unless you have really small tables.

$ Sstablescrub -n keyspace1 standard1

Pre-scrub sstables snapshotted into snapshot pre-scrub-1533861799166

Scrubbing BigTableReader(path=’/home/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/mc-88-big-Data.db’) (1.126GiB)…

Unfortunately, this took more time than I wanted to take for this blog post. Once you have the table scrubbed, you restart Cassandra and delete.

Delete the file and do a Nodetool repair – rm

This option works every time. It is no more difficult to do than the offline sstablescrub command and its success rate is 100%. It’s usually much faster than the offline sstablescrub option. In my prep for the blog post, this approach took only two hours for my 30 GB table. The only drawback I can see is that for the time it takes to do the repair on the table after the delete is performed, there is an increased risk of consistency problems esp if you are using CF=1 which should be a fairly uncommon use case.

Steps:

  1. Stop the node.
  2. cd to the offending keyspace and sstable directory.
  3. If you know which sstable file is bad (if you learned about the problem from stalled compactions, you will know) just delete it. If not, delete all files in the directory.
  4. Restart the node.
  5. Nodetool repair.

$ nodetool drain

$ pkill java

$ ps -ef |grep cassandra

root 18271 14813 0 20:39 pts/1 00:00:00 grep –color=auto cassandra

$ cd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/

$ pwd

/var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc

If you know the SSTable file you want to delete, you can delete just that one with rm -f *nnn*. If not, as in this case, you do them all.

$ sudo rm -f *

rm: cannot remove ‘backups’: Is a directory

rm: cannot remove ‘snapshots’: Is a directory

$ ls

backups snapshots

$systemctl start cassandra

$ nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

— Address Load Tokens Owns (effective) Host ID Rack

UN 192.168.1.88 1.35 MiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1

UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1

UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1

$ nodetool repair keyspace1 standard1 -full

[2018-08-10 11:23:22,454] Starting repair command #1 (51713c00-9cb1-11e8-ba61-01c8f56621df), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [standard1], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-10 13:02:36,097] Repair completed successfully [2018-08-10 13:02:36,098] Repair command #1 finished in 1 hour 39 minutes 13 seconds

The SSTable file list now looks like this:

229,648,710 mc-10-big-Data.db
103,421,070 mc-11-big-Data.db
1,216,169,275 mc-12-big-Data.db
76,738,970 mc-13-big-Data.db
773,774,270 mc-14-big-Data.db
17,035,624,448 mc-15-big-Data.db
83,365,660 mc-16-big-Data.db
170,061,285 mc-17-big-Data.db
758,998,865 mc-18-big-Data.db
2,683,075,900 mc-19-big-Data.db
749,573,440 mc-1-big-Data.db
91,184,160 mc-20-big-Data.db
303,380,050 mc-21-big-Data.db
3,639,126,510 mc-22-big-Data.db
231,929,395 mc-23-big-Data.db
1,469,272,390 mc-24-big-Data.db
204,485,420 mc-25-big-Data.db
345,655,100 mc-26-big-Data.db
805,017,870 mc-27-big-Data.db
50,714,125 mc-28-big-Data.db
11,578,088,555 mc-2-big-Data.db
170,033,435 mc-3-big-Data.db
1,677,053,450 mc-4-big-Data.db
62,245,405 mc-5-big-Data.db
8,426,967,700 mc-6-big-Data.db
1,979,214,745 mc-7-big-Data.db
2,910,586,420 mc-8-big-Data.db
14,097,936,920 mc-9-big-Data.db

Bootstrap the node

If you are using consistency factor (CF) ONE on reads, or you are really concerned about consistency overall, use this approach instead of the rm -f approach. It will insure that the node with missing data will not participate in any reads until all data is restored. Depending on how much data the node has to recover, it will often take longer than any of the other approaches. Although since bootstrapping can operate in parallel, it may not.

Steps:

  1. Shut down the node.
  2. Remove all of the files under the $CASSANDRA_HOME. Usually /var/lib/Cassandra.
  3. Modify /etc/cassandra/conf/cassandra-env.sh.
  4. Start Cassandra. – When the server starts with no files, it will connect to one of its seeds, recreate the schema and request all nodes to stream data to it to replace the data it has lost. It will not re-select new token ranges unless you try to restart it with a different IP than it had before.
  5. Modify the /ect/cassandra/conf/cassandra-env.sh file to remove the change in Step 3.

$ nodetool drain

$ sudo pkill java

$ ps -ef |grep java

$ vi /etc/cassandra/conf/cassandra-env.sh

Add this line at the end of the file:

JVM_OPTS=”$JVM_OPTS -Dcassandra.replace_address=192.168.1.88″

$ systemctl start cassandra

Wait for the node to join the cluster

During the bootstrap we see messages like this in the log:

INFO [main] 2018-08-10 13:39:06,780 StreamResultFuture.java:90 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Executing streaming plan for Bootstrap

INFO [StreamConnectionEstablisher:1] 2018-08-10 13:39:06,784 StreamSession.java:266 – [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Starting streaming to /192.168.1.90 >/code>

Later on we see:

INFO [main] 2018-08-10 14:18:16,133 StorageService.java:1449 - JOINING: Finish joining ring

INFO [main] 2018-08-10 14:18:16,482 SecondaryIndexManager.java:509 – Executing pre-join post-bootstrap tasks for: CFS(Keyspace=’keyspace1′, ColumnFamily=’standard1′)

INFO [main] 2018-08-10 14:18:16,484 SecondaryIndexManager.java:509 – Executing pre-join post-bootstrap tasks for: CFS(Keyspace=’keyspace1′, ColumnFamily=’counter1′)

INFO [main] 2018-08-10 14:18:16,897 StorageService.java:2292 – Node /192.168.1.88 state jump to NORMAL

WARN [main] 2018-08-10 14:18:16,899 StorageService.java:2324 – Not updating token metadata for /192.168.1.88 because I am replacing it

When we do a nodetool status we see:

$ nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

— Address Load Tokens Owns (effective) Host ID Rack

UN 192.168.1.88 30.87 GiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1

UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1

UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1

The node is up and running in less than one hour. Quicker than any of the options. Makes you think about your choices, doesn’t it?

If you have a keyspace with RF=1 then options 3 and 4 are not viable. You will lose data. Although with RF=1 and a corrupted SSTable file you are going to lose some data anyway.

A last view at the list of SSTable files shows you this:

773,774,270 mc-10-big-Data.db
17,148,617,040 mc-11-big-Data.db
749,573,440 mc-1-big-Data.db
170,033,435 mc-2-big-Data.db
1,677,053,450 mc-3-big-Data.db
62,245,405 mc-4-big-Data.db
8,426,967,700 mc-5-big-Data.db
229,648,710 mc-6-big-Data.db
103,421,070 mc-7-big-Data.db
1,216,169,275 mc-8-big-Data.db
76,738,970 mc-9-big-Data.db

Conclusion

If you run into corrupted SSTable files, don’t panic. It won’t have any impact on your operations in the short term unless you are using RF=ONE or CF=ONE.

Find out which node has the broken SSTable file.

Then, because its easiest and low risk, try the online nodetool scrub command.

If that does not work, then you have three choices. Offline Scrub works but is usually too slow to be useful. Rebuilding the whole node seems to be overkill but it will work, and it will maintain consistency on reads. If you have a lot of data and you want to solve the problem fairly quickly, just remove the offending SSTable file and do a repair.

All approaches have an impact on the other nodes in the cluster.

The first three require a repair which computes merkle trees and streams data to the node being fixed. The amount to be streamed is most with the delete but the total time for the recovery was less in my example. That may not always be the case. In the bootstrap example, the total time was very similar to the delete case because my test case had only one large table. If there were several large tables, the delete approach would have been the fastest to get the node back to normal.

Approach Scrub phase Repair phase Total Recovery time
Online Scrub 1:06 1:36 2:42
*Offline Scrub 144:35 1:37 146:22
Delete files 0:05 1:36 1:41
Bootstrap 0:05 1:45 1:50

All sample commands show the user in normal Linux user mode. That is because in my test environment the Cassandra cluster belonged to my user id. Most production Cassandra clusters run as the Cassandra Linux user. In that case, some amount of user id switching or sudo operations would be required to do the work.

The offline scrub time was estimated. I did not want to wait for six days to see if it was really going to take that long.

All sample output provided here was from a three-node cluster running Cassandra 3.11.2 running on Fedora 28 using a vanilla Cassandra install with pretty much everything in cassandra.env defaulted.

I corrupted the SSTable file using this command:

$ printf '\x31\xc0\xc3' | dd of=mc-8-big-Data.db bs=1 seek=0 count=100 conv=notrunc

The Fine Print When Using Multiple Data Directories

One of the longest lived features in Cassandra is the ability to allow a node to store data on more than one than one directory or disk. This feature can help increase cluster capacity or prevent a node from running out space if bootstrapping a new one will take too long to complete. Recently I was working on a cluster and saw how this feature has the potential to silently cause problems in a cluster. In this post we will go through some fine print when configuring Cassandra to use multiple disks.

Jay… what?

The feature which allows Cassandra to store data on multiple disks is commonly referred to as JBOD [pronounced jay-bod] which stands for “Just a Bunch Of Disks/Drives”. In Cassandra this feature is controlled by the data_file_directories setting in the cassandra.yaml file. In relation to this setting, Cassandra also allows its behaviour on disk failure to be controlled using the disk_failure_policy setting. For now I will leave the details of the setting alone, so we can focus exclusively on the data_file_directories setting.

Simple drives, simple pleasures

The data_file_directories feature is fairly straight forward in that it allows Cassandra to use multiple directories to store data. To use it just specify the list of directories you want Cassandra to use for data storage. For example.

data_file_directories:
    - /var/lib/cassandra/data
    - /var/lib/cassandra/data-extra

The feature has been around from day one of Cassandra’s life and the way in which Cassandra uses multiple directories has mostly stayed the same. There are no special restrictions to the directories, they can be on the same volume/disk or a different volume/disk. As far as Cassandra is concerned, the paths specified in the setting are just the directories it has available to read and write data.

At a high level, the way the feature works is Cassandra tries to evenly split data into each of the directories specified in the data_file_directories setting. No two directories will ever have an identical SSTable file name in them. Below is an example of what you could expect to see if you inspected each data directory when using this feature. In this example the node is configured to use two directories: …/data0/ and …/data1/

$ ls .../data0/music/playlists-3b90f8a0a50b11e881a5ad31ff0de720/
backups                      mc-5-big-Digest.crc32  mc-5-big-Statistics.db
mc-5-big-CompressionInfo.db  mc-5-big-Filter.db     mc-5-big-Summary.db
mc-5-big-Data.db             mc-5-big-Index.db      mc-5-big-TOC.txt

$ ls .../data1/music/playlists-3b90f8a0a50b11e881a5ad31ff0de720/
backups                      mc-6-big-Digest.crc32  mc-6-big-Statistics.db
mc-6-big-CompressionInfo.db  mc-6-big-Filter.db     mc-6-big-Summary.db
mc-6-big-Data.db             mc-6-big-Index.db      mc-6-big-TOC.txt

Data resurrection

One notable change which modified how Cassandra uses the data_file_directories setting was CASSANDRA-6696. The change was implemented in Cassandra version 3.2. To explain this problem and how it was fixed, consider the case where a node has two data directories A and B. Prior to this change in Cassandra, you could have a node that had data for a specific token in one SSTable that was on disk A. The node could also have a tombstone associated with that token in another SSTable on disk B. If the gc_grace_seconds passed, and no compactions were processed to reclaim the data tombstone there would be an issue if disk B failed. In this case if disk B did fail, the tombstone is lost and the data on disk A is still present! Running a repair in this case would resurrect the data by propagating it to other replicas! To fix this issue, CASSANDRA-6696 changed Cassandra so that a token range was always stored on a single disk.

This change did make Cassandra more robust when using the data_file_directories setting, however this change was no silver bullet and caution still needs to be taken when it is used. Most notably, consider the case where each data directory is mounted to a dedicated disk and the cluster schema results in wide partitions. In this scenario one of the disks could easily reach its maximum capacity due to the wide partitions while the other disk still has plenty of storage capacity.

How to lose a volume and influence a node

For a node running Cassandra version less than 3.2 and using the data_file_directories setting there are a number vulnerabilities to watch out for. If each data directory is mounted to a dedicated disk, and one of the disk dies or the mount disappears then this can silently cause problems. To explain this problem, consider the case where we installed and the data is located in /var/lib/cassandra/data. Say we want to add another directory to store data in only this time the data will be on another volume. It makes sense to have the data directories in the same location, so we create the directory /var/lib/cassandra/data-extra. We then mount our volume so that /var/lib/cassandra/data-extra points to it. If the disk backing /var/lib/cassandra/data-extra died or we forgot to put the mount information in fstab and lose the mount on a restart, then we will effectively lose system table data. Cassandra will start because the directory /var/lib/cassandra/data-extra exists however it will be empty.

Similarly, I have seen cases where a directory was manually added to a node that was managed by chef. In this case the node was running out disk space and there was no time to wait for new node to bootstrap. To avoid a node going down an additional volume was attached, mounted, and the data_file_directories setting in the cassandra.yaml modified to include the new data directory. Some time later chef was executed on the node to deploy an update, and as a result it reset cassandra.yaml configuration. Resetting the cassandra.yaml cleared the additional data directory that was listed under data_file_directories setting. When the node was restarted, the Cassandra process never knew that there was another data directory it had to read from.

Either of these cases can lead to more problems in the cluster. Remember how earlier I mentioned that a complete SSTable file will be always stored in a single data directory when using the data_file_directories setting? This behaviour applies to all data stored by Cassandra including its system data! So that means, in the above two scenarios Cassandra could potentially lose system table data. This is a problem because the system data table stores information about what data the node owns, what the schema is, and whether the node has bootstrapped. If the system table is lost and the node restarted the node will think it is a new node, take on a new identity and new token ranges. This results in a token range movement in the cluster. We have covered this topic in more detail in our auto bootstrapping blog post. This problem gets worse when a seed node loses its system table and comes back as a new node. This is because seed nodes never stream data and if a cleanup is run cluster wide, data is then lost.

Testing the theory

We can test these above scenarios for different versions of Cassandra using ccm. I created the following script to setup a three node cluster in ccm with each node configured to be a seed node and use two data directories. We use seed nodes to show the worst case scenario that can occur when a node using multiple data directories loses one of the directories.

#!/bin/bash

# This script creates a three node CCM cluster to demo the data_file_directories
# feature in different versions of Cassandra.

set -e

CLUSTER_NAME="${1:-TestCluster}"
NUMBER_NODES="3"
CLUSTER_VERSION="${2:-3.0.15}"

echo "Cluster Name: ${CLUSTER_NAME}"
echo "Cluster Version: ${CLUSTER_VERSION}"
echo "Number nodes: ${NUMBER_NODES}"

ccm create ${CLUSTER_NAME} -v ${CLUSTER_VERSION}

# Modifies the configuration of a node in the CCM cluster.
function update_node_config {
  CASSANDRA_YAML_SETTINGS="cluster_name:${CLUSTER_NAME} \
                          num_tokens:32 \
                          endpoint_snitch:GossipingPropertyFileSnitch \
                          seeds:127.0.0.1,127.0.0.2,127.0.0.3"

  for key_value_setting in ${CASSANDRA_YAML_SETTINGS}
  do
    setting_key=$(echo ${key_value_setting} | cut -d':' -f1)
    setting_val=$(echo ${key_value_setting} | cut -d':' -f2)
    sed -ie "s/${setting_key}\:\ .*/${setting_key}:\ ${setting_val}/g" \
      ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra.yaml
  done

  # Create and configure the additional data directory
  extra_data_dir="/Users/anthony/.ccm/${CLUSTER_NAME}/node${1}/data1"

  sed -ie '/data_file_directories:/a\'$'\n'"- ${extra_data_dir}
    " ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra.yaml

  mkdir ${extra_data_dir}

  sed -ie "s/dc=.*/dc=datacenter1/g" \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-rackdc.properties
  sed -ie "s/rack=.*/rack=rack${1}/g" \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-rackdc.properties

  # Tune Cassandra memory usage so we can run multiple nodes on the one machine
  sed -ie 's/\#MAX_HEAP_SIZE=\"4G\"/MAX_HEAP_SIZE=\"500M\"/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh
  sed -ie 's/\#HEAP_NEWSIZE=\"800M\"/HEAP_NEWSIZE=\"120M\"/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh

  # Allow remote access to JMX without authentication. This is for
  # demo purposes only - Never do this in production
  sed -ie 's/LOCAL_JMX=yes/LOCAL_JMX=no/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh
  sed -ie 's/com\.sun\.management\.jmxremote\.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' \
    ~/.ccm/${CLUSTER_NAME}/node${1}/conf/cassandra-env.sh
}

for node_num in $(seq ${NUMBER_NODES})
do
  echo "Adding 'node${node_num}'"
  ccm add node${node_num} \
    -i 127.0.0.${node_num} \
    -j 7${node_num}00 \
    -r 0 \
    -b

  update_node_config ${node_num}

  # Set localhost aliases - Mac only
  echo "ifconfig lo0 alias 127.0.0.${node_num} up"
  sudo ifconfig lo0 alias 127.0.0.${node_num} up
done

sed -ie 's/use_vnodes\:\ false/use_vnodes:\ true/g' \
  ~/.ccm/${CLUSTER_NAME}/cluster.conf

I first tested Cassandra version 2.1.20 using the following process.

Run the script and check the nodes were created.

$ ccm status
Cluster: 'mutli-dir-test'
-------------------------
node1: DOWN (Not initialized)
node3: DOWN (Not initialized)
node2: DOWN (Not initialized)

Start the cluster.

$ for i in $(seq 1 3); do echo "Starting node${i}"; ccm node${i} start; sleep 10; done
Starting node1
Starting node2
Starting node3

Check the cluster is up and note the Host IDs.

$  ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  47.3 KB    32      73.5%             4682088e-4a3c-4fbc-8874-054408121f0a  rack1
UN  127.0.0.2  80.35 KB   32      71.7%             b2411268-f168-485d-9abe-77874eef81ce  rack2
UN  127.0.0.3  64.33 KB   32      54.8%             8b55a1c6-f971-4e01-a34b-bb37dd55bb89  rack3

Insert some test data into the cluster.

$ ccm node1 cqlsh
Connected to TLP-578-2120 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.1.20 | CQL spec 3.2.1 | Native protocol v3]
Use HELP for help.
cqlsh> CREATE KEYSPACE music WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
cqlsh> CREATE TABLE music.playlists (
   ...  id uuid,
   ...  song_order int,
   ...  song_id uuid,
   ...  title text,
   ...  artist text,
   ...  PRIMARY KEY (id, song_id));
cqlsh> INSERT INTO music.playlists (id, song_order, song_id, artist, title)
   ...  VALUES (62c36092-82a1-3a00-93d1-46196ee77204, 1,
   ...  a3e64f8f-bd44-4f28-b8d9-6938726e34d4, 'Of Monsters and Men', 'Little Talks');
cqlsh> INSERT INTO music.playlists (id, song_order, song_id, artist, title)
   ...  VALUES (62c36092-82a1-3a00-93d1-46196ee77205, 2,
   ...  8a172618-b121-4136-bb10-f665cfc469eb, 'Birds of Tokyo', 'Plans');
cqlsh> INSERT INTO music.playlists (id, song_order, song_id, artist, title)
   ...  VALUES (62c36092-82a1-3a00-93d1-46196ee77206, 3,
   ...  2b09185b-fb5a-4734-9b56-49077de9edbf, 'Lorde', 'Royals');
cqlsh> exit

Write the data to disk by running nodetool flush on all the nodes.

$ for i in $(seq 1 3); do echo "Flushing node${i}"; ccm node${i} nodetool flush; done
Flushing node1
Flushing node2
Flushing node3

Check we can retrieve data from each node.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Look for a node that has all of the system.local SSTable files in a single directory. In this particular test, there were no SSTable files in data directory data0 of node1.

$ ls .../node1/data0/system/local-7ad54392bcdd35a684174e047860b377/
$
$ ls .../node1/data1/system/local-7ad54392bcdd35a684174e047860b377/
system-local-ka-5-CompressionInfo.db  system-local-ka-5-Summary.db          system-local-ka-6-Index.db
system-local-ka-5-Data.db             system-local-ka-5-TOC.txt             system-local-ka-6-Statistics.db
system-local-ka-5-Digest.sha1         system-local-ka-6-CompressionInfo.db  system-local-ka-6-Summary.db
system-local-ka-5-Filter.db           system-local-ka-6-Data.db             system-local-ka-6-TOC.txt
system-local-ka-5-Index.db            system-local-ka-6-Digest.sha1
system-local-ka-5-Statistics.db       system-local-ka-6-Filter.db

Stop node1 and simulate a disk or volume mount going missing by removing the data1 directory entry from the data_file_directories setting.

$ ccm node1 stop

Before the change the setting entry was:

data_file_directories:
- .../node1/data0
- .../node1/data1

After the change the setting entry was:

data_file_directories:
- .../node1/data0

Start node1 again and check the logs. From the logs we can see the messages where the node has generated a new Host ID and took ownership of new tokens.

WARN  [main] 2018-08-21 12:34:57,111 SystemKeyspace.java:765 - No host ID found, created c62c54bf-0b85-477d-bb06-1f5d696c7fef (Note: This should happen exactly once per node).
INFO  [main] 2018-08-21 12:34:57,241 StorageService.java:844 - This node will not auto bootstrap because it is configured to be a seed node.
INFO  [main] 2018-08-21 12:34:57,259 StorageService.java:959 - Generated random tokens. tokens are [659824738410799181, 501008491586443415, 4528158823720685640, 3784300856834360518, -5831879079690505989, 8070398544415493492, -2664141538712847743, -303308032601096386, -553368999545619698, 5062218903043253310, -8121235567420561418, 935133894667055035, -4956674896797302124, 5310003984496306717, -1155160853876320906, 3649796447443623633, 5380731976542355863, -3266423073206977005, 8935070979529248350, -4101583270850253496, -7026448307529793184, 1728717941810513773, -1920969318367938065, -8219407330606302354, -795338012034994277, -374574523137341910, 4551450772185963221, -1628731017981278455, -7164926827237876166, -5127513414993962202, -4267906379878550578, -619944134428784565]

Check the cluster status again. From the output we can see that the Host ID for node1 changed from 4682088e-4a3c-4fbc-8874-054408121f0a to c62c54bf-0b85-477d-bb06-1f5d696c7fef

$ ccm node2 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  89.87 KB   32      100.0%            c62c54bf-0b85-477d-bb06-1f5d696c7fef  rack1
UN  127.0.0.2  88.69 KB   32      100.0%            b2411268-f168-485d-9abe-77874eef81ce  rack2
UN  127.0.0.3  106.66 KB  32      100.0%            8b55a1c6-f971-4e01-a34b-bb37dd55bb89  rack3

Check we can retrieve data from each node again.

for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

Cassandra 2.1.20 Results

When we run the above test against a cluster using Apache Cassandra version 2.1.20 and remove the additional data directory data1 from node1, we can see that our cql statement fails when retrieving data from node1. The error produced shows that the song_order column is unknown to the node.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

<stdin>:1:InvalidRequest: code=2200 [Invalid query] message="Undefined name song_order in selection clause"

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

An interesting side note, if nodetool drain is run node1 before it is shut down then the above error never occurs. Instead the following output appears when we run our cql statement to retrieve data from the nodes. As we can see below the query that failed now returns no rows of data.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

 id | song_order | song_id | artist | title
----+------------+---------+--------+-------

(0 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Cassandra 2.2.13 Results

When we run the above test against a cluster using Apache Cassandra version 2.1.20 and remove the additional data directory data1 from node1, we can see that the cql statement fails retrieving data from node1. The error produced is similar to that produced in version 2.1.20 where id column name is unknown.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

<stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined name id in selection clause"

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

  id          | song_order | song_id     | artist              | title
 -------------+------------+-------------+---------------------+--------------
  62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
  62c36092... |          3 | 2b09185b... |               Lorde |       Royals
  62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Unlike Cassandra version 2.1.20, node1 never generated a new Host ID or calculated new tokens. This is because it replayed the commitlog and recovered most of the writes that had gone missing.

INFO  [main] ... CommitLog.java:160 - Replaying .../node1/commitlogs/CommitLog-5-1534865605274.log, .../node1/commitlogs/CommitLog-5-1534865605275.log
WARN  [main] ... CommitLogReplayer.java:149 - Skipped 1 mutations from unknown (probably removed) CF with id 5bc52802-de25-35ed-aeab-188eecebb090
...
INFO  [main] ... StorageService.java:900 - Using saved tokens [-1986809544993962272, -2017257854152001541, -2774742649301489556, -5900361272205350008, -5936695922885734332, -6173514731003460783, -617557464401852062, -6189389450302492227, -6817507707445347788, -70447736800638133, -7273401985294399499, -728761291814198629, -7345403624129882802, -7886058735316403116, -8499251126507277693, -8617790371363874293, -9121351096630699623, 1551379122095324544, 1690042196927667551, 2403633816924000878, 337128813788730861, 3467690847534201577, 419697483451380975, 4497811278884749943, 4783163087653371572, 5213928983621160828, 5337698449614992094, 5502889505586834056, 6549477164138282393, 7486747913914976739, 8078241138082605830, 8729237452859546461]

Cassandra 3.0.15 Results

When we run the above test against a cluster using Apache Cassandra version 3.0.15 and remove the additional data directory data1 from node1, we can see that the cql statement returns no data from node1.

$ for i in $(seq 1 3); do ccm node${i} cqlsh -e "SELECT id, song_order, song_id, artist, title FROM music.playlists"; done

 id | song_order | song_id | artist | title
----+------------+---------+--------+-------

(0 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

 id          | song_order | song_id     | artist              | title
-------------+------------+-------------+---------------------+--------------
 62c36092... |          2 | 8a172618... |      Birds of Tokyo |        Plans
 62c36092... |          3 | 2b09185b... |               Lorde |       Royals
 62c36092... |          1 | a3e64f8f... | Of Monsters and Men | Little Talks

(3 rows)

Cassandra 3.11.3 Results

When we run the above test against a cluster using Apache Cassandra version 3.11.3 and remove the additional data directory data1 from node1, the node fails to start and we can see the following error message in the logs.

ERROR [main] 2018-08-21 16:30:53,489 CassandraDaemon.java:708 - Exception encountered during startup
java.lang.RuntimeException: A node with address /127.0.0.1 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:558) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:804) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:664) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:613) ~[apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:379) [apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:602) [apache-cassandra-3.11.3.jar:3.11.3]
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:691) [apache-cassandra-3.11.3.jar:3.11.3]

In this case, the cluster reports node1 as down and still shows its original Host ID.

$ ccm node2 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
DN  127.0.0.1  191.96 KiB  32      100.0%            35a3c8ff-fa20-4f10-81cd-7284caeb00bd  rack1
UN  127.0.0.2  191.82 KiB  32      100.0%            2ebe4f0b-dc8f-4f46-93cd-37c410174a49  rack2
UN  127.0.0.3  170.46 KiB  32      100.0%            0384793e-7f59-40aa-a487-97f410dded4b  rack3

After inspecting the SSTables in both data directories we can see that a new Host ID 7d910c98-f69b-41b4-988a-f432b2e54b38 has been assigned to the node even though it failed to start.

$ ./tools/bin/sstabledump .../node1/data0/system/local-7ad54392bcdd35a684174e047860b377/mc-12-big-Data.db | grep host_id
      { "name" : "host_id", "value" : "35a3c8ff-fa20-4f10-81cd-7284caeb00bd", "tstamp" : "2018-08-21T06:24:08.106Z" },


$ ./tools/bin/sstabledump .../node1/data1/system/local-7ad54392bcdd35a684174e047860b377/mc-10-big-Data.db | grep host_id
      { "name" : "host_id", "value" : "7d910c98-f69b-41b4-988a-f432b2e54b38" },

Take away messages

As we have seen from testing there are potential dangers with using multiple directories in Cassandra. By simply removing one of the data directories in the setting a node can become a brand new node and affect the rest of the cluster. The JBOD feature can be useful in emergencies where disk space is urgently needed, however its usage in this case should be temporary.

The use of multiple disks in a Cassandra node, I feel is better done at the OS or hardware layer. Systems like LVM and RAID were designed to allow multiple disks to be used together to make up a volume. Using something like LVM or RAID rather than Cassandra’s JBOD feature reduces the complexity of the Cassandra configuration and the number of moving parts on the Cassandra side that can go wrong. By using the JBOD feature in Cassandra, it subtlety increases operational complexity and reduces the nodes ability to fail fast. In most cases I feel it is more useful for a node to fail out right rather than limp on and potentially impact the cluster in a negative way.

As a final thought, I think one handy feature that could be added to Apache Cassandra to help prevent issues associated with JBOD is the ability to check if the data, commitlog, saved_caches and hints are all empty prior to bootstrapping. If they are empty, then the node proceeds as normal. If they contain data, then perhaps the node could fail to start and print an error message in the logs.

Testing Apache Cassandra 4.0

With the goal of ensuring reliability and stability in Apache Cassandra 4.0, the project’s committers have voted to freeze new features on September 1 to concentrate on testing and validation before cutting a stable beta. Towards that goal, the community is investing in methodologies that can be performed at scale to exercise edge cases in the largest Cassandra clusters. The result, we hope, is to make Apache Cassandra 4.0 the best-tested and most reliable major release right out of the gate.

In the interests of communication (and hopefully more participation), here’s a look at some of the approaches being used to test Apache Cassandra 4.0:


Replay Testing

Workload Recording, Log Replay, and Comparison

Replay testing allows for side-by-side comparison of a workload using two versions of the same database. It is a black-box technique that answers the question, “did anything change that we didn’t expect?”

Replay testing is simple in concept: record a workload, then re-issue it against two clusters – one running a stable release and the second running a candidate build. Replay testing a stateful distributed system is more challenging. For a subset of workloads, we can achieve determinism in testing by grouping writes by CQL partition and ordering them via client-supplied timestamps. This also allows us to achieve parallelism, as recorded workloads can be distributed by partition across an arbitrarily-large fleet of writers. Though linearizing updates within a partition and comparing differences does not allow for validation of all possible workloads (e.g., CAS queries), this subset is very useful.

The suite of Full Query Logging (“FQL”) tools in Apache Cassandra enable workload recording. CASSANDRA-14618 and CASSANDRA-14619 will add fqltool replay and fqltool compare, enabling log replay and comparison. Standard tools in the Apache ecosystem such as Apache Spark and Apache Mesos can also make parallelizing replay and comparison across large clusters of machines straightforward.


Fuzz Testing and Property-Based Testing

Dynamic Test Generation and Fuzzing

Fuzz testing dynamically generates input to be passed through a function for validation. We can make fuzz testing smarter in stateful systems like Apache Cassandra to assert that persisted data conforms to the database’s contracts: acknowledged writes are not lost, deleted data is not resurrected, and consistency levels are respected. Fuzz testing of storage systems to validate these properties requires maintaining a record of responses received from the system; the development of a model representing valid legal states of data within the database; and a validation pass to assert that responses reflect valid states according to that model.

Property-based testing combines fuzz testing and assertions to explore a state space using randomly-generated input. These tests provide dynamic input to the system and assert that its fundamental properties are not violated. These properties can range from generic (e.g., “I can write data and read it back”) to specific (“range tombstone bounds synthesized during short-read-protection reads are properly closed”); and from local to distributed (e.g., “replacing every single node in a cluster results in an identical database”). To simplify debugging, property-based testing libraries like QuickTheories also provide a “shrinker,” which attempts to generate the simplest possible failing case after detecting input or a sequence of actions that triggers a failure.

Unlike model checkers, property-based tests don’t exhaust the state space – but explore it until a threshold of examples is reached. This allows for the computation to be distributed across many machines to gain confidence in code and infrastructure that scales with the amount of computation applied to test it.


Distributed Tests and Fault-Injection Testing

Validating Behavior Under Fault Scenarios

All of the above techniques can be combined with fault injection testing to validate that the system maintains availability where expected in fault scenarios, that fundamental properties hold, and that reads and writes conform to the system’s contracts. By asserting series of invariants under fault scenarios using different techniques, we gain the ability to exercise edge cases in the system that may reveal unexpected failures in extreme scenarios. Injected faults can take many forms – network partitions, process pauses, disk failures, and more.


Upgrade Testing

Ensuring a Safe Upgrade Path

Finally, it’s not enough to test one version of the database. Upgrade testing allows us to validate the upgrade path between major versions, ensuring that a rolling upgrade can be completed successfully, and that contents of the resulting upgraded database is identical to the original. To perform upgrade tests, we begin by snapshotting a cluster and cloning it twice, resulting in two identical clusters. One of the clusters is then upgraded. Finally, we perform a row-by-row scan and comparison of all data in each partition to assert that all rows read are identical, logging any deltas for investigation. Like fault injection tests, upgrade tests can also be thought of as an operational scenario all other types of tests can be parameterized against.


Wrapping Up

The Apache Cassandra developer community is working hard to deliver Cassandra 4.0 as the most stable major release to date, bringing a variety of methodologies to bear on the problem. We invite you to join us in the effort, deploying these techniques within your infrastructure and testing the release on your workloads. Learn more about how to get involved here.

The more that join, the better the release we’ll ship together.

Java 11 Support in Apache Cassandra 4.0

At the end of July, support for Java 11 was merged into the Apache Cassandra trunk, which will be shipped in the next major release, Cassandra 4.0. Prior to this, Cassandra 3.0 only ran using Java 8, since there were breaking changes in Java that prevented it from run on later versions. Cassandra now supports both Java 8 and 11.

To run Cassandra on Java 11, you’ll need to first download an early access build of jdk java 11, since there’s still no official released. I downloaded a build for my Mac and untar’ed the archive.

Next, you’ll need to set the environment variables. On my mac I’ve set the following variables:

$ export JAVA_HOME="/Users/jhaddad/Downloads/jdk-11.jdk/Contents/Home"
$ export JAVA8_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home"

You can get Cassandra by cloning the git repo and building using ant:

$ git clone https://github.com/apache/cassandra.git
$ cd cassandra
$ ant

You should see the build script finish with something like the following:

write-poms:
   [script] Warning: Nashorn engine is planned to be removed from a future JDK release

init:

maven-ant-tasks-localrepo:

maven-ant-tasks-download:

maven-ant-tasks-init:

maven-declare-dependencies:

_write-poms:

build-test:

jar:
      [jar] Building jar: /Users/jhaddad/dev/cassandra/build/tools/lib/stress.jar

BUILD SUCCESSFUL
Total time: 7 seconds

You can now start Cassandra with the following:

$ bin/cassandra -f

One feature that could be a big deal over time is the new garbage collection algorithm, ZGC. The goal of ZGC is to work on huge heaps while maintaining low latency, 10ms or less. If it delivers on the promise, we could avoid an entire optimization process that many teams struggle with. It can be enabled with these JVM flags.

-XX:+UnlockExperimentalVMOptions
-XX:+UseZGC

To use ZGC in Cassandra 4.0, you can add the JVM flags to the cassandra-env.sh file located in the conf directory of the repository as shown below. Note that flags are add above the JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS" line at the end of the file.

JVM_OPTS="$JVM_OPTS -XX:+UnlockExperimentalVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseZGC"
JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS"

The Cassandra team intends to freeze the trunk branch in September, committing to bug fixes and stability improvements before releasing 4.0. We’d love feedback on the release during this period - especially in regards to performance with Java 11. We appreciate any help testing real world workloads (in a staging environment!). Bugs can be reported to the Cassandra JIRA. We aim to make the 4.0 release stable on day one. We encourage everyone to get involved early to ensure the high quality of this important release!