Migrating DynamoDB Workloads From AWS  to Google Cloud – Simplified With ScyllaDB Alternator

Amazon’s DynamoDB must be credited for allowing a broader adoption of NoSQL databases at-scale. However, many developers want flexibility to run workloads on different clouds or across multiple clouds for high availability or disaster recovery purposes. This was a key reason Scylla introduced its DynamoDB-compatible API, Project Alternator. It allows you to run a Scylla cluster on your favorite public cloud, or even on-premises (either on your own equipment, or as part of an AWS Outposts deployment).

Let’s say that your favorite cloud is Google Cloud and that’s where you’d like to move your current DynamoDB workload. Moving from AWS to Google Cloud can be hard, especially if your application is tightly-coupled with the proprietary AWS DynamoDB API. With the introduction of ScyllaDB Cloud Alternator, our DynamoDB-compatible API as a service, this task became much easier.

This post will guide you through the database part of the migration process, ensuring minimal changes to your applications. What does it take? Let’s go step-by-step through the migration process.

Launch a ScyllaDB Cloud Alternator instance on Google Cloud

This part is easy:

Visit cloud.scylladb.com, sign in or sign up, and click “new cluster”.

Select GCP and Alternator, choose the instance type, click “Launch” and grab a coffee. You will have a cluster up and running in a few minutes.

Once the cluster is up and running, you can visit the cluster view to check its status

Move to the New Cluster

For the scope of this document, I’m ignoring the migration of the application logic, and the data transfer cost. Clearly you will need to consider both.

First question you need to ask yourself is: can I tolerate a downtime in service during the migration?

  • If yes, you need a cold / off line migration. You only needs to migrate the historical data from Dynamo to Scylla, also called forklifting
  • If not, you need a hot / live Migration. You will first need to extend your application to perform dual write to both databases, and only then execute the forklift.

Cold Migration

Hot Migration

Real Time Sync

There are two possible alternatives to keep the two DBs in sync in real time:

  • Dual Writes — the application writes the same event to the two DBs. This can extend to dual reads as well, allowing the application to compare the reads in real time. The disadvantage is the need to update the application with non-trivial logic.
  • Consuming DynamoDB Streams to feed the new Database — The disadvantage is the need to set streams for all relevant DynamoDB tables, and the cost associated with it.

Both methods allow you to choose which updates you want to sync in real time. Often one can use Cold Migration for most of the data, and Hot migration for the rest.

Dual Writes

Streams

More on using streams to sync real time requests:

Forklifting Historical Data

To upload historical data from AWS DynamoDB to Scylla Alternator we will use the Scylla Migrator. A Spark base tool which can read from AWS DynamoDB (as well as Apache Cassandra, Scylla) and write to Scylla Alternator API.

You will need to configure the source DynamoDB and target Scylla Cluster, and launch A Spark migration job.

Below is an example for such a job from Spark UI.

For more details, recommended configuration, and how to validate the migration success see Migrating From DynamoDB To Scylla.

Update Your Application

When ready, update your application to use Scylla Alternator endpoint instead of AWS one.

Checkout the Connect tab in Scylla Cloud cluster view for examples.

Resources

GET STARTED ON SCYLLA CLOUD

The post Migrating DynamoDB Workloads From AWS  to Google Cloud – Simplified With ScyllaDB Alternator appeared first on ScyllaDB.

Reaper 3.0 for Apache Cassandra was released

The K8ssandra team is pleased to announce the release of Reaper 3.0. Let’s dive into the main new features and improvements this major version brings, along with some notable removals.

Storage backends

Over the years, we regularly discussed dropping support for Postgres and H2 with the TLP team. The effort for maintaining these storage backends was moderate, despite our lack of expertise in Postgres, as long as Reaper’s architecture was simple. Complexity grew with more deployment options, culminating with the addition of the sidecar mode.
Some features require different consensus strategies depending on the backend, which sometimes led to implementations that worked well with one backend and were buggy with others.
In order to allow building new features faster, while providing a consistent experience for all users, we decided to drop the Postgres and H2 backends in 3.0.

Apache Cassandra and the managed DataStax Astra DB service are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.

Reaper does not generally require high availability - even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.

Adaptive Repairs and Schedules

One of the pain points we observed when people start using Reaper is understanding the segment orchestration and knowing how the default timeout impacts the execution of repairs.
Repair is a complex choreography of operations in a distributed system. As such, and especially in the days when Reaper was created, the process could get blocked for several reasons and required a manual restart. The smart folks that designed Reaper at Spotify decided to put a timeout on segments to deal with such blockage, over which they would be terminated and rescheduled.
Problems arise when segments are too big (or have too much entropy) to process within the default 30 minutes timeout, despite not being blocked. They are repeatedly terminated and recreated, and the repair appears to make no progress.
Reaper did a poor job at dealing with this for mainly two reasons:

  • Each retry will use the same timeout, possibly failing segments forever
  • Nothing obvious was reported to explain what was failing and how to fix the situation

We fixed the former by using a longer timeout on subsequent retries, which is a simple trick to make repairs more “adaptive”. If the segments are too big, they’ll eventually pass after a few retries. It’s a good first step to improve the experience, but it’s not enough for scheduled repairs as they could end up with the same repeated failures for each run.
This is where we introduce adaptive schedules, which use feedback from past repair runs to adjust either the number of segments or the timeout for the next repair run.

Adaptive Schedules

Adaptive schedules will be updated at the end of each repair if the run metrics justify it. The schedule can get a different number of segments or a higher segment timeout depending on the latest run.
The rules are the following:

  • if more than 20% segments were extended, the number of segments will be raised by 20% on the schedule
  • if less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout
  • if no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.

This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.

Incremental Repair Triggers

As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anticompaction process.

The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.

Reaper 3.0 introduces a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.

Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.

Percent unrepaired triggers

These new features will allow to securely optimize tombstone deletions by enabling the only_purge_repaired_tombstones compaction subproperty in Cassandra, permitting to reduce gc_grace_seconds down to 3 hours without fearing that deleted data reappears.

Schedules can be edited

That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.

3.0 fixes that embarrassing situation and adds an edit button to schedules, which allows to change the mutable settings of schedules:

Edit Repair Schedule

More improvements

In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).

Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 will display a pop up informing you and allowing to force create the schedule/run:

Force bypass schedule conflict

We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:

java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml

There are many other improvements and we invite all users to check the changelog in the GitHub repo.

Upgrade Now

We encourage all Reaper users to upgrade to 3.0.0, while recommending users to carefully prepare their migration out of Postgres/H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.

All instructions to download, install, configure, and use Reaper 3.0.0 are available on the Reaper website.

Certificates management and Cassandra Pt II - cert-manager and Kubernetes

The joys of certificate management

Certificate management has long been a bugbear of enterprise environments, and expired certs have been the cause of countless outages. When managing large numbers of services at scale, it helps to have an automated approach to managing certs in order to handle renewal and avoid embarrassing and avoidable downtime.

This is part II of our exploration of certificates and encrypting Cassandra. In this blog post, we will dive into certificate management in Kubernetes. This post builds on a few of the concepts in Part I of this series, where Anthony explained the components of SSL encryption.

Recent years have seen the rise of some fantastic, free, automation-first services like letsencrypt, and no one should be caught flat footed by certificate renewals in 2021. In this blog post, we will look at one Kubernetes native tool that aims to make this process much more ergonomic on Kubernetes; cert-manager.

Recap

Anthony has already discussed several points about certificates. To recap:

  1. In asymmetric encryption and digital signing processes we always have public/private key pairs. We are referring to these as the Keystore Private Signing Key (KS PSK) and Keystore Public Certificate (KS PC).
  2. Public keys can always be openly published and allow senders to communicate to the holder of the matching private key.
  3. A certificate is just a public key - and some additional fields - which has been signed by a certificate authority (CA). A CA is a party trusted by all parties to an encrypted conversation.
  4. When a CA signs a certificate, this is a way for that mutually trusted party to attest that the party holding that certificate is who they say they are.
  5. CA’s themselves use public certificates (Certificate Authority Public Certificate; CA PC) and private signing keys (the Certificate Authority Private Signing Key; CA PSK) to sign certificates in a verifiable way.

The many certificates that Cassandra might be using

In a moderately complex Cassandra configuration, we might have:

  1. A root CA (cert A) for internode encryption.
  2. A certificate per node signed by cert A.
  3. A root CA (cert B) for the client-server encryption.
  4. A certificate per node signed by cert B.
  5. A certificate per client signed by cert B.

Even in a three node cluster, we can envisage a case where we must create two root CAs and 6 certificates, plus a certificate for each client application; for a total of 8+ certificates!

To compound the problem, this isn’t a one-off setup. Instead, we need to be able to rotate these certificates at regular intervals as they expire.

Ergonomic certificate management on Kubernetes with cert-manager

Thankfully, these processes are well supported on Kubernetes by a tool called cert-manager.

cert-manager is an all-in-one tool that should save you from ever having to reach for openssl or keytool again. As a Kubernetes operator, it manages a variety of custom resources (CRs) such as (Cluster)Issuers, CertificateRequests and Certificates. Critically it integrates with Automated Certificate Management Environment (ACME) Issuers, such as LetsEncrypt (which we will not be discussing today).

The workfow reduces to:

  1. Create an Issuer (via ACME, or a custom CA).
  2. Create a Certificate CR.
  3. Pick up your certificates and signing keys from the secrets cert-manager creates, and mount them as volumes in your pods’ containers.

Everything is managed declaratively, and you can reissue certificates at will simply by deleting and re-creating the certificates and secrets.

Or you can use the kubectl plugin which allows you to write a simple kubectl cert-manager renew. We won’t discuss this in depth here, see the cert-manager documentation for more information

Java batteries included (mostly)

At this point, Cassandra users are probably about to interject with a loud “Yes, but I need keystores and truststores, so this solution only gets me halfway”. As luck would have it, from version .15, cert-manager also allows you to create JKS truststores and keystores directly from the Certificate CR.

The fine print

There are two caveats to be aware of here:

  1. Most Cassandra deployment options currently available (including statefulSets, cass-operator or k8ssandra) do not currently support using a cert-per-node configuration in a convenient fashion. This is because the PodTemplate.spec portions of these resources are identical for each pod in the StatefulSet. This precludes the possibility of adding per-node certs via environment or volume mounts.
  2. There are currently some open questions about how to rotate certificates without downtime when using internode encryption.
    • Our current recommendation is to use a CA PC per Cassandra datacenter (DC) and add some basic scripts to merge both CA PCs into a single truststore to be propagated across all nodes. By renewing the CA PC independently you can ensure one DC is always online, but you still do suffer a network partition. Hinted handoff should theoretically rescue the situation but it is a less than robust solution, particularly on larger clusters. This solution is not recommended when using lightweight transactions or non LOCAL consistency levels.
    • One mitigation to consider is using non-expiring CA PCs, in which case no CA PC rotation is ever performed without a manual trigger. KS PCs and KS PSKs may still be rotated. When CA PC rotation is essential this approach allows for careful planning ahead of time, but it is not always possible when using a 3rd party CA.
    • Istio or other service mesh approaches can fully automate mTLS in clusters, but Istio is a fairly large committment and can create its own complexities.
    • Manual management of certificates may be possible using a secure vault (e.g. HashiCorp vault), sealed secrets, or similar approaches. In this case, cert manager may not be involved.

These caveats are not trivial. To address (2) more elegantly you could also implement Anthony’s solution from part one of this blog series; but you’ll need to script this up yourself to suit your k8s environment.

We are also in discussions with the folks over at cert-manager about how their ecosystem can better support Cassandra. We hope to report progress on this front over the coming months.

These caveats present challenges, but there are also specific cases where they matter less.

cert-manager and Reaper - a match made in heaven

One case where we really don’t care if a client is unavailable for a short period is when Reaper is the client.

Cassandra is an eventually consistent system and suffers from entropy. Data on nodes can become out of sync with other nodes due to transient network failures, node restarts and the general wear and tear incurred by a server operating 24/7 for several years.

Cassandra contemplates that this may occur. It provides a variety of consistency level settings allowing you to control how many nodes must agree for a piece of data to be considered the truth. But even though properly set consistency levels ensure that the data returned will be accurate, the process of reconciling data across the network degrades read performance - it is best to have consistent data on hand when you go to read it.

As a result, we recommend the use of Reaper, which runs as a Cassandra client and automatically repairs the cluster in a slow trickle, ensuring that a high volume of repairs are not scheduled all at once (which would overwhelm the cluster and degrade the performance of real clients) while also making sure that all data is eventually repaired for when it is needed.

The set up

The manifests for this blog post can be found here.

Environment

We assume that you’re running Kubernetes 1.21, and we’ll be running with a Cassandra 3.11.10 install. The demo environment we’ll be setting up is a 3 node environment, and we have tested this configuration against 3 nodes.

We will be installing the cass-operator and Cassandra cluster into the cass-operator namespace, while the cert-manager operator will sit within the cert-manager namespace.

Setting up kind

For testing, we often use kind to provide a local k8s cluster. You can use minikube or whatever solution you prefer (including a real cluster running on GKE, EKS, or AKS), but we’ll include some kind instructions and scripts here to ease the way.

If you want a quick fix to get you started, try running the setup-kind-multicluster.sh script from the k8ssandra-operator repository, with setup-kind-multicluster.sh --kind-worker-nodes 3. I have included this script in the root of the code examples repo that accompanies this blog.

A demo CA certificate

We aren’t going to use LetsEncrypt for this demo, firstly because ACME certificate issuance has some complexities (including needing a DNS or a publicly hosted HTTP server) and secondly because I want to reinforce that cert-manager is useful to organisations who are bringing their own certs and don’t need one issued. This is especially useful for on-prem deployments.

First off, create a new private key and certificate pair for your root CA. Note that the file names tls.crt and tls.key will become important in a moment.

openssl genrsa -out manifests/demoCA/tls.key 4096
openssl req -new -x509 -key manifests/demoCA/tls.key -out manifests/demoCA/tls.crt -subj "/C=AU/ST=NSW/L=Sydney/O=Global Security/OU=IT Department/CN=example.com"

(Or you can just run the generate-certs.sh script in the manifests/demoCA directory - ensure you run it from the root of the project so that the secrets appear in .manifests/demoCA/.)

When running this process on MacOS be aware of this issue which affects the creation of self signed certificates. The repo referenced in this blog post contains example certificates which you can use for demo purposes - but do not use these outside your local machine.

Now we’re going to use kustomize (which comes with kubectl) to add these files to Kubernetes as secrets. kustomize is not a templating language like Helm. But it fulfills a similar role by allowing you to build a set of base manifests that are then bundled, and which can be customised for your particular deployment scenario by patching.

Run kubectl apply -k manifests/demoCA. This will build the secrets resources using the kustomize secretGenerator and add them to Kubernetes. Breaking this process down piece by piece:

# ./manifests/demoCA
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cass-operator
generatorOptions:
 disableNameSuffixHash: true
secretGenerator:
- name: demo-ca
  type: tls
  files:
  - tls.crt
  - tls.key
  • We use disableNameSuffixHash, because otherwise kustomize will add hashes to each of our secret names. This makes it harder to build these deployments one component at a time.
  • The tls type secret conventionally takes two keys with these names, as per the next point. cert-manager expects a secret in this format in order to create the Issuer which we will explain in the next step.
  • We are adding the files tls.crt and tls.key. The file names will become the keys of a secret called demo-ca.

cert-manager

cert-manager can be installed by running kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml. It will install into the cert-manager namespace because a Kubernetes cluster should only ever have a single cert-manager operator installed.

cert-manager will install a deployment, as well as various custom resource definitions (CRDs) and webhooks to deal with the lifecycle of the Custom Resources (CRs).

A cert-manager Issuer

Issuers come in various forms. Today we’ll be using a CA Issuer because our components need to trust each other, but don’t need to be trusted by a web browser.

Other options include ACME based Issuers compatible with LetsEncrypt, but these would require that we have control of a public facing DNS or HTTP server, and that isn’t always the case for Cassandra, especially on-prem.

Dive into the truststore-keystore directory and you’ll find the Issuer, it is very simple so we won’t reproduce it here. The only thing to note is that it takes a secret which has keys of tls.crt and tls.key - the secret you pass in must have these keys. These are the CA PC and CA PSK we mentioned earlier.

We’ll apply this manifest to the cluster in the next step.

Some cert-manager certs

Let’s start with the Cassandra-Certificate.yaml resource:

spec:
  # Secret names are always required.
  secretName: cassandra-jks-keystore
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  subject:
    organizations:
    - datastax
  dnsNames:
  - dc1.cass-operator.svc.cluster.local
  isCA: false
  usages:
    - server auth
    - client auth
  issuerRef:
    name: ca-issuer
    # We can reference ClusterIssuers by changing the kind here.
    # The default value is `Issuer` (i.e. a locally namespaced Issuer)
    kind: Issuer
    # This is optional since cert-manager will default to this value however
    # if you are using an external issuer, change this to that `Issuer` group.
    group: cert-manager.io
  keystores:
    jks:
      create: true
      passwordSecretRef: # Password used to encrypt the keystore
        key: keystore-pass
        name: jks-password
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048

The first part of the spec here tells us a few things:

  • The keystore, truststore and certificates will be fields within a secret called cassandra-jks-keystore. This secret will end up holding our KS PSK and KS PC.
  • It will be valid for 90 days.
  • 15 days before expiry, it will be renewed automatically by cert manager, which will contact the Issuer to do so.
  • It has a subject organisation. You can add any of the X509 subject fields here, but it needs to have one of them.
  • It has a DNS name - you could also provide a URI or IP address. In this case we have used the service address of the Cassandra datacenter which we are about to create via the operator. This has a format of <DC_NAME>.<NAMESPACE>.svc.cluster.local.
  • It is not a CA (isCA), and can be used for server auth or client auth (usages). You can tune these settings according to your needs. If you make your cert a CA you can even reference it in a new Issuer, and define cute tree like structures (if you’re into that).

Outside the certificates themselves, there are additional settings controlling how they are issued and what format this happens in.

  • IssuerRef is used to define the Issuer we want to issue the certificate. The Issuer will sign the certificate with its CA PSK.
  • We are specifying that we would like a keystore created with the keystore key, and that we’d like it in jks format with the corresponding key.
  • passwordSecretKeyRef references a secret and a key within it. It will be used to provide the password for the keystore (the truststore is unencrypted as it contains only public certs and no signing keys).

The Reaper-Certificate.yaml is similar in structure, but has a different DNS name. We aren’t configuring Cassandra to verify that the DNS name on the certificate matches the DNS name of the parties in this particular case.

Apply all of the certs and the Issuer using kubectl apply -k manifests/truststore-keystore.

Cass-operator

Examining the cass-operator directory, we’ll see that there is a kustomization.yaml which references the remote cass-operator repository and a local cassandraDatacenter.yaml. This applies the manifests required to run up a cass-operator installation namespaced to the cass-operator namespace.

Note that this installation of the operator will only watch its own namespace for CassandraDatacenter CRs. So if you create a DC in a different namespace, nothing will happen.

We will apply these manifests in the next step.

CassandraDatacenter

Finally, the CassandraDatacenter resource in the ./cass-operator/ directory will describe the kind of DC we want:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.10
  managementApiAuth:
    insecure: {}
  size: 1
  podTemplateSpec:
    spec:
      containers:
        - name: "cassandra"
          volumeMounts:
          - name: certs
            mountPath: "/crypto"
      volumes:
      - name: certs
        secret:
          secretName: cassandra-jks-keystore
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
  config:
    cassandra-yaml:
      authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
      authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
      role_manager: org.apache.cassandra.auth.CassandraRoleManager
      client_encryption_options:
        enabled: true
        # If enabled and optional is set to true encrypted and unencrypted connections are handled.
        optional: false
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        require_client_auth: true
        # Set trustore and truststore_password if require_client_auth is true
        truststore: /crypto/truststore.jks
        truststore_password: dc1
        protocol: TLS
        # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA] # An earlier version of this manifest configured cipher suites but the proposed config was less secure. This section does not need to be modified.
      server_encryption_options:
        internode_encryption: all
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        truststore: /crypto/truststore.jks
        truststore_password: dc1
    jvm-options:
      initial_heap_size: 800M
      max_heap_size: 800M
  • We provide a name for the DC - dc1.
  • We provide a name for the cluster - the DC would join other DCs if they already exist in the k8s cluster and we configured the additionalSeeds property.
  • We use the podTemplateSpec.volumes array to declare the volumes for the Cassandra pods, and we use the podTemplateSpec.containers.volumeMounts array to describe where and how to mount them.

The config.cassandra-yaml field is where most of the encryption configuration happens, and we are using it to enable both internode and client-server encryption, which both use the same keystore and truststore for simplicity. Remember that using internode encryption means your DC needs to go offline briefly for a full restart when the CA’s keys rotate.

  • We are not using authz/n in this case to keep things simple. Don’t do this in production.
  • For both encryption types we need to specify (1) the keystore location, (2) the truststore location and (3) the passwords for the keystores. The locations of the keystore/truststore come from where we mounted them in volumeMounts.
  • We are specifying JVM options just to make this run politely on a smaller machine. You would tune this for a production deployment.

Roll out the cass-operator and the CassandraDatacenter using kubectl apply -k manifests/cass-operator. Because the CRDs might take a moment to propagate, there is a chance you’ll see errors stating that the resource type does not exist. Just keep re-applying until everything works - this is a declarative system so applying the same manifests multiple times is an idempotent operation.

Reaper deployment

The k8ssandra project offers a Reaper operator, but for simplicity we are using a simple deployment (because not every deployment needs an operator). The deployment is standard kubernetes fare, and if you want more information on how these work you should refer to the Kubernetes docs.

We are injecting the keystore and truststore passwords into the environment here, to avoid placing them in the manifests. cass-operator does not currently support this approach without an initContainer to pre-process the cassandra.yaml using envsubst or a similar tool.

The only other note is that we are also pulling down a Cassandra image and using it in an initContainer to create a keyspace for Reaper, if it does not exist. In this container, we are also adding a ~/.cassandra/cqlshrc file under the home directory. This provides SSL connectivity configurations for the container. The critical part of the cqlshrc file that we are adding is:

[ssl]
certfile = /crypto/ca.crt
validate = true
userkey = /crypto/tls.key
usercert = /crypto/tls.crt
version = TLSv1_2

The version = TLSv1_2 tripped me up a few times, as it seems to be a recent requirement. Failing to add this line will give you back the rather fierce Last error: [SSL] internal error in the initContainer. The commands run in this container are not ideal. In particular, the fact that we are sleeping for 840 seconds to wait for Cassandra to start is sloppy. In a real deployment we’d want to health check and wait until the Cassandra service became available.

Apply the manifests using kubectl apply -k manifests/reaper.

Results

If you use a GUI, look at the logs for Reaper, you should see that it has connected to the cluster and provided some nice ASCII art to your console.

If you don’t use a GUI, you can run kubectl get pods -n cass-operator to find your Reaper pod (which we’ll call REAPER_PODNAME) and then run kubectl logs -n cass-operator REAPER_PODNAME to pull the logs.

Conclusion

While the above might seem like a complex procedure, we’ve just created a Cassandra cluster with both client-server and internode encryption enabled, all of the required certs, and a Reaper deployment which is configured to connect using the correct certs. Not bad.

Do keep in mind the weaknesses relating to key rotation, and watch this space for progress on that front.

Your Questions about Cassandra 4.0 vs. Scylla 4.4 Answered

We’ve gotten a lot of attention since we published our benchmark reports on the performance of Apache Cassandra 4.0 vs. Cassandra 3.11, and, as well, how Cassandra 4.0 compares to Scylla 4.4. We had so much interest that we organized a webinar to discuss all of our benchmarking findings. You can watch the entire webinar on-demand now:

WATCH NOW: COMPARING CASSANDRA 4.0, 3.0 AND SCYLLADB

You can read the blogs and watch the video to get the full details from our point of view, but what happened live on the webinar was rather unique. The questions kept coming! In fact, though we generally wrap a webinar in an hour, the Q&A session afterwards took an extra half-hour. ScyllaDB engineers Piotr Grabowski and Karol Baryla fielded all the inquiries with aplomb. So let’s now look at just a few of the questions raised by you, our audience.

Q: Was this the DataStax Enterprise (DSE) version of Cassandra?

Karol: No, our tests were conducted against Apache Cassandra open source versions 3.11 and 4.0.

Piotr: We actually started working on those benchmarks even before Cassandra 4 was released. Of course those numbers are from the officially released version of Cassandra.

[DataStax Enterprise is currently most closely comparable to Cassandra 3.11 — see here.]

Q: Did you try to use off-heap memtables in the tests of Cassandra?

Piotr: Yes. First of all, I don’t think it really improved the performance. The second point is that I had some stability issues with off-heap memtables. Maybe it would require more fine-tuning. We did try to fine-tune the Cassandra configuration as much as we could to get the best results. But for all the benchmarks we have shown, we did not use off-heap memtables.

Q: If Java 16 is not officially supported, is it not risky to use it with Cassandra 4.0 in production?

Karol: Correct. Java 16 is not officially supported by Cassandra. We used it in our benchmarks because we wanted to get the best performance possible for Cassandra. But yeah, if you wanted to use Cassandra 4.0 in production, then this is something you should take into consideration: that your performance may not be the same as the performance in our benchmarks. Because if you want to use Java 11, between that and Java 16 the ZGC garbage collector had a lot of performance improvements. Java 11 performance might not be as good.

Q: The performance of ScyllaDB looks so much better. What’s the biggest concern I need to pay attention to if I want to use it to replace current Cassandra deployments?

Piotr: With Scylla, we  highly recommend using our shard-aware drivers. Of course, Scylla is compatible with all existing Cassandra drivers. However, we have modified a select portion of them — the Java driver, the Go driver, the C++ driver, [and the Python and Rust drivers] — we have modified them to take advantage of the shard-aware architecture of Scylla. All of the requests that are sent from our shard-aware drivers come into the correct shard that holds the data.,

We did use our own shard-aware drivers in the testing. However, when our shard-aware driver connects to Cassandra it falls back to the old [non-shard aware] implementation, which we didn’t modify. They are backwards compatible with Cassandra.

Q: With latencies so low (in milliseconds), it seems like ScyllaDB takes away the need for building in-memory caches. Is that the right way to look at this?

Piotr: It depends on your workload. For many workloads it might be possible that you don’t need to use an in-memory cache. And if you look at Scylla internals, we have our own row-based cache, which serves as an in-memory cache within the Scylla database. 

The best way to tell is to measure it yourself. Check out the difference in our benchmarks between the disk-intensive workloads and the memory intensive workloads.

[You can learn more about when an external cache might not be necessary by checking out this whitepaper.]

Q: In the benchmark graphs, the X scale shows 10k/s to 180k/s but they are called “operations” by the presenter. Is it really operations and not kilobytes/second?

Karol: Correct. Those are operations per second, not kilobytes per second.

Piotr: The payload size was the default for cassandra-stress, which is 300 bytes.

[Thus, for example, if a result was 40k/s ops, that would be 40,000 ops x 300 bytes, or 12 Mbytes/sec throughput.]

Piotr: You can read more of the specific test setup in the blog post.

Q: When adding a new node, can you remind us how much data is distributed across the nodes? E.g. surely it will take much longer to add a new node if there’s 100TB on each node compared with 1TB on each node…

Karol: In the 3-node test, each node has 1TB of data — 3TB data total in the cluster. In the 4 vs. 40 node test, the dataset was 40 TB. So, for Scylla it was 10TB of data for each node, and for Cassandra it was 1TB per node.

Q: What actually happens when you add a new node or double the cluster size? Why does Scylla need to do a compaction when it adds new nodes?

Piotr: So let’s say you have a three node cluster and you add a fourth node. When a new node is added, the database redistributes the data. That’s when streaming happens. Streaming is essentially copying the data from one node to another node. In the case of adding a new node, the data is streamed from all the existing nodes to the new node.

Compaction may be running while adding a new node, but the main reason we mentioned it is because using the Leveled Compaction Strategy (LCS) was supposed to have a greater advantage for Cassandra 4.0, because it has Zero Copy Streaming, which is supposed to work better with LCS strategy. Yet this doesn’t kick in during the streaming, but when we first populated those nodes. We added 1TB to each node, and periodically the database would compact different SSTables, and the LCS tables are better for Zero Copy Streaming in Cassandra.

Q: Did you compare your replace node test with the test published by Cassandra (where they declared 5x times improvement) — why was the difference in results so large?

Piotr: The schema might be different. We also ran background load, which might have introduced some differences, and we wanted to test a real case scenario. So I am not sure that a 5x improvement is an average performance gain.

Q: What main reasons make Scylla’s performance so much better than Cassandra?

Piotr: The big differential between Cassandra 3 and 4 was the JVM algorithms that were used to garbage collect. Scylla is written in C++. We use our own framework, Seastar, which runs close to the metal. We tuned it to run really close to the storage devices and to the network devices, while Cassandra has to deal with the JVM, with the garbage collection mechanism. So the first part is the language difference. Java has to have a runtime. However, as you have seen, Java is getting better and better, especially with the new garbage collection mechanisms.

The second part is that we really have tuned the C++ to be as fast as possible using a shard-per-core architecture. For example, in our sharded architecture, each core is a separate process that doesn’t share that much data between all other cores. So if you have a CPU that has many cores, it might be possible that you have many NUMA nodes. And moving data between those NUMA nodes can be quite expensive. From the first days of Scylla we really optimized the database to not share data between shards. And that’s why we recommend shard-aware drivers.

[Though it must be observed that newer garbage collectors like ZGC are also now NUMA-aware.]

These are just a few of the questions that our engaged audience had. It’s worth listening to the whole webinar in full — especially that last half hour! And if you have any questions of your own, we welcome them either via contacting us privately, or joining our Slack community to ask our engineers and your community peers.

WATCH NOW: COMPARING CASSANDRA 4.0, 3.0 AND SCYLLADB

The post Your Questions about Cassandra 4.0 vs. Scylla 4.4 Answered appeared first on ScyllaDB.

Scylla University: New Spark and Kafka Lessons

Scylla University is our free online resource for you to learn and master NoSQL skills. We’re always adding new lessons and updating existing lessons to keep the content fresh and engaging.

We’re also expanding the content to cover data ecosystems, because we understand that your database doesn’t operate in a vacuum. To that end we recently published two new lessons on Scylla University: Using Spark with Scylla and Kafka and Scylla.

Using Spark with Scylla

Whether you use on-premises hardware or cloud-based infrastructure, Scylla is a solution that offers high performance, scalability, and durability to your data. With Scylla, data is stored in a row-and-column, table-like format that is efficient for transactional workloads. In many cases, we see Scylla used for OLTP workloads.

But what about analytics workloads? Many users these days they’ve standardized on Apache Spark. It accepts everything from columnar format files like Apache Parquet to row-based Apache Avro. It can also be integrated with transactional databases like Scylla.

By using Spark together with Scylla, users can deploy analytics workloads on the information stored in the transactional system.

The new Scylla University lesson “Using Spark with Scylla” covers:

  • An overview of Scylla, Spark, and how they can work together.
  • Scylla and Analytics workloads
  • Scylla token architecture, data distribution, hashing, and nodes
  • Spark intro: the driver program, RDDs, and data distribution
  • Considerations for writing and reading data using Spark and Scylla
  • What happens when writing data and what are the different configurable variables
  • How data is read from Scylla using Spark
  • How to decide if Spark should be collocated with Scylla
  • Best practices and considerations for configuring Spark to work with Scylla



Using Kafka with Scylla

This lesson provides an intro to Kafka and covers some basic concepts. Apache Kafka is an open-source distributed event streaming system. It allows you to:

  • Ingest data from a multitude of different systems, such as databases, your services, microservices or other software applications
  • Store them for future reads
  • Process and transform the incoming streams in real-time
  • Consume the stored data stream

Some common use cases for Kafka are:

  • Message broker (similar to RabbitMQ and others)
  • Serve as the “glue” between different services in your system
  • Provide replication of data between databases/services
  • Perform real-time analysis of data (e.g., for fraud detection)

The Scylla Sink Connector is a Kafka Connect connector that reads messages from a Kafka topic and inserts them into Scylla. It supports different data formats (Avro, JSON).It can scale across many Kafka Connect nodes. It has at-least-once semantics, and it periodically saves its current offset in Kafka.

The Scylla University lesson also provides a brief overview of Change Data Capture (CDC) and the Scylla CDC Source Connector. To learn more about CDC, check out this lesson.

The Scylla CDC Source Connector is a Kafka Connect connector that reads messages from a Scylla table (with Scylla CDC enabled) and writes them to a Kafka topic. It works seamlessly with standard Kafka converters (JSON, Avro). The connector can scale horizontally across many Kafka Connect nodes. Scylla CDC Source Connector has at-least-once semantics.

The lesson includes demos for quickly starting Kafka, using the Scylla Sink Connector, viewing changes on a table with CDC enabled, and downloading, installing, configuring, and using the Scylla CDC Source Connector.

To learn more about using Spark with Scylla and about Kafka and Scylla, check out the full lessons on Scylla University. These include quiz questions and hands-on labs.

Scylla University LIVE – Fall Event (November 9th and 10th)

Following the success of our previous Scylla University LIVE events, we’re hosting another event in November! We’ll conduct these informative live sessions in two different time zones to better support our global community of users. The November 9th training is scheduled for a time convenient in North and South America; November 10th will be the same sessions but better scheduled for users in Europe and Asia.

As a reminder, Scylla University LIVE is a FREE, half-day, instructor-led training event, with training sessions from our top engineers and architects. It will include sessions that cover the basics and how to get started with Scylla, as well as more advanced topics and new features. Following the sessions, we will host a roundtable discussion where you’ll have the opportunity to talk with Scylla experts and network with other users.

The event will be online and instructor-led. Participants that complete the LIVE training event will receive a certificate of completion.

REGISTER FOR SCYLLA UNIVERSITY LIVE

Next Steps

If you haven’t done so yet, register a user account in Scylla University and start learning. It’s free!

Join the #scylla-university channel on our community Slack for more training-related updates and discussions.

The post Scylla University: New Spark and Kafka Lessons appeared first on ScyllaDB.

Instaclustr Announces the General Availability of Managed Apache Cassandra 4.0

Instaclustr is pleased to announce the general availability of Cassandra 4.0 as part of our Managed Cassandra offering on the Instaclustr Managed Platform. Cassandra 4.0 delivers greater performance, unmatched stability, and new auditing capabilities, among other valuable features for the enterprise driven by the Apache Cassandra community.

Cassandra 4.0 beta releases have been available in preview on the Instaclustr Managed Platform since March of 2020. The general release of Cassandra 4.0 from the Apache Foundation has been exhaustively tested and is the most stable release of Cassandra to date. Instaclustr has already adopted Cassandra 4.0 for our internal critical infrastructure to realize its significant benefits for stability and performance. 

Cassandra 4.0 brings a wealth of improvements and new features including: 

As part of this release, we have also run our Certified Cassandra testing process. We are currently finalizing our Certified Cassandra test report for Cassandra 4.0 and plan to release this next week. This report provides detailed performance comparisons against previous versions of Cassandra as well as a summary of caveated features that we recommend treating with caution.

The Instaclustr Managed Platform makes it easy for organizations to deliver applications at scale by fully hosting and managing their open source data infrastructure. Our Managed Cassandra service gives customers using Cassandra 4.0, as well as supported earlier releases, the following benefits: 

  • Cluster provisioning, configuration, and monitoring using the Instaclustr Console or REST APIs
  • Support for hosting in the cloud on Amazon Web Services, Microsoft Azure, and Google Cloud Platform, or on-premises
  • Easy scaling
  • SOC 2 compliance
  • PCI-DSS compliant options
  • 99.99% enterprise SLAs 
  • Highly responsive 24×7 support from Instaclustr’s expert team (more details on support and support SLAs here)
  • Apache Spark 3.0 add-on
  • Apache Lucene add-on for full text search capabilities

If you’re a current customer, you can contact our Technical Operations or Customer Success teams for any questions you have about Cassandra 4.0. New customers can begin a free trial of Instaclustr’s Managed Cassandra here or contact the Instaclustr Sales team.

The post Instaclustr Announces the General Availability of Managed Apache Cassandra 4.0 appeared first on Instaclustr.

Changes to Incremental Repairs in Cassandra 4.0

What Are Incremental Repairs? 

Repairs allow Apache Cassandra users to fix inconsistencies in writes between different nodes in the same cluster. These inconsistencies can happen when one or more nodes fail. Because of Cassandra’s peer-to-peer architecture, Cassandra will continue to function (and return correct results within the promises of the consistency level used), but eventually, these inconsistencies will still need to be resolved. Not repairing, and therefore allowing data to remain inconsistent, creates significant risk of incorrect results arising when major operations such as node replacements occur. Cassandra repairs compare data sets and synchronize data between nodes. 

Cassandra has a peer-to-peer architecture. Each node is connected to all the other nodes in the cluster, so there is no single point of failure in the system should one node fail. 

As described in the Cassandra documentation on repairs, full repairs look at all of the data being repaired in the token range (in Cassandra, partition keys are converted to a token value using a hash function). Incremental repairs, on the other hand, look only at the data that’s been written since the last incremental repair. By using incremental repairs on a regular basis, Cassandra operators can reduce the time it takes to complete repairs. 

A History of Incremental Repairs

Incremental repairs have been a feature of Apache Cassandra since the release of Cassandra 2.2. At the time 2.2 was released, incremental repairs were made the default repair mechanism for Cassandra. 

But by the time of the release of Cassandra 3.0 and 3.11, incremental repairs were no longer recommended to the user community due to various bugs and inconsistencies. Fortunately, Cassandra 4.0 has changed this by resolving many of these bugs.

Resolving Cassandra Incremental Repair Issues

One such bug was related to how Cassandra marks which  SSTables have been repaired (ultimately this bug was addressed in Cassandra-9143). This bug would result in overstreaming and essentially plug communication channels in Cassandra and slow down the entire system. 

Another fix was in Cassandra-10446. This allows for forced repairs even when some replicas are down. It is also now possible to run incremental repairs when nodes are down (this was addressed in Cassandra-13818).

Other changes came with Cassandra-14939, including:

  • The user can see if pending repair data exists for a specific token range 
  • The user can force the promotion or demotion of data for completed sessions rather than waiting for compaction 
  • The user can get the most recent repairedAT timestamp for a specific token range

Incremental repairs are the default repair option in Cassandra. To run incremental repairs, use the following command:

nodetool repair

If you want to run a full repair instead, use the following command:

nodetool repair --full

Similar to Cassandra 4.0 diagnostic repairs, incremental repairs are intended primarily for nodes in a self-support Cassandra cluster. If you are an Instaclustr Managed Cassandra customer, repairs are included as a part of your deployment, which means you don’t need to worry about day-to-day repair tasks like this. 

In Summary

While these improvements should greatly increase the reliability of incremental repairs, we recommend a cautious approach to enabling in production, particularly if you have been running subrange repair on an existing cluster. Repairs are a complex operation and the impact of different approaches can depend significantly on the state of your cluster when you start the operation and even the data model that you are using. To learn more about Cassandra 4.0, contact our Cassandra experts for a free consultation or sign up for a free trial of our Managed Cassandra service. 

The post Changes to Incremental Repairs in Cassandra 4.0 appeared first on Instaclustr.

Full Query Logging With Apache Cassandra 4.0

The release of Apache Cassandra 4.0 comes with a bunch of new features to improve stability and performance. It also comes with a number of valuable new tools for operators to get more out of their Cassandra deployment. In this blog post, we’ll take a brief look at full query logging, one of the new features that comes with the Cassandra 4.0 release

What Are Full Query Logs?

First off, we need to understand what counts as a full query log (FQL) in Cassandra. Full query logs record all successful Cassandra Query Language (CQL) requests. Audit logs (also a new feature of Cassandra 4.0), on the other hand, contain both successful and unsuccessful CQL requests. (To learn about the different forms of logging and diagnostic events in Cassandra 4.0, check out this blog by Instaclustr Co-Founder and CTO Ben Bromhead.)

The FQL framework was implemented to be lightweight from the very beginning so there is no need to worry about the performance. This is achieved by a library called Chronicle Queues, which is designed for low latency and high-performance messaging for critical applications.

Use Cases for Full Query Logs

There are a number of exciting use cases for full query logs in Cassandra. Full query logs allow you to log, replay, and compare CQL requests live without affecting the performance of your production environment. This allows you to:

  • Examine traffic to individual nodes to help with debugging if you notice a performance issue in your environment 
  • Compare performance between two different versions of Cassandra in different environments
  • Compare performance between nodes with different settings to help with optimizing your cluster
  • Audit logs for security or compliance purposes

Configuring Full Query Logs in Cassandra

The settings for full query logs are adjustable either in the Cassandra configuration file, cassandra.yaml, or with nodetool.

See the following example from the Cassandra documentation for one way to approach configuration settings:

# default options for full query logging - these can be overridden from command line
# when executing nodetool enablefullquerylog
#full_query_logging_options:
   # log_dir:
   # roll_cycle: HOURLY
   # block: true
   # max_queue_weight: 268435456 # 256 MiB
   # max_log_size: 17179869184 # 16 GiB
   # archive command is "/path/to/script.sh %path" where %path is replaced with the file being rolled:
   # archive_command:
   # max_archive_retries: 10

In this case, you would just need to add an existing directory that has permissions for reading, writing, and execution to log_dir. The log segments here are rolled hourly, but can also be set to roll daily or minutely

The max_queue_weight, which sets the maximum weight for in-memory queue records waiting to be written prior to blocking or dropping, is set here to 268435456 Bytes (equivalent to 256 MiB). This is also the default value. 

And the max_log_size option, which sets the maximum size of rolled files that can be retained on the disk before the oldest file is deleted, is set here to 17179869184 Bytes (equivalent 16 GiB).

After configuring the settings for your full query logs in cassandra.yaml, you can execute this command using nodetool to enable full query logging:

$ nodetool enablefullquerylog --path /tmp/cassandrafullquerylog

You must do this on a node-by-node basis for each node where you want to have full Cassandra query logs. 

If you prefer to, you can also set the configuration details for the full query logs within the syntax of the nodetool enablefullquerylog. To learn how to do so, check out the Cassandra documentation. 

With the great power of the FQL framework to log all your queries, you might wonder what happens when you log a statement that contains some sensitive information in it. For example, if an operator creates a new role and specifies a password for the role, will the password be visible in the log? This seems like it would be a sensitive security issue. 

The answer is that, no, there will not be any passwords visible. The Cassandra implementation is quite aggressive when it comes to the obfuscation of queries containing passwords in it by obfuscating the remaining part of statements when it finds passwords in them. It would also obfuscate passwords in case a query with passwords is not successful, which might happen when an operator makes a mistake in CQL query syntax.

From the point of view of the observability, you get the status of FQL by the respective getfullquerylog nodetool subcommand and you can disable FQL in runtime by disablefullquerylog subcommand. Of course, you achieve the same with calling respective JMX methods on StorageService MBean.

How to View Full Query Logs

Now that you have your full Cassandra query logs configured correctly and enabled for your chosen nodes, you probably want to view them sometimes. That’s where the fqltool command comes in. fqldump allows you to view logs (converted from binary to a format understandable to us humans). fqlreplay will replay logs, and fqlcompare outputs any differences between your full query logs. 

Together, fqlreplay and fqlcompare let you revisit different sets of production traffic to help you analyze performance between different configurations or different nodes, or to help with debugging issues.

Conclusion

With enhancements for stability and performance, along with cool new features like live full query logging, Cassandra 4.0 is a big step forward for the Cassandra community. To start using Cassandra 4.0, sign up for your free trial of Instaclustr Managed Cassandra and select the preview release of Cassandra 4.0 as your software version. 

The post Full Query Logging With Apache Cassandra 4.0 appeared first on Instaclustr.

Guide to Apache Cassandra Data and Disaster Recovery

Apache Cassandra is a distributed database management system where data is replicated amongst multiple nodes and can span across multiple data centers. Cassandra can run without service interruption even when one or more nodes are down. It is because of this architecture that Cassandra is highly available and highly fault tolerant. 

However, there are a number of common scenarios we encounter where nodes or data needs to be recovered in a Cassandra cluster, including:

  • Cloud instance retirement
  • Availability zone outages
  • Errors made from client applications and accidental deletions
  • Corrupted data/SStables
  • Disk failures
  • Catastrophic failures that require an entire cluster rebuild

Depending on the nature of the issue, the options available for recovery include the following:

Single Dead Node

Causes for this scenario include cloud instance retirement or physical server failure. 

For Instaclustr Managed Cassandra clusters, this is detected and handled automatically by our Support team. Outage or data loss is highly unlikely assuming Instaclustr’s recommended replication factor (RF) of three is being used. The node will simply be replaced and data will be streamed from other replicas in the cluster. Recovery time is dependent on the amount of data.

A Node That Was Unavailable for a Short Period of Time

This could be due to rolling restart, cloud availability zone outage, or a number of other general errors.

For this scenario:

  • It is handled automatically by Cassandra
  • Recovery time is dependent on downtime in this case
  • There is no potential for data loss and no setup required

If a node in Cassandra is not available for a short period of time, the data to be replicated on the node is stored on a peer node. This data is called hints. Once the original node becomes available, the hints are transferred to the node, and the node is caught up with the missed data. This process is known as hinted handoffs within Cassandra. 

There are time and storage restrictions for hints. If a node is not available for a longer duration than configured (i.e. hinted handoff window), no hints are saved for it. In such a case, Instaclustr will make sure repairs are run following best practices to ensure data remains consistent.

Restore Keystore, Corrupted Tables, Column Factory, Accidentally Deleted Rows, Transaction Logs, and Archived Data

In this case, if you are a Managed Cassandra customer:

  • Email Instaclustr
  • Recovery time dependent on amount of data
  • Anything after the last backup will be lost
  • Setup involves backups, which is handled by Instaclustr
  • Causes include general error, human error, testing, and even malware or ransomware

Note:

  • Corrupted Tables/Column Factory: This assumes corruption is at the global cluster level and not at the individual node level
  • Accidentally deleted rows: This is not as straightforward as you would potentially need to restore elsewhere (either to another table/keyspace or cluster) and read from there. Instaclustr can help with this.
  • Transaction logs: Not a common restore method
  • Archived data: Backups only go back 7 days, so another method to handle this must be used

Rebuild an Entire Cluster With the Same Settings

You may need to rebuild a cluster from backup due to accidental deletion or any other total data loss (e.g. human error). Another common reason for using restore-from-backup is to create a clone of a cluster for testing purposes. 

All Instaclustr-managed clusters have backups enabled, and we provide a self-service restore feature. Complex restore scenarios can be achieved by contacting our 24×7 support team

When restoring a cluster, backups can be used to restore data from a point-in-time to a new cluster via the Instaclustr Console or Provisioning API. See the following from Instaclustr’s Cassandra Documentation:

“When restoring a cluster, you may choose to:

  • Restore all data or only selected tables
    • If restoring a subset of tables, the cluster schema will still be restored in its entirety and therefore the schemas for any non-restored keyspaces and tables will still be applied. However, only backup data for the selected tables will be restored.
  • Restore to a point-in-time
    • The restore process will use the latest backup data for each node, up to the specified point-in-time. If restoring a cluster with Continuous Backup enabled, then commit logs will also be restored.

Once the Restore parameters are submitted, a new cluster will be provisioned under the account with a topology that matches the designated point-in-time.”

For clusters with basic data backups enabled (once per node per 24 hour period), anything written after the last backup could potentially be lost. If the “continuous backup” feature is enabled, then this recovery point is reduced to five minutes.

Recovery time is dependent on the amount of data to be restored. The following chart aims to detail the restore download speeds for different instances provided in the Instaclustr managed service offering:

Instance Download Speed* Commit Log Replay Speed Startup Speeds (No Commit Logs)
i3.2xlarge 312 MB/s 10GB (320 Commitlogs): 12 seconds20GB (640 Commitlogs): 28 seconds30GB (960 Commitlogs): 1 minute 7 seconds 82 seconds per node

*The download speed is the approximate rate that data is transferred from cloud provider storage service to the instance. It may vary depending on region, time of day, etc.

Application Effect

This really is a question about quorum, consistency level, and fault tolerance. Assuming a replication factor of three and LOCAL_QUORUM consistency is being used (as per Instaclustr’s Cassandra best practices), your application will not see any outage.

Instaclustr Managed Cassandra Backup and Restore Options

The following are some of the different Cassandra backup and restore options available for Instaclustr’s customers.

Snapshot Backup

By default, each node takes a full backup once per 24 hours and uploads the data to an S3 bucket in the same region. Backup times are staggered at random hours throughout the day. The purpose of these backups is to recover the cluster data in case of data corruption or developer error (the most common example being accidentally deleting data). Backups are kept for seven days (or according to the lifecycle policy set on the bucket). For clusters that are hosted in Instaclustr’s AWS account, we do not replicate data across regions.

Continuous Backup

Continuous Backup can optionally be enabled to perform backups more frequently. Enabling Continuous Backup for a cluster will increase the frequency of snapshot backups to once every three hours, and additionally, enable commit log backups once every five minutes. This option provides a reduced window of potential data loss (i.e. a lower RPO). This feature can be enabled on any cluster as part of the enterprise add-ons package, for an additional 20% of cluster monthly price, or as per your contract.

Restore (Same Region)

Cluster backups may be used to restore data from a point-in-time to a new cluster, via the Console or using the Provisioning API. Data will be restored to a new cluster in the same region. This feature is fully automated and self-serve. 

Customized network configurations created outside the Instaclustr Console are not automatically restored, including peering.

For more information, read our documentation on Cassandra cluster restore operations. 

Next up, we’ll review some common failure scenarios and how they are resolved in greater detail.

Scenario 1: AZ Failure

For maximum fault tolerance, a Cassandra cluster should be architected using three (3) racks, which are each mapped to AWS Availability Zones (AZs). This configuration allows the loss of one AZ (rack) and QUORUM queries will still be successful.

Testing AZ Failure 

Instaclustr Technical Operations can assist with testing a disaster recovery (DR) situation by simulating the loss of one or more AZ. This is achieved by stopping the service (Cassandra) on all nodes in one rack (AZ) to simulate a DR situation. 

Below are two tests to consider:

First test: Stop Cassandra on all nodes in one rack

  • Rack = AZ
  • Simulates case where some or all nodes become unreachable in an AZ, but not necessarily failed
  • Measures impact to the application (we assume you’ll run some traffic through during the simulated outage)

Second test: Terminate all instances in one rack

  • Simulate a complete AZ failure
  • Instaclustr will replace each instance until the cluster is recovered
  • Measures recovery time

Our process in the case of an instance failure is:

  • Alerts would be triggered to PagerDuty detecting nodes down
  • Rostered engineer would investigate by:
    • Attempting to ssh to the instance
    • CheckingAWS instance status via CLI
  • Replace instance if it is unreachable (uses restore from backup).
  • Notify the customer of actions

Scenario 2: Region Failure

Failover DC in Another Region

In this option, another data center (DC) is added to the cluster in a secondary region. All data, and new writes, are continuously replicated to the additional DC. Reads are serviced by the primary DC. This is essentially a “hot standby” configuration.

Advantages:

  • Full cross-region redundancy
  • Very low Recovery Point Objective (RPO) and Recovery Time Objective (RTO). 
    1. RPO: Frequency of backups
    2. RTO: Amount of tolerable downtime for business
  • Simple to set up
  • Guaranteed AWS capacity in secondary region should the primary region become unavailable
  • Each DC is backed up to S3 in the same region
  • Protects against data corruption (at SSTable level)

Disadvantages

  • Does not protect against accidental deletes or malicious queries (all mutations are replicated to the secondary DC)

Cost

  • There is a potential to run the secondary DC with fewer or smaller instance types. At the very minimum it will need to service the full write load of the primary DC, as well as having sufficient storage for the data set. 
  • Cost includes additional AWS infrastructure cost, as well as Instaclustr management fees

Testing Region Failure 

Similar to simulating an AZ failure, a region outage can be simulated by:

First test: Stop Cassandra on all nodes in primary region

  • Quickest simulation, allows you to test how your application will respond (we assume you’ll run some traffic through during the simulated outage).

Second test: Terminate all instances in one region

  • Simulate a complete AZ failure (particularly where ephemeral instances are used)
  • Allows you to test how your application will respond (we assume you’ll run some traffic through during the simulated outage).
  • Measures recovery time for primary region after it becomes available again
    • All instances replaced or rebuilt from secondary DC
    • A repair would (likely) be required to catch up on writes missed during the outage.  

Scenario 3: Accidental/Malicious Delete, Data Corruption, etc.

For these scenarios, typically a partial data restore is required.  The recovery method, as well as the amount of data that can be recovered, will invariably depend on the nature of the data loss. You should contact support@instaclustr.com as soon as possible for expert advice. 

We have extensive experience in complex data recovery scenarios.

Case Study: Accidentally Dropped Table

Problem: Human error caused a production table to be accidentally dropped. 

Solution: Instaclustr TechOps was able to quickly reinstate the table data from the snapshots stored on disk. 

Recovery time: Less than one hour. 

Case Study: Data Corruption Caused by Cluster Overload

Problem: Unthrottled data loads caused SSTable corruption on disk and cluster outage. 

Solution: Combination of online and offline SSTable scrubs. Some data was also able to be recovered from recent backups.  A standalone instance was used to complete some scrubs and data recovery, in order to parallelise the work and reduce impact to live prod clusters.

Recovery time: Approximately 90% of the data was available within the same business day. Several weeks to recover the remainder. A small percentage of data was not able to be recovered.

Conclusion

In this post, we discussed common failure scenarios within Apache Cassandra, their respective causes and recovery solutions, as well as other nuances involved. Instaclustr has invested the time and effort to gain industry-leading, expert knowledge on the best practices for Cassandra. We are the open source experts when it comes to architecting the right solution to protecting and managing your data within Cassandra. 

Have more questions on Cassandra disaster recovery and other best practices? Reach out to schedule a consultation with one of our experts. Or, sign up for a free trial to get started with Cassandra right away.

The post Guide to Apache Cassandra Data and Disaster Recovery appeared first on Instaclustr.

Understanding Apache Cassandra Memory Usage

This article describes Apache Cassandra components that contribute to memory usage and provides some basic advice on tuning.

Cassandra memory usage is split into JVM heap and offheap.

Heap is managed by the JVM’s garbage collector.

Offheap is manually managed memory, which is used for:

  • Bloom filters: Used to quickly test if a SSTable contains a partition
  • Index summary: A search lookup of index positions
  • Compression metadata
  • Key cache (key_cache_size_in_mb): Used to store SSTable positions for a given partition key (so you can skip Index summary and Index scanning for data position)
  • File cache (file_cache_size_in_mb): 32MB of this are reserved for pooling buffers. The remaining MB serve as a cache holding uncompressed SSTables.
  • Row cache (row_cache_size_in_mb)
  • Counter cache (counter_cache_size_in_mb)
  • Offheap memtables (memtable_allocation_type)
  • Direct ByteBuffer (ByteBuffer.allocateDirect)
  • Memory mapped files (default disk_access_mode is mmap): Reading Data.db and Index.db files will use memory mapped files, which is where the operating system loads parts of the file into memory pages

Bloom Filters

Bloom filters are a data structure that lets you determine if a specific element is present in a set. Bloom filters let you look at data in Cassandra and determine between one of two possibilities for a given partition: 

1. It definitely does not exist in the given file, or:

2. It probably does exist in the file

If you want to make your bloom filters more accurate, configure them to consume more RAM. You can adjust this behavior for your bloom filters by changing bloom_filter_fp_chance to a float between 0 and 1. This parameter defaults to 0.1 for tables using LeveledCompactionStrategy, and 0.01 otherwise.

Bloom filters are stored offheap in RAM. As the bloom_filter_fp_chance gets closer to 0, memory usage increases, but does not increase in a linear fashion. 

Values for bloom_filter_fp_chance for false positives are usually between 0.01 (1%) to 0.1 (10%) chance.

You should adjust the parameter for bloom_filter_fp_chance depending on your use case. If you need to avoid excess IO operations, you should set  bloom_filter_fp_chance to a low number like 0.01. If you want to save RAM and care less about IO operations, you can use a higher  bloom_filter_fp_chance number. If you rarely read, or read by performing range slices over all the data (like Apache Spark does when scanning a whole table), an even higher number may be optimal. 

Index Summary 

Cassandra stores the offsets and index entries in offheap.

Compression Metadata

Cassandra stores the compression chunk offsets in offheap.

Key Cache

The key cache saves Cassandra from having to seek for the position of a partition. The key cache saves a good deal of time given how small it is, so it is worth using for at-large numbers. The global limit for the key cache is controlled in cassandra.yaml by setting key_cache_size_in_mb

There is also a per-table setting defined in the schema, in the property caching under keys, with the default set to ALL.

Row Cache

Compared to the key cache, the row cache saves more time but takes up more space. So the row cache should only be used for static rows or hot rows. The global limit for row cache is controlled in cassandra.yaml by setting row_cache_size_in_mb.

There is also a per-table setting defined in the schema, in the property caching under key rows_per_partition, with the default set to NONE.

Counter Cache

Counter cache helps cut down on counter locks’ contention for hot counter cells. Only the local (clock, count) tuple of a counter cell is, not the whole counter, so it is relatively cheap. You can adjust the global limit for counter cache managed in cassandra.yaml by setting counter_cache_size_in_mb.

Offheap Memtables

Since Cassandra 2.1, offheap memory can be used for memtables.

This is set via memtable_allocation_type in cassandra.yaml. If you want to have the least impact on reads, use offheap_buffers to move the cell name and value to DirectBuffer objects. 

With offheap_objects you can move the cell offheap. Then you only have a pointer to the offheap data. 

Direct ByteBuffer

There are a few miscellaneous places where Cassandra allocates offheap, such as HintsBuffer, and certain compressors such as LZ4 will also use offheap when file cache is exhausted.

Memory Mapped Files

By default, Cassandra uses memory mapped files. If the operating system is unable to allocate memory to map the file to, you will see message such as:

Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.

If this occurs, you will need to reduce offheap usage, resize to hardware with more available memory, or enable swap.

Maximum Memory Usage Reached

This is a common log message about memory that often causes concern:

INFO o.a.c.utils.memory.BufferPool Maximum memory usage reached (536870912), cannot allocate chunk of 1048576.

Cassandra has a cache that is used to store decompressed SSTable chunks in offheap memory. In effect, this cache performs a similar job to the OS page cache, except the data doesn’t need to be decompressed every time it is fetched.

This log message just means that the cache is full. When this is full, Cassandra will allocate a ByteBuffer outside the cache, which can be a degradation of performance (since it has to allocate memory). This is why this message is only at INFO level and not WARN.

The default chunk cache size is 512MB. It could be modified by altering file_cache_size_in_mb in cassandra.yaml.

Examining Memory Usage

Running nodetool info will provide heap and offheap memory usage.

However, comparing this with actual memory usage will usually show a discrepancy. 

For example: with Cassandra running in Docker, Cassandra was using 16.8GB according to docker stats; nodetool info reported 8GB heap, 4GB heap. This leaves 4.8GB not accounted for. Where does the memory go?

This happens because offheap usage reported by nodetool info only includes:

  • Memtable offheap
  • Bloom filter offheap
  • Index summary offheap
  • Compression metadata offheap

Other sources of offheap usage are not included, such as file cache, key cache, and other direct offheap allocations.

To get started with Apache Cassandra, sign up for a free trial of Instaclustr Managed Cassandra today. Or, connect with one of our experts to get advice on optimizing Apache Cassandra memory usage for your unique environment.

The post Understanding Apache Cassandra Memory Usage appeared first on Instaclustr.

Change Data Capture (CDC) With Kafka Connect and the Debezium Cassandra Connector (Part 2)

In Part 1 of this two-part blog series we discovered that Debezium is not a new metallic element (but the name was inspired by the periodic table), but a distributed open source CDC technology, and that CDC (in this context) stands for Change Data Capture (not Centers for Disease Control or the Control Data Corporation). 

We introduced the Debezium architecture and its use of Kafka Connect and explored how the Debezium Cassandra Connector (on the source side of the CDC pipeline) emits change events to Kafka for different database operations. 

In the second part of this blog series, we examine how Kafka sink connectors can use the change data, discover that Debezium also propagates database schema changes (in different ways), and summarize our experiences with the Debezium Cassandra Connector used for customer deployment. 

1. How Does a Kafka Connect Sink Connector Process the Debezium Change Data?

At the end of Part 1, we had a stream of Cassandra change data available in a Kafka topic, so the next question is how is this processed and applied to a target sink system? From reading the Debezium documentation, the most common deployment is by means of Kafka Connect, using Kafka Connect sink connectors to propagate the changes into sink systems:

But Debezium only comes with source connectors, not sink connectors; the sink connectors are not directly part of the Debezium ecosystem. Where do you get sink connectors that can work with the Debezium change data from?

Based on my previous experiences with Kafka Connect Connectors, I wondered which open source sink connectors are possibly suitable for this task, given that:

  • The Debezium change data format is a complex JSON structure
  • It isn’t based on a standard
  • Each database produces a different format/content, and
  • The actions required on the sink system side will depend on:
    • The sink system,
    • The operation type, and 
    • The data values 

What’s perhaps not so obvious from the example so far is that Debezium also detects and propagates schema changes. In the above example, the schema is just sent explicitly in each message as the Kafka key, which can change. For example, if columns are added to the database table that is being monitored, then the schema is updated to include the new columns. So the sink connector will also have to be able to interpret and act on schema changes. 

The Debezium documentation does address at least one part of this puzzle (the complexity of the JSON change data structure), by suggesting that:

“… you might need to configure Debezium’s new record state extraction transformation. This Kafka Connect SMT (Single Message Transform) propagates the after structure from Debezium’s change event to the sink connector. This is in place of the verbose change event record that is propagated by default.”

Another relevant feature (in incubation) is Debezium Event Deserialization:

“Debezium generates data change events in the form of a complex message structure. This message is later on serialized by the configured Kafka Connect converter and it is the responsibility of the consumer to deserialize it into a logical message. For this purpose, Kafka uses the so-called SerDes. Debezium provides SerDes (io.debezium.serde.DebeziumSerdes) to simplify the deserialization for the consumer either being it Kafka Streams pipeline or plain Kafka consumer. The JSON SerDe deserializes JSON encoded change events and transforms it into a Java class.”

I had a look at Kafka Streams Serdes in a previous blog, so the Debezium Deserializer looks useful for Kafka Streams processing or custom Kafka consumers, but not so much so for off-the-shelf Kafka sink connectors.

2. Single Message Transforms

Let’s have a look at a simple SMT example from this blog.

As I discovered in my last blog, Kafka Connect JDBC Sink Connectors make different assumptions about the format of the Kafka messages. Some assume a flat record structure containing just the new names/values, others require a full JSON Schema and Payload, others allow custom JSON sub-structure extraction, and others (e.g. the one I ended up using after customizing the IBM sink connector) allow for arbitrary JSON objects to be passed through (e.g. for inserting into a PostgreSQL JSONB data type column).

These differences are due to the fact that they were designed for use in conjunction with different source systems and source connectors. All of the connectors I looked at only allowed for insert or upsert (insert if the row doesn’t exist, else update) operations, but not deletes. Some customization would therefore be required to cope with the full range of operations emitted by Debezium source connectors. 

For a simple use case where you only want to insert/upsert new records, Debezium provides a bridge between the complex change message format and simpler formats expected by JDBC sink connectors, in the form of a “UnwrapFromEnvelope” single message transform. This can be used in either the source or sink connector side by adding these two lines to the connector configuration:

"transforms": "unwrap",

transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope"

Note that by default this SMT will filter out delete events, but there is an alternative, or possibly more recent SMT called io.debezium.transforms.ExtractNewRecordState, which also allows optional metadata to be included

3. Data Type Mappings

The Debezium Cassandra connector represents changes with events that are structured like the table in which the row exists, and each event contains field values and Cassandra data types. However, this implies that we need a mapping between the Cassandra data types and the sink system data types. There is a table in the documentation that describes the mapping from Cassandra data types to literal types (schema type) and semantic types (schema name), but currently not to “logical types”, Java data types, or potential sink system data types (e.g. to PostgreSQL, Elasticsearch, etc.). Some extra research is likely to be needed to determine the complete end-to-end Cassandra to sink system data type mappings. 

4. Duplicates and Ordering

The modern periodic table was invented by Mendeleev in 1869 to organize the elements, ordered by atomic weight. Mendeleev’s table predicted many undiscovered elements, including Germanium (featured in Part 1), which he predicted had an atomic weight of 70, but which wasn’t discovered until 1886.

The Debezium Cassandra Connector documentation highlights some other limitations of the current implementation. In particular, both duplicate and out-of-order events are possible. This implies that the Kafka sink connector will need to understand and handle both of these issues, and the behavior will also depend on the ability of the sink system to cope with them.

If the target system is sensitive to either of these issues, then it’s possible to use Kafka streams to deduplicate events:

and re-order (or wait for out-of-order) events (within a window):

Germanium arrived “out of order” but luckily Mendeleev had left a gap for it, so he didn’t have to redo his table from scratch.

The missing element, Germanium (actual atomic weight 72.6), a shiny metalloid.
(Source: Shutterstock)

5. Scalability Mismatch

Another thing you will need to watch out for is a potential mismatch of scalability between the source and sink systems. Cassandra is typically used for high-throughput write workloads and is linearly scalable with more nodes added to the cluster. Kafka is also fast and scalable and is well suited for delivering massive amounts of events from multiple sources to multiple sinks with low latency. It can also act as a buffer to absorb unexpected load spikes.

However, as I discovered in my previous pipeline experiments, you have to monitor, tune, and scale the Kafka Connect pipeline to keep the events flowing smoothly end-to-end. Even then, it’s relatively easy to overload the sink systems and end up with a backlog of events and increasing lag.

So if you are streaming Debezium change events from a high-throughput Cassandra source database you may have to tune and scale the target systems, optimize the performance of the Kafka Connect sink connectors and number of connector tasks running, only capture change events for a subset of the Cassandra tables and event types (you can filter out events that you are not interested in), or even use Kafka Streams processing to emit only significant business events, in order for the sink systems to keep up! And as we discovered (see below), the Debezium Cassandra Connector is actually on the sink-side of Cassandra as well.

(Source: Shutterstock)

Germanium transistors ran “hot” and even with a heat sink couldn’t match the speed of the new silicon transistors which were used in the CDC 6600. But Germanium has made a come-back: “Germanium Can Take Transistors Where Silicon Can’t

6. Our Experiences With the Debezium Cassandra Connector

As mentioned in Part 1, the Debezium Cassandra Connector is “special” — Cassandra is different from the other Debezium connectors since it is not implemented on top of the Kafka Connect framework. Instead, it is a single JVM process that is intended to reside on each Cassandra node and publish events to a Kafka cluster via a Kafka producer.

The Debezium Cassandra Connector has multiple components as follows:

  • Schema processor handles updates to schemas (which had a bug, see below)
  • Commit log processor, reads, and queues the Commit logs (which had a throughput mismatch issue)
  • Snapshot processor, which handles initial snapshot on start up (potentially producing lots of data)
  • Queue processor — a Kafka producer that emits change data to Kafka

We found and fixed some bugs (e.g. schema changes not being detected correctly or fast enough, see Report 1 and Report 2), and configured and optimized it for the customer use case. 

The Debezium Cassandra connector eventually achieved good performance, but it did need some effort to fix and configure it to work consistently and fast (i.e. changes consistently detected in under 1s). Some of these were already flagged in the Debezium documentation (“How the Cassandra Connector Works” and “When Things Go Wrong”), and relate to what happens when a Connector first starts up on a table producing an initial “snapshot” (potentially too much data), and the limitations of Cassandra commit logs (e.g. commit logs are per node, not per cluster; there is a delay between event logging and event capture, and they don’t record schema changes). 

We also had to ensure that the scalability mismatch flagged above between Cassandra and downstream systems wasn’t also an issue on the source side (the Debezium Cassandra Connector side). If the Debezium connector can’t process the Cassandra commit logs in a timely manner, then the Cassandra CDC directory fills up, and Cassandra will start rejecting writes, which is not a desirable outcome. To reduce the risk of the directory filling up under load, we changed how frequently the offsets were committed (at the cost of getting more duplicates if Debezium goes down on a node and has to reprocess a commit log) so that the Debezium Cassandra Connector could achieve better throughput and keep up. Just to be on the safe side, we also set up an alert to notify our technical operations team when the directory is getting too full.

Finally, in contrast to the examples in Part 1, which use the default Debezium human-readable JSON encoding for the schema and values, we used the more efficient binary encoding using Apache Avro. This means that you also have to have a Kafka Schema Registry running (which we provide as a Kafka Managed Service add-on). The Debezium Cassandra Connector encodes both the schema and the change data as Avro, but only puts a very short identifier to the schema in the Kafka key rather than the entire explicit schema, and sends schema changes to the Schema Registry. The Kafka Connect sink connector has to decode the Kafka record key and value from Avro, detect any changes in the schema, and get the new schema from the registry service. This reduces both the message size and Kafka cluster storage requirements significantly and is the recommended approach for high-throughput production environments.

There’s also another approach. For other connectors (e.g. MySQL), the schema changes can be propagated using a different mechanism, via a Kafka schema change topic.

7. Conclusions

The code for our forked Debezium Cassandra Connector can be found here, but we also contributed the code for the new schema evolution approach back to the main project.

Contact us if you are interested in running the Debezium Cassandra (or other) connectors, and check out our managed platform if you are interested in Apache Cassandra, Kafka, Kafka Connect, and Kafka Schema Registry.

What did we learn in this blog series? Debezium and Kafka Connect are a great combination for change data capture (CDC) across multiple heterogeneous systems, and can fill in the complete CDC picture – just as Trapeziums can be used to tile a 2D plane (In Part 1 we discovered that “Trapezium”, but not Debezium, is the only official Scrabble word ending in “ezium”).  M.C Escher was well known for his clever Tessellations (the trick of covering an area with repeated shapes without gaps). Here’s a well-known example “Day and Night” (which is now held by the National Gallery of Victoria in Melbourne, Australia, see this link for a detailed picture).

“Day and Night” at the Escher Museum, The Hague for debezium cdc kafka
“Day and Night” at the Escher Museum, The Hague. (Source: https://commons.wikimedia.org/wiki/File:Gevel_Escher_in_Het_Paleis_300_dpi.jpg)

Acknowledgements

Many thanks to the Instaclustr Debezium Cassandra project team who helped me out with information for this blog and/or worked on the project, including:

  • Chris Wilcox
  • Wesam Hasan
  • Stefan Miklosovic
  • Will Massey

The post Change Data Capture (CDC) With Kafka Connect and the Debezium Cassandra Connector (Part 2) appeared first on Instaclustr.

Change Data Capture (CDC) With Kafka Connect and the Debezium Cassandra Connector (Part 1)

It’s Quiz time! What does CDC stand for?

  1. Centers for Disease Control and Prevention
  2. Control Data Corporation
  3. Change Data Capture

(1) In the last year the U.S. Centers for Disease Control and Prevention (cdc.gov) has certainly been in the forefront of the news, and their Apache Kafka Covid-19 pipeline was the inspiration for my ongoing Kafka Connect streaming data pipeline series. So when colleagues at work recently mentioned CDC for databases I was initially perplexed.

(2) The second CDC that came to mind was the Control Data Corporation. This CDC was a now-defunct 1960s mainframe computer company, perhaps most famous for the series of supercomputers built by Seymour Cray before he started his own company. The CDC 6600 was the first computer to hit 1 megaFLOPS (arguably making it the first “supercomputer”), which it achieved by replacing the older and hotter-running Germanium transistors (see section 5.5 below) with the new silicon transistor.

CDC 6600 https://www.computerhistory.org/revolution/supercomputers/10/33/57

(3) But soon another word was introduced into the conversation which clarified the definition of CDC, Debezium! Although it sounds like a newly discovered metallic element, it’s actually a new open source distributed platform for Change Data Capture from multiple different databases. Like germanium, most metallic elements end in “ium”, and surprisingly the latest “ium” element, moscovium, was only officially named in 2016. The name Debezium (DBs + “ium”) was indeed inspired by the periodic table of elements

1. What is Change Data Capture?

Change Data Capture has been around for almost as long as databases (I even wrote one for Oracle in the 1990s!) and is an approach used to capture changes in one database to propagate to other systems to reuse (often other databases, but not exclusively). 

Some basic approaches rely on SQL to detect rows with changes (e.g. using timestamps and/or monotonically increasing values), or internal database triggers. Oracle, for example, produces a “change table” that can be used by external applications to get changes of interest to them. These approaches typically put extra load on the database and were therefore often only run as batch jobs (e.g. at night when the normal load was low or even paused), so the change data was only available to downstream systems once every 24 hours—far from real time!

More practical CDC solutions involve the use of custom applications (or sometimes native functionality) on the database to access the commit logs and make changes available externally, so called log-based CDC. Debezium uses this approach to get the change data, and then uses Kafka and Kafka Connect to make it available scalably and reliably to multiple downstream systems.

2. CDC Use Cases

The Debezium Github has a good introduction to Debezium Change Data Capture use cases, and I’ve thought of a few more (including some enabled by Kafka Streams):

  1. Cache invalidation, updating or rebuilding Indexes, etc.
  2. Simplifying and decoupling applications. For example:
    1. Instead of an application having to write to multiple systems, it can write to just one, and then:
    2. Other applications can receive change data from the one system, enabling multiple applications to share a single database
  3. Data integration across multiple heterogeneous systems
  4. For triggering real-time event-driven applications
  5. To use database changes to drive streaming queries
  6. To recreate state in Kafka Streams/Tables
  7. To build aggregate objects using Kafka Streams

The last two use cases (or possibly design patterns) use the power of Kafka Streams. There is also a recent enhancement for Kafka Streams which assists in relational to streams mapping using foreign key joins.

3. Kafka Connect JDBC Source Connectors

Recently I’ve been experimenting with Kafka Connect JDBC and PostgreSQL sink connectors for extensions to my pipeline blogs. But what I hadn’t taken much notice of was that there were also some JDBC source connectors available. For example, the Aiven open source jdbc connector comes in sink and source flavors. It supports multiple database types and has configurations for setting the poll interval and the query modes (e.g. batch and various incremental modes). However, it will put extra load onto the database, and the change data will be out of date for as long as the poll interval, and it only works for SQL databases, not for NoSQL databases such as Cassandra. 

There is at least one open source Kafka Connect Cassandra source connector, which we mention on our support pages, but it is not suitable for production environments.

So how does Debezium work for production CDC scenarios?

4. How does Debezium Work? Connectors + Kafka Connect

Debezium has “Connectors” (to capture change data events) for the following list of databases currently:

  • MySQL
  • MongoDB
  • PostgreSQL
  • SQL Server
  • Oracle (preview)
  • Cassandra (preview)

Debezium requires custom database side applications/plugins to be installed for some of these (PostgreSQL and Cassandra), which means that you may need the cooperation of your cloud provider if you are running them in the cloud.

Debezium “Connectors” always write the change data events to a Kafka cluster, and most of them are in fact Kafka Connect source connectors, so you will most likely need a Kafka Connect cluster as well, which brings other advantages. Here’s an example of the Debezium architecture for a PostgreSQL source, which uses both Kafka Connect source and sink connectors.

Using Kafka and Kafka Connect clusters to capture and deliver change data capture messages has many potential benefits. Kafka is very fast and supports high message throughput. It works well as a buffer to absorb load spikes to ensure that the downstream systems are not overloaded and that no messages are lost. It also supports multiple producers, enabling change data capture from multiple sources. 

Moreover, multiple sinks can subscribe to the same messages, allowing for sharing and routing of the CDC data among heterogeneous target systems. The loose coupling between consumers and producers is beneficial for the CDC use case, as the consumers can consume from multiple heterogeneous data sources, and transform, filter, and act on the change types depending on the target systems. 

One of my favorite Kafka “super powers” is its ability to “replay” events. This is useful for the CDC use case for scenarios involving rebuilding the state of downstream systems (e.g. reindexing) without putting extra load on the upstream databases (essentially using Kafka as a cache). An earlier (pre-Kafka) heterogeneous open source CDC system, DataBus (from LinkedIn), also identified this as a vital feature, which they called “look-back”.

“Looking back” at event streams comes with Kafka by default

5. Change Data Capture With the Debezium Cassandra Connector

Instaclustr recently built, deployed, configured, monitored, secured, ran, tested, fixed, and tuned the first Debezium Cassandra Connector on our managed platform for a customer. The customer use case was to replicate some tables from Cassandra into another analytical database, in close to real time, and reliably at scale. The rest of this blog series looks at the Debezium Cassandra Connector in more detail, explores how the change data could be used by downstream applications, and summarises our experiences from this project. 

Because the Cassandra connector is “incubating” it’s important to understand that the following changes will be ignored:

  • TTL on collection-type columns
  • Range deletes
  • Static columns
  • Triggers
  • Materialized views
  • Secondary indices
  • Lightweight transactions

5.1 What “Change Data” is Produced by the Debezium Cassandra “Connector”?

Most Debezium connectors are implemented as Kafka Connect source connectors, except for the Debezium Cassandra Connector, which just uses a Kafka producer to send the change data to Kafka as shown in this diagram: 

From the documentation for the Debezium Cassandra “Connector”, I discovered that the Debezium Cassandra Connector:

  1. Writes events for all Casandra insert, update and delete operation
  2. On a single table
  3. To a single Kafka topic
  4. All change events have a key and value
  5. The Kafka key is a JSON object that contains:
    1. A “key schema”, an array of fields including the name, type and possibly a “logicalType” (which is not really explained anywhere), and
    2. A payload (name value pairs);
    3. For each column in the primary key of the table
  6. The Kafka value is a bit more complicated, see below

The value of a change event message is more complicated, is JSON, and contains the following fields:

  • op: The type of operation, i = insert, u = update, d = delete
  • after: The state of the row after the operation occurred
  • source: Source metadata for the event 
  • ts_ms: The time at which Debezium processed the event

Unlike some other Debezium connectors, there is no “before” field (as the Cassandra commit log doesn’t contain this information), and some fields are optional (after, and ts_ms). Note that the value of “after” fields is null if their state is unchanged, except for primary key fields which always have a value.

Here are some simplified examples, for keeping web page metrics.

First, an insert (where the table primary key is page, and there are also visits and conversions):

INSERT INTO analytics (page, visits, conversions) VALUES ("/the-power-of-kafka-partitions", 4533, 55);

Results in this (simplified, the source fields are omitted for clarity) message value:

{
 "op": "c",
 "ts_ms": 1562202942832,
 "after": {
  "page": {
    "value": "/the-power-of-kafka-partitions/",
    "deletion_ts": null,
    "set": true
  },
  "visits": {
    "value": 4533,
    "deletion_ts": null,
    "set": true
  },
  "conversions": {
    "value": 55,
    "deletion_ts": null,
    "set": true
  }
 },
 "source": {
  ...
 }
}

Note that each of the after fields has a new value as the record has been created.

Now let’s see what happens with an UPDATE operation:

UPDATE analytics
SET visits=4534
WHERE page="/the-power-of-kafka-partitions/";

An UPDATE results in a change message with this value:

{
 "op": "u",
 "ts_ms": 1562202944319,
 "after": {
  "page": {
   "value": "/the-power-of-kafka-partitions/",
   "deletion_ts": null,
   "set": true
  },
  "visits": {
   "value": 4534,
   "deletion_ts": null,
   "set": true
  },
  "conversions": null
 },
 "source": {
  ...
 }
}

Note that for this UPDATE change event, the op field is now “u”, the after visits field has the new value, the after conversions field is null as it didn’t change, but the after page value is included as it’s the primary key (even though it didn’t change).

And finally a DELETE operation:

DELETE FROM analytics
WHERE page="/the-power-of-kafka-partitions/";

Results in this message value:

{
 "op": "d",
 "ts_ms": 1562202946329,
 "after": {
  "page": {
  "value": "/the-power-of-kafka-partitions/",
  "deletion_ts": 1562202947401,
  "set": true
  },
  "visits": null,
  "conversions": null  }
 },
 "source": {
  ...
 }
}

The op is now “d”, the after field only contains a value for page (the primary key) and deletion_ts now has a value, and other field values are null.

To summarize:

  • For all operation types the primary key after fields have values
  • For updates, non-primary key after fields have values if the value changed, otherwise null
  • For deletes, non-primary key after fields have null values

And now it’s time for a brief etymological interlude, brought to you by the suffix “ezium”:

Oddly enough, there is only one word permitted in Scrabble which ends in “ezium”, trapezium! Sadly. Debezium hasn’t made it into the Scrabble dictionary yet. A trapezium (in most of the world) is a quadrilateral with at least one pair of parallel sides. But note that in the U.S. this is called a trapezoid, and a “trapezium” has no parallel sides. Here’s an “Australian” Trapezium = U.S. Trapezoid:

Read Next: Change Data Capture (CDC) With Kafka Connect and the Debezium Cassandra Connector (Part 2)

The post Change Data Capture (CDC) With Kafka Connect and the Debezium Cassandra Connector (Part 1) appeared first on Instaclustr.

The CAP Theorem With Apache Cassandra and MongoDB

MongoDB and Apache Cassandra are both popular NoSQL distributed database systems. In this article, I will review how the CAP and PACELC theorems classify these systems. I will then show how both systems can be configured to deviate from their classifications in production environments. 

The CAP Theorem

The CAP theorem states that a distributed system can provide only two of three desired properties: consistency, availability, and partition tolerance. 

Consistency (C): Every client sees the same data. Every read receives the data from the most recent write. 

Availability (A): Every request gets a non-error response, but the response may not contain the most recent data.

Partition Tolerance (P): The system continues to operate despite one or more breaks in inter-node communication caused by a network or node failure. 

Because a distributed system must be partition tolerant, the only choices are deciding between availability and consistency. If part of a cluster becomes unavailable, a system will either: 

  • Safeguard data consistency by canceling the request even if it decreases the availability of the system. Such systems are called CP systems. 
  • Provide availability even though inconsistent data may be returned. These systems are AP distributed systems. 

According to the CAP theorem, MongoDB is a CP system and Cassandra is an AP system.

CAP theorem provides an overly simplified view of today’s distributed systems such as MongoDB and Cassandra. Under normal operations, availability and consistency are adjustable and can be configured to meet specific requirements. However, in keeping with CAP, increasing one state decreases the other. Hence, it would be more correct to describe the default behavior of MongDB or Cassandra as CP or AP. This is discussed in more detail below.

PACELC Theorem

The PACELC theorem was proposed by Daniel J. Abadi in 2010 to address two major oversights of CAP:

  1. It considers the behavior of a distributed system only during a failure condition (the network partition)
  2. It fails to consider that in normal operations, there is always a tradeoff between consistency and latency

PACELC is summarized as follows: In the event of a partition failure, a distributed system must choose between Availability (A) and Consistency, else (E) when running normally it must choose between latency (L) or consistency (C).

MongoDB is classified as a PC+EC system. During normal operations and during partition failures, it emphasizes consistency. Cassandra is a PA+EL system. During a partition failure it favors availability. Under normal operations, Cassandra gives up consistency for lower latency. However, like CAP, PACELC describes a default behavior

(As an aside, there are no distributed systems that are AC or PC+EC. These categories describe stand-alone ACID-compliant relational database management systems).

Apache Cassandra vs. MongoDB Architectures

MongoDB is a NoSQL document database. It is a single-master distributed system that uses asynchronous replication to distribute multiple copies of the data for high availability. A MongoDB is a group of instances running mongod and maintaining the same data. The MongoDB documentation refers to this grouping as a replica set. But, for simplicity, I will use the MongoDB cluster. 

A MongoDB cluster is composed of two types of data-bearing members: 

Primary: The primary is the master node and receives all write operations. 

Secondaries: The secondaries receive replicated data from the primary to maintain an identical data set. 

By default, the primary member handles all reads and writes. Optionally, a MongoDB client can route some or all reads to the secondary members. Writes must be sent to the primary. 

If the primary member fails, all writes are suspended until a new primary is selected from one of the secondary members. According to the MongoDB documentation, this process requires up to 12 seconds to complete. 

To increase availability, a cluster can be distributed across geographically distinct data centers. The maximum size of a MongoDB replicaset is 50 members. 

A MongoDB cluster.

A Cassandra cluster is a collection of instances, called nodes, connected in a peer-to-peer “share nothing” distributed architecture. There is no master node and every node can perform all database operations and each can serve client requests. Data is partitioned across nodes based on a consistent hash of its partitioning key. A partition has one or many rows and each node may have one or more partitions. However, a partition can reside only on one node.  

Data has a replication factor that determines the number of copies (replicas) that should be made. The replicas are automatically stored on different nodes. 

The node that first receives a request from a client is the coordinator. It is the job of the coordinator to forward the request to the nodes holding the data for that request and to send the results back to the coordinator. Any node in the cluster can act as a coordinator. 

The CAP Theorem With Apache Cassandra and MongoDB4.png (2)
Cassandra cluster showing coordinator. 

Consistency

By default, MongoDB is a strongly consistent system. Once a write completes, any subsequent read will return the most recent value.

Cassandra, by default, is an eventually consistent system. Once a write completes, the latest data eventually becomes available provided no subsequent changes are made.

It is important to keep in mind that MongoDB becomes an eventually consistent system when read operations are done on the secondary members. This happens because of replication lag (the delay between when the data is written to the primary and when that data is available on the secondary). The larger the lag, the greater chance that reads will return inconsistent data. 

Tunable Consistency

Both MongoDB and Cassandra have “tunable consistency.” That is, the levels of consistency and availability are adjustable to meet certain requirements. Individual read and write operations define the number of members or replicas that must acknowledge a request in order for that request to succeed. In MongoDB, this level is called read concern or write concern. In Cassandra, the level of acknowledgment is the consistency level of the operation. The definitions of each are shown in Tables 1 and 2. 

Read Concern Description
Local Returns data from the instance without guaranteeing the data has been written to a majority of the instances. This is equivalent to a read uncommitted isolation in a relational database.
Available Same as local. 
Majority Guarantees that a majority of the cluster members acknowledged the request. A majority read returns only committed data.
Write Concern Description
0 Does not require an acknowledgment of the write. 
1 Requires acknowledgment from the primary member only.
<number> Checks if the operation has replicated to the specified number of instances.
Majority Checks if the operations have propagated to the majority.
Table 1: MongoDB Read and Write Concerns 
Consistency Level Description
ONE, TWO, THREE How many replicas need to respond to a read or write request. 
LOCAL_ONE One replica in the same data center as the coordinator must successfully respond to the read or write request. Provides low latency at the expense of consistency
LOCAL_QUORUM A quorum (majority) of the replica nodes in the same data center as the coordinator must respond to the read or write request. Avoids latency of inter-data-center communication.
QUORUM A quorum (majority) of the replicas in the cluster need to respond to a read or write request for it to succeed. Used to maintain strong consistency across the entire cluster.
EACH_QUORUM The read or write must succeed on a quorum of replica nodes in each data center. Used only for writes to a multi-data-center cluster to maintain the same level of consistency across data centers.
ALL  The request must succeed on all replicas. Provides the highest consistency and the lowest availability of any other level.
ANY A write must succeed on at least one node or, if all replicas are down, a hinted handoff has been written. Guarantees that a write will never fail at the expense of having the lowest consistency. Delivers the lowest consistency and highest availability.
Table 2: Cassandra Consistency Levels

One important limitation is that no combination of read and write concerns can make MongoDB strongly consistent once reads are permitted on the secondary members. 

In Cassandra, data reads can be made strongly consistent if operations follow the formula (R = read consistency, W = write consistency, and N = replication factor):

R + W > N

Read and write consistency can be adjusted to optimize a specific operation. For example:

Optimization Read Consistency Write Consistency
Write ALL ONE
Read ONE ALL
Balance QUORUM QUORUM
None: High Latency With Low Availability ALL ALL
Table 3: Achieving Strong Consistency in Cassandra

The last setting is an extreme example of how you can get very strong consistency but lose all fault tolerance. If only one replica becomes unavailable, the query fails. 

On the other hand, MongoDB could be configured with read and write concerns so low that only eventual consistency is possible. 

The majority read/write concern differs from Cassandra’s quorum consistency level. Majority will return only committed data from a majority of the nodes. Cassandra read at quorum can return uncommitted data. 

Both systems can submit statements with linearizable consistency (data must be read and written in sequential order across all processes) with some restrictions. 

MongoDB provides linearizable consistency when you combine “majority” write concern with “linearizable” read concern. However, to use the linearizable read concern you must read data from the primary. 

Cassandra uses a modified Paxos consensus protocol to implement lightweight transactions. Lightweight transactions use a consistency level of SERIAL or LOCAL_SERIAL (equivalent to QUORUM and LOCAL_QUORUM). 

Losing Consistency During Normal Operations

Replicating multiple copies of data is how Cassandra and MongoDB increase availability. Data in these copies can become inconsistent during normal operations. However, as we shall see, Cassandra has background processes to resolve these inconsistencies.

When MongoDB secondary members become inconsistent with the primary due to replication lag, the only solution is waiting for the secondaries to catch up. MongoDB 4.2 can throttle replication whenever the lag exceeds a configurable threshold.

However, if the data in the secondaries becomes too stale, the only solution is to manually synchronize the member by bootstrapping the member after deleting all data, by copying a recent data directory from another member in the clusteror, or by restoring a snapshot backup.

Cassandra has two background processes to synchronize inconsistent data across replicas without the need for bootstrapping or restoring data: read repairs and hints. 

If Cassandra detects that replicas return inconsistent data to a read request, a background process called read repair imposes consistency by selecting the last written data to return to the client. 

If a Cassandra node goes offline, the coordinator attempting to write the unavailable replica temporarily stores the failed writes as hints on their local filesystem. If hints have not expired (three hours by default), they are written to the replica when it comes back online, a process known as hinted handoffs. 

Anti-entropy repairs (a.k.a. repairs) is a manual process to synchronize data among the replicas. Running repairs is part of the routine maintenance for a Cassandra cluster. 

Loss of Consistency During a Failure

If the primary member fails, MongoDB preserves consistency by suspending writes until a new primary is elected. Any writes to the failed primary that have not been replicated are rolled back when it returns to the cluster as a secondary. Later versions of MongoDB (4.0 and later) also create rollback files during rollbacks.

When a Cassandra node becomes unavailable, processing continues and failed writes are temporarily saved as hints on the coordinator. If the hints have not expired, they are applied to the node when it becomes available. 

Availability

Both MongoDB and Cassandra get high availability by replicating multiple copies of the data. The more copies, the higher the availability. Clusters can be distributed across geographically distinct data centers to further enhance availability. 

The maximum size MongoDB cluster is 50 members with no more than seven voting members. There is no hard limit to the number of nodes in a Cassandra cluster, but there can be performance and storage penalties for setting the replication factor too high. A typical replication factor is three for most clusters or five when there are very high availability requirements. Conversely, when data availability is less critical, say with data that can easily be recreated, the replication factor can be lowered to save space and to improve performance.

Fault Tolerance

If the primary member of a MongoDB cluster becomes unavailable for longer than electionTimeoutMillis (10 seconds by default), the secondary members hold an election to determine a new primary as long as a majority of members are reachable.

During the election, which can take up to 12 seconds, a MongoDB cluster is only partially available: 

  1. No writes are allowed until a new primary is elected
  2. Data can be read from the secondary if a majority of the secondary members are online and reads from the secondary members have been enabled

Because all Cassandra nodes are peers, a cluster can tolerate the loss of multiple replicas provided the consistency level of the operation is met. Even if a majority of replicas are lost, it is possible to continue operations by reverting to the default consistency level of ONE or LOCAL_ONE.

Fault Tolerance With Multiple Data Centers

Both MongoDB and Cassandra clusters can span geographically distinct data centers in order to increase high availability. If one data center fails, the application can rely on the survivors to continue operations. 

How MongoDB responds to the loss of a data center depends upon the number and placement of the members among the data centers. Availability will be lost if the data center containing the primary server is lost until a new primary replica is elected. 

The data can still be available for reads if it is distributed over multiple data centers even if one of the data centers fails. If the data center with a minority of the members goes down, the cluster can serve read and write operations; if the data center with the majority goes down, it can only perform read operations.

With three data centers, if any data center goes down, the cluster remains writeable as the remaining members can hold an election.

The fault tolerance of a Cassandra cluster depends on the number of data centers, the replication factor, and how much consistency you are willing to sacrifice. As long as the consistency level can be met, operations can continue.

Number of Failed Nodes Without Compromising High Availability
RF 1 Consistency Level 1 DC 2 DC 3 DC
2 ONE/LOCAL_ONE 1 node 3 nodes total 5 nodes total
2 LOCAL_QUORUM 0 nodes 0 nodes 0 nodes
2 QUORUM 0 nodes 1 node total 2 nodes total
2 EACH_QUORUM 0 nodes 0 nodes 0 nodes
3 LOCAL_ONE 2 nodes 2 nodes from each DC 2 nodes from each DC
3 ONE 2 nodes 5 nodes total 8 nodes total
3 LOCAL_QUORUM 1 node 2 nodes total 1 node from each DC or 2 nodes total
3 QUORUM 1 node 2 nodes total 4 nodes total
3 EACH_QUORUM 2 nodes 1 node each DC 1 node each DC
5 ONE  4 nodes 9 nodes 14 nodes
5 LOCAL_ONE 4 nodes 4 nodes from each DC 4 nodes from each DC
5 LOCAL_QUORUM 2 nodes 2 nodes from each DC 2 nodes from each DC
5 QUORUM 2 nodes 4 nodes total 7 nodes total
5 LOCAL_QUORUM 2 nodes Could survive loss of a DC Could survive loss of 2 DCs 
5 EACH_QUORUM 2 nodes from each 2 nodes from each 2 nodes from each
Table 4: Fault Tolerance of Cassandra

1 Assumes the replication factor for each DC 

Conclusions on Cassandra vs. MongoDB

With the CAP or PACELC theorems, MongoDB is categorized as a distributed system that guarantees consistency over availability while Cassandra is classified as a system that favors availability. However, these classifications only describe the default behavior of both systems. MongoDB remains strongly consistent only as long as all reads are directed to the primary member. Even when reads are limited to the primary member, consistency can be loosened by read and write concerns. Cassandra can also be made strongly consistent through adjustments in consistency levels as long as a loss of availability and increased latency is acceptable. 

Want to learn more about how to identify the best technology for your data layer? Contact us to schedule a time with our experts.

The post The CAP Theorem With Apache Cassandra and MongoDB appeared first on Instaclustr.

Why AWS PrivateLink Is Not Recommended With Kafka or Cassandra

AWS PrivateLink (also known as a VPC endpoint) is a technology that allows the user to securely access services using a private IP address. It is not recommended to configure an AWS PrivateLink connection with Apache Kafka or Apache Cassandra mainly due to a single entry point problem. PrivateLink only exposes a single IP to the user and requires a load balancer between the user and the service. Realistically, the user would need to have an individual VPC endpoint per node, which is expensive and may not work. 

Using PrivateLink, it is impossible by design to contact specific IPs within a VPC in the same way you can with VPC peering. VPC peering allows a connection between two VPCs, while PrivateLink publishes an endpoint that others can connect to from their own VPC.

Cassandra

The load-balancing policy for queries is set in the Cassandra driver (e.g. DCAwareRoundRobinPolicy or TokenAwarePolicy). This determines which node is selected as the coordinator node. The driver will connect directly to that node when making the query. AWS PrivateLink exposes only a single endpoint (IP) making all access to nodes connect through that load balancer, and therefore the load balancing policy in the driver is circumvented. DCAwareRoundRobinPolicy is able to effectively load balance anyway, so the load balancing policy being circumvented is not a real issue. For TokenAwarePolicy this can be a problem because the query is not routed to the node with the data, so a different node is called on forcing extra processing overhead and another network hop.

Kafka Connection Without AWS PrivateLink Image
Kafka Connection Without AWS PrivateLink

Kafka Connection With AWS PrivateLink Image
Kafka Connection With AWS PrivateLink

Kafka uses partitions that can have copies (dependent on replication factor) to increase durability and availability, and allow Kafka to failover to a broker with a replica of the partition if that broker fails. Each partition has one master that the driver needs to connect to directly for writes. When using AWS PrivateLink with Kafka, ports must be configured to route traffic from the load balancer. 

In the graphic above, AWS PrivateLink cuts off individual connections from the application to the brokers. Drivers aren’t intelligent enough to be cut off by a load balancer unless each connection has been configured through PrivateLink which can be tedious and is already built into Kafka. AWS PrivateLink defeats the purpose of these connections between our application and Kafka and causes a single point of failure if servers were to go down. While it has been found that it’s possible to connect Kafka and PrivateLink through Confluent, this comes with a lot of limitations. Just using VPC peering instead of PrivateLink would make it much easier to establish connections in our case.

Conclusion 

AWS PrivateLink is intended to provide a secure, private connection between different accounts and VPCs to simplify the structure of your network. But when using it to connect with technologies like Apache Kafka and Apache Cassandra, PrivateLink makes things more complex. The single endpoint causes much more stress than these technologies need, which is why using PrivateLink is not recommended.

Have more questions on AWS PrivateLink or other best practices for Apache Cassandra and Apache Kafka? Reach out to schedule a consultation with one of our experts.

The post Why AWS PrivateLink Is Not Recommended With Kafka or Cassandra appeared first on Instaclustr.

Analyzing Cassandra Data using GPUs, Part 2

Analyzing Cassandra Data using GPUs, Part 1