Enhanced Cluster Scaling for Apache Cassandra®

The Instaclustr Managed Platform now supports the zero-downtime scaling-up of local storage (also known as instance store or ephemeral storage) nodes running Apache Cassandra on AWS, Azure and GCP!

Enhanced Vertical Scaling 

NetApp has released an extension to the Cluster Scaling feature on the Instaclustr Managed Platform, which now supports scaling up the local storage of the Cassandra nodes. This enhancement builds upon the existing network block storage scaling option, offering greater flexibility and control over cluster configurations.

Customers can now easily scale up their Cassandra clusters on demand to respond to growing data needs. This development not only provides unparalleled flexibility and performance, but also distinguishes Instaclustr from competitors who lack support for local storage nodes. 

Local Storage 

Local storage refers to the physical storage directly attached to the node or instance. Unlike network-backed storage (such as EBS), local storage eliminates the need for data to travel over the network, leading to lower latency, higher throughput, and improved performance in data-intensive applications.

Moreover, the cost of the local storage is included in the instance pricing which can lead to local storage, when used in conjunction with Reserved Instance and similar concepts, being the optimal cost infrastructure choice for many use cases. 

Whether you need to store large volumes of data or run complex computational tasks, the ability to scale up local storage nodes gives you the flexibility to manage node sizes based on your requirements. Scaling local storage nodes with minimal disruption is complex. Instaclustr leverages advanced internal tools for on-demand vertical scaling. Our updated replace tool with “copy data” mode streamlines the process without compromising data integrity or cluster health. 

Additionally, this enhancement gives our customers the capability to switch between local storage (ephemeral) and network-based storage (persistent) while scaling up their clusters. Customer storage needs vary over time, ranging from the high I/O performance of local storage to the cost-effectiveness and durability of network-based storage. Instaclustr provides our customers with a variety of storage options to scale up their workloads based on performance and cost requirements.

Not sure where to start with your cluster scaling needs? Read about the best way to add capacity to your cluster here. 

Self-Service 

With this enhancement to the cluster scaling feature, we are expanding Instaclustr’s self-service capabilities, empowering customers with greater control and flexibility over their infrastructure. Scaling up Cassandra clusters has become more intuitive and is just a few clicks away.

This move towards greater autonomy is supported by production SLAs, ensuring scaling operations are completed without data loss or downtime. Cassandra nodes can be scaled up using the cluster scaling functionality available through the Instaclustr console, API, or Terraform provider. Visit our documentation for guidance on seamlessly scaling your storage for the Cassandra cluster. 

While this enhancement allows the scaling up of local storage, downscaling operations are not yet supported via self-service. Should you need to scale down the storage capacity of your cluster, our dedicated Support team is ready to assist 

Scaling up improves performance and operational flexibility, but it will also result in increased costs, so it is important to consider the cost implications. You can review the pricing of the available nodes with the desired storage on the resize page of the console. Upon selection, a summary of current and new costs will be provided for easy comparison. 

Leverage the latest enhancements to our Cluster Scaling feature via the console, API or Terraform provider, and if you have any questions about this feature, please contact Instaclustr Support at any time. 

Unlock the flexibility and ease of scaling Cassandra clusters at the click of a button and sign in now! 

The post Enhanced Cluster Scaling for Apache Cassandra® appeared first on Instaclustr.

ScyllaDB Cloud Network Connectivity: Transit Gateway Connection

ScyllaDB Cloud is a managed NoSQL Database-as-a-Service based on the leading open-source database engine ScyllaDB. It is designed for extreme performance, low latency, and high availability. It is compatible with the Cassandra Query Language (CQL) and Amazon DynamoDB APIs, making it a possible replacement for many solution architects pursuing performance. This article is about ScyllaDB Cloud network connectivity management. It will show how to use ScyllaDB Cloud to connect to the customer’s application using a transit gateway. ScyllaDB Cloud Network Connectivity In response to customer demand, ScyllaDB Cloud now offers two solutions for managing connectivity between ScyllaDB Cloud and customer application environments running in Amazon Web Services (AWS) or even in hybrid cloud setups. In addition to VPC peering, we now support Transit Gateway Connection. Two Connectivity Options VPC peering connection has been supported for years. It is an excellent solution for connectivity in non-complex network setups. However, managing many VPC peering connections can be cumbersome, error-prone, and challenging to audit. This problem became more severe at scale, and another tool from the arsenal of AWS became necessary. The Transit Gateway Connectivity feature was added to ScyllaDB Cloud in February 2024. It enables our customers to use Transit Gateways as part of their configuration. It adds another way to connect to ScyllaDB Clusters. The feature is available in clusters deployed in ScyllaDB Cloud as well as clusters deployed on the customer’s own AWS accounts, via the bring-your-own-account feature (BYOA). Transit Gateway Connection allows connection using a centralized hub that simplifies the connectivity and routing. This helps organizations connect multiple VPCs, Load Balancers, and VPNs within a single, scalable gateway. It acts as a transit hub for routing traffic between various network endpoints, providing a unified solution for managing connectivity across complex or distributed environments. The Transit Gateway simplifies the management of intricate networks created by introducing a more centralized hub-and-spokes model.   That simplifies configuring and monitoring the network connectivity, streamlines operations, and improves visibility. Moreover, it acts as a central place to apply security policies and access controls – thereby strengthening network security and improving auditability. Configure Transit Gateway Connection in ScyllaDB Cloud Configuring the Transit Gateway in ScyllaDB Cloud is straightforward. Prerequisites ScyllaDB Cloud account. If you don’t have an account, you can use a free account. Get one here: ScyllaDB Cloud. Basic understanding of AWS networking concepts (VPCs, CIDR subnets, Resource Shares, route tables). AWS Account with sufficient permissions to create network configurations. 30 minutes of your time. Step 1: ScyllaDB Cloud Cluster Login to ScyllaDB Cloud and create a New Cluster. Depending on your account, you can either use the Free trial or Dedicated VM option. For this, I am using a free trial; if you choose a Dedicated VM, your screen might be slightly different. Please select AWS as your Cloud Provider and click Next. As of the time of this blog, we support only AWS Transit Gateway connection. Support for Network Connectivity Center(NCC) or similar technologies for Google Cloud Platform(GCP) or other clouds will be added in time.   On the next screen, Cluster Properties under Network Type, you should select VPC Peering / Transit Gateway to enable network connectivity, and then you can confirm with the Create Cluster button below. The ScyllaDB cluster will be created shortly. Once it’s created, select the cluster. Step 2: Provision AWS Transit Gateway The following steps for provisioning of the transit gateway have to be done in your AWS account. Alto transit gateway can connect to transit gateways in other regions (inter-region); it can only connect VPCs from the same region. The transit gateway should be deployed in the same region as the ScyllaDB cluster. In this case, this is us-east-1. If you are not familiar with creating a transit gateway, the following guide can help: Configure AWS Transit Gateway (TGW) VPC Attachment Connection. It is a good idea to set Auto accept shared attachments. It will make your TGW automatically accept attachment requests from other accounts, including ScyllaDB Cloud. For the next step, we will need the Transit Gateway ID and RAM Resource Share APN. Step 3: Configuring Transit Gateway Connection in ScyllaDB Cloud The next step is to configure the network connection. Select the Connections tab. You will be navigated to the Connections screen below. Select Add Transit Gateway Connection and when prompted, provide a custom Name. This is how the connection will be visualized in ScyllaDB Cloud. The Data Center should be the region where the TGW is located. Get the AWS RAM ARN from your AWS Resource Share as configured in the AWS setup in the previous step and the Transit Gateway ID transit gateway from AWS Transit Gateways (12-digit ID). Provide the chosen VPC CIDR to route the traffic and proceed with Add Transit Gateway. The processing will take some time. If you followed my advice above, the connection will be accepted automatically. Otherwise, you will have to go to your AWS account to accept the connection. Once the connection is active, the attached networks from the customer AWS account and the ScyllaDB Cloud should see each other over the networks as defined. You can go to the ScyllaDB Documentation if you need help verifying the connection. Cost & Billing Amazon will charge the network components to the AWS account associated with each respective network component.  This might be confusing at first. It means that bills will be sent to different accounts when connecting components from different accounts. In our cost example below, the database clusters are in ScyllaDB Cloud, but the applications are in the customer AWS account. For illustration purposes, we have databases and applications in different regions, with multiple transit gateways and peering. It is a complicated setup designed to show the different categories of charges. The expenses that occur using Transit Gateway Connection are listed and explained below. AWS Transit Gateway attachment hourly charges AWS charges for each network component attached to the transit gateway. The typical cost per hour is $0.05 per attachment (us-east-1). The price varies by region for ap-northeast-2 it is $0.07 Attachments are considered all VPCs from the diagram above (1,2,3,4,5,6,7,8) and both transit gateways, TGW1 and TGW2. Attachments are also considered other network components like VPN connections, VPC peering, gateways etc. which are excluded from this setup. In our example, we have the following components: Components connected to TGW1 (price for attachment $0.05 us-east-1): VPC 1,2,3,4 will be billed to ScyllaDB Cloud 4 x $0.05 = $0.20 per hour VPC 5,6,7 and TGW2 will be billed to Customer AWS Account 5 x $0.05 = $0.25 per hour Components connected to TGW2 (price for attachment $0.07 ap-northeast-2) VPC 8 and TGW1 will be billed to Customer AWS Account 2 x $0.07 = $0.14 per hour Monthly the charges will be as follows:   Transit Gateway data processing charges Transit Gateway Data Processing costs $0.02 per 1 GB in most regions. Data processing is charged to the AWS account that owns the VPC. Assuming 2 TB of symmetrical traffic for simplicity. VPC 1,2,3,4 will be billed to ScyllaDB Cloud 4 x 2048 GB x $0.02 = $163.84 VPC 5,6,7,8 will be billed to Customer AWS Account 4 x 2048 GB x $0.02 = $163.84 Monthly the charges will be as follows:   Transit Gateway data processing charge across peering attachments These are the peering fees between two or more Transit Gateways. Only outbound traffic will be charged as standard traffic between regions. Charges differ by region, but the typical cost is $0.02. The monthly charges will be as follows:   All charges in the  ScyllaDB Cloud Column total of $324.44 will be included in ScyllaDB Cloud’s monthly billing report and passed back to the customer. All charges in Customer Account will appear in the customer AWS bill. Bring your own account (BYOA) Customers can use ScyllaDB Cloud to deploy databases in their AWS account. In this case, all charges will be applied to the customer’s account or accounts. Since there will be no charges in ScyllaDB Cloud, nothing will be passed with the monthly bill. Combining Transit Gateways, VPC peering, VPN connections in the same network setup is possible. ScyllaDB Cloud supports this configuration to provide flexibility and cost optimization. ScyllaDB Cloud Network Connectivity Overview ScyllaDB Cloud offers two network connectivity features. Both VPC Peering Connection and Transit Gateway Connection enable seamless and efficient connectivity between customer applications and highly performant ScyllaDB database clusters. We invite you to explore ScyllaDB Cloud and experiment with the network connections to match your unique application requirements and budget You can experiment with this feature in ScyllaDB Cloud. We welcome your questions in our community forum and Slack channel. Alternatively, you can use our contact form.

Cadence® Performance Benchmarking Using Apache Cassandra® Paxos V2

Overview 

Originally developed and open sourced by Uber, Cadence® is a workflow engine that greatly simplifies microservices orchestration for executing complex business processes at scale. 

Instaclustr has offered Cadence on the Instaclustr Managed Platform for more than 12 months now. Having made open source contributions to Cadence to allow it to work with Apache Cassandra® 4, we were keen to test that performance with the new features of Cassandra 4.1 with Cadence. 

Paxos is the consensus protocol used in Cassandra to implement lightweight transactions (LWT) that can handle concurrent operations. However, the initial version of Paxos—Paxos v1—achieves linearizable consistency at the high cost of 4 round-trips for write operations. It could potentially affect Cadence performance considering Cadence’s reliance on LWT for executing multi-row single shard conditional writes to Cassandra according to this documentation. 

With the release of Cassandra 4.1, Paxos v2 was introduced, promising a 50% improvement in LWT performance, reduced latency, and a reduction in the number of roundtrips needed for consensus, as per the release notes. Consequently, we wanted to test the impact on Cadence’s performance with the introduction of Cassandra Paxos v2.  

This blog post focuses on benchmarking the performance impact of Cassandra Paxos v2 on Cadence, with a primary performance metric being the rate of successful workflow executions per second. 

Benchmarking Setup 

We used the cadence-bench tool to generate bench loads on the Cadence test clusters; to reduce variables in benchmarking, we only use the basic loads that don’t require the Advanced Visibility feature. 

Cadence bench tests require a Cadence server and bench workers. The following are their setups in this benchmarking: 

Cadence Server 

At Instaclustr, a managed Cadence cluster depends on a managed Cassandra cluster as the persistence layer. The test Cadence and Cassandra clusters were provisioned in their own VPCs and used VPC Peering for inter-cluster communication. For comparison, we provisioned 2 sets of Cadence and Cassandra clusters—Baseline set and Paxos v2 set.  

  • Baseline set: A Cadence cluster dependent on a Cassandra cluster with the default Paxos v1 
  • Paxos v2 set: A Cadence cluster dependent on a Cassandra cluster with Paxos upgraded to v2 

We provisioned the test clusters in the following configurations: 

Application  Version  Node Size  Number of Nodes 
Cadence  1.2.2  CAD-PRD-m5ad.xlarge-150 (4 CPU cores + 16 GiB RAM)  3 
Cassandra  4.1.3  CAS-PRD-r6g.xlarge-400 (4 CPU cores + 32 GiB RAM)  3 

Bench Workers 

We used AWS EC2 instances as the stressor boxes to run bench workers that generate bench loads on the Cadence clusters. The stressor boxes were provisioned in the VPC of the corresponding Cadence cluster to reduce network latency between the Cadence server and bench workers. On each stressor box, we ran 20 bench worker processes. 

 These are the configurations of the EC2 instances used in this benchmarking: 

EC2 Instance Size  Number of Instances 
c4.xlarge (4 CPU cores + 7.5 GiB RAM)  3 

Upgrade Paxos 

For the test Cassandra cluster in the Paxos v2 set, we upgraded the Paxos to v2 after the cluster hit running.  

According to the section Steps for upgrading Paxos in NEWS.txt, we set 2 configuration properties and ensured Paxos repairs were running regularly to upgrade Paxos to v2 on a Cassandra cluster. We also planned to change the consistency level used for LWT in Cadence to fully benefit from Paxos v2 improvement.  

We took the following actions to upgrade Paxos on the test Paxos v2 Cassandra cluster: 

Added these configuration properties to the Cassandra configuration file

cassandra.yaml.
Configuration Property  Value 
paxos_variant  v2 
paxos_state_purging  repaired 

For Instaclustr managed Cassandra clusters, we use an automated service on Cassandra nodes to run Cassandra repairs regularly. 

Cadence sets LOCAL_SERIAL as the consistency level for conditional writes (i.e., LWT) to Cassandra as specified here. Because it’s hard-coded, we could not change it to ANY or LOCAL_QUORUM as suggested in the Steps for upgrading Paxos.  

The Baseline Cassandra cluster was set to use the default Paxos v1 so no change to configurations were required. 

Bench Loads 

We used the following configurations for the basic bench loads to be generated on both the Baseline and Paxos v2 Cadence clusters: 

{   

  "useBasicVisibilityValidation": true,   

  "contextTimeoutInSeconds": 3,   

  "failureThreshold": 0.01,   

  "totalLaunchCount": 100000,   

  "routineCount": 15 or 20 or 25,    

  "waitTimeBufferInSeconds": 300,   

  "chainSequence": 12,   

  "concurrentCount": 1,    

  "payloadSizeBytes": 1024,   

  "executionStartToCloseTimeoutInSeconds": 300   

}

All the configuration properties except routineCount were kept constant. routineCount defines the number of in-parallel launch activities that start the stress workflows. Namely, it affects the rate of generating concurrent test workflows, and hence can be used to test Cadence’s capability to handle concurrent workflows.   

We ran 3 sets of bench loads with different routineCounts (i.e., 15, 20, and 25) in this benchmarking so we could observe the impacts of routineCount on Cadence performance.  

We used cron jobs on the stressor boxes to automatically trigger bench loads with different routineCount on the following schedules: 

  • Bench load 1 with routineCount=15: Runs at 0:00, 4:00, 8:00, 12:00, 16:00, 20:00 
  • Bench load 2 with routineCount=20: Runs at 1:15, 5:15, 9:15, 13:15, 17:15, 21:15 
  • Bench load 3 with routineCount=25: Runs at 2:30, 6:30, 10:30, 14:30, 18:30, 22:30 

Results 

In the beginning of the benchmarking, we observed the aging impact on Cadence performance. Specifically, the key metric—Average Workflow Success/Sec—gradually decreased for both Baseline and Paxos v2 Cadence clusters as more bench loads were run. 

Why did this occur? Most likely because Cassandra uses in-memory cache when there’s only a small amount of data stored in it, meaning read and write latency are lower when all data lives in memory. After more rounds of bench loads were executed, Cadence performance became more stable and consistent. 

Cadence Workflow Success/Sec  

Cadence Average Workflow Success/Sec is the key metric we use to measure Cadence performance.  

As demonstrated in the graphs below and contrary to what we expected, Baseline Cadence cluster achieved around 10% higher Average Workflow Success/Sec than Paxos v2 Cadence cluster in the sample of 10 rounds of bench loads. 

Baseline Cadence cluster successfully processed 36.2 workflow executions/sec on average: 

Paxos v2 Cadence cluster successfully processed 32.9 workflow executions/sec on average: 

Cadence History Latency 

Cadence history service is the internal service that communicates with Cassandra to read and write workflow data in the persistence store. Therefore, Average History Latency should in theory be reduced if Paxos v2 truly benefits Cadence performance.  

However, as shown below, Paxos v2 Cadence cluster did not consistently perform better than Baseline Cadence cluster on this metric: 

Baseline Cadence cluster 

Paxos v2 Cadence cluster 

Cassandra CAS Operations Latency 

Cadence only uses LWT for conditional writes (LWT is also called Compare and Set [CAS] operation in Cassandra). Measuring the latency of CAS operations reported by Cassandra clusters should reveal if Paxos v2 reduces latency of LWT for Cadence. 

As indicated in the graphs below and by the mean value of this metric, we did not observe consistently significant differences in Average CAS Write Latency between Baseline and Paxos v2 Cassandra clusters.  

The mean value of Average CAS Write Latency reported by Baseline Cassandra cluster is 105.007 milliseconds. 

The mean value of Average CAS Write Latency reported by Paxos v2 Cassandra cluster is 107.851 milliseconds. 

Cassandra Contended CAS Write Count 

One of the improvements introduced by Paxos v2 is reduced contention. Therefore, to ensure that Paxos v2 was truly running on Paxos v2 Cassandra cluster, we measured and compared the metric Contended CAS Write Count for Baseline and Paxos v2 Cassandra clusters. As clearly illustrated in the graphs below, Baseline Cassandra cluster experienced contended CAS writes while Paxos v2 Cassandra cluster did not, which means Paxos v2 was in effect on Paxos v2 Cassandra. 

Baseline Cassandra cluster 

Paxos v2 Cassandra cluster 

Conclusion 

During our benchmarking test, there was no noticeable improvement in the performance of Cadence attributable to Cassandra Paxos v2.   

This may be because we were unable to modify the consistency level in the Lightweight Transactions (LWT) as required to fully leverage Cassandra Paxos v2, or may simply be that other factors in a complex, distributed application like Cadence mask any improvement from Paxos v2.  

Future Work 

Although we did not see promising results in this benchmarking, future investigations could focus on assessing the latency of LWTs in Cassandra clusters with Paxos v2 both enabled and disabled and with a less complex client application. Such exploration would be valuable for directly evaluating the impact of Paxos v2 on Cassandra’s performance. 

The post Cadence® Performance Benchmarking Using Apache Cassandra® Paxos V2 appeared first on Instaclustr.

New Google Cloud Z3 Instances: Early Performance Benchmarks on ScyllaDB Show up to 24% Better Throughput

ScyllaDB, a high-performance NoSQL database with a close-to-the-metal architecture, had the privilege of examining Google Cloud’s Z3 GCE instances in an early preview. The Z3 machine series is the first of the Storage Optimized VM GCE offerings. It boasts a remarkable 36T of Local SSD. The Z3 series is powered by the 4th Gen Intel Xeon Scalable processor (alias Sapphire Rapids) and DDR5 memory as well as Google’s custom-built Infrastructure Processing Unit (IPU) that supports Hyperdisk. The Z3 amalgamates the most recent advancements in compute, networking, and storage technologies into a single platform, with a distinct emphasis on a new breed of high-density, high-performance local SSD. The Z3 series is optimized for workloads that require low latency and high performance access to large data sets. Likewise, ScyllaDB is engineered to deliver predictable low latency, even with workloads exceeding 1M OPS per machine. Google, Intel, and ScyllaDB partnered to test ScyllaDB on the new instances because we were all curious to see how these innovations translated to performance gains with data-intensive use cases. TL;DR When we tested ScyllaDB on the new Z3 instances, ScyllaDB exhibited a significant throughput improvement across workloads versus the previous generation of N2 instances. We observed a 23% increase in write throughput, 24% for mixed workloads, and 14% for reads per vCPU (z3-highmem-88 vs n2-highmem-96) and at a lower cost when compared to N2 with the additional fast disks of the same size. On these new instances, a cluster of just 3 ScyllaDB nodes can achieve around 2.2M OPS for writes and mixed workloads and around 3.6M OPS for reads. Instance Selection: Google Cloud Z3 versus N2 Z3 instances come in 2 shapes: z3-highmem-88 and z3-highmem-176, each boasting 88 and 176 4th Gen Intel(R) Xeon(R) Scalable vCPUs respectively. Each vCPU is bolstered with 8GB memory, culminating in a staggering 1,408 GB for the larger variant. We conducted a comparative analysis between the Z3 instance and the N2 memory-optimized instances. The N2 instances were our standard choice until now. The N2 instances are available in a variety of sizes and are designed around two Intel CPU architectures: 2nd and 3rd Gen Intel(R) Xeon(R) Scalable. The 3rd Gen Intel(R) Xeon(R) Scalable architecture is the default choice for larger machines (with 96 vCPUs or more). The n2-highmem also incorporates 8GB per vCPU memory. The N2 instance reaches its maximum size at 128 vCPUs. Thus, for an equitable comparison, we selected the n2-highmem-96, the closest N2 instance to the smaller Z3 instance, and equipped it with the maximum attachable 24 fast local NVMe disks. ScyllaDB Benchmark Results: Z3 versus N2 Throughput Setup and Configuration Benchmarking such powerful machines requires considerable effort. To mimic user processes on this grand scale, we equipped 30 client instances, each with 30 processing units, to optimize outcomes. This necessitated the development of appropriate scripts for executing load and accumulating results. However, the scylla-cluster-tests testing framework facilitated this process, allowing us to execute all tests with remarkable efficiency. We measured maximum throughput using the cassandra-stress benchmark tool. To make the workload more realistic, we tuned the row size to 1KB each and set the replication factor to 3. Also, to measure the performance impact of the new generation CPUs, we included workloads that read from cache – removing the influence of disk speed disparities across the different instance types. All results show client-side values, so we measured the complete round trip and confirmed ScyllaDB-side metrics values. Results Because of ScyllaDB’s shard-per-core architecture, it is more suitable to show results normalized by vCPU to provide a better sense of the new CPU platform’s capabilities. ScyllaDB exhibited a significant 23% increase in write throughput and a 24% increase in throughput for a mixed workload. Additionally, ScyllaDB achieved a 14% improvement in read throughput.   Workload 4th Gen Intel Xeon [op/s per vCPU] 3rd Gen Intel Xeon [op/s per vCPU] diff Write only 8.45K op/s 6.85K op/s +23% Read Only (Entirely from Cache) 13.59K op/s 11.93K op/s +14% Mixed (50/50 Read /Writes) 8.63K op/s 6.94K op/s +24% The metrics showed a sustainable number of served requests: Requests Served per shard (Z3) Careful readers will notice the graph shows 15K OPS/shard, which is higher than the numbers in the table. This is because 8 vCPUs are reserved exclusively for work with network and disk IRQ; they are not serving requests as part of the ScyllaDB node. Overall, a cluster of just 3 nodes can achieve around 2.2M OPS write and mixed workload and around 3.6M OPS read (all measured with QUORUM consistency level). Despite Z3 instances being 8 vCPUs smaller than the N2 ones, we achieved better performance in all tested workloads, which is an extraordinary accomplishment. Workload z3-highmem-88 n2-highmem-96 diff Write only 2.23M op/s 1.97M op/s +13% Read Only (Entirely from Cache) 3.6M op/s 3.43M op/s +5% Mixed (50/50 Read /Writes) 2.28M op/s 2.00M op/s +14% And this is how the Z3 read workload looks in ScyllaDB Monitoring: Closing Thoughts The results of this benchmark highlight how Google Cloud’s new Intel 4th Gen Intel(R) Xeon(R) Scalable based Z3 platform family brings significant enhancements in terms of CPU, disk, memory, and network performance. 36TB local SSD capacity included makes it more cost-effective over N2 with the additional fast disks of the same size. For ScyllaDB users, this translates to substantial gains in throughput while reducing costs for a variety of workloads. We recommend using these instances for ScyllaDB users to further reduce their infrastructure footprint while improving performance. Next steps: Learn more about ScyllaDB’s close-to-the-metal architecture See what ScyllaDB customers are achieving with these infastructure innovations Learn how ScyllaDB’s performance compares to other databases Notices & Disclaimers Performance varies by use, configuration and other factors. Learn more at intel.com/performanceindex. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Preview Release of Apache Cassandra® version 5.0 on the Instaclustr Managed Platform

Apache Cassandra® version 5.0 Beta 1.0 is now available in public preview on the Instaclustr Managed Platform!  

This follows on the heels of the project release of Cassandra version 5.0 Beta 1.0. Instaclustr is proud to be the first managed service platform to release this version for deployment on all major cloud providers or on-premises.  

This release is designed to allow anyone to easily undertake any application-specific testing of Apache Cassandra 5.0 (Beta 1.0) in preparation for the forthcoming GA release of Apache Cassandra 5.0. 

Apache Cassandra 5.0 

The last major version of Apache Cassandra was released about three years ago, bringing numerous advantages with it. Cassandra 5.0 is another significant iteration with exciting features that will revolutionize the future of NoSQL databases.  

Cassandra 5.0 brings enhanced efficiency and scalability, performance, and memory optimizations to your applications. Additionally, it expands functionality support, accelerating your AI/ML journey and playing a pivotal role in the development of AI applications. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency. 
  • Vector Search: This is powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis. 
  • Unified Compaction Strategy: This unifies compaction approaches, including leveled, tiered, and time-windowed strategies. The strategy leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency. 
  • Numerous stability and testing improvements: You can read all about these changes here. 

Important: Limitations of Cassandra 5.0 Beta 1 on the Instaclustr Managed Platform 

Customers can provision a cluster on the Instaclustr Platform with Cassandra version 5.0 for testing purposes. Note that while Cassandra version 5.0 Beta 1.0 is supported on our platform, there are some limitations in the functionality available, including:   

  • KeySpace/Table-level monitoring and metrics are not available when the vector data type is used. 
  • Trie-indexed SSTables and memtables are not yet supported since we do not yet support using the BTI format for SSTables and memtables. 

Apart from the application-specific limitations (read more about them here), the public preview release comes with the following conditions: 

  • It is not supported for production usage and is not covered by SLAs. The release should be used for testing purposes only. 
  • There is no support for add-ons such as Apache Lucene™, Continuous Backups and others. 
  • No PCI-compliant mode. 

What’s Next? 

Instaclustr will continue to conduct performance baselining and additional testing to offer industry-leading support and work on removing existing limitations in preparation for Cassandra 5.0’s GA release. 

How to Get Started 

Ready to try Cassandra 5.0 out yourself? Set up a non-production cluster using the Instaclustr console by following the easy instructions available here. Choose the Apache Cassandra 5.0 Preview version from the public preview category on the Cassandra setup page to begin your exploration. 

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our sales team to start exploring Cassandra 5.0. With over 300 million node-hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra. 

If you have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time. 

The post Preview Release of Apache Cassandra® version 5.0 on the Instaclustr Managed Platform appeared first on Instaclustr.

A Tale of Database Performance Woes: Patrick’s Unlucky Green Fedoras

Database performance is serious business, but why not have a little fun exploring its challenges and complexities? 😉 Here’s a rather fanciful story we presented in Chapter 1 of  Database Performance at Scale, a free Open Access book. The technical topics covered here are expanded on throughout the book. But this is the one and only time we talk about poor Patrick. Let his struggles bring you some valuable lessons, solace in your own database performance predicaments… and maybe a few chuckles as well. *** After losing his job at a FAANG MAANG (MANGA?) company, Patrick decided to strike off on his own and founded a niche online store dedicated to trading his absolute favorite among headwear, green fedoras. Noticing that a certain NoSQL database was recently trending on the front page of Hacker News, Patrick picked it for his backend stack. After some experimentation with the offering’s free tier, Patrick decided to sign a one-year contract with a major cloud provider to get a significant discount on its NoSQL database-as-a-service offering. With provisioned throughput capable of serving up to 1,000 customers every second, the technology stack was ready and the store opened its virtual doors to the customers. To Patrick’s disappointment, fewer than ten customers visited the site daily. At the same time, the shiny new database cluster kept running, fueled by a steady influx of money from his credit card and waiting for its potential to be harnessed. Patrick’s Diary of Lessons Learned, Part I The lessons started right away: Although some databases advertise themselves as universal, most of them perform best for certain kinds of workloads. The analysis before selecting a database for your own needs must include estimating the characteristics of your own workload: Is it likely to be a predictable, steady flow of requests (e.g., updates being fetched from other systems periodically)? Is the variance high and hard to predict, with the system being idle for potentially long periods of time, with occasional bumps of activity? Database-as-a-service offerings often let you pick between provisioned throughput and on-demand purchasing. Although the former is more cost-efficient, it incurs a certain cost regardless of how busy the database actually is. The latter costs more per request, but you only pay for what you use. Give yourself time to evaluate your choice and avoid committing to long-term contracts (even if lured by a discount) before you see that the setup works for you in a sustainable way. The First Spike March 17th seemed like an extremely lucky day. Patrick was pleased to notice lots of new orders starting from the early morning. But as the number of active customers skyrocketed around noon, Patrick’s mood started to deteriorate. This was strictly correlated with the rate of calls he received from angry customers reporting their inability to proceed with their orders. After a short brainstorming session with himself and a web search engine, Patrick realized, to his dismay, that he lacked any observability tools on his precious (and quite expensive) database cluster. Shortly after frantically setting up Grafana and browsing the metrics, Patrick saw that although the number of incoming requests kept growing, their success rate was capped at a certain level, way below today’s expected traffic. “Provisioned throughput strikes again,” Patrick groaned to himself, while scrolling through thousands of “throughput exceeded” error messages that started appearing around 11am. Patrick’s Diary of Lessons Learned, Part II This is what Patrick learned: If your workload is susceptible to spikes, be prepared for it and try to architect your cluster to be able to survive a temporarily elevated load. Database-as-a-service solutions tend to allow configuring the provisioned throughput in a dynamic way, which means that the threshold of accepted requests can occasionally be raised temporarily to a previously configured level. Or, respectively, they allow it to be temporarily decreased to make the solution slightly more cost-efficient. Always expect spikes. Even if your workload is absolutely steady, a temporary hardware failure or a surprise DDoS attack can cause a sharp increase in incoming requests. Observability is key in distributed systems. It allows the developers to retrospectively investigate a failure. It also provides real-time alerts when a likely failure scenario is detected, allowing people to react quickly and either prevent a larger failure from happening, or at least minimize the negative impact on the cluster. The First Loss Patrick didn’t even manage to recover from the trauma of losing most of his potential income on the only day throughout the year during which green fedoras experienced any kind of demand, when the letter came. It included an angry rant from a would-be customer, who successfully proceeded with his order and paid for it (with a receipt from the payment processing operator as proof), but is now unable to either see any details of his order—and he’s still waiting for the delivery! Without further ado, Patrick browsed the database. To his astonishment, he didn’t find any trace of the order either. For completeness, Patrick also put his wishful thinking into practice by browsing the backup snapshot directory. It remained empty, as one of Patrick’s initial executive decisions was to save time and money by not scheduling any periodic backup procedures. How did data loss happen to him, of all people? After studying the consistency model of his database of choice, Patrick realized that there’s consensus to make between consistency guarantees, performance, and availability. By configuring the queries, one can either demand linearizabilityFootnote7 at the cost of decreased throughput, or reduce the consistency guarantees and increase performance accordingly. Higher throughput capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data landed on a single server without any replicas distributed in the system. Once this server failed—which happens to hardware surprisingly often, especially at large scale—the data was gone. Patrick’s Diary of Lessons Learned, Part III Further lessons include: Backups are vital in a distributed environment, and there’s no such thing as setting backup routines “too soon.” Systems fail, and backups are there to restore as much of the important data as possible. Every database system has a certain consistency model, and it’s crucial to take that into account when designing your project. There might be compromises to make. In some use cases (think financial systems), consistency is the key. In other ones, eventual consistency is acceptable, as long as it keeps the system highly available and responsive. The Spike Strikes Again Months went by and Patrick’s sleeping schedule was even beginning to show signs of stabilization. With regular backups, a redesigned consistency model, and a reminder set in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt moderately safe. If only he knew that a ten-second video of a cat dressed as a leprechaun had just gone viral in Malaysia… which, taking time zone into account, happened around 2am Patrick’s time, ruining the aforementioned sleep stabilization efforts. On the one hand, the observability suite did its job and set off a warning early, allowing for a rapid response. On the other hand, even though Patrick reacted on time, databases are seldom able to scale instantaneously, and his system of choice was no exception in that regard. The spike in concurrency was very high and concentrated, as thousands of Malaysian teenagers rushed to bulk-buy green hats in pursuit of ever-changing Internet trends. Patrick was able to observe a real-life instantiation of Little’s Law, which he vaguely remembered from his days at the university. With a beautifully concise formula, L = λW, the law can be simplified to the fact that concurrency equals throughput times latency.
TIP:  For those having trouble with remembering the formula, think units. Concurrency is just a number, latency can be measured in seconds, while throughput is usually expressed in 1/s. Then, it stands to reason that in order for units to match, concurrency should be obtained by multiplying latency (seconds) by throughput (1/s). You’re welcome!
Throughput depends on the hardware and naturally has its limits (e.g., you can’t expect a NVMe drive purchased in 2023 to serve the data for you in terabytes per second, although we are crossing our fingers for this assumption to be invalidated in the near future!) Once the limit is hit, you can treat it as constant in the formula. It’s then clear that as concurrency rises, so does latency. For the end-users—Malaysian teenagers in this scenario—it means that the latency is eventually going to cross the magic barrier for the average human perception of a few seconds. Once that happens, users get too frustrated and simply give up on trying altogether, assuming that the system is broken beyond repair. It’s easy to find online articles quoting that “Amazon found that 100ms of latency costs them 1 percent in sales”; although it sounds overly simplified, it is also true enough. Patrick’s Diary of Lessons Learned, Part IV The lessons continue: Unexpected spikes are inevitable, and scaling out the cluster might not be swift enough to mitigate the negative effects of excessive concurrency. Expecting the database to handle it properly is not without merit, but not every database is capable of that. If possible, limit the concurrency in your system as early as possible. For instance, if the database is never touched directly by customers (which is a very good idea for multiple reasons) but instead is accessed through a set of microservices under your control, make sure that the microservices are also aware of the concurrency limits and adhere to them. Keep in mind that Little’s Law exists—it’s fundamental knowledge for anyone interested in distributed systems. Quoting it often also makes you appear exceptionally smart among peers. Backup Strikes Back After redesigning his project yet again to take expected and unexpected concurrency fluctuations into account, Patrick happily waited for his fedora business to finally become ramen profitable. Unfortunately, the next March 17th didn’t go as smoothly as expected either. Patrick spent most of the day enjoying steady Grafana dashboards, which kept assuring him that the traffic was under control and capable of handling the load of customers, with a healthy safe margin. But then the dashboards stopped, kindly mentioning that the disks became severely overutilized. This seemed completely out of place given the observed concurrency. While looking for the possible source of this anomaly, Patrick noticed, to his horror, that the scheduled backup procedure coincided with the annual peak load… Patrick’s Diary of Lessons Learned, Part V Concluding thoughts: Database systems are hardly ever idle, even without incoming user requests. Maintenance operations often happen and you must take them into consideration because they’re an internal source of concurrency and resource consumption. Whenever possible, schedule maintenance options for times with expected low pressure on the system. If your database management system supports any kind of quality of service configuration, it’s a good idea to investigate such capabilities. For instance, it might be possible to set a strong priority for user requests over regular maintenance operations, especially during peak hours. Respectively, periods with low user-induced activity can be utilized to speed up background activities. In the database world, systems that use a variant of LSM trees for underlying storage need to perform quite a bit of compactions (a kind of maintenance operation on data) in order to keep the read/write performance predictable and steady. The end.  

Instaclustr Product Update: March 2024

Check out all the new features and updates added to our platform over the last few months!     

As always, if you have specific feature requests or improvement ideas you’d love to see, please get in touch with us. 

Major Announcements: 

Apache Cassandra® 

NetApp is excited to announce that the Instaclustr Platform now supports the use of AWS PrivateLink to connect to Apache Cassandra clusters running on the platform. This new feature provides customers running on AWS with a more secure option for inter-VPC connectivity. Log into the console today to include support for AWS PrivateLink with your AWS Managed Cassandra clusters with just one click! 

OpenSearch® 

With the release of searchable snapshots, customers can now utilize the feature as a solution for storing and searching data in a more cost- and time-efficient way. Interested in trying out searchable snapshots for yourself? Reach out to our friendly Support team today and get started! 

Other Significant Changes: 

Apache Cassandra 

  • Customer Initiated Resize Vertical Scaling with GCP and Azure is now available with Apache Cassandra® on the Instaclustr Managed Platform, allowing customers to add more disk space or scale up and down processing capacity on-demand, without downtime to their services, and without the intervention of Instaclustr Support. 

Apache Kafka® 

  • Instaclustr for Apache Kafka® With KRaft Support and Apache Kafka® Connect 3.6.1 was released as generally available on Instaclustr Platform. This is the first release on our managed platform to support KRaft mode as GA. 
  • We’ve added support for custom Subject Alternative Names (SAN) on AWS,​ making it easier for our customers to use internal DNS to connect to their managed Kafka clusters without needing to use IP addresses. Customers wanting to make use of this should reach out to us via our Support. Self-serve capabilities are coming soon. 

OpenSearch 

PostgreSQL® 

  • Support for the pgvector open source extension for similarity search is now available as an add-on to your Instaclustr for PostgreSQL.  The pgvector extension provides the ability to store and search ML-generated vector type data. 
  • Azure Custom Virtual Network (VNet) support to PostgreSQL clusters has been added, enabling Instaclustr Managed PostgreSQL clusters and many types of Azure resources to securely communicate with each other, the internet, and on-premises networks. 
  • High-Availability Multi-Region Replication now offers added support on Instaclustr Managed PostgreSQL. This feature provide customer with a robust reliable backup, even in the face of a complete regional outage 
  • PostgreSQL versions 16.1, 15.5, 14.10 and 13.13 are now in general availability. 

Ocean for Apache Spark™ 

  • Since customers are now executing many Spark Streaming workloads on Ocean Spark, our product was optimized to accommodate these longer-running applications. The console was redesigned so that streaming workloads are more easily distinguished from batch workloads. The zoom-widget and axes on charts were upgraded so that data scientists can more easily identify errors and performance bottlenecks. Log collection techniques were enhanced so that more data is available for troubleshooting.   
  • We have completed our external audit and received a report that Ocean Spark complies with SOC 2 Type I. 

Instaclustr Managed Platform 

  • Added support for Microsoft Social SSO on the Instaclustr Platform, allowing users to securely and effortlessly sign up and access the Instaclustr Platform using their Microsoft credentials. 
  • Added support for Hyderabad (AWS, ap-south-2) and Dammam (GCP, me-central2) regions for Cassandra, OpenSearch, and Kafka. 

Upcoming Releases: 

Apache Cassandra 

  • Instaclustr by NetApp is currently developing support for Apache Cassandra 5.0. This new version will offer our customers access to exciting new features such as vector search and storage-attached indexes. The new version will soon be released for public preview on the Instaclustr Managed Platform and can be used for testing purposes in a non-production environment. Keep an eye out for updates and a release announcement. 

Cadence® 

  • Shared Production infrastructure is to be introduced soon, enabling a lower entry price point for Cadence production usage. This will extend the Developer Shared Infrastructure offering to support 2-node Cadence clusters with new node sizes​. 

Ocean for Apache Spark 

  • Application Configuration Templates will be enhanced with finer-grained permissions. This will enable administrators to improve partition memory and processor resources among teams, enforce standards for running applications, and to better protect security configurations.

OpenSearch 

  • StorageGRID for OpenSearch will be enabled on the Instaclustr Managed Platform, allowing Instaclustr by NetApp to offer an on-prem OpenSearch product offering and provide an on-prem object storage solution for OpenSearch back-ups that is S3 compatible. ​​ 

Instaclustr Managed Platform 

  • Instaclustr will be soon accessible as a SaaS subscription on the Azure Marketplace, allowing customers to allocate Instaclustr spend towards any Microsoft commit they might have. Stay tuned for the announcements from Instaclustr. 
  • Instaclustr will soon release customer-viewable action logs—a new feature that allows permitted users to view and search the log of actions performed by all the users within an account or organization. This feature will be helpful for audit and compliance activities. 

Instaclustr Enterprise Support 

  •  We will soon offer Enterprise Support for Apache Spark™. Keep an eye out for the announcement on our website. 

Did you know? 

Recently, Instaclustr by NetApp completed the largest new customer onboarding migration exercise in our history—and it’s quite possibly the largest Apache Cassandra and Apache Kafka migration exercise ever completed by anyone! In this blog we walk through the overall process and provide details of our approach. 

The post Instaclustr Product Update: March 2024 appeared first on Instaclustr.

IoT Overdrive Part 1: Compute Cluster Running Apache Cassandra® and Apache Kafka®

Throughout this series, we’ll be walking through the design, implementation, and use of a compute cluster, comprised (as completely as can be) of open source hardware and software. But what is a compute cluster? And what do I mean by open source hardware and software?  

In Part 1, we’ll give a high-level overview of the project, what it can do, and where we’ll be taking this series and the project in the future.  

Let’s start with some definitions that will make explaining the project a bit easier. 

What Is a Compute Cluster? 

A compute cluster is defined as 2 or more computers programmed to act as one computer. Our compute cluster will host several APIs, but users will not know what each node does, or even see the nodes; any request made to the cluster will return a single API response. The compute cluster does not need all computers of the same type; it can be heterogeneous (like the cluster we’ll describe in this post) 

Open Source Hardware 

From the Open Source Hardware Association (OSHWA) website:

“Open source hardware is hardware whose design is made publicly available so that anyone can study, modify, distribute, make, and sell the design or hardware based on that design. The hardware’s source, the design from which it is made, is available in the preferred format for making modifications to it.  

“Ideally, open source hardware uses readily available components and materials, standard processes, open infrastructure, unrestricted content, and open source design tools to maximize the ability of individuals to make and use hardware.  

“Open source hardware gives people the freedom to control their technology while sharing knowledge and encouraging commerce through the open exchange of designs.” 

In short, open source hardware has its complete design publicly available and allows anyone to modify or re-distribute it.  

Now let’s take a closer look at a subset of hardware: single-board computers. 

Single-Board Computers 

These computers are usually credit-card sized and contain everything you need to run operating systems like Debian, Android, and more. 

You may have heard of the very notable Raspberry Pi; this is one of the more well-known examples. There are also other boards that meet this definition, such as the BeagleBone series of computers, and Orange Pi series. 

Raspberry Pi (Source: Wikipedia)

Now that we’ve covered the vocabulary, let’s look at what the project is in detail. 

What Is the Project About? 

The elevator pitch is this: a compute cluster made from open source hardware and software that hosts an Apache Cassandra® database and an Apache Kafka® messaging system. The objective was to see if a physical cluster could be created that even remotely represented what it’s like to manage clusters in the cloud.  

Not an easy task; perhaps doing it would add at least a couple of layers of complexity to the project. But rather than see that as a negative, continue reading to find out how it got turned into a positive.  

A Way to Learn About New Technologies 

One of the many things someone working in Developer Relations (DevRel) should know (in my non-expert opinion) is how they learn best and how to pick up new technologies quickly.  

I pick things up best by playing with them and building things. In many ways, this project offered me a lot of hands-on experience with many of the technologies needed in my day-to-day work including but not limited to: 

  • Apache Cassandra
  • Apache Kafka
  • KRaft
  • Ansible
  • Docker
  • Docker Compose
  • Docker Swarm
  • Linux utilities and command line tools
  • Networking

New technologies are being added to this list; it’s been a wonderful deep dive into deploying Cassandra clusters and Kafka swarms.  

At the beginning it was unclear exactly how many technologies would be added to my repertoire of “skills I know just enough to be dangerous with”, and that’s because of the nature of the project.

An Experiment in IoT  

This project should definitely be filed under “experimental”: Version 1.0 was all about not knowing for sure if it would even work.  

But what makes these kinds of projects exciting is the fear of failure. Luckily, it did work (sort of–more on that in a bit) and building onto it made it work well. I will be using a lot of IoT technologies in building and monitoring the cluster, including I2C drivers, a lot of cli command parsing, and some embedded programming. 

A Fun Demo! 

Sometimes a cool demo is all it takes to bring people into a booth or a presentation. 

Creating 4 racks of computers that fit on a desk and network together to host a distributed database and messaging system was challenging, interesting and most definitely a good time to build! 

What This Project Does 

This project hosts an Apache Cassandra database and an Apache Kafka message broker. The Cassandra database is accessible through a Stargate API (more on that below) and the Kafka system can be accessed directly. This means this project could be used as a locally deployed database and messaging setup, with possibilities for physically distributed hardware. 

Apache Cassandra 

The database is currently designed to hold health data sent from the nodes and overall cluster itself.  Visualizations and alerts using this data will be built at a later stagewatch out for the future blog post on this. 

Stargate API 

The Stargate project allows us to have a GraphQL API of our Cassandra database data without any extra programming steps. You can learn more on the project’s website. 

Apache Kafka 

Kafka is the messaging system currently used to send data from users and cluster APIs to the Cassandra database. There are producers and consumers running on the cluster itself, creating and consuming topics and messages. 

Version 1.0: the Prototype 

The prototype was made from mostly my own boards that were lying around (which is a lot of boards more details below). Wi-Fi was used for internet communication. 

Hardware 

Here is a list of the hardware: 

  • 3 Orange Pi Zero 2 computers
  • 3 Raspberry Pi 4 computers with 8GB memory

The Raspberry Pi computers were used as the “workers” (running the Cassandra and Kafka systems), and the Orange Pi computers as the “managers” (running swarm management software, see Docker swarm below for more).  

Each node was hand-configured with either Armbian (on the Orange Pi computers) or Raspberry Pi OS (on the Raspberry Pi computers). 

A fourth Raspberry Pi 4 was added to serve as an admin machine for the cluster. The initial setup looked something like the following: 

Here’s a picture:

Source: Kassian Wren

Software

For the software, we needed something that could manage a cluster of nodes without a ton of overhead. We decided on using Docker, Docker Compose, and Docker Swarm for the prototype. These tools were chosen because they’re open source and very well-supported.

Docker

Docker is a container technology used to run containers with Linux operating systems such as the ones on our Pi computers. Docker also has some utilities that make it exceptionally useful for this project: Docker Compose and Docker Swarm.

Docker Compose 

Docker Compose allows us to write out multi-machine services, such as our Cassandra and Kafka clusters. These will then be deployed on Docker containers on the Pi computers.  

Docker Swarm 

Docker Swarm allows us to network together all of our containers and run them as a swarm. We then use Docker Compose with Swarm to apply services to our Docker container cluster.  

A Dockerfile is used to describe the Cassandra and Kafka services as well as any others. We then used Docker Swarm with Compose to create services on the cluster using the Dockerfile mentioned previously. 

Putting It All Together 

Getting this working was relatively straightforward, even if it appears a bit complicated at first: 

  • Write an ISO file containing an operating system to the board via MicroSD cards (Ubuntu official image for Orange Pi computers, Raspberry Pi OS for Raspberry Pi computers)
  • Logged into each one using ssh 
  • Perform basic setup and system updates 
  • Install Docker onto each board (except admin)
  • Register a Docker Swarm and add the workers and managers using the CLI utility 
  • Create a Dockerfile that installed the Cassandra and Kafka services to the worker nodes
  • Run Docker Compose with the Dockerfile to create services
  • Test Cassandra with cqlsh
  • Test Kafka by creating producer and consumer from cli 

And after all that preparation–it all worked with minimal tweaking or configuration. Here’s a picture of an inprogress cluster running the software via Docker: 

Source: Kassian Wren

Up and Running 

Once everything was up and running, the next thought I had was how to test it. Cassandra has cqlsh, an easy way to connect to and access the database.   

However, it was…a bit slow. Very slow in fact (it took several seconds to run queries)!  The good news was that it looked like my prototype worked, but there were a few issues that made its operation less than ideal. 

Lessons Learned 

A few good lessons were learnt in the time spent planning, creating, and building version 1.0 and they are:  

Need More Compute Power! 

It turns out running a database and messaging system takes resources, and there just weren’t enough to go around in my prototype. It needed to scalenot just verticallybut horizontally. 

Automation Is Your Friend…

Using secure shell (or ssh) to access each node and configuring it by hand was a bit of a nightmare. There needs to be a way to automate this process for the Pi computers in a repeatable, standard way. 

…but Wi-Fi Isn’t 

Part of the speed issue was running the cluster on Wi-Fi; the network connections between nodes had high latency and failed intermittently. We will need to switch to ethernet for future iterations.

Too Many Cables! 

An issue in the prototype that would get in the way of any horizontal scaling was too many power cables. Each Pi needed one, and the router needed one, tooso unless we wanted to get several power strips, we were going to have to come up with another solution. 

Summary 

Time for more Pi? 

It’s been an interesting experience learning all the technologies involved in building my prototype as well as getting it up and running. It hasn’t all been easy but it’s not over yet. 

Version 1.0 was only the beginning—the next evolution of the experiment is in progress. 

So, watch out for Part 2 in this series where I share what happened next! 

 

The post IoT Overdrive Part 1: Compute Cluster Running Apache Cassandra® and Apache Kafka® appeared first on Instaclustr.

DynamoDB – How to Move Out?

What does a DynamoDB migration really look like? Should you dual-write? Are there any tools to help?  Moving data from one place to another is conceptually simple. You simply read from one datasource and write to another. However, doing that consistently and safely is another story. There are a variety of mistakes you can make if you overlook important details. We recently discussed the top reasons so many organizations are currently seeking DynamoDB alternatives. Beyond costs (the most frequently-mentioned factor), aspects such as throttling, hard limits, and vendor lock-in are frequently cited as motivation for a switch. But what does a migration from DynamoDB to another database look like? Should you dual-write? Are there any available tools to assist you with that? What are the typical do’s and don’ts? In other words, how do you move out from DynamoDB? In this post, let’s start with an overview of how database migrations work, cover specific and important characteristics related to DynamoDB migrations, and then discuss some of the strategies employed to integrate with and migrate data seamlessly to other databases. How Database Migrations Work Most database migrations follow a strict set of steps to get the job done. First, you start capturing all changes made to the source database. This guarantees that any data modifications (or deltas) can be replayed later. Second, you simply copy data over. You read from the source database and write to the destination one. A variation is to export a source database backup and simply side-load it into the destination database. Past the initial data load, the target database will contain most of the records from the source database, except the ones that have changed during the period of time it took for you to complete the previous step. Naturally, the next step is to simply replay all deltas generated by your source database to the destination one. Once that completes, both databases will be fully in sync, and that’s when you may switch your application over. To Dual-Write or Not? If you are familiar with Cassandra migrations, then you have probably been introduced to the recommendation of simply “dual-writing” to get the job done. That is, you would proxy every writer mutation from your source database to also apply the same records to your target database. Unfortunately, not every database implements the concept of allowing a writer to retrieve or manipulate the timestamp of a record like the CQL protocol allows. This prevents you from implementing dual-writes in the application while back-filling the target database with historical data. If you attempt to do that, you will likely end up with an inconsistent migration, where some target Items may not reflect their latest state in your source database. Wait… Does it mean that dual-writing in a migration from DynamoDB is just wrong? Of course not! Consider that your DynamoDB table expires records (TTL) every 24h. In that case, it doesn’t make sense to back-fill your database: simply dual-write and – past the TTL period, switch your readers over. If your TTL is longer (say a year), then waiting for it to expire won’t be the most efficient way to move your data over. Back-Filling Historical Data Whether or not you need to back-fill historical data primarily depends on your use case. Yet, we can easily reason around the fact that it typically is a mandatory step in most migrations. There are 3 main ways for you to back-fill historical data from DynamoDB: ETL ETL (extract-transform-load) is essentially what a tool like Apache Spark does. It starts with a Table Scan and reads a single page worth of results. The results are then used to infer your source table’s schema. Next, it spawns readers to consume from your DynamoDB table as well as writer workers ingest the retrieved data to the destination database. This approach is great for carrying out simple migrations and also lets you transform (the T in the ETL part 🙂 your data as you go. However, it is unfortunately prone to some problems. For example: Schema inference: DynamoDB tables are schemaless, so it’s difficult to infer a schema. All table’s attributes (other than your hash and sort keys) might not be present within the first page of the initial scan. Plus, a given Item might not project all the attributes present within another Item. Cost: Sinces extracting data requires a DynamoDB table full scan, it will inevitably consume RCUs. This will ultimately drive up migration costs, and it can also introduce an upstream impact to your application if DynamoDB runs out of capacity. Time: The time it takes to migrate the data is proportional to your data set size. This means that if your migration takes longer than 24 hours, you may be unable to directly replay from DynamoDB Streams after, given that this is the period of time that AWS guarantees the availability of its events. Table Scan A table scan, as the name implies, involves retrieving all records from your source DynamoDB table – only after loading them to your destination database. Unlike the previous ETL approach where both the “Extract” and “Load” pieces are coupled and data gets written as you go, here each step is carried out in a phased way. The good news is that this method is extremely simple to wrap your head around. You run a single command. Once it completes, you’ve got all your data! For example: $ aws dynamodb scan --table-name source > output.json You’ll then end up with a single JSON file containing all existing Items within your source table, which you may then simply iterate through and write to your destination. Unless you are planning to transform your data, you shouldn’t need to worry about the schema (since you already know beforehand that all Key Attributes are present). This method works very well for small to medium-sized tables, but – as with the previous ETL method – it may take considerable time to scan larger tables. And that’s not accounting for the time it will take you to parse it and later load it to the destination. S3 Data Export If you have a large dataset or are concerned with RCU consumption and the impact to live traffic, you might rely on exporting DynamoDB data to Amazon S3. This allows you to easily dump your tables’ entire contents without impacting your DynamoDB table performance. In addition, you can request incremental exports later, in case the back-filling process took longer than 24 hours. To request a full DynamoDB export to S3, simply run: $ aws dynamodb export-table-to-point-in-time --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME --s3-bucket BUCKET_NAME --s3-prefix PREFIX_NAME --export-format DYNAMODB_JSON The export will then run in the background (assuming the specified S3 bucket exists). To check for its completion, run: $ aws dynamodb list-exports --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/source { "ExportSummaries": [ { "ExportArn": "arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME/export/01706834224965-34599c2a", "ExportStatus": "COMPLETED", "ExportType": "FULL_EXPORT" } ] } Once the process is complete, your source table’s data will be available within the S3 bucket/prefix specified earlier. Inside it, you will find a directory named AWSDynamoDB, under a structure that resembles something like this: $ tree AWSDynamoDB/ AWSDynamoDB/ └── 01706834981181-a5d17203 ├── _started ├── data │ ├── 325ukhrlsi7a3lva2hsjsl2bky.json.gz │ ├── 4i4ri4vq2u2vzcwnvdks4ze6ti.json.gz │ ├── aeqr5obfpay27eyb2fnwjayjr4.json.gz │ ├── d7bjx4nl4mywjdldiiqanmh3va.json.gz │ ├── dlxgixwzwi6qdmogrxvztxzfiy.json.gz │ ├── fuukigkeyi6argd27j25mieigm.json.gz │ ├── ja6tteiw3qy7vew4xa2mi6goqa.json.gz │ ├── jirrxupyje47nldxw7da52gnva.json.gz │ ├── jpsxsqb5tyynlehyo6bvqvpfki.json.gz │ ├── mvc3siwzxa7b3jmkxzrif6ohwu.json.gz │ ├── mzpb4kukfa5xfjvl2lselzf4e4.json.gz │ ├── qs4ria6s5m5x3mhv7xraecfydy.json.gz │ ├── u4uno3q3ly3mpmszbnwtzbpaqu.json.gz │ ├── uv5hh5bl4465lbqii2rvygwnq4.json.gz │ ├── vocd5hpbvmzmhhxz446dqsgvja.json.gz │ └── ysowqicdbyzr5mzys7myma3eu4.json.gz ├── manifest-files.json ├── manifest-files.md5 ├── manifest-summary.json └── manifest-summary.md5 2 directories, 21 files So how do you restore from these files? Well… you need to use the DynamoDB Low-level API. Thankfully, you don’t need to dig through its details since AWS provides the LoadS3toDynamoDB sample code as a way to get started. Simply override the DynamoDB connection with the writer logic of your target database, and off you go! Streaming DynamoDB Changes Whether or not you require back-filling data, chances are you want to capture events from DynamoDB to ensure both will get in sync with each other. Enter DynamoDB Streams. This can be used to capture changes performed in your source DynamoDB table. But how do you consume from its events? DynamoDB Streams Kinesis Adapter AWS provides the DynamoDB Streams Kinesis Adapter to allow you to process events from DynamoDB Streams via the Amazon Kinesis Client Library (such as the kinesis-asl module in Apache Spark). Beyond the historical data migration, simply stream events from DynamoDB to your target database. After that, both datastores should be in sync. Although this approach may introduce a steep learning curve, it is by far the most flexible one. It even lets you consume events from outside the AWS ecosystem (which may be particularly important if you’re switching to a different provider). For more details on this approach, AWS provides a walkthrough on how to consume events from a source DynamoDB table to a destination one. AWS Lambda Lambda functions are simple to get started with, handle all checkpointing logic on their own, and seamlessly integrate with the AWS ecosystem. With this approach, you simply encapsulate your application logic inside a Lambda function. That lets you write events to your destination database without having to deal with the Kinesis API logic, such as check-pointing or number of shards in a stream. When taking this route, you can load the captured events directly into your target database. Or, if the 24 hour retention limit is a concern, you can simply stream and retain these records in another service, such as Amazon SQS, and replay them later. The latter approach is well beyond the scope of this article. For examples of how to get started with Lambda functions, see the AWS documentation. Final Remarks Migrating from one database to another requires careful planning and a thorough understanding of all steps involved during the process. Further complicating the matter, there’s a variety of different ways to accomplish a migration, and each variation brings its own set of trade-offs and benefits. This article provided an in-depth look at how a migration from DynamoDB works, and how it differs from other databases. We also discussed different ways to back-fill historical data and stream changes to another database. Finally, we ran through an end-to-end migration, leveraging AWS tools you probably already know. At this point, you should have all the tools and tactics required to carry out a migration on your own. If you have any questions or require assistance with migrating your DynamoDB workloads to ScyllaDB, remember: we are here to help.

Avi Kivity and Dor Laor AMA: “Tablets” replication, all things async, caching, object storage & more

Even though the ScyllaDB monster has tentacles instead of feet, we still like to keep our community members on their toes. From a lively ScyllaDB Lounge, to hands-on ScyllaDB Labs, to nonstop speaker engagement (we racked up thousands of chat interactions), this was not your typical virtual event. But it’s pretty standard for the conferences we host – both ScyllaDB Summit and P99 CONF. Still, we surprised even the seasoned sea monsters by launching a previously unannounced session after the first break: an “Ask Me Anything” session with ScyllaDB Co-founders Dor Laor (CEO) and Avi Kivity (CTO). The discussion covered topics such as the new tablets replication algorithm (replacing Vnodes), ScyllaDB’s async shard-per-core architecture, strategies for tapping object storage with S3, the CDC roadmap, and comparing database deployment options. Here’s a taste of that discussion… Why do tablets matter for me, the end user? Avi: But that spoils my keynote tomorrow. 😉 One of the bigger problems that tablets can solve is the all-or-nothing approach to scaling that we have with Vnodes. Now, if you add a node, it’s all or nothing. You start streaming, and it can take hours, depending on the data shape. When it succeeds, you’ve added a node and only then it will be able to serve coordinator requests. With tablets, bootstrapping a node takes maybe a minute, and at the end of that minute, you have a new node in your cluster. But, that node doesn’t hold any data yet. The load balancer notices that an imbalance exists and starts moving tablets from the overloaded nodes toward that new node. From that, several things follow: You will no longer have that all-or-nothing situation. You won’t have to wait for hours to know if you succeeded or failed. As soon as you add a node, it starts shouldering part of the load. Imagine you’re stressed for CPU or you’re reaching the end of your storage capacity. As soon as you add the node, it starts taking part in the cluster workload, reducing CPU usage from your other nodes. Storage starts moving to the other nodes …. you don’t have to wait for hours. Since adding a node takes just a minute, you can add multiple nodes at the same time. Raft’s consistent topology allows us to linearize parallel requests and apply them almost concurrently.    After that, existing nodes will stream data to the new ones, and system load will eventually converge to a fair distribution as the process completes. So if you’re adding a node in response to an increase in your workload the database will be able to quickly react. You don’t get the full impact of adding the node immediately; that still takes time until the data is streamed. However, you get incremental capacity immediately, and it grows as you wait for data to be streamed. A new tablet, weighing only 5GB, scheduled on a new node will take 2 minutes to be operational and relieve the load on the original servers. Dor: Someone else just asked about migrating from Vnodes to tablets for existing deployments. Migration won’t happen automatically. In ScyllaDB 6.0, we’ll have tablets enabled for new clusters by default. But for existing clusters, it’s up to you to enable it. In ScyllaDB 6.1, tablets will likely be the default. We’ll have an automatic process that shifts Vnodes to tablets. In short, you won’t need to have a new cluster and migrate the data to it. It’s possible to do it in place – and we’ll have a process for it. Read our engineering blog on tablets What are some of the distinctive innovations in ScyllaDB? Avi: The primary innovation in ScyllaDB is its thread-per-core architecture. Thread-per-core was not invented by ScyllaDB. It was used by other products, mostly closed-source products, before – although I don’t know of any databases back then using this architecture. Our emphasis is on making everything asynchronous. We can sustain a very large amount of concurrency, so we can utilize modern disks or large node counts, large CPU counts, and very fast disks. All of these require a lot of concurrency to saturate the hardware and extract every bit of performance from it. So our architecture is really oriented towards that. And it’s not just the architecture – it’s also our mindset. Everything we do is oriented towards getting great performance while maintaining low latency. Another outcome from that was automatic tuning. We learned very quickly that it’s impossible to manually tune a database for different workloads, manage compaction settings, and so forth. It’s too complex for people to manage if they are not experts – and difficult even if they are experts. So we invested a lot of effort in having the database tune itself. This self-tuning is not perfect. But it is automatic, and it is always there. You know that if your workload changes, or something in the environment changes, ScyllaDB will adapt to it and you don’t need to be paged in order to change settings. When the workload increases, or you lose a node, the database will react to it and adapt. Another thing is workload prioritization. Once we had the ability to isolate different internal workloads (e.g., isolate the streaming workload from the compaction workload from the user workload), we exposed it to users as a capability called workload prioritization. That lets users define separate workloads, which could be your usual OLTP workload and an ETL workload or analytics workload. You can run those two workloads simultaneously on the same cluster, without them interfering with one another. How does ScyllaDB approach caching? Dor: Like many infrastructure projects, ScyllaDB has its own integrated, internal cache. The servers that we run on are pretty powerful and have lots of memory, so it makes sense to utilize this memory. The database knows best about the object lifetime and the access patterns – what is used most, what’s rarely used, what’s most expensive to retrieve from the disk, and so on. So ScyllaDB does its own caching very effectively. It’s also possible to monitor the cache utilization. We even have algorithms like heat-weighted load balancing. That comes into play, for example, when you restart a server. If you were doing upgrades, the server comes back relatively fast, but with a completely cold cache. If clients naively route queries evenly to all of your cluster nodes, then there will be one replica in the cluster (the one with a cold cache), and this could introduce a negative latency impact. ScyllaDB nodes actually know the cache utilization of their peers since we propagate these stats via Gossip. And over time, gradually, it sends more and more requests as a function of the cache utilization to all of the servers in the cluster. That’s how upgrades are done painlessly without any impact to your running operations. Other databases don’t have this mechanism. They require you to have an external cache. That’s extremely wasteful because all of the database servers have lots of memory, and you’re just wasting that, you’re wasting all its knowledge about object utilization and the recently used objects. Those external caches, like Redis, usually aren’t great for high availability. And you need to make sure that the cache maintains the same consistency as the database – and it’s complicated. Avi: Yes, caching is difficult, which is why it made it into that old joke about the two difficult things in computer science: naming things, cache invalidation, and off-by-one errors. It’s old, but it’s funny every time – maybe that says something about programmers’ sense of humor. 😉 The main advantage of having strong caching inside the database is that it simplifies things for the database user. Read our engineering blog on ScyllaDB’s specialized cache What are you hoping to achieve with the upcoming storage object storage capability and do you plan to support multiple tiers? Dor: We are adding S3 support: some of it is already in the master branch, the rest will follow. We see a lot of potential in S3. It can benefit total cost of ownership, and it will also improve backup and restore, letting you have a passive data center just by sending S3 to another data center but without nodes (unlike the active-active that we support today). There are multiple benefits, even faster scaling later on and faster recovery from a backup. But the main advantage is its cost: compared to NVMe SSDs, it’s definitely cheaper. The cost advantage comes with a tradeoff; the price is high latency, especially with the implementation by cloud vendors. Object storage is still way slower than pure block devices like NVMe. We will approach it by offering different options. One approach is basically tiered storage. You can have a policy where the hot data resides on NVMe and the data that you care less about, but you still want to keep, is stored on S3. That data on S3 will be cold, and its performance will be different. It’s simplest to model this with time series access patterns. We also want to get to more advanced access patterns (e.g., completely random access) and just cache the most recently used objects in NVMe while keeping the rarely accessed objects in S3. That’s one way to benefit from S3. Another way is the ability to still cache your entire data set on top of NVMe SSDs. You can have one or two replicas as a cached replica (stored not in RAM but in NVMe), so storage access will still be very fast when data no longer fits in RAM. The remaining replicas can use S3 instead of NVMe, and this setup will be very cost-effective. You might be able to have a 67% cost reduction or a 33% cost reduction, depending on the ratio of cache replicas versus replicas that are backed by S3. Can you speak to the pros and cons of ScyllaDB vs. Cassandra? Dor: If there is a con, we define it as a bug and we address it. 😉 Really, that’s what we’ve been doing for years and years. As far as the benefits, there’s no JVM – no big fat layer existing or requiring tuning. Another benefit is the caching. And we have algorithms that use the cache, like heat weighted load balancing. There’s also better observability with Grafana and Prometheus – you can even drill down per shard. Shard-per-core is another thing that guarantees performance. We also have “hints” where the client can say, “I’d like to bypass the cache because, for example, I’m doing a scan now and these values shouldn’t be cached.” Even the way that we index is different. Also, we have global indexes unlike Cassandra. That’s just a few things – there are many improvements over Cassandra. Co-Founder Keynotes: More from Dor and Avi Want to learn more? Access the co-founder keynotes here: See Dor’s Talk & Deck See Avi’s Talk & Deck Access all of ScyllaDB Summit On-Demand (free)

Cloud Database Rewards, Risks & Tradeoffs

Considering a fully-managed cloud database? Consider the top rewards, risks, and trade-offs related to performance and cost. What do you really gain – and give up– when moving to a fully managed cloud database? Now that managed cloud database offerings have been “battle tested” in production for a decade, how is the reality matching up to the expectation? What can teams thinking of adopting a fully managed cloud database learn from teams who have years of experience working with this deployment model? We’ve found that most teams are familiar with the admin/management aspects of  database as a service ( a.k.a “DBaaS”). But let’s zero in on the top risks, rewards and trade-offs related to two aspects that commonly catch teams off guard: performance and cost. Bonus: Hear your peers’ firsthand cloud database experiences at ScyllaDB Summit on-demand Cloud Database Performance Performance Rewards Using a cloud database makes it extremely easy for you to place your data close to your application and your end users. Most also support multiregion replication, which lets you deploy an always-on architecture with just a few clicks. This simplicity makes it feasible to run specialized use cases, such as “smart replication.” For example, think about a worldwide media streaming service where you have one catalog tailored to users living in Brazil and a different catalog for users in the United States. Or, consider a sports betting use case where you have users all around the globe and you need to ensure that as the game progresses, updated odds are rolled out to all users at the exact same time to “level the playing field” (this is the challenge that ZeroFlucs tackled very impressively). The ease of scale is another potential performance-related reward. To reap this reward, be sure to test that your selected cloud database is resilient enough to sustain sudden traffic spikes. Most vendors let you quickly scale out your deployment, but beware of solutions that don’t let you transition between “tiers.” After all, you don’t want to find yourself in a situation where it’s Black Friday and your application can’t meet the demands.  Managed cloud database options like ScyllaDB Cloud let you add as many nodes as needed to satisfy any traffic surges that your business is fortunate enough to experience. Performance Risks One performance risk is the unpredictable cost of scale. Know your throughput requirements and how much growth you anticipate. If you’re running up to a few thousand operations per second, a pay-per-operations service model probably makes sense. But as you grow to tens of thousands of operations per second and beyond, it can become quite expensive. Many high-growth companies opt for pricing models that don’t charge by the number of operations you run, but rather charge for the infrastructure you choose. Also, be mindful of potential hidden limits or quotas that your cloud database provider may impose. For example, DynamoDB limits item size to 400KB per item; the operation is simply refused if you try to exceed that. Moreover, your throughput could be throttled down if you try to pass your allowance, or if the vendor imposes a hard limit on the number of operations you can run on top of a single partition. Throttling severely increases latency, which may be unacceptable for real-time applications. If this is important to you, look for a cloud database model that doesn’t impose workload restrictions. With offerings that use infrastructure-based cost models, there aren’t artificial traffic limits; you can push as far as the underlying hardware can handle. Performance Trade-offs It’s crucial to remember that a fully-managed cloud database is fundamentally a business model. As your managed database vendor contributes to your growth, it also captures more revenue. Despite the ease of scalability, many vendors limit your scaling options to a specific range, potentially not providing the most performant infrastructure. For example, perhaps you have a real-time workload that reads a lot of cold data in such a way that I/O access is really important to you, but your vendor simply doesn’t support provisioning your database on top of NVMe (nonvolatile memory express) storage. Having a third party responsible for all your core database tasks obviously simplifies maintenance and operations. However, if you encounter performance issues, your visibility into the problem could be reduced, limiting your troubleshooting capabilities. In such cases, close collaboration with your vendor becomes essential for identifying the root cause. If visibility and fast resolution matter to you, opt for cloud database solutions that offer comprehensive visibility into your database’s performance. Cloud Database Costs Cost Rewards Adopting a cloud database eliminates the need for physical infrastructure and dedicated staff. You don’t have to invest in hardware or its maintenance because the infrastructure is provided and managed by the DBaaS provider. This shift results in significant cost savings, allowing you to allocate resources more effectively toward core operations, innovation and customer experience rather than spending on hardware procurement and management. Furthermore, using managed cloud databases reduce staffing costs by transferring responsibilities such as DevOps and database administration to the vendor. This eliminates the need for a specialized in-house database team, enabling you to optimize your workforce and allocate relevant staff to more strategic initiatives. There’s also a benefit in deployment flexibility, Leading providers typically offer two pricing models: pay-as-you-go and annual pricing. The pay-as-you-go model eliminates upfront capital requirements and allows for cost optimization by aligning expenses with actual database usage. This flexibility is particularly beneficial for startups or organizations with limited resources. Most cloud database vendors offer a standard model where the customer’s database sits on the vendor’s cloud provider infrastructure. Alternatively, there’s a “bring your own account” model, where the database remains on your organization’s cloud provider infrastructure. This deployment is especially advantageous for enterprises with established relationships with their cloud providers, potentially leading to cost savings through pre-negotiated discounts. Additionally, by keeping the database resources on your existing infrastructure, you avoid dealing with additional security concerns. It also allows you to manage your database as you manage your other existing infrastructure. Cost Risks Although a cloud database offers scalability, the expense of scaling your database may not follow a straightforward or easily predictable pattern. Increased workload from applications can lead to unexpected spikes or sudden scaling needs, resulting in higher costs (as mentioned in the previous section). As traffic or data volume surges, the resource requirements for the database may significantly rise, leading to unforeseen expenses as you need to scale up. It is crucial to closely monitor and analyze the cost implications of scaling to avoid budget surprises. Additionally, while many providers offer transparent pricing, there may still be hidden costs. These costs often arise from additional services or specific features not covered by the base pricing. For instance, specialized support or advanced features for specific use cases may incur extra charges. It is essential to carefully review the service-level agreements and pricing documentation provided by your cloud database provider to identify any potential hidden costs. Here’s a real-life example: One of our customers recently moved over from another cloud database vendor. At that previous vendor, they encountered massive unexpected variable costs, primarily associated with network usage. This “cloud bill shock” resulted in some internal drama; some engineers were fired in the aftermath. Understanding and accounting for these hidden, unanticipated costs is crucial for accurate budgeting and effective cost management. This ensures a comprehensive understanding of the total cost of ownership and enables more informed decisions on the most cost-effective approach for your organization. Given the high degree of vendor lock-in involved in most options, it’s worth thinking about this long and hard. You don’t want to be forced into a significant application rewrite because your solution isn’t sustainable from a cost perspective and doesn’t have any API-compatible paths out. Cost Trade-offs The first cost trade-off associated with using a cloud database involves limited cost optimizations. While managed cloud solutions offer some cost-saving features, they might limit your ability to optimize costs to the same extent as self-managed databases. Constraints imposed by the service provider may restrict actions like optimizing hardware configurations or performance tuning. They also provide standardized infrastructure that caters to a broad variety of use cases. On the one hand, this simplifies operations. On the other hand, one size does not fit all. It could limit your ability to implement highly customized cost-saving strategies by fine-tuning workload-specific parameters and caching strategies. The bottom line here: carefully evaluate these considerations to determine the impact on your cost optimization efforts. The second trade-off pertains to cost comparisons and total cost of ownership. When comparing costs between vendor-managed and self-managed databases, conducting a total cost of ownership analysis is essential. Consider factors such as hardware, licenses, maintenance, personnel and other operational expenses associated with managing a database in-house and then compare these costs against ongoing subscription fees and additional expenses related to the cloud database solution. Evaluate the long-term financial impact of using fully-managed cloud database versus managing the database infrastructure in-house. Then, with this holistic view of the costs, decide what’s best given your organization’s specific requirements and budget considerations. Additional Database Deployment Model Considerations Although a database as a service (DBaaS) deployment will definitely shield you from many infrastructure and hardware decisions through your selection process, a fundamental understanding of the generic compute resources required by any database is important for identifying potential bottlenecks that may limit performance. For an overview of the critical considerations and tradeoffs when selecting CPUs, memory, storage and networking for your distributed database infrastructure, see Chapter 7 of the free book, “Database Performance at Scale.” After an introduction to the hardware that’s involved in every deployment model, whether you think about it or not, that book chapter shifts focus to different deployment options and their impact on performance. You’ll learn about the special considerations associated with cloud-hosted deployments, serverless, containerization and container orchestration technologies such as Kubernetes. Access the complete “Database Performance at Scale” book free, courtesy of ScyllaDB.

New ScyllaDB Enterprise Release: Up to 50% Higher Throughput, 33% Lower Latency

Performance improvements, encryption at rest, Repair Based Node Operations, consistent schema management using Raft & more  ScyllaDB Enterprise 2024.1.0 LTS, a production-ready ScyllaDB Enterprise Long Term Support Major Release, is now available! It introduces significant performance improvements: up to 50% higher throughput, 35% greater efficiency, and 33% lower latency. It introduces encryption at rest, Repair Based Node Operations (RBNO) for all operations, and numerous improvements and bug fixes. Additionally, consistent schema management using Raft will be enabled automatically upon upgrade (see below for more details). The new release is based on ScyllaDB Open Source 5.4. In this blog, we’ll highlight the new capabilities that our users have been asking about most frequently. For the complete details, read the release notes. Read the detailed release notes on our forum Learn more about ScyllaDB Enterprise Get ScyllaDB Enterprise 2024.1 (customers only, or 30-day evaluation) Upgrade from ScyllaDB Enterprise 2023.1.x to 2024.1.y Upgrade from ScyllaDB Enterprise 2022.2.x to 2024.1.y Upgrade from ScyllaDB Open Source 5.4 to ScyllaDB Enterprise 2024.1.x ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2023.1, and are welcome to contact our Support Team with questions. Read the detailed release notes Performance Improvements 2024.1 includes many runtime and build performance improvements that translate to: Higher throughput per vCPU and server Lower mean and P99 latency ScyllaDB 2024.1 vs ScyllaDB 2023.1 Throughput tests 2024.1 has up to 50% higher throughput than 2023.1. In some cases, this can translate to a 35% reduction in the number of vCPUs required to support a similar load. This enables a similar reduction in vCPU cost. Latency tests Latency tests were performed at 50% of the maximum throughput tested. As demonstrated below, the latency (both mean and P99) is 33% lower, even with the higher throughput. Test Setup Amazon EC2 instance_type_db: i3.2xlarge (8 cores) instance_type_loader: c4.2xlarge Test profiles: cassandra-stress [mixed|read|write] no-warmup cl=QUORUM duration=50m -schema ‘replication(factor=3)’ -mode cql3 native -rate threads=100 -pop ‘dist=gauss(1..30000000,15000000,1500000)’ Note that these results are for tests performed on a small i3 server (i3.2xlarge). ScyllaDB scales linearly with the number of cores and achieves much better results for the i4i instance type. ScyllaDB Enterprise 2024.1 vs ScyllaDB Open Source 5.4 ScyllaDB Enterprise 2024.1 is based on ScyllaDB Open Source 5.4, but includes enterprise-only performance improvement optimizations. As shown below, the throughput gain is significant and latency is lower. These tests use the same setup and parameters detailed above. Encryption at Rest (EaR) Enhancements This new release includes enhancements to Encryption at Rest (EaR), including new Amazon KMS integration, and extended cluster-level encryption at rest. Together, these improvements allow you to easily use your own key for cluster-wide EaR. ScyllaDB Enterprise has supported Encryption at Rest (EaR) for some time. Until now, users could store the keys for EaR locally, in an encrypted table, or in an external KMIP server. This release adds the ability to: Use Amazon KMS to store and manage keys. Set default EaR parameters (including the new KMS) for *all* cluster tables. These are both detailed below. Amazon KMS Integration for Encryption at Rest ScyllaDB can now use a Customer Managed Key (CMK), stored in KMS, to create, encrypt, and decrypt Data Keys (DEK), which are then used to encrypt and decrypt the data in storage (such as SSTables, Commit logs, Batches, and hints logs). KMS creates DEK from CMK: DEK (plain text version) is used to encrypt the data at rest: Diagrams are from: https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#data-keys Before using KMS, you need to set KMS as a key provider and validate that ScyllaDB nodes have permission to access and use the CMK you created in KMS. Once you do that, you can use the CMK in the CREATE and ALTER TABLE commands with KmsKeyProviderFactory, as follows CREATE TABLE myks.mytable (......) WITH scylla_encryption_options = { 'cipher_algorithm' : 'AES/CBC/PKCS5Padding', 'secret_key_strength' : 128, 'key_provider': 'KmsKeyProviderFactory', 'kms_host': 'my_endpoint' }<>CODE> Where “my_key” points to a section in scylla.yaml kms_hosts: my_endpoint: aws_use_ec2_credentials: true aws_use_ec2_region: true master_key: alias/MyScyllaKey You can also use the KMS provider to encrypt system-level data. See more examples and info here. Transparent Data Encryption Transparent Data Encryption (TDE) adds a way to define Encryption at Rest parameters per cluster, not only per table. This allows the system administrator to enforce encryption of *all* tables using the same master key (e.g., from KMS) without specifying the encryption parameter per table. For example, with the following in scylla.yaml, all tables will be encrypted using encryption parameters of my-kms1: user_info_encryption: enabled: true key_provider: KmsKeyProviderFactory, kms_host: my_kms1 See more examples and info here. Repair Based Node Operations (RBNO) RBNO provides a more robust, reliable, and safer data streaming for node operations like node-replace and node-add/remove. In particular, a failed node operation can resume from the point it stopped – without sending data that has already been synced. In addition, with RBNO enabled, you don’t need to repair before or after node operations, such as replace or removenode. In this release, RBNO is enabled by default for all operations: remove node, rebuild, bootstrap, and decommission. The replace node operation was already enabled by default. For details, see the Repair Based Node Operations (RBNO) docs and the blog, Faster, Safer Node Operations with Repair vs Streaming. Node-Aggregated Table Level Metrics Most ScyllaDB metrics are per-shard, per-node, but not for a specific table. We now export some per-table metrics. These are exported once per node, not per shard, to reduce the number of metrics. Guardrails Guardrails is a framework to protect ScyllaDB users and admins from common mistakes and pitfalls. In this release, ScyllaDB includes a new guardrail on the replication factor. It is now possible to specify the minimum replication factor for new keyspaces via a new configuration item. Security In addition to the EaR enhancements above, the following security features were introduced in 2024.1: Encryption at transit, TLS certificates It is now possible to use TLS certificates to authenticate and authorize a user to ScyllaDB. The system can be configured to derive the user role from the client certificate and derive the permissions the user has from that role. #10099 Learn more in the certificate-authentication docs. FIPS Tolerant ScyllaDB Enterprise can now run on FIPS enabled Ubuntu, using libraries that were compiled with FIPS enabled, such as OpenSSL, GnuTLS, and more. Strongly Consistent Schema Management with Raft Strongly Consistent Schema Management with Raft became the default for new clusters in ScyllaDB Enterprise 2023.1.In this release, it is enabled by default when upgrading existing clusters. Learn more in the blog, ScyllaDB’s Path to Strong Consistency: A New Milestone.   Read the detailed release notes

Support for AWS PrivateLink On Instaclustr for Apache Cassandra® is now GA

Instaclustr is excited to announce the general availability of AWS PrivateLink with Apache Cassandra® on the Instaclustr Managed Platform. This release follows the announcement of the new feature in public preview last year.  

Support for AWS PrivateLink with Cassandra provides our AWS customers with a simpler and more secure option for network cross-account connectivity, to expose an application in one VPC to other users or applications in another VPC.  

Network connections to an AWS PrivateLink service can only be one-directional from the requestor to the destination VPC. This prevents network connections being initiated from the destination VPC to the requestor and creates an additional measure of protection from potential malicious activity. 

All resources in the destination VPC are masked and appear to the requestor as a single AWS PrivateLink service. The AWS PrivateLink service manages access to all resources within the destination VPC. This significantly simplifies cross-account network setup as compared to authorizing peering requests, configuring routes tables and security groups when establishing VPC peering.  

The Instaclustr team has worked with care to integrate the AWS PrivateLink service for your AWS Managed Cassandra environment to give you a simple and secure cross-account network solution with just a few clicks.  

Fitting AWS PrivateLink to Cassandra is not a straightforward task as AWS PrivateLink exposes a single IP proxy per AZ, and Cassandra clients generally expect direct access to all Cassandra nodes. To solve this problem, the development of Instaclustr’s AWS PrivateLink service has made use of Instaclustr’s Shotover Proxy in front of your AWS Managed Cassandra clusters to reduce cluster IP addresses from one-per-node to one-per-rack, enabling the use of a load balancer as required by AWS PrivateLink.  

By managing database requests in transit, Shotover gives Instaclustr customers AWS PrivateLink’s simple and secure network setup with the benefits of Managed Cassandra. Keep a look out for an upcoming blog post with more details on the technical implementation of AWS PrivateLink for Managed Cassandra. 

AWS PrivateLink is offered as an Instaclustr Enterprise feature, available at an additional charge of 20% on top of the node cost for the first feature enabled by you. The Instaclustr console will provide a summary of node prices or management units for your AWS PrivateLink enabled Cassandra cluster, covering both Cassandra and Shotover node size options and prices when you first create an AWS PrivateLink enabled Cassandra cluster. Information on charges from AWS is available here. 

Log into the Console to include support for AWS PrivateLink with your AWS Managed Cassandra clusters with just one click today. Alternatively, support for AWS PrivateLink for Managed Cassandra is available at the Instaclustr API or Terraform.  

Please reach out to our Support team for any assistance with AWS PrivateLink for your AWS Managed Cassandra clusters. 

The post Support for AWS PrivateLink On Instaclustr for Apache Cassandra® is now GA appeared first on Instaclustr.

ScyllaDB Summit 2024 Recap: An Inside Look

A rapid rundown of the whirlwind database performance event Hello readers, it’s great to be back in the US to help host ScyllaDB Summit, 2024 edition. What a great virtual conference it has been – and once again, I’m excited to share the behind-the-scenes perspective as one of your hosts. First, let’s thank all the presenters once again for contributions from around the world. With 30 presentations covering all things ScyllaDB, it made for a great event. To kick things off, we had the now-famous Felipe Cardeneti Mendes host the ScyllaDB lounge and get straight into answering questions. Once the audience got a taste for it (and realized the technical acumen of Felipe’s knowledge), there was an unstoppable stream of questions for him. Felipe was so popular that he became a meme for the conference!
Felipe Mendes’s comparing MongoDB vs ScyllaDB using @benchant_com benchmarks !! What a vast leap by ScyllaDB. Loving it. @ScyllaDB #ScyllaDB pic.twitter.com/ObrZzMU9o2 — ktv (@RamnaniKartavya) February 14, 2024
The first morning, we also trialed something new by running hands-on labs aimed at both novice and advanced users. These live streams were a hit, with over 1000 people in attendance and everyone keen to get their hands on ScyllaDB. If you’d like to continue that experience, be sure to check out the self-paced ScyllaDB University and the interactive instructor-led ScyllaDB University LIVE event coming up in March. Both are free and virtual! Let’s recap some stand out presentations for me on the first day of the conference. The opening keynote, by CEO and co-founder Dor Laor, was titled ScyllaDB Leaps Forward. This is a must-see presentation. It provides the background context you need to understand tablet architecture and the direction that ScyllaDB is headed: not only the fastest database in terms of latency (at any scale), but also the quickest to scale in terms of elasticity. The companion keynote on day two from CTO and co-founder Avi Kivity completes the loop and explains in more detail why ScyllaDB is making this major architecture shift from vNodes replication to tablets. Take a look at Tablets: Rethink Replication for more insights. The second keynote, from Discord Staff Engineer Bo Ingram, opened with the premise So You’ve Lost Quorum: Lessons From Accidental Downtime and shared how to diagnose issues in your clusters and how to avoid making a fault too big to tolerate. Bo is a talented storyteller and published author. Be sure to watch this keynote for great tips on how to handle production incidents at scale. And don’t forget to buy his book, ScyllaDB in Action for even more practical advice on getting the most out of ScyllaDB. Download the first 4 chapters for free An underlying theme for the conference was exploring individual customers’ migration paths from other databases onto ScyllaDB. To that end, we were fortunate to hear from JP Voltani, Head of Engineering at Tractian, on their Experience with Real-Time ML and the reasons why they moved from MongoDB to ScyllaDB to scale their data pipeline. Working with over 5B samples from +50K IoT devices, they were able to achieve their migration goals. Felipe’s presentation on MongoDB to ScyllaDB: Technical Comparison and the Path to Success then detailed the benchmarks, processes, and tools you need to be successful for these types of migrations. There were also great presentations looking at migration paths from DynamoDB and Cassandra; be sure to take a look at them if you’re on any of those journeys. A common component in customer migration paths was the use of Change Data Capture (CDC) and we heard from Expedia on their migration journey from Cassandra to ScyllaDB. They cover the aspects and pitfalls the team needed to overcome as part of their Identity service project. If you are keen to learn more about this topic, then Guilherme’s presentation on Real-Time Event Processing with CDC is a must-see. Martina Alilović Rojnić gave us the Strategy Behind Reversing Labs’ Massive Key-Value Migration which had mind-boggling scale, migrating more than 300 TB of data and over 400 microservices from their bespoke key-value store to ScyllaDB – with ZERO downtime. An impressive feat of engineering! ShareChat shared everything about Getting the Most Out of ScyllaDB Monitoring. This is a practical talk about working with non-standard ScyllaDB metrics to analyze the remaining cluster capacity, debug performance problems, and more. Definitely worth a watch if you’re already running ScyllaDB in production. After a big day hosting the conference combined with the fatigue of international travel from Australia, assisted with a couple of cold beverages the night after, sleep was the priority. Well rested and eager for more, we launched early into day two of the event with more great content. Leading the day was Avi’s keynote, which I already mentioned above. Equally informative was the following keynote from Miles Ward and Joe Shorter on Radically Outperforming DynamoDB. If you’re looking for more reasons to switch, this was a good presentation to learn from, including details of the migration and using ScyllaDB Cloud with Google Cloud Platform. Felipe delivered another presentation (earning him MVP of the conference) about using workload prioritization features of ScyllaDB to handle both Real-Time and Analytical workloads, something you might not ordinarily consider compatible. I also enjoyed Piotr’s presentation on how ScyllaDB Drivers take advantage of the unique ScyllaDB architecture to deliver high-performance and ultra low-latencies. This is yet another engineering talk showcasing the strengths of ScyllaDB’s feature sets. Kostja Osipov set the stage for this on Day 1. Kostja consistently delivers impressive Raft talks year after year, and his Topology on Raft: An Inside Look talk is another can’t miss. There’s a lot there! Give it a (re)watch if you want all the details on how Raft is implemented in the new releases and what it all means for you, from the user perspective. We also heard from Kishore Krishnamurthy, CTO at ZEE5, giving us insights into Steering a High-Stakes Database Migration. It’s always interesting to hear the executive-level perspective on the direction you might take to reduce costs while maintaining your SLAs. There were also more fascinating insights from ZEE5 engineers on how they are Tracking Millions of Heartbeats on Zee’s OTT Platform. Solid technical content. In a similar vein, proving that “simple” things can still present serious engineering challenges when things are distributed and at scale, Edvard Fagerholm showed us how Supercell Persists Real-Time Events. Edvard illustrated how ScyllaDB helps them process real-time events for their games like Clash of Clans and Clash Royale with hundreds of millions of users. The day flew by. Before long, we were wrapping up the conference. Thanks to the community of 7500 for great participation – and thousands of comments – from start to finish. I truly enjoyed hosting the introductory lab and getting swarmed by questions. And no, I’m not a Kiwi! Thank you all for a wonderful experience.

Seastar, ScyllaDB, and C++23

Seastar now supports C++20 and C++23 (and dropped support for C++17) Seastar is an open-source (Apache 2.0 licensed) C++ framework for I/O intensive asynchronous computing, using the thread-per-core model. Seastar underpins several high- performance distributed systems: ScyllaDB, Redpanda, and Ceph Crimson. Seastar source is available on github. Background As a C++ framework, Seastar must choose which C++ versions to support. The support policy is last-two-versions. That means that at any given time, the most recently released version as well as the previous one are supported, but earlier versions cannot be expected to work. This policy gives users of the framework three years to upgrade to the next C++ edition while not constraining Seastar to ancient versions of the language. Now that C++23 has been ratified, Seastar now officially supports C++20 and C++23. The previously supported C++17 is now no longer supported. New features in C++23 We will focus here on C++23 features that are relevant to Seastar users; this isn’t a comprehensive review of C++23 changes. For an overview of C++23 additions, consult the cppreference page. std::expected std::expected is a way to communicate error conditions without exceptions. This is useful since exception handling is very slow in most C++ implementations. In a way, it is similar to std::future and seastar::future: they are all variant types that can hold values and errors, though, of course, futures also represent concurrent computations, not just values. So far, ScyllaDB has used boost::outcome for the same role that std::expected fills. This improved ScyllaDB’s performance under overload conditions. We’ll likely replace it with std::expected soon, and integration into Seastar itself is a good area for extending Seastar. std::flat_set and std::flat_map These new containers reduce allocations compared to their traditional variants and are suitable for request processing in Seastar applications. Seastar itself won’t use them since it still maintains C++20 compatibility, but Seastar users should consider them, along with abseil containers. Retiring C++17 support As can be seen from the previous section, C++23 does not have a dramatic impact on Seastar. The retiring of C++17, however, does. This is because we can now fully use some C++20-only features on Seastar itself. Coroutines C++20 introduced coroutines, which make synchronous code both easier to write and more efficient (a very rare tradeoff). Seastar applications could already use coroutines freely, but Seastar itself could not due to the need to support C++17. Since all supported C++ editions now have coroutines, continuation-style code will be replaced by coroutines where this makes sense. std::format Seastar has long been using the wonderful {fmt} library. Since it was standardized as std::format in C++20, we may drop this dependency in favor of the standard library version. The std::ranges library Another long-time dependency, the Boost.Range library, can now be replaced by its modern equivalent std::ranges. This promises better compile times, and, more importantly, better compile-time error reporting as the standard library uses C++ concepts to point out programmer errors more directly. Concepts As concepts were introduced in C++20, they can now be used unconditionally in Seastar. Previously, they were only used when C++20 mode was active, which somewhat restricted what could be done with them. Conclusion C++23 isn’t a revolution for C++ users in general and Seastar users in particular, but, it does reduce the dependency on third-party libraries for common tasks. Concurrent with its adoption, dropping C++17 allows us to continue modernizing and improving Seastar.

Distributed Database Consistency: Dr. Daniel Abadi & Kostja Osipov Chat

Dr. Daniel Abadi (University of Maryland) and Kostja Osipov (ScyllaDB) discuss PACELC, CAP theorem, Raft, and Paxos Database consistency has been a strongly consistent theme at ScyllaDB Summit over the past few years – and we guarantee that will continue at ScyllaDB Summit 2024 (free + virtual). Co-founder Dor Laor’s opening keynote on “ScyllaDB Leaps Forward” includes an overview of the latest milestones on ScyllaDB’s path to immediate consistency. Kostja Osipov (Director of Engineering) then shares the details behind how we’re implementing this shift with Raft and what the new consistent metadata management updates mean for users. Then on Day 2, Avi Kivity (Co-founder) picks up this thread in his keynote introducing ScyllaDB’s revolutionary new tablet architecture – which is built on the foundation of Raft. Update: ScyllaDB Summit 2024 is now a wrap! Access ScyllaDB Summit On Demand ScyllaDB Summit 2023 featured two talks on database consistency. Kostja Osipov shared a preview of Raft After ScyllaDB 5.2: Safe Topology Changes (also covered in this blog series). And Dr. Daniel Abadi, creator of the PACELC theorem, explored The Consistency vs Throughput Tradeoff in Distributed Databases. After their talks, Daniel and Kostja got together to chat about distributed database consistency. You can watch the full discussion below. Here are some key moments from the chat… What is the CAP theorem and what is PACELC Daniel: Let’s start with the CAP theorem. That’s the more well-known one, and that’s the one that came first historically. Some say it’s a three-way tradeoff, some say it’s a two-way tradeoff. It was originally described as a three-way tradeoff: out of consistency, availability, and tolerance to network partitions, you can have two of them, but not all three. That’s the way it’s defined. The intuition is that if you have a copy of your data in America and a copy of your data in Europe and you want to do a write in one of those two locations, you have two choices. You do it in America, and then you say it’s done before it gets to Europe. Or, you wait for it to get to Europe, and then you wait for it to occur there before you say that it’s done. In the first case, if you commit and finish a transaction before it gets to Europe, then you’re giving up consistency because the value in Europe is not the most current value (the current value is the write that happened in America). But if America goes down, you could at least respond with stale data from Europe to maintain availabilty. PACELC is really an extension of the CAP theorem. The PAC of PACELC is CAP. Basically, that’s saying that when there is a network partition, you must choose either availability or consistency. But the key point of PACELC is that network partitions are extremely rare. There’s all kinds of redundant ways to get a message from point A to point B. So the CAP theorem is kind of interesting in theory, but in practice, there’s no real reason why you have to give up C or A. You can have both in most cases because there’s never a network partition. Yet we see many systems that do give up on consistency. Why? The main reason why you give up on consistency these days is latency. Consistency just takes time. Consistency requires coordination. You have to have two different locations communicate with each other to be able to remain consistent with one another. If you want consistency, that’s great. But you have to pay in latency. And if you don’t want to pay that latency cost, you’re going to pay in consistency. So the high-level explanation of the PACELC theorem is that when there is a partition, you have to choose between availability and consistency. But in the common case where there is no partition, you have to choose between latency and consistency. [Read more in Dr. Abadi’s paper, Consistency Tradeoffs in Modern Distributed Database System Design] In ScyllaDB, when we talk about consensus protocols, there’s Paxos and Raft. What’s the purpose for each? Kostja: First, I would like to second what Dr. Abadi said. This is a tradeoff between latency and consistency. Consistency requires latency, basically. My take on the CAP theorem is that it was really oversold back in the 2010s. We were looking at this as a fundamental requirement, and we have been building systems as if we are never going to go back to strong consistency again. And now the train has turned around completely. Now many vendors are adding back strong consistency features. For ScyllaDB, I’d say the biggest difference between Paxos and Raft is whether it’s a centralized algorithm or a decentralized algorithm. I think decentralized algorithms are just generally much harder to reason about. We use Raft for configuration changes, which we use as a basis for our topology changes (when we need the cluster to agree on a single state). The main reason we chose Raft was that it has been very well specified, very well tested and implemented, and so on. Paxos itself is not a multi-round protocol. You have to build on top of it; there are papers on how to build multi-Paxos on top of Paxos and how you manage configurations on top of that. If you are a practitioner, you need some very complete thing to build upon. Even when we were looking at Raft, we found quite a few open issues with the spec. That’s why both can co-exist. And I guess, we also have eventual consistency – so we could take the best of all worlds. For data, we are certainly going to run multiple Raft groups. But this means that every partition is going to be its own consensus – running independently, essentially having its own leader. In the end, we’re going to have, logically, many leaders in the cluster. However, if you look at our schema and topology, there’s still a single leader. So for schema and topology, we have all of the members of the cluster in the same group. We do run a single leader, but this is an advantage because the topology state machine is itself quite complicated. Running in a decentralized fashion without a single leader would complicate it quite a bit more. For a layman, linearizable just means that you can very easily reason about what’s going on: one thing happens after another. And when you build algorithms, that’s a huge value. We build complex transitions of topology when you stream data from one node to another – you might need to abort this, you might need to coordinate it with another streaming operation, and having one central place to coordinate this is just much, much easier to reason about. Daniel: Returning to what Kostja was saying. It’s not just that the trend (away from consistency) has started reverse script. I think it’s very true that people overreacted to CAP. It’s sort of like they used CAP as an excuse for why they didn’t create a consistent system. I think there are probably more systems than there should have been that might have been designed very differently if they didn’t drink the CAP Kool-aid so much. I think it’s a shame, and as Kostja said, it’s starting to reverse now. Daniel and Kostja on Industry Shifts Daniel: We are seeing sort of a lot of systems now, giving you the best of both worlds. You don’t want to do consistency at the application level. You really want to have a database that can take care of the consistency for you. It can often do it faster than the application can deal with it. Also, you see bugs coming up all the time in the application layer. It’s hard to get all those corner cases right. It’s not impossible but it’s just so hard. In many cases, it’s just worth paying the cost to get the consistency guaranteed in the system and be working with a rock-solid system. On the other hand, sometimes you need performance. Sometimes users can’t tolerate 20 milliseconds – it’s just too long. Sometimes you don’t need consistency. It makes sense to have both options. ScyllaDB is one example of this, and there are also other systems providing options for users. I think it’s a good thing. Kostja: I want to say more about the complexity problem. There was this research study on Ruby on Rails, Python, and Go applications, looking at how they actually use strongly consistent databases and different consistency levels that are in the SQL standard. It discovered that most of the applications have potential issues simply because they use the default settings for transactional databases, like snapshot isolation and not serializable isolation. Applied complexity has to be taken into account. Building applications is more difficult and even more diverse than building databases. So you have to push the problem down to the database layer and provide strong consistency in the database layer to make all the data layers simpler. It makes a lot of sense. Daniel: Yes, that was Peter Bailis’ 2015 UC Berkeley Ph.D. thesis, Coordination Avoidance in Distributed Databases. Very nice comparison. What I was saying was that they know what they’re getting, at least, and they just tried to design around it and they hit bugs. But what you’re saying is even worse: they don’t even know what they’re getting into. They’re just using the defaults and not getting full isolation and not getting full consistency – and they don’t even know what happened. Continuing the Database Consistency Conversation Intrigued by database consistency? Here are some places to learn more: Join us at ScyllaDB Summit, especially for Kostja’s tech talk on Day 1 Look at Daniel’s articles and talks Look at Kostja’s articles and talks