Cloud Database Rewards, Risks & Tradeoffs

Considering a fully-managed cloud database? Consider the top rewards, risks, and trade-offs related to performance and cost.

What do you really gain – and give up– when moving to a fully managed cloud database?

Now that managed cloud database offerings have been “battle tested” in production for a decade, how is the reality matching up to the expectation? What can teams thinking of adopting a fully managed cloud database learn from teams who have years of experience working with this deployment model?

We’ve found that most teams are familiar with the admin/management aspects of  database as a service ( a.k.a “DBaaS”). But let’s zero in on the top risks, rewards and trade-offs related to two aspects that commonly catch teams off guard: performance and cost.

Bonus: Hear your peers’ firsthand cloud database experiences at ScyllaDB Summit on-demand

Cloud Database Performance

Performance Rewards

Using a cloud database makes it extremely easy for you to place your data close to your application and your end users. Most also support multiregion replication, which lets you deploy an always-on architecture with just a few clicks. This simplicity makes it feasible to run specialized use cases, such as “smart replication.” For example, think about a worldwide media streaming service where you have one catalog tailored to users living in Brazil and a different catalog for users in the United States. Or, consider a sports betting use case where you have users all around the globe and you need to ensure that as the game progresses, updated odds are rolled out to all users at the exact same time to “level the playing field” (this is the challenge that ZeroFlucs tackled very impressively).

The ease of scale is another potential performance-related reward. To reap this reward, be sure to test that your selected cloud database is resilient enough to sustain sudden traffic spikes. Most vendors let you quickly scale out your deployment, but beware of solutions that don’t let you transition between “tiers.” After all, you don’t want to find yourself in a situation where it’s Black Friday and your application can’t meet the demands.  Managed cloud database options like ScyllaDB Cloud let you add as many nodes as needed to satisfy any traffic surges that your business is fortunate enough to experience.

Performance Risks

One performance risk is the unpredictable cost of scale. Know your throughput requirements and how much growth you anticipate. If you’re running up to a few thousand operations per second, a pay-per-operations service model probably makes sense. But as you grow to tens of thousands of operations per second and beyond, it can become quite expensive. Many high-growth companies opt for pricing models that don’t charge by the number of operations you run, but rather charge for the infrastructure you choose.

Also, be mindful of potential hidden limits or quotas that your cloud database provider may impose. For example, DynamoDB limits item size to 400KB per item; the operation is simply refused if you try to exceed that. Moreover, your throughput could be throttled down if you try to pass your allowance, or if the vendor imposes a hard limit on the number of operations you can run on top of a single partition. Throttling severely increases latency, which may be unacceptable for real-time applications. If this is important to you, look for a cloud database model that doesn’t impose workload restrictions. With offerings that use infrastructure-based cost models, there aren’t artificial traffic limits; you can push as far as the underlying hardware can handle.

Performance Trade-offs

It’s crucial to remember that a fully-managed cloud database is fundamentally a business model. As your managed database vendor contributes to your growth, it also captures more revenue. Despite the ease of scalability, many vendors limit your scaling options to a specific range, potentially not providing the most performant infrastructure. For example, perhaps you have a real-time workload that reads a lot of cold data in such a way that I/O access is really important to you, but your vendor simply doesn’t support provisioning your database on top of NVMe (nonvolatile memory express) storage.

Having a third party responsible for all your core database tasks obviously simplifies maintenance and operations. However, if you encounter performance issues, your visibility into the problem could be reduced, limiting your troubleshooting capabilities. In such cases, close collaboration with your vendor becomes essential for identifying the root cause. If visibility and fast resolution matter to you, opt for cloud database solutions that offer comprehensive visibility into your database’s performance.

Cloud Database Costs

Cost Rewards

Adopting a cloud database eliminates the need for physical infrastructure and dedicated staff. You don’t have to invest in hardware or its maintenance because the infrastructure is provided and managed by the DBaaS provider. This shift results in significant cost savings, allowing you to allocate resources more effectively toward core operations, innovation and customer experience rather than spending on hardware procurement and management.

Furthermore, using managed cloud databases reduce staffing costs by transferring responsibilities such as DevOps and database administration to the vendor. This eliminates the need for a specialized in-house database team, enabling you to optimize your workforce and allocate relevant staff to more strategic initiatives.

There’s also a benefit in deployment flexibility, Leading providers typically offer two pricing models: pay-as-you-go and annual pricing. The pay-as-you-go model eliminates upfront capital requirements and allows for cost optimization by aligning expenses with actual database usage. This flexibility is particularly beneficial for startups or organizations with limited resources.

Most cloud database vendors offer a standard model where the customer’s database sits on the vendor’s cloud provider infrastructure. Alternatively, there’s a “bring your own account” model, where the database remains on your organization’s cloud provider infrastructure. This deployment is especially advantageous for enterprises with established relationships with their cloud providers, potentially leading to cost savings through pre-negotiated discounts. Additionally, by keeping the database resources on your existing infrastructure, you avoid dealing with additional security concerns. It also allows you to manage your database as you manage your other existing infrastructure.

Cost Risks

Although a cloud database offers scalability, the expense of scaling your database may not follow a straightforward or easily predictable pattern. Increased workload from applications can lead to unexpected spikes or sudden scaling needs, resulting in higher costs (as mentioned in the previous section). As traffic or data volume surges, the resource requirements for the database may significantly rise, leading to unforeseen expenses as you need to scale up. It is crucial to closely monitor and analyze the cost implications of scaling to avoid budget surprises.

Additionally, while many providers offer transparent pricing, there may still be hidden costs. These costs often arise from additional services or specific features not covered by the base pricing. For instance, specialized support or advanced features for specific use cases may incur extra charges. It is essential to carefully review the service-level agreements and pricing documentation provided by your cloud database provider to identify any potential hidden costs.

Here’s a real-life example: One of our customers recently moved over from another cloud database vendor. At that previous vendor, they encountered massive unexpected variable costs, primarily associated with network usage. This “cloud bill shock” resulted in some internal drama; some engineers were fired in the aftermath.

Understanding and accounting for these hidden, unanticipated costs is crucial for accurate budgeting and effective cost management. This ensures a comprehensive understanding of the total cost of ownership and enables more informed decisions on the most cost-effective approach for your organization. Given the high degree of vendor lock-in involved in most options, it’s worth thinking about this long and hard. You don’t want to be forced into a significant application rewrite because your solution isn’t sustainable from a cost perspective and doesn’t have any API-compatible paths out.

Cost Trade-offs

The first cost trade-off associated with using a cloud database involves limited cost optimizations. While managed cloud solutions offer some cost-saving features, they might limit your ability to optimize costs to the same extent as self-managed databases. Constraints imposed by the service provider may restrict actions like optimizing hardware configurations or performance tuning. They also provide standardized infrastructure that caters to a broad variety of use cases. On the one hand, this simplifies operations. On the other hand, one size does not fit all. It could limit your ability to implement highly customized cost-saving strategies by fine-tuning workload-specific parameters and caching strategies. The bottom line here: carefully evaluate these considerations to determine the impact on your cost optimization efforts.

The second trade-off pertains to cost comparisons and total cost of ownership. When comparing costs between vendor-managed and self-managed databases, conducting a total cost of ownership analysis is essential. Consider factors such as hardware, licenses, maintenance, personnel and other operational expenses associated with managing a database in-house and then compare these costs against ongoing subscription fees and additional expenses related to the cloud database solution. Evaluate the long-term financial impact of using fully-managed cloud database versus managing the database infrastructure in-house. Then, with this holistic view of the costs, decide what’s best given your organization’s specific requirements and budget considerations.

Additional Database Deployment Model Considerations

Although a database as a service (DBaaS) deployment will definitely shield you from many infrastructure and hardware decisions through your selection process, a fundamental understanding of the generic compute resources required by any database is important for identifying potential bottlenecks that may limit performance.

For an overview of the critical considerations and tradeoffs when selecting CPUs, memory, storage and networking for your distributed database infrastructure, see Chapter 7 of the free book, “Database Performance at Scale.”

After an introduction to the hardware that’s involved in every deployment model, whether you think about it or not, that book chapter shifts focus to different deployment options and their impact on performance. You’ll learn about the special considerations associated with cloud-hosted deployments, serverless, containerization and container orchestration technologies such as Kubernetes.

Access the complete “Database Performance at Scale” book free, courtesy of ScyllaDB.

New ScyllaDB Enterprise Release: Up to 50% Higher Throughput, 33% Lower Latency

Performance improvements, encryption at rest, Repair Based Node Operations, consistent schema management using Raft & more 

ScyllaDB Enterprise 2024.1.0 LTS, a production-ready ScyllaDB Enterprise Long Term Support Major Release, is now available! It introduces significant performance improvements: up to 50% higher throughput, 35% greater efficiency, and 33% lower latency. It introduces encryption at rest, Repair Based Node Operations (RBNO) for all operations, and numerous improvements and bug fixes. Additionally, consistent schema management using Raft will be enabled automatically upon upgrade (see below for more details). The new release is based on ScyllaDB Open Source 5.4.

In this blog, we’ll highlight the new capabilities that our users have been asking about most frequently. For the complete details, read the release notes.

ScyllaDB Enterprise customers are encouraged to upgrade to ScyllaDB Enterprise 2023.1, and are welcome to contact our Support Team with questions.

Read the detailed release notes

Performance Improvements

2024.1 includes many runtime and build performance improvements that translate to:

  • Higher throughput per vCPU and server
  • Lower mean and P99 latency

ScyllaDB 2024.1 vs ScyllaDB 2023.1

Throughput tests

2024.1 has up to 50% higher throughput than 2023.1. In some cases, this can translate to a 35% reduction in the number of vCPUs required to support a similar load. This enables a similar reduction in vCPU cost.

Latency tests

Latency tests were performed at 50% of the maximum throughput tested.
As demonstrated below, the latency (both mean and P99) is 33% lower, even with the higher throughput.

Test Setup

Amazon EC2
instance_type_db: i3.2xlarge (8 cores)
instance_type_loader: c4.2xlarge

Test profiles:
cassandra-stress [mixed|read|write] no-warmup cl=QUORUM duration=50m -schema ‘replication(factor=3)’ -mode cql3 native -rate threads=100 -pop ‘dist=gauss(1..30000000,15000000,1500000)’

Note that these results are for tests performed on a small i3 server (i3.2xlarge). ScyllaDB scales linearly with the number of cores and achieves much better results for the i4i instance type.

ScyllaDB Enterprise 2024.1 vs ScyllaDB Open Source 5.4

ScyllaDB Enterprise 2024.1 is based on ScyllaDB Open Source 5.4, but includes enterprise-only performance improvement optimizations. As shown below, the throughput gain is significant and latency is lower.

These tests use the same setup and parameters detailed above.

Encryption at Rest (EaR) Enhancements

This new release includes enhancements to Encryption at Rest (EaR), including new Amazon KMS integration, and extended cluster-level encryption at rest. Together, these improvements allow you to easily use your own key for cluster-wide EaR.

ScyllaDB Enterprise has supported Encryption at Rest (EaR) for some time. Until now, users could store the keys for EaR locally, in an encrypted table, or in an external KMIP server. This release adds the ability to:

  • Use Amazon KMS to store and manage keys.
  • Set default EaR parameters (including the new KMS) for *all* cluster tables.

These are both detailed below.

Amazon KMS Integration for Encryption at Rest

ScyllaDB can now use a Customer Managed Key (CMK), stored in KMS, to create, encrypt, and decrypt Data Keys (DEK), which are then used to encrypt and decrypt the data in storage (such as SSTables, Commit logs, Batches, and hints logs).

KMS creates DEK from CMK:

DEK (plain text version) is used to encrypt the data at rest:

Diagrams are from: https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#data-keys

Before using KMS, you need to set KMS as a key provider and validate that ScyllaDB nodes have permission to access and use the CMK you created in KMS. Once you do that, you can use the CMK in the CREATE and ALTER TABLE commands with KmsKeyProviderFactory, as follows

CREATE TABLE myks.mytable (......) WITH
  scylla_encryption_options = {
  'cipher_algorithm' :  'AES/CBC/PKCS5Padding',
  'secret_key_strength' : 128,
  'key_provider': 'KmsKeyProviderFactory',
  'kms_host': 'my_endpoint'
}<>CODE>

Where “my_key” points to a section in scylla.yaml

kms_hosts:
  my_endpoint:
    aws_use_ec2_credentials: true
    aws_use_ec2_region: true
    master_key: alias/MyScyllaKey

You can also use the KMS provider to encrypt system-level data. See more examples and info here.

Transparent Data Encryption

Transparent Data Encryption (TDE) adds a way to define Encryption at Rest parameters per cluster, not only per table.
This allows the system administrator to enforce encryption of *all* tables using the same master key (e.g., from KMS) without specifying the encryption parameter per table. For example, with the following in scylla.yaml, all tables will be encrypted using encryption parameters of my-kms1:

user_info_encryption:
  enabled: true
  key_provider: KmsKeyProviderFactory,
  kms_host: my_kms1

See more examples and info here.

Repair Based Node Operations (RBNO)

RBNO provides a more robust, reliable, and safer data streaming for node operations like node-replace and node-add/remove. In particular, a failed node operation can resume from the point it stopped – without sending data that has already been synced. In addition, with RBNO enabled, you don’t need to repair before or after node operations, such as replace or removenode.

In this release, RBNO is enabled by default for all operations: remove node, rebuild, bootstrap, and decommission. The replace node operation was already enabled by default.

For details, see the Repair Based Node Operations (RBNO) docs and the blog, Faster, Safer Node Operations with Repair vs Streaming.

Node-Aggregated Table Level Metrics

Most ScyllaDB metrics are per-shard, per-node, but not for a specific table. We now export some per-table metrics. These are exported once per node, not per shard, to reduce the number of metrics.

Guardrails

Guardrails is a framework to protect ScyllaDB users and admins from common mistakes and pitfalls. In this release, ScyllaDB includes a new guardrail on the replication factor. It is now possible to specify the minimum replication factor for new keyspaces via a new configuration item.

Security

In addition to the EaR enhancements above, the following security features were introduced in 2024.1:

Encryption at transit, TLS certificates

It is now possible to use TLS certificates to authenticate and authorize a user to ScyllaDB. The system can be configured to derive the user role from the client certificate and derive the permissions the user has from that role. #10099
Learn more in the certificate-authentication docs.

FIPS Tolerant

ScyllaDB Enterprise can now run on FIPS enabled Ubuntu, using libraries that were compiled with FIPS enabled, such as OpenSSL, GnuTLS, and more.

Strongly Consistent Schema Management with Raft

Strongly Consistent Schema Management with Raft became the default for new clusters in ScyllaDB Enterprise 2023.1.In this release, it is enabled by default when upgrading existing clusters. Learn more in the blog, ScyllaDB’s Path to Strong Consistency: A New Milestone.

 

Read the detailed release notes

Support for AWS PrivateLink On Instaclustr for Apache Cassandra® is now GA

Instaclustr is excited to announce the general availability of AWS PrivateLink with Apache Cassandra® on the Instaclustr Managed Platform. This release follows the announcement of the new feature in public preview last year.  

Support for AWS PrivateLink with Cassandra provides our AWS customers with a simpler and more secure option for network cross-account connectivity, to expose an application in one VPC to other users or applications in another VPC.  

Network connections to an AWS PrivateLink service can only be one-directional from the requestor to the destination VPC. This prevents network connections being initiated from the destination VPC to the requestor and creates an additional measure of protection from potential malicious activity. 

All resources in the destination VPC are masked and appear to the requestor as a single AWS PrivateLink service. The AWS PrivateLink service manages access to all resources within the destination VPC. This significantly simplifies cross-account network setup as compared to authorizing peering requests, configuring routes tables and security groups when establishing VPC peering.  

The Instaclustr team has worked with care to integrate the AWS PrivateLink service for your AWS Managed Cassandra environment to give you a simple and secure cross-account network solution with just a few clicks.  

Fitting AWS PrivateLink to Cassandra is not a straightforward task as AWS PrivateLink exposes a single IP proxy per AZ, and Cassandra clients generally expect direct access to all Cassandra nodes. To solve this problem, the development of Instaclustr’s AWS PrivateLink service has made use of Instaclustr’s Shotover Proxy in front of your AWS Managed Cassandra clusters to reduce cluster IP addresses from one-per-node to one-per-rack, enabling the use of a load balancer as required by AWS PrivateLink.  

By managing database requests in transit, Shotover gives Instaclustr customers AWS PrivateLink’s simple and secure network setup with the benefits of Managed Cassandra. Keep a look out for an upcoming blog post with more details on the technical implementation of AWS PrivateLink for Managed Cassandra. 

AWS PrivateLink is offered as an Instaclustr Enterprise feature, available at an additional charge of 20% on top of the node cost for the first feature enabled by you. The Instaclustr console will provide a summary of node prices or management units for your AWS PrivateLink enabled Cassandra cluster, covering both Cassandra and Shotover node size options and prices when you first create an AWS PrivateLink enabled Cassandra cluster. Information on charges from AWS is available here. 

Log into the Console to include support for AWS PrivateLink with your AWS Managed Cassandra clusters with just one click today. Alternatively, support for AWS PrivateLink for Managed Cassandra is available at the Instaclustr API or Terraform.  

Please reach out to our Support team for any assistance with AWS PrivateLink for your AWS Managed Cassandra clusters. 

The post Support for AWS PrivateLink On Instaclustr for Apache Cassandra® is now GA appeared first on Instaclustr.

ScyllaDB Summit 2024 Recap: An Inside Look

A rapid rundown of the whirlwind database performance event

Hello readers, it’s great to be back in the US to help host ScyllaDB Summit, 2024 edition. What a great virtual conference it has been – and once again, I’m excited to share the behind-the-scenes perspective as one of your hosts.

First, let’s thank all the presenters once again for contributions from around the world. With 30 presentations covering all things ScyllaDB, it made for a great event.

To kick things off, we had the now-famous Felipe Cardeneti Mendes host the ScyllaDB lounge and get straight into answering questions. Once the audience got a taste for it (and realized the technical acumen of Felipe’s knowledge), there was an unstoppable stream of questions for him. Felipe was so popular that he became a meme for the conference!

The first morning, we also trialed something new by running hands-on labs aimed at both novice and advanced users. These live streams were a hit, with over 1000 people in attendance and everyone keen to get their hands on ScyllaDB. If you’d like to continue that experience, be sure to check out the self-paced ScyllaDB University and the interactive instructor-led ScyllaDB University LIVE event coming up in March. Both are free and virtual!

Let’s recap some stand out presentations for me on the first day of the conference. The opening keynote, by CEO and co-founder Dor Laor, was titled ScyllaDB Leaps Forward. This is a must-see presentation. It provides the background context you need to understand tablet architecture and the direction that ScyllaDB is headed: not only the fastest database in terms of latency (at any scale), but also the quickest to scale in terms of elasticity. The companion keynote on day two from CTO and co-founder Avi Kivity completes the loop and explains in more detail why ScyllaDB is making this major architecture shift from vNodes replication to tablets. Take a look at Tablets: Rethink Replication for more insights.

The second keynote, from Discord Staff Engineer Bo Ingram, opened with the premise So You’ve Lost Quorum: Lessons From Accidental Downtime and shared how to diagnose issues in your clusters and how to avoid making a fault too big to tolerate. Bo is a talented storyteller and published author. Be sure to watch this keynote for great tips on how to handle production incidents at scale. And don’t forget to buy his book, ScyllaDB in Action for even more practical advice on getting the most out of ScyllaDB.

Download the first 4 chapters for free

An underlying theme for the conference was exploring individual customers’ migration paths from other databases onto ScyllaDB. To that end, we were fortunate to hear from JP Voltani, Head of Engineering at Tractian, on their Experience with Real-Time ML and the reasons why they moved from MongoDB to ScyllaDB to scale their data pipeline. Working with over 5B samples from +50K IoT devices, they were able to achieve their migration goals. Felipe’s presentation on MongoDB to ScyllaDB: Technical Comparison and the Path to Success then detailed the benchmarks, processes, and tools you need to be successful for these types of migrations. There were also great presentations looking at migration paths from DynamoDB and Cassandra; be sure to take a look at them if you’re on any of those journeys.

A common component in customer migration paths was the use of Change Data Capture (CDC) and we heard from Expedia on their migration journey from Cassandra to ScyllaDB. They cover the aspects and pitfalls the team needed to overcome as part of their Identity service project. If you are keen to learn more about this topic, then Guilherme’s presentation on Real-Time Event Processing with CDC is a must-see.

Martina Alilović Rojnić gave us the Strategy Behind Reversing Labs’ Massive Key-Value Migration which had mind-boggling scale, migrating more than 300 TB of data and over 400 microservices from their bespoke key-value store to ScyllaDB – with ZERO downtime. An impressive feat of engineering!

ShareChat shared everything about Getting the Most Out of ScyllaDB Monitoring. This is a practical talk about working with non-standard ScyllaDB metrics to analyze the remaining cluster capacity, debug performance problems, and more. Definitely worth a watch if you’re already running ScyllaDB in production.

After a big day hosting the conference combined with the fatigue of international travel from Australia, assisted with a couple of cold beverages the night after, sleep was the priority. Well rested and eager for more, we launched early into day two of the event with more great content.

Leading the day was Avi’s keynote, which I already mentioned above. Equally informative was the following keynote from Miles Ward and Joe Shorter on Radically Outperforming DynamoDB. If you’re looking for more reasons to switch, this was a good presentation to learn from, including details of the migration and using ScyllaDB Cloud with Google Cloud Platform.

Felipe delivered another presentation (earning him MVP of the conference) about using workload prioritization features of ScyllaDB to handle both Real-Time and Analytical workloads, something you might not ordinarily consider compatible. I also enjoyed Piotr’s presentation on how ScyllaDB Drivers take advantage of the unique ScyllaDB architecture to deliver high-performance and ultra low-latencies. This is yet another engineering talk showcasing the strengths of ScyllaDB’s feature sets. Kostja Osipov set the stage for this on Day 1. Kostja consistently delivers impressive Raft talks year after year, and his Topology on Raft: An Inside Look talk is another can’t miss. There’s a lot there! Give it a (re)watch if you want all the details on how Raft is implemented in the new releases and what it all means for you, from the user perspective.

We also heard from Kishore Krishnamurthy, CTO at ZEE5, giving us insights into Steering a High-Stakes Database Migration. It’s always interesting to hear the executive-level perspective on the direction you might take to reduce costs while maintaining your SLAs. There were also more fascinating insights from ZEE5 engineers on how they are Tracking Millions of Heartbeats on Zee’s OTT Platform. Solid technical content.

In a similar vein, proving that “simple” things can still present serious engineering challenges when things are distributed and at scale, Edvard Fagerholm showed us how Supercell Persists Real-Time Events. Edvard illustrated how ScyllaDB helps them process real-time events for their games like Clash of Clans and Clash Royale with hundreds of millions of users.

The day flew by. Before long, we were wrapping up the conference. Thanks to the community of 7500 for great participation – and thousands of comments – from start to finish. I truly enjoyed hosting the introductory lab and getting swarmed by questions. And no, I’m not a Kiwi! Thank you all for a wonderful experience.

Seastar, ScyllaDB, and C++23

Seastar now supports C++20 and C++23 (and dropped support for C++17)

Seastar is an open-source (Apache 2.0 licensed) C++ framework for I/O intensive asynchronous computing, using the thread-per-core model. Seastar underpins several high- performance distributed systems: ScyllaDB, Redpanda, and Ceph Crimson. Seastar source is available on github.

Background

As a C++ framework, Seastar must choose which C++ versions to support. The support policy is last-two-versions. That means that at any given time, the most recently released version as well as the previous one are supported, but earlier versions cannot be expected to work. This policy gives users of the framework three years to upgrade to the next C++ edition while not constraining Seastar to ancient versions of the language.

Now that C++23 has been ratified, Seastar now officially supports C++20 and C++23. The previously supported C++17 is now no longer supported.

New features in C++23

We will focus here on C++23 features that are relevant to Seastar users; this isn’t a comprehensive review of C++23 changes. For an overview of C++23 additions, consult the cppreference page.

std::expected

std::expected is a way to communicate error conditions without exceptions. This is useful since exception handling is very slow in most C++ implementations. In a way, it is similar to std::future and seastar::future: they are all variant types that can hold values and errors, though, of course, futures also represent concurrent computations, not just values.

So far, ScyllaDB has used boost::outcome for the same role that std::expected fills. This improved ScyllaDB’s performance under overload conditions. We’ll likely replace it with std::expected soon, and integration into Seastar itself is a good area for extending Seastar.

std::flat_set and std::flat_map

These new containers reduce allocations compared to their traditional variants and are suitable for request processing in Seastar applications. Seastar itself won’t use them since it still maintains C++20 compatibility, but Seastar users should consider them, along with abseil containers.

Retiring C++17 support

As can be seen from the previous section, C++23 does not have a dramatic impact on Seastar. The retiring of C++17, however, does. This is because we can now fully use some C++20-only features on Seastar itself.

Coroutines

C++20 introduced coroutines, which make synchronous code both easier to write and more efficient (a very rare tradeoff). Seastar applications could already use coroutines freely, but Seastar itself could not due to the need to support C++17. Since all supported C++ editions now have coroutines, continuation-style code will be replaced by coroutines where this makes sense.

std::format

Seastar has long been using the wonderful {fmt} library. Since it was standardized as std::format in C++20, we may drop this dependency in favor of the standard library version.

The std::ranges library

Another long-time dependency, the Boost.Range library, can now be replaced by its modern equivalent std::ranges. This promises better compile times, and, more importantly, better compile-time error reporting as the standard library uses C++ concepts to point out programmer errors more directly.

Concepts

As concepts were introduced in C++20, they can now be used unconditionally in Seastar. Previously, they were only used when C++20 mode was active, which somewhat restricted what could be done with them.

Conclusion

C++23 isn’t a revolution for C++ users in general and Seastar users in particular, but, it does reduce the dependency on third-party libraries for common tasks. Concurrent with its adoption, dropping C++17 allows us to continue modernizing and improving Seastar.

Distributed Database Consistency: Dr. Daniel Abadi & Kostja Osipov Chat

Dr. Daniel Abadi (University of Maryland) and Kostja Osipov (ScyllaDB) discuss PACELC, CAP theorem, Raft, and Paxos

Database consistency has been a strongly consistent theme at ScyllaDB Summit over the past few years – and we guarantee that will continue at ScyllaDB Summit 2024 (free + virtual). Co-founder Dor Laor’s opening keynote on “ScyllaDB Leaps Forward” includes an overview of the latest milestones on ScyllaDB’s path to immediate consistency. Kostja Osipov (Director of Engineering) then shares the details behind how we’re implementing this shift with Raft and what the new consistent metadata management updates mean for users. Then on Day 2, Avi Kivity (Co-founder) picks up this thread in his keynote introducing ScyllaDB’s revolutionary new tablet architecture – which is built on the foundation of Raft.

Update: ScyllaDB Summit 2024 is now a wrap!

Access ScyllaDB Summit On Demand

ScyllaDB Summit 2023 featured two talks on database consistency. Kostja Osipov shared a preview of Raft After ScyllaDB 5.2: Safe Topology Changes (also covered in this blog series). And Dr. Daniel Abadi, creator of the PACELC theorem, explored The Consistency vs Throughput Tradeoff in Distributed Databases.

After their talks, Daniel and Kostja got together to chat about distributed database consistency. You can watch the full discussion below.

Here are some key moments from the chat…

What is the CAP theorem and what is PACELC

Daniel: Let’s start with the CAP theorem. That’s the more well-known one, and that’s the one that came first historically. Some say it’s a three-way tradeoff, some say it’s a two-way tradeoff. It was originally described as a three-way tradeoff: out of consistency, availability, and tolerance to network partitions, you can have two of them, but not all three. That’s the way it’s defined. The intuition is that if you have a copy of your data in America and a copy of your data in Europe and you want to do a write in one of those two locations, you have two choices.

You do it in America, and then you say it’s done before it gets to Europe. Or, you wait for it to get to Europe, and then you wait for it to occur there before you say that it’s done. In the first case, if you commit and finish a transaction before it gets to Europe, then you’re giving up consistency because the value in Europe is not the most current value (the current value is the write that happened in America). But if America goes down, you could at least respond with stale data from Europe to maintain availabilty.

PACELC is really an extension of the CAP theorem. The PAC of PACELC is CAP. Basically, that’s saying that when there is a network partition, you must choose either availability or consistency. But the key point of PACELC is that network partitions are extremely rare. There’s all kinds of redundant ways to get a message from point A to point B.

So the CAP theorem is kind of interesting in theory, but in practice, there’s no real reason why you have to give up C or A. You can have both in most cases because there’s never a network partition. Yet we see many systems that do give up on consistency. Why? The main reason why you give up on consistency these days is latency. Consistency just takes time. Consistency requires coordination. You have to have two different locations communicate with each other to be able to remain consistent with one another. If you want consistency, that’s great. But you have to pay in latency. And if you don’t want to pay that latency cost, you’re going to pay in consistency. So the high-level explanation of the PACELC theorem is that when there is a partition, you have to choose between availability and consistency. But in the common case where there is no partition, you have to choose between latency and consistency.

[Read more in Dr. Abadi’s paper, Consistency Tradeoffs in Modern Distributed Database System Design]

In ScyllaDB, when we talk about consensus protocols, there’s Paxos and Raft. What’s the purpose for each?

Kostja: First, I would like to second what Dr. Abadi said. This is a tradeoff between latency and consistency. Consistency requires latency, basically. My take on the CAP theorem is that it was really oversold back in the 2010s. We were looking at this as a fundamental requirement, and we have been building systems as if we are never going to go back to strong consistency again. And now the train has turned around completely. Now many vendors are adding back strong consistency features.

For ScyllaDB, I’d say the biggest difference between Paxos and Raft is whether it’s a centralized algorithm or a decentralized algorithm. I think decentralized algorithms are just generally much harder to reason about. We use Raft for configuration changes, which we use as a basis for our topology changes (when we need the cluster to agree on a single state). The main reason we chose Raft was that it has been very well specified, very well tested and implemented, and so on. Paxos itself is not a multi-round protocol. You have to build on top of it; there are papers on how to build multi-Paxos on top of Paxos and how you manage configurations on top of that. If you are a practitioner, you need some very complete thing to build upon. Even when we were looking at Raft, we found quite a few open issues with the spec. That’s why both can co-exist. And I guess, we also have eventual consistency – so we could take the best of all worlds.

For data, we are certainly going to run multiple Raft groups. But this means that every partition is going to be its own consensus – running independently, essentially having its own leader. In the end, we’re going to have, logically, many leaders in the cluster. However, if you look at our schema and topology, there’s still a single leader. So for schema and topology, we have all of the members of the cluster in the same group. We do run a single leader, but this is an advantage because the topology state machine is itself quite complicated. Running in a decentralized fashion without a single leader would complicate it quite a bit more. For a layman, linearizable just means that you can very easily reason about what’s going on: one thing happens after another. And when you build algorithms, that’s a huge value. We build complex transitions of topology when you stream data from one node to another – you might need to abort this, you might need to coordinate it with another streaming operation, and having one central place to coordinate this is just much, much easier to reason about.

Daniel: Returning to what Kostja was saying. It’s not just that the trend (away from consistency) has started reverse script. I think it’s very true that people overreacted to CAP. It’s sort of like they used CAP as an excuse for why they didn’t create a consistent system. I think there are probably more systems than there should have been that might have been designed very differently if they didn’t drink the CAP Kool-aid so much. I think it’s a shame, and as Kostja said, it’s starting to reverse now.

Daniel and Kostja on Industry Shifts

Daniel: We are seeing sort of a lot of systems now, giving you the best of both worlds. You don’t want to do consistency at the application level. You really want to have a database that can take care of the consistency for you. It can often do it faster than the application can deal with it. Also, you see bugs coming up all the time in the application layer. It’s hard to get all those corner cases right. It’s not impossible but it’s just so hard. In many cases, it’s just worth paying the cost to get the consistency guaranteed in the system and be working with a rock-solid system. On the other hand, sometimes you need performance. Sometimes users can’t tolerate 20 milliseconds – it’s just too long. Sometimes you don’t need consistency. It makes sense to have both options. ScyllaDB is one example of this, and there are also other systems providing options for users. I think it’s a good thing.

Kostja: I want to say more about the complexity problem. There was this research study on Ruby on Rails, Python, and Go applications, looking at how they actually use strongly consistent databases and different consistency levels that are in the SQL standard. It discovered that most of the applications have potential issues simply because they use the default settings for transactional databases, like snapshot isolation and not serializable isolation. Applied complexity has to be taken into account. Building applications is more difficult and even more diverse than building databases. So you have to push the problem down to the database layer and provide strong consistency in the database layer to make all the data layers simpler. It makes a lot of sense.

Daniel: Yes, that was Peter Bailis’ 2015 UC Berkeley Ph.D. thesis, Coordination Avoidance in Distributed Databases. Very nice comparison. What I was saying was that they know what they’re getting, at least, and they just tried to design around it and they hit bugs. But what you’re saying is even worse: they don’t even know what they’re getting into. They’re just using the defaults and not getting full isolation and not getting full consistency – and they don’t even know what happened.

Continuing the Database Consistency Conversation

Intrigued by database consistency? Here are some places to learn more:

 

 

ScyllaDB Summit Speaker Spotlight: Miles Ward, CTO at SADA

SADA CTO Miles Ward shares a preview of his ScyllaDB Summit keynote with Joseph Shorter (VP of Platform Architecture at Digital Turbine) 

ScyllaDB Summit is now just days away! If database performance at scale matters to your team, join us to hear about your peers’ experiences and discover new ways to alleviate your own database latency, throughput, and cost pains. It’s free, virtual, and highly interactive.

While setting the virtual stage, we caught up with ScyllaDB Summit keynote speaker, SADA CTO, and electric sousaphone player Miles Ward. Miles and the ScyllaDB team go way back, and we’re thrilled to welcome him back to ScyllaDB Summit – along with Joseph Shorter, VP of Platform Architecture at Digital Turbine.

Miles and team worked with Joseph and team on a high stakes (e.g., “if we make a mistake, the business goes down”) and extreme scale DynamoDB to ScyllaDB migration. And to quantify “extreme scale,” consider this:

The keynote will be an informal chat about why and how they pulled off this database migration in the midst of a cloud migration (AWS to Google Cloud).

Update: ScyllaDB Summit 24 is completed! That means you can watch this session on demand.

Watch Miles and Joe On Demand

Here’s what Miles shared in our chat…

Can you share a sneak peek of your keynote? What should attendees expect?

Lots of tech talks speak in the hypothetical about features yet to ship, about potential and capabilities. Not here! The engineers at SADA and Digital Turbine are all done: we switched from the old and dusted DynamoDB on AWS to the new hotness Alternator via ScyllaDB on GCP, and the metrics are in! We’ll have the play-by-play, lessons learned, and the specifics you can use as you’re evaluating your own adoption of ScyllaDB.

You’re co-presenting with an exceptional tech leader, Digital Turbine’s Joseph Shorter. Can you tell us more about him?

Joseph is a stellar technical leader. We met as we were first connecting with Digital Turbine through their complex and manifold migration. Remember: they’re built by acquisition so it’s all-day integrations and reconciliations.

Joe stood out as utterly bereft of BS, clear about the human dimension of all the change his company was going through, and able to grapple with all the layers of this complex stack to bring order out of chaos.

Are there any recent developments on the SADA and Google Cloud fronts that might intrigue ScyllaDB Summit attendees?

Three!

GenAI is all the rage! SADA is building highly efficient systems using the power of Google’s Gemini and Duet APIs to automate some of the most rote, laborious tasks from our customers. None of this works if you can’t keep the data systems humming; thanks ScyllaDB!

New instances from GCP with even larger SSDs (critical for ScyllaDB performance!) are coming very soon. Perhaps keep your eyes out around Google Next (April 9-11!)

SADA just got snapped up by the incredible folks at Insight, so now we can help in waaaaaay more places and for customers big and small. If what Digital Turbine did sounds like something you could use help with, let me know!

How did you first come across ScyllaDB, and how has your relationship evolved over time?

I met Dor in Tel Aviv a long, long, (insert long pause), LONG time ago, right when ScyllaDB was getting started. I loved the value prop, the origin story, and the team immediately. I’ve been hunting down good use cases for ScyllaDB ever since!

What ScyllaDB Summit 24 sessions (besides your own, of course) are you most looking forward to and why?

One is the Disney talk, which is right after ours. We absolutely have to see how Disney’s DynamoDB migration compares to ours.

Another is Discord, I am an avid user of Discord. I’ve seen some of their performance infrastructure and they have very, very, very VERY high throughput so I’m sure they have some pretty interesting experiences to share.

Also, Dor and Avi of course!

Are you at liberty to share your daily caffeine intake?

Editor’s note: If you’ve ever witnessed Miles’ energy, you’d understand why we asked this question.

I’m pretty serious about my C8H10N4O2 intake. It is a delicious chemical that takes care of me. I’m steady with two shots from my La Marzzoco Linea Mini to start the day right, with typically a post-lunch-fight-the-sleepies cappuccino. Plus, my lovely wife has a tendency to only drink half of basically anything I serve her, so I get ‘dregs’ or sips of tasty coffee leftovers on the regular. Better living through chemistry!

***
We just met with Miles and Joe, and that chemistry is amazing too. This is probably your once-in-a-lifetime chance to attend a talk that covers electric sousaphones and motorcycles along with databases – with great conversation, practical tips, and candid lessons learned.

Trust us: You won’t want to miss their keynote, or all the other great talks that your peers across the community are preparing for ScyllaDB Summit.

Register Now – It’s Free

From Redis & Aurora to ScyllaDB, with 90% Lower Latency and $1M Savings

How SecurityScorecard built a scalable and resilient security ratings platform with ScyllaDB

SecurityScorecard is the global leader in cybersecurity ratings, with millions of organizations continuously rated. Their rating platform provides instant risk ratings across ten groups of risk factors, including DNS health, IP reputation, web application security, network security, leaked information, hacker chatter, endpoint security, and patching cadence.

Nguyen Cao, Staff Software Engineer at SecurityScorecard, joined us at ScyllaDB Summit 2023 to share how SecurityScorecard’s scoring architecture works and why they recently rearchitected it. This blog shares his perspective on how they decoupled the frontend and backend services by introducing a middle layer for improved scalability and maintainability.

Spoiler: Their architecture shift involved a migration from Redis and Aurora to ScyllaDB. And it resulted in:

  • 90% latency reduction for most service endpoints
  • 80% fewer production incidents related to Presto/Aurora performance
  • $1M infrastructure cost savings per year
  • 30% faster data pipeline processing
  • Much better customer experience

Curious? Read on as we unpack this.

Join us at ScyllaDB Summit 24 to hear more firsthand accounts of how teams are tackling their toughest database challenges. Disney, Discord, Expedia, Paramount, and more are all on the agenda.

Register Now – It’s Free

SecurityScorecard’s Data Pipeline

SecurityScorecard’s mission is to make the world a safer place by transforming the way organizations understand, mitigate, and communicate cybersecurity to their boards, employees, and vendors. To do this, they continuously evaluate an organization’s security profile and report on ten key risk factors, each with a grade of A-F.

Here’s an abstraction of the data pipeline used to calculate these ratings:

Starting with signal collection, their global networks of sensors are deployed across over 50 countries to scan IPs, domains, DNS, and various external data sources to instantly detect threats. That result is then processed by the Attribution Engine and Cyber Analytics modules, which try to associate IP addresses and domains with vulnerabilities. Finally, the scoring engines compute a rating score. Nguyen’s team is responsible for the scoring engines that calculate the scores accessed by the frontend services.

Challenges with Redis and Aurora

The team’s previous data architecture served SecurityScorecard well for a while, but it couldn’t keep up with the company’s growth.

The platform API (shown on the left side of the diagram) received requests from end users, then made further requests to other internal services, such as the measurements service. That service then queried datastores such as Redis, Aurora, and Presto on HDFS. The scoring workflow (an enhancement of Apache Airflow) then generated a score based on the measurements and the findings of over 12 million scorecards.

This architecture met their needs for years. One aspect that worked well was using different datastores for different needs. They used three main datastores: Redis (for faster lookups of 12 million scorecards), Aurora (for storing 4 billion measurement stats across nodes), or a Presto cluster on HDFS (for complex SQL queries on historical results).

However, as the company grew, challenges with Redis and Aurora emerged.

Aurora and Presto latencies spiked under high throughput. The Scoring Workflow was running batch inserts of over 4B rows into Aurora throughout the day. Aurora has a primary/secondary architecture, where the primary node is solely responsible for writes and the secondary replicas are read-only. The main drawback of this approach was that their write-intensive workloads couldn’t keep up. Because their writes weren’t able to scale, this ultimately led to elevated read latencies because replicas were also overwhelmed. Moreover, Presto queries became consistently overwhelmed as the number of requests and amount of data associated with each scorecard grew. At times, latencies spiked to minutes, and this caused problems that impacted the entire platform.

The largest possible instance of Redis still wasn’t sufficient. The largest possible instance od Redis supported only 12M scorecards, but they needed to grow beyond that. They tried Redis cluster for increased cache capacity, but determined that this approach would bring excessive complexity. For example, at the time, only the Python driver supported consistent hashing-based routing. That meant they would need to implement their own custom drivers to support this critical functionality on top of their Java and Node.js services.

HDFS didn’t allow fast score updating. The company wants to encourage customers to rapidly remediate reported issues – and the ability to show them an immediate score improvement offers positive reinforcement for good behavior. However, their HDFS configuration (with data immutability) meant that data ingestion to HDFS had to go through the complete scoring workflow first. This meant that score updates could be delayed for 3 days.

Maintainability. Their ~50 internal services were implemented in a variety of tech stacks (Go, Java, Node.js, Python, etc.). All these services directly accessed the various datastores, so they had to handle all the different queries (SQL, Redis, etc.) effectively and efficiently. Whenever the team changed the database schema, they also had to update all the services.

Moving to a New Architecture with ScyllaDB

To reduce latencies at the new scale that their rapid business growth required, the team moved to ScyllaDB Cloud and developed a new scoring API that routed less latency-sensitive requests to Presto + S3 storage. Here’s a visualization of this – and considerably simpler – architecture:

A ScyllaDB Cloud cluster replaced Redis and Aurora, and AWS S3 replaced HDFS (Presto remains) for storing the scorecard details. Also added: a scoring-api service, which works as a data gateway. This component routes particular types of traffic to the appropriate data store.

How did this address their challenges?

Latency
With a carefully-designed ScyllaDB schema, they can quickly access the data that they need based on the primary key. Their scoring API can receive requests for up to 100,000 scorecards. This led them to build an API capable of splitting and parallelizing these payloads into smaller processing tasks to avoid overwhelming a single ScyllaDB coordinator. Upon completion, the results are aggregated and then returned to the calling service. Also, their high read throughput no longer causes latency spikes, thanks largely to ScyllaDB’s eventually consistent architecture.

Scalability
With ScyllaDB Cloud, they can simply add more nodes to their clusters as demand increases, allowing them to overcome limits faced with Redis. The scoring API is also scalable since it’s deployed as an ECS service. If they need to serve requests faster, they just add more instances.

Fast Score Updating
Now, scorecards can be updated immediately by sending an upsert request to ScyllaDB Cloud.

Maintainability
Instead of accessing the datastore directly, services now send REST queries to the scoring API, which works as a gateway. It directs requests to the appropriate places, depending on the use case. For example, if the request needs a low-latency response it’s sent to ScyllaDB Cloud. If it’s a request for historical data, it goes to Presto.

Results & Lessons Learned

Nguyen then shared the KPIs they used to track project success. On the day that they flipped the switch, latency immediately decreased by over 90% on the two endpoints tracked in this chart (from 4.5 seconds to 303 milliseconds). They’re fielding 80% fewer incidents. And they’ve already saved over $1M USD a year in infrastructure costs by replacing Redis and Aurora. On top of that, they achieved a 30% speed improvement in their data processing.

Wrapping up, Nguyen shared their top 3 lessons learned:

  • Design your ScyllaDB schema based on data access patterns with a focus on query latency.
  • Route infrequent, complex, and latency-tolerant data access to OLAP engines like Presto or Athena (generating reports, custom analysis, etc.).
  • Build a scalable, highly parallel processing aggregation component to fully benefit from ScyllaDB’s high throughput.

Watch the Complete SecurityScorecard Tech Talk

You can watch Nguyen’s complete tech talk and skim through the deck in our tech talk library.

Watch the Full Tech Talk