ScyllaDB Student Projects: CQL-over-WebSocket

Introduction: University of Warsaw Student Projects

Since 2019, ScyllaDB has cooperated with the University of Warsaw by mentoring teams of students in their Bachelor’s theses. After SeastarFS, ScyllaDB Rust Driver and many other interesting projects, the 2021/2022 edition of the program brings us CQL-over-WebSocket: a way of interacting with ScyllaDB straight through your browser.

Motivation

The most popular tool for interacting with ScyllaDB is cqlsh. It’s a Python-based command line tool allowing you to send CQL requests, browse their results, and so on. It’s based on the Python driver for Cassandra, and thus brings in a few dependencies: Python itself, the driver library and multiple indirect dependencies. Forcing users to install extra packages is hardly a “batteries included” experience, so we decided to try and create a terminal for interacting with ScyllaDB right from the Web browser.

Sounds easy, but there are a few major roadblocks for implementing such a tool. First of all, browsers do not allow pages to establish raw TCP connections. Instead, all communication is done either by issuing HTTP(S) requests, or via the WebSocket protocol. WebSocket is interesting because it allows full duplex connections directly from the browser and can multiplex streams of messages over TCP. If you are a JavaScript developer, check out this page. Also, this article goes into greater detail on specific differences between HTTP and WebSocket.

Secondly, while Cassandra already has a JavaScript driver, it’s meant for Node.js, and is not directly usable from the browser, not to mention that JavaScript lacks the type safety guarantees provided by more modern languages, e.g., TypeScript.

CQL is stateful in its nature, and clients usually establish long living sessions in order to communicate with the database. That makes HTTP a rather bad candidate for a building block, given its stateless nature and verbosity. CQL is also a binary protocol, so transferring data on top of a text-based one would cause way too much pain for performance-oriented developers.

Disqualifying HTTP makes the choice rather clear — WebSocket it is! Once this crucial design decision had been made, we could move forward to implementing all the layers:

  1. WebSocket server implementation for Seastar
  2. Injecting a WebSocket server into ScyllaDB
  3. Making a TypeScript driver for CQL, intended specifically for web browsers
  4. Implementing an in-browser cqlsh terminal

Part I: WebSocket Server Implementation for Seastar

In order to be able to receive WebSocket communication in ScyllaDB, we first need to provide a WebSocket server implementation for Seastar, our asynchronous C++ framework. Seastar already has a functional HTTP server, which was very helpful, as the WebSocket handshake is based on HTTP. As any other server implementation embedded into Seastar, this one also needed to be latency-friendly and fast.

Protocol Overview

WebSocket protocol is specified in RFC 6455 and the W3C WebSocket Living Standard. We also used a great tutorial on writing WebSocket servers published by Mozilla. In short, a minimal WebSocket server implementation needs to support the following procedures:

  • a HTTP-based handshake, which verifies that the client wishes to upgrade the connection to WebSocket, and also potentially specifies a subprotocol for the communication
  • exchanging data frames, which also includes decoding the frame headers, as well as masking and unmasking the data
  • handling PING and PONG messages, used in a WebSocket connection to provide heartbeat capabilities

Interface

The implementation of Seastar WebSocket server is still experimental and subject to change, but it’s already fully functional and allows the developers to implement their own custom WebSocket servers.

At the core of implementing your own WebSocket server in Seastar, there’s a concept of a “handler”, allowing you to specify how to handle an incoming stream of WebSocket data. Seastar takes care of automatically decoding the frames and unmasking the data, so the developers simply operate on incoming and outgoing streams, as if it was a plain TCP connection. Each handler is bound to a specific subprotocol. During handshake, if a client declares that they want to speak in a particular subprotocol, they will be rerouted to a matching handler. Here’s an example of how to implement an echo server, which returns exactly the same data it receives:

A full demo application implementing an echo server can be found here: https://github.com/scylladb/seastar/blob/master/demos/websocket_demo.cc

And here’s a minimal WebSocket client, which can be used for testing the server, runnable directly in IPython:

Part II: Injecting a WebSocket server into ScyllaDB

Now that we have WebSocket support in Seastar, all that’s left to do server-side is to inject such a server into ScyllaDB. Fortunately, in the case of CQL-over-WebSocket, the implementation is rather trivial. ScyllaDB already has a fully functional CQL server, so all we need to do is reroute all the decoded WebSocket traffic into our CQL server, and then send all the responses back, as-is, via WebSocket straight to the client. At the time of writing this blog post, this support is still not merged upstream, but it’s already available for review: https://github.com/scylladb/scylla/pull/10921

In order to enable the CQL-over-WebSocket server, it’s enough to simply specify the chosen port in scylla.yaml configuration file, e.g.:

cql_over_websocket_port: 8222

After that, ScyllaDB is capable of accepting WebSocket connections which declare their subprotocol to be “cql“.

Part III: TypeScript Driver for CQL

Neither ScyllaDB nor Cassandra had a native TypeScript driver, and the only existing JavaScript implementation was dedicated for server-side Node.js support, which made it unusable from the browser. Thus, we jumped at the opportunity and wrote a TypeScript driver from scratch. The result is here: https://github.com/dfilimonow/CQL-Driver.

While still very experimental and, notably, lacking concurrency support, load balancing policies, and other vital features, it’s also fully functional and capable of sending and receiving CQL requests straight from the browser via a WebSocket connection, which is exactly what we needed in order to implement a browser-only client.

Part IV: In-browser cqlsh Terminal

The frontend of our project is an in-browser implementation of a cqlsh-like terminal. It’s a static webpage based on TypeScript code compiled to JavaScript, coded with React and Material UI. It supports authentication, maintains shell history and presents query results in a table, with paging support.

The source code is available here: https://github.com/gbzaleski/ZPP-ScyllaDB-Front

Ideally, in order to provide full “batteries included” experience, the page could be served by ScyllaDB itself from a local address, but until that happens, here’s some preview screen captures from the current build of the frontend:

Note that since this code isn’t available in a production release of ScyllaDB, it requires a custom-compiled version of ScyllaDB running locally to get this to work. If you are curious about running this yourself, chat me up in the ScyllaDB user Slack community.

Source Code and the Paper

One of the outcomes of this project, aside from upstreaming changes into Seastar and ScyllaDB, is a Bachelor’s thesis. The paper is available here:

DOWNLOAD THE THESIS

and its official description and abstract can be found here:

READ THE ABSTRACT

Congratulations to those students who put in the time and effort to make this project so successful: Barłomiej Kozaryna, Daniel Filimonow, Andrzej Stalke, and Grzegorz Zaleski. Thanks, team! And, as always, thank you to the instructors and staff of the University of Warsaw for such a great partnership. We’re now looking forward to the 2022/2023 edition of the student projects program, bringing new cool innovations and improvements to our ScyllaDB open source community!

JOIN OUR OPEN SOURCE COMMUNITY

 

Performance Engineering Masterclass for Optimizing Distributed Systems: What to Expect

Performance engineering in distributed systems is a critical skill to master. Because while it’s one thing to get massive terabyte-to-petabyte scale systems up and running, it’s a whole other thing to make sure they are operating at peak efficiency. In fact, it’s usually more than just “one other thing.” Performance optimization of large distributed systems is usually a multivariate problem — combining aspects of underlying hardware, networking, tuning operating systems or finagling with layers of virtualization and application architectures.

Such a complex problem can’t be taught from just one perspective. So, for our upcoming Performance Engineering Masterclass, we’re bringing together three industry experts who want to pass along to you the insights they have learned over their careers. Attendees will benefit from their perspectives based on extensive production and systems testing experience.

During this free, online – and highly interactive – 3 hour masterclass, you will learn how to apply SLO-driven strategies to optimize observability – and ultimately performance – from the front-end to the back-end of your distributed systems. Our experts walk you through how to optimize distributed systems performance with SLO-driven strategies.  Specifically, we’ll cover how to measure and optimize performance for latency-sensitive distributed systems built using modern DevOps approaches with progressive delivery. At the end, you will have the opportunity to demonstrate what you learned and earn a certificate that shows your achievement.

You will learn best practices on how to:

  • Optimize your team’s pipelines and processes for continuous performance
  • Define, validate, and visualize SLOs across your distributed systems
  • Diagnose and fix subpar distributed database performance using observability methods

Our Performance Engineering Masterclass will be held on August 30th, 2022 at 8:00AM – 12:00PM PDT. (4:00PM – 8:00PM GMT.)

REGISTER NOW FOR FREE

Schedule

The Masterclass program runs for three hours, between 8:30 – 11:30 AM Pacific Daylight Time, June 21st, 2022. The platform will open at 8:00 AM and close at 12:00 PM (noon), so there will be time before and after the scheduled sessions for you to meet and mingle with your fellow attendees, plus to ask questions in our live staffed virtual lounges:

TIME (PDT) CONTENT
8:00-8:30am PLATFORM OPENS (TECHNOLOGY LOUNGES)
8:30-8:40am Welcome & Opening Remarks
Peter Corless, ScyllaDB
8:40 – 9:30am Introduction to Modern Performance
Leandro Melendez, Grafana k6
9:30-10:00am Efficient Automation with the Help of SRE Concepts

Henrik Rexed, Dynatrace
10:00-10:30am Observability Best Practices in Distributed Databases
Felipe Cardeneti Mendes, ScyllaDB
10:30-11:00am Panel Discussion + Live Q&A
11:00-11:30am Certification Examination and Awards
11:30am-12:00pm PLATFORM CLOSING (TECHNOLOGY LOUNGES)

Agenda

The day’s learning will be centered around three main lessons:

Introduction to Modern Performance

The relentless drive to cloud-native distributed applications means no one can sit back and hope applications are always going to perform at their peak capabilities. Whether your applications are built on microservices and serverless architectures, you need to be able to observe and improve the performance of your always-on, real-time systems. We’ll kick off our masterclass by providing a grounding in current expectations of what performance engineering and load testing entail.

This session defines the modern challenges developers face, including continuous performance principles, Service Level Objectives (SLOs) and Service Level Indicators (SLIs). It will delineate best practices, and provide hands-on examples using Grafana k6, an open source modern load testing tool built on Go and JavaScript.

Efficient Automation with the Help of SRE Concepts

Service Level Objectives (SLOs) are a key part of modern software engineering practices. They help quantify the quality of service provided to end users, and therefore maintaining them becomes important for modern DevOps approaches with progressive delivery.

SLOs have become a powerful tool allowing teams to manage the monitoring of the reliability of production systems and validate the test cycles automatically launched through various pipelines. We’ll walk you through how to measure, validate and visualize these SLOs using Prometheus, an open observability platform, to provide concrete examples.

Next, you will learn how to automate your deployment using Keptn, a cloud-native event-based life-cycle orchestration framework. Discover how it can be used for multi-stage delivery, remediation scenarios, and automating production tasks.

Observability Best Practices in Distributed Databases

In the final session, dive into distributed database observability, showing several real-world situations that may affect workload performance and how to diagnose them. Gaining proper visibility against your database is important in order to predict, analyze and observe patterns and critical problems that emerge. Managing scenarios such as unbounded concurrency, alerting, queueing, anti-entropy and typical database background activities are often a production-grade requirement for most mission critical platforms. However, simply having access to these metrics is not enough. DevOps, Database and Application professionals need a deeper understanding of them to know what to watch for.

To show how to put these principles into practice, this session uses the ScyllaDB NoSQL database and ScyllaDB Monitoring Stack built on open source Grafana, Grafana Loki, Alert Manager and Prometheus as an example of instrumentation, observability tools, and troubleshooting methodologies. Discover how to triage problems by priority and quickly gather live insights to troubleshoot application performance.

Your Instructors

A masterclass is distinguished by the experience of its instructors, and we are pleased to introduce our 3 amazing instructors:

Leandro Melendez
Grafana k6

Leandro Melendez, also known by his nom de guerre “Señor Performo,” has decades of experience as a developer, DBA, and project manager. He has spent the past decade focusing on QA for performance and load testing. Now a developer advocate at Grafana k6, Leandro seeks to share his experiences with broader audiences, such as hosting the Spanish-language edition of the PerfBytes podcast. He will also be speaking at the upcoming P99 CONF in October 2022.

Henrik Rexed
Dynatrace

Henrik Rexed is a cloud native advocate with a career of nearly two decades focusing specifically on performance engineering. As a Senior Staff Engineer at Dynatrace, he is a frequent speaker in conferences, webinars and podcasts. His current passion is the movement to the emerging Cloud Native Computing Foundation (CNCF) standard for OpenTelemetry. Henrik will also be a speaker at the upcoming P99 CONF in October 2022.

Felipe Cardeneti Mendes
ScyllaDB

Felipe Cardeneti Mendes is a published author and solutions architect at ScyllaDB. Before ScyllaDB he spent nearly a decade as a systems administrator at one of the largest tech companies on the planet. He focuses on helping customers adapt their systems to meet the challenges of this next tech cycle, whether that means building low latency applications in Rust, or migrating or scaling their systems to the latest hardware, operating systems and cloud environments.

Achieve Certification

At the end of the prepared presentations, there will be an examination offered to all attendees. Those who complete the examination with a passing grade will be presented with a Certificate of Completion for the Masterclass.

Attendees who do not achieve a passing grade during the live event will have a chance to retake the examination at a later date to achieve their certification.

Sponsors

This Masterclass is brought to you for free through a joint collaboration of ScyllaDB, Grafana k6, and Dynatrace.

Register Now

The world of cloud-native distributed applications, observability and performance optimization is changing in real time, so there is no better time than right now to commit to deepening your knowledge. August 30th is right around the corner, so don’t delay — sign up today!

REGISTER NOW FOR THE MASTERCLASS

Introducing ScyllaDB Enterprise 2022.1

ScyllaDB is pleased to announce the availability of ScyllaDB Enterprise 2022.1, a production-ready ScyllaDB Enterprise major release. After more than 6,199 commits originating from five open source releases, we’re excited to now move forward with ScyllaDB Enterprise 2022.

ScyllaDB Enterprise builds on the proven features and capabilities of our ScyllaDB Open Source NoSQL database, providing greater reliability from additional vigorous testing. Between ScyllaDB Enterprise 2021 to this release of 2022, we’ve closed 1,565 issues. Plus, it provides a set of unique enterprise-only features such as LDAP authorization and authentication.

ScyllaDB Enterprise 2022 is immediately available for all ScyllaDB Enterprise customers, and will be deployed to ScyllaDB Cloud users as part of their fully-managed service. A 30-day trial version is available for those interested in testing its capabilities against their own use cases.

New Features in ScyllaDB Enterprise 2022

Support for AWS EC2 I4i Series Instances

ScyllaDB now supports the new AWS EC2 I4i series instances. The I4i series provides superior performance over the I3 series due to a number of factors: the 3rd generation Intel Xeon “Ice Lake” processors, the AWS Nitro System hypervisor, and low-latency Nitro NVMe SSDs. ScyllaDB can achieve 2x throughput and lower latencies on I4i instances over comparable i3 servers.

Support for Arm-based Systems

With ScyllaDB Open Source 4.6 we implemented support for Arm-based architectures, including the new AWS Im4gn and Is4gen storage-optimized instances powered by Graviton2 processors. Release 4.6 also supports the low cost T4g burstable instance for development of cloud-based applications. Since ScyllaDB is now compiled to run on any Aarch64 architecture, you can even run it on an Arm-based M1-powered Macintosh using Docker for local development.

Change Data Capture (CDC)

This feature allows you to track changes made to a base table in your database for visibility or auditing. Changes made to the base table are stored in a separate table that can be queried by standard CQL. Our CDC implementation uses a configurable Time To Live (TTL) to ensure it does not occupy an inordinate amount of your disk. Note: This was one of the few features brought in through a maintenance release (2021.1.1). It is new since the last major release of ScyllaDB Enterprise 2021.

LDAP

ScyllaDB Enterprise 2021.1.2 introduced support for LDAP Authorization and LDAP Authentication.

  • LDAP Authentication allows users to manage the list of ScyllaDB Enterprise users and passwords in an external Directory Service (LDAP server), for example, MS Active Directory. ScyllaDB leverages authentication by a third-party utility named saslauthd to handle authentication, which, in turn, supports many different authentication mechanisms. Read the complete documentation on how to configure and use saslauthd and LDAP Authentication.
  • LDAP Authorization allows users to manage the roles and privileges of ScyllaDB users in an external Directory Service (LDAP server), for example, MS Active Directory. To do that, one needs a role_manager entry in scylla.yaml set to com.scylladb.auth.LDAPRoleManager. When this role manager is chosen, ScyllaDB forbids GRANT and REVOKE role statements (CQL commands) as all users get their roles from the contents of the LDAP directory.When LDAP Authorization is enabled and a ScyllaDB user authenticates to ScyllaDB, a query is sent to the LDAP server, whose response sets the user’s roles for that login session. The user keeps the granted roles until logout; any subsequent changes to the LDAP directory are only effective at the user’s next login to ScyllaDB. Read the complete documentation on how to configure and use LDAP Authorization.

Note: This is one of the few features brought in through a maintenance release (2021.1.2). It is new since the last major release of ScyllaDB Enterprise 2021.

Timeout per Operation

There is now new syntax for setting timeouts for individual queries with “USING TIMEOUT”. This is particularly useful when one has queries that are known to take a long time. Till now, you could either increase the timeout value for the entire system (with request_timeout_in_ms), or keep it low and see many timeouts for the longer queries. The new Timeout per Operation allows you to define the timeout in a more granular way. Conversely, some queries might have tight latency requirements, in which case it makes sense to set their timeout to a small value. Such queries would get time out faster, which means that they won’t needlessly hold the server’s resources. You can use the new TIMEOUT parameters for both queries (SELECT) and updates (INSERT, UPDATE, DELETE). Note: This was one of the few features brought in through a maintenance release (2021.1.1). It is new since the last major release of ScyllaDB Enterprise 2021.

Improvements to Cloud Formation

A new variable VpcCidrIp allows you to set CIDR IP range for VPC. Previously, the range was hard coded to 172.31.0.0/16. Plus the Cloud Formation template was reordered for better readability. Note: This was one of the few features brought in through a maintenance release (2021.1.5). It is new since the last major release of ScyllaDB Enterprise 2021.

I/O Scheduler Improvements

A new I/O scheduler was integrated via a Seastar update. The new scheduler is better at restricting disk I/O in order to keep latency low. This implementation introduces a new dispatcher in the middle of a traditional producer-consumer model. The IO scheduler seeks to find what is known as the effective dispatch rate — the fastest rate at which the system can process data without running into internal queuing jams.

Improved Reverse Queries

Reverse queries are SELECT statements that use reverse order from the table schema. If no order was defined, the default order is ascending (ASC). For example, imagine rows in a partition sorted by time in ascending order. A reverse query would sort rows in descending order, with the newest rows first. Reverse queries were improved in ScyllaDB Open Source 4.6, and are further improved in 5.0, first, to return short pages to limit memory consumption, and secondly, for reverse queries to leverage ScyllaDB’s row-based cache (before 5.0 they bypassed the cache).

New Virtual Tables for Configuration and Nodetool Information

A new system.config virtual table allows querying and updating a subset of configuration parameters over CQL. These updates are not persistent, and will return to the scylla.yaml update after restart. Nodetool command information can also be accessed via virtual tables, including snapshots, protocol servers, runtime info, and a virtual table replacement for nodetool versions. Virtual tables allow remote access over CQL, including for ScyllaDB Cloud users.

Repair-based Node Operations (RBNO)

We leveraged our row-based repair mechanism for other repair-based node operations (RBNO), such as node bootstraps, decommissions, and removals, and allowed it to use off-strategy compaction to simplify cluster management.

SSTable Index Caching

We also added SSTable Index Caching (4.6). Up to this release, ScyllaDB only cached data from SSTables data rows (values), not the indexes themselves. As a result, if the data was not in cache readers had to touch the disk while walking the index. This was inefficient, especially for large partitions, increasing the load on the disk, and adding latency. Now, index blocks can be cached in memory, between readers, populated on access, and evicted on memory pressure – reducing the IO and decreasing latency. More info can be found in Tomasz Grabiec session in ScyllaDB Summit “SSTable Index Caching.”

SSTable Index Caching provides faster performance by avoiding
unnecessary queries to SSD.

Granular Timeout Controls

Timeouts per operation were also added, allowing more granular control over latency requirements and freeing up server resources (4.4). We expanded upon this by adding Service Level Properties, allowing you to associate attributes to rules and users, so you can have granular control over session properties, like per-service-level timeouts and workload types (4.6).

Guardrails

ScyllaDB is a very powerful tool, with many features and options. In many cases, these options, such as experimental or performance-impacting features, or a combination of them, are not recommended to run in production. Guardrails are a collection of reservations that make it harder for the user to use non-recommended options in production. A few examples:

  • Prevent users from using SimpleReplicationStrategy.
  • Warn or prevent usage of DateTieredCompactionStrategy, which has long since been deprecated in favor of TimeWindowCompactionStrategy.
  • Disable Thrift, a legacy interface, by default
  • Ensure that all nodes use the same snitch mode

ScyllaDB administrators can use our default settings or customize guardrails for their own environment and needs.

​​Improvements to Alternator

The Amazon DynamoDB-compatible interface has been updated to include a number of new features:

  • Support for Cross-Origin Resource Sharing (CORS)
  • Fully supports nested attribute paths
  • Support attribute paths in ConditionExpression, FilterExpression Support, and ProjectionExpression

Related Innovations

Kubernetes Operator Improvements

Our Kubernetes story was production-ready with ScyllaDB Operator 1.0, released in January of 2020. Since that time, we have continuously improved on our capabilities and performance tuning with cloud orchestration.

Faster Shard-Aware Drivers

Also during the past two years we have learned a lot about making our shard-aware drivers even faster. Open source community contributions led to a shard-aware Python driver which you can read about here and here. Since then our developers have embraced the speed and safety of Rust, and we are now porting our database drivers to async Rust, which will provide a core of code for multiple drivers with language bindings to C/C++ and Python.

New Release Cycle for Enterprise Customers

ScyllaDB Enterprise 2022.1 is a Long Term Support release (LTS) as per our new enterprise release policy. It used to be that Enterprise users would need to wait a year between feature updates in major ScyllaDB Enterprise releases. Starting later this year, we intend to provide additional new features in a Short Term Support (STS) release, versioned as 2022.2. This new mix of LTS and STS releases will allow us to provide new features on an intrayear basis. For you, it means you don’t need to wait as long to see new innovations show up in a stable enterprise release.

For example, let’s look at this chart from the historical releases we made in the era of ScyllaDB Open Source 3.0 to 4.0, and how that tracked to ScyllaDB Enterprise 2019 to 2020.

There was a four month lag between the release of ScyllaDB Open Source 3.0 (January 2019) to the related ScyllaDB Enterprise 2019.1 release (May 2019). And there was a similar three month lag between ScyllaDB Open Source 4.0 and ScyllaDB Enterprise 2020.1.

During the interim between those major releases there were three minor releases of ScyllaDB Open Source — 3.1, 3.2 and 3.3. Enterprise users needed to wait until the next major annual release to utilize any of those innovations — apart from a point feature such as support for IPv6.

In comparison, ScyllaDB Open Source 5.0 was released on July 7th, 2022. And here, within a calendar month, we are announcing ScyllaDB Enterprise 2022.

Plus, when ScyllaDB Enterprise 2022.2 is released later this year, it will include within it additional production-ready innovations brought in from ScyllaDB Open Source without requiring users to wait until the new major release in calendar year 2023.

This narrows the delay in production-ready software features for our Enterprise users. For example, this is what the rest of 2022 could look like:

Note ScyllaDB Enterprise 2022.2 will be a Short Term Support release. Which means some Enterprise users may still decide to wait for the next ScyllaDB Enterprise LTS release in 2023. Existing ScyllaDB Enterprise customers are encouraged to discuss their upgrade policies and strategies with their support team.

Get ScyllaDB Enterprise 2022 Now

ScyllaDB Enterprise 2022 is immediately available. Existing ScyllaDB Enterprise customers can work with their support team to manage their upgrades. ScyllaDB Cloud users will see the changes made automatically to their cluster in the coming days.

For those who wish to try ScyllaDB Enterprise for the first time, there is a 30-day trial available.

You can check out the documentation and download it now from the links below. If this is your first foray into ScyllaDB, or into NoSQL in general, remember to sign up for our free online courses in ScyllaDB University.

GET STARTED WITH SCYLLADB ENTERPRISE

Implementing a New IO Scheduler Algorithm for Mixed Read/Write Workloads

Introduction

Being a large datastore, ScyllaDB operates many internal activities competing for disk I/O. There are data writers (for instance memtable flushing code or commitlog) and data readers (which are most likely fetching the data from the SSTables to serve client requests). Beyond that, there are background activities, like compaction or repair, which  can do I/O in both directions. This competition is very important to tackle; otherwise just issuing a disk I/O without any fairness or balancing consideration, a query serving a read request can find itself buried in the middle of the pile of background writes. By the time it has the opportunity to run, all that wait would have translated into increased latency for the read.

While keeping the in-disk concurrency at a reasonably high level to utilize disk internal parallelism, it is then better for the database not to send all those requests to the disk in the first place, and keep them queued inside the database. Being inside the database, ScyllaDB can then tag those requests and through prioritization guarantee quality-of-service among various classes of I/O requests that the database has to serve: CQL query reads, commitlog writes, background compaction I/O, etc.

That being said, there’s a component in the ScyllaDB low-level engine called the I/O scheduler that operates several I/O request queues and tries to maintain a balance between two inefficiencies — overwhelming the disk with requests and underutilizing it. With the former, we achieve good throughput but end up compromising latency. With the latter, latency is good but throughput suffers. Our goal is to find the balance where we can extract most of the available disk throughput while still maintaining good latency.

Current State / Previous Big Change

Last time we touched this subject, it was about solving the disk throughput and IOPS capacity being statically partitioned between shards. Back then, the challenge was to make correct accounting and limiting of a shared resource constrained with the shard-per-core design of the library.

A side note: Seastar uses the “share nothing” approach, which means that any decisions made by CPU cores (often referred to as shards) are not synchronized with each other. In rare cases, when one shard needs the other shard’s help, they explicitly communicate with each other.

The essential part of the solution was to introduce a shared pair of counters that helped to achieve two goals: prevent shards from consuming more bandwidth than a disk can provide and also act as a “reservation” when each shard could claim the capacity, and wait for it to become available if needed — all in a “fair” manner.

Just re-sorting the requests, however, doesn’t make the I/O scheduler a good one. Another essential part of I/O scheduling is maintaining the balance between queueing and executing requests. Understanding what overwhelms the disk and what doesn’t turns out to be very tricky.

Towards Modeling the Disk

Most likely when evaluating a disk one would be looking at its 4 parameters — read/write IOPS and read/write throughput (such as in MB/s). Comparing these numbers to one another is a popular way of claiming one disk is better than the other and estimating the aforementioned “bandwidth capacity” of the drive by applying Little’s Law. With that, the scheduler’s job is to provide a certain level of concurrency inside the disk to get maximum bandwidth from it, but not to make this concurrency too high in order to prevent disk from queueing requests internally for longer than needed.

In its early days ScyllaDB took this concept literally and introduced the configuration option to control the maximum amount of requests that can be sent into the disk in one go. Later this concept was elaborated taking into account the disk bandwidth and the fact that reads and writes come at different costs. So the scheduler maintained not only the “maximum number of requests” but also the “maximum number of bytes” sent to the disk and evaluated these numbers differently for reads vs writes.

This model, however, didn’t take into account any potential interference read and write flows could have on each other. The assumption was that this influence should be negligible. Further experiments on mixed I/O showed that the model could be improved even further.

The Data

The model was elaborated by developing a tool, Diskplorer, to collect a full profile of how a disk behaves under all kinds of load. What the tool does is load the disk with reads and writes of different “intensities” (including pure workloads, when one of the dimensions is literally zero) and collects the resulting latencies of requests. The result is 5-dimension space showing how disk latency depends on the { read, write } x { iops, bandwidth } values. Drawing such a complex thing on the screen is challenging by itself. Instead, the tool renders a set of flat slices, some examples of which are shown below.

For instance, this is how read request latency depends on the intensity of small reads (challenging disk IOPS capacity) vs intensity of large writes (pursuing the disk bandwidth). The latency value is color-coded, the “interesting area” is painted in cyan — this is where the latency stays below 1 millisecond. The drive measured is the NVMe disk that comes with the AWS EC2 i3en.3xlarge instance.

This drive demonstrates almost perfect half-duplex behavior — increasing the read intensity several times requires roughly the same reduction in write intensity to keep the disk operating at the same speed.

Similarly this is what the plot looks like for yet another AWS EC2 instance — the i3.3xlarge.

This drive demonstrates less than perfect, but still half-duplex behavior. Increasing reads intensity needs writes to slow down much deeper to keep up with the required speeds, but the drive is still able to maintain the mixed workload.

Less relevant for ScyllaDB, but still very interesting is the same plot for some local HDD. The axis and coloring are the same, but the scale differs, in particular, the cyan outlines ~100 milliseconds area.

This drive is also half-duplex, but with write bandwidth above ~100MB/s the mixed workload cannot survive, despite a pure write workload being pretty possible on it.

It’s seen that the “safety area” when we can expect the disk request latency to stay below a value we want is a concave triangular-like area with its vertices located at zero and near disk’s maximum bandwidth and IOPS.

The Math

If trying to approximate the above plots with a linear equation, the result would be like this

Where Bx are maximum read and write bandwidth values, Ox are the maximum IOPS values, and K is the constant with the exact value taken from the collected profile. Taking into account that bandwidth and IOPS are both time derivative of the respective bytes and ops numbers, i.e.

and

and introducing relations between maximum bandwidth and IOPS values

the above equation turns into some simpler form of

Let’s measure each request with a two-values tuple T = {1, bytes} for reads and T = {Mo, Mb · bytes} for writes and define the normalization operation to be

Now the final equation looks like

Why is it better or simpler than the original one? The thing is that measuring bandwidth (or IOPS) is not simple. If you look at any monitoring tool, be it a plain Linux top command or sophisticated Grafana stack, you’ll find that all such speedish values are calculated within some pre-defined time-frame, say 1 second. Limiting something that’s only measurable with a certain time gap is an even more complicated task.

In its final form, the scheduling equation has the great benefit — it needs to accumulate some instant numerical values — the aforementioned request tuples. The only difficulty here is that the algorithm should limit not the accumulated value itself, but the speed at which it grows. This task has its well known solution.

The Code

There’s a well developed algorithm called token bucket out there. It originates from telecommunications and networking and is used to make sure that a packet flow doesn’t exceed the configured rate or bandwidth.

In its bare essence, the algorithm maintains a bucket with two inputs and one output. The first input is the flow of packets to be transmitted. The second input is the rate-limited flow of tokens. Rate-limited here means that unlike packets that flow into the bucket as they appear, tokens are put into a bucket with some fixed rate, say N tokens per second. The output is the rate-limited flow of packets that had recently arrived into the bucket. Each outgoing packet carries one or more tokens with it and the number of tokens corresponds to what “measure” is rate-limited. If the token bucket needs to ensure the packets-per-second rate, then each packet takes away one token. If the goal is to limit the output for bytes-per-second, then each packet must carry as many tokens as many bytes its length is.

This algorithm can be applied to our needs. Remember, the goal is to rate-limit the outgoing requests’ normalized cost-tuples. Respectively, each request is only dispatched when it can get N(T) tokens with it, and the tokens flow into the bucket with the rate of K.

In fact, the classical token bucket algo had to be slightly adjusted to play nicely with the unpredictable nature of modern SSDs. The thing is that despite all the precautions and modeling, disks still can slow down unexpectedly. One of the reasons is background garbage collection performed by the FTL. Continuing to serve IOs at the initial rate into the disk running its background hygiene should be avoided as much as possible. To compensate for this, we added a backlink to the algorithm — tokens that flow into the bucket to not appear from nowhere (even in the strictly rate-limited manner), instead they originate from another bucket. In turn, this second bucket gets its tokens from the disk itself after requests complete.

The modified algorithm makes sure the I/O scheduler serves requests with a rate that doesn’t exceed two values — the one predicted by the model and the one that the disk shows under real load.

Results

The results can be seen from two angles. First, is whether the scheduler does its job and ensures the bandwidth and IOPS stay in the “safety area” from the initial profile measurements. Second, is whether this really helps, i.e. does the disk show good latencies or not. To check both we ran a cassandra stress test over two versions of ScyllaDB — 4.6 with its old scheduler and 5.0 with the new one. There were four runs in a row for each version — the first run was to populate ScyllaDB with data, and subsequent to query this data back, but “disturbed” with background compaction (because we know that ordinary workload is too easy for modern NVMe disk to handle). The querying was performed with different rates of client requests — 10k, 5k and 7.5k requests per second.

Bandwidth/IOPS Management

Let’s first look at how the scheduler maintains I/O flows and respects the expected limitations.

Population bandwidth: ScyllaDB 4.6 (total) vs 5.0 (split, green — R, yellow — W)

On the pair of plots above you can see bandwidth for three different flows — commitlog (write only), compaction (both reads and writes) and memtable flushing (write only). Note that ScyllaDB 5.0 can report read and write bandwidth independently.

ScyllaDB 4.6 maintains the total bandwidth to be 800 MB/s which is the peak of what the disk can do. ScyllaDB 5.0 doesn’t let the bandwidth exceed a somewhat lower value of 710 MB/s because the new formula requires that the balanced sum of bandwidth and IOPS, not their individual values, stays within the limit.

Similar effects can be seen during the reading phase of the test on the plots below. Since the reading part involves the low-size reading requests coming from the CQL requests processing, this time it’s good to see not only the bandwidth plots, but also the IOPS ones. Also note, that this time there are three different parts of the test — the incoming request rate was 10k, 5k and 7.5k, thus there are three distinct parts on the plots. And last but not least — the memtable flushing and commitlog classes are not present here, because this part of the test was read-only with the compaction job running in the background.

Query bandwidth: ScyllaDB 4.6 (total) vs 5.0 (split, green — R, yellow — W)

The first thing to note is that the 5.0 scheduler inhibits compaction class much heavier than in 4.6, letting the query class get more disk throughput — which is one of the goals we wanted to achieve. Second, it’s seen that like during the population stage, the net bandwidth on the 5.0 scheduler is lower than the one on 4.6 — again because the new scheduler takes into account the IOPS value together with the bandwidth, not separately.

By looking at the IOPS plots one may notice that 4.6 seems to provide more or less equal IOPS for different request rates in the query class, while 5.0’s allocated capacity is aligned with the incoming rate. That’s the correct observation, the answer to it — 4.6 didn’t manage to sustain even the request rate of 5k requests per second, while 5.0 felt well even at the rate of 7.5k and died at 10k. Now that is the clear indication that the new scheduler does provide the expected I/O latencies.

Latency Implications

Seastar keeps track of two latencies — in-queue and in-disk. The scheduler’s goal is to maintain the latter one. The in-queue can grow as high as it wants; it’s up to upper layers to handle it. For example, if the requests happen to wait too long in the queue, ScyllaDB activates a so-called “backpressure” mechanism which involves several techniques and may end up canceling some requests sitting in the queue.

ScyllaDB 4.6 commitlog class delays (green — queue time, yellow — execution time)

ScyllaDB 5.0 commitlog class delays (green — queue time, yellow — execution time)

The scale of the plot doesn’t make it obvious, but from the pop-up on the bottom-right it’s seen that the in-disk latency of the commitlog class dropped three times — from 1.5 milliseconds to about half a millisecond. This is the direct consequence of the overall decreased bandwidth as was seen above.

When it comes to querying workload, things get even better, because this time scheduler keeps bulky writes away from the disk when the system needs short and latency sensitive reads.

ScyllaDB 4.6 query class delays (green — queue time, yellow — execution time)

ScyllaDB 5.0 query class delays (green — queue time, yellow — execution time)

Yet again, the in-queue latencies are orders of magnitude larger than the in-disk ones, so the idea of what the latter are can be gotten from the pop-ups on the bottom-right. Query in-disk latencies dropped two times thanks to the suppressed compaction workload.

What’s Next

The renewed scheduler will come as a part of lots of other great improvements in 5.0. However, some new features should be added on top. For example — the metrics list will be extended to report the described accumulated costs each scheduling class had collected so far. Another important feature is in adjusting the rate limiting factor (the K thing) run-time to address disks aging out and any potential mis-evaluations we might have done. And finally, to adjust the scheduler to be usable on slow drives like persistent cloud disks or spinning HDDs we’ll need to tune the configuration of the latency goal the scheduler aims at.

Discover More in ScyllaDB V

ScyllaDB V is our initiative to make ScyllaDB, already the most monstrously fast and scalable NoSQL database available, into even more of a monster to run your most demanding workloads. You can learn more about this in three upcoming webinars:

Webinar: ScyllaDB V Developer Deep Dive Series – Performance Enhancements + AWS I4i Benchmarking

August 4, 2022 | Live Online | ScyllaDB

We’ll explore the new IO model and scheduler, which provide fine-tuned balancing of read/write requests based on your disk’s capabilities. Then, we’ll look at how ScyllaDB’s close-to-the-hardware design taps the full power of high-performance cloud computing instances such as the new EC2 I4i instances.

Register Now >

Webinar: ScyllaDB V Developer Deep Dive Series – Resiliency and Strong Consistency via Raft

August 11, 2022 | Live Online | ScyllaDB

ScyllaDB’s implementation of the Raft consensus protocol translates to strong, immediately consistent schema updates, topology changes, tables and indexes, and more. Learn how the Raft consensus algorithm has been implemented, what you can do with it today, and what radical new capabilities it will enable in the days ahead.

Register Now >

Webinar: ScyllaDB V Developer Deep Dive Series – Rust-Based Drivers and UDFs with WebAssembly

August 18, 2022 | Live Online | ScyllaDB

Rust and Wasm are both perfect for ScyllaDB’s obsession with high performance and low latency. Join this webinar to hear our engineers’ insights on how Rust and Wasm work with ScyllaDB and how you can benefit from them.

Register Now >

 

Data Mesh — A Data Movement and Processing Platform @ Netflix

Data Mesh — A Data Movement and Processing Platform @ Netflix

By Bo Lei, Guilherme Pires, James Shao, Kasturi Chatterjee, Sujay Jain, Vlad Sydorenko

Background

Realtime processing technologies (A.K.A stream processing) is one of the key factors that enable Netflix to maintain its leading position in the competition of entertaining our users. Our previous generation of streaming pipeline solution Keystone has a proven track record of serving multiple of our key business needs. However, as we expand our offerings and try out new ideas, there’s a growing need to unlock other emerging use cases that were not yet covered by Keystone. After evaluating the options, the team has decided to create Data Mesh as our next generation data pipeline solution.

Last year we wrote a blog post about how Data Mesh helped our Studio team enable data movement use cases. A year has passed, Data Mesh has reached its first major milestone and its scope keeps increasing. As a growing number of use cases on board to it, we have a lot more to share. We will deliver a series of articles that cover different aspects of Data Mesh and what we have learned from our journey. This article gives an overview of the system. The following ones will dive deeper into different aspects of it.

Data Mesh Overview

A New Definition Of Data Mesh

Previously, we defined Data Mesh as a fully managed, streaming data pipeline product used for enabling Change Data Capture (CDC) use cases. As the system evolves to solve more and more use cases, we have expanded its scope to handle not only the CDC use cases but also more general data movement and processing use cases such that:

  • Events can be sourced from more generic applications (not only databases).
  • The catalog of available DB connectors is growing (CockroachDB, Cassandra for example)
  • More Processing patterns such as filter, projection, union, join, etc.

As a result, today we define Data Mesh as a general purpose data movement and processing platform for moving data between Netflix systems at scale.

Overall Architecture

The Data Mesh system can be divided into the control plane (Data Mesh Controller) and the data plane (Data Mesh Pipeline). The controller receives user requests, deploys and orchestrates pipelines. Once deployed, the pipeline performs the actual heavy lifting data processing work. Provisioning a pipeline involves different resources. The controller delegates the responsibility to the corresponding microservices to manage their life cycle.

Pipelines

A Data Mesh pipeline reads data from various sources, applies transformations on the incoming events and eventually sinks them into the destination data store. A pipeline can be created from the UI or via our declarative API. On the creation/update request the controller figures out the resources associated with the pipeline and calculates the proper configuration for each of them.

Connectors

A source connector is a Data Mesh managed producer. It monitors the source database’s bin log and produces CDC events to the Data Mesh source fronting Kafka topic. It is able to talk to the Data Mesh controller to automatically create/update the sources.

Previously we only had RDS source connectors to listen to MySQL and Postgres using the DBLog library; Now we have added Cockroach DB source connectors and Cassandra source connectors. They use different mechanisms to stream events out of the source databases. We’ll have blog posts deep dive into them.

In addition to managed connectors, application owners can emit events via a common library, which can be used in circumstances where a DB connector is not yet available or there is a preference to emit domain events without coupling with a DB schema.

Sources

Application developers can expose their domain data in a centralized catalog of Sources. This allows data sharing as multiple teams at Netflix may be interested in receiving changes for an entity. In addition, a Source can be defined as a result of a series of processing steps — for example an enriched Movie entity with several dimensions (such as the list of Talents) that further can be indexed to fulfill search use cases.

Processors

A processor is a Flink Job. It contains a reusable unit of data processing logic. It reads events from the upstream transports and applies some business logic to each of them. An intermediate processor writes data to another transport. A sink processor writes data to an external system such as Iceberg, ElasticSearch, or a separate discoverable Kafka topic.

We have provided a Processor SDK to help the advanced users to develop their own processors. Processors developed by Netflix developers outside our team can also be registered to the platform and work with other processors in a pipeline. Once a processor is registered, the platform also automatically sets up a default alert UI and metrics dashboard

Transports

We use Kafka as the transportation layer for the interconnected processors to communicate. The output events of the upstream processor are written to a Kafka topic, and the downstream processors read their input events from there.

Kafka topics can also be shared across pipelines. A topic in pipeline #1 that holds the output of its upstream processor can be used as the source in pipeline #2. We frequently see use cases where some intermediate output data is needed by different consumers. This design enables us to reuse and share data as much as possible. We have also implemented the features to track the data lineage so that our users can have a better picture of the overall data usage.

Schema

Data Mesh enforces schema on all the pipelines, meaning we require all the events passing through the pipelines to conform to a predefined template. We’re using Avro as a shared format for all our schemas, as it’s simple, powerful, and widely adopted by the community..

We make schema as the first class citizen in Data Mesh due to the following reasons:

  • Better data quality: Only events that comply with the schema can be encoded. Gives the consumer more confidence.
  • Finer granularity of data lineage: The platform is able to track how fields are consumed by different consumers and surface it on the UI.
  • Data discovery: Schema describes data sets and enables the users to browse different data sets and find the dataset of interest.

On pipeline creation, each processor in that pipeline needs to define what schema it consumes and produces. The platform handles the schema validation and compatibility check. We have also built automation around handling schema evolution. If the schema is changed at the source, the platform tries to upgrade the consuming pipelines automatically without human intervention.

Future

Data Mesh Initially started as a project to solve our Change Data Capture needs. Over the past year, we have observed an increasing demand for all sorts of needs in other domains such as Machine Learning, Logging, etc. Today, Data Mesh is still in its early stage and there are just so many interesting problems yet to be solved. Below are the highlights of some of the high priority tasks on our roadmap.

Making Data Mesh The Paved Path (Recommended Solution) For Data Movement And Processing

As mentioned above, Data Mesh is meant to be the next generation of Netflix’s real-time data pipeline solution. As of now, we still have several specialized internal systems serving their own use cases. To streamline the offering, it makes sense to gradually migrate those use cases onto Data Mesh. We are currently working hard to make sure that Data Mesh can achieve feature parity to Delta and Keystone. In addition, we also want to add support for more sources and sinks to unlock a wide range of data integration use cases.

More Processing Patterns And Better Efficiency

People use Data Mesh not only to move data. They often also want to process or transform their data along the way. Another high priority task for us is to make more common processing patterns available to our users. Since by default a processor is a Flink job, having each simple processor doing their work in their own Flink jobs can be less efficient. We are also exploring ways to merge multiple processing patterns into one single Flink job.

Broader support for Connectors

We are frequently asked by our users if Data Mesh is able to get data out of datastore X and land it into datastore Y. Today we support certain sources and sinks but it’s far from enough. The demand for more types of connectors is just enormous and we see a big opportunity ahead of us and that’s definitely something we also want to invest on.

Data Mesh is a complex yet powerful system. We believe that as it gains its maturity, it will be instrumental in Netflix’s future success. Again, we are still at the beginning of our journey and we are excited about the upcoming opportunities. In the following months, we’ll publish more articles discussing different aspects of Data Mesh. Please stay tuned!

The Team

Data Mesh wouldn’t be possible without the hard work and great contributions from the team. Special thanks should go to our stunning colleagues:

Bronwyn Dunn, Jordan Hunt, Kevin Zhu, Pradeep Kumar Vikraman, Santosh Kalidindi, Satyajit Thadeshwar, Tom Lee, Wei Liu


Data Mesh — A Data Movement and Processing Platform @ Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Instaclustr support for Apache Cassandra 3.0, 3.11 and 4.0

This blog provides a reference to Instaclustr’s Support for Apache Cassandra and can be used to understand the version status. The table below displayed the dates that the Apache Foundation project will end support for the corresponding major version. Instaclustr will provide our customers with extended support for these versions for 12 months beyond the Apache Foundation project dates. During the extended support period Instaclustr’s support will be limited to critical fixes only.

The Apache Cassandra versions supported on Instaclustr’s Managed Platform provided in this blog are also applicable for our Support-only customers.

The Apache Cassandra project has put in an impressive effort to support these versions thus far, and we’d like to take the opportunity to thank all the contributors and maintainers of these older versions.

Our extended support is provided to enable customers to plan their migrations with confidence. We encourage those of you considering upgrades to explore the significant advantages of Cassandra 4.0 which has been in full general availability on our managed platform since September 2021.

If you have any further concerns or wish to discuss migration, please get in touch with our Support team.

The post Instaclustr support for Apache Cassandra 3.0, 3.11 and 4.0 appeared first on Instaclustr.

NoSQL Summer Blockbusters: From Wrangling Rust to HA Trial by Fire

With summer blockbusters returning in full force this year, maybe you’re on a bit of a binge-watching, popcorn-munching kick? If your tastes veer towards the technical, we’ve got a marquee full of on-demand features that will keep you on the edge of your seat:

  • Escaping Jurassic NoSQL
  • Wrangling Rust
  • Daring database engineering feats
  • High availability trial by fire
  • Breaking up with Apache Cassandra

Here’s an expansive watchlist that crosses genres…

ScyllaDB Engineering

Different I/O Access Methods For Linux, What We Chose For ScyllaDB, And Why

When most server application developers think of I/O, they consider network I/O since most resources these days are accessed over the network: databases, object storage, and other microservices. However, the developer of a database must also consider file I/O.

In this video, ScyllaDB CTO Avi Kivity provides a detailed technical overview of the available choices for I/O access and their various tradeoffs. He then explains why ScyllaDB chose asynchronous direct I/O (AIO/DIO) as the access method for our high-performance low latency database and reviews how that decision has impacted our engineering efforts as well as our product performance.

Avi covers:

  • Four choices for accessing files on a Linux server: read/write, mmap, Direct I/O (DIO) read/write, and asynchronous direct I/O (AIO/DIO)
  • The tradeoffs among these choices with respect to core characteristics such as cache control, copying, MMU activity, and I/O scheduling
  • Why we chose AIO/DIO for ScyllaDB and a retrospective on that decision seven years later

WATCH NOW

Understanding Storage I/O Under Load

There’s a popular misconception about I/O that (modern) SSDs are easy to deal with; they work pretty much like RAM but use a “legacy” submit-complete API. And other than keeping in mind a disk’s possible peak performance and maybe maintaining priorities of different IO streams there’s not much to care about. This is not quite the case – SSDs do show non-linear behavior and understanding the disk’s real abilities is crucial when it comes to squeezing as much performance from it as possible.

Diskplorer is an open-source disk latency/bandwidth exploring toolset. By using Linux fio under the hood it runs a battery of measurements to discover performance characteristics for a specific hardware configuration, giving you an at-a-glance view of how server storage I/O will behave under load. ScyllaDB CTO Avi Kivity shares an interesting approach to measuring disk behavior under load, gives a walkthrough of Diskplorer and explains how it’s used.

With the elaborated model of a disk at hand, it becomes possible to build latency-oriented I/O scheduling that cherry-picks requests from the incoming queue keeping the disk load perfectly Balanced. ScyllaDB engineer Pavel Emelyanov also presents the scheduling algorithm developed for the Seastar framework and shares results achieved using it.

WATCH NOW

ScyllaDB User Experiences

Eliminating Volatile Latencies: Inside Rakuten’s NoSQL Migration

Patience with Apache Cassandra’s volatile latencies was wearing thin at Rakuten, a global online retailer serving 1.5B worldwide members. The Rakuten Catalog Platform team architected an advanced data platform – with Cassandra at its core – to normalize, validate, transform, and store product data for their global operations. However, while the business was expecting this platform to support extreme growth with exceptional end-user experiences, the team was battling Cassandra’s instability, inconsistent performance at scale, and maintenance overhead. So, they decided to migrate.

Watch this video to hear Hitesh Shah’s firsthand account of:

  • How specific Cassandra challenges were impacting the team and their product
  • How they determined whether migration would be worth the effort
  • What processes they used to evaluate alternative databases
  • What their migration required from a technical perspective
  • Strategies (and lessons learned) for your own database migration

WATCH NOW

Real-World Resiliency: Surviving Datacenter Disaster

Disaster can strike even the most seemingly prepared businesses. In March 2021, a fire completely destroyed a popular cloud datacenter in Strasbourg, France, run by OVHcloud, one of Europe’s leading cloud providers. Because of the fire, millions of websites went down for hours and a tremendous amount of data was lost.

One company that fared better than others was Kiwi.com, the popular European travel site. Despite the loss of an entire datacenter, Kiwi.com was able to continue operations uninterrupted, meeting their stringent Service Level Agreements (SLAs) without fail.

What was the secret to their survival while millions of other sites caught in the same datacenter fire went dark?

Watch this video to learn about Kiwi.com’s preparedness strategy and get best practices for your own resiliency planning, including:

  • Designing and planning highly available services
  • The Kiwi.com team’s immediate disaster response
  • Long-term recovery and restoration
  • Choosing and implementing highly resilient systems

WATCH NOW

NoSQL Trends and Best Practices

Beyond Jurassic NoSQL: New Designs for a New World

We live in an age of rapid innovation, but our infrastructure shows signs of rust and old age.  Under the shiny exterior of our new technology, some old and dusty infrastructure is carrying a weight it was never designed to hold.  This is glaringly true in the database world, which remains dominated by databases architected for a different age. These legacy NoSQL databases were designed for a different (nascent) cloud, different hardware, different bottlenecks, and even different programming models.

Avishai Ish-Shalom shares his insights on the massive disconnect between the potential of a brave new distributed world and the reality of “modern” applications relying on fundamentally aged databases. This video covers:

  • The changes in the operating environment of databases
  • The changes in the business requirements for databases
  • The architecture gap of legacy databases
  • New designs for a new world
  • How the adoption of modern databases impacts engineering teams and the products you’re building

WATCH NOW

Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline

Numberly operates business-critical data pipelines and applications where failure and latency means “lost money” in the best-case scenario. Most of their data pipelines and applications are deployed on Kubernetes and rely on Kafka and ScyllaDB, with Kafka acting as the message bus and ScyllaDB as the source of data for enrichment. The availability and latency of both systems are thus very important for data pipelines. While most of Numberly’s applications are developed using Python, they found a need to move high-performance applications to Rust in order to benefit from a lower-level programming language.

Learn the lessons from Numberly’s experience, including:

The rationale for selecting a lower-level language
Strategies for developing using a lower-level Rust code base
Observability and analyzing latency impacts with Rust
Tuning everything from Apache Avro to driver client settings
How to build a mission-critical system combining Apache Kafka and ScyllaDB
Feedback from 6 months of Rust in production

WATCH NOW

Latency and Consistency Tradeoffs in Modern Distributed Databases

Just over 10 years ago, Dr. Daniel Abadi proposed a new way of thinking about the engineering tradeoffs behind building scalable, distributed systems. According to Dr. Abadi, this new model, known as the PACELC theorem, comes closer to explaining the design of distributed database systems than the well-known CAP theorem.

Watch this video to hear Dr. Abadi’s reflections on PACELC ten years later, explore the impact of this evolution, and learn how ScyllaDB Cloud takes a unique approach to support modern applications with extreme performance and low latency at scale.

WATCH NOW

Getting the Most out of ScyllaDB University LIVE Summer School 2022

ScyllaDB University LIVE Summer Session for 2022 is right around the corner, taking place on Thursday, July 28th, 8AM-12PM PDT!

REGISTER NOW

As a quick reminder, ScyllaDB University LIVE is a FREE, half-day, instructor-led training event, led by  our top engineers and architects. We will have two parallel tracks – a getting started track focused on NoSQL and ScyllaDB Essentials and a track covering advanced ScyllaDB strategies and best practices. Across the tracks, we’ll have hands-on training and introduce new features. Following the sessions, we will host a roundtable discussion where you’ll have the opportunity to talk with ScyllaDB experts and network with other users.

Participants who complete the training will have access to more free, online, self-paced learning material such as our hands-on labs on ScyllaDB University.

Additionally, those that complete the training will be able to get a certification and some cool swag!

Detailed Agenda and How to Prepare

You can prepare for the event by taking some of our courses at ScyllaDB University. They are completely free and will allow you to better understand ScyllaDB and how the technology works.

Here are the sessions on the ScyllaDB University LIVE agenda. Below each session, we list the related ScyllaDB University lessons you might want to explore in advance.

Essentials Track Advanced  Track
ScyllaDB Essentials

Intro to Scylla, Basic concepts, Scylla Architecture, Hands-on Demo

Suggested learning material:

Kafka and Change Data Capture (CDC)

What is CDC? Consuming data, Under the hood, Hands-on Example

Suggested learning material:

ScyllaDB Basics

Basic Data Modeling, Definitions, Data Types, Primary Key Selection, Clustering key, Compaction, ScyllaDB Drivers, Overview

Suggested learning material:

Running ScyllaDB on Kubernetes

ScyllaDB Operator, ScyllaDB Deployment, Alternator Deployment, Maintenance, Hands-on Demo, Recent Updates

Suggested learning material:

Build Your First ScyllaDB-Powered App

This session is a hands-on example of how to create a full-stack app powered by Scylla Cloud and implement CRUD operations using NodeJS and Express.

Suggested learning material:

Advanced Topics in ScyllaDB

Collections, UDT, MV, Secondary Index, Prepared Statements, Paging, Retries, Sizing, TTL, Troubleshooting

Suggested learning material:

REGISTER NOW FOR SCYLLADB UNIVERSITY LIVE