An Interview with Pekka Enberg: Modern Hardware, Old APIs

Pekka Enberg has been working on Scylla since before its inception in 2014. Prior to joining ScyllaDB he worked on a variety of technologies ranging from High Frequency Trading (HFT) backends through the JVM runtime to kernels, and also web applications. Pekka is currently working on a PhD in computer science exploring new kernel abstractions and interfaces suitable for modern computer architectures.

Being fascinated by the evolution of hardware and software and the mismatch that sometimes happens between them — or rather, their intricate co-evolution — I couldn’t resist the opportunity to ask Pekka for his perspective on how Kernel APIs and architecture is standing the test of time. After all, Linux is a 30 year old project; older if you trace it back to the UNIX API from the 1970s. Pekka’s unique experience on both sides of the API brings an interesting perspective which also sheds light on the rationale behind the Seastar framework and the Scylla database architecture.

Let’s dive in!

Avishai: Pekka, tell me a little bit about your PhD dissertation and what drove you to work on new kernel interfaces? What’s wrong with the ones we have?

Pekka: I am obviously not the first one to consider this topic, but I have a bit of a personal story behind it, related to Scylla. Some time in Spring of 2013, Avi Kivity (ScyllaDB CTO) approached me, and wanted me to talk about maybe joining his newly founded company. I knew Avi from the Linux kernel virtualization scene, and had met him at Linux conferences. When he told me they were building an operating system from scratch to make things run faster in virtualized environments, I was immediately sold. So we built OSv, an operating system kernel purpose built for hypervisor-based virtual machines (VMs), and integrated with the Java virtual machine (JVM) so that you could run any Java application tightly coupled with the kernel with improved performance — or at least that was the assumption.

I was performance testing Apache Cassandra running on top of OSv, which was one of the big application targets for us and the idea was that we would optimize the operating system layer and co-optimize it with the JVM — we had people that had extensive JVM background and worked on jRockit JVM. However, we discovered that we couldn’t get a lot of gains despite the obvious architectural edge of OSv because of the way the application was structured.

Apache Cassandra was built with this traditional large thread pool architecture (staged event-driven architecture). With this kind of architecture, you break work into multiple stages. For example, you receive a request in one stage, then do some parsing in another stage, and so on. Each stage can run on a different thread, and so you have these large thread pools to be able to take advantage of the parallelism in hardware. But what we saw in performance testing was that in some scenarios, Cassandra was spending a lot of its time waiting for locks to be released. Each stage handing over work to another had to synchronize and we could see locking really high on CPU profiles.

Around that time (late 2014) Avi started to work on Seastar, which is sort of an antithesis to the thread pool architecture. You have just one kernel thread running on a CPU, and try to partition application work and data — just as you would in a distributed system so that threads don’t share anything, and never synchronize with each other. So Seastar is thinking about the machine as a distributed thing rather than a shared-memory machine which was really typical of building multi-threaded services at the time.

But back to your question, why did the kernel become such a problem? We have a lot of different abstractions like threading in the kernel and they require crossing from user space to kernel space (context switching). This cost has gone up quite a bit, partly because of the deep security issues that were uncovered in CPUs [more here] but also relative to other hardware speeds which have changed over the years.

For example, the network is getting faster all the time and we have very fast storage devices, which is why the relative cost of context switching is much higher than it was before. The other thing is the kernel has the same synchronization problem we saw in user-level applications. The Linux kernel is a monolithic shared-memory kernel, which means that all CPUs share many of the same data structures, which they have to synchronize. Of course, there’s tons of work going on in making those synchronization primitives very efficient, but fundamentally you still have the same synchronization problem. Take the Linux networking stack for example: you have some packet arriving on the NIC, which causes an interrupt or a poll event. The event is handled on some CPU, but it’s not necessarily the same CPU that actually is going to handle the network protocol processing of the packet, and the application-level message the packet contains might also be handled on another CPU. So not only do you have to synchronize and lock moving data around, you also invalidate caches, do context switches, etc.

Avishai: Suppose you are running JVM on Linux, you have the networking stack which has a packet on one kernel thread that is supposed to be handled by a JVM thread but they don’t actually know about each other, is that correct? It sounds like the JVM scheduler is “fighting” the kernel scheduler for control and they don’t actually coordinate.

Pekka: Yes, that’s also a problem for sure. This is a mismatch in what the application thinks it’s doing and what the kernel decides to do. A similar and perhaps more fundamental issue is the virtual memory abstraction: the application thinks it has some memory allocated for it, but unless you specifically tell the kernel to never ever take this memory away (the mlock system call) then when you’re accessing some data structure it might not be in the memory, triggering a page fault which may result in unpredictable performance. And while that page fault is being serviced, the application’s kernel thread is blocked, and there is no way for the application to know this might happen.

The Seastar framework attempts to solve this issue by basically taking control over the machine, bypassing many OS abstractions. So Seastar is not just about eliminating context switches, synchronizations and such, it’s also very much about control. Many people ask if the choice of C++ as the programming language is the reason why Scylla has a performance advantage over Cassandra. I think it is, but not because of the language, but because C++ provides more control.

The JVM generates really efficient code which can be as fast as C++ in most cases, but when it comes to control and predictability the JVM is more limited. Also, when Scylla processes a query it handles caching itself, as opposed to many other databases which use the kernel controlled page cache. All the caching in Scylla is controlled by Scylla itself, and so you know there’s some predictability in what’s going to happen. This translates into request processing latency which is very predictable in Scylla.

Avishai: You said that Seastar is not only about control, can you elaborate more on that?

Pekka: Seastar is built around an idea of how to program multi-core machines efficiently: avoiding coordination and not blocking kernel threads. Seastar has this future/promise model which allows you to write application code efficiently to take advantage of both concurrency and parallelism. A basic example: you write to the disk which is a blocking operation because there is some delay until the data hits whatever storage; the same for networking as well. For a threaded application which uses blocking I/O semantics you would have thread pools because this operation would block a thread for some time, so other threads can use the CPU in the meantime, and this switching work is managed by the kernel. With a thread-per-core model if a thread blocks that’s it — nothing can run on that CPU, so Seastar uses non-blocking I/O and a future/promise model which is basically a way to make it very efficient to switch to some other work. So Seastar is moving these concurrency interfaces or abstractions into user space where it’s much more efficient.

Going back to your question about why the operating system became such a problem, the kernel provides concurrency and parallelism by kernel threads but sometimes you have to block the thread for whatever reason, perhaps wait for some event to happen. Often the application actually has to consider the fact that making the thread sleep can be much more expensive than just burning the CPU a little bit — for example, polling. Making threads sleep and wake up takes time because there’s a lot of things that the kernel has to do when the thread blocks — crossing between kernel and user space, some updating and checking the thread data structure so that the CPU scheduler can run the thread on some other core, etc, so it becomes really expensive. There’s a delay in the wake up when that event happened. Maybe your I/O completed or a packet arrived. For whatever reason your application thread doesn’t run immediately and this can become a problem for low latency applications.

Avishai: So it’s all about dealing with waiting for data and running concurrent computations in the meantime on the CPU?

Pekka: Yes, and is a problem that they already had and solved in the 1950s. The basic problem was that the storage device was significantly slower than the CPU, and they wanted to improve throughput of the machine by doing something useful while waiting for I/O to complete. So they invented something called “multi-programming”, which is more or less what we know as multithreading today. And this is what the POSIX programming model is to applications: you have a process and this process can have multiple threads performing sequential work. You do some stuff in a thread and maybe some I/O, and you have to wait for that I/O before you can proceed with the computation.

But as we already discussed, this blocking model is expensive. Another issue is that hardware has changed over the decades. For example, not all memory accesses have equal cost because of something called NUMA (non-uniform memory access), but this isn’t really visible in the POSIX model. Also, system calls are quite expensive because of the crossing between kernel and user space. Today, you can dispatch I/O operation on a fast NVMe storage device in the time it takes to switch between two threads on the CPU. So whenever you block you probably missed an opportunity to do I/O, so that’s an issue. The question is: how do I efficiently take advantage of the fact that I/O is quite fast but there is still some delay and I want to do some useful work? You need to be able to switch tasks very fast and this is exactly what Seastar aims to do. Seastar eliminates the CPU crossing cost as much as possible and context switching costs and instead of using kernel threads for tasks we use continuation chains or coroutines.

Avishai: It sounds like Seastar is quite a deviation from the POSIX model?

Pekka: The POSIX model really is something that was born in the 1960s and 1970s, to a large extent. It’s a simple CPU-centric model designed around a single CPU that does sequential computation (also known as the von Neumann model), which is easy to reason about. But CPUs internally haven’t worked that way since the beginning of the 1980s or even earlier. So it’s just an abstraction that programmers use — kind of a big lie in a sense, right?

POSIX tells programmers that you have this CPU that can run processes, which can have threads. It’s still the sequential computation model, but with multiple threads you need to remember to do some synchronization. But how things actually get executed is something completely different, and how you can efficiently take advantage of these capabilities is also something completely different. All abstractions are a lie to some degree, but that’s the point — it would be very difficult to program these machines if you didn’t have abstractions that everybody knows about.

Seastar is a different kind of programming model, and now you see these types of programming frameworks much more frequently. For example, you have async/await in Rust, which is very similar as well. When we started doing this in 2014, it was all a little bit new and a weird way of thinking about the whole problem, at least to me. Of course, if you write some application that is not so performance sensitive and you don’t care about latency very much, POSIX is more than fine, although you’ll want to use something even more high level.

Avishai: Can’t we just make existing models faster? Like using user-space POSIX threads running on top of something like Seastar?

Pekka:  So user-space based threading is not a new idea. In the 1990s, for example, you had the Solaris operating system do this. They had something called M:N scheduling, where you have N number of kernel threads, and then you have M number of user-level threads, which are time-multiplexed on the kernel threads. So you could have the kernel set up a kernel thread per core and then in user space you run hundreds of threads on top of those kernel threads, for example.

Seastar has the concept of a continuation, which is it’s just a different incarnation of the same programming model. It’s just a different way of expressing the concurrency in your program. But yes, we could make thread context switching and maybe synchronization much faster in user space, but of course there are some additional problems that need to be solved too. There’s the issue of blocking system calls when doing I/O. There are known solutions to the problem, but they are not supported by Linux at the moment. In any case, this issue nicely ties back to my PhD topic: what kind of capabilities the O/S should expose so you could implement POSIX abstractions in user space.

Avishai: I think right now most backend programmers are not actually familiar with the POSIX abstraction, POSIX is more popular with system level programmers. Backend developers are mostly familiar with the model that is presented by the runtime they use — JVM or Python or Golang, etc. — which is not exactly the same as POSIX. It raises an interesting question, especially now that we’re getting sandboxes with WebAssembly, perhaps we want to replace POSIX with a different model?

Pekka: So hopefully no kernel developer reads this, but I tend to think that POSIX is mostly obsolete… Tracing back the history of POSIX, the effort was really about providing a portable interface for applications so you could write an application once and run it on different kinds of machine architectures and operating systems. For example, you had AT&T UNIX, SunOS, and BSD, and Linux later. If you wrote an application in the 1980s or 1990s, you probably wrote it in C or C++, and then POSIX was very relevant for portability because of all these capabilities that it provided. But with the emergence of runtimes like the Java Virtual Machine (JVM) and more recently Node.js and others, I think it’s a fair question to ask how relevant POSIX is. But in any case all of these runtimes are still largely built on top of the POSIX abstraction, but the integration is not perfect, right?

Take the virtual memory abstraction as an example. With memory-mapped files (mmap), a virtual memory region is transparently backed by files or anonymous memory. But there’s this funny problem that if you use something like Go programming language and you use its goroutines to express concurrency in your application — guess what happens if you need to take a page fault? The page fault screws up everything because it will basically stop the Go runtime scheduler, which is a user space thing.

Another interesting problem that shouldn’t happen in theory, but actually does, is file systems calls that shouldn’t block but do. We use the asynchronous I/O interface of Linux but it’s a known fact that still, some operations which by specification of the interface should be non-blocking actually block and for some specific reasons.

For example, we recommend the XFS file system for Scylla, because it’s the best file system that implements non-blocking operations. However, In some rare cases even with XFS, when you’re writing and then the file system has to allocate a new block or whatever and then you hit a code path which has a lock. If you happen to have two threads doing that then now you are blocked. There’s a good blog post by Avi about the topic, actually.

Anyway, this is one of the reasons why Seastar attempts to bypass anything it can. Seastar tells the Linux kernel “give me all the memory and don’t touch it.” It has its own I/O scheduler, and so on. Avi has sometimes even referred to Seastar as an “operating system in user space,” and I think that’s a good way to think about it.

Avishai: It sounds like one of biggest problems here is that we have a lot of pre-existing libraries and runtimes that make a lot of assumptions about the underlying operating system, so if we change the abstraction to whatever, we basically create a new world and we would have to rewrite a lot of the libraries and runtimes.

Pekka: Yeah, when I started the work on my PhD I had this naive thinking to throw out POSIX. But when talking about my PhD work with Anil Madhavapeddy of MirageOS unikernel fame, he told me that he thought POSIX was a distraction, and we probably can never get rid of it. And although POSIX doesn’t matter as much as it once did, you have to consider that it’s not just POSIX specification or the interfaces but all the underlying stuff like CPUs which are highly optimized to run this type of sequential code.

A lot of work has been done to make this illusion of a shared memory system fast — and it’s amazingly fast. But for something like Scylla the question is can you squeeze more from the hardware by programming in a different way? I think that we might have reached a point where the cost of maintaining the old interfaces is too high because CPU performance is unlikely to get significantly better. The main reason being that CPU clock frequencies are not getting higher.

In the beginning of the 2000s, we were thinking that we’d have these 10 GHz CPUs soon, but of course that didn’t happen. By 2006 when we started to get multi-core chips, there was this thinking that soon we will have CPUs with hundreds of cores, but we actually don’t really have that today. We certainly have more cores, but not in the same order of magnitude as people thought we would.

But although CPU speeds aren’t growing as much, the network speeds are insane already, with 40GbE NICs that are commodity at least in cloud environments, and 100GbE and 200GbE NICs in the horizon.

It’s also interesting to see what’s happening in the storage space. With something like Intel Optane devices, which can connect directly to your memory controller, you have persistent memory that’s almost the same speed as DRAM.

I’m not an expert on storage, so I don’t know how far the performance can be improved, but it’s a big change nonetheless. This puts the idea of a memory hierarchy under pressure. You have this idea that as you move closer to the CPU, storage becomes faster, but smaller. So you start from the L1 cache, then move to DRAM, and finally to storage. But now we’re seeing the storage layer closing in on the DRAM layer in terms of speed.

This brings us back to the virtual memory abstraction, which is about providing an illusion of a memory space that is as large as storage space. But what do you gain from this abstraction, now that you can have this persistent storage which is almost as fast as memory?

Also, we have distributed systems so if you need more memory than you can fit in one machine, you can use more machines over the network. So I think we are at that point in time where the price of this legacy abstraction is high. But is it too high to justify rewriting the OS layer remains to be seen.

Want to Be Part of the Conversation?

If you want to contribute your own say on how the next generation of our NoSQL database gets designed, feel free to join our Slack community. Or, if you have the technical chops to contribute directly to the code base, check out the career opportunities at ScyllaDB.

 

The post An Interview with Pekka Enberg: Modern Hardware, Old APIs appeared first on ScyllaDB.

Using Helm Charts to Deploy Scylla on Kubernetes

Modern deployments are kept as code, and it is very common that development, staging and production environments differ only in a few properties like allocated resources. When Kubernetes is used as infrastructure, duplication of manifests for all types of deployments may be error prone. That’s why Helm templates are quite popular these days.

With Scylla Operator 1.1 we introduced three Helm Charts which will help you to deploy and customize Scylla products using Helm.

Helm Charts

Helm Charts are templated Kubernetes manifests combined into a single package that can be installed on your Kubernetes cluster. Once packaged, installing a Helm Chart into your cluster is as easy as running a single helm install command.

Release 1.1 provides three Helm Charts:

  • scylla-operator
  • scylla-manager
  • scylla

As the name suggests, each of them allows to customize and deploy one of the Scylla products.

Helm Charts are shipped in two release channels:

  • “latest” channel as the name suggests contains the latest charts built after each merge and published on every successful end-to-end tests run.
  • “stable” channel contains only charts which are released and approved by the QA team.

Note: only stable charts should be used in production environments. 

Users can override default parameters when installing the chart to suit their needs or if the simplicity is required, they can use values provided by us.

Deploy with a Few Clicks

Using Helm Scylla can be installed in a matter of minutes. First we need to install Scylla Operator and its dependencies. To add Scylla Chart repository to Helm registry and update it execute the following commands:

helm repo add scylla https://scylla-operator-charts.storage.googleapis.com/stable

helm repo update

One Scylla Operator dependency is Cert Manager. You can either install it using Helm from their Chart repository, or use a static copy stored in our GitHub repository:

kubectl apply -f https://github.com/scylladb/scylla-operator/blob/master/examples/common/cert-manager.yaml

To install Scylla Operator use:

helm install scylla-operator scylla/scylla-operator --create-namespace --namespace scylla-operator

Now you’re ready to deploy the Scylla cluster. Let’s deploy a default single rack cluster just to play around with it.

helm install scylla scylla/scylla --create-namespace --namespace scylla

That’s it! Scylla Operator will coordinate each node joining cluster, and soon your 3 node cluster should be ready to serve traffic. To check your Scylla pods, use the following:

kubectl -n scylla get pods -l "app.kubernetes.io/name=scylla"

Customization

Often default values aren’t what you need. That’s where the Helm template comes in and solves the problem. You can either use the --set parameter to set individual parameters or provide a yaml file with more values you want to overwrite.

Let’s say we want to deploy a 6 node Scylla cluster suitable for i3.xlarge instances and spread these instances between two racks.

datacenter: "us-east-1"
racks:
- name: "us-east-1a"
  members: 3
  storage:
    capacity: 800Gi
  resources:
    limits:
      cpu: 3
      memory: 24Gi
    requests:
      cpu: 3
      memory: 24Gi
- name: "us-east-1b"
  members: 3
  storage:
    capacity: 800Gi
  resources:
    limits:
      cpu: 3
      memory: 24Gi
    requests:
      cpu: 3
      memory: 24Gi

To preview what manifests will be generated, you can append --dry-run parameter to install command.

If you’re satisfied with the output you can append the name of the file to helm install command:

helm install scylla scylla/scylla --create-namespace --namespace scylla -f 6_i3xlarge.yaml

To find out all available parameters you can check the Helm Chart source code available in Scylla Operator repository.

Summary

As you can see, we are constantly moving forward in order to make Scylla Operator as good as possible. If you find any issues, make sure to report them so we can fix them in one of the next releases.

FAQs

Where I can find Scylla Helm Charts?

There are two repositories for each of the release channels. To add them to helm, use:

Source code of Helm Charts is available in Scylla Operator GitHub repository.

How to upgrade my existing deployment using Scylla Operator 1.0?

You can find the upgrade procedure in Scylla Operator documentation.

Do I have to use Helm to deploy Scylla Operator?

No! It’s just an option. We also provide static Yaml manifests which can be used to deploy Scylla Operator and others. Check out our GitHub repository or documentation for details.

Can I use Scylla Manager for free?

Yes. Scylla Manager is available for free for Scylla Enterprise customers and Scylla Open Source users. However, with Scylla Open Source, Scylla Manager is limited to 5 nodes. See the Scylla Manager Proprietary Software License Agreement for details.

GET STARTED WITH SCYLLA OPERATOR

The post Using Helm Charts to Deploy Scylla on Kubernetes appeared first on ScyllaDB.

Instaclustr’s Support for Cassandra 2.1, 2.2, 3.0, and 3.11

As the official release of Cassandra 4.0 approaches, some of our customers who are still using Cassandra 2.1 and 2.2 have asked us where we stand on support. The Apache Foundation project will end support for the following versions on the following dates:

  • Apache Cassandra 3.11 will fully be maintained until April 30, 2022 and with critical fixes only until April 30, 2023
  • Apache Cassandra 3.0 will be maintained until April 30, 2022 with critical fixes only
  • Apache Cassandra 2.2 will be maintained until April 30, 2021. with critical fixes only
  • Apache Cassandra 2.1 will be maintained until April 30, 2021 with critical fixes only

Instaclustr will continue to support these versions for our customers for 12 months beyond these dates. During this period of extended support it will be limited to critical issues only. This means Instaclustr customers will receive support to the following dates:

  • Apache Cassandra 3.11 will be maintained until April 30, 2024 with critical fixes only
  • Apache Cassandra 3.0 will be maintained until April 30, 2023 with critical fixes only
  • Apache Cassandra 2.2 will be maintained until April 30, 2022 with critical fixes only
  • Apache Cassandra 2.1 will be maintained until April 30, 2022 with critical fixes only

The Apache Cassandra project has put in an impressive effort to support these versions thus far, and we’d like to take the opportunity to thank all the contributors and maintainers of these older versions.

Our extended support is provided to enable customers to plan their migrations with confidence. We encourage those of you considering upgrades to explore the significant advantages of Cassandra 4.0 which currently has a beta in preview on our managed platform and is expected to be in full general availability release by June 2021.

If you have any further concerns or wish to discuss migration, please get in touch with your Customer Success representative.

The post Instaclustr’s Support for Cassandra 2.1, 2.2, 3.0, and 3.11 appeared first on Instaclustr.

Mapped: A New Way to Control Your Business via IIoT

Mapped launched their service this year to enable businesses to control and manage their facilities via a unified AI-powered data infrastructure platform. Their modular and extensible platform brings together disparate data sets via various APIs related to the Industrial Internet of Things (IIoT). From your lobby, to your elevators, from your HVAC and power systems to your industrial devices and security systems, it provides an all-in-one-view.

Mapped has done precisely that — mapped over 30,000 different makes/models of devices across 900 different classes — using a GraphQL API and industry-standard ontology allowing developers and users to build applications tailored to their business needs. The team at Mapped have already dealt with the hard part — data normalization, data extraction, and relationship discovery — allowing users to get right to work applying business rules and logic, finding root causes and establishing predictions about the future of their infrastructure.

To make such a system robust and scalable, Mapped chose JanusGraph for its graph data model, and Scylla as the underlying performant and reliable NoSQL data store.

I had the opportunity recently to interview Jose de Castro, CTO at Mapped, about the launch of their platform. I wanted to understand the nature of Mapped’s business model. He gave me the example of how they can be utilized in the realm of commercial real estate.

“Let’s say you own a high-rise building. Or lease a few floors. You want to ask business-level questions like, ‘How is my space being utilized? How can I be more energy efficient? Are all these investments paying off? How are my solar panels working?’ Right now, before Mapped, to get those answers you have to look at many varied and disparate systems. There’s no way to make correlations between them either. Elevators are elevators. Fire safety? That’s separate. We wanted to create a platform where you can bring all those systems together and see your entire business — its physical and system infrastructure — in a cohesive manner. Then we want to provide one unified API, based in GraphQL, to understand it all.”

“Plus, we use Machine Learning (ML) to infer connections and relationships between that data. For example, maybe two machines are having problems on the same floor of a building. Could that be an underlying electrical issue they share?”

“Or, let’s say you want to detect tailgating (also known as piggybacking) — that physical security breach when an unbadged person slides in just after a badged person, or worse, your well-meaning personnel politely hold the door open for them. You want to understand devices in motion — a certain door that is open for far longer than is usual.”

While Mapped is interested in working with the building and big property companies, they see their more natural customer base amongst the tenants themselves, who have a high desire to maintain visibility into the most current data across all of their environments and investments.

Mapped also has plans to branch into other specialized solutions: manufacturing and healthcare, freight and logistics, energy, utilities and even commercial insurance.

Open APIs for Developers

To make this broad and deep vision work, Mapped has established an ontology and built an initial set of connectors. They have also published a specification to allow other device manufacturers to create connectors for their own equipment.

Mapped has already done the hard part — normalization. As Jose explained, “Mapped is opinionated. We believe there are more consistent ways to do things. With many other middleware vendors, you have to figure it out for yourself.”

Jose credited the mind behind a lot of Mapped’s vision: Jason Koh, their Chief Data Scientist. Jason is one of the chief proponents behind Brick, the open source open data schema for buildings. Mapped is currently implemented as per Brick version 1.2, with its own extensions.

An example of the data model for an HVAC air handling unit designed in Brick

On top of Brick is BOT, the Building Topology Ontology, a W3C initiative that allows the physical layouts of buildings to be shared. This allows a logical representation of devices to be laid down on a physical topology model.

Mapped-in-a-Box: JanusGraph and Scylla Under the Hood

Powering all of this is, as usual, the tandem pair of JanusGraph for its graph data modeling and querying capabilities backed by a Scylla NoSQL database for its highly performant, highly-scalable, highly-available storage engine.

JanusGraph

Mapped’s path to Scylla began early in the development cycle. For development, they had initially deployed JanusGraph using Apache Cassandra  as a data store using Docker Compose, plus some of their own magic for a stable internal environment.

However Jose noted, Cassandra “was a hog running in your laptop.” They switched to Scylla simply to allow their developer laptops to run smoother, more lightweight and cooler. Since Scylla had worked so well in development, they asked, “Why not use it in production?”

To make their production deployment and management experience even easier, Mapped is using the new Scylla Operator for Kubernetes.

Beyond the initial JanusGraph use case, Mapped intends to use Scylla for all of its latency-sensitive workloads.

Scylla’s high availability capabilities are another key reason Mapped built their business on Scylla. Mapped has a requirement for “five nines” Service Level Agreements (SLAs) — meaning no more than 5 minutes 15 seconds of downtime per year. The experience of other customers such as Kiwi.com bear out the non-stop durability of Scylla, even during times of disaster.

Learn More

If you wish to discover more about the services and developer APIs Mapped offers, please visit their website at mapped.com.

If you’d like to learn more about how to build your own business on Scylla, please contact us, or join our Slack channel.

The post Mapped: A New Way to Control Your Business via IIoT appeared first on ScyllaDB.

Scylla Manager 2.3 Suspend & Resume

Scylla Manager 2.3 introduces a new mechanism to suspend and resume operations that can be used to implement maintenance windows or an off peak hours strategy. In this blogpost I will demonstrate how you can use this new feature to avoid running repairs during prime time.

When a cluster is in the suspended state, the only Scylla Manager tasks allowed to run are the health check tasks. That is checking if CQL, Alternator and REST services are responding timely. All the other tasks are stopped. Scheduling new tasks within the next 8 hours period or running tasks manually is not allowed.

To put a cluster into suspended state just execute “sctool suspend” command against the cluster. To validate the tasks are stopped, you can run the “sctool task list” command. In the example below cluster “prod” was suspended stopping an ongoing one shot backup task.

In the “sctool task list” command output the [SUSPENDED] prefix indicates the task would not run until the cluster is resumed. When resuming the cluster the default behavior is not to resume the stopped tasks automatically. The tasks would run according to the schedule, that is on “Next run“. If you wish to resume the task add “--start-tasks” flag.

To automate the process of suspending and resuming in the context of reducing load during peak hours we would use crontab. In the example below we would put the cluster to suspend state on weekdays from 7PM to 12PM.

Procedure:

  1. In the Scylla Manager server, save the current crontab content to a file by executing “crontab -l > ./my-crontab
  2. Open ./my-crontab file in editor, and add the following lines
    0 19 * * MON-FRI sctool suspend -c prod &>> ~/prod-suspend.log
    0 0 * * TUE-SAT sctool resume -c prod &>> ~/prod-resume.log
  3. Install the rules by executing “crontab  ./my-crontab”

Alternatively you can automate that with an external scheduler using Scylla Manager API. The following example shows how to suspend and resume cluster “prod” using curl from the Scylla Manager server host.

Next Steps

Read the release notes for Scylla Manager 2.3, then check out our download center. Remember that Scylla Manager is freely available for Scylla Open Source users for up to five nodes, and unlimited for users of Scylla Enterprise.

DOWNLOAD SCYLLA MANAGER

The post Scylla Manager 2.3 Suspend & Resume appeared first on ScyllaDB.

Dial C* for Operator: Unlocking Advanced Cassandra Configurations

Project Circe April Update

Project Circe is ScyllaDB’s year-long initiative to make Scylla, already the best NoSQL database, even better. For the month of April we are going to take a look inside the organization and code base to see what it takes to bring major new features into a project as dynamic as Scylla. Currently there are nearly a half-million lines of code in the scylladb/scylla repository on Github (482.7k as of this writing). Of those thousands of source lines of code so far are dedicated to the library implementing the new Raft consensus protocol.

Raft and the Logical Clock

We’ve already covered what Raft is and how we plan to use it in Scylla. The hardest part, of course, is actually getting it to work. So I had a recent chat with Kostja Osipov, who leads development on this key infrastructure. He gave me an overview of some of the supporting efforts going into making Raft ready for release.

While Raft is currently being wired up to work with RPC to permit topology changes (add/remove nodes) the breadth of testing in order to make Raft truly resilient goes far beyond the source lines of code of the database commits themselves. “We added over a hundred test cases,” Kostja noted, “unit tests, functional tests using the concept of nemesis, or failure injections, plus a randomized test inspired by the Jepsen approach to testing.”

If you have a keen eye for scouring Github, you may already have come across the scylla/test/raft subdirectory, which includes another 2,853 source lines of code. The most basic of all tests is fsm_test.cc (finite state machine test), which treats Raft server as a “device under test.” It models specific chains of events which must not lead to protocol failure, e.g. receiving an outdated message from a deposed leader. The next level of testing includes a mock network and mock logical clock that is in replication_test. It allows us to test how different combinations of Raft options, such as pre-voting, non-voting members and gracious leader step down work together over a potentially slow network or non-synchronized clocks In etcd_raft.cc you’ll find a port of etcd Raft implementation unit tests. The team studied lots of Raft implementation and found the etcd testing effort one of the most thorough.

This sort of aggressive testing led to some interesting results. “One test found that our library crashes when the network reorders packets and there is a failure of one of the members. Another crash was when the leader fails while trying to bring on board a new cluster member — so there is a leader change — but it shouldn’t lead to a crash.”

Raft, despite being widely considered a simple protocol, has infinitely many protocol states. To be able to radically increase the amount of states our testing explores, the ScyllaDB engineering team came up with an implementation of a logical clock. “It’s a clock that is ticking with the speed at which the computer executes the test, not the speed of a wall clock.” Using this logical clock the team was able to squeeze a lot of things into a single test that runs at CPU speed — millions of events per second.

In Scylla’s implementation, “Every state machine has [its] own instance of logical clock; this enables tests when different state machines run at different clock speeds.”

Recent Interesting Commits

Every week our CTO Avi Kivity produces a roundup of changes to our codebase sent out on our user mailing list entitled “Last week in scylla.git master.” You can read the latest month’s roundups here:

Here’s a few of the more salient commits mentioned this past month:

Compaction Strategies Reshaped

Better Memory Allocation to Further Reduce Latencies and Stalls

  • The SSTable parser was modified by adding new methods for parsing byte strings. This avoids creating large memory allocations for some cells, reducing related latencies.
  • More code paths can now work with non-contiguous memory for table columns and intermediate values: comparing values, and the CQL write path. This reduces CPU stalls due to memory allocation when large blobs are present.
  • Continuing on the path of allowing non-contiguous allocations for large blobs, memory linearizations have been removed from Change Data Capture (CDC). This reduces CPU stalls when CDC is used in conjunction with large blobs.

Scylla Monitoring Stack 3.7

We recently released Scylla Monitoring Stack 3.7. One feature in particular (#1258) provides visibility of the accumulation and sending of Hinted Handoffs — updates that are maintained during transient node failures, and sent when nodes come back online. While Hinted Handoffs have been around in Scylla for many years, this new dashboard provides immediate observability into this aspect of extra load due to transient node failures.

Learn More in Scylla University

While much of what we’ve been talking about here is deep in the heart of Scylla’s source code, many readers might be new to Scylla, or even new to NoSQL in general. For you, we’ve created Scylla University. It is an entirely free online resource for users to build your NoSQL database skills. It’s your first step into the journey of mastering the monstrously-fast, monstrously-scalable database which is Scylla.

You can start with an overview of Scylla and then, at your own pace, move up into advanced architectural concepts like consensus protocols, learning how these power user-oriented features like Lightweight Transactions and how they work under the hood.

GET STARTED IN SCYLLA UNIVERSITY

The post Project Circe April Update appeared first on ScyllaDB.

Incremental Compaction 2.0: A Revolutionary Space and Write Optimized Compaction Strategy

Introduction

Let’s go through a brief history of compaction strategies. It all started with Size Tiered strategy (STCS), which is optimized for writes. Later, it was realized that STCS has bad space and read amplification with overwrite-intensive workloads. Then, Leveled strategy (LCS) was introduced to solve that problem, but it introduced another problem which is its write amplification.

So what do I do if I care about space but cannot afford the write amplification of LCS?

Early in 2020, Incremental Compaction Strategy (ICS) came to the rescue. ICS solved the temporary space requirement issue in STCS, however, it didn’t solve the space amplification issue with overwrite workloads, which was unfortunately inherited from STCS. For example, when STCS and ICS are facing an overwrite workload, a given partition may be redundantly found in all sstables, leading to the space and read amplification issues aforementioned.

What if we come up with an idea for ICS where both space and write are optimized when running overwrite workloads?

Turns out that’s possible. The RUM conjecture states that only two out of the three amplification factors can be reduced at once. STCS, for example, optimizes for writes, while sacrificing read and space. LCS optimizes for space and read, while sacrificing writes.

With the Scylla Enterprise 2020.1.6 release, ICS gained a new feature called Space Amplification Goal (SAG), which enables space optimization without killing the write optimized nature of the strategy. In simpler words, SAG will allow you to maximize your disk utilization without killing the write performance of ICS. Maximizing disk utilization is very important, from a business perspective, because your cluster will be able to store more data without having to expand its storage capacity. You’ll learn in the next section how that was made possible.

Maximizing Disk Utilization in ICS with Overwrite Workloads

Before the space optimization with SAG is explained, let’s first understand the write optimized nature of both STCS and ICS.

The picture above describes the size-tiered structure of STCS and ICS. The low write amplification is achieved by allowing accumulation of SSTables in a size tier, and when the tier reaches a certain size, its SSTables are compacted into the next larger tier. Therefore, a given piece of data only has to be copied once per each existing tier, allowing for high write performance.

Last but not least, let’s understand the space optimized nature of LCS.

The picture above describes the leveled structure of LCS. The low space amplification is achieved by not allowing data redundancy in a given level L. When a SSTable from level L is promoted to L+1, it will be compacted with all the overlapping SSTables in level L+1. This approach comes with a high write amplification cost though.

With all this knowledge in mind, how could SAG possibly lead ICS to being both space and write optimized?

Turns out that’s possible with a hybrid approach, where ICS combines both the leveled and the size-tiered structures into a single one.

With SAG turned on, the largest tier in ICS will behave exactly like a level in LCS. Data from the second-largest tier will be compacted with the data in the largest one, when the time comes, in a process known as cross-tier compaction, very similar to when LCS compacts SSTables in level 0 with all SSTables in level 1. So the largest tier becomes space optimized, as there’s no data redundancy in it. Given that it contains most of the data in a given table, overall space amplification is significantly reduced as a result of this optimization.

All tiers but the largest one will keep the original size-tiered behavior, where SSTables can still be accumulated in them, so they’re still write optimized.

With the largest tier being space optimized, while all the others being write optimized, ICS becomes both write and space optimized. So ICS + SAG, or ICS 2.0, is now a potential candidate for users with overwrite workloads who care about space.

Configuring SAG to Enable Space Optimization in ICS

Space amplification goal (SAG) was implemented as a new strategy option for ICS, that can be easily configured to make the strategy both space and write optimized. That means space optimization is disabled by default, which is reasonable because write-only (no overwrites) workloads don’t need it.

As the name implies, SAG is a goal for space amplification imposed on the strategy. To configure it, you must choose a value between 1 and 2. Value of 1.5 implies the aforementioned cross-tier compaction when the second-largest tier is half the size of the largest one. In simpler words, with a goal value of 1.5, strategy will be continuously working to reduce the space amplification to below 1.5. Keep in mind that the configured value is not an actual upper bound on space amplification, but rather a goal which the strategy will strive to achieve.

Let’s see ICS + SAG in action with different configuration values:

SAG=0 means that SAG was disabled, meaning ICS was only optimized for writes.

In the graph above, it can be seen that the lower the SAG value the lower the disk usage but the higher the compaction aggressiveness. In other words, the lower the SAG value the lower the space amplification but the higher the write amplification.

A simple schema change as follows will enable the SAG behavior:

ALTER TABLE foo.bar
    WITH compaction = {
        'class': 'IncrementalCompactionStrategy',
        'space_amplification_goal': '1.75' };

For starters, set SAG initially to 1.75 and conservatively decrease it by 0.25, for example, after checking from monitoring that the cluster can sustain the write rate without compaction falling behind. It’s not recommended to set it below 1.25, unless you really know what you’re doing.

Recommendation: write amplification can be further reduced by setting Scylla option --compaction-enforce-min-threshold true, which guarantees the minimum threshold, 4 by default, is respected.

ICS+SAG vs. LCS. Which one should I pick as my compaction strategy?

ICS + SAG isn’t a complete replacement for LCS. LCS’ write amplification may be good enough if running an overwrite workload with a high time locality, where recently written data has a high probability of being modified again soon. Provided LCS has poor write amplification for your particular workload, ICS + SAG becomes definitely a better choice.

Conclusion

Until recently, ICS users had to live with a suboptimal space amplification in face of overwrite-intensive workloads. With the Scylla Enterprise 2020.1.6 release, ICS gained a new feature called Space Amplification Goal (SAG) which will make the strategy both space and write optimized, fixing the problem aforementioned. The space optimization translates into disk utilization being maximized, which in turn translates into your cluster being able to store more data without having to add more nodes. So everybody out there who is either unhappy with LCS or uses ICS on tables with overwrite workload, please go ahead and try ICS + SAG in order to maximize your disk usage and save costs as a result, without having to give up good write performance.

Stay tuned!

The post Incremental Compaction 2.0: A Revolutionary Space and Write Optimized Compaction Strategy appeared first on ScyllaDB.