We had the pleasure of hosting GumGum’s Keith Sader at our Scylla Summit this year. Keith began by giving some background by drawing from his career, which has spanned the adtech space at GumGum as well as prior stints at the automobile shopping site Edmunds.com and Garmin, the maker of tracking and mapping devices.
GumGum’s Verity Platform
Keith guided Scylla Summit attendees through how GumGum’s advertising systems works. “Verity is our contextual platform. What it does is use Natural Language Processing [NLP] along with computer vision to really analyze the context of a website. This includes not only just strictly looking at the images and contextualizing those but also the images in relation to the rest of the page to get really, really accurate marketing segments to find the correct context for an ad.” This makes GumGum not as reliant on cookie-based matching, which is increasingly being curbed by both privacy regulations and public scrutiny.
GumGum’s Verity platform is able to not only read and analyze the text of this recipe article but also to understand the image of the cake in order to match it to their ad inventory and propose an ad for sugar, all in less than the blink of an eye.
GumGum’s high scalability ad server platform responds to requests coming from Server Side Publishers (SSPs) and advertisers using a mix of both relational and non-relational datastores. The relational data store tracks the campaigns, ads, targets and platforms. The NoSQL datastores include DynamoDB and Scylla. DynamoDB is used as a cookie store.
“Probably the most important one, and the most relevant to this talk, is our tally back-end which we originally implemented in Cassandra. Tally does some very cool things around ad performance. Originally our Cassandra cluster was 51 instances around the globe — Virginia, Oregon, Japan, Ireland — and they were pretty beefy machines, i3.2xlarge instances. Just the basic cost of that alone was almost $200,000 not including staff time.” The staff time and costs on top of that included two engineers dedicating 20% of their time each in keeping the server running.
“Cassandra is fantastic if you can throw a dev team at it. But we’re not really a Cassandra focused shop. We’re making an ad server with computer vision, which is a different problem almost entirely.” This was the reason behind why GumGum turned to Scylla Cloud instead.
“What Tally records — and honestly this is probably some of the most important things for ad serving — when you are serving an ad you are looking to figure out ‘does it match the slot it’s going into?’ On a page for an ad slot there’s a lot of filtering, a lot of brand awareness, but the other part is ‘have you served this ad already?’ How do you know that? Well Tally knows that because this is where we count everything.”
“What Scylla via Tally does is help us count all the impressions, all the views, all the clicks, all the revenue, all the video streams, and play percentages along the way in the course of serving advertisements. What this tells us is when we can actually stop serving an ad when it’s fulfilled its campaign goal and serve something else. Or we can compare two different ads just to serve in a slot and determine ‘Okay based upon this one’s proportions of impressions to this one’s proportions of impressions what makes the most sense for us to serve now?’”
Whether or not GumGum was able to bid on and serve an ad, where and how it was placed in the context of a page, and whether or not the viewer could see it and click on it, or whether a video plays and how far a viewer got through it all determines the success or failure of ad placement. Keith explained how GumGum can take data collected in Tally and report back factually to the advertiser, whether that means “by the way people are only watching ten percent of your video ad” or “Hey this ad, this video plays 100%. It’s a fantastic campaign!’”
GumGum uses third party-data providers to ensure Tally stays current and accurate. These jobs start off the data and continue throughout the course of the day. As Keith explained, “We adjust it based upon the best and most accurate knowledge we have at the time to determine whether or not we should serve another ad.” This means Tally isn’t precisely real-time, “but we’re pretty darn close.”
Keith showed the prior deployment of 51 nodes of Cassandra, with the bulk of them, 39, in North America; 30 in Virginia, and 9 in Oregon. “This was our original Tally deployment.” You can see we had most of our instances in North America where we advertised. We had nine of them in Ireland and three of them in Japan. As GumGum is a worldwide company we had to have relative data across the globe. This will become important later.”
However, GumGum was having issues. “Cassandra wasn’t really performing under our high loads and we’re not a Cassandra engineering place. We sort of struggled with that. Adding nodes for us was also really manual procedure. It was tough to go in and add nodes because we always had to tweak some configuration or do something else to make Cassandra work for us.”
“Also after the initial installation we lost whatever little bit of vendor support we had. We were out in the weeds and also stuck on an old version of Cassandra 2.1.5. I’m sure it was great at the time it was released. It wasn’t really supporting our growth.” Readers will note that Cassandra 2.1.5 was released in 2015.
Keith lamented about the common problem many companies trying to run their own Cassandra clusters face, “We didn’t have an engineering department to throw at it.” He described the “classic too many cooks in the kitchen” issues they’d face. “We’d have product engineering, data engineering and operations all come in and manage our Cassandra cluster with not exactly unpredictable results — but not results we wanted.”
GumGum Moves to Scylla Cloud
“We thought about replacing this and we went through a bunch of different vendor selections. Originally we looked at DataStax. Great. A little pricey. Looked at DynamoDB. Not really right for us in terms of cost of performance. Plus we have to change the entire Tally backend. Redis? We’re pretty read heavy and we need some consistency in our writes. So redis wasn’t quite it.”
By comparison, “Scylla was a drop-in replacement. It was also a managed system so we could really lay off a lot of the operational items to someone else that wasn’t us. And we could focus on making relevant contextual ads and serving them. So we dropped in Scylla and it was great.”
GumGum’s heterogeneous data environment uses a combination of SQL and NoSQL systems. Originally they used Apache Cassandra for their Tally engine; in time it was replaced by Scylla.
Keith was more than a tad ironic after showing the updated slide. “This is magic. We dropped it in. Put it in the diagram. We’re done, right? Maybe not.”
“Let’s talk about architectures for a little bit. Cassandra by itself allocates thread and processor pools to handle a request and this is all scheduled by the OS. Great. It’s pretty easy to set up but you’ve got a couple problems. One is that this OS allocation can starve certain resources that need to read from your data store, or need to write to your data store.”
Keith pointed out that your OS can starve your request coming in. Or obfuscate some poor keyspace setup which users might have had. “You know Cassandra. Easy to get going. Hard to really operate.”
By comparison, “Sylla has an architecture that’s shard per core. A shared-nothing architecture. There’s a single thread allocated to a single core plus a little bit of RAM to actually handle a request for a data segment.”
While this is great because Keith didn’t have to worry about thread and processor allocations, “part of the downside of this architecture, which we ran into, is it can run hot and unbalanced from time to time. That can be based upon how you divvy up your keyspace.” Which in turn is based upon just how user requests come in.
“We had some pretty good initial results — we’ve dropped it in. We didn’t really have to change much in our query to the Cassandra [CQL] protocol. That was fantastic. Yet we’d still experience issues when the system was under load. We would get timeouts during reads lacking consistency. That was a little disconcerting for us.”
Looking into the issue, Keith noted “part of that was because we were using an old legacy Cassandra driver. The old drivers don’t have this awareness of how the underlying infrastructure works — the shard per core element I talked about. The new driver does.”
If you are interested, you can read more about what goes into making a Scylla shard aware driver in this article, and see the difference it can make in performance here. You can also see what goes into supporting our new Change Data Capture (CDC) feature in this article.
Another issue was discovered moving from Proof of Concept (POC) to production. “Our initial POC wasn’t quite size proportional. We under-allocated what we needed in terms of nodes to move to Scylla.” Another problem was that they originally set a replication factor of five for three availability zones, so there was a mismatch when trying to allocate servers across disparate datacenters. “I’ll let you do the math, but you should have those things match.”
For the hot partitions, Cassandra would just allocate additional threads and cores to obscure the imbalanced data problem. “We had to wind up modifying our schema inquiries for tables that return over 100 rows. We just had to figure out how to really shard our keyspace better.”
Another issue they uncovered was properly setting consistency levels on queries. For example, they had initially asked for a consistency level ALL on a global basis. “That slowed things down.” They set the consistency level to LOCAL_QUORUM instead. Keith’s pro tip was that when you move from Cassandra to Scylla, you check your queries and understand how your sharding is working.
By moving to Scylla Cloud, GumGum was able to cut its server count in Europe by a third, and their US cluster nearly in half.
In the end, GumGum ended with a lower amount of overhead and higher throughput. They reduced their cluster count from 51 instances to only 30 — 21 in North America (15 in Virginia, 6 in Oregon), 6 in Ireland and 3 in Japan.
The cost of licensing and servers for Scylla Cloud was slightly higher than Cassandra, but this included support and cluster management, which they were able to offload from their team, which meant an actual savings of Total Cost of Ownership (TCO).
“To basically take 20 percent of two people’s staff time and an entire load off of our system — this was the big win for us. For just 13 grand we got a much better and more performant data system around the globe.”
Discover Scylla Cloud
If you’d like to discover Scylla Cloud for yourself, we recently published a few articles about capacity planning with style using the Scylla Pricing Calculator, and getting from zero to ~2 million operations per second in 5 minutes.
It all begins with creating an account on Scylla Cloud. Get started today!
Project Circe is ScyllaDB’s year-long initiative to make Scylla, already the best NoSQL database, even better. We’re sharing our updates for the month of May 2021.
Better failure detection
Failure detection is now done directly by nodes pinging each other rather than through the gossip protocol. This is more reliable and the information is available more rapidly. Impact on networking is low, since Scylla implements a fully connected mesh in all clusters smaller than 256 nodes per datacenter, which is much larger than the typical cluster.
SLA per workload
Scylla supports multiple workloads, some with real-time guarantees, some have batch nature where latency does not matter. Prior work allows Scylla to prioritize workloads by definition of roles and map them to different scheduling groups. We recently researched how Scylla behaves when it is overwhelmed with requests. In such situations, a process called workload shedding comes into play. But which workload should we shed? Of course, it’s best to not take a completely random approach. A good step forward is the new timeout per role feature that allows the user to map workloads to timeouts so the system will choose them for automatic workload shedding.
Configuration is now possible to set timeouts based on a role,
SERVICE LEVEL infrastructure. These timeouts
override the global timeouts in scylla.yaml, and can be overridden
on a per-statement basis.
Virtual table enhancements
Infrastructure for a new style of virtual tables has been merged. While Scylla already supported virtual tables, it was hard to populate them with data. The new infrastructure reuses memtables as a simple way to populate a virtual table.
Off-strategy compaction is now enabled for repair. After repair completes, the SSTables generated by repair will first be merged together, then incorporated into the set of SSTables used for serving data. This reduces read amplification due to the large number of SSTables that repair can generate, especially for range queries where the bloom filter cannot exclude those SSTables.
Off strategy is also important when Repair Based Node Operations (RBNO) is used – this will soon be the default. RBNO pushes repair everywhere – to streaming, remove and node decommission. It’s important to tame compaction accordingly and automatically, so you as an end user wouldn’t even be aware of it.
Repair is now delayed until hints for that table are replayed. This reduces the amount of work that repair has to do, since hint replay can fill in the gaps that a downed node misses in the data set.
This month we took another step towards Raft. Currently Raft group-0 has reached a functional stage. It resembles the functionality etcd provides (you can read more about how we tested this in our April update). The team was required to answer an interesting question – how does a minimal group of cluster members find each other? Raft in Scylla, as opposed to eventual consistency, needs a minimal set of nodes. The seed mechanism will come into play but with a new definition the seed’s UUIDs which will be a must to boot an existing cluster, a better way than to have a split brain right from the get go.
- We fixed a performance problem with many range tombstones.
- Change Data Capture (CDC) uses a new internal table for maintaining the stream identifiers. The new table works better with large clusters.
- Authentication had a 15-second delay, working around dependency problems. But it is long unneeded and is now removed, speeding up node start.
- Repair allocates working memory for holding table rows, but did not consider memory bloat and could over-allocate memory. It is now more careful.
- Scylla uses a log-structured memory allocator (LSA) for memtable and cache. Recently, unintentional quadratic behavior in LSA was discovered, so as a workaround the memory reserve size is decreased. Since the quadratic cost is in terms of this reserve size, the bad behavior is eliminated. Note the reserves will automatically grow if the workload really needs them.
- We fixed a bug in the row cache that can cause large stalls on schemas with no clustering key.
- The setup scripts will now format the filesystem with 1024 byte blocks if possible. This reduces write amplification for lightweight transaction (LWT) workloads. Yes, the Scylla scripts take care of this for you too!
- SSTables will now automatically choose a buffer size that is compatible with achieving good latency, based on disk measurements by iotune.
- The ScyllaDB git repositories has lots of gems in it, one of them is a new project called Scylla Stress Orchestrator – https://github.com/scylladb/scylla-stress-orchestrator/ – which allows you to test Scylla with clients and monitoring using a single command line.
- Another, even competing method is the cloud-formation based container client setup that allows our cloud team to reach two million requests per second in a trivial way. Check out the blog post here, and the load test demo for Scylla Cloud in Github.
- The perf_simple_query benchmark now reports how many instructions were executed by the CPU per query. This is just a unit-benchmark but cool to track!
- The tarball installer now works correctly when SElinux is enabled.
- There is now rudimentary support for code-coverage reports in unit tests. (Coverage may not be the coolest kid in the block but it is not cool not to test properly!)
Scylla Operator for Kubernetes News
Scylla’s Operator 1.2 release was published with helm charts (find it on Github; plus read our blog and the Release Notes). Now 1.3 and 1.4 are in the making. In addition, our Kubernetes deployment can autoscale! An internal demonstration using https://github.com/scylladb/scylla-cluster-autoscaler was presented and you are welcome to play with it.
Scylla Enterprise News
As you may have heard, we released Scylla Enterprise 2021.1, bringing two long-anticipated features:
- Space Amplification Goal (SAG) for Incremental Compaction Strategy allows a user to strike the utilization balance the desire between write amplification (which is more CPU and IO intensive) and disk amplification (which is more storage intensive). SAG can push the ICS volume consumption lower, towards the domain of Level compaction which you still enjoy from size-tiered behavior. This is the default and the recommended compaction strategy.
- New Deployment Options for Scylla Enterprise now include our Scylla Unified Installer and allow you to install anywhere, also with an air-gap environment.
At ScyllaDB, we take security seriously. In this day and age it is vital to establish trust with any 3rd party handling your data, and reports of unsecured databases are still far too common in the news. With our customers’ vital needs foremost in our minds, we set to work ensuring that we met stringent goals established by the industry and confirmed by third party auditors. Scylla is proud to announce that we have successfully conducted a System and Organization Controls 2 (SOC2) Type II report for Scylla Cloud. This applies to all regions where Scylla Cloud can be deployed. (Learn more)
Hands-on Labs for Scylla in Your Browser
Scylla University now includes interactive labs that allow trainees to have an immediate hands-on experience with our products without requiring a local installation or setup. The labs use Katacoda to run a virtual terminal in your browser without the need to configure anything. They are embedded into the relevant Scylla University lessons, which makes them even more engaging.
Check out one of the labs below and try it yourself!
Currently, we have the following labs available:
- Quick Wins Lab: In this lab, you will see how to quickly start Scylla by running a single instance. You will then see how to run the CQL Shell and perform some basic CQL operations such as creating a table, inserting data, and reading it.
- High Availability Lab: This lab demonstrates, using a hands-on example, how Availability works in Scylla. You’ll try setting the Replication Factor and Consistency Levels in a three-node cluster and you’ll see how they affect read and write operations when all of the nodes in the cluster are up, and also when some of them are unavailable.
- Basic Data Modeling Lab: Data modeling is the process of identifying the entities in our domain, the relationships between these entities, and how they will be stored in the database. In this lab, you’ll learn some important terms such as Keyspace, Table, Column, Row, Primary Key, Partition Key, Compound Key, and Clustering Key. You’ll run different CQL queries to understand those terms better and get some hands-on experience with a live cluster.
Scylla University LIVE – Summer School (July 28th & 29th)
Following the success of our first Scylla University Live in April, we’re hosting another event in July! This time we’ll conduct these informative live sessions in two different time zones to better support our global community of users. July 28th training is scheduled for a time convenient in Europe and Asia, while July 29th will be the same sessions but better scheduled for users in North and South America.
A reminder, the Scylla University LIVE Summer School is a FREE, half-day, instructor-led training event, with training sessions from our top engineers and architects. It will include sessions that cover the basics and how to get started with Scylla, as well as more advanced topics and new features. Following the sessions, we will host a roundtable discussion where you’ll have the opportunity to talk with Scylla experts and network with other users.
We learned a lot from running our first Scylla University LIVE in April. This time we’re going to split the tutorial tracks based on level of expertise, so our Summer School tracks are divided between Scylla Essentials and Advanced topics.
- Getting Started with Scylla: From installation and configuration to queries and basic data modeling
- Advanced Data Modeling: Scylla shard-aware drivers, Materialized Views & Secondary Indexes, Lightweight Transactions, tips and best practices
- How to Create an App on Scylla: Getting the most out of your client
- Working with Kafka and Scylla: How to use the Kafka Scylla Connectors, both sink (consumer) and source (producer)
- Working with Spark and Scylla: Learn how to migrate and stream data into Scylla, or export data from Scylla for your analytics jobs
- Improving Your Applications Using Scylla Monitoring: Learn common pitfalls, prepared statements, batching, retries, and more
Stay tuned for more updates and details about this upcoming event!
Getting your database running at the right scale and speed to handle your growing business should not be complex or confusing. As I wrote in a previous blog post, capacity planning is not to be discounted and providing tools to streamline it is part of our mission.
To help you get started, we recently released our Scylla Cloud pricing and sizing calculator. This handy tool takes your peak workload requirements in terms of transactions per second and total data stored to recommend the proper sized cluster to run sustained workload.
But the real test is to see if the recommended cluster can actually achieve the desired performance numbers. Luckily with Scylla Cloud it’s easy enough to run a quick test with just a few clicks! Let’s check how the calculator fares for a nice round sustained workload of 1 million operations per second, using a balanced 50:50 read/write ratio.
Estimating with the Calculator
We put in 500,000 reads and 500,000 writes per second, stick with the average 1 kb item size, and let’s say our data set will consume 10 terabytes of disk (before replication).
Given those inputs, the calculator recommends a cluster of 6 i3.16xlarge nodes, which is actually much more than is required for a simple workload as the benchmark we are about to run. The calculator is designed to support a wide range of workloads and usage patterns and errs on the side of caution as a necessity, but we will take its recommendation at face value and launch this cluster on Scylla Cloud.
Note that the calculator’s workload projections are based on supporting peak sustained performance. There may be issues where you have a spike in your workload, or certain operations (range scans, lightweight transactions, etc.) that require extra processing or round trips. So a benchmark may vary significantly from your actual production experience.
This cluster will run us ~$30K/month, which might seem a lot until we compare it with other cloud databases: DynamoDB would run $133K/month with reserved capacity and DataStax Astra would cost a whopping $690K/month.
Scylla Cloud Formation
To run this load test, I will use a pre-configured load test cluster with automated provisioning using CloudFormation stacks, which can be found in this repository. This makes running the load test as easy as providing AWS CloudFormation with a URL of the template, filling a few parameters and hitting “create.” But first, being good cloud citizens, we must create a VPC for our load test workers and configure VPC peering with Scylla cloud’s VPC to allow the workers to communicate securely with the database cluster. The repository contains two templates, one for the VPC and another for load test itself. I have preloaded the CloudFormation templates on S3 to allow anyone to run them easily, so all I need to do is paste the URL in the AWS console:
After the stack has finished creating the VPC, I’ve executed the peering procedure (see the docs for more details) and I’m ready to run the load test! Again, I’m using a template from S3:
But this time I have a few parameters to put in:
The VPCStackName parameter allows the load test stack to automatically configure itself with the VPC, and the Scylla cluster parameters — a list of nodes and the password — are copy-pasted directly from the “Connect” tab of the Scylla Cloud cluster; the number of workers can be changed to create more load, but in this case I’ll go with the default of 2 writers and 2 readers.
The workers’ CloudFormation stack creates 2 autoscaling groups (one for reader nodes, one for writers) which contain EC2 instances pre-configured to run scylla-bench with a concurrency setting of 500 for each node. You can view the exact parameters in the CloudFormation template, as well as change them if you wish to run a different workload.
Less than a minute after launching the workers stack, the load test is automatically starting and you can switch to Scylla Monitoring Stack dashboards to see what’s going on:
1.8M op/sec with a load of 42%! Impressive!
This cluster is clearly capable of higher loads than the calculator specified — nearly 2M OPS. That’s not surprising — this is, after all, a cluster that has very little data at the moment. Sustained operations need to account for things like compactions, which manifest when more data accumulates.
In the meantime, let’s take a look at the latency:
The P99 read latency is about 400µs (that’s right — microseconds) while running almost 2M operations per second! Not too shabby. Write latency P99 is even lower, about 250µs — which makes sense as Scylla was designed for fast writes.
The calculator is handy to get a starting point for the size of the cluster you need, but by no means a final step — it is always wise to benchmark with a workload similar to what will run in production.
In the cloud it is easy to bootstrap clusters for tests, so benchmarking a few configurations is pretty straightforward. The hard part, of course, is understanding what the workload would be like and designing the benchmark. Even simple load tests like this one have value, at the very least showing what Scylla clusters can do with in terms of raw performance.
I encourage you to run your own benchmarks, using the templates we have published or your own tools — it’s only a few clicks away!
At ScyllaDB we have found a common thread within our user base: the need for developers to create and maintain low-latency, high-performance, highly available applications that can readily, reliably scale to meet their ever-growing data demands. While Scylla can be part of this big data revolution, it is not the sole sum or extent of what is required to power modern heterogenous data environments.
We believe there is a growing community of engineers, developers and devops practitioners who are each seeking to find or, more so, to create their own innovative solutions to the most intractable big data problems. So we want to bring them all together in community. That is what the P99 CONF is all about.
Dor Laor, ScyllaDB CEO, was the chief proponent behind creating the P99 CONF. “At ScyllaDB, we have always wanted to launch an event for the community of developers who work on high performance distributed applications. Building on the success of our user conferences we’re now glad to initiate a vendor-neutral industry event centered around low latency, high performance, high availability and real-time technologies – ranging from operating systems, CPUs, middleware, databases and observability.”
Glauber Costa, Staff Engineer at Datadog, concurs on the need for such an event: “The engineering community that is concerned with low latencies is pretty diverse. You have everything from operating systems primitives to application interfaces to algorithms. Not to mention server components like drives and network interfaces. It is important to bring these different aspects together so we learn from best practices across a wide range of experiences. So I think the P99 CONF is a great idea.”
All Things P99
P99 CONF is sort of a self-selecting audience title. If you know you know. If you are asking yourself “Why P99?” or “What is a P99?” this show may not be for you. But for those who do not know but are still curious, P99 is a latency threshold, often expressed in terms of milliseconds (or even microseconds) under which 99% of a network transaction occurs. Those last 1% of transactions represent the long-tail connections that can lead to retries, timeouts and frustrated users. Given that even a single modern complex web page may be created out of dozens of assets and multiple database calls, hitting a snag due to even one of those long-tail events it is far more likely than most people think. This conference is for professionals of all stripes that are looking to bring their latencies down as low as possible for reliable, fast applications of all sorts — consumer apps, enterprise architectures, or the Internet of Things.
P99 CONF is the place where industry leading engineers can present novel approaches for solving complex problems, efficiently, at speed. There’s no other event like this. A conference by engineers for engineers. P99 CONF is for a highly technical developer audience, focused on technology, not products, and will be vendor / tool agnostic. Vendor pitches are not welcome.
Sign Up Today!
P99 CONF will be held free and online October 6-7, 2021. You can sign up today for updates and reminders as the event approaches.
If your title includes terms like “IT Architect,” “Software Engineer,” “Developer,” “Software Engineer,” “SRE,” or “DevOps,” this is your tribe. Your boss is not invited.
Call For Speakers
For P99 CONF to truly succeed it must be driven by leaders and stakeholders in the developer community, and we welcome your response to our Call for Speakers.
Are you a developer, architect, DevOps, SRE with a novel approach to high performance and low latency? Perhaps you’re a production-oriented practitioner looking to share your best practices for service level management, or a new approach to system management or monitoring that helps to shave valuable milliseconds off application performance.
1. What type of content are you looking for? Is there a theme?
We’re looking for compelling technical sessions, novel algorithms, exciting optimizations, case studies, and best practices.
P99 CONF is designed to highlight the engineering challenges and creative solutions required for low-latency, high performance, high availability distributed computing applications. We’d like to share your expertise with a highly technical audience of industry professionals. Here are the categories we are looking for talks in:
- Development — Techniques in programming languages and operating systems
- Architecture — High performance distributed systems, design patterns and frameworks
- Performance — Capacity planning, benchmarking and performance testing
- DevOps — Observability & optimization to meet SLAs
- Use Cases — Low-latency applications in production and lessons learned
Example topics include…
- Operating systems techniques — eBPF, io_uring, XDP and friends
- Framework development — Most exciting new enhancements of golang, Rust and JVMs
- Containerization, virtualization and other animals
- Databases, streaming platforms, and distributed computing technologies
- CPUs, GPUs and 1Us
- Tools — Observability tools, debugging, tracing and whatever helps your night vision
- Storage methods, filesystems, block devices, object store
- Methods and processes — Capacity planning, auto scaling and SLO management
2. Does my presentation have to include p99s, specifically?
No. It’s the name of the conference, not a strict technical requirement.
You are free to look at average latencies, p95s or even p99.9s. Or you might not even be focused on latencies per se, but on handling massive data volumes, throughputs, I/O mechanisms, or other intricacies of high performance, high availability distributed systems.
3. Is previous speaking experience required?
Absolutely not, first time speakers are welcome and encouraged!
Have questions before submitting? Feel free to ask us at firstname.lastname@example.org.
4. If selected, what is the time commitment?
We are looking for 15-20 minute sessions which will be pre-recorded and live Q&A during the event. We expect the total time commitment to be 4-5 hours.
You’ll need to attend a 30 min speaker briefing and a 1 hour session recording appointment, all done virtually. Additionally, we ask that all speakers log onto the virtual conference platform 30 minutes before your session and join for live Q&A on the platform during and following your session. And, of course, all speakers are encouraged to attend the entire online conference to see the other sessions as well.
5. What language should the content be in?
While we hope to expand language support in the future, we ask that all content be in English.
6. What makes for a good submission?
There’s usually not one thing alone that makes for a good submission; it’s a combination of general worthiness of the subject, novelty of approach to a solution, usability/applicability to others, deep technical insights, and/or real-world lessons learned.
Session descriptions should be no more than 250 words long. We’ll look at submissions using these criteria:
Be authentic — Your peers want your personal experiences and examples drawn from real-world scenarios
Be catchy — Give your proposal a simple and straightforward but catchy title
Be interesting — Make sure the subject will be of interest to others; explain why people will want to attend and what they’ll take away from it
Be complete — Include as much detail about the presentation as possible
Don’t be “pitchy” — Keep proposals free of marketing and sales.
Be understandable — While you can certainly cite industry terms, try to write a jargon-free proposal that contains clear value for attendees
Be deliverable — Sessions have a fixed length, and you will not be able to cover everything. The best sessions are concise and focused.
Be specific — Overviews aren’t great in this format; the narrower and more specific your topic is, the deeper you can dive into it, giving the audience more to take home
Be cautious — Live demos sometimes don’t go as planned, so we don’t recommend them
Be rememberable — Leave your audience with take-aways they’ll be able to take back to their own organizations and work. Give them something they’ll remember for a good long time.
We’ll reject submissions that contain material that is vague, inauthentic, or plagiarized.
No vendor pitches will be accepted.
7. If selected, can I get help on my content & what equipment do I need?
Absolutely, our content team is here to help.
We will want to touch base with you 1-2 times to help you with any content questions you may need and offer graphic design assistance. This conference will be a virtual event and we’ll provide you with a mic and camera if needed (and lots of IT help along the way). Need something else? Let us know and we’ll do our best.
Scylla is pleased to announce the availability of Scylla Enterprise 2021, a production-ready Scylla Enterprise major release. After more than 2,500 commits originating from five open source releases, we’re excited to now move forward with Scylla Enterprise 2021. This release marks a significant milestone for us. While we’ve said for years we are a drop-in replacement for Apache Cassandra we are now also a drop-in replacement for Amazon DynamoDB.
Scylla Enterprise builds on the proven features and capabilities of Scylla Open Source NoSQL Database and provides greater reliability from additional vigorous testing, as well as a set of unique enterprise-only features. Scylla Enterprise 2021 is immediately available for all Scylla Enterprise customers, and provides a 30-day trial version for those interested in testing its capabilities against their own use cases.
Space Amplification Goal (SAG) for ICS
Incremental Compaction Strategy (ICS) was introduced in Scylla Enterprise 2019.1.4. While this feature provides significant savings for disk space, it came at the cost of write amplification. A new feature of Scylla Enterprise 2021.1 is Space Amplification Goal (SAG), a new property that allows users to fine tune a balance between space amplification and write amplification. This new feature is aimed at Scylla users with overwrite-intensive workloads, automatically triggering compaction to deduplicate data whenever the compaction strategy finds that space amplification has crossed the configured threshold set by a user for SAG. LEARN MORE
Remove the Seed Concept in Gossip
The concept of seed and the different behavior between seed nodes and non-seed nodes generate a lot of confusion, complication, and error for users. Starting with this release, seed nodes are ignored in the Gossip protocol. They are still in use (for now) as part of the init process. LEARN MORE
Binary Search in SSTable Promoted Index
The binary search dramatically improves index lookup, making them 12x faster, while reducing CPU utilization to only 1/10th or 10% and disk I/O to 1/20th or 5% the prior rates. Before Scylla Enterprise 2021.1, lookups in the promoted index were done by scanning the index linearly, so that the lookup took O(n) time. This is inefficient for large partitions, consuming a great deal of CPU and I/O. Now the reader scans the SSTable promoted index with a binary search, reducing search time to O(log n). LEARN MORE
Alternator, Our DynamoDB-Compatible API
Amazon DynamoDB users can more easily switch to Scylla, deploying your database on-premises, or on any cloud of your choice. Scylla significantly reduces the total cost of ownership, delivers lower and more consistent latencies, and expands the limitations DynamoDB places on object size, partition size, etc. The following new features have been added to Alternator since the 2020.1 release:
- Alternator now offers a Load Balancer to distribute requests smoothly across the entire cluster, rather than directing all queries to a single IP address connected to one node.
- New Alternator SSL options (Alternator Client to Node
Encryption on Transit). Beginning with this release, the new
alternator_encryption_optionsparameter is used to define Alternator SSL options.
- New option to start Alternator with HTTPS on Docker containers to provide data-in-transit security.
- Added FilterExpression – provides a newer syntax for filtering results of Query and Scan requests.
- Now allows users to choose the isolation level per table. If they do not use one, a default isolation level is set.
- Allow access to system tables from the Alternator REST API.
- Now support the ScanIndexForward option of a query expression.
By default, the query sort order is ascending. Setting
ScanIndexForward to False parameter reverses the order.
See AWS DynamoDB Query API
Scylla Unified Installer
Scylla is now available as an all-in-one binary tar file. Unified Installer should be used when one does not have root privileges on the server. For installation on an air-gap server, with no external access, it is recommended to download the standard packages, copy them to the air gap server and install using the standard package manager. LEARN MORE
We are constantly improving product security. Scylla Enterprise 2021.1 addresses the following issues:
- A new optional CQL port (19042 by default) is being open for Scylla advanced shard-aware drivers. It works exactly like the typical 9042 works for existing drivers (connector libraries), but it allows the client to choose the specific shard to connect to by precise binding of the client-side (ephemeral) port. Also, a TLS alternative is supported, under port 19142. LEARN MORE
- Now allow users to disable CQL unencrypted native transport by setting it to zero.
- Scylla now supports hot reloading of SSL Certificates. If SSL/TLS support is enabled, then whenever the files are changed on disk, Scylla will reload them and use them for subsequent connections.
- GnuTLS vulnerability: GNUTLS-SA-2020-09-04 #7212
To get started with Scylla Enterprise 2021, either as an existing customer or for a 30-day trial, head to our Download Center. You can also check out our Documentation, including the Getting Started and Upgrade Guide sections, as well as our Release Notes. You can also sign up for Scylla Cloud, our fully managed Database-as-a-Service (DBaaS) which is based on Scylla Enterprise.
If you have questions about using Scylla Enterprise in your own organization, feel free to contact us to speak with our sales and solutions team members, or join our Scylla Slack community to communicate directly with our engineers and your industry peers.
Pekka Enberg has been working on Scylla since before its inception in 2014. Prior to joining ScyllaDB he worked on a variety of technologies ranging from High Frequency Trading (HFT) backends through the JVM runtime to kernels, and also web applications. Pekka is currently working on a PhD in computer science exploring new kernel abstractions and interfaces suitable for modern computer architectures.
Being fascinated by the evolution of hardware and software and the mismatch that sometimes happens between them — or rather, their intricate co-evolution — I couldn’t resist the opportunity to ask Pekka for his perspective on how Kernel APIs and architecture is standing the test of time. After all, Linux is a 30 year old project; older if you trace it back to the UNIX API from the 1970s. Pekka’s unique experience on both sides of the API brings an interesting perspective which also sheds light on the rationale behind the Seastar framework and the Scylla database architecture.
Let’s dive in!
Avishai: Pekka, tell me a little bit about your PhD dissertation and what drove you to work on new kernel interfaces? What’s wrong with the ones we have?
Pekka: I am obviously not the first one to consider this topic, but I have a bit of a personal story behind it, related to Scylla. Some time in Spring of 2013, Avi Kivity (ScyllaDB CTO) approached me, and wanted me to talk about maybe joining his newly founded company. I knew Avi from the Linux kernel virtualization scene, and had met him at Linux conferences. When he told me they were building an operating system from scratch to make things run faster in virtualized environments, I was immediately sold. So we built OSv, an operating system kernel purpose built for hypervisor-based virtual machines (VMs), and integrated with the Java virtual machine (JVM) so that you could run any Java application tightly coupled with the kernel with improved performance — or at least that was the assumption.
I was performance testing Apache Cassandra running on top of OSv, which was one of the big application targets for us and the idea was that we would optimize the operating system layer and co-optimize it with the JVM — we had people that had extensive JVM background and worked on jRockit JVM. However, we discovered that we couldn’t get a lot of gains despite the obvious architectural edge of OSv because of the way the application was structured.
Apache Cassandra was built with this traditional large thread pool architecture (staged event-driven architecture). With this kind of architecture, you break work into multiple stages. For example, you receive a request in one stage, then do some parsing in another stage, and so on. Each stage can run on a different thread, and so you have these large thread pools to be able to take advantage of the parallelism in hardware. But what we saw in performance testing was that in some scenarios, Cassandra was spending a lot of its time waiting for locks to be released. Each stage handing over work to another had to synchronize and we could see locking really high on CPU profiles.
Around that time (late 2014) Avi started to work on Seastar, which is sort of an antithesis to the thread pool architecture. You have just one kernel thread running on a CPU, and try to partition application work and data — just as you would in a distributed system so that threads don’t share anything, and never synchronize with each other. So Seastar is thinking about the machine as a distributed thing rather than a shared-memory machine which was really typical of building multi-threaded services at the time.
But back to your question, why did the kernel become such a problem? We have a lot of different abstractions like threading in the kernel and they require crossing from user space to kernel space (context switching). This cost has gone up quite a bit, partly because of the deep security issues that were uncovered in CPUs [more here] but also relative to other hardware speeds which have changed over the years.
For example, the network is getting faster all the time and we have very fast storage devices, which is why the relative cost of context switching is much higher than it was before. The other thing is the kernel has the same synchronization problem we saw in user-level applications. The Linux kernel is a monolithic shared-memory kernel, which means that all CPUs share many of the same data structures, which they have to synchronize. Of course, there’s tons of work going on in making those synchronization primitives very efficient, but fundamentally you still have the same synchronization problem. Take the Linux networking stack for example: you have some packet arriving on the NIC, which causes an interrupt or a poll event. The event is handled on some CPU, but it’s not necessarily the same CPU that actually is going to handle the network protocol processing of the packet, and the application-level message the packet contains might also be handled on another CPU. So not only do you have to synchronize and lock moving data around, you also invalidate caches, do context switches, etc.
Avishai: Suppose you are running JVM on Linux, you have the networking stack which has a packet on one kernel thread that is supposed to be handled by a JVM thread but they don’t actually know about each other, is that correct? It sounds like the JVM scheduler is “fighting” the kernel scheduler for control and they don’t actually coordinate.
Pekka: Yes, that’s also a problem for sure.
This is a mismatch in what the application thinks it’s doing and
what the kernel decides to do. A similar and perhaps more
fundamental issue is the virtual memory abstraction: the
application thinks it has some memory allocated for it, but unless
you specifically tell the kernel to never ever take this memory
mlock system call) then when you’re
accessing some data structure it might not be in the memory,
triggering a page fault which may result in unpredictable
performance. And while that page fault is being serviced, the
application’s kernel thread is blocked, and there is no way for the
application to know this might happen.
The Seastar framework attempts to solve this issue by basically taking control over the machine, bypassing many OS abstractions. So Seastar is not just about eliminating context switches, synchronizations and such, it’s also very much about control. Many people ask if the choice of C++ as the programming language is the reason why Scylla has a performance advantage over Cassandra. I think it is, but not because of the language, but because C++ provides more control.
The JVM generates really efficient code which can be as fast as C++ in most cases, but when it comes to control and predictability the JVM is more limited. Also, when Scylla processes a query it handles caching itself, as opposed to many other databases which use the kernel controlled page cache. All the caching in Scylla is controlled by Scylla itself, and so you know there’s some predictability in what’s going to happen. This translates into request processing latency which is very predictable in Scylla.
Avishai: You said that Seastar is not only about control, can you elaborate more on that?
Pekka: Seastar is built around an idea of how to program multi-core machines efficiently: avoiding coordination and not blocking kernel threads. Seastar has this future/promise model which allows you to write application code efficiently to take advantage of both concurrency and parallelism. A basic example: you write to the disk which is a blocking operation because there is some delay until the data hits whatever storage; the same for networking as well. For a threaded application which uses blocking I/O semantics you would have thread pools because this operation would block a thread for some time, so other threads can use the CPU in the meantime, and this switching work is managed by the kernel. With a thread-per-core model if a thread blocks that’s it — nothing can run on that CPU, so Seastar uses non-blocking I/O and a future/promise model which is basically a way to make it very efficient to switch to some other work. So Seastar is moving these concurrency interfaces or abstractions into user space where it’s much more efficient.
Going back to your question about why the operating system became such a problem, the kernel provides concurrency and parallelism by kernel threads but sometimes you have to block the thread for whatever reason, perhaps wait for some event to happen. Often the application actually has to consider the fact that making the thread sleep can be much more expensive than just burning the CPU a little bit — for example, polling. Making threads sleep and wake up takes time because there’s a lot of things that the kernel has to do when the thread blocks — crossing between kernel and user space, some updating and checking the thread data structure so that the CPU scheduler can run the thread on some other core, etc, so it becomes really expensive. There’s a delay in the wake up when that event happened. Maybe your I/O completed or a packet arrived. For whatever reason your application thread doesn’t run immediately and this can become a problem for low latency applications.
Avishai: So it’s all about dealing with waiting for data and running concurrent computations in the meantime on the CPU?
Pekka: Yes, and is a problem that they already had and solved in the 1950s. The basic problem was that the storage device was significantly slower than the CPU, and they wanted to improve throughput of the machine by doing something useful while waiting for I/O to complete. So they invented something called “multi-programming”, which is more or less what we know as multithreading today. And this is what the POSIX programming model is to applications: you have a process and this process can have multiple threads performing sequential work. You do some stuff in a thread and maybe some I/O, and you have to wait for that I/O before you can proceed with the computation.
But as we already discussed, this blocking model is expensive. Another issue is that hardware has changed over the decades. For example, not all memory accesses have equal cost because of something called NUMA (non-uniform memory access), but this isn’t really visible in the POSIX model. Also, system calls are quite expensive because of the crossing between kernel and user space. Today, you can dispatch I/O operation on a fast NVMe storage device in the time it takes to switch between two threads on the CPU. So whenever you block you probably missed an opportunity to do I/O, so that’s an issue. The question is: how do I efficiently take advantage of the fact that I/O is quite fast but there is still some delay and I want to do some useful work? You need to be able to switch tasks very fast and this is exactly what Seastar aims to do. Seastar eliminates the CPU crossing cost as much as possible and context switching costs and instead of using kernel threads for tasks we use continuation chains or coroutines.
Avishai: It sounds like Seastar is quite a deviation from the POSIX model?
Pekka: The POSIX model really is something that was born in the 1960s and 1970s, to a large extent. It’s a simple CPU-centric model designed around a single CPU that does sequential computation (also known as the von Neumann model), which is easy to reason about. But CPUs internally haven’t worked that way since the beginning of the 1980s or even earlier. So it’s just an abstraction that programmers use — kind of a big lie in a sense, right?
POSIX tells programmers that you have this CPU that can run processes, which can have threads. It’s still the sequential computation model, but with multiple threads you need to remember to do some synchronization. But how things actually get executed is something completely different, and how you can efficiently take advantage of these capabilities is also something completely different. All abstractions are a lie to some degree, but that’s the point — it would be very difficult to program these machines if you didn’t have abstractions that everybody knows about.
Seastar is a different kind of programming model, and now you see these types of programming frameworks much more frequently. For example, you have async/await in Rust, which is very similar as well. When we started doing this in 2014, it was all a little bit new and a weird way of thinking about the whole problem, at least to me. Of course, if you write some application that is not so performance sensitive and you don’t care about latency very much, POSIX is more than fine, although you’ll want to use something even more high level.
Avishai: Can’t we just make existing models faster? Like using user-space POSIX threads running on top of something like Seastar?
Pekka: So user-space based threading is not a new idea. In the 1990s, for example, you had the Solaris operating system do this. They had something called M:N scheduling, where you have N number of kernel threads, and then you have M number of user-level threads, which are time-multiplexed on the kernel threads. So you could have the kernel set up a kernel thread per core and then in user space you run hundreds of threads on top of those kernel threads, for example.
Seastar has the concept of a continuation, which is it’s just a different incarnation of the same programming model. It’s just a different way of expressing the concurrency in your program. But yes, we could make thread context switching and maybe synchronization much faster in user space, but of course there are some additional problems that need to be solved too. There’s the issue of blocking system calls when doing I/O. There are known solutions to the problem, but they are not supported by Linux at the moment. In any case, this issue nicely ties back to my PhD topic: what kind of capabilities the O/S should expose so you could implement POSIX abstractions in user space.
Avishai: I think right now most backend programmers are not actually familiar with the POSIX abstraction, POSIX is more popular with system level programmers. Backend developers are mostly familiar with the model that is presented by the runtime they use — JVM or Python or Golang, etc. — which is not exactly the same as POSIX. It raises an interesting question, especially now that we’re getting sandboxes with WebAssembly, perhaps we want to replace POSIX with a different model?
Pekka: So hopefully no kernel developer reads this, but I tend to think that POSIX is mostly obsolete… Tracing back the history of POSIX, the effort was really about providing a portable interface for applications so you could write an application once and run it on different kinds of machine architectures and operating systems. For example, you had AT&T UNIX, SunOS, and BSD, and Linux later. If you wrote an application in the 1980s or 1990s, you probably wrote it in C or C++, and then POSIX was very relevant for portability because of all these capabilities that it provided. But with the emergence of runtimes like the Java Virtual Machine (JVM) and more recently Node.js and others, I think it’s a fair question to ask how relevant POSIX is. But in any case all of these runtimes are still largely built on top of the POSIX abstraction, but the integration is not perfect, right?
Take the virtual memory abstraction as an example. With
memory-mapped files (
mmap), a virtual memory region is
transparently backed by files or anonymous memory. But there’s this
funny problem that if you use something like Go programming
language and you use its goroutines to express concurrency in your
application — guess what happens if you need to take a page fault?
The page fault screws up everything because it will basically stop
the Go runtime scheduler, which is a user space thing.
Another interesting problem that shouldn’t happen in theory, but actually does, is file systems calls that shouldn’t block but do. We use the asynchronous I/O interface of Linux but it’s a known fact that still, some operations which by specification of the interface should be non-blocking actually block and for some specific reasons.
For example, we recommend the XFS file system for Scylla, because it’s the best file system that implements non-blocking operations. However, In some rare cases even with XFS, when you’re writing and then the file system has to allocate a new block or whatever and then you hit a code path which has a lock. If you happen to have two threads doing that then now you are blocked. There’s a good blog post by Avi about the topic, actually.
Anyway, this is one of the reasons why Seastar attempts to bypass anything it can. Seastar tells the Linux kernel “give me all the memory and don’t touch it.” It has its own I/O scheduler, and so on. Avi has sometimes even referred to Seastar as an “operating system in user space,” and I think that’s a good way to think about it.
Avishai: It sounds like one of biggest problems here is that we have a lot of pre-existing libraries and runtimes that make a lot of assumptions about the underlying operating system, so if we change the abstraction to whatever, we basically create a new world and we would have to rewrite a lot of the libraries and runtimes.
Pekka: Yeah, when I started the work on my PhD I had this naive thinking to throw out POSIX. But when talking about my PhD work with Anil Madhavapeddy of MirageOS unikernel fame, he told me that he thought POSIX was a distraction, and we probably can never get rid of it. And although POSIX doesn’t matter as much as it once did, you have to consider that it’s not just POSIX specification or the interfaces but all the underlying stuff like CPUs which are highly optimized to run this type of sequential code.
A lot of work has been done to make this illusion of a shared memory system fast — and it’s amazingly fast. But for something like Scylla the question is can you squeeze more from the hardware by programming in a different way? I think that we might have reached a point where the cost of maintaining the old interfaces is too high because CPU performance is unlikely to get significantly better. The main reason being that CPU clock frequencies are not getting higher.
In the beginning of the 2000s, we were thinking that we’d have these 10 GHz CPUs soon, but of course that didn’t happen. By 2006 when we started to get multi-core chips, there was this thinking that soon we will have CPUs with hundreds of cores, but we actually don’t really have that today. We certainly have more cores, but not in the same order of magnitude as people thought we would.
But although CPU speeds aren’t growing as much, the network speeds are insane already, with 40GbE NICs that are commodity at least in cloud environments, and 100GbE and 200GbE NICs in the horizon.
It’s also interesting to see what’s happening in the storage space. With something like Intel Optane devices, which can connect directly to your memory controller, you have persistent memory that’s almost the same speed as DRAM.
I’m not an expert on storage, so I don’t know how far the performance can be improved, but it’s a big change nonetheless. This puts the idea of a memory hierarchy under pressure. You have this idea that as you move closer to the CPU, storage becomes faster, but smaller. So you start from the L1 cache, then move to DRAM, and finally to storage. But now we’re seeing the storage layer closing in on the DRAM layer in terms of speed.
This brings us back to the virtual memory abstraction, which is about providing an illusion of a memory space that is as large as storage space. But what do you gain from this abstraction, now that you can have this persistent storage which is almost as fast as memory?
Also, we have distributed systems so if you need more memory than you can fit in one machine, you can use more machines over the network. So I think we are at that point in time where the price of this legacy abstraction is high. But is it too high to justify rewriting the OS layer remains to be seen.
Want to Be Part of the Conversation?
If you want to contribute your own say on how the next generation of our NoSQL database gets designed, feel free to join our Slack community. Or, if you have the technical chops to contribute directly to the code base, check out the career opportunities at ScyllaDB.
The post An Interview with Pekka Enberg: Modern Hardware, Old APIs appeared first on ScyllaDB.