Books by Monster SCALE Summit Speakers: Data-Intensive Applications & Beyond

Monster Scale Summit speakers have amassed a rather impressive list of publications, including quite a few books If you’ve seen the Monster Scale Summit agenda, you know that the stars have aligned nicely. In just two half days, from anywhere you like, you can learn from 60+ outstanding speakers exploring extreme scale engineering challenges from a variety of angles (distributed databases, real-time data, AI, Rust…) If you read the bios of our speakers, you’ll note that many have written books. This blog highlights 11 Monster Scale reads. Once you register for the conference (it’s free + virtual), you’ll gain 30-day full access to the complete O’Reilly library (thanks to O’Reilly, a conference media sponsor). And Manning Publications is also a media sponsor. They are offering the Monster SCALE community a nice 45% discount on all Manning books. Conference attendees who participate in the speaker chat will be eligible to win print books (courtesy of O’Reilly) and eBook bundles (courtesy of Manning). See the agenda and register – it’s free Platform Engineering: A Guide for Technical, Product, and People Leaders By Camille Fournier and Ian Nowland October 2024 Bookshop.org | Amazon | O’Reilly Until recently, infrastructure was the backbone of organizations operating software they developed in-house. But now that cloud vendors run the computers, companies can finally bring the benefits of agile custom-centricity to their own developers. Adding product management to infrastructure organizations is now all the rage. But how’s that possible when infrastructure is still the operational layer of the company? This practical book guides engineers, managers, product managers, and leaders through the shifts required to become a modern platform-led organization. You’ll learn what platform engineering is “and isn’t” and what benefits and value it brings to developers and teams. You’ll understand what it means to approach your platform as a product and learn some of the most common technical and managerial barriers to success. With this book, you’ll: Cultivate a platform-as-product, developer-centric mindset Learn what platform engineering teams are and are not Start the process of adopting platform engineering within your organization Discover what it takes to become a product manager for a platform team Understand the challenges that emerge when you scale platforms Automate processes and self-service infrastructure to speed development and improve developer experience Build out, hire, manage, and advocate for a platform team Camille is presenting “What Engineering Leaders Get Wrong About Scale”   Designing Data-Intensive Applications, 2nd Edition By Martin Kleppmann and Chris Riccomini Bookshop.org | Amazon | O’Reilly February 2026 Data is at the center of many challenges in system design today. Difficult issues such as scalability, consistency, reliability, efficiency, and maintainability need to be resolved. In addition, there’s an overwhelming variety of tools and analytical systems, including relational databases, NoSQL datastores, plus data warehouses and data lakes. What are the right choices for your application? How do you make sense of all these buzzwords? In this second edition, authors Martin Kleppmann and Chris Riccomini build on the foundation laid in the acclaimed first edition, integrating new technologies and emerging trends. You’ll be guided through the maze of decisions and trade-offs involved in building a modern data system, from choosing the right tools like Spark and Flink to understanding the intricacies of data laws like the GDPR. Peer under the hood of the systems you already use, and learn to use them more effectively Make informed decisions by identifying the strengths and weaknesses of different tools Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity Understand the distributed systems research upon which modern databases are built Peek behind the scenes of major online services, and learn from their architectures Martin and Chris are featured in “Fireside Chat: Designing Data-Intensive Applications, Second Edition”   The Coder Cafe: 66 Timeless Concepts for Software Engineers By Teiva Harsanyi ETA Summer 2026 Manning (use code SCALE2026 for 45%) Great software developers—even the proverbial greybeards—get a little better every day by adding knowledge and skills continuously. This new book invites you to share a cup of coffee with senior Google engineer Teiva Harsanyi as he shows you how to make your code more readable, use unit tests as documentation, reduce latency, navigate complex systems, and more. The Coder Cafe introduces 66 vital software engineering concepts that will upgrade your day-to-day practice, regardless of your skill level. You’ll find focused explanations—each five pages or less—on everything from foundational data structures to distributed architecture. These timeless concepts are the perfect way to turn your coffee break into a high-impact career boost. Generate property-based tests to expose hidden edge cases automatically Explore the CAP and PACELC theorems to balance consistency and availability trade-offs Design graceful-degradation strategies to keep systems usable under failure Leverage Bloom filters to perform fast, memory-efficient membership checks Cultivate lateral thinking to uncover unconventional solutions Teiva is presenting “Working on Complex Systems”   Think Distributed Systems By Dominik Tornow August 2025 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) All modern software is distributed. Let’s say that again—all modern software is distributed. Whether you’re building mobile utilities, microservices, or massive cloud native enterprise applications, creating efficient distributed systems requires you to think differently about failure, performance, network services, resource usage, latency, and much more. This clearly-written book guides you into the mindset you’ll need to design, develop, and deploy scalable and reliable distributed systems. In Think Distributed Systems you’ll find a beautifully illustrated collection of mental models for: Correctness, scalability, and reliability Failure tolerance, detection, and mitigation Message processing Partitioning and replication Consensus Dominik is presenting “Systems Engineering for Agentic Applications,” which is the topic of his upcoming book   Database Performance at Scale By Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, and Cynthia Dunlop September 2023 Amazon | ScyllaDB (free book) Discover critical considerations and best practices for improving database performance based on what has worked, and failed, across thousands of teams and use cases in the field. This book provides practical guidance for understanding the database-related opportunities, trade-offs, and traps you might encounter while trying to optimize data-intensive applications for high throughput and low latency. Whether you’re building a new system from the ground up or trying to optimize an existing use case for increased demand, this book covers the essentials. The ultimate goal of the book is to help you discover new ways to optimize database performance for your team’s specific use cases, requirements, and expectations. Understand often overlooked factors that impact database performance at scale Recognize data-related performance and scalability challenges associated with your project Select a database architecture that’s suited to your workloads, use cases, and requirements Avoid common mistakes that could impede your long-term agility and growth Jumpstart teamwide adoption of best practices for optimizing database performance at scale Felipe is presenting “The Engineering Behind ScyllaDB’s Efficiency” and will be hosting the ScyllaDB Lounge   Writing for Developers: Blogs That Get Read By Piotr Sarna and Cynthia Dunlop December 2025 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) Think about how many times you’ve read an engineering blog that’s sparked a new idea, demystified a technology, or saved you from going down a disastrous path. That’s the power of a well-crafted technical article! This practical guide shows you how to create content your fellow developers will love to read and share. Writing for Developers introduces seven popular “patterns” for modern engineering blogs—such as “The Bug Hunt”, “We Rewrote It in X”, and “How We Built It”—and helps you match these patterns with your ideas. The book covers the entire writing process, from brainstorming, planning, and revising, to promoting your blog and leveraging it into further opportunities. You’ll learn through detailed examples, methodical strategies, and a “punk rock DIY attitude!”: Pinpoint topics that make intriguing posts Apply popular blog post design patterns Rapidly plan, draft, and optimize blog posts Make your content clearer and more convincing to technical readers Tap AI for revision while avoiding misuses and abuses Increase the impact of all your technical communications This book features the work of numerous Monster Scale Summit speakers, including Joran Greef and Sanchay Javeria. There’s also a reference to Pat Helland in Bryan Cantrill’s foreword.   ScyllaDB in Action Bo Ingram October 2024 Bookshop.org | Amazon | Manning  (use code SCALE2026 for 45%) | ScyllaDB (free chapters) ScyllaDB in Action is your guide to everything you need to know about ScyllaDB, from your very first queries to running it in a production environment. It starts you with the basics of creating, reading, and deleting data and expands your knowledge from there. You’ll soon have mastered everything you need to build, maintain, and run an effective and efficient database. This book teaches you ScyllaDB the best way—through hands-on examples. Dive into the node-based architecture of ScyllaDB to understand how its distributed systems work, how you can troubleshoot problems, and how you can constantly improve performance. You’ll learn how to: Read, write, and delete data in ScyllaDB Design database schemas for ScyllaDB Write performant queries against ScyllaDB Connect and query a ScyllaDB cluster from an application Configure, monitor, and operate ScyllaDB in production Bo’s colleagues Ethan Donowitz and Peter French are presenting “How Discord Automates Database Operations at Scale”   Latency: Reduce Delay in Software Systems By Pekka Enberg Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) Slow responses can kill good software. Whether it’s recovering microseconds lost while routing messages on a server or speeding up page loads that keep users waiting, finding and fixing latency can be a frustrating part of your work as a developer. This one-of-a-kind book shows you how to spot, understand, and respond to latency wherever it appears in your applications and infrastructure. This book balances theory with practical implementations, turning academic research into useful techniques you can apply to your projects. In Latency you’ll learn: What latency is—and what it is not How to model and measure latency Organizing your application data for low latency Making your code run faster Hiding latency when you can’t reduce it Pekka and his Turso teammates are frequent speakers at Monster Scale Summit’s sister conference, P99 CONF   100 Go Mistakes and How to Avoid Them By Teiva Harsanyi August 2022 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) 100 Go Mistakes and How to Avoid Them puts a spotlight on common errors in Go code you might not even know you’re making. You’ll explore key areas of the language such as concurrency, testing, data structures, and more—and learn how to avoid and fix mistakes in your own projects. As you go, you’ll navigate the tricky bits of handling JSON data and HTTP services, discover best practices for Go code organization, and learn how to use slices efficiently. This book shows you how to: Dodge the most common mistakes made by Go developers Structure and organize your Go application Handle data and control structures efficiently Deal with errors in an idiomatic manner Improve your concurrency skills Optimize your code Make your application production-ready and improve testing quality Teiva is presenting “Working on Complex Systems”   The Manager’s Path: A Guide for Tech Leaders Navigating Growth and Change By Camille Fournier May 2017 Bookshop.org | Amazon | O’Reilly Managing people is difficult wherever you work. But in the tech industry, where management is also a technical discipline, the learning curve can be brutal—especially when there are few tools, texts, and frameworks to help you. In this practical guide, author Camille Fournier (tech lead turned CTO) takes you through each stage in the journey from engineer to technical manager. From mentoring interns to working with senior staff, you’ll get actionable advice for approaching various obstacles in your path. This book is ideal whether you’re a new manager, a mentor, or a more experienced leader looking for fresh advice. Pick up this book and learn how to become a better manager and leader in your organization. Begin by exploring what you expect from a manager Understand what it takes to be a good mentor, and a good tech lead Learn how to manage individual members while remaining focused on the entire team Understand how to manage yourself and avoid common pitfalls that challenge many leaders Manage multiple teams and learn how to manage managers Learn how to build and bootstrap a unifying culture in teams Camille is presenting “What Engineering Leaders Get Wrong About Scale”   The Missing README: A Guide for the New Software Engineer By Chris Riccomini and Dmitriy Ryaboy Bookshop.org | Amazon | O’Reilly August 2021 For new software engineers, knowing how to program is only half the battle. You’ll quickly find that many of the skills and processes key to your success are not taught in any school or bootcamp. The Missing README fills in that gap—a distillation of workplace lessons, best practices, and engineering fundamentals that the authors have taught rookie developers at top companies for more than a decade. Early chapters explain what to expect when you begin your career at a company. The book’s middle section expands your technical education, teaching you how to work with existing codebases, address and prevent technical debt, write production-grade software, manage dependencies, test effectively, do code reviews, safely deploy software, design evolvable architectures, and handle incidents when you’re on-call. Additional chapters cover planning and interpersonal skills such as Agile planning, working effectively with your manager, and growing to senior levels and beyond. You’ll learn: How to use the legacy code change algorithm, and leave code cleaner than you found it How to write operable code with logging, metrics, configuration, and defensive programming How to write deterministic tests, submit code reviews, and give feedback on other people’s code The technical design process, including experiments, problem definition, documentation, and collaboration What to do when you are on-call, and how to navigate production incidents Architectural techniques that make code change easier Agile development practices like sprint planning, stand-ups, and retrospectives Chris is featured in “Fireside Chat: Designing Data-Intensive Applications, Second Edition” (with Martin Kleppmann)

Common Performance Pitfalls of Modern Storage I/O

Performance pitfalls when aiming for high-performance I/O on modern hardware and cloud platforms Over the past two blog posts in this series, we’ve explored real-world performance investigations on storage-optimized instances with high-performance NVMe RAID arrays. Part 1 examined how the Seastar IO Scheduler‘s fair-queue issues inadvertently throttle bandwidth to a fraction of the advertised instance capacity. Part 2 dove into the filesystem layer, revealing how XFS’s block size and alignment strategies could force read-modify-write operations and cut a big chunk out of write throughput. This third and final part synthesizes those findings into a practical performance checklist. Whether you’re optimizing ScyllaDB, building your own database system, or simply trying to understand why your storage isn’t delivering the advertised performance, understanding these three interconnected layers – disk, filesystem, and application – is essential. Each layer has its own assumptions of what constitutes an optimal request. When these expectations misalign, the consequences cascade down, amplifying latency and degrading throughput. This post presents a set of delicate pitfalls we’ve encountered, organized by layer. Each includes concrete examples from production investigations as well as actionable mitigation strategies. Mind the Disk — Block Size and Alignment Matter Physical and Logical sector sizes Modern SSDs, particularly high-performance NVMe drives, expose two critical properties: physical sector size and logical sector size. The logical sector size is what the operating system sees. It typically defaults to 512 bytes for backward compatibility, though many modern drives are also capable of reporting 4K. The physical sector size reflects how data is actually stored on the flash chips. It’s the size at which the drive delivers peak performance. When you submit a write request that doesn’t align to the physical sector size, the SSD controller must: Read the entire physical page containing your data Modify the portion you’re writing to Write the entire sector back to the empty/erased flash page In our investigation of AWS i7i instances, we were bitten by exactly this problem. First, IOTune, our disk benchmarking tool, was using 512 byte requests to run the benchmarks because the firmware in these new NVMes was incorrectly reporting the physical sector size as 512 bytes when it was actually 4 KB. This made us measure disks as having up to 25% less read IOPS and up to 42% less write IOPS. The measurements were used to configure the IO Scheduler, so we ended up using the disks like they were a less performant model. That’s a very good way for your business to lose money. 🙂 If you’re wondering why I only explained the reasoning for slow writes with requests unaligned to the physical sector size, it’s because we still don’t fully understand why read requests are also hit by this issue. It’s an open question and we’re still researching some leads (which I hope will get us an answer). Sustained performance If your software relies on a disk hitting a certain performance number, try to account for the fact that even dedicated provisioned NVMes have peak and sustained performance values. It’s well known that elastic storage (like EBS, for instance) has baseline performance and peak performance, but it’s less intuitive for dedicated NVMes to behave like this. Measuring disk performance with 10m+ workloads might result in 3-5% lower IOPS/throughput. That allows your software to better predict how the disk behaves under sustained load. Mind the RAID If you’re using a RAID0 for NVMe arrays, be aware that your app’s parallel architecture might resonate with the way requests end up distributed over the RAID array. RAIDs are made out of chunks and stripes. A chunk is a block of data within a single disk; when the chunk size is exceeded, the driver moves on to the next disk in the array. A stripe contains all the chunks, one from each disk that will get written sequentially. A RAID0 with 2 disks will get written like this: stripe0: chunk0 (disk0), chunk1 (disk1); stripe1; stripe2… Filesystems usually align files to the RAID stripe boundary. Depending on the write pattern of your application, you could end up stressing some of the disks more, and not leveraging the entire power of the RAID array. Key Takeaways for the Disk Layer Detection: Always verify physical and logical sector sizes if you suspect your issue might be related to this. Don’t blindly trust firmware-reported values; cross-check with benchmarking tools and adjust your app and filesystem to use the physical sector size if possible. Measurement discipline: Increase measurement time when benchmarking disks. Even dedicated NVMes can have baseline vs. peak performance. RAID awareness: RAID architecture is made out of blocks, addresses, and drivers managing them. It’s not a magic endpoint that will just amplify your N-drives array into N times the performance of a disk. Its architecture has its own set of assumptions and limitations which, together with the filesystem’s own limitations and assumptions, might interfere with your app’s. Mind the Filesystem — Block Size, Alignment, and Metadata Operations Filesystem Block Size and Request Alignment Every filesystem has its own block size, independent of the disk’s physical sector size. XFS, for instance, can be formatted with block sizes ranging from 512 bytes to 64 KB. In ScyllaDB, we used to format with 1K block sizes because we wanted the fastest commitlog writes >= 1K. For older SSDs, the physical sector size was 512 bytes. On modern 4K-optimized NVMe arrays, this choice became a liability. We realized that 4K block sizes would bring us lots of extra write throughput. This filesystem-level block size affects two critical aspects: how data is stored and aligned on disk, and how metadata is laid out. Here’s a concrete example. When Seastar issues 128 KB sequential writes to a file on 1K-block XFS, the filesystem doesn’t seem to align these to 4K boundaries (maybe because the SSD firmware reported a physical sector size of 512 bytes). Using blktrace to inspect the actual disk I/O, we observed that approximately 75% of requests aligned to 1K or 2K boundaries. For these requests, the drive controller would split them each into at most 3 parts: a head, a 4K-aligned core, and a tail. For the head and tail, the disk would perform RMW, which is very slow. That would become the dominating factor for the entire request (consisting of the 3 parts). Reformatting the filesystem with 4K block size completely transformed the alignment distribution of requests, and 100% of them aligned to 4K. This brought a lot of throughput back for us. Filesystem Metadata Operations and Blocking Consider this: when a file grows, XFS must: Allocate extents from the freespace tree, requiring B-Tree modifications and mutex locks Update inode metadata to reflect the new file size Flush metadata periodically to ensure durability Update access/change times (ctimes) on every write Each of these operations can block subsequent I/O submissions. In our research, we discovered that the RWF_NOWAIT flag (requesting non-blocking async I/O submission) was insufficiently effective when metadata operations were queued. Writes would be re-submitted from worker threads rather than the Seastar reactor, adding context-switch overhead and latency spikes. When the final size of files is known, it is beneficial to pre-allocate or pre-truncate the file to that size using functions like fallocate() or ftruncate(). This practice dramatically improves the alignment distribution across the file and helps to amortize the overhead associated with extent allocation and metadata updates. While effective, fallocate() can be an expensive operation. It can potentially impact latency, especially if the allocation groups are already busy. Truncation is significantly cheaper; this alone can offer substantial benefits. Another helpful technique is using approaches like Seastar’s sloppy_size=true, where a file’s size is doubled via truncation whenever the current limit is reached. Key Takeaways for the Filesystem Layer Format Correctly: Format XFS (or other filesystems) with block sizes matching your SSD physical sector size if possible. Most modern NVMe drives are 4K-optimized. Go lower only if there are strong restrictions – like inability to read 4K aligned or potential read amplification or you have benchmarks showing better performance for the disk used with a smaller request size. Pre-allocation: When file sizes are known, pre-truncate or pre-allocate files to their final size using `fallocate()` or truncation. This amortizes extent allocation overhead and ensures uniform alignment across the file. Metadata Flushing: Understand the filesystem’s metadata update behavior. File sizes and access time updates are expensive. If you’re doing AIO, use RWF_NOWAIT if possible and make sure it works by tracing the calls with `strace`. Data deletion has also proved to have very expensive side effects. However, on older generation NVMes, TRIM requests that accumulated in the filesystem would flush all at once, overloading the SSD controller and causing huge latency spikes for the apps running on that machine. Mind the Application — Parallelism and Request Size Strategy Parallelism tuning Modern NVMe storage devices can handle thousands of requests in flight simultaneously. Factors like the internal queue depth and the number of outstanding requests the device can accept determine the maximum achievable bandwidth and latency when properly loaded. However, application-level concurrency (threads, fibers, async tasks) must be sufficient to keep these queues full. Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows – as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the input increase. The relationship between throughput, latency, and queue depth follows Little’s Law: Average Queue Depth = Throughput * Average Latency For a device delivering 14.5 GB/s with 128K requests, the number of requests/second the device can handle is 14.5GB/s divided by 128K – so roughly 113k req/s. If an individual request latency is, for example, 1ms, the device queue needs at least 113 outstanding requests to sustain 14.5 GB/s. This has practical implications: if you tune your application for, say, 40 concurrent requests and later upgrade to a faster device, you’ll need to increase concurrency or you’ll under-utilize the new device. Conversely, if you over-provision concurrency, you risk queue saturation and latency spikes because the SSD controllers only have so much compute power themselves. In ScyllaDB, concurrency is expressed as the number of shards (1 per CPU core) and fibers submitting I/O in parallel. Because we want the database to perform exceptionally even in mixed workloads, we’re using Seastar’s IO Scheduler to modulate the number of requests we send to storage. The Scheduler is configured with a latency goal that it needs to follow and with the iops/bandwidth the disk can handle at peak performance. In real workloads, it’s very difficult to match the load with a static execution model you’ve built yourself, even with thorough benchmarking and testing. Hardware performance varies, and read/write patterns are usually surprising. Your best bet is to use an IO scheduler built to leverage the maximum potential of the hardware while still obeying a latency limit you’ve set. Request size strategy Many small I/O requests often fail to saturate an NVMe drive’s throughput because bandwidth is the product of IOPS and request size, and small requests hit an IOPS/latency/CPU ceiling before they hit the PCIe/NAND bandwidth ceiling. For a given request size, bandwidth is essentially IOPS x Request size. So, for instance, 4K I/Os would need extremely high IOPS to reach GB/s-class throughput. A concrete example: 350k IOPS (which is typical for an AWS i7i.4xlarge, for instance) at 4K request size will get you ~1.3GB/s (i7i.4xlarge can do 3.5GB/s easily). This looks slow in throughput terms even though the device might be doing exactly what the workload allows. NVMe devices reach peak bandwidth when they have enough outstanding requests (queue depth) to keep the many internal flash channels busy.​ If your workload issues many small requests but mostly one-at-a-time (or only a couple in flight), the drive spends time waiting between completions instead of streaming data. As a result, bandwidth stays low. Another important aspect is that with small I/O sizes, the fixed per-I/O costs (syscalls, scheduling, interrupts, copying, etc.) can dominate the latency. That means the kernel/filesystem path becomes the limiter even before the NVMe hardware does.​ This is one reason why real applications often can’t reproduce vendors’ peak numbers. In practice, you usually need high concurrency (multiple threads/fibers/jobs) and true async I/O to maintain a deep in-flight queue and approach device limits. But, to repeat one idea from the section above, also remember that the storage controller itself has a limited compute capacity. Overloading it results in higher latencies for both read and write paths. It is important to find the right request size for your application’s workload. Go as big as your latency expectations will allow. If your workload is not very predictable, keep in mind that you’ll most likely need some dynamic logic that adjusts the I/O patterns so you can squeeze every last drop of performance out of that expensive storage. Key Takeaways for the Application Layer Parallelism tuning: Benchmark your specific hardware and workload to find the optimal fiber/thread/shard count. Watch the latency. While you increase the parallelism, you’ll notice throughput increasing. At some point, throughput will plateau and latency will start to increase. That’s the sweet spot you’re looking for. Request size strategy: Picking the right request size is important. Seastar defaults to 128K for sequential writes, which works well on most modern storage (but validate it via benchmarking). If your device prefers larger or smaller requests for throughput, the cost is latency – so design your workload accordingly. In practice, we’ve never seen SSDs that cannot be saturated for throughput with 128K requests and ScyllaDB can achieve sub-millisecond latencies with this request size. Conclusion The path from application buffer to persistent storage on modern hardware is complex. Often, performance issues are counterintuitive and very difficult to track down. A 1 KB filesystem block size doesn’t “save space” on 4K-optimized SSDs; it wastes throughput by forcing read-modify-write operations. A perfectly tuned IO Scheduler can still throttle requests if given incorrect disk properties. Sufficient parallelism doesn’t guarantee high throughput if request sizes are too small to fill device queues. By minding the disk, the filesystem, and the application – and by understanding how they interact – you can fully take advantage of modern storage hardware and build systems that reliably deliver the performance that the storage vendor advertised.

Kelsey Hightower’s Take on Engineering at Scale

At ScyllaDB’s Monster SCALE Summit 25, Hightower shared why facing scale the hard way is the right way “There’s a saying I like: ‘Some people have good ideas. Some ideas have people.’ When your idea outlives you, that’s success.” – Kelsey Hightower Kelsey Hightower’s career is a perfect example of that. His ideas have taken on a life of their own, extending far beyond his work at Puppet, CoreOS, to KubeCon and Google. And he continues to scale his impact with his signature unscripted keynotes as well as the definitive book, “Kubernetes Up and Running.” We were thrilled that Hightower joined ScyllaDB’s Monster SCALE Summit to share his experiences and advice on engineering at scale. And to scale his insights beyond those who joined us for the live keynote, we’re capturing some of the most memorable moments and takeaways here. Note: Monster SCALE Summit 2026 will go live March 11-12, featuring antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications”) and more than 50 others. The event is free and virtual; register and join the community for some lively chats. Monster Scale Summit 2026 – Free + Virtual Fail Before You Scale The interview with host Tim Koopmans began with a pointed warning for attendees of a conference that’s all about scaling: Don’t just chase scale because you’re fascinated by others’ scaling strategies and achievements. You need to really experience some pain personally first. As Hightower put it: “A lot of people go see a conference talk — I’m probably guilty of this myself — and then try to ‘do scale things’ before they even have experience with what they’re doing. Back at Puppet Labs, lots of people wrote shell scripts with bad error handling. Things went awry when conditions weren’t perfect. Then they moved on to configuration management, and those who made that journey could understand the trade-offs. Those who started directly with Puppet often didn’t.” “Be sure you have a reason,” he said “So before you over-optimize for a problem you may not even have, you should ask: ‘How bad is it really? Where are the metrics proving that you need a more scalable solution?’ Sometimes you can do nothing and just wait for scaling to become the default option.” Ultimately, you should hope for the “good problem” where increasing demand causes you to hit the limits of your tech, he said. That’s much better than having few customers and over-engineering for problems you don’t have. Make the Best Choice You Can … For Now The conversation shifted to what level of scale teams should target with their initial tech stack and each subsequent iteration. Should you optimize for a future state that hasn’t happened yet? Play it safe in case the market changes? “If you’re not sure whether you’re on the right stack … I promise you, it’s going to change,” Hightower said. “Make the best choice you can for now. You can spend all year optimizing for ‘the best thing,’ but it may not be the best thing 10 years from now. Say you pick a database, go all in, learn best practices. But put a little footnote in your design doc: ‘Here’s how we’d change this.’ Estimate the switching cost. If you do that, you won’t get stuck in sunk cost fallacy later.” Rather than trying to predict the future, think about how to avoid getting trapped. You don’t want dependencies or extensions to limit your ability to migrate when it’s time to take a better (or just different) path. “Change isn’t failure,” he emphasized. “Plan for it; don’t fear it.” In Hightower’s view, scaling decisions should start on a whiteboard, not in code. “When I was at Google, we’d do technical whiteboard sessions. Draw a line — that’s time. “Today, we’re here. Our platform allows us to do these things. Is that good or bad?” Then draw ahead: “Where do we want to be in two years?” He continued, “Usually that’s driven by teams and customer needs. You can’t do everything at once. So plot milestones — six months, a year, etc. You can push things out in time for when new libraries or tools arrive. If something new shows up that gets you two years ahead instantly, great. Having a timeline gives you freedom without guilt that you can’t ship everything today.” Are You Really Prepared For a 747? Following up on the Google thread, Koopmans asked, “I’d love to hear practical ways Google avoids over-engineering when designing for scale.” To illustrate why “Google-scale” solutions don’t always fit everyone else, Hightower used a memorable analogy: “I had a customer once say, ‘We want BigQuery on-prem.’ I said, ‘You do? Really? OK, how much money do you have?’ And it was one of those companies that had plenty of capital, so that wasn’t the issue. I told them, ‘That would be like going to the airport, looking out the window, seeing a brand-new 747 and telling the airline that you want that plane. Even if they let you buy it, you don’t have a pilot’s license, you don’t know how to fuel it. Where are you going to park it? Are you going to drive it down your subdivision, decapitating the roofs of your neighbors’ houses?” Some things just aren’t meant for everyone.” Ultimately, whether it’s over-engineering or not depends on the target user. Understand who they are, how they work and what tools they use, then build with that in mind. Hightower also warned against treating “best practices” as universal truths: “One question that most customers show up with is, ‘What are the best practices?’ Not necessarily the best practices for me. They just want to know what everyone else is doing. I think that might be another anti-pattern in the mix, where you only care about what everyone else is doing and you don’t bring the necessary context for a good recommendation.” How Leaders Should Think About Dev Tooling “Serializing engineering culture” (Hightower’s phrase) like Google did with its google3 monorepo makes it simple for thousands of new engineers to join the team and start contributing almost instantly. During his tenure at Google, everything from gRPC to deployment tools was integrated. Engineers just opened a browser, added code and reviews would start automatically. However, there’s a fine line between serializing and stifling. Hightower believes that prohibiting engineers from even installing Python on their laptops, for example, is overkill: “That’s like telling Picasso he can’t use his favorite brush.” He continued: “Everyone works differently. As a leader, learn what tools people actually use and promote sharing. Have engineers show their workflows — the shortcuts, setups and plugins that make them productive. That’s where creativity and speed come from. Share the nuance. Most people think their tricks are too small to matter, but they do. I want to see your dotfiles! You’ll inspire others.” Watch the Complete Talk As Hightower noted, “Some people have good ideas. Some ideas have people.” His approach to scale – pragmatic, context-driven and human – shows why some ideas really do outlive the people who created them. You can see the full talk below. Fun fact: it was truly an unscripted interview – Hightower insisted! The team met him in the hotel lobby that morning, chatted a bit during a coffee run, prepped the camera angles … and suddenly Hightower and Koopmans were broadcasting to 20,000 attendees around the world.

Claude Code Marketplace Now Available

Claude Code has become an indispensable part of my daily workflow. I use it for everything from writing code to debugging production issues. But while Claude is incredibly capable out of the box, there are areas where injecting specialized domain knowledge makes it dramatically more useful.

That’s why I built a plugin marketplace. Yesterday I released rustyrazorblade/skills, a collection of Claude Code plugins that extend Claude with expert-level knowledge in specific domains. The first plugin is something I’ve been talking about doing for a while: a Cassandra expert.

The Deceptively Simple Act of Writing to Disk

Tracking down a mysterious write throughput degradation From a high-level perspective, writing a file seems like a trivial operation: open, write data, close. Modern programming languages abstract this task into simple, seemingly instantaneous function calls. However, beneath this thin veneer of simplicity lies a complex, multi-layered gauntlet of technical challenges, especially when dealing with large files and high-performance SSDs. For the uninitiated, the path from application buffer to persistent storage is fraught with performance pitfalls and unexpected challenges. If your goal is to master the art of writing large files efficiently on modern hardware, understanding all the details under the hood is essential. This article will walk you through a case study of fixing a throughput performance issue. We’ll get into the intricacies of high-performance disk I/O, exploring the essential technical questions and common oversights that can dramatically affect reliability, speed, and efficiency. It’s part 2 of a 3-part series. Read part 1 Read part 3 When lots of work leads to a performance regression If you haven’t yet read part 1 (When bigger instances don’t scale), now’s a great time to do so. It will help you understand the origin of the problem we’re focusing on here. TL;DR: In that blog post, we described how we managed to figure out why a new class of highly performant machines didn’t scale as expected when instance sizes increased. We discovered a few bugs in our Seastar IO Scheduler (stick around a bit, I’ll give a brief description of that below). That helped us measure scalable bandwidth numbers. At the time, we believed these new NVMes were inclined to perform better with 4K requests than with 512 byte requests. We later discovered that the latter issue was not related to the scheduler at all. We were actually chasing a firmware bug in the SSD controller itself. These disks do, in fact, perform better with 4K requests. What we initially thought was a problem in our measurement tool (IOTune) turned out to be something else entirely. IOTune wasn’t misdetecting the disk’s physical sector size (this is the request size at which a disk can achieve the best IOPS). Instead, the disk firmware was reporting it incorrectly. It was reporting it as 512 bytes. However, in reality, it was 4K. We worked around the IOPS issue since the cloud provider wasn’t willing to fix the firmware bug due to backward-compatibility concerns. We also deployed the IO Scheduler fixes and our measured disk models (io-properties) with IOTune scaled nicely with the size of the instance. Still, in real workload tests, ScyllaDB didn’t like it. Performance results of some realistic workloads showed a write throughput degradation of around 10% on some instances provisioned with quite new and very fast SSDs. While this wasn’t much, it was alarming because we were kind of expecting an improvement after the last series of fixes. These first two charts give us a good indication of how well ScyllaDB utilizes the disk. In short, we’re looking for both of them to be as stable as possible and as close to 100% as possible. The “I/O Group consumption” metric tracks the amount of shared capacity currently taken by in-flight operations from the group (reads, writes, compaction, etc.). It’s expressed as a percentage of the configured disk capacity. The “I/O Group Queue flow ratio” metric in ScyllaDB measures the balance between incoming I/O request rates and the dispatch rate from the I/O queue for a given I/O group. It should be as close as possible to 1.0, because requests cannot accumulate in disk. If it jumps up, it means one of two things. The reactor might be constantly falling behind and not kicking the I/O queue in a timely manner. Or, it can mean that the disk is slower than we told ScyllaDB it was – and the scheduler is overloading it with requests. The spikes here indicate that the IO Scheduler doesn’t provide a very good QoS. That led us to believe the disk was overloaded with requests, so we ended up not saturating the throughput. The following throughput charts for commitlog, memtable, and compaction groups reinforce this claim. The 4 NVMe RAID array we were testing against was capable of around 14.5GB/s throughput. We expected that at any point in time during the test, the sum of the bandwidths for those three groups would get close to the configured disk capacity. Please note that according to the formula described in the section below on IO Scheduler, bandwidth and IOPS have a competing relationship. It’s not possible to reach the maximum configured bandwidth because that would leave you with 0 space for IOPS. The reverse holds true: You cannot reach the maximum IOPS because that would mean your request size got so low that you’re most likely not getting any bandwidth from the disk. At the end of the chart, we would expect an abrupt drop in throughput for the commitlog and memtable groups because the test ends, with the compaction group rapidly consuming most of the 14.5GB/s of disk throughput. That was indeed the case, except that these charts are very spiky as well. In many cases, summing up the bandwidth for the three groups shows that they consume around 12.5GB/s of the disk total throughput. The Seastar IO Scheduler Seastar uses an I/O scheduler to coordinate shards for maximizing the disk’s bandwidth while still preserving great IOPS. Basically, the scheduler lets Seastar saturate a disk and at the same time fit all requests within a latency goal configured on the library (usually 0.5ms). A detailed explanation of how the IO Scheduler works can be found in this blog post. But here’s a summary of where max bandwidth/IOPS values come from and where they go to within the ScyllaDB IO Scheduler. I believe it will connect the problem description above with the rest of the case study below. The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a disk, it will output 4 values corresponding to read/write IOPS and read/write bandwidth. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of your disk. This model then helps ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula which looks something like: read_bw/read_bw_max + write_bw/write_bw_max + read_iops/read_iops_max + write_iops/write_iops_max <= 1 The internal mechanics of how the IO Scheduler works are detailed in the blog post linked above. Peak and sustained throughput We observed some bursty behavior in the SSDs under test. It wasn’t much; iops/bandwidth would be lower with around 5% than the values measured by running the benchmark for the default 2 minutes. The iops/bandwidth values would start stabilizing at around 30 minutes and that’s what we call sustained io-properties. We thought our IOTune runs might have recorded peak disk io-properties (i.e., we ran the tool for too short of a duration – the default is 120 seconds). With a realistic workload, we are actually testing the disks at their sustained throughput, so the IO Scheduler builds an inflated model of the disk and ends up sending more requests than the disk can actually handle. This would cause the overload we saw in the charts. We then tested with newly measured sustained io-properties (with the IOTune duration configured to run 30 minutes, then 60 minutes). However, there wasn’t any noticeable improvement in the throughput degradation problem…and the charts were still spiky. Disk saturation length The disk saturation length, as defined and measured by IOTune, is the smallest request length that’s needed to achieve the maximum disk throughput. All the disks that we’ve seen so far had a measured saturation length of 128K. This means that it should be perfectly possible to achieve maximum throughput with 128K requests. We noticed something quite odd while running tests on these performant NVMes: the Seastar IOTune tool would report a saturation length of 1MB. We immediately panicked because there are a few important assumptions that rely on us being able to saturate disks with 128K requests. The issue matched the symptoms we were chasing. A disk model built with the assumption that saturation length is 1MB would trick the IO Scheduler into allowing a higher number of 128K requests (the length Seastar uses for sequential writes) than the disk controller can handle efficiently. In other words, the IO Scheduler would try to achieve the high throughput measured with 1MB request length, but using 128K requests. This would make the disk appear overloaded, as we saw in the charts. Assume you’re trying to reach the maximum throughput on a common disk with 4K requests, for instance. You won’t be able to do it. And since the throughput would stay below the maximum, the IO Scheduler would stuff more and more requests into the disk. It’s hoping to reach the maximum – but again, it won’t be reached. The side effect is that the IO Scheduler ends up overloading the disk controller with requests, increasing in-disk latencies for all the other processes trying to use it. As is typical when you’re navigating muddy waters like these, this turned out to be a false lead. We were stepping on some bugs in our measuring tools, IOTune and io_tester. IOTune was running with lower parallelism than those disks needed for saturation. And io_tester was measuring overwriting a file rather than writing to a pre-truncated new file. The saturation length of this type of disks was still 128k, like we had seen in the past. Fortunately, that meant we didn’t need to make potential architectural changes in Seastar in order to accommodate larger requests. A nice observation we can make here based on the tests we ran trying to dis/prove this theory is that extent allocation is a rather slow process. If the allocation group is already quite busy (a few big files already exist under the same directory, for instance), the effect on throughput when appending and extending a file is quite dramatic. Internal parallelism Another interesting finding was that the disk/filesystem seemed to suffer from internal parallelism problems. We ran io_tester with 128k requests, 8 fibers per shard with 8 shards and 64 shards. The results were very odd. The expected bandwidth was ~12.7GB/s, but we were confused to see it drop when we increased the number of shards. Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the increase of the input. However, in the table above, we see something different. When the disk is overloaded, latency increases (because it holds more requests internally), but the bandwidth drops at the same time. This might explain why IOTune measured bandwidths around 12GB/s, while under the real workload (capped with IOTune-measured io-properties.yaml), the disk behaved as if it was overloaded (i.e., high latency and the actual bandwidth below the maximum bandwidth). When IOTune measures the disk, shards load the disk one-by-one. However, the real test workload sends parallel requests from all shards at once. The storage device was a RAID0 with 4 NVMes and a RAID chunk size of 1MB. The theory here was that since each shard writes via 8 fibers of 128k requests, it’s possible that many shards could end up writing to the same disk in the array. The explanation is that XFS aligns files on 1MB boundaries. If all shards start at the same file offset and move at relatively the same speed, the shards end up picking the same drive from the array. That means we might not be measuring the full throughput of the raid array. The measurements confirm that the shards do not consistently favor the same drive. Single disk throughput was measured at 3.220 GB/s while the entire 4-disk array achieved a throughput of 10.9 GB/s. If they were picking the same disk all the time, the throughput of the entire 4-disk array would’ve been equal to that of a single disk (i.e. 3.2GB/s). This lead ended up being a dead end. We tried to prove it in a simulator, but all requests ended up shuffled evenly between the disks. Sometimes, interesting theories that you bet can explain certain effects just don’t hold true in practice. In this case, something else sits at the base of this issue. XFS formatting Although the previous lead didn’t get us very far, it opened a very nice door. We noticed that the throughput drop is reproducible on 64 shards, rather than 8 shards RAID mounted with the scylla_raid_setup script (a ScyllaDB script which does preparation work on a new machine, e.g., formats the filesystem, sets up the RAID array), not on a raw block device and not on a RAID created with default parameters Comparing the running mkfs.xfs commands, we spotted a few differences. In the table below, notice how the XFS parameters differ between default-mounted XFS and XFS mounted by scylla_raid_setup. The 1K vs 4K data.bsize difference stands out. We also spotted – and this is an important observation– that truncating the test files to some large size seems to bring back the stolen throughput. The results coming from this observation are extremely surprising. Keep reading to see how this leads us to actually figuring out the root cause of the problem. The table below shows the throughput in MB/s when running tests on files that are being appended and extended and files that are being pre-truncated to their final size (both cases were run on XFS mounted with the Scylla script). We’ve experimented with Seastar’s sloppy_size=true file option, which truncates the file’s size to double every time it sees an access past the current size of the file. However, while it improved the throughput numbers, it unfortunately still left half of the missing throughput on the table. RWF_NOWAIT and non-blocking IO The first lead that we got from here was by running our tests under strace. Apparently, all of our writes would get re-submitted out of the Seastar Reactor thread to XFS threads. Seastar uses RWF_NOWAIT to attempt non-blocking AIO submissions directly from the Seastar Reactor thread. It sets aio_rw_flags=RWF_NOWAIT on IOCBs during initial io_submit calls from the Reactor backend. On modern kernels, this flag requests immediate failure (EAGAIN) if the submission may block, preserving reactor responsiveness.​ On io_submit failure with RWF_NOWAIT, Seastar clears the flag and queues the IOCBs for retrying. The thread pool then executes the retry submissions without the RWF_NOWAIT flag, as these can tolerate blocking. We thought reducing the number of retries would increase throughput. Unfortunately, it didn’t actually do that when we disabled the flag. The cause for the throughput drop is uncovered next. As for the RWF_NOWAIT issue, it’s still unclear why it doesn’t affect throughput. However, the fix was a kernel patch by our colleague Pavel Emelyanov which fiddles with inode time update when IOCBs carry the RWF_NOWAIT flag. More details on this would definitely exceed the scope of this blog post. blktrace and offset alignment distribution Returning to our throughput performance issue, we started running io_tester experiments with blktrace on, and we noticed something strange. For around 25% of the requests, io_tester was submitting 128k requests and XFS would queue them as 256 sector requests (256 sectors x 512 bytes reported physical sector size equals to 128k). However, it would split the requests and complete them in 2 parts. (Note the Q label on the first line of the output below; this indicates the request was queued) The first part of the request would finish in 161 microseconds, while the second part would finish in 5249 microseconds. This dragged down the latency of the whole request to 5249 microseconds (the 4th column in the table is the timestamp of the event in seconds; the latency of a request is max(ts_completed – ts_queued)). The remaining 75% of requests were queued and completed in one go as 256 sector requests. They were also quite fast: 52 microseconds, as shown below. The explanation for the split is related to how 128k requests hit the file, given that XFS lays it out on disk, considering a 1MB RAID chunk size. The split point occurs at address 21057335224 + 72, which translates to hex 0x9CE3AE00000. This reveals it is, in fact, a multiple of 0x100000 – the 1MB RAID chunk boundary. We can discuss optimizations for this, but that’s outside the scope of this article. Unfortunately, it was also out of scope for our throughput issue. However, here are some interesting charts showing how request offsets alignment looks based on the blktrace events we collected. default formatted XFS with 4K block size scylla_raid_setup XFS: 1K block scylla_raid_setup XFS: 1K block + truncation This result is significant! For XFS formatted with a 4K block size, most requests are 4K aligned. For XFS formatted with scylla_raid_setup (1K block size), most requests are 1K or 2K aligned. For XFS formatted with scylla_raid_setup (1K block size) and with test files truncated to their final size, all requests are 64K aligned (although in some cases we also saw them being 4k aligned). It turns out that XFS lays out files on disk very differently when the file is known in advance compared to the case when the file is grown on-demand. That results in IO unaligned to the disk block size. Punchline Now here comes the explanation of the problem we’ve been chasing since the beginning. In the first part of the article, we saw that doing random writes with 1K requests produces worse IOPS than with 4K requests on these 4K optimized NVMes. This happens because when executing the 1K request, the disk needs to perform Read-Modify-Write to land the data into chips. When we submit 128k requests (as ScyllaDB does) that are 1K or 2K aligned (see the alignment distributions in the charts above), the disk is forced to do RMW on the head and tail of the requests. This slows down all the requests (unrelated, but similar in concept to the raid chunk alignment split we’ve seen above). Individually, the slowdown is probably tiny. But since most requests are 1k and 2k aligned on XFS formatted with 1k block size (no truncation), the throughput hit is quite significant. It’s very interesting to note that, as shown in the last chart above, truncation also improved the alignment distribution quite significantly, and also improved throughput. It also appeared to significantly shorten the list of extents created for our test files. For ScyllaDB, the right solution was to format XFS with a 4K block size. Truncating to the final size of the file wasn’t really an option for us because we can’t predict how big an SSTable will grow. Since sloppy_size’ing the files didn’t provide great results, we agreed that 4K-formatted XFS was the way to go. The throughput degradation we got using the higher io-properties numbers seems to be solved. We initially expected to see “improved performance” compared to the original low io-properties case (i.e., a higher measured write throughput). The success wasn’t obvious, though. It was rather hidden within the dashboards, as shown below. Here’s what disk utilization and IO flow ratio charts look like. The disk is fully utilized and clearly not overloaded anymore. And here are memtable and commitlog charts. These look very similar to the charts we got with the initial low io-properties numbers from the “When bigger instances don’t scale” article. Most likely, this means that’s what the test can do. The good news was hidden here. While the test went full speed ahead, the compaction job filled all the available throughput, from ~10GB/s (when commitlog+memtable were running at 4.5GB/s) to 14.5GB/s (when commitlog and memtable flush processes were done). The only thing left to check was whether the filesystem formatted with the 4K block size would cause read amplification on older disks with 512 bytes physical sector size. It turns out it didn’t. We were able to achieve similar IOPS on a RAID with 4 NVMes of the older type.   Next up: Part 3, Common performance pitfalls of modern storage I/O.

Inside the 2026 Monster Scale Summit Agenda

Monster Scale Summit is all about extreme scale engineering and data-intensive applications. So here’s a big announcement: the agenda is now available! Attendees can join 50+ tech talks, including: Keynotes by antirez, Camille Fournier, Pat Helland, Joran Greef, Thea Aarrestad, Dor Laor, and Avi Kivity Martin Kleppmann & Chris Riccomini chatting about the second edition of Designing Data-Intensive Applications Tales of extreme scale engineering – Rivian, Pinterest, LinkedIn, Nextdoor, Uber Eats, Google, Los Alamos National Labs, CERN, and AmEx ScyllaDB customer perspectives – Discord, Disney, Freshworks, ShareChat, SAS, Sprig, MoEngage, Meesho, Tiket, and Zscaler Database engineering – Inside looks at ScyllaDB, turbopuffer, Redis, ClickHouse, DBOS, MongoDB, DuckDB, and TigerBeetle What’s new/next for ScyllaDB – Vector search, tablets, tiered storage, data consistency, incremental repair, Rust-based drivers, and more Like other ScyllaDB-hosted conferences (e.g., P99 CONF), the event will be free and virtual so everyone can participate. Take a look, register, and start choosing your own adventure across the multiple tracks of tech talks. Full Agenda Register [free + virtual] When you join us March 11 and 12, you can… Chat directly with speakers and connect with ~20K of your peers Participate in interactive trainings on topics like real-time AI, database performance at scale, high availability, and cloud cost optimization strategies Pick the minds of ScyllaDB engineers and architects, who are available to answer your toughest database performance questions Win conference swag, sea monster plushies, book bundles, and other cool giveaways Details, Details The agenda site has all the scheduling, abstracts, and speaker details. Please note that times are shown in your local time zone. Be sure to scroll down into the Instant Access section. This is one of the best parts of Monster SCALE Summit. You can access these sessions from the minute the event platform opens until the conference wraps. Some teams have shared that they use Instant Access to build their own watch parties beyond the live conference hours. If you do this, please share photos! Another important detail: books. Quite a few of our speakers are gluttons for punishmentaccomplished book authors. Camille Fournier – Platform Engineering, The Manager’s Path Martin Kleppmann and Chris Riccomini – Designing Data-Intensive Applications 2E Chris Riccomini – The Missing README Dominik Tornow – Think Distributed Systems Teiva Harsanyi – 100 Go Mistakes and How to Avoid Them, The Coder Cafe Felipe Cardeneti Mendes – Database Performance at Scale We will have book giveaways throughout the event, so be sure to attend live. All registrants get 30-day access to the complete O’Reilly platform, which includes all O’Reilly, Manning, and many other tech books. This includes the shiny new second edition of Designing Data-Intensive Applications, which publishes this month. Perfect timing… Register now – it’s free

Apache Cassandra® 5.0: Improving performance with Unified Compaction Strategy

Introduction

Unified Compaction Strategy (UCS), introduced in Apache Cassandra 5.0, is a versatile compaction framework that not only unifies the benefits of Size-Tiered (STCS) and Leveled (LCS) Compaction Strategies, but also introduces new capabilities like shard parallelism, density-aware SSTable organization, and safer incremental compaction, all of which deliver more predictable performance at scale. By utilizing a flexible scaling model, UCS allows operators to tune compaction behavior to match evolving workloads, spanning from write-heavy to read-heavy, without requiring disruptive strategy migrations in most cases.

In the past, operators had to choose between rigid strategies and accept significant trade-offs. UCS changes this paradigm, allowing the system to efficiently adapt to changing workloads with tuneable configurations that can be altered mid-flight and even applied differently across different compaction levels based on data density.

Why compaction matters

Compaction is the critical process that determines a cluster’s long-term health and cost-efficiency. When executed correctly, it produces denser nodes with highly organized SSTables, allowing each server to store more data without sacrificing speed. This efficiency translates to a smaller infrastructure footprint, which can lower cloud costs and resource usage.

Conversely, inefficient compaction is a primary driver of performance degradation. Poorly managed SSTables lead to fragmented data, forcing the system to work harder for every request. This overhead consumes excessive CPU and I/O, often forcing teams to try adding more nodes (horizontal scale) just to keep up with background maintenance noise.

Key concepts and terminology

To understand how UCS optimizes a cluster, it is necessary to understand the fundamental trade-offs it balances:

  • Read amplification: Occurs when the database must consult multiple SSTables to answer a single query. High read amplification acts as a “latency tax,” forcing extra I/O to reconcile data fragments.
  • Write amplification: A metric that quantifies the overhead of background processes (such as compactions). It represents the ratio between total data written to disk and the amount of data originally sent by an application. High write amplification wears out SSDs and steals throughput.
  • Space amplification: The ratio of disk space used to the actual size of the “live” data. It tracks data such as tombstones or overwritten rows that haven’t been purged yet.
  • Fan factor: The “growth dial” for the cluster data hierarchy. It defines how many files of a similar size must accumulate before they are merged into a larger tier.
  • Sharding: UCS splits data into smaller, independent token ranges (shards), allowing the system to run multiple compactions in parallel across CPU cores.
Performance gains by design

UCS provides baseline architectural improvements that were not available in older strategies:

Improved compaction parallelism

Older strategies often got stuck on a single thread during large merges. UCS sharding allows a server to use its full processing power. This significantly reduces the likelihood of compaction storms and keeps tail latencies (p99) predictable.

Reduced disk space amplification

Because UCS operates on smaller shards, it doesn’t need to double the entire disk space of a node to perform a major merge. This greatly reduces the risk of nodes from running out of space during heavy maintenance cycles.

Density-based SSTable organization

UCS measures SSTables by density (token range coverage). This mitigates the huge SSTable problem where a single massive file becomes too large to compact, hindering read performance indefinitely.

Scaling parameter

The scaling parameter (denoted as W) is a configurable setting that determines the size ratio between compaction tiers. It helps balance write amplification and read performance by controlling how much data is rewritten during compaction operations. A lower scaling parameter value results in more frequent, smaller compactions, whereas a higher value leads to larger compaction groups.

The strategy engine: tuning and parameters

UCS acts as a strategy engine by adjusting the scaling parameter (W), allowing UCS to mimic, or outperform, its predecessors STCS and LCS.

At a high level, the scaling parameter influences the effective fan-out behavior at each compaction level. Tiered-style settings such as T4 allow more SSTables to accumulate before merging, favoring write efficiency, while leveled-style settings such as L10 keep SSTables more tightly organized, reducing read amplification at the cost of additional background work.

The numbers below are illustrative and not prescriptive:

UCS configuration guide Workload type Strategy target Scaling (W) Primary benefit Heavy writes / IoT STCS (Tiered) Negative   (e.g., -4) Lowest read amplification Heavy reads LCS (Leveled) Positive      (e.g., 10) Lowest write amplification Balanced Hybrid Zero (0) Balanced performance for general apps Practical example

UCS allows operators to mix behaviors across the data lifecycle.

'scaling_parameters': 'T4, T4, L10'

Note that scaling_parameters takes a string format that can accommodate parameters for per-level tuning.

This example instructs a cluster: “Use tiered compaction for the first two levels to keep up with the high write volume, but once data reaches the third level, reorganize it into a leveled structure so reads stay fast.”

Here’s a fuller, illustrative example of how one might structure their CQL to change the compaction strategy.

ALTER TABLE keyspace_name.table_name   WITH compaction = {  'class': 'UnifiedCompactionStrategy',  'scaling_parameters': 'T4,T4,L10'   };
Operational evolution: moving beyond major compactions

In older strategies and in Apache Cassandra versions prior to 5.0, operators often felt forced to run a major compaction to reclaim disk space or fix performance. This was a critical event that could impact a node’s I/O for extended periods of time and required substantial free disk space to complete.

Because UCS is density-aware and sharded, it effectively performs compactions constantly and granularly so major compactions are rarely needed. It identifies overlapping data within specific token ranges (shards) and cleans them up incrementally. Operators no longer must choose between a fragmented disk and a risky, resource-heavy manual compaction; UCS keeps data density more uniform across the cluster over time.

The migration advantage: “in-place” adoption

One of the key performance features of a UCS migration is in-place adoption, meaning that when a table is switched to UCS, it does not immediately force a massive data rewrite. Instead, it looks at the existing SSTables, calculates their density, and maps them into its new sharding structure.

This allows for moving from STCS or LCS to UCS with significantly less I/O overhead than any other strategy change.

Conclusion

UCS is an operational shift toward simplicity and predictability. By removing the need to choose between compaction trade-offs, UCS allows organizations to scale with confidence. Whether handling a massive influx of IoT data or serving high-speed user profiles, UCS helps clusters remain performant, cost-effective, and ready for the future.

On a newly deployed NetApp Instaclustr Apache Cassandra 5 cluster, UCS is already the default strategy (while Apache Cassandra 5.0 has STCS set as the default).

Ready to experience this new level of Cassandra performance for yourself? Try it with a free 30-day trial today!

The post Apache Cassandra® 5.0: Improving performance with Unified Compaction Strategy appeared first on Instaclustr.

Getting Started with Database-Level Encryption at Rest in ScyllaDB Cloud

Learn about ScyllaDB database-level encryption with Customer-Managed Keys & see how to set up and manage encryption with a customer key — or delegate encryption to ScyllaDB ScyllaDB Cloud takes a proactive approach to ensuring the security of sensitive data: we provide database-level encryption in addition to the default storage-level encryption. With this added layer of protection, customer data is always protected against attacks. Customers can focus on their core operations, knowing that their critical business and customer assets are well-protected. Our clients can either use a customer-managed key (CMK, our version of BYOK) or let ScyllaDB Cloud manage the CMK for them. The feature is available in all cloud platforms supported by ScyllaDB Cloud. This article explains how ScyllaDB Cloud protects customer data. It focuses on the technical aspects of ScyllaDB database-level encryption with Customer-Managed Keys (CMK). Storage-level encryption Encryption at rest is when data files are encrypted before being written to persistent storage. ScyllaDB Cloud always uses encrypted volumes to prevent data breaches caused by physical access to disks. Database-level encryption Database-level encryption is a technique for encrypting all data before it is stored in the database. 
 The ScyllaDB Cloud feature is based on the proven ScyllaDB Enterprise database-level encryption at rest, extended with the Customer Managed Keys (CMK) encryption control. This ensures that the data is securely stored – and the customer is the one holding the key. The keys are stored and protected separately from the database, substantially increasing security. ScyllaDB Cloud provides full database-level encryption using the Customer Managed Keys (CMK) concept. It is based on envelope encryption to encrypt the data and decrypt only when the data is needed. This is essential to protect the customer data at rest. Some industries, like healthcare or finance, have strict data security regulations. Encrypting all data helps businesses comply with these requirements, avoiding the need to prove that all tables holding sensitive personal data are covered by encryption. It also helps businesses protect their corporate data, which can be even more valuable. A key feature of CMK is that the customer has complete control of the encryption keys. Data encryption keys will be introduced later (it is confusing to cover them at  the beginning). The customer can: Revoke data access at any time Restore data access at any time Manage the master keys needed for decryption Log all access attempts to keys and data Customers can delegate all key management operations to the ScyllaDB Cloud support team if they prefer this. To achieve this, the customer can choose the ScyllaDB key when creating the cluster. To ensure customer data is secure and adheres to all privacy regulations. By default, encryption uses the symmetrical algorithm AES-128, a solid corporate encryption standard covering all practical applications. Breaking AES-128 can take an immense amount of time, approximately trillions of years. The strength can be increased to AES-256. Note: Database-level encryption in ScyllaDB Cloud is available for all clusters deployed in Amazon Web Services (AWS) and Google Cloud Platform (GCP). Encryption To ensure all user data is protected, ScyllaDB will encrypt: All user tables Commit logs Batch logs Hinted handoff data This ensures all customer data is properly encrypted. The first step of the encryption process is to encrypt every record with a data encryption key (DEK). Once the data is encrypted with the DEK, it is sent to either AWS KMS or GCP KMS, where the master key (MK) resides. The DEK is then encrypted with the master key (MK), producing an encrypted DEK (EDEK or a wrapped key). The master key remains in the KMS, while the EDEK is returned and stored with the data. The DEK used to encrypt the data is destroyed to ensure data protection. A new DEK will be generated the next time new data needs to be encrypted. Decryption Because the original non-encrypted DEK is destroyed when the EDEK was produced, the data cannot be decrypted. The EDEK cannot be used to decrypt the data directly because the DEK key is encrypted. It has to be decrypted, and for that, the master key will be required again. This can only be decrypted with the master key(MK) in the KMS. Once the DEK is unwrapped, the data can be decrypted. As you can see, the data cannot be decrypted without the master key – which is protected at all times in the KMS and cannot be “copied” outside KMS. By revoking the master key, the customer can disable access to the data independently from the database or application authorization. Multi-region deployment Adding new data centers to the ScyllaDB cluster will create additional local keys in those regions. All master keys support multi-regions, and a copy of each key resides locally in each region – ensuring those multi-regional setups are protected from regional outages for the cloud provider and against disaster. The keys are available in the same region as the data center and can be controlled independently. In case you use a Customer Key – cloud providers will charge you for the KMS. AWS will charge $1/month, GCP will change you $0.06 for each cluster prorated per hour. Each additional DC creates a replica that is counted as an additional key.   There is an additional cost per key request. ScyllaDB Enterprise utilizes those requests efficiently, resulting in an estimated monthly cost of up to $1 for a 9-node cluster. Managing encryption keys adds another layer of administrative work in addition to the extra cost. ScyllaDB Cloud offers database clusters that can be encrypted using keys managed by ScyllaDB support. They provide the same level of protection, but our support team helps you manage the master keys. The ScyllaDB keys are applied by default and are subsidized by ScyllaDB. Creating a Cluster with Database-Level Encryption Creating a cluster with database-level encryption requires: A ScyllaDB Cloud account – If you don’t have one, you can create a ScyllaDB Cloud account here. 10 minutes with ScyllaDB Key or 20 minutes creating your own key To create a cluster with database-level encryption enabled, we will need a master key. We can either create a customer-managed key using ScyllaDB Cloud UI or skip this step completely and use a ScyllaDB Managed Key, which will skip the next six steps. In both cases, all the data will be protected by strong encryption at the database level. Setting up the customer-managed key can be found in the database-level encryption documentation. Transparent database-level encryption in ScyllaDB Cloud significantly boosts the security of your ScyllaDB clusters and backups.   Next Steps Start using this feature in ScyllaDB Cloud. Get your questions answered in our community forum and Slack channel. Or, use our contact form.