Books by Monster SCALE Summit Speakers: Data-Intensive Applications & Beyond
Monster Scale Summit speakers have amassed a rather impressive list of publications, including quite a few books If you’ve seen the Monster Scale Summit agenda, you know that the stars have aligned nicely. In just two half days, from anywhere you like, you can learn from 60+ outstanding speakers exploring extreme scale engineering challenges from a variety of angles (distributed databases, real-time data, AI, Rust…) If you read the bios of our speakers, you’ll note that many have written books. This blog highlights 11 Monster Scale reads. Once you register for the conference (it’s free + virtual), you’ll gain 30-day full access to the complete O’Reilly library (thanks to O’Reilly, a conference media sponsor). And Manning Publications is also a media sponsor. They are offering the Monster SCALE community a nice 45% discount on all Manning books. Conference attendees who participate in the speaker chat will be eligible to win print books (courtesy of O’Reilly) and eBook bundles (courtesy of Manning). March 4 update: There’s also a special Manning offer (50% off!) for Monster Scale bundles. See the agenda and register – it’s free Platform Engineering: A Guide for Technical, Product, and People Leaders By Camille Fournier and Ian Nowland October 2024 Bookshop.org | Amazon | O’Reilly Until recently, infrastructure was the backbone of organizations operating software they developed in-house. But now that cloud vendors run the computers, companies can finally bring the benefits of agile custom-centricity to their own developers. Adding product management to infrastructure organizations is now all the rage. But how’s that possible when infrastructure is still the operational layer of the company? This practical book guides engineers, managers, product managers, and leaders through the shifts required to become a modern platform-led organization. You’ll learn what platform engineering is “and isn’t” and what benefits and value it brings to developers and teams. You’ll understand what it means to approach your platform as a product and learn some of the most common technical and managerial barriers to success. With this book, you’ll: Cultivate a platform-as-product, developer-centric mindset Learn what platform engineering teams are and are not Start the process of adopting platform engineering within your organization Discover what it takes to become a product manager for a platform team Understand the challenges that emerge when you scale platforms Automate processes and self-service infrastructure to speed development and improve developer experience Build out, hire, manage, and advocate for a platform team Camille is presenting “What Engineering Leaders Get Wrong About Scale” Designing Data-Intensive Applications, 2nd Edition By Martin Kleppmann and Chris Riccomini Bookshop.org | Amazon | O’Reilly February 2026 Data is at the center of many challenges in system design today. Difficult issues such as scalability, consistency, reliability, efficiency, and maintainability need to be resolved. In addition, there’s an overwhelming variety of tools and analytical systems, including relational databases, NoSQL datastores, plus data warehouses and data lakes. What are the right choices for your application? How do you make sense of all these buzzwords? In this second edition, authors Martin Kleppmann and Chris Riccomini build on the foundation laid in the acclaimed first edition, integrating new technologies and emerging trends. You’ll be guided through the maze of decisions and trade-offs involved in building a modern data system, from choosing the right tools like Spark and Flink to understanding the intricacies of data laws like the GDPR. Peer under the hood of the systems you already use, and learn to use them more effectively Make informed decisions by identifying the strengths and weaknesses of different tools Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity Understand the distributed systems research upon which modern databases are built Peek behind the scenes of major online services, and learn from their architectures Martin and Chris are featured in “Fireside Chat: Designing Data-Intensive Applications, Second Edition” The Coder Cafe: 66 Timeless Concepts for Software Engineers By Teiva Harsanyi ETA Summer 2026 Manning (use code SCALE2026 for 45%) Great software developers—even the proverbial greybeards—get a little better every day by adding knowledge and skills continuously. This new book invites you to share a cup of coffee with senior Google engineer Teiva Harsanyi as he shows you how to make your code more readable, use unit tests as documentation, reduce latency, navigate complex systems, and more. The Coder Cafe introduces 66 vital software engineering concepts that will upgrade your day-to-day practice, regardless of your skill level. You’ll find focused explanations—each five pages or less—on everything from foundational data structures to distributed architecture. These timeless concepts are the perfect way to turn your coffee break into a high-impact career boost. Generate property-based tests to expose hidden edge cases automatically Explore the CAP and PACELC theorems to balance consistency and availability trade-offs Design graceful-degradation strategies to keep systems usable under failure Leverage Bloom filters to perform fast, memory-efficient membership checks Cultivate lateral thinking to uncover unconventional solutions Teiva is presenting “Working on Complex Systems” Think Distributed Systems By Dominik Tornow August 2025 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) All modern software is distributed. Let’s say that again—all modern software is distributed. Whether you’re building mobile utilities, microservices, or massive cloud native enterprise applications, creating efficient distributed systems requires you to think differently about failure, performance, network services, resource usage, latency, and much more. This clearly-written book guides you into the mindset you’ll need to design, develop, and deploy scalable and reliable distributed systems. In Think Distributed Systems you’ll find a beautifully illustrated collection of mental models for: Correctness, scalability, and reliability Failure tolerance, detection, and mitigation Message processing Partitioning and replication Consensus Dominik is presenting “Systems Engineering for Agentic Applications,” which is the topic of his upcoming book Database Performance at Scale By Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov, and Cynthia Dunlop September 2023 Amazon | ScyllaDB (free book) Discover critical considerations and best practices for improving database performance based on what has worked, and failed, across thousands of teams and use cases in the field. This book provides practical guidance for understanding the database-related opportunities, trade-offs, and traps you might encounter while trying to optimize data-intensive applications for high throughput and low latency. Whether you’re building a new system from the ground up or trying to optimize an existing use case for increased demand, this book covers the essentials. The ultimate goal of the book is to help you discover new ways to optimize database performance for your team’s specific use cases, requirements, and expectations. Understand often overlooked factors that impact database performance at scale Recognize data-related performance and scalability challenges associated with your project Select a database architecture that’s suited to your workloads, use cases, and requirements Avoid common mistakes that could impede your long-term agility and growth Jumpstart teamwide adoption of best practices for optimizing database performance at scale Felipe is presenting “The Engineering Behind ScyllaDB’s Efficiency” and will be hosting the ScyllaDB Lounge Writing for Developers: Blogs That Get Read By Piotr Sarna and Cynthia Dunlop December 2025 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) Think about how many times you’ve read an engineering blog that’s sparked a new idea, demystified a technology, or saved you from going down a disastrous path. That’s the power of a well-crafted technical article! This practical guide shows you how to create content your fellow developers will love to read and share. Writing for Developers introduces seven popular “patterns” for modern engineering blogs—such as “The Bug Hunt”, “We Rewrote It in X”, and “How We Built It”—and helps you match these patterns with your ideas. The book covers the entire writing process, from brainstorming, planning, and revising, to promoting your blog and leveraging it into further opportunities. You’ll learn through detailed examples, methodical strategies, and a “punk rock DIY attitude!”: Pinpoint topics that make intriguing posts Apply popular blog post design patterns Rapidly plan, draft, and optimize blog posts Make your content clearer and more convincing to technical readers Tap AI for revision while avoiding misuses and abuses Increase the impact of all your technical communications This book features the work of numerous Monster Scale Summit speakers, including Joran Greef and Sanchay Javeria. There’s also a reference to Pat Helland in Bryan Cantrill’s foreword. ScyllaDB in Action Bo Ingram October 2024 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) | ScyllaDB (free chapters) ScyllaDB in Action is your guide to everything you need to know about ScyllaDB, from your very first queries to running it in a production environment. It starts you with the basics of creating, reading, and deleting data and expands your knowledge from there. You’ll soon have mastered everything you need to build, maintain, and run an effective and efficient database. This book teaches you ScyllaDB the best way—through hands-on examples. Dive into the node-based architecture of ScyllaDB to understand how its distributed systems work, how you can troubleshoot problems, and how you can constantly improve performance. You’ll learn how to: Read, write, and delete data in ScyllaDB Design database schemas for ScyllaDB Write performant queries against ScyllaDB Connect and query a ScyllaDB cluster from an application Configure, monitor, and operate ScyllaDB in production Bo’s colleagues Ethan Donowitz and Peter French are presenting “How Discord Automates Database Operations at Scale” Latency: Reduce Delay in Software Systems By Pekka Enberg Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) Slow responses can kill good software. Whether it’s recovering microseconds lost while routing messages on a server or speeding up page loads that keep users waiting, finding and fixing latency can be a frustrating part of your work as a developer. This one-of-a-kind book shows you how to spot, understand, and respond to latency wherever it appears in your applications and infrastructure. This book balances theory with practical implementations, turning academic research into useful techniques you can apply to your projects. In Latency you’ll learn: What latency is—and what it is not How to model and measure latency Organizing your application data for low latency Making your code run faster Hiding latency when you can’t reduce it Pekka and his Turso teammates are frequent speakers at Monster Scale Summit’s sister conference, P99 CONF 100 Go Mistakes and How to Avoid Them By Teiva Harsanyi August 2022 Bookshop.org | Amazon | Manning (use code SCALE2026 for 45%) 100 Go Mistakes and How to Avoid Them puts a spotlight on common errors in Go code you might not even know you’re making. You’ll explore key areas of the language such as concurrency, testing, data structures, and more—and learn how to avoid and fix mistakes in your own projects. As you go, you’ll navigate the tricky bits of handling JSON data and HTTP services, discover best practices for Go code organization, and learn how to use slices efficiently. This book shows you how to: Dodge the most common mistakes made by Go developers Structure and organize your Go application Handle data and control structures efficiently Deal with errors in an idiomatic manner Improve your concurrency skills Optimize your code Make your application production-ready and improve testing quality Teiva is presenting “Working on Complex Systems” The Manager’s Path: A Guide for Tech Leaders Navigating Growth and Change By Camille Fournier May 2017 Bookshop.org | Amazon | O’Reilly Managing people is difficult wherever you work. But in the tech industry, where management is also a technical discipline, the learning curve can be brutal—especially when there are few tools, texts, and frameworks to help you. In this practical guide, author Camille Fournier (tech lead turned CTO) takes you through each stage in the journey from engineer to technical manager. From mentoring interns to working with senior staff, you’ll get actionable advice for approaching various obstacles in your path. This book is ideal whether you’re a new manager, a mentor, or a more experienced leader looking for fresh advice. Pick up this book and learn how to become a better manager and leader in your organization. Begin by exploring what you expect from a manager Understand what it takes to be a good mentor, and a good tech lead Learn how to manage individual members while remaining focused on the entire team Understand how to manage yourself and avoid common pitfalls that challenge many leaders Manage multiple teams and learn how to manage managers Learn how to build and bootstrap a unifying culture in teams Camille is presenting “What Engineering Leaders Get Wrong About Scale” The Missing README: A Guide for the New Software Engineer By Chris Riccomini and Dmitriy Ryaboy Bookshop.org | Amazon | O’Reilly August 2021 For new software engineers, knowing how to program is only half the battle. You’ll quickly find that many of the skills and processes key to your success are not taught in any school or bootcamp. The Missing README fills in that gap—a distillation of workplace lessons, best practices, and engineering fundamentals that the authors have taught rookie developers at top companies for more than a decade. Early chapters explain what to expect when you begin your career at a company. The book’s middle section expands your technical education, teaching you how to work with existing codebases, address and prevent technical debt, write production-grade software, manage dependencies, test effectively, do code reviews, safely deploy software, design evolvable architectures, and handle incidents when you’re on-call. Additional chapters cover planning and interpersonal skills such as Agile planning, working effectively with your manager, and growing to senior levels and beyond. You’ll learn: How to use the legacy code change algorithm, and leave code cleaner than you found it How to write operable code with logging, metrics, configuration, and defensive programming How to write deterministic tests, submit code reviews, and give feedback on other people’s code The technical design process, including experiments, problem definition, documentation, and collaboration What to do when you are on-call, and how to navigate production incidents Architectural techniques that make code change easier Agile development practices like sprint planning, stand-ups, and retrospectives Chris is featured in “Fireside Chat: Designing Data-Intensive Applications, Second Edition” (with Martin Kleppmann)Common Performance Pitfalls of Modern Storage I/O
Performance pitfalls when aiming for high-performance I/O on modern hardware and cloud platforms Over the past two blog posts in this series, we’ve explored real-world performance investigations on storage-optimized instances with high-performance NVMe RAID arrays. Part 1 examined how the Seastar IO Scheduler‘s fair-queue issues inadvertently throttle bandwidth to a fraction of the advertised instance capacity. Part 2 dove into the filesystem layer, revealing how XFS’s block size and alignment strategies could force read-modify-write operations and cut a big chunk out of write throughput. This third and final part synthesizes those findings into a practical performance checklist. Whether you’re optimizing ScyllaDB, building your own database system, or simply trying to understand why your storage isn’t delivering the advertised performance, understanding these three interconnected layers – disk, filesystem, and application – is essential. Each layer has its own assumptions of what constitutes an optimal request. When these expectations misalign, the consequences cascade down, amplifying latency and degrading throughput. This post presents a set of delicate pitfalls we’ve encountered, organized by layer. Each includes concrete examples from production investigations as well as actionable mitigation strategies. Mind the Disk — Block Size and Alignment Matter Physical and Logical sector sizes Modern SSDs, particularly high-performance NVMe drives, expose two critical properties: physical sector size and logical sector size. The logical sector size is what the operating system sees. It typically defaults to 512 bytes for backward compatibility, though many modern drives are also capable of reporting 4K. The physical sector size reflects how data is actually stored on the flash chips. It’s the size at which the drive delivers peak performance. When you submit a write request that doesn’t align to the physical sector size, the SSD controller must: Read the entire physical page containing your data Modify the portion you’re writing to Write the entire sector back to the empty/erased flash page In our investigation of AWS i7i instances, we were bitten by exactly this problem. First, IOTune, our disk benchmarking tool, was using 512 byte requests to run the benchmarks because the firmware in these new NVMes was incorrectly reporting the physical sector size as 512 bytes when it was actually 4 KB. This made us measure disks as having up to 25% less read IOPS and up to 42% less write IOPS. The measurements were used to configure the IO Scheduler, so we ended up using the disks like they were a less performant model. That’s a very good way for your business to lose money. 🙂 If you’re wondering why I only explained the reasoning for slow writes with requests unaligned to the physical sector size, it’s because we still don’t fully understand why read requests are also hit by this issue. It’s an open question and we’re still researching some leads (which I hope will get us an answer). Sustained performance If your software relies on a disk hitting a certain performance number, try to account for the fact that even dedicated provisioned NVMes have peak and sustained performance values. It’s well known that elastic storage (like EBS, for instance) has baseline performance and peak performance, but it’s less intuitive for dedicated NVMes to behave like this. Measuring disk performance with 10m+ workloads might result in 3-5% lower IOPS/throughput. That allows your software to better predict how the disk behaves under sustained load. Mind the RAID If you’re using a RAID0 for NVMe arrays, be aware that your app’s parallel architecture might resonate with the way requests end up distributed over the RAID array. RAIDs are made out of chunks and stripes. A chunk is a block of data within a single disk; when the chunk size is exceeded, the driver moves on to the next disk in the array. A stripe contains all the chunks, one from each disk that will get written sequentially. A RAID0 with 2 disks will get written like this: stripe0: chunk0 (disk0), chunk1 (disk1); stripe1; stripe2… Filesystems usually align files to the RAID stripe boundary. Depending on the write pattern of your application, you could end up stressing some of the disks more, and not leveraging the entire power of the RAID array. Key Takeaways for the Disk Layer Detection: Always verify physical and logical sector sizes if you suspect your issue might be related to this. Don’t blindly trust firmware-reported values; cross-check with benchmarking tools and adjust your app and filesystem to use the physical sector size if possible. Measurement discipline: Increase measurement time when benchmarking disks. Even dedicated NVMes can have baseline vs. peak performance. RAID awareness: RAID architecture is made out of blocks, addresses, and drivers managing them. It’s not a magic endpoint that will just amplify your N-drives array into N times the performance of a disk. Its architecture has its own set of assumptions and limitations which, together with the filesystem’s own limitations and assumptions, might interfere with your app’s. Mind the Filesystem — Block Size, Alignment, and Metadata Operations Filesystem Block Size and Request Alignment Every filesystem has its own block size, independent of the disk’s physical sector size. XFS, for instance, can be formatted with block sizes ranging from 512 bytes to 64 KB. In ScyllaDB, we used to format with 1K block sizes because we wanted the fastest commitlog writes >= 1K. For older SSDs, the physical sector size was 512 bytes. On modern 4K-optimized NVMe arrays, this choice became a liability. We realized that 4K block sizes would bring us lots of extra write throughput. This filesystem-level block size affects two critical aspects: how data is stored and aligned on disk, and how metadata is laid out. Here’s a concrete example. When Seastar issues 128 KB sequential writes to a file on 1K-block XFS, the filesystem doesn’t seem to align these to 4K boundaries (maybe because the SSD firmware reported a physical sector size of 512 bytes). Using blktrace to inspect the actual disk I/O, we observed that approximately 75% of requests aligned to 1K or 2K boundaries. For these requests, the drive controller would split them each into at most 3 parts: a head, a 4K-aligned core, and a tail. For the head and tail, the disk would perform RMW, which is very slow. That would become the dominating factor for the entire request (consisting of the 3 parts). Reformatting the filesystem with 4K block size completely transformed the alignment distribution of requests, and 100% of them aligned to 4K. This brought a lot of throughput back for us. Filesystem Metadata Operations and Blocking Consider this: when a file grows, XFS must: Allocate extents from the freespace tree, requiring B-Tree modifications and mutex locks Update inode metadata to reflect the new file size Flush metadata periodically to ensure durability Update access/change times (ctimes) on every write Each of these operations can block subsequent I/O submissions. In our research, we discovered that the RWF_NOWAIT flag (requesting non-blocking async I/O submission) was insufficiently effective when metadata operations were queued. Writes would be re-submitted from worker threads rather than the Seastar reactor, adding context-switch overhead and latency spikes. When the final size of files is known, it is beneficial to pre-allocate or pre-truncate the file to that size using functions like fallocate() or ftruncate(). This practice dramatically improves the alignment distribution across the file and helps to amortize the overhead associated with extent allocation and metadata updates. While effective, fallocate() can be an expensive operation. It can potentially impact latency, especially if the allocation groups are already busy. Truncation is significantly cheaper; this alone can offer substantial benefits. Another helpful technique is using approaches like Seastar’s sloppy_size=true, where a file’s size is doubled via truncation whenever the current limit is reached. Key Takeaways for the Filesystem Layer Format Correctly: Format XFS (or other filesystems) with block sizes matching your SSD physical sector size if possible. Most modern NVMe drives are 4K-optimized. Go lower only if there are strong restrictions – like inability to read 4K aligned or potential read amplification or you have benchmarks showing better performance for the disk used with a smaller request size. Pre-allocation: When file sizes are known, pre-truncate or pre-allocate files to their final size using `fallocate()` or truncation. This amortizes extent allocation overhead and ensures uniform alignment across the file. Metadata Flushing: Understand the filesystem’s metadata update behavior. File sizes and access time updates are expensive. If you’re doing AIO, use RWF_NOWAIT if possible and make sure it works by tracing the calls with `strace`. Data deletion has also proved to have very expensive side effects. However, on older generation NVMes, TRIM requests that accumulated in the filesystem would flush all at once, overloading the SSD controller and causing huge latency spikes for the apps running on that machine. Mind the Application — Parallelism and Request Size Strategy Parallelism tuning Modern NVMe storage devices can handle thousands of requests in flight simultaneously. Factors like the internal queue depth and the number of outstanding requests the device can accept determine the maximum achievable bandwidth and latency when properly loaded. However, application-level concurrency (threads, fibers, async tasks) must be sufficient to keep these queues full. Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows – as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the input increase. The relationship between throughput, latency, and queue depth follows Little’s Law: Average Queue Depth = Throughput * Average Latency For a device delivering 14.5 GB/s with 128K requests, the number of requests/second the device can handle is 14.5GB/s divided by 128K – so roughly 113k req/s. If an individual request latency is, for example, 1ms, the device queue needs at least 113 outstanding requests to sustain 14.5 GB/s. This has practical implications: if you tune your application for, say, 40 concurrent requests and later upgrade to a faster device, you’ll need to increase concurrency or you’ll under-utilize the new device. Conversely, if you over-provision concurrency, you risk queue saturation and latency spikes because the SSD controllers only have so much compute power themselves. In ScyllaDB, concurrency is expressed as the number of shards (1 per CPU core) and fibers submitting I/O in parallel. Because we want the database to perform exceptionally even in mixed workloads, we’re using Seastar’s IO Scheduler to modulate the number of requests we send to storage. The Scheduler is configured with a latency goal that it needs to follow and with the iops/bandwidth the disk can handle at peak performance. In real workloads, it’s very difficult to match the load with a static execution model you’ve built yourself, even with thorough benchmarking and testing. Hardware performance varies, and read/write patterns are usually surprising. Your best bet is to use an IO scheduler built to leverage the maximum potential of the hardware while still obeying a latency limit you’ve set. Request size strategy Many small I/O requests often fail to saturate an NVMe drive’s throughput because bandwidth is the product of IOPS and request size, and small requests hit an IOPS/latency/CPU ceiling before they hit the PCIe/NAND bandwidth ceiling. For a given request size, bandwidth is essentially IOPS x Request size. So, for instance, 4K I/Os would need extremely high IOPS to reach GB/s-class throughput. A concrete example: 350k IOPS (which is typical for an AWS i7i.4xlarge, for instance) at 4K request size will get you ~1.3GB/s (i7i.4xlarge can do 3.5GB/s easily). This looks slow in throughput terms even though the device might be doing exactly what the workload allows. NVMe devices reach peak bandwidth when they have enough outstanding requests (queue depth) to keep the many internal flash channels busy. If your workload issues many small requests but mostly one-at-a-time (or only a couple in flight), the drive spends time waiting between completions instead of streaming data. As a result, bandwidth stays low. Another important aspect is that with small I/O sizes, the fixed per-I/O costs (syscalls, scheduling, interrupts, copying, etc.) can dominate the latency. That means the kernel/filesystem path becomes the limiter even before the NVMe hardware does. This is one reason why real applications often can’t reproduce vendors’ peak numbers. In practice, you usually need high concurrency (multiple threads/fibers/jobs) and true async I/O to maintain a deep in-flight queue and approach device limits. But, to repeat one idea from the section above, also remember that the storage controller itself has a limited compute capacity. Overloading it results in higher latencies for both read and write paths. It is important to find the right request size for your application’s workload. Go as big as your latency expectations will allow. If your workload is not very predictable, keep in mind that you’ll most likely need some dynamic logic that adjusts the I/O patterns so you can squeeze every last drop of performance out of that expensive storage. Key Takeaways for the Application Layer Parallelism tuning: Benchmark your specific hardware and workload to find the optimal fiber/thread/shard count. Watch the latency. While you increase the parallelism, you’ll notice throughput increasing. At some point, throughput will plateau and latency will start to increase. That’s the sweet spot you’re looking for. Request size strategy: Picking the right request size is important. Seastar defaults to 128K for sequential writes, which works well on most modern storage (but validate it via benchmarking). If your device prefers larger or smaller requests for throughput, the cost is latency – so design your workload accordingly. In practice, we’ve never seen SSDs that cannot be saturated for throughput with 128K requests and ScyllaDB can achieve sub-millisecond latencies with this request size. Conclusion The path from application buffer to persistent storage on modern hardware is complex. Often, performance issues are counterintuitive and very difficult to track down. A 1 KB filesystem block size doesn’t “save space” on 4K-optimized SSDs; it wastes throughput by forcing read-modify-write operations. A perfectly tuned IO Scheduler can still throttle requests if given incorrect disk properties. Sufficient parallelism doesn’t guarantee high throughput if request sizes are too small to fill device queues. By minding the disk, the filesystem, and the application – and by understanding how they interact – you can fully take advantage of modern storage hardware and build systems that reliably deliver the performance that the storage vendor advertised.Kelsey Hightower’s Take on Engineering at Scale
At ScyllaDB’s Monster SCALE Summit 25, Hightower shared why facing scale the hard way is the right way “There’s a saying I like: ‘Some people have good ideas. Some ideas have people.’ When your idea outlives you, that’s success.” – Kelsey Hightower Kelsey Hightower’s career is a perfect example of that. His ideas have taken on a life of their own, extending far beyond his work at Puppet, CoreOS, to KubeCon and Google. And he continues to scale his impact with his signature unscripted keynotes as well as the definitive book, “Kubernetes Up and Running.” We were thrilled that Hightower joined ScyllaDB’s Monster SCALE Summit to share his experiences and advice on engineering at scale. And to scale his insights beyond those who joined us for the live keynote, we’re capturing some of the most memorable moments and takeaways here. Note: Monster SCALE Summit 2026 will go live March 11-12, featuring antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications”) and more than 50 others. The event is free and virtual; register and join the community for some lively chats. Monster Scale Summit 2026 – Free + Virtual Fail Before You Scale The interview with host Tim Koopmans began with a pointed warning for attendees of a conference that’s all about scaling: Don’t just chase scale because you’re fascinated by others’ scaling strategies and achievements. You need to really experience some pain personally first. As Hightower put it: “A lot of people go see a conference talk — I’m probably guilty of this myself — and then try to ‘do scale things’ before they even have experience with what they’re doing. Back at Puppet Labs, lots of people wrote shell scripts with bad error handling. Things went awry when conditions weren’t perfect. Then they moved on to configuration management, and those who made that journey could understand the trade-offs. Those who started directly with Puppet often didn’t.” “Be sure you have a reason,” he said “So before you over-optimize for a problem you may not even have, you should ask: ‘How bad is it really? Where are the metrics proving that you need a more scalable solution?’ Sometimes you can do nothing and just wait for scaling to become the default option.” Ultimately, you should hope for the “good problem” where increasing demand causes you to hit the limits of your tech, he said. That’s much better than having few customers and over-engineering for problems you don’t have. Make the Best Choice You Can … For Now The conversation shifted to what level of scale teams should target with their initial tech stack and each subsequent iteration. Should you optimize for a future state that hasn’t happened yet? Play it safe in case the market changes? “If you’re not sure whether you’re on the right stack … I promise you, it’s going to change,” Hightower said. “Make the best choice you can for now. You can spend all year optimizing for ‘the best thing,’ but it may not be the best thing 10 years from now. Say you pick a database, go all in, learn best practices. But put a little footnote in your design doc: ‘Here’s how we’d change this.’ Estimate the switching cost. If you do that, you won’t get stuck in sunk cost fallacy later.” Rather than trying to predict the future, think about how to avoid getting trapped. You don’t want dependencies or extensions to limit your ability to migrate when it’s time to take a better (or just different) path. “Change isn’t failure,” he emphasized. “Plan for it; don’t fear it.” In Hightower’s view, scaling decisions should start on a whiteboard, not in code. “When I was at Google, we’d do technical whiteboard sessions. Draw a line — that’s time. “Today, we’re here. Our platform allows us to do these things. Is that good or bad?” Then draw ahead: “Where do we want to be in two years?” He continued, “Usually that’s driven by teams and customer needs. You can’t do everything at once. So plot milestones — six months, a year, etc. You can push things out in time for when new libraries or tools arrive. If something new shows up that gets you two years ahead instantly, great. Having a timeline gives you freedom without guilt that you can’t ship everything today.” Are You Really Prepared For a 747? Following up on the Google thread, Koopmans asked, “I’d love to hear practical ways Google avoids over-engineering when designing for scale.” To illustrate why “Google-scale” solutions don’t always fit everyone else, Hightower used a memorable analogy: “I had a customer once say, ‘We want BigQuery on-prem.’ I said, ‘You do? Really? OK, how much money do you have?’ And it was one of those companies that had plenty of capital, so that wasn’t the issue. I told them, ‘That would be like going to the airport, looking out the window, seeing a brand-new 747 and telling the airline that you want that plane. Even if they let you buy it, you don’t have a pilot’s license, you don’t know how to fuel it. Where are you going to park it? Are you going to drive it down your subdivision, decapitating the roofs of your neighbors’ houses?” Some things just aren’t meant for everyone.” Ultimately, whether it’s over-engineering or not depends on the target user. Understand who they are, how they work and what tools they use, then build with that in mind. Hightower also warned against treating “best practices” as universal truths: “One question that most customers show up with is, ‘What are the best practices?’ Not necessarily the best practices for me. They just want to know what everyone else is doing. I think that might be another anti-pattern in the mix, where you only care about what everyone else is doing and you don’t bring the necessary context for a good recommendation.” How Leaders Should Think About Dev Tooling “Serializing engineering culture” (Hightower’s phrase) like Google did with its google3 monorepo makes it simple for thousands of new engineers to join the team and start contributing almost instantly. During his tenure at Google, everything from gRPC to deployment tools was integrated. Engineers just opened a browser, added code and reviews would start automatically. However, there’s a fine line between serializing and stifling. Hightower believes that prohibiting engineers from even installing Python on their laptops, for example, is overkill: “That’s like telling Picasso he can’t use his favorite brush.” He continued: “Everyone works differently. As a leader, learn what tools people actually use and promote sharing. Have engineers show their workflows — the shortcuts, setups and plugins that make them productive. That’s where creativity and speed come from. Share the nuance. Most people think their tricks are too small to matter, but they do. I want to see your dotfiles! You’ll inspire others.” Watch the Complete Talk As Hightower noted, “Some people have good ideas. Some ideas have people.” His approach to scale – pragmatic, context-driven and human – shows why some ideas really do outlive the people who created them. You can see the full talk below. Fun fact: it was truly an unscripted interview – Hightower insisted! The team met him in the hotel lobby that morning, chatted a bit during a coffee run, prepped the camera angles … and suddenly Hightower and Koopmans were broadcasting to 20,000 attendees around the world.Claude Code Marketplace Now Available
Claude Code has become an indispensable part of my daily workflow. I use it for everything from writing code to debugging production issues. But while Claude is incredibly capable out of the box, there are areas where injecting specialized domain knowledge makes it dramatically more useful.
That’s why I built a plugin marketplace. Yesterday I released rustyrazorblade/skills, a collection of Claude Code plugins that extend Claude with expert-level knowledge in specific domains. The first plugin is something I’ve been talking about doing for a while: a Cassandra expert.
The Deceptively Simple Act of Writing to Disk
Tracking down a mysterious write throughput degradation From a high-level perspective, writing a file seems like a trivial operation: open, write data, close. Modern programming languages abstract this task into simple, seemingly instantaneous function calls. However, beneath this thin veneer of simplicity lies a complex, multi-layered gauntlet of technical challenges, especially when dealing with large files and high-performance SSDs. For the uninitiated, the path from application buffer to persistent storage is fraught with performance pitfalls and unexpected challenges. If your goal is to master the art of writing large files efficiently on modern hardware, understanding all the details under the hood is essential. This article will walk you through a case study of fixing a throughput performance issue. We’ll get into the intricacies of high-performance disk I/O, exploring the essential technical questions and common oversights that can dramatically affect reliability, speed, and efficiency. It’s part 2 of a 3-part series. Read part 1 Read part 3 When lots of work leads to a performance regression If you haven’t yet read part 1 (When bigger instances don’t scale), now’s a great time to do so. It will help you understand the origin of the problem we’re focusing on here. TL;DR: In that blog post, we described how we managed to figure out why a new class of highly performant machines didn’t scale as expected when instance sizes increased. We discovered a few bugs in our Seastar IO Scheduler (stick around a bit, I’ll give a brief description of that below). That helped us measure scalable bandwidth numbers. At the time, we believed these new NVMes were inclined to perform better with 4K requests than with 512 byte requests. We later discovered that the latter issue was not related to the scheduler at all. We were actually chasing a firmware bug in the SSD controller itself. These disks do, in fact, perform better with 4K requests. What we initially thought was a problem in our measurement tool (IOTune) turned out to be something else entirely. IOTune wasn’t misdetecting the disk’s physical sector size (this is the request size at which a disk can achieve the best IOPS). Instead, the disk firmware was reporting it incorrectly. It was reporting it as 512 bytes. However, in reality, it was 4K. We worked around the IOPS issue since the cloud provider wasn’t willing to fix the firmware bug due to backward-compatibility concerns. We also deployed the IO Scheduler fixes and our measured disk models (io-properties) with IOTune scaled nicely with the size of the instance. Still, in real workload tests, ScyllaDB didn’t like it. Performance results of some realistic workloads showed a write throughput degradation of around 10% on some instances provisioned with quite new and very fast SSDs. While this wasn’t much, it was alarming because we were kind of expecting an improvement after the last series of fixes. These first two charts give us a good indication of how well ScyllaDB utilizes the disk. In short, we’re looking for both of them to be as stable as possible and as close to 100% as possible. The “I/O Group consumption” metric tracks the amount of shared capacity currently taken by in-flight operations from the group (reads, writes, compaction, etc.). It’s expressed as a percentage of the configured disk capacity. The “I/O Group Queue flow ratio” metric in ScyllaDB measures the balance between incoming I/O request rates and the dispatch rate from the I/O queue for a given I/O group. It should be as close as possible to 1.0, because requests cannot accumulate in disk. If it jumps up, it means one of two things. The reactor might be constantly falling behind and not kicking the I/O queue in a timely manner. Or, it can mean that the disk is slower than we told ScyllaDB it was – and the scheduler is overloading it with requests. The spikes here indicate that the IO Scheduler doesn’t provide a very good QoS. That led us to believe the disk was overloaded with requests, so we ended up not saturating the throughput. The following throughput charts for commitlog, memtable, and compaction groups reinforce this claim. The 4 NVMe RAID array we were testing against was capable of around 14.5GB/s throughput. We expected that at any point in time during the test, the sum of the bandwidths for those three groups would get close to the configured disk capacity. Please note that according to the formula described in the section below on IO Scheduler, bandwidth and IOPS have a competing relationship. It’s not possible to reach the maximum configured bandwidth because that would leave you with 0 space for IOPS. The reverse holds true: You cannot reach the maximum IOPS because that would mean your request size got so low that you’re most likely not getting any bandwidth from the disk. At the end of the chart, we would expect an abrupt drop in throughput for the commitlog and memtable groups because the test ends, with the compaction group rapidly consuming most of the 14.5GB/s of disk throughput. That was indeed the case, except that these charts are very spiky as well. In many cases, summing up the bandwidth for the three groups shows that they consume around 12.5GB/s of the disk total throughput. The Seastar IO Scheduler Seastar uses an I/O scheduler to coordinate shards for maximizing the disk’s bandwidth while still preserving great IOPS. Basically, the scheduler lets Seastar saturate a disk and at the same time fit all requests within a latency goal configured on the library (usually 0.5ms). A detailed explanation of how the IO Scheduler works can be found in this blog post. But here’s a summary of where max bandwidth/IOPS values come from and where they go to within the ScyllaDB IO Scheduler. I believe it will connect the problem description above with the rest of the case study below. The IOTune tool is a disk benchmarking tool that ships with Seastar. When you run this tool on a disk, it will output 4 values corresponding to read/write IOPS and read/write bandwidth. These 4 values end up in a file called io-properties.yaml. When provided with these values, the Seastar IO Scheduler will build a model of your disk. This model then helps ScyllaDB maximize the drive’s performance. The IO Scheduler models the disk based on the IOPS and bandwidth properties using a formula which looks something like:read_bw/read_bw_max + write_bw/write_bw_max +
read_iops/read_iops_max + write_iops/write_iops_max <= 1
The internal mechanics of how the IO Scheduler works are detailed
in the blog post linked above. Peak and sustained throughput We
observed some bursty behavior in the SSDs under test. It wasn’t
much; iops/bandwidth would be lower with around 5% than the values
measured by running the benchmark for the default 2 minutes. The
iops/bandwidth values would start stabilizing at around 30 minutes
and that’s what we call sustained io-properties. We thought our
IOTune runs might have recorded peak disk io-properties (i.e., we
ran the tool for too short of a duration – the default is 120
seconds). With a realistic workload, we are actually testing the
disks at their sustained throughput, so the IO Scheduler builds an
inflated model of the disk and ends up sending more requests than
the disk can actually handle. This would cause the overload we saw
in the charts. We then tested with newly measured sustained
io-properties (with the IOTune duration configured to run 30
minutes, then 60 minutes). However, there wasn’t any noticeable
improvement in the throughput degradation problem…and the charts
were still spiky. Disk saturation length The disk saturation
length, as defined and measured by IOTune, is the smallest request
length that’s needed to achieve the maximum disk throughput. All
the disks that we’ve seen so far had a measured saturation length
of 128K. This means that it should be perfectly possible to achieve
maximum throughput with 128K requests. We noticed something quite
odd while running tests on these performant NVMes: the Seastar
IOTune tool would report a saturation length of 1MB. We immediately
panicked because there are a few important assumptions that rely on
us being able to saturate disks with 128K requests. The issue
matched the symptoms we were chasing. A disk model built with the
assumption that saturation length is 1MB would trick the IO
Scheduler into allowing a higher number of 128K requests (the
length Seastar uses for sequential writes) than the disk controller
can handle efficiently. In other words, the IO Scheduler would try
to achieve the high throughput measured with 1MB request length,
but using 128K requests. This would make the disk appear
overloaded, as we saw in the charts. Assume you’re trying to reach
the maximum throughput on a common disk with 4K requests, for
instance. You won’t be able to do it. And since the throughput
would stay below the maximum, the IO Scheduler would stuff more and
more requests into the disk. It’s hoping to reach the maximum – but
again, it won’t be reached. The side effect is that the IO
Scheduler ends up overloading the disk controller with requests,
increasing in-disk latencies for all the other processes trying to
use it. As is typical when you’re navigating muddy waters like
these, this turned out to be a false lead. We were stepping on some
bugs in our measuring tools, IOTune and io_tester. IOTune was
running with lower parallelism than those disks needed for
saturation. And io_tester was measuring overwriting a file rather
than writing to a pre-truncated new file. The saturation length of
this type of disks was still 128k, like we had seen in the past.
Fortunately, that meant we didn’t need to make potential
architectural changes in Seastar in order to accommodate larger
requests. A nice observation we can make here based on the tests we
ran trying to dis/prove this theory is that extent allocation is a
rather slow process. If the allocation group is already quite busy
(a few big files already exist under the same directory, for
instance), the effect on throughput when appending and extending a
file is quite dramatic. Internal parallelism Another interesting
finding was that the disk/filesystem seemed to suffer from internal
parallelism problems. We ran io_tester with 128k requests, 8 fibers
per shard with 8 shards and 64 shards. The results were very odd.
The expected bandwidth was ~12.7GB/s, but we were confused to see
it drop when we increased the number of shards.
Generally, the bandwidth vs. latency dependency is defined by two
parts. If you measure latency and bandwidth while you increase the
app parallelism, the latency is constant and the bandwidth grows as
long as the internal disk parallelism is not saturated. Once the
disk is throughput loaded, the bandwidth stops growing or grows
very little, while the latency scales almost linearly with the
increase of the input. However, in the table above, we see
something different. When the disk is overloaded, latency increases
(because it holds more requests internally), but the bandwidth
drops at the same time. This might explain why IOTune measured
bandwidths around 12GB/s, while under the real workload (capped
with IOTune-measured io-properties.yaml), the disk behaved as if it
was overloaded (i.e., high latency and the actual bandwidth below
the maximum bandwidth). When IOTune measures the disk, shards load
the disk one-by-one. However, the real test workload sends parallel
requests from all shards at once. The storage device was a RAID0
with 4 NVMes and a RAID chunk size of 1MB. The theory here was that
since each shard writes via 8 fibers of 128k requests, it’s
possible that many shards could end up writing to the same disk in
the array. The explanation is that XFS aligns files on 1MB
boundaries. If all shards start at the same file offset and move at
relatively the same speed, the shards end up picking the same drive
from the array. That means we might not be measuring the full
throughput of the raid array. The measurements confirm that the
shards do not consistently favor the same drive. Single disk
throughput was measured at 3.220 GB/s while the entire 4-disk array
achieved a throughput of 10.9 GB/s. If they were picking the same
disk all the time, the throughput of the entire 4-disk array
would’ve been equal to that of a single disk (i.e. 3.2GB/s). This
lead ended up being a dead end. We tried to prove it in a
simulator, but all requests ended up shuffled evenly between the
disks. Sometimes, interesting theories that you bet can explain
certain effects just don’t hold true in practice. In this case,
something else sits at the base of this issue. XFS formatting
Although the previous lead didn’t get us very far, it opened a very
nice door. We noticed that the throughput drop is reproducible on
64 shards, rather than 8 shards RAID mounted with the
scylla_raid_setup script (a ScyllaDB script which does
preparation work on a new machine, e.g., formats the filesystem,
sets up the RAID array), not on a raw block device and not on a
RAID created with default parameters
Comparing the running mkfs.xfs commands, we spotted a few
differences. In the table below, notice how the XFS parameters
differ between default-mounted XFS and XFS mounted by
scylla_raid_setup. The 1K vs 4K data.bsize difference
stands out. We
also spotted – and this is an important
observation– that truncating the test files to some large
size seems to bring back the stolen throughput. The results coming
from this observation are extremely surprising. Keep reading to see
how this leads us to actually figuring out the root cause of the
problem. The table below shows the throughput in MB/s when running
tests on files that are being appended and extended and files that
are being pre-truncated to their final size (both cases were run on
XFS mounted with the Scylla script).
We’ve experimented with Seastar’s sloppy_size=true file option,
which truncates the file’s size to double every time it sees an
access past the current size of the file. However, while it
improved the throughput numbers, it unfortunately still left half
of the missing throughput on the table. RWF_NOWAIT and non-blocking
IO The first lead that we got from here was by running our tests
under strace. Apparently, all of our writes would
get re-submitted out of the Seastar Reactor thread to XFS threads.
Seastar uses RWF_NOWAIT to attempt non-blocking AIO
submissions directly from the Seastar Reactor thread. It sets
aio_rw_flags=RWF_NOWAIT on IOCBs during initial
io_submit calls from the Reactor backend. On modern
kernels, this flag requests immediate failure (EAGAIN)
if the submission may block, preserving reactor responsiveness. On
io_submit failure with RWF_NOWAIT,
Seastar clears the flag and queues the IOCBs for retrying. The
thread pool then executes the retry submissions without the
RWF_NOWAIT flag, as these can tolerate blocking. We
thought reducing the number of retries would increase throughput.
Unfortunately, it didn’t actually do that when we disabled the
flag. The cause for the throughput drop is uncovered next. As for
the RWF_NOWAIT issue, it’s still unclear why it
doesn’t affect throughput. However, the fix was a kernel patch by
our colleague Pavel Emelyanov which fiddles with inode time update
when IOCBs carry the RWF_NOWAIT flag. More details on
this would definitely exceed the scope of this blog post. blktrace
and offset alignment distribution Returning to our throughput
performance issue, we started running io_tester experiments with
blktrace on, and we noticed something strange. For
around 25% of the requests, io_tester was submitting 128k requests
and XFS would queue them as 256 sector requests (256 sectors x 512
bytes reported physical sector size equals to 128k). However, it
would split the requests and complete them in 2 parts. (Note the Q
label on the first line of the output below; this indicates the
request was queued) The first part of the request would finish in
161 microseconds, while the second part would finish in 5249
microseconds. This dragged down the latency of the whole request to
5249 microseconds (the 4th column in the table is the timestamp of
the event in seconds; the latency of a request is max(ts_completed
– ts_queued)). The
remaining 75% of requests were queued and completed in one go as
256 sector requests. They were also quite fast: 52 microseconds, as
shown below.
The explanation for the split is related to how 128k requests hit
the file, given that XFS lays it out on disk, considering a 1MB
RAID chunk size. The split point occurs at address 21057335224 +
72, which translates to hex 0x9CE3AE00000. This reveals it is, in
fact, a multiple of 0x100000 – the 1MB RAID chunk boundary. We can
discuss optimizations for this, but that’s outside the scope of
this article. Unfortunately, it was also out of scope for our
throughput issue. However, here are some interesting charts showing
how request offsets alignment looks based on the
blktrace events we collected.
default formatted XFS with 4K block size
scylla_raid_setup XFS: 1K block
scylla_raid_setup XFS: 1K block + truncation This result is
significant! For XFS formatted with a 4K block size, most
requests are 4K aligned. For XFS formatted with
scylla_raid_setup (1K block size), most requests are
1K or 2K aligned. For XFS formatted with
scylla_raid_setup (1K block size) and with test files
truncated to their final size, all requests are 64K aligned
(although in some cases we also saw them being 4k aligned). It
turns out that XFS lays out files on disk very differently when the
file is known in advance compared to the case when the file is
grown on-demand. That results in IO unaligned to the disk block
size. Punchline Now here comes the explanation of the problem we’ve
been chasing since the beginning. In
the first part of the article, we saw that doing random writes
with 1K requests produces worse IOPS than with 4K requests on these
4K optimized NVMes. This happens because when executing the 1K
request, the disk needs to perform Read-Modify-Write to land the
data into chips. When we submit 128k requests (as ScyllaDB does)
that are 1K or 2K aligned (see the alignment distributions in the
charts above), the disk is forced to do RMW on the head and tail of
the requests. This slows down all the requests (unrelated, but
similar in concept to the raid chunk alignment split we’ve seen
above). Individually, the slowdown is probably tiny. But since most
requests are 1k and 2k aligned on XFS formatted with 1k block size
(no truncation), the throughput hit is quite significant. It’s very
interesting to note that, as shown in the last chart above,
truncation also improved the alignment distribution quite
significantly, and also improved throughput. It also appeared to
significantly shorten the list of extents created for our test
files. For ScyllaDB, the right solution was to format XFS with a 4K
block size. Truncating to the final size of the file wasn’t really
an option for us because we can’t predict how big an SSTable will
grow. Since sloppy_size’ing the files didn’t provide great results,
we agreed that 4K-formatted XFS was the way to go. The throughput
degradation we got using the higher io-properties numbers seems to
be solved. We initially expected to see “improved performance”
compared to the original low io-properties case (i.e., a higher
measured write throughput). The success wasn’t obvious, though. It
was rather hidden within the dashboards, as shown below. Here’s
what disk utilization and IO flow ratio charts look like. The disk
is fully utilized and clearly not overloaded anymore.
And here are memtable and commitlog charts. These look very similar
to the charts we got with the initial low io-properties numbers
from the
“When bigger instances don’t scale” article. Most likely, this
means that’s what the test can do.
The
good news was hidden here. While the test went full speed ahead,
the compaction job filled all the available throughput, from
~10GB/s (when commitlog+memtable were running at 4.5GB/s) to
14.5GB/s (when commitlog and memtable flush processes were done).
The
only thing left to check was whether the filesystem formatted with
the 4K block size would cause read amplification on older disks
with 512 bytes physical sector size. It turns out it didn’t. We
were able to achieve similar IOPS on a RAID with 4 NVMes of the
older type.
Next up:
Part 3, Common performance pitfalls of modern storage I/O. Inside the 2026 Monster Scale Summit Agenda
Monster Scale Summit is all about extreme scale engineering and data-intensive applications. So here’s a big announcement: the agenda is now available! Attendees can join 50+ tech talks, including: Keynotes by antirez, Camille Fournier, Pat Helland, Joran Greef, Thea Aarrestad, Dor Laor, and Avi Kivity Martin Kleppmann & Chris Riccomini chatting about the second edition of Designing Data-Intensive Applications Tales of extreme scale engineering – Rivian, Pinterest, LinkedIn, Nextdoor, Uber Eats, Google, Los Alamos National Labs, CERN, and AmEx ScyllaDB customer perspectives – Discord, Disney, Freshworks, ShareChat, SAS, Sprig, MoEngage, Meesho, Tiket, and Zscaler Database engineering – Inside looks at ScyllaDB, turbopuffer, Redis, ClickHouse, DBOS, MongoDB, DuckDB, and TigerBeetle What’s new/next for ScyllaDB – Vector search, tablets, tiered storage, data consistency, incremental repair, Rust-based drivers, and more Like other ScyllaDB-hosted conferences (e.g., P99 CONF), the event will be free and virtual so everyone can participate. Take a look, register, and start choosing your own adventure across the multiple tracks of tech talks. Full Agenda Register [free + virtual] When you join us March 11 and 12, you can… Chat directly with speakers and connect with ~20K of your peers Participate in interactive trainings on topics like real-time AI, database performance at scale, high availability, and cloud cost optimization strategies Pick the minds of ScyllaDB engineers and architects, who are available to answer your toughest database performance questions Win conference swag, sea monster plushies, book bundles, and other cool giveaways Details, Details The agenda site has all the scheduling, abstracts, and speaker details. Please note that times are shown in your local time zone. Be sure to scroll down into the Instant Access section. This is one of the best parts of Monster SCALE Summit. You can access these sessions from the minute the event platform opens until the conference wraps. Some teams have shared that they use Instant Access to build their own watch parties beyond the live conference hours. If you do this, please share photos! Another important detail: books. Quite a few of our speakers areApache Cassandra® 5.0: Improving performance with Unified Compaction Strategy
IntroductionUnified Compaction Strategy (UCS), introduced in Apache Cassandra 5.0, is a versatile compaction framework that not only unifies the benefits of Size-Tiered (STCS) and Leveled (LCS) Compaction Strategies, but also introduces new capabilities like shard parallelism, density-aware SSTable organization, and safer incremental compaction, all of which deliver more predictable performance at scale. By utilizing a flexible scaling model, UCS allows operators to tune compaction behavior to match evolving workloads, spanning from write-heavy to read-heavy, without requiring disruptive strategy migrations in most cases.
In the past, operators had to choose between rigid strategies and accept significant trade-offs. UCS changes this paradigm, allowing the system to efficiently adapt to changing workloads with tuneable configurations that can be altered mid-flight and even applied differently across different compaction levels based on data density.
Why compaction mattersCompaction is the critical process that determines a cluster’s long-term health and cost-efficiency. When executed correctly, it produces denser nodes with highly organized SSTables, allowing each server to store more data without sacrificing speed. This efficiency translates to a smaller infrastructure footprint, which can lower cloud costs and resource usage.
Conversely, inefficient compaction is a primary driver of performance degradation. Poorly managed SSTables lead to fragmented data, forcing the system to work harder for every request. This overhead consumes excessive CPU and I/O, often forcing teams to try adding more nodes (horizontal scale) just to keep up with background maintenance noise.
Key concepts and terminologyTo understand how UCS optimizes a cluster, it is necessary to understand the fundamental trade-offs it balances:
- Read amplification: Occurs when the database must consult multiple SSTables to answer a single query. High read amplification acts as a “latency tax,” forcing extra I/O to reconcile data fragments.
- Write amplification: A metric that quantifies the overhead of background processes (such as compactions). It represents the ratio between total data written to disk and the amount of data originally sent by an application. High write amplification wears out SSDs and steals throughput.
- Space amplification: The ratio of disk space used to the actual size of the “live” data. It tracks data such as tombstones or overwritten rows that haven’t been purged yet.
- Fan factor: The “growth dial” for the cluster data hierarchy. It defines how many files of a similar size must accumulate before they are merged into a larger tier.
- Sharding: UCS splits data into smaller, independent token ranges (shards), allowing the system to run multiple compactions in parallel across CPU cores.
UCS provides baseline architectural improvements that were not available in older strategies:
Improved compaction parallelismOlder strategies often got stuck on a single thread during large merges. UCS sharding allows a server to use its full processing power. This significantly reduces the likelihood of compaction storms and keeps tail latencies (p99) predictable.
Reduced disk space amplificationBecause UCS operates on smaller shards, it doesn’t need to double the entire disk space of a node to perform a major merge. This greatly reduces the risk of nodes from running out of space during heavy maintenance cycles.
Density-based SSTable organizationUCS measures SSTables by density (token range coverage). This mitigates the huge SSTable problem where a single massive file becomes too large to compact, hindering read performance indefinitely.
Scaling parameterThe scaling parameter (denoted as W) is a configurable setting that determines the size ratio between compaction tiers. It helps balance write amplification and read performance by controlling how much data is rewritten during compaction operations. A lower scaling parameter value results in more frequent, smaller compactions, whereas a higher value leads to larger compaction groups.
The strategy engine: tuning and parametersUCS acts as a strategy engine by adjusting the scaling parameter (W), allowing UCS to mimic, or outperform, its predecessors STCS and LCS.
At a high level, the scaling parameter influences the effective fan-out behavior at each compaction level. Tiered-style settings such as T4 allow more SSTables to accumulate before merging, favoring write efficiency, while leveled-style settings such as L10 keep SSTables more tightly organized, reducing read amplification at the cost of additional background work.
The numbers below are illustrative and not prescriptive:
UCS configuration guide Workload type Strategy target Scaling (W) Primary benefit Heavy writes / IoT STCS (Tiered) Negative (e.g., -4) Lowest read amplification Heavy reads LCS (Leveled) Positive (e.g., 10) Lowest write amplification Balanced Hybrid Zero (0) Balanced performance for general apps Practical exampleUCS allows operators to mix behaviors across the data lifecycle.
'scaling_parameters': 'T4, T4, L10'
Note that scaling_parameters takes a string format that can accommodate parameters for per-level tuning.
This example instructs a cluster: “Use tiered compaction for the first two levels to keep up with the high write volume, but once data reaches the third level, reorganize it into a leveled structure so reads stay fast.”
Here’s a fuller, illustrative example of how one might structure their CQL to change the compaction strategy.
ALTER TABLE keyspace_name.table_name WITH compaction = { 'class': 'UnifiedCompactionStrategy', 'scaling_parameters': 'T4,T4,L10' };
Operational evolution: moving beyond major compactions
In older strategies and in Apache Cassandra versions prior to 5.0, operators often felt forced to run a major compaction to reclaim disk space or fix performance. This was a critical event that could impact a node’s I/O for extended periods of time and required substantial free disk space to complete.
Because UCS is density-aware and sharded, it effectively performs compactions constantly and granularly so major compactions are rarely needed. It identifies overlapping data within specific token ranges (shards) and cleans them up incrementally. Operators no longer must choose between a fragmented disk and a risky, resource-heavy manual compaction; UCS keeps data density more uniform across the cluster over time.
The migration advantage: “in-place” adoptionOne of the key performance features of a UCS migration is in-place adoption, meaning that when a table is switched to UCS, it does not immediately force a massive data rewrite. Instead, it looks at the existing SSTables, calculates their density, and maps them into its new sharding structure.
This allows for moving from STCS or LCS to UCS with significantly less I/O overhead than any other strategy change.
ConclusionUCS is an operational shift toward simplicity and predictability. By removing the need to choose between compaction trade-offs, UCS allows organizations to scale with confidence. Whether handling a massive influx of IoT data or serving high-speed user profiles, UCS helps clusters remain performant, cost-effective, and ready for the future.
On a newly deployed NetApp Instaclustr Apache Cassandra 5 cluster, UCS is already the default strategy (while Apache Cassandra 5.0 has STCS set as the default).
Ready to experience this new level of Cassandra performance for yourself? Try it with a free 30-day trial today!
The post Apache Cassandra® 5.0: Improving performance with Unified Compaction Strategy appeared first on Instaclustr.
Getting Started with Database-Level Encryption at Rest in ScyllaDB Cloud
Learn about ScyllaDB database-level encryption with Customer-Managed Keys & see how to set up and manage encryption with a customer key — or delegate encryption to ScyllaDB ScyllaDB Cloud takes a proactive approach to ensuring the security of sensitive data: we provide database-level encryption in addition to the default storage-level encryption. With this added layer of protection, customer data is always protected against attacks. Customers can focus on their core operations, knowing that their critical business and customer assets are well-protected. Our clients can either use a customer-managed key (CMK, our version of BYOK) or let ScyllaDB Cloud manage the CMK for them. The feature is available in all cloud platforms supported by ScyllaDB Cloud. This article explains how ScyllaDB Cloud protects customer data. It focuses on the technical aspects of ScyllaDB database-level encryption with Customer-Managed Keys (CMK). Storage-level encryption Encryption at rest is when data files are encrypted before being written to persistent storage. ScyllaDB Cloud always uses encrypted volumes to prevent data breaches caused by physical access to disks. Database-level encryption Database-level encryption is a technique for encrypting all data before it is stored in the database. The ScyllaDB Cloud feature is based on the proven ScyllaDB Enterprise database-level encryption at rest, extended with the Customer Managed Keys (CMK) encryption control. This ensures that the data is securely stored – and the customer is the one holding the key. The keys are stored and protected separately from the database, substantially increasing security. ScyllaDB Cloud provides full database-level encryption using the Customer Managed Keys (CMK) concept. It is based on envelope encryption to encrypt the data and decrypt only when the data is needed. This is essential to protect the customer data at rest. Some industries, like healthcare or finance, have strict data security regulations. Encrypting all data helps businesses comply with these requirements, avoiding the need to prove that all tables holding sensitive personal data are covered by encryption. It also helps businesses protect their corporate data, which can be even more valuable. A key feature of CMK is that the customer has complete control of the encryption keys. Data encryption keys will be introduced later (it is confusing to cover them at the beginning). The customer can: Revoke data access at any time Restore data access at any time Manage the master keys needed for decryption Log all access attempts to keys and data Customers can delegate all key management operations to the ScyllaDB Cloud support team if they prefer this. To achieve this, the customer can choose the ScyllaDB key when creating the cluster. To ensure customer data is secure and adheres to all privacy regulations. By default, encryption uses the symmetrical algorithm AES-128, a solid corporate encryption standard covering all practical applications. Breaking AES-128 can take an immense amount of time, approximately trillions of years. The strength can be increased to AES-256. Note: Database-level encryption in ScyllaDB Cloud is available for all clusters deployed in Amazon Web Services (AWS) and Google Cloud Platform (GCP). Encryption To ensure all user data is protected, ScyllaDB will encrypt: All user tables Commit logs Batch logs Hinted handoff data This ensures all customer data is properly encrypted. The first step of the encryption process is to encrypt every record with a data encryption key (DEK). Once the data is encrypted with the DEK, it is sent to either AWS KMS or GCP KMS, where the master key (MK) resides. The DEK is then encrypted with the master key (MK), producing an encrypted DEK (EDEK or a wrapped key). The master key remains in the KMS, while the EDEK is returned and stored with the data. The DEK used to encrypt the data is destroyed to ensure data protection. A new DEK will be generated the next time new data needs to be encrypted. Decryption Because the original non-encrypted DEK is destroyed when the EDEK was produced, the data cannot be decrypted. The EDEK cannot be used to decrypt the data directly because the DEK key is encrypted. It has to be decrypted, and for that, the master key will be required again. This can only be decrypted with the master key(MK) in the KMS. Once the DEK is unwrapped, the data can be decrypted. As you can see, the data cannot be decrypted without the master key – which is protected at all times in the KMS and cannot be “copied” outside KMS. By revoking the master key, the customer can disable access to the data independently from the database or application authorization. Multi-region deployment Adding new data centers to the ScyllaDB cluster will create additional local keys in those regions. All master keys support multi-regions, and a copy of each key resides locally in each region – ensuring those multi-regional setups are protected from regional outages for the cloud provider and against disaster. The keys are available in the same region as the data center and can be controlled independently. In case you use a Customer Key – cloud providers will charge you for the KMS. AWS will charge $1/month, GCP will change you $0.06 for each cluster prorated per hour. Each additional DC creates a replica that is counted as an additional key. There is an additional cost per key request. ScyllaDB Enterprise utilizes those requests efficiently, resulting in an estimated monthly cost of up to $1 for a 9-node cluster. Managing encryption keys adds another layer of administrative work in addition to the extra cost. ScyllaDB Cloud offers database clusters that can be encrypted using keys managed by ScyllaDB support. They provide the same level of protection, but our support team helps you manage the master keys. The ScyllaDB keys are applied by default and are subsidized by ScyllaDB. Creating a Cluster with Database-Level Encryption Creating a cluster with database-level encryption requires: A ScyllaDB Cloud account – If you don’t have one, you can create a ScyllaDB Cloud account here. 10 minutes with ScyllaDB Key or 20 minutes creating your own key To create a cluster with database-level encryption enabled, we will need a master key. We can either create a customer-managed key using ScyllaDB Cloud UI or skip this step completely and use a ScyllaDB Managed Key, which will skip the next six steps. In both cases, all the data will be protected by strong encryption at the database level. Setting up the customer-managed key can be found in the database-level encryption documentation. Transparent database-level encryption in ScyllaDB Cloud significantly boosts the security of your ScyllaDB clusters and backups. Next Steps Start using this feature in ScyllaDB Cloud. Get your questions answered in our community forum and Slack channel. Or, use our contact form.We Built a Better Cassandra + ScyllaDB Driver for Node.js – with Rust
Lessons learned building a Rust-backed Node.js driver for ScyllaDB: bridging JS and Rust, performance pitfalls, and benchmark results This blog post explores the story of building a new Node.js database driver as part of our Student Team Programming Project. Up ahead: troubles with bridging Rust with JavaScript, a new solution being initially a few times slower than the previous one, and a few charts! Note: We cover the progress made until June 2025 as part of the ZPP project, which is a collaboration between ScyllaDB and University of Warsaw. Since then, the ScyllaDB Driver team adopted the project (and now it’s almost production ready). Motivation The database speaks one language, but users want to speak to it in multiple languages: Rust, Go, C++, Python, JavaScript, etc. This is where a driver comes in, acting as a “translator” of sorts. All the JavaScript developers of the world currently rely on the DataStax Node.js driver. It is developed with the Cassandra database in mind, but can also be used for connecting to ScyllaDB, as they use the same protocol – CQL. This driver gets the job done, but it is not designed to take full advantage of ScyllaDB’s features (e.g., shard-per-core architecture, tablets). A solution for that is rewriting the driver and creating one that is in-house, developed and maintained by ScyllaDB developers. This is a challenging task requiring years of intensive development, with new tasks interrupting along the way. An alternative approach is writing the new driver as a wrapper around an existing one – theoretically simplifying the task (spoiler: not always) to just bridging the interfaces. This concept was proven in the making of the ScyllaDB C / C++ driver, which is an overlay over the Rust driver. We chose the ScyllaDB Rust driver as the backend of the new JavaScript driver for a few reasons. ScyllaDB’s Rust driver is developed and maintained by ScyllaDB. That means it’s always up to date with the latest database features, bug fixes, and optimizations. And since it’s written in Rust, it offers native-level performance without sacrificing memory safety. [More background on this approach] Development of such a solution skips the implementation of complicated database handling logic, but brings its own set of problems. We wanted our driver to be as similar as possible to the Node.js driver so anyone wanting to switch does not need to do much configuration. This was a restriction on one side. On the other side, we have limitations of the Rust driver interface. Driver implementations differ and the API for communicating with them can vary in some places. Some give a lot of responsibility to the user, requiring more effort but giving greater flexibility. Others do most of the work without allowing for much customization. Navigating these considerations is a recurring theme when choosing to write a driver as a wrapper over a different one. Despite the challenges during development, this approach comes with some major advantages. Once the initial integration is complete, adding new ScyllaDB features becomes much easier. It’s often just a matter of implementing a few bridging functions. All the complex internal logic is handled by the Rust driver team. That means faster development, fewer bugs, and better consistency across languages. On top of that, Rust is significantly faster than Node.js. So if we keep the overhead from the bridging layer low, the resulting driver can actually outperform existing solutions in terms of raw speed. The environment: Napi vs Napi-Rs vs Neon With the goal of creating a driver that uses ScyllaDB Rust Driver underneath, we needed to decide how we would be communicating between languages. There are two main options when it comes to communicating between JavaScript and other languages: Use a Node API (NAPI for short) – an API built directly into the NodeJS engine, or Interface the program through the V8 JavaScript engine. While we could use one of those communication methods directly, they are dedicated for C / C++, which would mean writing a lot of unsafe code. Luckily, other options exist: NAPI-RS and Neon. Those libraries handle all the unsafe code required for using the C / C++ APIs and expose (mostly safe) Rust interfaces. The first option uses NAPI exclusively under the hood, while the Neon option uses both of those interfaces. After some consideration, we decided to use NAPI-RS over Neon. Here are the things we considered when deciding which library to use: – Library approach — In NAPI-RS, the library handles the serialization of data into the expected Rust types. This lets us take full advantage of Rust’s static typing and any related optimizations. With Neon, on the other hand, we have to manually parse values into the correct types. With NAPI-RS, writing a simple function is as easy as adding a #[napi] tag: Simple a+b example And in Neon, we need to manually handle JavaScript context: A+b example in Neon – Simplicity of use — As a result of the serialization model, NAPI-RS leads to cleaner and shorter code. When we were implementing some code samples for the performance comparison, we had serious trouble implementing code in Neon just for a simple example. Based on that experience, we assumed similar issues would likely occur in the future. – Performance — We made some simple tests comparing the performance of library function calls and sending data between languages. While both options were visibly slower than pure JavaScript code, the NAPI-RS version had better performance. Since driver efficiency is a critical requirement, this was an important factor in our decision. You can read more about the benchmarks in our thesis. – Documentation — Although the documentation for both tools is far from perfect, NAPI-RS’s documentation is slightly more complete and easier to navigate. Current state and capabilities Note: This represents the state as of May 2025. More features have been introduced since then. See the project readme for a brief overview of current and planned features. The driver supports regular statements (both select and insert) and batch statements. It supports all CQL types, including encoding from almost all allowed JS types. We support prepared statements (when the driver knows the expected types based on the prepared statement), and we support unprepared statements (where users can either provide type hints, or the driver guesses expected value types). Error handling is one of the few major functions that behaves differently than the DataStax driver. Since the Rust driver throws different types of errors depending on the situation, it’s nearly impossible to map all of them reliably. To avoid losing valuable information, we pass through the original Rust errors as is. However, when errors are generated by our own logic in the wrapper, we try to keep them consistent with the old driver’s error types. In the DataStax driver, you needed to explicitly call shutdown() to close the database connection. This generated some problems: when the connection variable was dropped, the connection sometimes wouldn’t stop gracefully, even keeping the program running in some situations. We decided to switch this approach, so that the connection is automatically closed when the variable keeping the client is dropped. For now, it’s still possible to call shutdown on the client. Note: We are still discussing the right approach to handling a shutdown. As a result, the behavior described here may change in the future. Concurrent execution The driver has a dedicated endpoint for executing multiple queries concurrently. While this endpoint gives you less control over individual requests — for example, all statements must be prepared and you can’t set different options per statement — these constraints allow us to optimize performance. In fact, this approach is already more efficient than manually executing queries in parallel (around 35% faster in our internal testing), and we have additional optimization ideas planned for future implementation. Paging The Rust and DataStax drivers both have built-in support for paging, a CQL feature that allows splitting results of large queries into multiple chunks (pages). Interestingly, although the DataStax driver has multiple endpoints for paging, it doesn’t allow execution of unpaged queries. Our driver supports the paging endpoints (for now, one of those endpoints is still missing) and we also added the ability to execute unpaged queries in case someone ever needs that. With the current paging API, you have several options for retrieving paged results: Automatic iteration: You can iterate over all rows in the result set, and the driver will automatically request the next pages as needed. Manual paging: You can manually request the next page of results when you’re ready, giving you more control over the paging process. Page state transfer: You can extract the current page state and use it to fetch the next page from a different instance of the driver. This is especially useful in scenarios like stateless web servers, where requests may be handled by different server instances. Prepared statements cache Whenever executing multiple instances of the same statement, it’s recommended to use prepared statements. In ScyllaDB Rust Driver, by default, it’s the user’s responsibility to keep track of the already prepared statements to avoid preparing them multiple times (and, as a result, increasing both the network usage and execution times). In the DataStax driver, it was the driver’s responsibility to avoid preparing the same query multiple times. In the new driver, we use Rust’s Driver Caching Session for (most) of the statement caching. Optimizations One of the initial goals for the project was to have a driver that is faster than the DataStax driver. While using NAPI-RS added some overhead, we hoped the performance of the Rust driver would help us achieve this goal. With the initial implementation, we didn’t put much focus on efficient usage of the NAPI-RS layer. When we first benchmarked the new driver, it turned out to be way slower compared to both the DataStax JavaScript driver and the ScyllaDB Rust driver… Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 4.08 3.53 1.04 250000 13.50 5.81 1.73 1000000 55.05 15.37 4.61 4000000 227.69 66.95 18.43 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08 250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72 11.64 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.63 2.61 1.08 250000 4.09 2.89 1.52 1000000 15.74 4.90 3.45 4000000 58.96 12.72 11.64 Operations scylladb-javascript-driver (initial version) [s] Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.96 3.11 1.31 250000 4.90 4.33 1.89 1000000 16.99 10.58 4.93 4000000 65.74 31.83 17.26 Those results were a bit of a surprise, as we didn’t fully anticipate how much overhead NAPI-RS would introduce. It turns out that using JavaScript Objects introduced way higher overhead compared to other built-in types, or Buffers. You can see on the following flame graph how much time was spent executing NAPI functions (yellow-orange highlight), which are related to sending objects between languages. Creating objects with NAPI-RS is as simple as adding the#[napi] tag to the struct we want to expose to the
NodeJS part of the code. This approach also allows us to create
methods on those objects. Unfortunately, given its overhead, we
needed to switch the approach – especially in the most used parts
of the driver, like parsing parameters, results, or other parts of
executing queries. We can create a napi object like this:
Which is converted to the following JavaScript class: We
can use this struct between JavaScript and Rust. When accepting
values as arguments to Rust functions exposed in NAPI-RS, we can
either accept values of the types that implement the
FromNapiValue trait, or accept references to values of
types that are exposed to NAPI (these implement the default
FromNapiReference trait). We can do it like this:
Then, when we call the following Rust function we
can just pass a number in the JavaScript code.
FromNapiValue is implemented for built-in types like
numbers or strings, and the
FromNapiReference trait is created automatically
when using the #[napi] tag on a Rust struct. Compared
to that, we need to manually implement
FromNapiValue for custom structs. However, this
approach allows us to receive those objects in functions exposed to
NodeJS, without the need for creating Objects – and thus
significantly improves performance. We used this mostly to improve
the performance of passing query parameters to the Rust side of the
driver. When it comes to returning values from Rust code, a type
must have a ToNapiValue trait implemented. Similarly,
this trait is already implemented for built-in types, and is auto
generated with macros when adding the #[napi] tag to
the object. And this auto generated implementation was causing most
of our performance problems. Luckily, we can also implement our own
ToNapiValue trait. If we return a raw value and create
an object directly in the JavaScript part of the code, we can avoid
almost all of the negative performance impacts that come from the
default implementation of ToNapiValue. We can do it
like this:
This will return just the number instead of the whole struct. An
example of such places in the code was UUID. This type is used for
providing the UUID retrieved as part of any query, and can also be
used for inserts. In the initial implementation, we had a UUID
wrapper: an object created in the Rust part of the code, that
had a default ToNapiValue implementation, that
was handling all the logic for the UUID. When we changed the
approach to returning just a raw buffer representing the UUID and
handling all the logic on the JavaScript side, we shaved off about
20% of the CPU time we were using in the select benchmarks at that
point in time. Note: Since completing the initial project, we’ve
introduced additional changes to how serialization and
deserialization works. This means the current state may be
different from what we describe here. A new round of benchmarking
is in progress; stay tuned for those results. Benchmarks In the
previous section, we showed you some early benchmarks. Let’s talk a
bit more about how we tested and what we tested. All benchmarks
presented here were run on a single machine – the database was run
in a Docker container and the driver benchmarks were run without
any virtualization or containerization. The machine was running on
AMD Ryzen™ 7 PRO 7840U with 32GB RAM, with the database itself
limited to 8GB of RAM in total. We tested the driver both with
ScyllaDB and Cassandra (latest stable versions as of the time of
testing – May 2025). Both of those databases were run in a three
node configuration, with 2 shards per node in the case of ScyllaDB.
With this information on the benchmarks, let’s see the effect all
the optimizations we added had on the driver performance when
tested with ScyllaDB:
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.89 3.45 0.99 4.08 250000 4.15 5.66 1.73 13.50 1000000
13.65 15.86 4.41 55.05 4000000 55.85 56.73 18.42 227.69
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 2.83 2.48 1.04 1.63 250000 1.91 2.91 1.56 4.09 1000000
4.58 4.69 3.42 15.74 4000000 16.05 14.27 11.92 58.96
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.50 3.04 1.33 1.96 250000 2.93 4.52 1.94 4.90 1000000
8.79 11.11 5.08 16.99 4000000 32.99 36.62 17.90 65.74
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] scylladb-javascript-driver (initial version)
[s] 62500 1.42 3.09 1.25 1.45 250000 2.94 3.81 2.43 3.43 1000000
9.19 8.98 7.21 10.82 4000000 33.51 28.97 25.81 40.74 And here are
the same benchmarks, without the initial driver version.
Here are the results of running the benchmark on Cassandra.
Operations Scylladb-javascript-driver [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 2.48 14.50 1.25
250000 5.82 19.93 2.00 1000000 19.77 19.54 5.16
Operations Scylladb-javascript-driver [s]
Datastax-cassandra-driver [s] Rust-driver [s] 62500 1.60 2.99 1.48
250000 3.06 4.46 2.42 1000000 9.02 9.03 6.53
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] 62500 2.32 4.03 2.11 250000 5.45 6.53 4.01
1000000 18.77 16.20 13.21
Operations Scylladb-javascript-driver [s] Datastax-cassandra-driver
[s] Rust-driver [s] 62500 1.86 4.15 1.57 250000 4.24 5.41 3.36
1000000 13.11 14.11 10.54 The test results across both ScyllaDB and
Cassandra show that the new driver has slightly better performance
on the insert benchmarks. For select benchmarks, it starts ahead
and the performance advantage decreases with time. Despite a series
of optimizations, the majority of the CPU time still comes from
NAPI communication and thread synchronization (according to
internal flamegraph testing). There is still some room for
improvement, which we’re going to explore. Since running those
benchmarks, we introduced changes that improve the performance of
the driver. With those improvements performance of select
benchmarks is much closer to the speed of the DataStax driver.
Again…please stay tuned for another blog post with updated results.
Shards and tablets Since the DataStax driver lacked tablet and
shard support, we were curious if our new shard-aware and
tablet-aware drivers provided a measurable performance gain with
shards and tablets.
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No
Shards 62,500 1.89 2.61 3.45 3.51 0.99 1.20 250,000 4.15 7.61 5.66
6.14 1.73 2.30 1,000,000 13.65 30.36 15.86 16.62 4.41 8.33
4,000,000 55.85 134.90 56.73 77.68 18.42 42.64
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] Shard-Aware No Shards Shard-Aware No Shards Shard-Aware No
Shards 62,500 1.50 1.52 3.04 3.63 1.33 1.33 250,000 2.93 3.29 4.52
5.09 1.94 2.02 1,000,000 8.79 10.29 11.11 11.13 5.08 5.71 4,000,000
32.99 38.53 36.62 39.28 17.90 20.67 In insert benchmarks, there are
noticeable changes across all drivers when having more than one
shard. The Rust driver improved by around 36%, the new driver
improved by around 46%, and the DataStax driver improved by only
around 10% when compared to the single sharded version. While
sharding provides some performance benefits for the DataStax
driver, which is not shard aware, the new driver benefits
significantly more — achieving performance improvements comparable
to the Rust driver. This shows that it’s not only introducing more
shards that provide an improvement in this case; a major part of
the performance improvement is indeed shard-awareness.
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] No Tablets Standard No Tablets Standard No Tablets Standard
62,500 1.76 1.89 3.67 3.45 1.06 0.99 250,000 3.91 4.15 5.65 5.66
1.59 1.73 1,000,000 12.81 13.65 13.54 15.86 3.74 4.41
Operations ScyllaDB JS Driver [s] DataStax Driver [s] Rust Driver
[s] No Tablets Standard No Tablets Standard No Tablets Standard
62,500 1.46 1.50 2.92 3.04 1.33 1.33 250,000 2.76 2.93 4.03 4.52
1.94 1.94 1,000,000 8.36 8.79 7.68 11.11 4.84 5.08 When it comes to
tablets, the new driver and the Rust driver see only minimal
changes to the performance, while the performance of the DataStax
driver drops significantly. This behavior is expected. The DataStax
driver is not aware of the tablets. As a result, it is unable to
communicate directly with the node that will store the data – and
that increases the time spent waiting on network communication.
Interesting things happen, however, when we look at the network
traffic: WHAT TOTAL CQL TCP Total Size New driver 3 node all
412,764 112,318 300,446 ∼ 43.7 MB New driver 3 node | driver ↔
database 409,678 112,318 297,360 – New driver 3 node | node ↔ node
3,086 0 3,086 – DataStax driver 3 node all 268,037 45,052 222,985 ∼
81.2 MB DataStax driver 3 node | driver ↔ database 90,978 45,052
45,926 – DataStax driver 3 node | node ↔ node 177,059 0 177,059 –
This table shows the number of packets sent during the concurrent
insert benchmark on three-node ScyllaDB with 2 shards per node.
Those results were obtained with RF = 1. While running the database
with such a replication factor is not production-suitable, we
chose it to better visualize the results. When looking at those
numbers, we can draw the following conclusions: The new driver has
a different coalescing mechanism. It has a shorter wait time, which
means it sends more messages to the database and achieves
lower latencies. The new driver knows which node(s) will
store the data. This reduces internal traffic between database
nodes and lets the database serve more traffic with the same
resources. Future plans The goal of this project was to create a
working prototype, which we managed to successfully achieve. It’s
available at https://github.com/scylladb/nodejs-rs-driver,
but it’s considered experimental at this point. Expect it to change
considerably, with ongoing work and refactors. Some of the features
that were present in DataStax driver, and are expected for the
driver to be considered deployment-ready, are not yet implemented.
The Drivers team is actively working to add those features. If
you’re interested in this project and would like to contribute,
here’s the project’s GitHub
repository. Exploring the key features of Cassandra® 5.0
Apache Cassandra has become one of the most broadly adopted distributed databases for large-scale, highly available applications since its launch as an open source project in 2008. The 5.0 release in September 2024 represents the most substantial advancement to the project since 4.0 released in July 2021. Multiple customers (and our own internal Cassandra use case) have now been happily running on Cassandra 5 for up to 12 months so we thought the time was right to explore the key features they are leveraging to power their modern applications.
An overview of new features in Apache Cassandra 5.0Apache Cassandra 5.0 introduces core capabilities aimed at AI-driven systems, low-latency analytical workloads, and environments that blend operational and analytical processing.
Highlights include:
- The new vector data type and an Approximate Nearest Neighbor (ANN) index based on Hierarchical Navigable Small World (HNSW), which is integrated into the Storage-Attached Index (SAI) architecture
- Trie-based memtables and the Big Trie-Index (BTI) SSTable format, delivering better memory efficiency and more consistent write performance
- The Unified Compaction Strategy, a tunable density-based approach that can align with leveled or tiered compaction patterns.
Additional enhancements include expanded mathematical CQL functions, dynamic data masking, and experimental support for Java 17.
At NetApp, Apache Cassandra 5.0 is fully supported, and we are actively assisting customers as they transition from 4.x.
A deeper look at Cassandra 5.0’s key features Storage–Attached Indexes (SAI)Storage–Attached Indexes bring a modern, storage-integrated approach to secondary indexing in Apache Cassandra, resolving many of the scalability and maintenance challenges associated with earlier index implementations. Legacy Secondary Indexes (2i) and SASI remain available, but SAI offers a more robust and predictable indexing model for a broad range of production workloads.
SAI operates per-SSTable, allowing queries to be indexed locally versus the cluster-wide coordination required of other strategies. This model supports diverse CQL data types, enables efficient numeric and text range filters, and provides more consistent performance characteristics than 2i or SASI. The same storage-attached foundation is also used for Cassandra 5’s vector indexing mechanism, allowing ANN search to operate within the same storage and query framework.
SAI supports combining filters across multiple indexed columns and works seamlessly with token-aware routing to reduce unnecessary coordinator work. Public evaluations and community testing have shown faster index builds, more predictable read paths, and improved disk utilization compared with previous index formats.
Operationally, SAI functions as part of the storage engine itself: indexes are defined using standard CQL statements and are maintained automatically during flush and compaction, with no cluster-wide rebuilds required. This provides more flexible query options and can simplify application designs that previously relied on manual denormalization or external indexing systems.
Native Vector Search capabilitiesApache Cassandra 5.0 introduces native support for high-dimensional vector embeddings through the new vector data type. Embeddings represent semantic information in numerical form, enabling similarity search to be performed directly within the database. The vector type is integrated with the database’s storage-attached index architecture, which uses HNSW graphs to efficiently support ANN search across cosine, Euclidean, and dot-product similarity metrics.
With vector search implemented at the storage layer, applications involving semantic matching, content discovery, and retrieval-oriented workflows while maintaining the system’s established scalability and fault-tolerance characteristics are supported.
After upgrading to 5.0, existing schemas can add vector columns and store embeddings through standard write operations. For example:
UPDATE products SET embedding = [0.1, 0.2, 0.3, 0.4, 0.5] WHERE id = <id>;
To create a new table with a vector type column:
CREATE TABLE items ( product_id UUID PRIMARY KEY, embedding VECTOR<FLOAT, 768> // 768 denotes dimensionality );
Because vector indexes are attached to SSTables, they participate automatically in the compaction and repair processes and do not require an external indexing system. ANN queries can be combined with regular CQL filters, allowing similarity searches and metadata conditions to be evaluated within a unified distributed query workflow. This brings vector retrieval into Apache Cassandra’s native consistency, replication, and storage model.
Unified Compaction Strategy (UCS)Unified Compaction Strategy in Apache Cassandra 5 included a density-aware approach to organizing SSTables that blends the strengths of Leveled Compaction Strategy (LCS) and Size Tiered Compaction Strategy (STCS). UCS aims to provide the predictable read amplification associated with LCS and the write efficiency of STCS, without many of the workload-specific drawbacks that previously made compaction selection difficult. Choosing an unsuitable compaction strategy in earlier releases could lead to operational complexity and long-term performance issues, which UCS is designed to mitigate.
UCS exposes a set of tunable parameters like density thresholds and per-level scaling that let operators adjust compaction behavior toward read-heavy, write-heavy, or time-series patterns. This flexibility also helps smooth the transition from existing strategies, as UCS can adopt and improve the current SSTable layout without requiring a full rewrite in most cases. The introduction of compaction shards further increases parallelism and reduces the impact of large compactions on cluster performance.
Although LCS and STCS remain available (and while STCS remains the default strategy in 5.0, UCS is the default strategy on newly deployed NetApp Instaclustr’s managed Apache Cassandra 5 clusters), UCS supports a broader range of workloads, reduces the operational burden of compaction tuning, and aligns well with other storage engine improvements in Apache Cassandra 5 such as trie-based SSTables and Storage-Attached Indexes.
Trie Memtables and Trie-Indexed SSTablesTrie Memtables and Trie-indexed SSTables (Big Trie-Index, BTI) are significant storage engine enhancements released in Apache Cassandra 5. They are designed to reduce memory overhead, improve lookup performance, and increase flush efficiency. A trie data structure stores keys by shared prefixes instead of repeatedly storing full keys, which lowers object count and improves CPU cache locality compared with the legacy skip-list memtable structure. These benefits are particularly visible in high-ingestion, IoT, and time-series workloads.
Skip-list memtables store full keys for every entry, which can lead to large heap usage and increased garbage collection activity under heavy write loads. Trie Memtables substantially reduce this overhead by compacting key storage and avoiding pointer-heavy layouts. On disk, the BTI SSTable format replaces the older BIG index with a trie-based partition index that removes redundant key material and reduces the number of key comparisons needed during partition lookups.
Using Trie memtables requires enabling both the trie-based memtable implementation and the BTI SSTable format. Existing BIG SSTables are converted to BTI through normal compaction or by rebuilding data. On NetApp Instaclustr’s managed Apache Cassandra clusters Trie Memtables and BTI are enabled by default, but when upgrading major versions to 5.0, data must be converted from BIG to BTI first to utilize Trie structures.
Other new features Mathematical CQL functionsApache Cassandra 5.0 added a rich set of math functions allowing developers to perform computations directly within queries. This reduces data transfer overhead and reduces client-side post-processing, among many other benefits. From fundamental functions like ABS(), ROUND(), or SQRT() to more complex operations like SIN(), COS(), TAN(), these math functions are extensible to a multitude of domains from financial data, scientific measurements or spatial data.
Dynamic Data MaskingDynamic Data Masking (DDM) is a new feature to obscure sensitive
column-level data at query time or permanently attach the
functionality to a column so that the data always returns
obfuscated. Stored data values are not altered in this process, and
administrators can control access through role-based access control
(RBAC) to ensure only those with access can see the data while also
tuning the visibility of the obscured data. This feature helps with
adherence to data privacy regulations such as GDPR, HIPAA, and PCI
DSS without needing external redaction systems.
Apache Cassandra 5.0 packs a punch with game changing features that meet the needs of modern workloads and applications. Features like vector search capabilities and Storage Attached Indexes stand out as they will inevitably shape how data can be leveraged within the same database while maintaining speed, scale, and resilience.
When you deploy a managed cluster on NetApp Instaclustr’s Managed Platform, you get the benefits of all these amazing features without worrying about configuration and maintenance.
Ready to experience the power of Apache Cassandra 5.0 for yourself? Try it free for 30 days today!
The post Exploring the key features of Cassandra® 5.0 appeared first on Instaclustr.
Instaclustr product update: December 2025
Instaclustr product update: December 2025Here’s a roundup of the latest features and updates that we’ve recently released.
If you have any particular feature requests or enhancement ideas that you would like to see, please get in touch with us.
Major announcements OpenSearch®AI Search for OpenSearch®: Unlocking next-generation search
AI Search for OpenSearch, which is now available in Public Preview on the NetApp Instaclustr Managed Platform, is designed to bring semantic, hybrid, and multimodal search capabilities to OpenSearch deployments—turning them into an end-to-end AI-powered search solution within minutes. With built-in ML models, vector indexing, and streamlined ingestion pipelines, next-generation search can be enabled in minutes without adding operational complexity. This feature powers smarter, more relevant discovery experiences backed by AI—securely deployed across any cloud or on-premises environment.
ClickHouse®
FSx for NetApp ONTAP and Managed ClickHouse® integration is now
available
We’re excited to announce that NetApp has introduced seamless
integration between Amazon FSx for NetApp ONTAP and Instaclustr
Managed ClickHouse, to enable customers to build a truly hybrid
lakehouse architecture on AWS. This integration is designed to
deliver lightning-fast analytics without the need for complex data
movement, while leveraging FSx for ONTAP’s unified file and object
storage, tiered performance, and cost optimization. Customers can
now run zero-copy lakehouse analytics with ClickHouse directly on
FSx for ONTAP data—to simplify operations, accelerate
time-to-insight, and reduce total cost of ownership.
Instaclustr for PostgreSQL® on Amazon FSx for ONTAP: A new
era
We’re excited to announce the public preview of Instaclustr Managed
PostgreSQL integrated with Amazon FSx for NetApp ONTAP—combining
enterprise-grade storage with world-class open source database
management. This integration is designed to deliver higher IOPS,
lower latency, and advanced data management without increasing
instance size or adding costly hardware. Customers can now run
PostgreSQL clusters backed by FSx for ONTAP storage, leveraging
on-disk compression for cost savings and paving the way for
ONTAP-powered features, such as instant snapshot backups, instant
restores, and fast forking. These ONTAP-enabled features are
planned to unlock huge operational benefits and will be launched
with our GA release.
- Released Apache Cassandra 5.0.5 into general availability on the NetApp Instaclustr Managed Platform.
- Transitioned Apache Cassandra v4.1.8 to CLOSED lifecycle state; scheduled to reach End of Life (EOL) on December 20, 2025.
- Kafka on Azure now supports v5 generation nodes, available in General Availability.
- Instaclustr Managed Apache ZooKeeper has moved from General Availability to closed status.
- Kafka and Kafka Connect 3.1.2 and 3.5.1 are retired; 3.6.2, 3.7.1, 3.8.1 are in legacy support. Next set of lifecycle state changes for Kafka and Kafka Connect in end March 2026 will see all supported versions 3.8.1 and below marked End of Life.
- Karapace Rest Proxy and Schema Registry 3.15.0 are closed. Customers are advised to move to version 5.x.
- Kafka Rest Proxy 5.0.0 and Kafka Schema Registry 5.0.0, 5.0.4 have been moved to end of life. Affected customers have been contacted by Support to schedule a migration to a supported version as soon as possible.
- ClickHouse 25.3.6 has been added to our managed platform in General Availability.
- Kafka Table Engine integration with ClickHouse has added support to enable real-time data ingestion, streamline streaming analytics, and accelerate insights.
- New ClickHouse node sizes, powered by AWS m7g, r7i, and r7g instances, are now in Limited Availability for cluster creation.
- Cadence is now available to be provisioned with Cassandra 5.x, designed to deliver improved performance, enhanced scalability, and stronger security for mission-critical workflows.
- OpenSearch 2.19.3 and 3.2.0 have been released to General Availability.
- PostgreSQL AWS PrivateLink support has been added, enabling connectivity between VPCs using AWS PrivateLink.
- PostgreSQL version 18.0 has now been released to General Availability, alongside PostgreSQL version 16.10, 17.6.
- Added new PostgreSQL metrics for connect states and wait event types.
- PostgreSQL Load Balancer add-on is now available, providing a unified endpoint for cluster access, simplifying failover handling, and ensuring node health through regular checks.
- We’re working on enabling multi-datacenter (multi-DC) cluster provisioning via API and console, designed to make it easier to deploy clusters across regions with secure networking and reduced manual steps.
- We’re working on adding Kafka Tiered Storage for clusters running in GCP— designed to bring affordable, scalable retention, and instant access to historical data, to ensure flexibility and performance across clouds for enterprise Kafka users.
- We’re planning to extend our Managed ClickHouse to allow it to work with on-prem deployments.
- Following the success of our public preview, we’re preparing to launch PostgreSQL integrated with FSx for NetApp ONTAP (FSxN) into General Availability. This enhancement is designed to combine enterprise-grade PostgreSQL with FSxN’s scalable, cost-efficient storage, enabling customers to optimize infrastructure costs while improving performance and flexibility.
- As part of our ongoing advancements in AI for OpenSearch, we are planning to enable adding GPU nodes into OpenSearch clusters, aiming to enhance the performance and efficiency of machine learning and AI workloads.
- Self-service Tags Management feature—allowing users to add, edit, or delete tags for their clusters directly through the Instaclustr console, APIs, or Terraform provider for RIYOA deployments.
- Cadence Workflow, the open source orchestration engine created by Uber, has officially joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project. This milestone ensures transparent governance, community-driven innovation, and a sustainable future for one of the most trusted workflow technologies in modern microservices and agentic AI architectures. Uber donates Cadence Workflow to CNCF: The next big leap for the open source project—read the full story and discover what’s next for Cadence.
- Upgrading ClickHouse® isn’t just about new features—it’s essential for security, performance, and long-term stability. In ClickHouse upgrade: Why staying updated matters, you’ll learn why skipping upgrades can lead to technical debt, missed optimizations, and security risks. Then, explore A guide to ClickHouse® upgrades and best practices for practical strategies, including when to choose LTS releases for mission-critical workloads and when stable releases make sense for fast-moving environments.
- Our latest blog, AI Search for OpenSearch®: Unlocking next-generation search, explains how this new solution enables smarter discovery experiences using built-in ML models, vector embeddings, and advanced search techniques—all fully managed on the NetApp Instaclustr Platform. Ready to explore the future of search? Read the full article and see how AI can transform your OpenSearch deployments.
If you have any questions or need further assistance with these enhancements to the Instaclustr Managed Platform, please contact us.
SAFE HARBOR STATEMENT: Any unreleased services or features referenced in this blog are not currently available and may not be made generally available on time or at all, as may be determined in NetApp’s sole discretion. Any such referenced services or features do not represent promises to deliver, commitments, or obligations of NetApp and may not be incorporated into any contract. Customers should make their purchase decisions based upon services and features that are currently generally available.
The post Instaclustr product update: December 2025 appeared first on Instaclustr.
Freezing streaming data into Apache Iceberg™—Part 1: Using Apache Kafka®Connect Iceberg Sink Connector
IntroductionEver since the first distributed system—i.e. 2 or more computers networked together (in 1969)—there has been the problem of distributed data consistency: How can you ensure that data from one computer is available and consistent with the second (and more) computers? This problem can be uni-directional (one computer is considered the source of truth, others are just copies), or bi-directional (data must be synchronized in both directions across multiple computers).
Some approaches to this problem I’ve come across in the last 8 years include Kafka Connect (for elegantly solving the heterogeneous many-to-many integration problem by streaming data from source systems to Kafka and from Kafka to sink systems, some earlier blogs on Apache Camel Kafka Connectors and a blog series on zero-code data pipelines), MirrorMaker2 (MM2, for replicating Kafka clusters, a 2 part blog series), and Debezium (Change Data Capture/CDC, for capturing changes from databases as streams and making them available in downstream systems, e.g. for Apache Cassandra and PostgreSQL)—MM2 and Debezium are actually both built on Kafka Connect.
Recently, some “sink” systems have been taking over responsibility for streaming data from Kafka into themselves, e.g. OpenSearch pull-based ingestion (c.f. OpenSearch Sink Connector), and the ClickHouse Kafka Table Engine (c.f. ClickHouse Sink Connector). These “pull-based” approaches are potentially easier to configure and don’t require running a separate Kafka Connect cluster and sink connectors, but some downsides may be that they are not as reliable or independently scalable, and you will need to carefully monitor and scale them to ensure they perform adequately.
And then there’s “zero-copy” approaches—these rely on the well-known computer science trick of sharing a single copy of data using references (or pointers), rather than duplicating the data. This idea has been around for almost as long as computers, and is still widely applicable, as we’ll see in part 2 of the blog.
The distributed data use case we’re going to explore in this 2-part blog series is streaming Apache Kafka data into Apache Iceberg, or “Freezing streaming Apache Kafka data into an (Apache) Iceberg”! In part 1 we’ll introduce Apache Iceberg and look at the first approach for “freezing” streaming data using the Kafka Connect Iceberg Sink Connector.
What is Apache Iceberg?Apache Iceberg is an open source specification open table format optimized for column-oriented workloads, supporting huge analytic datasets. It supports multiple different concurrent engines that can insert and query table data using SQL—and Iceberg is organized like, well, an iceberg!
The tip of the Iceberg is the Catalog. An Iceberg Catalog acts as a central metadata repository, tracking the current state of Iceberg tables, including their names, schemas, and metadata file locations. It serves as the “single source of truth” for a data Lakehouse, enabling query engines to find the correct metadata file for a table to ensure consistent and atomic read/write operations.
Just under the water, the next layer is the metadata layer. The Iceberg metadata layer tracks the structure and content of data tables in a data lake, enabling features like efficient query planning, versioning, and schema evolution. It does this by maintaining a layered structure of metadata files, manifest lists, and manifest files that store information about table schemas, partitions, and data files, allowing query engines to prune unnecessary files and perform operations atomically.
The data layer is at the bottom. The Iceberg data layer is the storage component where the actual data files are stored. It supports different storage backends, including cloud-based object storage like Amazon S3 or Google Cloud Storage, or HDFS. It uses file formats like Parquet or Avro. Its main purpose is to work in conjunction with Iceberg’s metadata layer to manage table snapshots and provide a more reliable and performant table format for data lakes, bringing data warehouse features to large datasets.

As shown in the above diagram, Iceberg supports multiple different engines, including Apache Spark and ClickHouse. Engines provide the “database” features you would expect, including:
- Data Management
- ACID Transactions
- Query Planning and Optimization
- Schema Evolution
- And more!
I’ve recently been reading an excellent book on Apache Iceberg (“Apache Iceberg: The Definitive Guide”), which explains the philosophy, architecture and design, including operation, of Iceberg. For example, it says that it’s best practice to treat data lake storage as immutable—data should only be added to a Data Lake, not deleted. So, in theory at least, writing infinite, immutable Kafka streams to Iceberg should be straightforward!
But because it’s a complex distributed system (which looks like a database from above water but is really a bunch of files below water!), there is some operational complexity. For example, it handles change and consistency by creating new snapshots for every modification, enabling time travel, isolating readers from writes, and supporting optimistic concurrency control for multiple writers. But you need to manage snapshots (e.g. expiring old snapshots). And chapter 4 (performance optimisation) explains that you may need to worry about compaction (reducing too many small files), partitioning approaches (which can impact read performance), and handling row-level updates. The first two issues may be relevant for Kafka, but probably not the last one. So, it looks like it’s good fit for the streaming Kafka use cases, but we may need to watch out for Iceberg management issues.
“Freezing” streaming data with the Kafka Iceberg Sink ConnectorBut Apache Iceberg is “frozen”—what’s the connection to fast-moving streaming data? You certainly don’t want to collide with an iceberg from your speedy streaming “ship”—but you may want to freeze your streaming data for long-term analytical queries in the future. How can you do that without sinking? Actually, a “sink” is the first answer: A Kafka Connect Iceberg Sink Connector is the most common way of “freezing” your streaming data in Iceberg!
Kafka Connect is the standard framework provided by Apache Kafka to move data from multiple heterogeneous source systems to multiple heterogeneous sink systems, using:
- A Kafka cluster
- A Kafka Connect cluster (running connectors)
- Kafka Connect source connectors
- Kafka topics and
- Kafka Connect Sink Connectors
That is, a highly decoupled approach. It provides real-time data movement with high scalability, reliability, error handling and simple transformations.
Here’s the Kafka Connect Iceberg Sink Connector official documentation.
It appears to be reasonably complicated to configure this sink connector; you will need to know something about Iceberg. For example, what is a “control topic”? It’s apparently used to coordinate commits for exactly-once semantics (EOS).
The connector supports fan-out (writing to multiple Iceberg tables from one topic), fan-in (writing to one Iceberg table from multiple topics), static and dynamic routing, and filtering.
In common with many technologies that you may want to use as Kafka Connect sinks, they may not all have good support for Kafka metadata. The KafkaMetadata Transform (which injects topic, partition, offset and timestamp properties) is only experimental at present.
How are Iceberg tables created with the correct metadata? If you have JSON record values, then schemas are inferred by default (but may not be correct or optimal). Alternatively, explicit schemas can be included in-line or referenced from a Kafka Schema Registry (e.g. Karapace), and, as an added bonus, schema evolution is supported. Also note that Iceberg tables may have to be manually created prior to use if your Catalog doesn’t support table auto-creation.
From what I understood about Iceberg, to use it (e.g. for writes), you need support from an engine (e.g. to add raw data to the Iceberg warehouse, create the metadata files, and update the catalog). How does this work for Kafka Connect? From this blog I discovered that the Kafka Connect Iceberg Sink connector is functioning as an Iceberg engine for writes, so there really is an engine, but it’s built into the connector.
As is the case with all Kafka Connect Sink Connectors, records are available immediately they are written to Kafka topics by Kafka producers and Kafka Connect Source Connectors, i.e. records in active segments can be copied immediately to sink systems. But is the Iceberg Sink Connector real-time? Not really! The default time to write to Iceberg is every 5 minutes (iceberg.control.commit.interval-ms) to prevent multiplication of small files—something that Iceberg(s) doesn’t/don’t like (“melting”?). In practice, it’s because every data file must be tracked in the metadata layer, which impacts performance in many ways—proliferation of small files is typically addressed by optimization and compaction (e.g. Apache Spark supports Iceberg management, including these operations).
So, unlike most Kafka Connect sink connectors, which write as quickly as possible, there will be lag before records appear in Iceberg tables (“time to freeze” perhaps)!
The systems are separate (Kafka and Iceberg are independent), records are copied to Iceberg, and that’s it! This is a clean separation of concerns and ownership. Kafka owns the source data (with Kafka controlling data lifecycles, including record expiry), Kafka Connect Iceberg Sink Connector performs the reading from Kafka and writing to Iceberg, and is independently scalable to Kafka. Kafka doesn’t handle any of the Iceberg management. Once the data has landed in Iceberg, Kafka has no further visibility or interest in it. And the pipeline is purely one way, write only – reads or deletes are not supported.
Here’s a summary of this approach to freezing streams:
- Kafka Connect Iceberg Sink Connector shares all the benefits of the Kafka Connect framework, including scalability, reliability, error handling, routing, and transformations.
- At least, JSON values are required, ideally full schemas and referenced in Karapace—but not all schemas are guaranteed to work.
- Kafka Connect doesn’t “manage” Iceberg (e.g. automatically aggregate small files, remove snapshots, etc.)
- You may have to tune the commit interval – 5 minutes is the default.
- But it does have a built-in engine that supports writing to Iceberg.
- You may need to use an external tool (e.g. Apache Spark) for Iceberg management procedures.
- It’s write-only to Iceberg. Reads or deletes are not supported

But what’s the best thing about the Kafka Connect Iceberg Sink Connector? It’s available now (as part of the Apache Iceberg build) and works on the NetApp Instaclustr Kafka Connect platform as a “bring your own connector” (instructions here).
In part 2, we’ll look at Kafka Tiered Storage and Iceberg Topics!
The post Freezing streaming data into Apache Iceberg™—Part 1: Using Apache Kafka®Connect Iceberg Sink Connector appeared first on Instaclustr.
Stay ahead with Apache Cassandra®: 2025 CEP highlights
Apache Cassandra® committers are working hard, building new features to help you more seamlessly ease operational challenges of a distributed database. Let’s dive into some recently approved CEPs and explain how these upcoming features will improve your workflow and efficiency.
What is a CEP?CEP stands for Cassandra Enhancement Proposal. They are the process for outlining, discussing, and gathering endorsements for a new feature in Cassandra. They’re more than a feature request; those who put forth a CEP have intent to build the feature, and the proposal encourages a high amount of collaboration with the Cassandra contributors.
The CEPs discussed here were recently approved for implementation or have had significant progress in their implementation. As with all open-source development, inclusion in a future release is contingent upon successful implementation, community consensus, testing, and approval by project committers.
CEP-42: Constraints frameworkWith collaboration from NetApp Instaclustr, CEP-42, and subsequent iterations, delivers schema level constraints giving Cassandra users and operators more control over their data. Adding constraints on the schema level means that data can be validated at write time and send the appropriate error when data is invalid.
Constraints are defined in-line or as a separate definition. The inline style allows for only one constraint while a definition allows users to define multiple constraints with different expressions.
The scope of this CEP-42 initially supported a few constraints that covered the majority of cases, but in follow up efforts the expanded list of support includes scalar (>, <, >=, <=), LENGTH(), OCTET_LENGTH(), NOT NULL, JSON(), REGEX(). A user is also able to define their own constraints if they implement it and put them on Cassandra’s class path.
A simple example of an in-line constraint:
CREATE TABLE users (
username text PRIMARY KEY,
age int CHECK age >= 0 and age < 120
);
Constraints are not supported for UDTs (User-Defined Types) nor collections (except for using NOT NULL for frozen collections).
Enabling constraints closer to the data is a subtle but mighty way for operators to ensure that data goes into the database correctly. By defining rules just once, application code is simplified, more robust, and prevents validation from being bypassed. Those who have worked with MySQL, Postgres, or MongoDB will enjoy this addition to Cassandra.
CEP-51: Support “Include” Semantics for cassandra.yamlThe cassandra.yaml file holds important settings for storage,
memory, replication, compaction, and more. It’s no surprise that
the average size of the file around 1,000 lines (though, yes—most
are comments). CEP-51 enables splitting the cassandra.yaml
configuration into multiple files using includes
semantics. From the outside, this feels like a small change, but
the implications are huge if a user chooses to opt-in.
In general, the size of the configuration file makes it difficult to manage and coordinate changes. It’s often the case that multiple teams manage various aspects of the single file. In addition, cassandra.yaml permissions are readable for those with access to this file, meaning private information like credentials are comingled with all other settings. There is risk from an operational and security standpoint.
Enabling the new semantics and therefore modularity for the configuration file eases management, deployment, complexity around environment-specific settings, and security in one shot. The configuration file follows the principle of least privilege once the cassandra.yaml is broken up into smaller, well-defined files; sensitive configuration settings are separated out from general settings with fine-grained access for the individual files. With the feature enabled, different development teams are better equipped to deploy safely and independently.
If you’ve deployed your Cassandra cluster on the NetApp Instaclustr platform, the cassandra.yaml file is already configured and managed for you. We pride ourselves on making it easy for you to get up and running fast.
CEP-52: Schema annotations for Apache CassandraWith extensive review by the NetApp Instaclustr team and Stefan Miklosovic, CEP-52 introduces schema annotations in CQL allowing in-line comments and labels of schema elements such as keyspaces, tables, columns, and User Defined Types (UDT). Users can easily define and alter comments and labels on these elements. They can be copied over when desired using CREATE TABLE LIKE syntax. Comments are stored as plain text while labels are stored as structured metadata.
Comments and labels serve different annotation purposes: Comments document what a schema object is for, whereas labels describe how sensitive or controlled it is meant to be. For example, labels can be used to identify columns as “PII” or “confidential”, while the comment on that column explains usage, e.g. “Last login timestamp.”
Users can query these annotations. CEP-52 defines two new read-only tables (system_views.schema_comments and system_views.schema_security_labels) to store comments and security labels so objects with comments can be returned as a list or a user/machine process can query for specific labels, beneficial for auditing and classification. Note that adding security labels are descriptive metadata and do not enforce access control to the data.
CEP-53: Cassandra rolling restarts via SidecarSidecar is an auxiliary component in the Cassandra ecosystem that exposes cluster management and streaming capabilities through APIs. Introducing rolling restarts through Sidecar, this feature is designed to provide operators with more efficient and safer restarts without cluster-wide downtime. More specifically, operators can monitor, pause, resume, and abort restarts all through an API with configurable options if restarts fail.
Rolling restarts brings operators a step closer to cluster-wide operations and lifecycle management via Sidecar. Operators will be able to configure the number of nodes to restart concurrently with minimal risk as this CEP unleashes clear states as a node progresses through a restart. Accounting for a variety of edge cases, an operator can feel assured that, for example, a non-functioning sidecar won’t derail operations.
The current process for restarting a node is a multi-step, manual process, which does not scale for large cluster sizes (and is also tedious for small clusters). Restarting clusters previously lacked a streamlined approach as each node needed to be restarted one at a time, making the process time-intensive and error-prone.
Though Sidecar is still considered WIP, it’s got big plans to improve operating large clusters!
The NetApp Instaclustr Platform, in conjunction with our expert TechOps team already orchestrates these laborious tasks for our Cassandra customers with a high level of care to ensure their cluster stays online. Restarting or upgrading your Cassandra nodes is a huge pain-point for operators, but it doesn’t have to be when using our managed platform (with round-the-clock support!)
CEP-54: Zstd with dictionary SSTable compressionCEP-54, with NetApp Instaclustr’s collaboration, aims to add support Zstd with dictionary compression for SSTables. Zstd, or Zstandard, is a fast, lossless data compression algorithm that boasts impressive ratio and speed and has been supported in Cassandra since 4.0. Certain workloads can benefit from significantly faster read/write performance, reduced storage footprint, and increased storage device lifetime when using dictionary compression.
At a high level, operators choose a table they want to compress with a dictionary. A dictionary must be trained first on a small amount of already present data (recommended no more than 10MiB). The result of a training is a dictionary, which is stored cluster-wide for all other nodes to use, and this dictionary is used for all subsequent writes of SSTables to a disk.
Workloads with structured data of similar rows benefit most from Zstd with dictionary compression. Some examples of ideal workloads include event logs, telemetry data, metadata tables with templated messages. Think: repeated row data. If the table data is too unstructured or random, this feature likely won’t be optimal for dictionary compression, however plain Zstd will still be an excellent option.
New SSTables with dictionaries are readable across nodes and can stream, repair, and backup. Existing tables are unaffected if dictionary compression is not enabled. Too many unique dictionaries hurt decompression; use minimal dictionaries (recommended dictionary size is about 100KiB and one dictionary per table) and only adopt new ones when they’re noticeably better.
CEP-55: Generated role namesCEP-55 adds support to create users/roles without supplying a name, simplifying
user management, especially when generating users and roles in bulk. With an example syntax, CREATE GENERATED ROLE WITH GENERATED PASSWORD, new keys are placed in a newly introduced configuration section in cassandra.yaml under “role_name_policy.”
Stefan Miklosovic, our Cassandra engineer at NetApp Instaclustr, created this CEP as a logical follow up to CEP-24 (password validation/generation), which he authored as well. These quality-of-life improvements let operators spend less time doing trivial tasks with high-risk potential and more time on truly complex matters.
Manual name selection seems trivial until a hundred role names need to be generated; now there is a security risk if the new usernames—or worse passwords—are easily guessable. With CEP-55, the generated role name will be UUID-like, with optional prefix/suffix and size hints, however a pluggable policy is available to generate and validate names as well. This is an opt-in feature with no effect to the existing method of generating role names.
The future of Apache Cassandra is brightThese Cassandra Enhancement Proposals demonstrate a strong commitment to making Apache Cassandra more powerful, secure, and easier to operate. By staying on top of these updates, we ensure our managed platform seamlessly supports future releases that accelerate your business needs.
At NetApp Instaclustr, our expert TechOps team already orchestrates laborious tasks like restarts and upgrades for our Apache Cassandra customers, ensuring their clusters stay online. Our platform handles the complexity so you can get up and running fast.
Learn more about our fully managed and hosted Apache Cassandra offering and try it for free today!
The post Stay ahead with Apache Cassandra®: 2025 CEP highlights appeared first on Instaclustr.
Vector search benchmarking: Embeddings, insertion, and searching documents with ClickHouse® and Apache Cassandra®
Welcome back to our series on vector search benchmarking. In part 1, we dove into setting up a benchmarking project and explored how to implement vector search in PostgreSQL from the example code in GitHub. We saw how a hands-on project with students from Northeastern University provided a real-world testing ground for Retrieval-Augmented Generation (RAG) pipelines.
Now, we’re continuing our journey by exploring two more powerful open source technologies: ClickHouse and Apache Cassandra. Both handle vector data differently and understanding their methods is key to effective vector search benchmarking. Using the same student project as our guide, this post will examine the code for embedding, inserting, and retrieving data to see how these technologies stack up.
Let’s get started.
Vector search benchmarking with ClickHouseClickHouse is a column-oriented database management system known for its incredible speed in analytical queries. It’s no surprise that it has also embraced vector search. Let’s see how the student project team implemented and benchmarked the core components.
Step 1: Embedding and inserting data
scripts/vectorize_and_upload.py
This is the file that handles Step 1 of the pipeline for
ClickHouse. Embeddings in this file
(scripts/vectorize_and_upload.py) are used as vector
representations of Guardian news articles for the purpose of
storing them in a database and performing semantic search. Here’s
how embeddings are handled step-by-step (the steps look similar to
PostgreSQL).
First up, is the generation of embeddings. The same
SentenceTransformer model used in part 1
(all-MiniLM-L6-v2) is loaded in the class constructor.
In the method generate_embeddings(self, articles), for
each article:
- The article’s title and body are concatenated into a text string.
- The model generates an embedding vector
(
self.model.encode(text_for_embedding)), which is a numerical representation of the article’s semantic content. - The embedding is added to the article’s dictionary under the
key
embedding.
Then the embeddings are stored in ClickHouse as follows.
- The database table
guardian_articlesis created with an embeddingArray(Float64) NOT NULLcolumn specifically to store these vectors. - In
upload_to_clickhouse_debug(self, articles_with_embeddings), the script inserts articles into ClickHouse, including the embedding vector as part of each row.
services/clickhouse/clickhouse_dao.py
The steps to search are the same as for PostgreSQL in part 1.
Here’s part of the related_articles method for
ClickHouse:
def related_articles(self, query: str, limit: int =
5):
"""Search for similar articles using vector similarity""" ... query_embedding = self.model.encode(query).tolist() search_query = f""" SELECT url, title, body, publication_date, cosineDistance(embedding, {query_embedding}) as distance FROM guardian_articles ORDER BY distance ASC LIMIT {limit} """ ...
When searching for related articles, it encodes the query into an embedding, then performs a vector similarity search in ClickHouse using cosineDistance between stored embeddings and the query embedding, and results are ordered by similarity, returning the most relevant articles.
Vector search benchmarking with Apache CassandraNext, let’s turn our attention to Apache Cassandra. As a distributed NoSQL database, Cassandra is designed for high availability and scalability, making it an intriguing option for large-scale RAG applications.
Step 1: Embedding and inserting datascripts/pull_docs_cassandra.py
As in the above examples, embeddings in this file are used to
convert article text (body) into numerical vector
representations for storage and later retrieval in Cassandra.
For each article, the code extracts the body and
computes the embeddings:
embedding = model.encode(body) embedding_list = [float(x) for x in embedding]
model.encode(body)converts the text to aNumPyarray of 384 floats.- The array is converted to a standard Python list of floats for Cassandra storage.
Next, the embedding is stored in the vector column of the
articles table using a CQL INSERT:
insert_cql = SimpleStatement(""" INSERT INTO articles (url, title, body, publication_date, vector) VALUES (%s, %s, %s, %s, %s) IF NOT EXISTS; """) result = session.execute(insert_cql, (url, title, body, publication_date, embedding_list))
The schema for the table specifies: vector
vector<float, 384>, meaning each article has a
corresponding 384-dimensional embedding. The code also creates a
custom index for the vector column:
session.execute(""" CREATE CUSTOM INDEX IF NOT EXISTS ann_index ON articles(vector) USING 'StorageAttachedIndex'; """)
This enables efficient vector (ANN: Approximate Nearest Neighbor) search capabilities, allowing similarity queries on stored embeddings.
A key part of the setup is the schema and indexing. The
Cassandra schema in
services/cassandra/init/01-schema.cql defines the
vector column.
Being a NoSQL database, Cassandra schemas are a bit different to normal SQL databases, so it’s worth taking a closer look. This Cassandra schema is designed to support Retrieval-Augmented Generation (RAG) architectures, which combine information retrieval with generative models to answer queries using both stored data and generative AI. Here’s how the schema supports RAG:
- Keyspace and table structure
- Keyspace (
vectorembeds): Analogous to a database, this isolates all RAG-related tables and data. - Table (
articles): Stores retrievable knowledge sources (e.g., articles) for use in generation.
- Keyspace (
- Table columns
url TEXT PRIMARY KEY: Uniquely identifies each article/document, useful for referencing and deduplication.title TEXTandbody TEXT: Store the actual content and metadata, which may be retrieved and passed to the generative model during RAG.publication_date TIMESTAMP: Enables filtering or ranking based on recency.vector VECTOR<FLOAT, 384>: Stores the embedding representation of the article. The new Cassandra vector data type is documented here.
- Indexing
- Sets up an Approximate Nearest Neighbor (ANN) index using Cassandra’s Storage Attached Index.
More information about Cassandra vector support is in the documentation.
Step 2: Vector search and retrievalThe retrieval logic in
services/cassandra/cassandra_dao.py showcases the
elegance of Cassandra’s vector search capabilities.
The code to create the query embeddings and perform the query is similar to the previous examples, but the CQL query to retrieve similar documents looks like this:
query_cql = """ SELECT url, title, body, publication_date FROM articles ORDER BY vector ANN OF ? LIMIT ? """ prepared = self.client.prepare(query_cql) rows = self.client.execute(prepared, (emb, limit))What have we learned?
By exploring the code from this RAG benchmarking project we’ve seen distinct approaches to vector search. Here’s a summary of key takeaways:
- Critical steps in the process:
- Step 1: Embedding articles and inserting them into the vector databases.
- Step 2: Embedding queries and retrieving relevant articles from the database.
- Key design pattern:
- The DAO (Data Access Object) design pattern provides a clean, scalable way to support multiple databases.
- This approach could extend to other databases, such as OpenSearch, in the future.
- Additional insights:
- It’s possible to perform vector searches over the latest documents, pre-empting queries, and potentially speeding up the pipeline.
So far, we have only scratched the surface. The students built a complete benchmarking application with a GUI (using Steamlit), used multiple other interesting components (e.g. LangChain, LangGraph, FastAPI and uvicorn), Grafana and LangSmith for metrics, and Claude to use the retrieved articles to answer questions, and Docker support for the components. They also revealed some preliminary performance results! Here’s what the final system looked like (this and the previous blog focused on the bottom boxes only).

In a future article, we will examine the rest of the application code, look at the preliminary performance results the students uncovered, and discuss what they tell us about the trade-offs between these different databases.
Ready to learn more right now? We have a wealth of resources on vector search. You can explore our blogs on ClickHouse vector search and Apache Cassandra Vector Search (here, here, and here) to deepen your understanding.
The post Vector search benchmarking: Embeddings, insertion, and searching documents with ClickHouse® and Apache Cassandra® appeared first on Instaclustr.
Optimizing Cassandra Repair for Higher Node Density
This is the fourth post in my series on improving the cost efficiency of Apache Cassandra through increased node density. In the last post, we explored compaction strategies, specifically the new UnifiedCompactionStrategy (UCS) which appeared in Cassandra 5.
- Streaming Throughput
- Compaction Throughput and Strategies
- Repair (you are here)
- Query Throughput
- Garbage Collection and Memory Management
- Efficient Disk Access
- Compression Performance and Ratio
- Linearly Scaling Subsystems with CPU Core Count and Memory
Now, we’ll tackle another aspect of Cassandra operations that directly impacts how much data you can efficiently store per node: repair. Having worked with repairs across hundreds of clusters since 2012, I’ve developed strong opinions on what works and what doesn’t when you’re pushing the limits of node density.
Building easy-cass-mcp: An MCP Server for Cassandra Operations
I’ve started working on a new project that I’d like to share, easy-cass-mcp, an MCP (Model Context Protocol) server specifically designed to assist Apache Cassandra operators.
After spending over a decade optimizing Cassandra clusters in production environments, I’ve seen teams consistently struggle with how to interpret system metrics, configuration settings, schema design, and system configuration, and most importantly, how to understand how they all impact each other. While many teams have solid monitoring through JMX-based collectors, extracting and contextualizing specific operational metrics for troubleshooting or optimization can still be cumbersome. The good news is that we now have the infrastructure to make all this operational knowledge accessible through conversational AI.
easy-cass-stress Joins the Apache Cassandra Project
I’m taking a quick break from my series on Cassandra node density to share some news with the Cassandra community: easy-cass-stress has officially been donated to the Apache Software Foundation and is now part of the Apache Cassandra project ecosystem as cassandra-easy-stress.
Why This Matters
Over the past decade, I’ve worked with countless teams struggling with Cassandra performance testing and benchmarking. The reality is that stress testing distributed systems requires tools that can accurately simulate real-world workloads. Many tools make this difficult by requiring the end user to learn complex configurations and nuance. While consulting at The Last Pickle, I set out to create an easy to use tool that lets people get up and running in just a few minutes
Azure fault domains vs availability zones: Achieving zero downtime migrations
The challenges of operating production-ready enterprise systems in the cloud are ensuring applications remain up to date, secure and benefit from the latest features. This can include operating system or application version upgrades, but it is not limited to advancements in cloud provider offerings or the retirement of older ones. Recently, NetApp Instaclustr undertook a migration activity for (almost) all our Azure fault domain customers to availability zones and Basic SKU IP addresses.
Understanding Azure fault domains vs availability zones“Azure fault domain vs availability zone” reflects a critical distinction in ensuring high availability and fault tolerance. Fault domains offer physical separation within a data center, while availability zones expand on this by distributing workloads across data centers within a region. This enhances resiliency against failures, making availability zones a clear step forward.
The need for migrating from fault domains to availability zonesNetApp Instaclustr has supported Azure as a cloud provider for our Managed open source offerings since 2016. Originally this offering was distributed across fault domains to ensure high availability using “Basic SKU public IP Addresses”, but this solution had some drawbacks when performing particular types of maintenance. Once released by Azure in several regions we extended our Azure support to availability zones which have a number of benefits including more explicit placement of additional resources, and we leveraged “Standard SKU Public IP’s” as part of this deployment.
When we introduced availability zones, we encouraged customers to provision new workloads in them. We also supported migrating workloads to availability zones, but we had not pushed existing deployments to do the migration. This was initially due to the reduced number of regions that supported availability zones.
In early 2024, we were notified that Azure would be retiring support for Basic SKU public IP addresses in September 2025. Notably, no new Basic SKU public IPs would be created after March 1, 2025. For us and our customers, this had the potential to impact cluster availability and stability – as we would be unable to add nodes, and some replacement operations would fail.
Very quickly we identified that we needed to migrate all customer deployments from Basic SKU to Standard SKU public IPs. Unfortunately, this operation involves node-level downtime as we needed to stop each individual virtual machine, detach the IP address, upgrade the IP address to the new SKU, and then reattach and start the instance. For customers who are operating their applications in line with our recommendations, node-level downtime does not have an impact on overall application availability, however it can increase strain on the remaining nodes.
Given that we needed to perform this potentially disruptive maintenance by a specific date, we decided to evaluate the migration of existing customers to Azure availability zones.
Key migration consideration for Cassandra clustersAs with any migration, we were looking at performing this with zero application downtime, minimal additional infrastructure costs, and as safe as possible. For some customers, we also needed to ensure that we do not change the contact IP addresses of the deployment, as this may require application updates from their side. We quickly worked out several ways to achieve this migration, each with its own set of pros and cons.
For our Cassandra customers, our go to method for changing cluster topology is through a data center migration. This is our zero-downtime migration method that we have completed hundreds of times, and have vast experience in executing. The benefit here is that we can be extremely confident of application uptime through the entire operation and be confident in the ability to pause and reverse the migration if issues are encountered. The major drawback to a data center migration is the increased infrastructure cost during the migration period – as you effectively need to have both your source and destination data centers running simultaneously throughout the operation. The other item of note, is that you will need to update your cluster contact points to the new data center.
For clusters running other applications, or customers who are more cost conscious, we evaluated doing a “node by node” migration from Basic SKU IP addresses in fault domains, to Standard SKU IP addresses in availability zones. This does not have any short-term increased infrastructure cost, however the upgrade from Basic SKU public IP to Standard SKU is irreversible, and different types of public IPs cannot coexist within the same fault domain. Additionally, this method comes with reduced rollback abilities. Therefore, we needed to devise a plan to minimize risks for our customers and ensure a seamless migration.
Developing a zero-downtime node-by-node migration strategyTo achieve a zero-downtime “node by node” migration, we explored several options, one of which involved building tooling to migrate the instances in the cloud provider but preserve all existing configurations. The tooling automates the migration process as follows:
- Begin with stopping the first VM in the cluster. For cluster availability, ensure that only 1 VM is stopped at any time.
- Create an OS disk snapshot and verify its success, then do the same for data disks
- Ensure all snapshots are created and generate new disks from snapshots
- Create a new network interface card (NIC) and confirm its status is green
- Create a new VM and attach the disks, confirming that the new VM is up and running
- Update the private IP address and verify the change
- The public IP SKU will then be upgraded, making sure this operation is successful
- The public IP will then be reattached to the VM
- Start the VM
Even though the disks are created from snapshots of the original disks, we encountered several discrepancies in our testing, with settings between the original VM and the new VM. For instance, certain configurations, such as caching policies, did not automatically carry over, requiring manual adjustments to align with our managed standards.
Recognizing these challenges, we decided to extend our existing node replacement mechanism to streamline our migration process. This is done so that a new instance is provisioned with a new OS disk with the same IP and application data. The new node is configured by the Instaclustr Managed Platform to be the same as the original node.
The next challenge: our existing solution is built so that the replaced node was provisioned to be the exact same as the original. However, for this operation we needed the new node to be placed in an availability zone instead of the same fault domain. This required us to extend the replacement operation so that when we triggered the replacement, the new node was placed in the desired availability zone. Once this operation completed, we had a replacement tool that ensured that the new instance was correctly provisioned in the availability zone, with a Standard SKU, and without data loss.
Now that we had two very viable options, we went back to our existing Azure customers to outline the problem space, and the operations that needed to be completed. We worked with all impacted customers on the best migration path for their specific use case or application and worked out the best time to complete the migration. Where possible, we first performed the migration on any test or QA environments before moving onto production environments.
Collaborative customer migration successSome of our Cassandra customers opted to perform the migration using our data center migration path, however most customers opted for the node-by-node method. We successfully migrated the existing Azure fault domain clusters over to the Availability Zone that we were targeting, with only a very small number of clusters remaining. These clusters are operating in Azure regions which do not yet support availability zones, but we were able to successfully upgrade their public IP from Basic SKUs that are set for retirement to Standard SKUs.
No matter what provider you use, the pace of development in cloud computing can require significant effort to support ongoing maintenance and feature adoption to take advantage of new opportunities. For business-critical applications, being able to migrate to new infrastructure and leverage these opportunities while understanding the limitations and impact they have on other services is essential.
NetApp Instaclustr has a depth of experience in supporting business critical applications in the cloud. You can read more about another large-scale migration we completed The worlds Largest Apache Kafka and Apache Cassandra Migration or head over to our console for a free trial of the Instaclustr Managed Platform.
The post Azure fault domains vs availability zones: Achieving zero downtime migrations appeared first on Instaclustr.
Integrating support for AWS PrivateLink with Apache Cassandra® on the NetApp Instaclustr Managed Platform
Discover how NetApp Instaclustr leverages AWS PrivateLink for secure and seamless connectivity with Apache Cassandra®. This post explores the technical implementation, challenges faced, and the innovative solutions we developed to provide a robust, scalable platform for your data needs.
Last year, NetApp achieved a significant milestone by fully
integrating AWS PrivateLink support for Apache Cassandra® into the
NetApp Instaclustr Managed Platform. Read our AWS PrivateLink
support for Apache Cassandra General Availability announcement
here. Our Product Engineering team made remarkable progress in
incorporating this feature into various NetApp Instaclustr
application offerings. NetApp now offers AWS PrivateLink support as
an Enterprise Feature add-on for the Instaclustr Managed Platform
for
Cassandra,
Kafka®,
OpenSearch®,
Cadence®, and
Valkey
.
The journey to support AWS PrivateLink for Cassandra involved considerable engineering effort and numerous development cycles to create a solution tailored to the unique interaction between the Cassandra application and its client driver. After extensive development and testing, our product engineering team successfully implemented an enterprise ready solution. Read on for detailed insights into the technical implementation of our solution.
What is AWS PrivateLink?PrivateLink is a networking solution from AWS that provides private connectivity between Virtual Private Clouds (VPCs) without exposing any traffic to the public internet. This solution is ideal for customers who require a unidirectional network connection (often due to compliance concerns), ensuring that connections can only be initiated from the source VPC to the destination VPC. Additionally, PrivateLink simplifies network management by eliminating the need to manage overlapping CIDRs between VPCs. The one-way connection allows connections to be initiated only from the source VPC to the managed cluster hosted in our platform (target VPC)—and not the other way around.
To get an idea of what major building blocks are involved in making up an end-to-end AWS PrivateLink solution for Cassandra, take a look at the following diagram—it’s a simplified representation of the infrastructure used to support a PrivateLink cluster:

In this example, we have a 3-node Cassandra cluster at the far right with one Cassandra node per Availability Zone (or AZ). Next, we have the VPC Endpoint Service and a Network Load Balancer (NLB). The Endpoint Service is essentially the AWS PrivateLink, and by design AWS needs it to be backed by an NLB–that’s pretty much what we have to manage on our side.
On the customer side, they must create a VPC Endpoint that enables them to privately connect to the AWS PrivateLink on our end; naturally, customers will also have to use a Cassandra client(s) to connect to the cluster.
AWS PrivateLink support with Instaclustr for Apache CassandraTo incorporate AWS PrivateLink support with Instaclustr for Apache Cassandra on our platform, we came across a few technical challenges. First and foremost, the primary challenge was relatively straightforward: Cassandra clients need to talk to each individual node in a cluster.
However, the problem is that nodes in an AWS PrivateLink cluster are only assigned private IPs; that is what the nodes would announce by default when Cassandra clients attempt to discover the topology of the cluster. Cassandra clients cannot do much with the received private IPs as they cannot be used to connect to the nodes directly in an AWS PrivateLink setup.
We devised a plan of attack to get around this problem:
- Make each individual Cassandra node listen for CQL queries on unique ports.
- Configure the NLB so it can route traffic to the appropriate node based on the relevant unique port.
- Let clients implement the AddressTranslator interface from the Cassandra driver. The custom address translator will need to translate the received private IPs to one of the VPC Endpoint Elastic Network Interface (or ENI) IPs without altering the corresponding unique ports.
To understand this approach better, consider the following example:
Suppose we have a 3-node Cassandra cluster. According to the proposed approach we will need to do the followings:
- Let the nodes listen on ports 172.16.0.1:6001 (in AZ1), 172.16.0.2: 6002 (in AZ2) and 172.16.0.3: 6003 (in AZ3)
- Configure the NLB to listen on the same set of ports
- Define and associate target groups based on the port. For instance, the listener on port 6002 will be associated with a target group containing only the node that is listening on port 6002.
- As for how the custom address translator is expected to work,
let’s assume the VPC Endpoint ENI IPs are 192.168.0.1 (in AZ1),
192.168.0.2 (in AZ2) and 192.168.0.3 (in AZ3). The address
translator should translate received addresses like so:
- 172.16.0.1:6001 --> 192.168.0.1:6001 - 172.16.0.2:6002 --> 192.168.0.2:6002 - 172.16.0.3:6003 --> 192.168.0.3:6003
The proposed approach not only solves the connectivity problem but also allows for connecting to appropriate nodes based on query plans generated by load balancing policies.
Around the same time, we came up with a slightly modified approach as well: we realized the need for address translation can be mostly mitigated if we make the Cassandra nodes return the VPC Endpoint ENI IPs in the first place.
But the excitement did not last for long! Why? Because we quickly discovered a key problem: there is a limit to the number of listeners that can be added to any given AWS NLB of just 50.
While 50 is certainly a decent limit, the way we designed our solution meant we wouldn’t be able to provision a cluster with more than 50 nodes. This was quickly deemed to be an unacceptable limitation as it is not uncommon for a cluster to have more than 50 nodes; many Cassandra clusters in our fleet have hundreds of nodes. We had to abandon the idea of address translation and started thinking about alternative solution approaches.
Introducing Shotover ProxyWe were disappointed but did not lose hope. Soon after, we devised a practical solution centred around using one of our open source products: Shotover Proxy.
Shotover Proxy is used with Cassandra clusters to support AWS PrivateLink on the Instaclustr Managed Platform. What is Shotover Proxy, you ask? Shotover is a layer 7 database proxy built to allow developers, admins, DBAs, and operators to modify in-flight database requests. By managing database requests in transit, Shotover gives NetApp Instaclustr customers AWS PrivateLink’s simple and secure network setup with the many benefits of Cassandra.
Below is an updated version of the previous diagram that introduces some Shotover nodes in the mix:

As you can see, each AZ now has a dedicated Shotover proxy node.
In the above diagram, we have a 6-node Cassandra cluster. The Cassandra cluster sitting behind the Shotover nodes is an ordinary Private Network Cluster. The role of the Shotover nodes is to manage client requests to the Cassandra nodes while masking the real Cassandra nodes behind them. To the Cassandra client, the Shotover nodes appear to be Cassandra nodes, and it is only them that make up the entire cluster! This is the secret recipe for AWS PrivateLink for Instaclustr for Apache Cassandra that enabled us to get past the challenges discussed earlier.
So how is this model made to work?
Shotover can alter certain requests from—and responses to—the client. It can examine the tokens allocated to the Cassandra nodes in its own AZ (aka rack) and claim to be the owner of all those tokens. This essentially makes them appear to be an aggregation of the nodes in its own rack.
Given the purposely crafted topology and token allocation metadata, while the client directs queries to the Shotover node, the Shotover node in turn can pass them on to the appropriate Cassandra node and then transparently send responses back. It is worth noting that the Shotover nodes themselves do not store any data.
Because we only have 1 Shotover node per AZ in this design and there may be at most about 5 AZs per region, we only need that many listeners in the NLB to make this mechanism work. As such, the 50-listener limit on the NLB was no longer a problem.
The use of Shotover to manage client driver and cluster interoperability may sound straight forward to implement, but developing it was a year-long undertaking. As described above, the initial months of development were devoted to engineering CQL queries on unique ports and the AddressTranslator interface from the Cassandra driver to gracefully manage client connections to the Cassandra cluster. While this solution did successfully provide support for AWS PrivateLink with a Cassandra cluster, we knew that the 50-listener limit on the NLB was a barrier for use and wanted to provide our customers with a solution that could be used for any Cassandra cluster, regardless of node count.
The next few months of engineering were then devoted to the Proof of Concept of an alternative solution with the goal to investigate how Shotover could manage client requests for a Cassandra cluster with any number of nodes. And so, after a solution to support a cluster with any number of nodes was successfully proved, subsequent effort was then devoted to work through stability testing the new solution, the results of that engineering being the stable solution described above.
We have also conducted performance testing to evaluate the relative performance of a PrivateLink-enabled Cassandra cluster compared to its non-PrivateLink counterpart. Multiple iterations of performance testing were executed as some adjustments to Shotover were identified from test cases and resulted in the PrivateLink-enabled Cassandra cluster throughput and latency measuring near to a standard Cassandra cluster throughput and latency.
Related content: Read more about creating an AWS PrivateLink-enabled Cassandra cluster on the Instaclustr Managed Platform
The following was our experimental setup for identifying the max throughput in terms of Operations per second of a Cassandra PrivateLink cluster in comparison to a non-Cassandra PrivateLink cluster
- Baseline node size:
i3en.xlarge - Shotover Proxy node size on Cassandra Cluster:
CSO-PRD-c6gd.medium-54 - Cassandra version:
4.1.3 - Shotover Proxy version:
0.2.0 - Other configuration: Repair and backup disabled, Client Encryption disabled
Across different cluster sizes, we observed no significant difference in operation throughput between PrivateLink and non-PrivateLink configurations.
Latency resultsLatency benchmarks were conducted at ~70% of the observed peak throughput (as above) to simulate realistic production traffic.
Operation Ops/second Setup Mean Latency (ms) Median Latency (ms) P95 Latency (ms) P99 Latency (ms) Mixed-small (3 Nodes) 11630 Non-PrivateLink 9.90 3.2 53.7 119.4 PrivateLink 9.50 3.6 48.4 118.8 Mixed-small (6 Nodes) 23510 Non-PrivateLink 6 2.3 27.2 79.4 PrivateLink 9.10 3.4 45.4 104.9 Mixed-small (9 Nodes) 36255 Non-PrivateLink 5.5 2.4 21.8 67.6 PrivateLink 11.9 2.7 77.1 141.2Results indicate that for lower to mid-tier throughput levels, AWS PrivateLink introduced minimal to negligible overhead. However, at higher operation rates, we observed increased latency, most notably at the p99 mark—likely due to network level factors or Shotover.
The increase in latency is expected as AWS PrivateLink introduces an additional hop to route traffic securely, which can impact latencies, particularly under heavy load. For the vast majority of applications, the observed latencies remain within acceptable ranges. However, for latency-sensitive workloads, we recommend adding more nodes (for high load cases) to help mitigate the impact of the additional network hop introduced by PrivateLink.
As with any generic benchmarking results, performance may vary depending on specific data model, workload characteristics, and environment. The results presented here are based on specific experimental setup using standard configurations and should primarily be used to compare the relative performance of PrivateLink vs. Non-PrivateLink networking under similar conditions.
Why choose AWS PrivateLink with NetApp Instaclustr?NetApp’s commitment to innovation means you benefit from cutting-edge technology combined with ease of use. With AWS PrivateLink support on our platform, customers gain:
- Enhanced security: All traffic stays private, never touching the internet.
- Simplified networking: No need to manage complex CIDR overlaps.
- Enterprise scalability: Handles sizable clusters effortlessly.
By addressing challenges, such as the NLB listener cap and private-to-VPC IP translation, we’ve created a solution that balances efficiency, security, and scalability.
Experience PrivateLink todayThe integration of AWS PrivateLink with Apache Cassandra® is now generally available with production-ready SLAs for our customers. Log in to the Console to create a Cassandra cluster with support for AWS PrivateLink with just a few clicks today. Whether you’re managing sensitive workloads or demanding performance at scale, this feature delivers unmatched value.
Want to see it in action? Book a free demo today and experience the Shotover-powered magic of AWS PrivateLink firsthand.
Resources- Getting started: Visit the documentation to learn how to create an AWS PrivateLink-enabled Apache Cassandra cluster on the Instaclustr Managed Platform.
- Connecting clients: Already created a Cassandra cluster with AWS PrivateLink? Click here to read about how to connect Cassandra clients in one VPC to an AWS PrivateLink-enabled Cassandra cluster on the Instaclustr Platform.
- General availability announcement: For more details, read our General Availability announcement on AWS PrivateLink support for Cassandra.
The post Integrating support for AWS PrivateLink with Apache Cassandra® on the NetApp Instaclustr Managed Platform appeared first on Instaclustr.
Compaction Strategies, Performance, and Their Impact on Cassandra Node Density
This is the third post in my series on optimizing Apache Cassandra for maximum cost efficiency through increased node density. In the first post, I examined how streaming operations impact node density and laid out the groundwork for understanding why higher node density leads to significant cost savings. In the second post, I discussed how compaction throughput is critical to node density and introduced the optimizations we implemented in CASSANDRA-15452 to improve throughput on disaggregated storage like EBS.
Cassandra Compaction Throughput Performance Explained
This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, CASSANDRA-15452.
This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice section covering the storage engine if you’d like to brush up before reading this post.
How Cassandra Streaming, Performance, Node Density, and Cost are All related
This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.
A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.
Cassandra 5 Released! What's New and How to Try it
Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.
You can grab the latest release on the Cassandra download page.
easy-cass-lab v5 released
I’ve got some fun news to start the week off for users of easy-cass-lab: I’ve just released version 5. There are a number of nice improvements and bug fixes in here that should make it more enjoyable, more useful, and lay groundwork for some future enhancements.
- When the cluster starts, we wait for the storage service to
reach NORMAL state, then move to the next node. This is in contrast
to the previous behavior where we waited for 2 minutes after
starting a node. This queries JMX directly using Swiss Java Knife
and is more reliable than the 2-minute method. Please see
packer/bin-cassandra/wait-for-up-normalto read through the implementation. - Trunk now works correctly. Unfortunately, AxonOps doesn’t support trunk (5.1) yet, and using the agent was causing a startup error. You can test trunk out, but for now the AxonOps integration is disabled.
- Added a new repl mode. This saves keystrokes and provides some
auto-complete functionality and keeps SSH connections open. If
you’re going to do a lot of work with ECL this will help you be a
little more efficient. You can try this out with
ecl repl. - Power user feature: Initial support for profiles in AWS regions
other than
us-west-2. We only provide AMIs forus-west-2, but you can now set up a profile in an alternate region, and build the required AMIs usingeasy-cass-lab build-image. This feature is still under development and requires using aneasy-cass-labbuild from source. Credit to Jordan West for contributing this work. - Power user feature: Support for multiple profiles. Setting the
EASY_CASS_LAB_PROFILEenvironment variable allows you to configure alternate profiles. This is handy if you want to use multiple regions or have multiple organizations. - The project now uses Kotlin instead of Groovy for Gradle configuration.
- Updated Gradle to 8.9.
- When using the list command, don’t show the alias “current”.
- Project cleanup, remove old unused pssh, cassandra build, and async profiler subprojects.
The release has been released to the project’s GitHub page and to homebrew. The project is largely driven by my own consulting needs and for my training. If you’re looking to have some features prioritized please reach out, and we can discuss a consulting engagement.