Real-Time Write Heavy Database Workloads: Considerations & Tips

Let’s look at the performance-related complexities that teams commonly face with write-heavy workloads and discuss your options for tackling them Write-heavy database workloads bring a distinctly different set of challenges than read-heavy ones. For example: Scaling writes can be costly, especially if you pay per operation and writes are 5X more costly than reads Locking can add delays and reduce throughput I/O bottlenecks can lead to write amplification and complicate crash recovery Database backpressure can throttle the incoming load While cost matters – quite a lot, in many cases – it’s not a topic we want to cover here. Rather, let’s focus on the performance-related complexities that teams commonly face and discuss your options for tackling them. What Do We Mean by “a Real-Time Write Heavy Workload”? First, let’s clarify what we mean by a “real-time write-heavy” workload. We’re talking about workloads that: Ingest a large amount of data (e.g., over 50K OPS) Involve more writes than reads Are bound by strict latency SLAs (e.g., single-digit millisecond P99 latency) In the wild, they occur across everything from online gaming to real-time stock exchanges. A few specific examples: Internet of Things (IoT) workloads tend to involve small but frequent append-only writes of time series data. Here, the ingestion rate is primarily determined by the number of endpoints collecting data. Think of smart home sensors or industrial monitoring equipment constantly sending data streams to be processed and stored. Logging and Monitoring systems also deal with frequent data ingestion, but they don’t have a fixed ingestion rate. They may not necessarily append only, as well as may be prone to hotspots, such as when one endpoint misbehaves. Online Gaming platforms need to process real-time user interactions, including game state changes, player actions, and messaging. The workload tends to be spiky, with sudden surges in activity. They’re extremely latency sensitive since even small delays can impact the gaming experience. E-commerce and Retail workloads are typically update-heavy and often involve batch processing. These systems must maintain accurate inventory levels, process customer reviews, track order status, and manage shopping cart operations, which usually require reading existing data before making updates. Ad Tech and Real-time Bidding systems require split-second decisions. These systems handle complex bid processing, including impression tracking and auction results, while simultaneously monitoring user interactions such as clicks and conversions. They must also detect fraud in real time and manage sophisticated audience segmentation for targeted advertising. Real-time Stock Exchange systems must support high-frequency trading operations, constant stock price updates, and complex order matching processes – all while maintaining absolute data consistency and minimal latency. Next, let’s look at key architectural and configuration considerations that impact write performance. Storage Engine Architecture The choice of storage engine architecture fundamentally impacts write performance in databases. Two primary approaches exist: LSM trees and B-Trees. Databases known to handle writes efficiently – such as ScyllaDB, Apache Cassandra, HBase, and Google BigTable – use Log-Structured Merge Trees (LSM). This architecture is ideal for handling large volumes of writes. Since writes are immediately appended to memory, this allows for very fast initial storage. Once the “memtable” in memory fills up, the recent writes are flushed to disk in sorted order. That reduces the need for random I/O. For example, here’s what the ScyllaDB write path looks like: With B-tree structures, each write operation requires locating and modifying a node in the tree – and that involves both sequential and random I/O. As the dataset grows, the tree can require additional nodes and rebalancing, leading to more disk I/O, which can impact performance. B-trees are generally better suited for workloads involving joins and ad-hoc queries. Payload Size Payload size also impacts performance. With small payloads, throughput is good but CPU processing is the primary bottleneck. As the payload size increases, you get lower overall throughput and disk utilization also increases. Ultimately, a small write usually fits in all the buffers and everything can be processed quite quickly. That’s why it’s easy to get high throughput. For larger payloads, you need to allocate larger buffers or multiple buffers. The larger the payloads, the more resources (network and disk) are required to service those payloads. Compression Disk utilization is something to watch closely with a write-heavy workload. Although storage is continuously becoming cheaper, it’s still not free. Compression can help keep things in check – so choose your compression strategy wisely. Faster compression speeds are important for write-heavy workloads, but also consider your available CPU and memory resources. Be sure to look at the compression chunk size parameter. Compression basically splits your data into smaller blocks (or chunks) and then compresses each block separately. When tuning this setting, realize that larger chunks are better for reads while smaller ones are better for writes, and take your payload size into consideration. Compaction For LSM-based databases, the compaction strategy you select also influences write performance. Compaction involves merging multiple SSTables into fewer, more organized files, to optimize read performance, reclaim disk space, reduce data fragmentation, and maintain overall system efficiency. When selecting compaction strategies, you could aim for low read amplification, which makes reads as efficient as possible. Or, you could aim for low write amplification by avoiding compaction from being too aggressive. Or, you could prioritize low space amplification and have compaction purge data as efficiently as possible. For example, ScyllaDB offers several compaction strategies (and Cassandra offers similar ones): Size-tiered compaction strategy (STCS): Triggered when the system has enough (four by default) similarly sized SSTables. Leveled compaction strategy (LCS): The system uses small, fixed-size (by default 160 MB) SSTables distributed across different levels. Incremental Compaction Strategy (ICS): Shares the same read and write amplification factors as STCS, but it fixes its 2x temporary space amplification issue by breaking huge sstables into SSTable runs, which are comprised of a sorted set of smaller (1 GB by default), non-overlapping SSTables. Time-window compaction strategy (TWCS): Designed for time series data. For write-heavy workloads, we warn users to avoid leveled compaction at all costs. That strategy is designed for read-heavy use cases. Using it can result in a regrettable 40x write amplification. Batching In databases like ScyllaDB and Cassandra, batching can actually be a bit of a trap – especially for write-heavy workloads. If you’re used to relational databases, batching might seem like a good option for handling a high volume of writes. But it can actually slow things down if it’s not done carefully. Mainly, that’s because large or unstructured batches end up creating a lot of coordination and network overhead between nodes. However, that’s really not what you want in a distributed database like ScyllaDB. Here’s how to think about batching when you’re dealing with heavy writes: Batch by the Partition Key: Group your writes by the partition key so the batch goes to a coordinator node that also owns the data. That way, the coordinator doesn’t have to reach out to other nodes for extra data. Instead, it just handles its own, which cuts down on unnecessary network traffic. Keep Batches Small and Targeted: Breaking up large batches into smaller ones by partition keeps things efficient. It avoids overloading the network and lets each node work on only the data it owns. You still get the benefits of batching, but without the overhead that can bog things down. Stick to Unlogged Batches: Considering you follow the earlier points, it’s best to use unlogged batches. Logged batches add extra consistency checks, which can really slow down the write. So, if you’re in a write-heavy situation, structure your batches carefully to avoid the delays that big, cross-node batches can introduce. Wrapping Up We offered quite a few warnings, but don’t worry. It was easy to compile a list of lessons learned because so many teams are extremely successful working with real-time write-heavy workloads. Now you know many of their secrets, without having to experience their mistakes. 🙂 If you want to learn more, here are some firsthand perspectives from teams who tackled quite interesting write-heavy challenges: Zillow: Consuming records from multiple data producers, which resulted in out-of-order writes that could result in incorrect updates Tractian: Preparing for 10X growth in high-frequency data writes from IoT devices Fanatics: Heavy write operations like handling orders, shopping carts, and product updates for this online sports retailer Also, take a look at the following video, where we go into even greater depth on these write-heavy challenges and also walk you through what these workloads look like on ScyllaDB.

Inside Tripadvisor’s Real-Time Personalization with ScyllaDB + AWS

See the engineering behind real-time personalization at Tripadvisor’s massive (and rapidly growing) scale What kind of traveler are you? Tripadvisor tries to assess this as soon as you engage with the site, then offer you increasingly relevant information on every click—within a matter of milliseconds. This personalization is powered by advanced ML models acting on data that’s stored on ScyllaDB running on AWS. In this article, Dean Poulin (Tripadvisor Data Engineering Lead on the AI Service and Products team) provides a look at how they power this personalization. Dean shares a taste of the technical challenges involved in delivering real-time personalization at Tripadvisor’s massive (and rapidly growing) scale. It’s based on the following AWS re:Invent talk: Pre-Trip Orientation In Dean’s words … Let’s start with a quick snapshot of who Tripadvisor is, and the scale at which we operate. Founded in 2000, Tripadvisor has become a global leader in travel and hospitality, helping hundreds of millions of travelers plan their perfect trips. Tripadvisor generates over $1.8 billion in revenue and is a publicly traded company on the NASDAQ stock exchange. Today, we have a talented team of over 2,800 employees driving innovation, and our platform serves a staggering 400 million unique visitors per month – a number that’s continuously growing. On any given day, our system handles more than 2 billion requests from 25 to 50 million users. Every click you make on Tripadvisor is processed in real time. Behind that, we’re leveraging machine learning models to deliver personalized recommendations – getting you closer to that perfect trip. At the heart of this personalization engine is ScyllaDB running on AWS. This allows us to deliver millisecond-latency at a scale that few organizations reach. At peak traffic, we hit around 425K operations per second on ScyllaDB with P99 latencies for reads and writes around 1-3 milliseconds. I’ll be sharing how Tripadvisor is harnessing the power of ScyllaDB, AWS, and real-time machine learning to deliver personalized recommendations for every user. We’ll explore how we help travelers discover everything they need to plan their perfect trip: whether it’s uncovering hidden gems, must-see attractions, unforgettable experiences, or the best places to stay and dine. This [article] is about the engineering behind that – how we deliver seamless, relevant content to users in real time, helping them find exactly what they’re looking for as quickly as possible. Personalized Trip Planning Imagine you’re planning a trip. As soon as you land on the Tripadvisor homepage, Tripadvisor already knows whether you’re a foodie, an adventurer, or a beach lover – and you’re seeing spot-on recommendations that seem personalized to your own interests. How does that happen within milliseconds? As you browse around Tripadvisor, we start to personalize what you see using Machine Learning models which calculate scores based on your current and prior browsing activity. We recommend hotels and experiences that we think you would be interested in. We sort hotels based on your personal preferences. We recommend popular points of interest near the hotel you’re viewing. These are all tuned based on your own personal preferences and prior browsing activity. Tripadvisor’s Model Serving Architecture Tripadvisor runs on hundreds of independently scalable microservices in Kubernetes on-prem and in Amazon EKS. Our ML Model Serving Platform is exposed through one of these microservices. This gateway service abstracts over 100 ML Models from the Client Services – which lets us run A/B tests to find the best models using our experimentation platform. The ML Models are primarily developed by our Data Scientists and Machine Learning Engineers using Jupyter Notebooks on Kubeflow. They’re managed and trained using ML Flow, and we deploy them on Seldon Core in Kubernetes. Our Custom Feature Store provides features to our ML Models, enabling them to make accurate predictions. The Custom Feature Store The Feature Store primarily serves User Features and Static Features. Static Features are stored in Redis because they don’t change very often. We run data pipelines daily to load data from our offline data warehouse into our Feature Store as Static Features. User Features are served in real time through a platform called Visitor Platform. We execute dynamic CQL queries against ScyllaDB, and we do not need a caching layer because ScyllaDB is so fast. Our Feature Store serves up to 5 million Static Features per second and half a million User Features per second. What’s an ML Feature? Features are input variables to the ML Models that are used to make a prediction. There are Static Features and User Features. Some examples of Static Features are awards that a restaurant has won or amenities offered by a hotel (like free Wi-Fi, pet friendly or fitness center). User Features are collected in real time as users browse around the site. We store them in ScyllaDB so we can get lightning fast queries. Some examples of user features are the hotels viewed over the last 30 minutes, restaurants viewed over the last 24 hours, or reviews submitted over the last 30 days. The Technologies Powering Visitor Platform ScyllaDB is at the core of Visitor Platform. We use Java-based Spring Boot microservices to expose the platform to our clients. This is deployed on AWS ECS Fargate. We run Apache Spark on Kubernetes for our daily data retention jobs, our offline to online jobs. Then we use those jobs to load data from our offline data warehouse into ScyllaDB so that they’re available on the live site. We also use Amazon Kinesis for processing streaming user tracking events. The Visitor Platform Data Flow The following graphic shows how data flows through our platform in four stages: produce, ingest, organize, and activate. Data is produced by our website and our mobile apps. Some of that data includes our Cross-Device User Identity Graph, Behavior Tracking events (like page views and clicks) and streaming events that go through Kinesis. Also, audience segmentation gets loaded into our platform. Visitor Platform’s microservices are used to ingest and organize this data. The data in ScyllaDB is stored in two keyspaces: The Visitor Core keyspace, which contains the Visitor Identity Graph The Visitor Metric keyspace, which contains Facts and Metrics (the things that the people did as they browsed the site) We use daily ETL processes to maintain and clean up the data in the platform. We produce Data Products, stamped daily, in our offline data warehouse – where they are available for other integrations and other data pipelines to use in their processing. Here’s a look at Visitor Platform by the numbers:   Why Two Databases? Our online database is focused on the real-time, live website traffic. ScyllaDB fills this role by providing very low latencies and high throughput. We use short term TTLs to prevent the data in the online database from growing indefinitely, and our data retention jobs ensure that we only keep user activity data for real visitors. Tripadvisor.com gets a lot of bot traffic, and we don’t want to store their data and try to personalize bots – so we delete and clean up all that data. Our offline data warehouse retains historical data used for reporting, creating other data products, and training our ML Models. We don’t want large-scale offline data processes impacting the performance of our live site, so we have two separate databases used for two different purposes. Visitor Platform Microservices We use 5 microservices for Visitor Platform: Visitor Core manages the cross-device user identity graph based on cookies and device IDs. Visitor Metric is our query engine, and that provides us with the ability for exposing facts and metrics for specific visitors. We use a domain specific language called visitor query language, or VQL. This example VQL lets you see the latest commerce click facts over the last three hours. Visitor Publisher and Visitor Saver handle the write path, writing data into the platform. Besides saving data in ScyllaDB, we also stream data to the offline data warehouse. That’s done with Amazon Kinesis. Visitor Composite simplifies publishing data in batch processing jobs. It abstracts Visitor Saver and Visitor Core to identify visitors and publish facts and metrics in a single API call. Roundtrip Microservice Latency This graph illustrates how our microservice latencies remain stable over time. The average latency is only 2.5 milliseconds, and our P999 is under 12.5 milliseconds. This is impressive performance, especially given that we handle over 1 billion requests per day. Our microservice clients have strict latency requirements. 95% of the calls must complete in 12 milliseconds or less. If they go over that, then we will get paged and have to find out what’s impacting the latencies. ScyllaDB Latency Here’s a snapshot of ScyllaDB’s performance over three days. At peak, ScyllaDB is handling 340,000 operations per second (including writes and reads and deletes) and the CPU is hovering at just 21%. This is high scale in action! ScyllaDB delivers microsecond writes and millisecond reads for us. This level of blazing fast performance is exactly why we chose ScyllaDB. Partitioning Data into ScyllaDB This image shows how we partition data into ScyllaDB. The Visitor Metric Keyspace has two tables: Fact and Raw Metrics. The primary key on the Fact table is Visitor GUID, Fact Type, and Created At Date. The composite partition key is the Visitor GUID and Fact Type. The clustering key is Created At Date, which allows us to sort data in partitions by date. The attributes column contains a JSON object representing the event that occurred there. Some example Facts are Search Terms, Page Views, and Bookings. We use ScyllaDB’s Leveled Compaction Strategy because: It’s optimized for range queries It handles high cardinality very well It’s better for read-heavy workloads, and we have about 2-3X more reads than writes Why ScyllaDB? Our solution was originally built using Cassandra on-prem. But as the scale increased, so did the operational burden. It required dedicated operations support in order for us to manage the database upgrades, backups, etc. Also, our solution requires very low latencies for core components. Our User Identity Management system must identify the user within 30 milliseconds – and for the best personalization, we require our Event Tracking platform to respond in 40 milliseconds. It’s critical that our solution doesn’t block rendering the page so our SLAs are very low. With Cassandra, we had impacts to performance from garbage collection. That was primarily impacting the tail latencies, the P999 and P9999 latencies. We ran a Proof of Concept with ScyllaDB and found the throughput to be much better than Cassandra and the operational burden was eliminated. ScyllaDB gave us a monstrously fast live serving database with the lowest possible latencies. We wanted a fully-managed option, so we migrated from Cassandra to ScyllaDB Cloud, following a dual write strategy. That allowed us to migrate with zero downtime while handling 40,000 operations or requests per second. Later, we migrated from ScyllaDB Cloud to ScyllaDB’s “Bring your own account” model, where you can have the ScyllaDB team deploy the ScyllaDB database into your own AWS account. This gave us improved performance as well as better data privacy. This diagram shows what ScyllaDB’s BYOA deployment looks like. In the center of the diagram, you can see a 6-node ScyllaDB cluster that is running on EC2. And then there’s two additional EC2 instances. ScyllaDB Monitor gives us Grafana dashboards as well as Prometheus metrics. ScyllaDB Manager takes care of infrastructure automation like triggering backups and repairs. With this deployment, ScyllaDB could be co-located very close to our microservices to give us even lower latencies as well as much higher throughput and performance. Wrapping up, I hope you now have a better understanding of our architecture, the technologies that power the platform, and how ScyllaDB plays a critical role in allowing us to handle Tripadvisor’s extremely high scale.

How to Build a High-Performance Shopping Cart App with ScyllaDB

Build a shopping cart app with ScyllaDB– and learn how to use ScyllaDB’s Change Data Capture (CDC) feature to query and export the history of all changes made to the tables.This blog post showcases one of ScyllaDB’s sample applications: a shopping cart app. The project uses FastAPI as the backend framework and ScyllaDB as the database. By cloning the repository and running the application, you can explore an example of an API server built on top of ScyllaDB for a CRUD app. Additionally, you’ll see how to use ScyllaDB’s Change Data Capture (CDC) feature to query and export the history of all changes made to the tables.What’s inside the shopping cart sample app?The application has two components: an API server and a database.API server: Python + FastAPIThe backend is built with Python and FastAPI, a modern Python web framework known for its speed and ease of use. FastAPI ensures that you have a framework that can deliver relatively high performance if used with the right database. At the same time, due to its exceptional developer experience, you can easily understand the code of the project and how it works even if you’ve never used it before.The application exposes multiple API endpoints to perform essential operations like:Adding products to the cartRemoving products from the cartUploading new productsUpdating product information (e.g. price)Database: ScyllaDBAt the core of this application is ScyllaDB, a low-latency NoSQL database that provides predictable performance. ScyllaDB excels in handling large volumes of data with single-digit millisecond latency, making it ideal for large-scale real-time applications.ScyllaDB acts as the foundation for a high-performance low-latency app. Moreover, it has additional capabilities that can help you maintain low p99 latency as well as analyze user behavior. ScyllaDB’s CDC feature tracks changes in the database and you can query historical operations. For e-commerce applications, this means you can capture insights into user behavior:What products are being added or removed from the cart and when?How do users interact with the cart?What does a typical journey look like for a user who actually buys something?These and other insights are invaluable for personalizing the user experience, optimizing the buying journey, and increasing conversion rates.Using ScyllaDB for an ecommerce applicationAs studies have shown, low latency is critical for achieving high conversion rates and delivering a smooth user experience. For instance, shopping cart operations – such as adding, updating, and retrieving products – require high performance to prevent cart abandonment.Data modeling, being the foundation for high-performance web applications, must remain a top priority. So let’s start with the process of creating a performant data model.Design a Shopping Cart data modelWe emphasize a practical “query-first” approach to NoSQL data modeling: start with your application’s queries, then design your schema around them. This method ensures your data model is optimized for your specific use cases and the database can provide a reliable and single-digit p99 latency at any scale.Let’s review the specific CRUD operations and queries a typical shopping cart application performs.ProductsList, add, edit and remove products.GET /products?limit=?SELECT * FROM product LIMIT {limit}GET /products/{product_id}SELECT * FROM product WHERE id = ?POST /productsINSERT INTO product () values ()PUT /products/{product_id}UPDATE product SET ? WHERE id = ?DELETE /products/{product_id}DELETE FROM product WHERE id = ?Based on these requirements, you can create a table to store products. You can notice what value is often used to filter products: product id. This is a good indicator that product id should be the partition key or at least part of it.The Product table:Our application is simple, so a single column will suffice as the partition key. However, if your use case requires additional queries and filtering by additional columns, you can consider using a composite partition key or adding a clustering key to the table.CartList, add, remove products from user’s cart.GET /cart/{user_id}SELECT * FROM cart_items WHERE user_id = ? AND cart_id = ?POST /cart/{user_id}INSERT INTO cart() VALUES ()Here we don’t need cart id because the user can only have one active cart at a time. (You could also build another endpoint to list past purchases by the user – that endpoint would require the cart id as well)DELETE /cart/{user_id}DELETE FROM cart_items WHERE user_id = ? AND cart_id = ? AND product_id = ?POST checkout /cart/{user_id}/checkoutUPDATE cart SET is_active = false WHERE user_id = ? AND cart_id = ?The cart-related operations contain a slightly more complicated logic behind the scenes. We have two values that we use to query by: user id and cart id. Those can be used together as composite partition keys.Additionally, one user can have multiple carts – one they’re using right now to shop and possibly other ones that they had in the past that they already paid for. For this reason, we need to have a way to efficiently find the user’s active cart. This query requirement will be handled by a secondary index on the is_active column.The Cart table:Additionally, we also need to create a table which connects the Product and Cart tables. Without this table, it would be impossible to retrieve products from a cart.The Cart_items table:We enable Change Data Capture for this table. This feature logs all data operations performed on the table into another table, cart_items_scylla_cdc_log. Later, we can query this log to retrieve the table’s historical operations. This data can be used to analyze user behavior, such as the products users add or remove from their carts.Final database schema:Now that we’ve covered the data modeling aspect of the project, you can clone the repository and get started with building.Getting startedPrerequisites:Python 3.8+ScyllaDB cluster (with ScyllaDB Cloud or use Docker)Connect to your ScyllaDB cluster using CQLSH and create the schema:Then, install the Python requirements in a new environment:Modify config.py to match your database credentials:Run the server:Generate sample user data:This script populates your ScyllaDB tables with sample data. This is necessary for the next step, where you will run CDC queries to analyze user behavior.Analyze user behavior with CDCCDC records every data change, including deletes, offering a comprehensive view of your data evolution without affecting database performance. For a shopping cart application, some potential use cases for CDC include:Analyzing a specific user’s buying behaviorTracking user actions leading to checkoutEvaluating product popularity and purchase frequencyAnalyzing active and abandoned cartsBeyond these business-specific insights, CDC data can also be exported to external platforms, such as Kafka, for further processing and analysis.Here are a couple of useful tips when working with CDC:The CDC log table contains timeuuid values, which can be converted to readable timestamps using the toTimestamp() CQL function.The cdc$operation column helps filter operations by type. For instance, a value of 2 indicates an INSERT operation.The most efficient and scalable way to query CDC data is to use the ScyllaDB source connector and set up an integration with Kafka.Now, let’s explore a couple of quick questions that CDC can help answer.How many times did users add more than 2 of the same product to the cart?How many carts contain a particular product?Set up ScyllaDB CDC with Kafka ConnectTo provide a scalable way for you to analyze ScyllaDB CDC logs, you can use Kafka to receive messages sent by ScyllaDB. Then, you can use an analytics tool, like Elasticsearch, to get insights. To send CDC logs to Kafka, you need to install the ScyllaDB CDC source connector, and create a new ScyllaDB connection in Kafka Connect.Install the ScyllaDB source connector on the machine/container that’s running Kafka:Then use the following ScyllaDB related parameters when you create the connection:Make sure to enable CDC on each table you want to send messages from. You can do this by executing the following CQL:Try it out yourselfIf you are interested in trying out this application yourself, check out the dedicated documentation site: shopping-cart.scylladb.com and the GitHub repository.If you have any questions about this project or ScyllaDB, submit a question in ScyllaDB the forum.

A Tiny Peek at Monster SCALE Summit 2025

Big things have been happening behind the scenes for the premier Monster SCALE Summit. Ever since we introduced it at P99 CONF, the community response has been overwhelming. We’re now faced with the “good” problem of determining how to fit all the selected speakers into the two half-days we set aside for the event. 😅 If you missed the intro last year, Monster Scale Summit is a highly technical conference that connects the community of professionals designing, implementing, and optimizing performance-sensitive data-intensive applications. It focuses on exploring “monster scale” engineering challenges with respect to extreme levels of throughput, data, and global distribution. The two-day event is free, intentionally virtual, and highly interactive. Register – it’s free and virtual We’ll be announcing the agenda next month. But we’re so excited about the speaker lineup that we can’t wait to share a taste of what you can expect. Here’s a preview of 12 of the 60+ sessions that you can join on March 11 and 12… Designing Data-Intensive Applications in 2025 Martin Kleppmann and Chris Riccomini (Designing Data-Intensive Applications book) Join us for an informal chat with Martin Kleppmann and Chris Riccomini, who are currently revising the famous book Designing Data-Intensive Applications. We’ll cover how data-intensive applications have evolved since the book was first published, the top tradeoffs people are negotiating today, and what they believe is next for data-intensive applications. Martin and Chris will also provide an inside look at the book writing and revision process. The Nile Approach: Re-engineering Postgres for Millions of Tenants Gwen Shapira (Nile) Scaling relational databases is a notoriously challenging problem. Doing so while maintaining consistent low latency, efficient use of resources and compatibility with Postgres may seem impossible. At Nile, we decided to tackle the scaling challenge by focusing on multi-tenant applications. These applications require not only scalability, but also a way to isolate tenants and avoid the noisy neighbor problem. By tackling both challenges, we developed an approach, which we call “virtual tenant databases”, which gives us an efficient way to scale Postgres to millions of tenants while still maintaining consistent performance. In this talk, I’ll explore the limitations of traditional scaling for multi-tenant applications and share how Nile’s virtual tenant databases address these challenges. By combining the best of Postgres existing capabilities, distributed algorithms and a new storage layer, Nile re-engineered Postgres for multi-tenant applications at scale. The Mechanics of Scale Dominik Tornow (Resonate HQ) As distributed systems scale, the complexity of their development and operation skyrockets. A dependable understanding of the mechanics of distributed systems is our most reliable parachute. In this talk, we’ll use systems thinking to develop an accurate and concise mental model of concurrent, distributed systems, their core challenges, and the key principles to address these challenges. We’ll explore foundational problems such as the tension between consistency and availability, and essential techniques like partitioning and replication. Whether you are building a new system from scratch or scaling an existing system to new heights, this talk will provide the understanding to confidently navigate the intricacies of modern, large-scale distributed systems. Feature Store Evolution Under Cost Constraints: When Cost is Part of the Architecture Ivan Burmistrov and David Malinge (ShareChat) At P99 CONF 23, the ShareChat team presented the scaling challenges for the ML Feature Store so it could handle 1 billion features per second. Once the system was scaled to handle the load, the next challenge the team faced was extreme cost constraints: it was required to make the same quality system much cheaper to run. Ivan and David will talk about approaches the team implemented in order to optimize for cost in the Cloud environment while maintaining the same SLA for the service. The talk will touch on such topics as advanced optimizations on various levels to bring down the compute, minimizing the waste when running on Kubernetes, autoscaling challenges for stateful Apache Flink jobs, and others. The talk should be useful for those who are either interested in building or optimizing an ML Feature Store or in general looking into cost optimizations in the cloud environment. Time Travelling at Scale Richard Hart (Antithesis) Antithesis is a continuous reliability platform that autonomously searches for problems in your software within a simulated environment. Every problem we find can be perfectly reproduced, allowing for efficient debugging of even the most complex problems. But storing and querying histories of program execution at scale creates monster large cardinalities. Over a ~10 hour test run, we generate ~1bn rows. The solution: our own tree-database. 30B Images and Counting: Scaling Canva’s Content-Understanding Pipelines Dr. Kerry Halupka (Canva) As the demand for high-quality, labeled image data grows, building systems that can scale content understanding while delivering real-time performance is a formidable challenge. In this talk, I’ll share how we tackled the complexities of scaling content understanding pipelines to support monstrous volumes of data, including backfilling labels for over 30 billion images. At the heart of our system is an extreme label classification model capable of handling thousands of labels and scaling seamlessly to thousands more. I’ll dive into the core components: candidate image search, zero-shot labelling using highly trained teacher models, and iterative refinement with visual critic models. You’ll learn how we balanced latency, throughput, and accuracy while managing evolving datasets and continuously expanding label sets. I’ll also discuss the tradeoffs we faced—such as ensuring precision in labelling without compromising speed—and the techniques we employed to optimise for scale, including strategies to address data sparsity and performance bottlenecks. By the end of the session, you’ll gain insights into designing, implementing, and scaling content understanding systems that meet extreme demands. Whether you’re working with real-time systems, distributed architectures, or ML pipelines, this talk will provide actionable takeaways for pushing large-scale labelling pipelines to their limits and beyond. How Agoda Scaled 50x Throughput with ScyllaDB Worakarn Isaratham (Agoda) In this talk, we will explore the performance tuning strategies implemented at Agoda to optimize ScyllaDB. Key topics include enhancing disk performance, selecting the appropriate compaction strategy, and adjusting SSTable settings to match our usage profile. Who Needs One Database Anyway? Glauber Costa (Turso) Developers need databases. That’s how you store your data. And that’s usually how it goes: you have your large fleet of services, and they connect to one database. But what if it wasn’t like that? What if instead of one database, one application would create one million databases, or even more? In this talk, we’ll explore the market trends that give rise to use cases where this pattern is beneficial, and the infrastructure changes needed to support it. How We Boosted ScyllaDB’s Data Streaming by 25x Asias He (ScyllaDB) To improve elasticity, we wanted to speed up streaming, the process of scaling out/in to other nodes used to analyze every partition. Enter file-based streaming,  a new feature that optimizes tablet movement. This new approach streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, significantly less data is streamed over the network, and less CPU is consumed –  especially for data models that contain small cells. This session will share the engineering behind this optimization and look at the performance impact you can expect in common situations. Evolving Atlassian Confluence Cloud for Scale, Reliability, and Performance Bhakti Mehta (Atlassian) This session covers the journey of Confluence Cloud – the team workspace for collaboration and knowledge sharing used by thousands of companies – and how we aim to take it to the next level, with scale, performance, and reliability as the key motivators. This session presents a deep dive to provide insights into how the Confluence architecture has evolved into its current form. It discusses how Atlassian deploys, runs, and operates at scale and all challenges encountered along the way. I will cover performance and reliability at scale starting with the fundamentals of measuring everything, re-defining metrics to be insightful of actual customer pain, auditing end-to-end experiences. Beyond just dev-ops and best practices, this means empowering teams to own product stability through practices and tools. Two Leading Approaches to Data Virtualization: Which Scales Better? Dr. Daniel Abadi (University of Maryland) You have a large dataset stored in location X, and some code to process or analyze it in location Y. What is better: move the code to the data, or move the data to the code? For decades, it has always been assumed that the former approach is more scalable. Recently, with the rise of cloud computing, and the push to separate resources for storage and compute, we have seen data increasingly being pushed to code, flying in face of conventional wisdom. What is behind this trend, and is it a dangerous idea? This session will look at this question from academic and practical perspectives, with a particular focus on data virtualization, where there exists an ongoing debate on the merits of push-based vs. pull-based data processing. Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System Dmytro Hnatiuk (Wise) Scaling a system from 66 million to over 25 billion records is no easy feat—especially when it’s a core financial system where every number has to be right, and data needs to be fresh right now. In this session, I’ll share the ups and downs of managing this kind of growth without losing my sanity. You’ll learn how to balance high data accuracy with real-time performance, optimize your app logic, and avoid the usual traps of database scaling. This isn’t about turning you into a database expert—it’s about giving you the practical, no-BS strategies you need to scale your systems without getting overwhelmed by technical headaches. Perfect for engineers and architects who want to tackle big challenges and come out on top.

Innovative data compression for time series: An open source solution

Introduction

There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.  

Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.  

Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative. 

ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub. 

There are so many compressors already, so why develop another one? 

While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born. 

ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that functionno actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use. 

Here’s an example: for a temperature change metricwhich mostly varies slowly (as do a lot of system metrics!)instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios. 

Image 1: ATSC data for temperature 

How does ATSC work? 

ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.  

By default, ATSC will segment the datathis guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file. 

In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function. 

ATSC currently uses one (per frame) of those following functions: 

  • FFT (Fast Fourier Transforms) 
  • Constant 
  • Interpolation – Catmull-Rom 
  • Interpolation – Inverse Distance Weight 

Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting 

These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.

For a more detailed insight into ATSC internals and operations check our paper! 

 Use cases for ATSC and results 

ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below). 

Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results: 

Method   Compressed size (bytes)  Compression Ratio 
Prometheus  454,778,552  1.33 
LZ4  141,347,821  4.29 
ATSC  14,276,544   42.47 

Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times. 

Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.

ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include: 

  • Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss. 
  • Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events. 
  • Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data. 
  • Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)

Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)   

Using ATSC 

ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions. 

Future work 

Currently, we are planning to evolve ATSC in two ways (check our open issues): 

  1. Adding features to the core compressor focused on these functionalities:
    • Frame expansion for appending new data to existing frames 
    • Dynamic function loading to add more functions without altering the codebase 
    • Global and per-frame error storage 
    • Improved error encoding 
  2. Integrations with additional technologies (e.g. databases):
    • We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra® 
CREATE TABLE sensors_poly (   
    sensor_id UInt16,   
    location UInt32,
    timestamp DateTime,
    pressure Float64
CODEC(ATSC('Polynomial', 1)),
    temperature Float64 
CODEC(ATSC('Polynomial', 1)),
) 
ENGINE = MergeTree 
ORDER BY (sensor_id, location,
timestamp);

Image 5: Currently testing ClickHouse integration 

Sound interesting? Try it out and let us know what you think.  

ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data. 

But don’t just take our word for itdownload and run it!  

Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.   

Want to integrate this with another tool? You can build and run our demo integration with ClickHouse. 

The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.

New cassandra_latest.yaml configuration for a top performant Apache Cassandra®

Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.  

This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters. 

Motivation 

The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters: 

  1. Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints. 
  2. Operators: who prefer stability and minimal disruption during upgrades. 
  3. Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility. 

Using cassandra_latest.yaml 

Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.  

This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements. 

Key changes and features 

Key Cache Size 

  • Old: Evaluated as a minimum from 5% of the heap or 100MB
  • Latest: Explicitly set to 0

Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage. 

Commit Log Disk Access Mode 

  • Old: Set to legacy
  • Latest: Set to auto

Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.

Memtable Implementation 

  • Old: Skiplist-based
  • Latest: Trie-based

Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.

create table … with memtable = {'class': 'TrieMemtable', … }

Memtable Allocation Type 

  • Old: Heap buffers 
  • Latest: Off-heap objects 

Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments. 

Trickle Fsync 

  • Old: False 
  • Latest: True 

Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads. 

SSTable Format 

  • Old: big 
  • Latest: bti (trie-indexed structure) 

Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets. 

sstable:
  selected_format: bti
  default_compression: zstd
  compression:
    zstd:
      enabled: true
      chunk_length: 16KiB
      max_compressed_length: 16KiB

Default Compaction Strategy 

  • Old: STCS (Size-Tiered Compaction Strategy) 
  • Latest: Unified Compaction Strategy 

Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables. 

default_compaction:
  class_name: UnifiedCompactionStrategy
  parameters:
    scaling_parameters: T4
    max_sstables_to_compact: 64
    target_sstable_size: 1GiB
    sstable_growth: 0.3333333333333333
    min_sstable_size: 100MiB

Concurrent Compactors 

  • Old: Defaults to the smaller of the number of disks and cores
  • Latest: Explicitly set to 8

Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient. 

Default Secondary Index 

  • Old: legacy_local_table
  • Latest: sai

Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time. 

Stream Entire SSTables 

  • Old: implicity set to True
  • Latest: explicity set to True

Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by

entire_sstable_stream_throughput_outbound

and

entire_sstable_inter_dc_stream_throughput_outbound

for inter-DC transfers. 

UUID SSTable Identifiers 

  • Old: False
  • Latest: True

Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments. 

Storage Compatibility Mode 

  • Old: Cassandra 4
  • Latest: None

Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions. 

Testing and validation 

The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests. 

Future improvements 

Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml. 

Conclusion 

The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone. 

Try it out 

Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!

The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.

Cassandra 5 Released! What's New and How to Try it

Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.

You can grab the latest release on the Cassandra download page.

Instaclustr for Apache Cassandra® 5.0 Now Generally Available

NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.

NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and onpremises.

Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.

Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by: 

  • Helping you build AI/ML-based applications through Vector Search  
  • Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities 
  • Improving flexibility and security 

With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industryleading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.  

Support for continuous backups and private network addons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.  
  • Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.  
  • Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.  
  • Numerous stability and testing improvements: You can read all about these changes here. 

All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.  

Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.  

We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.  

Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things. 

Lifecycle Policy Updates 

As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).

To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters. 

Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.  

Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.  

You can read more about our lifecycle policies on our website. 

Getting Started 

Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.  

  • Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
  • Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
  • Learn about 4 new Apache Cassandra 5.0 features to be excited about. 
  • Click here to learn what you need to know about Apache Cassandra 5.0. 

Why Choose Apache Cassandra on the Instaclustr Managed Platform? 

NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.  

NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.  

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.  

With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.  

If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.  

The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.

Apache Cassandra® 5.0: Behind the Scenes

Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.  

Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.  

It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform. 

Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra projectHis changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform. 

August 2023: The Beginning

We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.  

One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java. 

September 2023: First Milestone 

The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).  

Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.  

At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released. 

November 2023 Further Testing and Internal Preview 

The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.  

We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.  

We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.  

During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us. 

If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer: 

java.lang.IllegalArgumentException: Could not parse type name vector<float, 5>  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356)  
<Rest of stacktrace removed for brevity>

December 2023: Focus on new features and planning 

As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.  

 The final list of high impact features we came up with was: 

  • A new data type Vectors 
  • Trie memtables/Trie Indexed SSTables (BTI Formatted SStables) 
  • Storage-Attached Indexes (SAI) 
  • Unified Compaction Strategy 

A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase. 

Once the holiday season arrived, it was time for a break, and we were back in force in February next year. 

February 2024: Intensive testing 

In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, wdiscovered issues in the interaction with our monitoring and provisioning setup. 

We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.  

Next, we broke down the work we needed to do This included identifying monitoring agents requiring upgrade and config defaults that needed to change. 

From this point, the project split into 3 streams of work: 

  1. Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.  
  2. Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.  
  3. Infrastructure Upgrades Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables. 

A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved. 

March 2024: Public Preview Release 

In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables. 

However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.  

See our public preview launch blog for further details. 

There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults. 

April 2024: Configuration Tuning and Deeper Testing 

The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.  

This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default. 

"memtable": 
  { 
    "configurations": 
      { 
        "skiplist": 
          { 
            "class_name": "SkipListMemtable" 
          }, 
        "sharded": 
          { 
            "class_name": "ShardedSkipListMemtable" 
          }, 
        "trie": 
          { 
            "class_name": "TrieMemtable" 
          }, 
        "default": 
          { 
            "inherits": "trie" 
          } 
      } 
  }, 
"sstable": 
  { 
    "selected_format": "bti" 
  }, 
"storage_compatibility_mode": "NONE",

The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables You can override this per table: 

CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};

Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information

You can also reference the cassandra_latest.yaml  file for the latest settings (please note you should not apply these to existing clusters without rigorous testing). 

May 2024: Major Infrastructure Milestone 

We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.  

At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed. 

June 2024: Successful Rollout New Cassandra Driver 

The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default. 

July 2024: Path to Release Candidate 

We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.  

The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform. 

The Road Ahead to General Availability 

We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including: 

  • Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure. 

At Launch: 

When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.  

For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.  

What Have We Learned From This Project? 

  • Releasing limited, small and frequent changes has resulted in a smooth project, even if sometimes frequent releases do not feel smooth. Some thoughts: 
    • Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
    • Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc. 
    • Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.  
  • Releasing to internal and public preview helped create momentum within the Instaclustr business and teams:  
    • This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
  • Communicating frequently, transparently, and efficiently is the foundation of success:  
    • We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything. 
    • It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project. 

This has been a longrunning but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.  

You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.  

More Readings 

 

The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.

Will Your Cassandra Database Project Succeed?: The New Stack

Open source Apache Cassandra® continues to stand out as an enterprise-proven solution for organizations seeking high availability, scalability and performance in a NoSQL database. (And hey, the brand-new 5.0 version is only making those statements even more true!) There’s a reason this database is trusted by some of the world’s largest and most successful companies.

That said, effectively harnessing the full spectrum of Cassandra’s powerful advantages can mean overcoming a fair share of operational complexity. Some folks will find a significant learning curve, and knowing what to expect is critical to success. In my years of experience working with Cassandra, it’s when organizations fail to anticipate and respect these challenges that they set the stage for their Cassandra projects to fall short of expectations.

Let’s look at the key areas where strong project management and following proven best practices will enable teams to evade common pitfalls and ensure a Cassandra implementation is built strong from Day 1.

Accurate Data Modeling Is a Must

Cassandra projects require a thorough understanding of its unique data model principles. Teams that approach Cassandra like a relationship database are unlikely to model data properly. This can lead to poor performance, excessive use of secondary indexes and significant data consistency issues.

On the other hand, teams that develop familiarity with Cassandra’s specific NoSQL data model will understand the importance of including partition keys, clustering keys and denormalization. These teams will know to closely analyze query and data access patterns associated with their applications and know how to use that understanding to build a Cassandra data model that matches their application’s needs step for step.

The post Will Your Cassandra Database Project Succeed?: The New Stack appeared first on Instaclustr.

Use Your Data in LLMs With the Vector Database You Already Have: The New Stack

Open source vector databases are among the top options out there for AI development, including some you may already be familiar with or even have on hand.

Vector databases allow you to enhance your LLM models with data from your internal data stores. Prompting the LLM with local, factual knowledge can allow you to get responses tailored to what your organization already knows about the situation. This reduces “AI hallucination” and improves relevance.

You can even ask the LLM to add references to the original data it used in its answer so you can check yourself. No doubt vendors have reached out with proprietary vector database solutions, advertised as a “magic wand” enabling you to assuage any AI hallucination concerns.

But, ready for some good news?

If you’re already using Apache Cassandra 5.0OpenSearch or PostgreSQL, your vector database success is already primed. That’s right: There’s no need for costly proprietary vector database offerings. If you’re not (yet) using these free and fully open source database technologies, your generative AI aspirations are a good time to migrate — they are all enterprise-ready and avoid the pitfalls of proprietary systems.

For many enterprises, these open source vector databases are the most direct route to implementing LLMs — and possibly leveraging retrieval augmented generation (RAG) — that deliver tailored and factual AI experiences.

Vector databases store embedding vectors, which are lists of numbers representing spatial coordinates corresponding to pieces of data. Related data will have closer coordinates, allowing LLMs to make sense of complex and unstructured datasets for features such as generative AI responses and search capabilities.

RAG, a process skyrocketing in popularity, involves using a vector database to translate the words in an enterprise’s documents into embeddings to provide highly efficient and accurate querying of that documentation via LLMs.

Let’s look closer at what each open source technology brings to the vector database discussion:

Apache Cassandra 5.0 Offers Native Vector Indexing

With its latest version (currently in preview), Apache Cassandra has added to its reputation as an especially highly available and scalable open source database by including everything that enterprises developing AI applications require.

Cassandra 5.0 adds native vector indexing and vector search, as well as a new vector data type for embedding vector storage and retrieval. The new version has also added specific Cassandra Query Language (CQL) functions that enable enterprises to easily use Cassandra as a vector database. These additions make Cassandra 5.0 a smart open source choice for supporting AI workloads and executing enterprise strategies around managing intelligent data.

OpenSearch Provides a Combination of Benefits

Like Cassandra, OpenSearch is another highly popular open source solution, one that many folks on the lookout for a vector database happen to already be using. OpenSearch offers a one-stop shop for search, analytics and vector database capabilities, while also providing exceptional nearest-neighbor search capabilities that support vector, lexical, and hybrid search and analytics.

With OpenSearch, teams can put the pedal down on developing AI applications, counting on the database to deliver the stability, high availability and minimal latency it’s known for, along with the scalability to account for vectors into the tens of billions. Whether developing a recommendation engine, generative AI agent or any other solution where the accuracy of results is crucial, those using OpenSearch to leverage vector embeddings and stamp out hallucinations won’t be disappointed.

The pgvector Extension Makes Postgres a Powerful Vector Store

Enterprises are no strangers to Postgres, which ranks among the most used databases in the world. Given that the database only needs the pgvector extension to become a particularly performant vector database, countless organizations are just a simple deployment away from harnessing an ideal infrastructure for handling their intelligent data.

pgvector is especially well-suited to provide exact nearest-neighbor search, approximate nearest-neighbor search and distance-based embedding search, and at using cosine distance (as recommended by OpenAI), L2 distance and inner product to recognize semantic similarities. Efficiency with those capabilities makes pgvector a powerful and proven open source option for training accurate LLMs and RAG implementations, while positioning teams to deliver trustworthy AI applications they can be proud of.

Was the Answer to Your AI Challenges in Front of You All Along?

The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be. Recognizing that available open source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.

The post Use Your Data in LLMs With the Vector Database You Already Have: The New Stack appeared first on Instaclustr.

easy-cass-lab v5 released

I’ve got some fun news to start the week off for users of easy-cass-lab: I’ve just released version 5. There are a number of nice improvements and bug fixes in here that should make it more enjoyable, more useful, and lay groundwork for some future enhancements.

  • When the cluster starts, we wait for the storage service to reach NORMAL state, then move to the next node. This is in contrast to the previous behavior where we waited for 2 minutes after starting a node. This queries JMX directly using Swiss Java Knife and is more reliable than the 2-minute method. Please see packer/bin-cassandra/wait-for-up-normal to read through the implementation.
  • Trunk now works correctly. Unfortunately, AxonOps doesn’t support trunk (5.1) yet, and using the agent was causing a startup error. You can test trunk out, but for now the AxonOps integration is disabled.
  • Added a new repl mode. This saves keystrokes and provides some auto-complete functionality and keeps SSH connections open. If you’re going to do a lot of work with ECL this will help you be a little more efficient. You can try this out with ecl repl.
  • Power user feature: Initial support for profiles in AWS regions other than us-west-2. We only provide AMIs for us-west-2, but you can now set up a profile in an alternate region, and build the required AMIs using easy-cass-lab build-image. This feature is still under development and requires using an easy-cass-lab build from source. Credit to Jordan West for contributing this work.
  • Power user feature: Support for multiple profiles. Setting the EASY_CASS_LAB_PROFILE environment variable allows you to configure alternate profiles. This is handy if you want to use multiple regions or have multiple organizations.
  • The project now uses Kotlin instead of Groovy for Gradle configuration.
  • Updated Gradle to 8.9.
  • When using the list command, don’t show the alias “current”.
  • Project cleanup, remove old unused pssh, cassandra build, and async profiler subprojects.

The release has been released to the project’s GitHub page and to homebrew. The project is largely driven by my own consulting needs and for my training. If you’re looking to have some features prioritized please reach out, and we can discuss a consulting engagement.

easy-cass-lab updated with Cassandra 5.0 RC-1 Support

I’m excited to announce that the latest version of easy-cass-lab now supports Cassandra 5.0 RC-1, which was just made available last week! This update marks a significant milestone, providing users with the ability to test and experiment with the newest Cassandra 5.0 features in a simplified manner. This post will walk you through how to set up a cluster, SSH in, and run your first stress test.

For those new to easy-cass-lab, it’s a tool designed to streamline the setup and management of Cassandra clusters in AWS, making it accessible for both new and experienced users. Whether you’re running tests, developing new features, or just exploring Cassandra, easy-cass-lab is your go-to tool.

easy-cass-lab now available in Homebrew

I’m happy to share some exciting news for all Cassandra enthusiasts! My open source project, easy-cass-lab, is now installable via a homebrew tap. This powerful tool is designed to make testing any major version of Cassandra (or even builds that haven’t been released yet) a breeze, using AWS. A big thank-you to Jordan West who took the time to make this happen!

What is easy-cass-lab?

easy-cass-lab is a versatile testing tool for Apache Cassandra. Whether you’re dealing with the latest stable releases or experimenting with unreleased builds, easy-cass-lab provides a seamless way to test and validate your applications. With easy-cass-lab, you can ensure compatibility and performance across different Cassandra versions, making it an essential tool for developers and system administrators. easy-cass-lab is used extensively for my consulting engagements, my training program, and to evaluate performance patches destined for open source Cassandra. Here are a few examples:

Cassandra Training Signups For July and August Are Open!

I’m pleased to announce that I’ve opened training signups for Operator Excellence to the public for July and August. If you’re interested in stepping up your game as a Cassandra operator, this course is for you. Head over to the training page to find out more and sign up for the course.

Streaming My Sessions With Cassandra 5.0

As a long time participant with the Cassandra project, I’ve witnessed firsthand the evolution of this incredible database. From its early days to the present, our journey has been marked by continuous innovation, challenges, and a relentless pursuit of excellence. I’m thrilled to share that I’ll be streaming several working sessions over the next several weeks as I evaluate the latest builds and test out new features as we move toward the 5.0 release.

Streaming Cassandra Workloads and Experiments

Streaming

In the world of software engineering, especially within the realm of distributed systems, continuous learning and experimentation are not just beneficial; they’re essential. As a software engineer with a focus on distributed systems, particularly Apache Cassandra, I’ve taken this ethos to heart. My journey has led me to not only explore the intricacies of Cassandra’s distributed architecture but also to share my experiences and findings with a broader audience. This is why my YouTube channel has become an active platform where I stream at least once a week, engaging with viewers through coding sessions, trying new approaches, and benchmarking different Cassandra workloads.

Live Streaming On Tuesdays

As I promised in December, I redid my presentation from the Cassandra Summit 2023 on a live stream. You can check it out at the bottom of this post.

Going forward, I’ll be live-streaming on Tuesdays at 10AM Pacific on my YouTube channel.

Next week I’ll be taking a look at tlp-stress, which is used by the teams at some of the biggest Cassandra deployments in the world to benchmark their clusters. You can find that here.

Cassandra Summit Recap: Performance Tuning and Cassandra Training

Hello, friends in the Apache Cassandra community!

I recently had the pleasure of speaking at the Cassandra Summit in San Jose. Unfortunately, we ran into an issue with my screen refusing to cooperate with the projector, so my slides were pretty distorted and hard to read. While the talk is online, I think it would be better to have a version with the right slides as well as a little more time. I’ve decided to redo the entire talk via a live stream on YouTube. I’m scheduling this for 10am PST on Wednesday, January 17 on my YouTube channel. My original talk was done in 30 minute slot, this will be a full hour, giving plenty of time for Q&A.

Cassandra Summit, YouTube, and a Mailing List

I am thrilled to share some significant updates and exciting plans with my readers and the Cassandra community. As we draw closer to the end of the year, I’m preparing for an important speaking engagement and mapping out a year ahead filled with engaging and informative activities.

Cassandra Summit Presentation: Mastering Performance Tuning

I am honored to announce that I will be speaking at the upcoming Cassandra Summit. My talk, titled “Cassandra Performance Tuning Like You’ve Been Doing It for Ten Years,” is scheduled for December 13th, from 4:10 pm to 4:40 pm. This session aims to equip attendees with advanced insights and practical skills for optimizing Cassandra’s performance, drawing from a decade’s worth of experience in the field. Whether you’re new to Cassandra or a seasoned user, this talk will provide valuable insights to enhance your database management skills.

Uncover Cassandra's Throughput Boundaries with the New Adaptive Scheduler in tlp-stress

Introduction

Apache Cassandra remains the preferred choice for organizations seeking a massively scalable NoSQL database. To guarantee predictable performance, Cassandra administrators and developers rely on benchmarking tools like tlp-stress, nosqlbench, and ndbench to help them discover their cluster’s limits. In this post, we will explore the latest advancements in tlp-stress, highlighting the introduction of the new Adaptive Scheduler. This brand-new feature allows users to more easily uncover the throughput boundaries of Cassandra clusters while remaining within specific read and write latency targets. First though, we’ll take a brief look at the new workload designed to stress test the new Storage Attached Indexes feature coming in Cassandra 5.

AxonOps Review - An Operations Platform for Apache Cassandra

Note: Before we dive into this review of AxonOps and their offerings, it’s important to note that this blog post is part of a paid engagement in which I provided product feedback. AxonOps had no influence or say over the content of this post and did not have access to it prior to publishing.

In the ever-evolving landscape of data management, companies are constantly seeking solutions that can simplify the complexities of database operations. One such player in the market is AxonOps, a company that specializes in providing tooling for operating Apache Cassandra.

Benchmarking Apache Cassandra with tlp-stress

This post will introduce you to tlp-stress, a tool for benchmarking Apache Cassandra. I started tlp-stress back when I was working at The Last Pickle. At the time, I was spending a lot of time helping teams identify the root cause of performance issues and needed a way of benchmarking. I found cassandra-stress to be difficult to use and configure, so I ended up writing my own tool that worked in a manner that I found to be more useful. If you’re looking for a tool to assist you in benchmarking Cassandra, and you’re looking to get started quickly, this might be the right tool for you.

Back to Consulting!

Saying “it’s been a while since I wrote anything here” would be an understatement, but I’m back, with a lot to talk about in the upcoming months.

First off - if you’re not aware, I continued writing, but on The Last Pickle blog. There’s quite a few posts there, here are the most interesting ones:

Now the fun part - I’ve spent the last 3 years at Apple, then Netflix, neither of which gave me much time to continue my writing. As of this month, I’m officially no longer at Netflix and have started Rustyrazorblade Consulting!

Building a 100% ScyllaDB Shard-Aware Application Using Rust

Building a 100% ScyllaDB Shard-Aware Application Using Rust

I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...

On Scylla Manager Suspend & Resume feature

On Scylla Manager Suspend & Resume feature

!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...

Renaming and reshaping Scylla tables using scylla-migrator

We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...

Python scylla-driver: how we unleashed the Scylla monster's performance

At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...

Scylla Summit 2019

I've had the pleasure to attend again and present at the Scylla Summit in San Francisco and the honor to be awarded the...

A Small Utility to Help With Extracting Code Snippets

It’s been a while since I’ve written anything here. Part of the reason has been due to the writing I’ve done over on the blog at The Last Pickle. In the lsat few years, I’ve written about our tlp-stress tool, tips for new Cassandra clusters, and a variety of performance posts related to Compaction, Compression, and GC Tuning.

The other reason is the eight blog posts I’ve got in the draft folder. One of the reasons why there are so many is the way I write. If the post is programming related, I usually start with the post, then start coding, pull snippets out, learn more, rework the post, then rework snippets. It’s an annoying, manual process. The posts sitting in my draft folder have incomplete code, and reworking the code is a tedious process that I get annoyed with, leading to abandoned posts.

Scylla: four ways to optimize your disk space consumption

We recently had to face free disk space outages on some of our scylla clusters and we learnt some very interesting things while outlining some improvements t...