From Raw Performance to Price Performance: A Decade of Evolution at ScyllaDB

Tech journalist George Anadiotis catches up on how ScyllaDB’s latest releases deliver extreme elasticity and price-performance — and shares a peek at what’s next (vector search, object storage, and more) This is a guest post authored by tech journalist George Anadiotis. It’s a follow-up to articles that he published in 2023 and 2022 In business, they say it takes ten years to become an overnight success. In technology, they say it takes ten years to build a file system. ScyllaDB is in the technology business, offering a distributed NoSQL database that is monstrously fast and scalable. It turns out that it also takes ten years or more to build a successful database. This is something that Felipe Mendes and Guilherme Nogueira know well. Mendes and Nogueira are Technical Directors at ScyllaDB, working directly on the product as well as consulting clients. Recently, they presented some of the things they’ve been working on at ScyllaDB’s Monster Scale Summit, and they shared their insights in an exclusive fireside chat. You can also catch the podcast on AppleSpotify, and Amazon The evolution of ScyllaDB When ScyllaDB started out, it was all about raw performance. The goal was to be “the fastest NoSQL database available in the market, and we did that – we still are” as Mendes put it. However, as he added, raw speed alone does not necessarily make a good database. Features such as materialized views, secondary indexes, and integrations with third party solutions are really important as well. Adding such features marked the second generation in ScyllaDB’s evolution. ScyllaDB started as a performance-oriented alternative to Cassandra, so inevitably, evolution meant feature parity with Cassandra. The third generation of ScyllaDB was marked by the move to the cloud. ScyllaDB Cloud was introduced in 2019, has been growing at 200% YoY. As Nogueira shared, even today there are daily signups of new users ready to try the oddly-named database that’s used by companies such as Discord, Medium, and Tripadvisor, all of which the duo works with. The next generation brought a radical break from what Mendes called the inefficiencies in Cassandra, which involved introducing the Raft protocol for node coordination. Now ScyllaDB is moving to a new generation, by implementing what Mendes and Nogueira referred to as hallmark features: strong consistency and tablets. Strong consistency and tablets The combination of the new Raft and Tablets features enables clusters to scale up in seconds because it enables nodes to join in parallel, as opposed to sequentially which was the case for the Gossip protocol in Cassandra (which ScyllaDB also relied on originally). But it’s not just adding nodes that’s improved, it’s also removing nodes.When a node goes down for maintenance, for example, ScyllaDB’s strong consistency support means that the rest of the nodes in the cluster will be immediately aware. By contrast, in the previously supported regime of eventual consistency via a gossip protocol, it could take such updates a while to propagate. Using Raft means transitioning to a state machine mechanism, as Mendes noted. A node leader is appointed, so when a change occurs in the cluster, the state machine is updated and the change is immediately propagated. Raft is used to propagate updates consistently at every step of a topology change. It also allows for parallel topology updates, such as adding multiple nodes at once. This was not possible under the gossip-based approach. And this is where tablets come in. With tablets, instead of having one single leader per cluster, there is one leader per tablet. A tablet is a logical abstraction that partitions data in tables into smaller fragments. Tablets are load-balanced after new nodes join, ensuring consistent distribution across the cluster. Any changes to Tablets ownership are also ensured to be consistent by using Raft to propagate these changes. Each tablet is independent from the rest, which means that ScyllaDB with Raft can move them to other nodes on demand atomically and in a strongly consistent way as workloads grow or shrink. Speed, economy, elasticity By breaking down tables into smaller and more manageable units, data can be moved between nodes in a cluster much faster. This means that clusters can be scaled up rapidly, as Mendes demonstrated. When new nodes join a cluster, the data is redistributed in minutes rather than hours, which was the case previously (and is still the case with alternatives like Cassandra). When we’re talking about machines that have higher capacity, that also means that they have a higher storage density to be used, as Mendes noted. Tablets balance out in a way that utilizes storage capacity evenly, so all nodes in the cluster will have a similar utilization rate. That’s because the number of tablets at each node is determined according to the number of CPUs, which is always tied to storage in cloud nodes. In this sense, as storage utilization is more flexible and the cluster can scale faster, it also allows users to run at a much higher storage utilization rate. A typical storage utilization rate, Mendes said, is 50% to 60%. ScyllaDB aims to run at up to 90% storage utilization. That’s because tablets and cloud automations enable ScyllaDB Cloud to rapidly scale the cluster once those storage thresholds are exceeded, as ScyllaDB’s benchmarking shows. Going from 60% to 90% storage utilization means an extra 30% per node disk space can be utilized. At scale, that translates to significant savings for users. Further to scaling speed and economy, there is an additional benefit to tablets: enabling the elasticity of cloud operations for cloud deployments, without the complexity. Something old, something new, something borrowed, something blue Beyond strong consistency and tablets, there is a wide range of new features and improvements that the ScyllaDB team is working on. Some of these, such as support for S3 object storage, are efforts that are ongoing. Besides offering users choice, as well as a way to economize even further on storage, object storage support could also serve resilience. Other features, such as workload prioritization or the Alternator DynamoDB-compatible API, have been there for a while but are being improved and re-emphasized. As Mendes shared, when running a variety of workloads, it’s very hard for the database to know which is which and how to prioritize. Workload prioritization enables users to characterize and prioritize workloads, assigning appropriate service levels to each. Last but not least, ScyllaDB is also adding vector capabilities to the database engine. Vector data types, data structures, and query capabilities have been implemented and are being benchmarked. Initial results show great promise, even outperforming pure-play vector databases. This will eventually become a core feature, supported on both on-premise and cloud offerings. Once again, ScyllaDB is keeping with the times in its own characteristic way. As Mendes and Nogueira noted, there are many ScyllaDB clients using ScyllaDB to power AI workloads, some of them like Clearview AI sharing their stories. Nevertheless, ScyllaDB remains focused on database fundamentals, taking calculated steps in the spirit of continuous improvement that has become its trademark. After all, why change something that’s so deeply ingrained in the organization’s culture, is working well for them and appreciated by the ones who matter most – users?

How to Use Testcontainers with ScyllaDB

Learn how to use Testcontainers to create lightweight, throwaway instances of ScyllaDB for testing Why wrestle with all the complexities of database configuration for each round of integration testing? In this blog post, we will explain how to use the Testcontainers library to provide lightweight, throwaway instances of ScyllaDB for testing. We’ll go through a hands-on example that includes creating the database instance and testing against it. Testcontainers: A Valuable Tool for Integration Testing with ScyllaDB You automatically unit test your code and (hopefully) integration test your system…but what about your database? To rest assured that the application works as expected, you need to extend beyond unit testing. You also need to automatically test how the units interact with one another and how they interact with external services and systems (message brokers, data stores, and so on). But running those integration tests requires the infrastructure to be configured correctly, with all the components set to the proper state. You also need to ensure that the tests are isolated and don’t produce any side effects or “test pollution.” How do you reduce the pain…and get it all running in your CI/CD process? This is where Testcontainers comes into play. Testcontainers is an open source library for throwaway, lightweight instances of databases (including ScyllaDB), message brokers, web browsers, or just about anything that can run in a Docker container. You define your dependencies in code, which makes it well-suited for CI/CD processes. When you run your tests, a ScyllaDB container will be created and then deleted. This allows you to test your application against a real instance of the database without having to worry about complex environment configurations. It also ensures that the database setup has no effect on the production environment. Some of the advantages of using Testcontainers with ScyllaDB: It launches Dockerized databases on demand, so you get a fresh environment for every test run. It isolates tests with throwaway containers. There’s no test interference or state leakage since each test gets a pristine database state Tests are fast and realistic, since the container starts in seconds, ScyllaDB responds fast, and actual CQL responses are used. Tutorial: Building a ScyllaDB Test Step-by-Step The Testcontainers ScyllaDB integration works with JavaGo, Python (see example here), and Node.js. Here, we’ll walk through an example of how to use it with Java.The steps described below are applicable to any programming language and its corresponding testing framework. In our specific Java example, we will be using the JUnit 5 testing framework. The integration between Testcontainers and ScyllaDB uses Docker. You can read more about using ScyllaDB with Docker, and learn the Best Practices for Running ScyllaDB on Docker. Step 1: Configure Your Project Dependencies Before we begin, make sure you have: Java 21 or newer installed Docker installed and running (required for Testcontainers) Gradle 8 Note: If you are more comfortable with Maven, you can still follow this tutorial, but the setup and test execution steps will be different. To verify that Java 21 or newer is installed, run: java --version To verify that Docker is installed and running correctly, run: docker run hello-world To verify that Gradle 8 or newer is installed, run: gradle --version Once you have verified that all of the relevant project dependencies are installed and ready, you can move on to creating a new project. mkdir testcontainers-scylladb-java cd testcontainers-scylladb-java gradle init A series of prompts will appear. Here are the relevant choices you need to select: Select application Select java Enter Java version: 21 Enter project name: testcontainers-scylladb-java Select application structure: Single application project Select build script DSL: Groovy Select test framework: JUnit Jupiter For “Generate build using new APIs and behavior” select no After that part is finished, to verify the successful initialization of the new project, run: ./gradlew --version If everything goes well, you should see a build.gradle file in the app folder. You will need to add the following dependencies in your app/build.gradle file: dependencies { // Use JUnit Jupiter for testing. testImplementation libs.junit.jupiter testRuntimeOnly 'org.junit.platform:junit-platform-launcher' // This dependency is used by the application. implementation libs.guava // Add the required dependencies for the test testImplementation 'org.testcontainers:scylladb:1.20.5' testImplementation 'com.scylladb:java-driver-core:4.18.1.0' implementation 'ch.qos.logback:logback-classic:1.4.11' } Also, to get test report output in the terminal, you will need to add testLogging to app/build.gradle file as well: tasks.named('test') { // Use JUnit Platform for unit tests. useJUnitPlatform() // Add this testLogging configuration to get // the test results in terminal testLogging { events "passed", "skipped", "failed" showStandardStreams = true exceptionFormat = 'full' showCauses = true } } Once you’re finished editing the app/build.gradle file, you need to install the dependencies by running this command in the terminal: ./gradlew build You should see the BUILD SUCCESSFUL output in the terminal. The final preparation step is to create a ScyllaDBExampleTest.java file somewhere in the src/test/java folder. JUnit will find all tests in the src/test/java folder. For example: touch src/test/java/org/example/ScyllaDBExampleTest.java Step 2: Launch ScyllaDB in a Container Once the dependencies are installed and the ScyllaDBExampleTest.java file created, you can copy and paste the code provided below to the ScyllaDBExampleTest.java file. This code will start a fresh ScyllaDB instance for every test in this file in the setUp method. To ensure the instance will get shut down after every test, we’ve created the tearDown method, too. Step 3: Connect via the Java Driver You’ll connect to the ScyllaDB container by creating a new session. To do so, you’ll need to update your setUp method in the ScyllaDBExampleTest.java file: Step 4: Define Your Schema Now that you have the code to run ScyllaDB and connect to it, you can use the connection to create the schema for the database. Let’s define the schema by updating your setUp method in the ScyllaDBExampleTest.java file: Step 5: Insert and Query Data Once you have prepared the ScyllaDB instance, you can run operations on it. To do so, let’s add a new method to our ScyllaDBExampleTest class in the ScyllaDBExampleTest.java file: Step 6: Run and Validate the Test Your test is now complete and ready to be executed! Use the following command to execute the test: ./gradlew clean test --no-daemon If the execution is successful, you’ll notice the container starting in the logs, and the test will pass if the assertions are met. Here’s an example of what a successful terminal output might look like: 12:05:26.708 [Test worker] DEBUG com.github.dockerjava.zerodep.shaded.org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager -- ep-00000012: connection released [route: {}->unix://localhost:2375][total available: 1; route allocated: 1 of 2147483647; total allocated: 1 of 2147483647] 12:05:26.708 [Test worker] DEBUG org.testcontainers.utility.ResourceReaper -- Removed container and associated volume(s): scylladb/scylla:2025.1 ScyllaDBExampleTest > testScyllaDBOperations() PASSED BUILD SUCCESSFUL in 35s 3 actionable tasks: 3 executed Full code example The repository for the full code example can be found here:  https://github.com/scylladb/scylla-code-samples/tree/master/java-testcontainers Level Up: Extending Your ScyllaDB Tests That’s just the basics. Here are some additional uses you might want to explore on your own: Test schema migrations – Verify that your database evolution scripts work correctly Simulate multi-node clusters – Use multiple containers to test your application with multi-node and multi-dc  scenarios Benchmark performance – Measure your application’s throughput under various workloads Test failure scenarios – Simulate how your application handles network partitions or node failures Wrap-Up: Master ScyllaDB Testing with Confidence You’ve built a fast, real ScyllaDB test in Java that provides realistic database behavior without the overhead of a permanent installation. This approach should give you confidence that your code will work correctly in production. You can try it with an example app on ScyllaDB University, customize it to your project and specific needs, and share your experience with the community! Resources: Dive Deeper ScyllaDB Documentation ScyllaDB with Docker Best Practices for Running ScyllaDB on Docker Testcontainers GitHub Repository with Examples ScyllaDB University – Free courses to master ScyllaDB

Why Teams Are Ditching DynamoDB

Teams sometimes need lower latency, lower costs (especially as they scale) or the ability to run their applications somewhere other than AWS It’s easy to understand why so many teams have turned to Amazon DynamoDB since its introduction in 2012. It’s simple to get started, especially if your organization is already entrenched in the AWS ecosystem. It’s relatively fast and scalable, with a low learning curve. And since it’s fully managed, it abstracts away the operational effort and know-how traditionally required to keep a database up and running in a healthy state. But as time goes on, drawbacks emerge, especially as workloads scale and business requirements evolve. Teams sometimes need lower latency, lower costs (especially as they scale), or the ability to run their applications somewhere other than AWS. In those cases, ScyllaDB, which offers a DynamoDB-compatible API, is often selected as an alternative. Let’s explore the challenges that drove three teams to leave DynamoDB. Multi-Cloud Flexibility and Cost Savings Yieldmo is an online advertising platform that connects publishers and advertisers in real-time using an auction-based system, optimized with ML. Their business relies on delivering ads quickly (within 200-300 milliseconds) and efficiently, which requires ultra-fast, high-throughput database lookups at scale. Database delays directly translate to lost business. They initially built the platform on DynamoDB. However, while DynamoDB had been reliable, significant limitations emerged as they grew. As Todd Coleman, Technical Co-Founder and Chief Architect, explained, their primary concerns were twofold: escalating costs and geographic restrictions. The database was becoming increasingly expensive as they scaled, and it locked them into AWS, preventing true multi-cloud flexibility. While exploring DynamoDB alternatives, they were hoping to find an option that would maintain speed, scalability, and reliability while reducing costs and providing cloud vendor independence. Yieldmo first considered staying with DynamoDB and adding a caching layer. However, caching couldn’t fix the geographic latency issue. Cache misses would be too slow, making this approach impractical. They also explored Aerospike, which offered speed and cross-cloud support. However, Aerospike’s in-memory indexing would have required a prohibitively large and expensive cluster to handle Yieldmo’s large number of small data objects. Additionally, migrating to Aerospike would have required extensive and time-consuming code changes. Then they discovered ScyllaDB. And ScyllaDB’s DynamoDB-compatible API (Alternator) was a game changer. Todd explained, “ScyllaDB supported cross cloud deployments, required a manageable number of servers, and offered competitive costs. Best of all, its API was DynamoDB compatible, meaning we could migrate with minimal code changes. In fact, a single engineer implemented the necessary modifications in just a few days.” The migration process was carefully planned, leveraging their existing Kafka message queue architecture to ensure data integrity. They conducted two proof-of-concept (POC) tests: first with a single table of 28 billion objects, and then across all five AWS regions. The results were impressive. Todd shared, “Our database costs were cut in half, even with DynamoDB reserved capacity pricing.” And beyond cost savings, Yieldmo gained the flexibility to potentially deploy across different cloud providers. Their latency improved, and ScyllaDB was as simple to operate as DynamoDB. Wrapping up, Todd concluded: “One of our initial concerns was moving away from DynamoDB’s proven reliability. However, ScyllaDB has been an excellent partner. Their team provides monitoring of our clusters, alerts us to potential issues, and advises us when scaling is needed in terms of ongoing maintenance overhead. The experience has been comparable to DynamoDB, but with greater independence and substantial cost savings.” Hear from Yieldmo  Migrating to GCP with Better Performance and Lower Costs Digital Turbine, a major player in mobile ad tech with $500 million in annual revenue, faced growing challenges with its DynamoDB implementation. While its primary motivation for migration was standardizing on Google Cloud Platform following acquisitions, the existing DynamoDB solution had been causing both performance and cost concerns at scale. “It can be a little expensive as you scale, to be honest,” explained Joseph Shorter, vice president of Platform Architecture at Digital Turbine. “We were finding some performance issues. We were doing a ton of reads — 90% of all interactions with DynamoDB were read operations. With all those operations, we found that the performance hits required us to scale up more than we wanted, which increased costs.” Digital Turbine needed the migration to be as fast and low-risk as possible, which meant keeping application refactoring to a minimum. The main concern, according to Shorter, was “How can we migrate without radically refactoring our platform, while maintaining at least the same performance and value – and avoiding a crash-and-burn situation? Because if it failed, it would take down our whole company. “ After evaluating several options, Digital Turbine moved to ScyllaDB and achieved immediate improvements. The migration took less than a sprint to implement and the results exceeded expectations. “A 20% cost difference — that’s a big number, no matter what you’re talking about,” Shorter noted. “And when you consider our plans to scale even further, it becomes even more significant.” Beyond the cost savings, they found themselves “barely tapping the ScyllaDB clusters,” suggesting room for even more growth without proportional cost increases. Hear from Digital Turbine High Write Throughput with Low Latency and Lower Costs The User State and Customizations team for one of the world’s largest media streaming services had been using DynamoDB for several years. As they were rearchitecting two existing use cases, they wondered if it was time for a database change. The two use cases were: Pause/resume: If a user is watching a show and pauses it, they can pick up where they left off – on any device, from any location. Watch state: Using that same data, determine whether the user has watched the show. Here’s a simple architecture diagram: Every 30 seconds, the client sends heartbeats with the updated playhead position of the show and then sends those events to the database. The Edge Pipeline loads events in the same region as the user, while the Authority (Auth) Pipeline combines events for all five regions that the company serves. Finally, the data has to be fetched and served back to the client to support playback. Note that the team wanted to preserve separation between the Auth and Edge regions, so they weren’t looking for any database-specific replication between them. The two main technical requirements for supporting this architecture were: To ensure a great user experience, the system had to remain highly available, with low-latency reads and the ability to scale based on traffic surges. To avoid extensive infrastructure setup or DBA work, they needed easy integration with their AWS services. Once those boxes were checked, the team also hoped to reduce overall cost. “Our existing infrastructure had data spread across various clusters of DynamoDB and Elasticache, so we really wanted something simple that could combine these into a much lower cost system” explained their backend engineer. Specifically, they needed a database with: Multiregion support, since the service was popular across five major geographic regions. The ability to handle over 170K writes per second. Updates didn’t have a strict service-level agreement (SLA), but the system needed to perform conditional updates based on event timestamps. The ability to handle over 78K reads per second with a P99 latency of 10 to 20 milliseconds. The use case involved only simple point queries; things like indexes, partitioning and complicated query patterns weren’t a primary concern. Around 10TB of data with room for growth. Why move from DynamoDB? According to their backend engineer, “DynamoDB could support our technical requirements perfectly. But given our data size and high (write-heavy) throughput, continuing with DynamoDB would have been like shoveling money into the fire.” Based on their requirements for write performance and cost, they decided to explore ScyllaDB. For a proof of concept, they set up a ScyllaDB Cloud test cluster with six AWS i4i 4xlarge nodes and preloaded the cluster with 3 billion records. They ran combined loads of 170K writes per second and 78K reads per second. And the results? “We hit the combined load with zero errors. Our P99 read latency was 9 ms and the write latency was less than 1 ms.” These low latencies, paired with significant cost savings (over 50%) convinced them to leave DynamoDB. Beyond lower latencies at lower cost, the team also appreciated the following aspects of ScyllaDB: ScyllaDB’s performance-focused design (being built on the Seastar framework, using C++, being NUMA-aware, offering shard-aware drivers, etc.) helps the team reduce maintenance time and costs. Incremental Compaction Strategy helps them significantly reduce write amplification. Flexible consistency level and replication factors helps them support separate Auth and Edge pipelines. For example, Auth uses quorum consistency while Edge uses a consistency level of “1” due to the data duplication and high throughput. Their backend engineer concluded: “Choosing a database is hard. You need to consider not only features, but also costs. Serverless is not a silver bullet, especially in the database domain. “In our case, due to the high throughput and latency requirements, DynamoDB serverless was not a great option. Also, don’t underestimate the role of hardware. Better utilizing the hardware is key to reducing costs while improving performance.” Learn More Is Your Team Next? If your team is considering a move from DynamoDB, ScyllaDB might be an option to explore. Sign up for a technical consultation to talk more about your use case, SLAs, technical requirements and what you’re hoping to optimize. We’ll let you know if ScyllaDB is a good fit and, if so, what a migration might involve in terms of application changes, data modeling, infrastructure and so on. Bonus: Here’s a quick look at how ScyllaDB compares to DynamoDB

Cassandra Compaction Throughput Performance Explained

This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, CASSANDRA-15452.

This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice section covering the storage engine if you’d like to brush up before reading this post.

CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator

Introduction: The need for an Apache Cassandra® password validator and generator

Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.

While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.

When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.

But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?

That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.

The Goals of CEP-24

A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.

These were the key goals we established for CEP-24:

  • Introduce a way to enforce password strength upon role creation or role alteration.
  • Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
  • Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
  • To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
  • Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.

The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).

The password validator implements a custom guardrail introduced as part of CEP-24. A custom guardrail can validate and generate values of arbitrary types when properly implemented. In the CEP-24 context, the password guardrail provides CassandraPasswordValidator by extending ValueValidator, while passwords are generated by CassandraPasswordGenerator by extending ValueGenerator. Both components work with passwords as String type values.

Password validation and generation are configured in the cassandra.yaml file under the password_validator section. Let’s explore the key configuration properties available. First, the class_name and generator_class_name parameters specify which validator and generator classes will be used to validate and generate passwords respectively.

Cassandra ships CassandraPasswordValidator and CassandraPasswordGenerator out of the box. However, if a particular enterprise decides that they need something very custom, they are free to implement their own validators, put it on Cassandra’s class path and reference it in the configuration behind class_name parameter. Same for the validator.

CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.

password_validator: 
 # Implementation class of a validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.validators is prepended. 
 # By default, there is no validator. 
 class_name: CassandraPasswordValidator 
 # Implementation class of related generator which generates values which are valid when 
 # tested against this validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.generators is prepended. 
 # By default, there is no generator. 
 generator_class_name: CassandraPasswordGenerator

Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.

This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.

# There are four characteristics: 
 # upper-case, lower-case, special character and digit. 
 # If this value is set e.g. to 3, a password has to 
 # consist of 3 out of 4 characteristics. 

 # For example, it has to contain at least 2 upper-case characters, 
 # 2 lower-case, and 2 digits to pass, 
 # but it does not have to contain any special characters. 
 # If the number of characteristics found in the password is 
 # less than or equal to this number, it will emit a warning. 
 characteristic_warn: 3 
 # If the number of characteristics found in the password is 
 #less than or equal to this number, it will emit a failure. 
 characteristic_fail: 2

Next, there are configuration parameters for each characteristic which count towards warning or failure:

# If the password is shorter than this value, 
# the validator will emit a warning. 
length_warn: 12 
# If a password is shorter than this value, 
# the validator will emit a failure. 
length_fail: 8 
# If a password does not contain at least n 
# upper-case characters, the validator will emit a warning. 
upper_case_warn: 2 
# If a password does not contain at least 
# n upper-case characters, the validator will emit a failure. 
upper_case_fail: 1 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a warning. 
lower_case_warn: 2 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a failure. 
lower_case_fail: 1 
# If a password does not contain at least 
# n digits, the validator will emit a warning. 
digit_warn: 2 
# If a password does not contain at least 
# n digits, the validator will emit a failure. 
digit_fail: 1 
# If a password does not contain at least 
# n special characters, the validator will emit a warning. 
special_warn: 2 
# If a password does not contain at least 
# n special characters, the validator will emit a failure. 
special_fail: 1

It is also possible to say that illegal sequences of certain length found in a password will be forbidden: 

# If a password contains illegal sequences that are at least this long, it is invalid. 
# Illegal sequences might be either alphabetical (form 'abcde'), 
# numerical (form '34567'), or US qwerty (form 'asdfg') as well 
# as sequences from supported character sets. 
# The minimum value for this property is 3, 
# by default it is set to 5. 
illegal_sequence_length: 5

Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:

# Dictionary to check the passwords against. Defaults to no dictionary. 
# Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. 
# Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. 
dictionary: /path/to/dictionary/file

Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.

Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
contains the dictionary word 'cassandraisadatabase'. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; 
InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
must be 8 or more characters in length. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; 

Warnings: 

Guardrail password violated: Password was set, however it might not be 
strong enough according to the configured password strength policy. 
To fix this warning, the following has to be resolved: Password must be 12 or more 
characters in length. Passwords must contain 2 or more digit characters. Password 
must contain 2 or more special characters. Password matches 2 of 4 character rules, 
but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met. 

When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.

cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; 

generated_password 
------------------ 
   R7tb33?.mcAX

The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.

When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.

We can create a role with generated password like this:

cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; 

or by CREATE USER: 

cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;

When a password is generated foralice (out of scope of this documentation), she can log in: 

$ cqlsh -u alice -p R7tb33?.mcAX 
... 
alice@cqlsh>

Note: It is recommended to save password to ~/.cassandra/credentials, for example: 

[PlainTextAuthProvider] 
username = cassandra
password = R7tb33?.mcAX

and by setting auth_provider in ~/.cassandra/cqlshrc 

[auth_provider] 
module = cassandra.auth 
classname = PlainTextAuthProvider

It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:

alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength policy. 
You may also use 'GENERATED PASSWORD' upon role creation or alteration."

The following command will automatically generate a new password that meets all configured security requirements.

alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;

Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.

These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.

Final thoughts and next steps

The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.

By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.

As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.

Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.

CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.

The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.

Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type

In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy',         '       replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature  Cassandra 4 + OpenSearch  Cassandra 5 (Preview) 
Embedding Storage  OpenSearch  Native VECTOR Data Type 
Similarity Search  KNN Plugin in OpenSearch  COSINE, EUCLIDEAN, DOT_PRODUCT 
Search Method  Exact K-Nearest Neighbor  Approximate Nearest Neighbor (ANN) 
System Complexity  Requires two systems  All-in-one Cassandra solution 

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch®

Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘
CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',          '
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.
{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

How Cassandra Streaming, Performance, Node Density, and Cost are All related

This is the first post of several I have planned on optimizing Apache Cassandra for maximum cost efficiency. I’ve spent over a decade working with Cassandra and have spent tens of thousands of hours data modeling, fixing issues, writing tools for it, and analyzing it’s performance. I’ve always been fascinated by database performance tuning, even before Cassandra.

A decade ago I filed one of my first issues with the project, where I laid out my target goal of 20TB of data per node. This wasn’t possible for most workloads at the time, but I’ve kept this target in my sights.

IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative

IBM’s recent acquisition of DataStax has certainly made waves in the tech industry. With IBM’s expanding influence in data solutions and DataStax’s reputation for advancing Apache Cassandra® technology, this acquisition could signal a shift in the database management landscape.

For businesses currently using DataStax, this news might have sparked questions about what the future holds. How does this acquisition impact your systems, your data, and, most importantly, your goals?

While the acquisition proposes prospects in integrating IBM’s cloud capabilities with high-performance NoSQL solutions, there’s uncertainty too. Transition periods for acquisitions often involve changes in product development priorities, pricing structures, and support strategies.

However, one thing is certain: customers want reliable, scalable, and transparent solutions. If you’re re-evaluating your options amid these changes, here’s why NetApp Instaclustr offers an excellent path forward.

Decoding the IBM-DataStax link-up

DataStax is a provider of enterprise solutions for Apache Cassandra, a powerful NoSQL database trusted for its ability to handle massive amounts of distributed data. IBM’s acquisition reflects its growing commitment to strengthening data management and expanding its footprint in the open source ecosystem.

While the acquisition promises an infusion of IBM’s resources and reach, IBM’s strategy often leans into long-term integration into its own cloud services and platforms. This could potentially reshape DataStax’s roadmap to align with IBM’s broader cloud-first objectives. Customers who don’t rely solely on IBM’s ecosystem—or want flexibility in their database management—might feel caught in a transitional limbo.

This is where Instaclustr comes into the picture as a strong, reliable alternative solution.

Why consider Instaclustr?

Instaclustr is purpose-built to empower businesses with a robust, open source data stack. For businesses relying on Cassandra or DataStax, Instaclustr delivers an alternative that’s stable, high-performing, and highly transparent.

Here’s why Instaclustr could be your best option moving forward:

1. 100% open source commitment

We’re firm believers in the power of open source technology. We offer pure Apache Cassandra, keeping it true to its roots without the proprietary lock-ins or hidden limitations. Unlike proprietary solutions, a commitment to pure open source ensures flexibility, freedom, and no vendor lock-in. You maintain full ownership and control.

2. Platform agnostic

One of the things that sets our solution apart is our platform-agnostic approach. Whether you’re running your workloads on AWS, Google Cloud, Azure, or on-premises environments, we make it seamless for you to deploy, manage, and scale Cassandra. This differentiates us from vendors tied deeply to specific clouds—like IBM.

3. Transparent pricing

Worried about the potential for a pricing overhaul under IBM’s leadership of DataStax? At Instaclustr, we pride ourselves on simplicity and transparency. What you see is what you get—predictable costs without hidden fees or confusing licensing rules. Our customer-first approach ensures that you remain in control of your budget.

4. Expert support and services

With Instaclustr, you’re not just getting access to technology—you’re also gaining access to a team of Cassandra experts who breathe open source. We’ve been managing and optimizing Cassandra clusters across the globe for years, with a proven commitment to providing best-in-class support.

Whether it’s data migration, scaling real-world workloads, or troubleshooting, we have you covered every step of the way. And our reliable SLA-backed managed Cassandra services mean businesses can focus less on infrastructure stress and more on innovation.

5. Seamless migrations

Concerned about the transition process? If you’re currently on DataStax and contemplating a move, our solution provides tools, guidance, and hands-on support to make the migration process smooth and efficient. Our experience in executing seamless migrations ensures minimal disruption to your operations.

Customer-centric focus

At the heart of everything we do is a commitment to your success. We understand that your data management strategy is critical to achieving your business goals, and we work hard to provide adaptable solutions.

Instaclustr comes to the table with over 10 years of experience in managing open source technologies including Cassandra, Apache Kafka®, PostgreSQL®, OpenSearch®, Valkey,® ClickHouse® and more, backed by over 400 million node hours and 18+ petabytes of data under management. Our customers trust and rely on us to manage the data that drives their critical business applications.

With a focus on fostering an open source future, our solutions aren’t tied to any single cloud, ecosystem, or bit of red tape. Simply put: your open source success is our mission.

Final thoughts: Why Instaclustr is the smart choice for this moment

IBM’s acquisition of DataStax might open new doors—but close many others. While the collaboration between IBM and DataStax might appeal to some enterprises, it’s important to weigh alternative solutions that offer reliability, flexibility, and freedom.

With Instaclustr, you get a partner that’s been empowering businesses with open source technologies for years, providing the transparency, support, and performance you need to thrive.

Ready to explore a stable, long-term alternative to DataStax? Check out Instaclustr for Apache Cassandra.

Contact us and learn more about Instaclustr for Apache Cassandra or request a demo of the Instaclustr platform today!

The post IBM acquires DataStax: What that means for customers–and why Instaclustr is a smart alternative appeared first on Instaclustr.

Innovative data compression for time series: An open source solution

Introduction

There’s no escaping the role that monitoring plays in our everyday lives. Whether it’s from monitoring the weather or the number of steps we take in a day, or computer systems to ever-popular IoT devices.  

Practically any activity can be monitored in one form or another these days. This generates increasing amounts of data to be pored over and analyzed–but storing all this data adds significant costs over time. Given this huge amount of data that only increases with each passing day, efficient compression techniques are crucial.  

Here at NetApp® Instaclustr we saw a great opportunity to improve the current compression techniques for our time series data. That’s why we created the Advanced Time Series Compressor (ATSC) in partnership with University of Canberra through the OpenSI initiative. 

ATSC is a groundbreaking compressor designed to address the challenges of efficiently compressing large volumes of time-series data. Internal test results with production data from our database metrics showed that ATSC would compress, on average of the dataset, ~10x more than LZ4 and ~30x more than the default Prometheus compression. Check out ATSC on GitHub. 

There are so many compressors already, so why develop another one? 

While other compression methods like LZ4, DoubleDelta, and ZSTD are lossless, most of our timeseries data is already lossy. Timeseries data can be lossy from the beginning due to under-sampling or insufficient data collection, or it can become lossy over time as metrics are rolled over or averaged. Because of this, the idea of a lossy compressor was born. 

ATSC is a highly configurable, lossy compressor that leverages the characteristics of time-series data to create function approximations. ATSC finds a fitting function and stores the parametrization of that functionno actual data from the original timeseries is stored. When the data is decompressed, it isn’t identical to the original, but it is still sufficient for the intended use. 

Here’s an example: for a temperature change metricwhich mostly varies slowly (as do a lot of system metrics!)instead of storing all the points that have a small change, we fit a curve (or a line) and store that curve/line achieving significant compression ratios. 

Image 1: ATSC data for temperature 

How does ATSC work? 

ATSC looks at the actual time series, in whole or in parts, to find how to better calculate a function that fits the existing data. For that, a quick statistical analysis is done, but if the results are inconclusive a sample is compressed with all the functions and the best function is selected.  

By default, ATSC will segment the datathis guarantees better local fitting, more and smaller computations, and less memory usage. It also ensures that decompression targets a specific block instead of the whole file. 

In each fitting frame, ATSC will create a function from a pre-defined set and calculate the parametrization of said function. 

ATSC currently uses one (per frame) of those following functions: 

  • FFT (Fast Fourier Transforms) 
  • Constant 
  • Interpolation – Catmull-Rom 
  • Interpolation – Inverse Distance Weight 

Image 2: Polynomial fitting vs. Fast-Fourier Transform fitting 

These methods allow ATSC to compress data with a fitting error within 1% (configurable!) of the original time-series.

For a more detailed insight into ATSC internals and operations check our paper! 

 Use cases for ATSC and results 

ATSC draws inspiration from established compression and signal analysis techniques, achieving compression ratios ranging from 46x to 880x with a fitting error within 1% of the original time-series. In some cases, ATSC can produce highly compressed data without losing any meaningful information, making it a versatile tool for various applications (please see use cases below). 

Some results from our internal tests comparing to LZ4 and normal Prometheus compression yielded the following results: 

Method   Compressed size (bytes)  Compression Ratio 
Prometheus  454,778,552  1.33 
LZ4  141,347,821  4.29 
ATSC  14,276,544   42.47 

Another characteristic is the trade-off between fast compression speed vs. slower compression speed. Compression is about 30x slower than decompression. It is expected that time-series are compressed once but decompressed several times. 

Image 3: A better fitting (purple) vs. a loose fitting (red). Purple takes twice as much space.

ATSC is versatile and can be applied in various scenarios where space reduction is prioritized over absolute precision. Some examples include: 

  • Rolled-over time series: ATSC can offer significant space savings without meaningful loss in precision, such as metrics data that are rolled over and stored for long term. ATSC provides the same or more space savings but with minimal information loss. 
  • Under-sampled time series: Increase sample rates without losing space. Systems that have very low sampling rates (30 seconds or more) and as such, it is very difficult to identify actual events. ATSC provides the space savings and keeps the information about the events. 
  • Long, slow-moving data series: Ideal for patterns that are easy to fit, such as weather data. 
  • Human visualization: Data meant for human analysis, with minimal impact on accuracy, such as historic views into system metrics (CPU, Memory, Disk, etc.)

Image 4: ATSC data (green) with an 88x compression vs. the original data (yellow)   

Using ATSC 

ATSC is written in Rust as and is available in GitHub. You can build and run yourself following these instructions. 

Future work 

Currently, we are planning to evolve ATSC in two ways (check our open issues): 

  1. Adding features to the core compressor focused on these functionalities:
    • Frame expansion for appending new data to existing frames 
    • Dynamic function loading to add more functions without altering the codebase 
    • Global and per-frame error storage 
    • Improved error encoding 
  2. Integrations with additional technologies (e.g. databases):
    • We are currently looking into integrating ASTC with ClickHouse® and Apache Cassandra® 
CREATE TABLE sensors_poly (   
    sensor_id UInt16,   
    location UInt32,
    timestamp DateTime,
    pressure Float64
CODEC(ATSC('Polynomial', 1)),
    temperature Float64 
CODEC(ATSC('Polynomial', 1)),
) 
ENGINE = MergeTree 
ORDER BY (sensor_id, location,
timestamp);

Image 5: Currently testing ClickHouse integration 

Sound interesting? Try it out and let us know what you think.  

ATSC represents a significant advancement in time-series data compression, offering high compression ratios with a configurable accuracy loss. Whether for long-term storage or efficient data visualization, ATSC is a powerful open source tool for managing large volumes of time-series data. 

But don’t just take our word for itdownload and run it!  

Check our documentation for any information you need and submit ideas for improvements or issues you find using GitHub issues. We also have easy first issues tagged if you’d like to contribute to the project.   

Want to integrate this with another tool? You can build and run our demo integration with ClickHouse. 

The post Innovative data compression for time series: An open source solution appeared first on Instaclustr.

New cassandra_latest.yaml configuration for a top performant Apache Cassandra®

Welcome to our deep dive into the latest advancements in Apache Cassandra® 5.0, specifically focusing on the cassandra_latest.yaml configuration that is available for new Cassandra 5.0 clusters.  

This blog post will walk you through the motivation behind these changes, how to use the new configuration, and the benefits it brings to your Cassandra clusters. 

Motivation 

The primary motivation for introducing cassandra_latest.yaml is to bridge the gap between maintaining backward compatibility and leveraging the latest features and performance improvements. The yaml addresses the following varying needs for new Cassandra 5.0 clusters: 

  1. Cassandra Developers: who want to push new features but face challenges due to backward compatibility constraints. 
  2. Operators: who prefer stability and minimal disruption during upgrades. 
  3. Evangelists and New Users: who seek the latest features and performance enhancements without worrying about compatibility. 

Using cassandra_latest.yaml 

Using cassandra_latest.yaml is straightforward. It involves copying the cassandra_latest.yaml content to your cassandra.yaml or pointing the cassandra.config JVM property to the cassandra_latest.yaml file.  

This configuration is designed for new Cassandra 5.0 clusters (or those evaluating Cassandra), ensuring they get the most out of the latest features in Cassandra 5.0 and performance improvements. 

Key changes and features 

Key Cache Size 

  • Old: Evaluated as a minimum from 5% of the heap or 100MB
  • Latest: Explicitly set to 0

Impact: Setting the key cache size to 0 in the latest configuration avoids performance degradation with the new SSTable format. This change is particularly beneficial for clusters using the new SSTable format, which doesn’t require key caching in the same way as the old format. Key caching was used to reduce the time it takes to find a specific key in Cassandra storage. 

Commit Log Disk Access Mode 

  • Old: Set to legacy
  • Latest: Set to auto

Impact: The auto setting optimizes the commit log disk access mode based on the available disks, potentially improving write performance. It can automatically choose the best mode (e.g., direct I/O) depending on the hardware and workload, leading to better performance without manual tuning.

Memtable Implementation 

  • Old: Skiplist-based
  • Latest: Trie-based

Impact: The trie-based memtable implementation reduces garbage collection overhead and improves throughput by moving more metadata off-heap. This change can lead to more efficient memory usage and higher write performance, especially under heavy load.

create table … with memtable = {'class': 'TrieMemtable', … }

Memtable Allocation Type 

  • Old: Heap buffers 
  • Latest: Off-heap objects 

Impact: Using off-heap objects for memtable allocation reduces the pressure on the Java heap, which can improve garbage collection performance and overall system stability. This is particularly beneficial for large datasets and high-throughput environments. 

Trickle Fsync 

  • Old: False 
  • Latest: True 

Impact: Enabling trickle fsync improves performance on SSDs by periodically flushing dirty buffers to disk, which helps avoid sudden large I/O operations that can impact read latencies. This setting is particularly useful for maintaining consistent performance in write-heavy workloads. 

SSTable Format 

  • Old: big 
  • Latest: bti (trie-indexed structure) 

Impact: The new BTI format is designed to improve read and write performance by using a trie-based indexing structure. This can lead to faster data access and more efficient storage management, especially for large datasets. 

sstable:
  selected_format: bti
  default_compression: zstd
  compression:
    zstd:
      enabled: true
      chunk_length: 16KiB
      max_compressed_length: 16KiB

Default Compaction Strategy 

  • Old: STCS (Size-Tiered Compaction Strategy) 
  • Latest: Unified Compaction Strategy 

Impact: The Unified Compaction Strategy (UCS) is more efficient and can handle a wider variety of workloads compared to STCS. UCS can reduce write amplification and improve read performance by better managing the distribution of data across SSTables. 

default_compaction:
  class_name: UnifiedCompactionStrategy
  parameters:
    scaling_parameters: T4
    max_sstables_to_compact: 64
    target_sstable_size: 1GiB
    sstable_growth: 0.3333333333333333
    min_sstable_size: 100MiB

Concurrent Compactors 

  • Old: Defaults to the smaller of the number of disks and cores
  • Latest: Explicitly set to 8

Impact: Setting the number of concurrent compactors to 8 ensures that multiple compaction operations can run simultaneously, helping to maintain read performance during heavy write operations. This is particularly beneficial for SSD-backed storage where parallel I/O operations are more efficient. 

Default Secondary Index 

  • Old: legacy_local_table
  • Latest: sai

Impact: SAI is a new index implementation that builds on the advancements made with SSTable Storage Attached Secondary Index (SASI). Provide a solution that enables users to index multiple columns on the same table without suffering scaling problems, especially at write time. 

Stream Entire SSTables 

  • Old: implicity set to True
  • Latest: explicity set to True

Impact: When enabled, it permits Cassandra to zero-copy stream entire eligible, SSTables between nodes, including every component. This speeds up the network transfer significantly subject to throttling specified by

entire_sstable_stream_throughput_outbound

and

entire_sstable_inter_dc_stream_throughput_outbound

for inter-DC transfers. 

UUID SSTable Identifiers 

  • Old: False
  • Latest: True

Impact: Enabling UUID-based SSTable identifiers ensures that each SSTable has a unique name, simplifying backup and restore operations. This change reduces the risk of name collisions and makes it easier to manage SSTables in distributed environments. 

Storage Compatibility Mode 

  • Old: Cassandra 4
  • Latest: None

Impact: Setting the storage compatibility mode to none enables all new features by default, allowing users to take full advantage of the latest improvements, such as the new sstable format, in Cassandra. This setting is ideal for new clusters or those that do not need to maintain backward compatibility with older versions. 

Testing and validation 

The cassandra_latest.yaml configuration has undergone rigorous testing to ensure it works seamlessly. Currently, the Cassandra project CI pipeline tests both the standard (cassandra.yaml) and latest (cassandra_latest.yaml) configurations, ensuring compatibility and performance. This includes unit tests, distributed tests, and DTests. 

Future improvements 

Future improvements may include enforcing password strength policies and other security enhancements. The community is encouraged to suggest features that could be enabled by default in cassandra_latest.yaml. 

Conclusion 

The cassandra_latest.yaml configuration for new Cassandra 5.0 clusters is a significant step forward in making Cassandra more performant and feature-rich while maintaining the stability and reliability that users expect. Whether you are a developer, an operator professional, or an evangelist/end user, cassandra_latest.yaml offers something valuable for everyone. 

Try it out 

Ready to experience the incredible power of the cassandra_latest.yaml configuration on Apache Cassandra 5.0? Spin up your first cluster with a free trial on the Instaclustr Managed Platform and get started today with Cassandra 5.0!

The post New cassandra_latest.yaml configuration for a top performant Apache Cassandra® appeared first on Instaclustr.

Cassandra 5 Released! What's New and How to Try it

Apache Cassandra 5.0 has officially landed! This highly anticipated release brings a range of new features and performance improvements to one of the most popular NoSQL databases in the world. Having recently hosted a webinar covering the major features of Cassandra 5.0, I’m excited to give a brief overview of the key updates and show you how to easily get hands-on with the latest release using easy-cass-lab.

You can grab the latest release on the Cassandra download page.

Instaclustr for Apache Cassandra® 5.0 Now Generally Available

NetApp is excited to announce the general availability (GA) of Apache Cassandra® 5.0 on the Instaclustr Platform. This follows the release of the public preview in March.

NetApp was the first managed service provider to release the beta version, and now the Generally Available version, allowing the deployment of Cassandra 5.0 across the major cloud providers: AWS, Azure, and GCP, and onpremises.

Apache Cassandra has been a leader in NoSQL databases since its inception and is known for its high availability, reliability, and scalability. The latest version brings many new features and enhancements, with a special focus on building data-driven applications through artificial intelligence and machine learning capabilities.

Cassandra 5.0 will help you optimize performance, lower costs, and get started on the next generation of distributed computing by: 

  • Helping you build AI/ML-based applications through Vector Search  
  • Bringing efficiencies to your applications through new and enhanced indexing and processing capabilities 
  • Improving flexibility and security 

With the GA release, you can use Cassandra 5.0 for your production workloads, which are covered by NetApp’s industryleading SLAs. NetApp has conducted performance benchmarking and extensive testing while removing the limitations that were present in the preview release to offer a more reliable and stable version. Our GA offering is suitable for all workload types as it contains the most up-to-date range of features, bug fixes, and security patches.  

Support for continuous backups and private network addons is available. Currently, Debezium is not yet compatible with Cassandra 5.0. NetApp will work with the Debezium community to add support for Debezium on Cassandra 5.0 and it will be available on the Instaclustr Platform as soon as it is supported. 

Some of the key new features in Cassandra 5.0 include: 

  • Storage-Attached Indexes (SAI): A highly scalable, globally distributed index for Cassandra databases. With SAI, column-level indexes can be added, leading to unparalleled I/O throughput for searches across different data types, including vectors. SAI also enables lightning-fast data retrieval through zero-copy streaming of indices, resulting in unprecedented efficiency.  
  • Vector Search: This is a powerful technique for searching relevant content or discovering connections by comparing similarities in large document collections and is particularly useful for AI applications. It uses storage-attached indexing and dense indexing techniques to enhance data exploration and analysis.  
  • Unified Compaction Strategy: This strategy unifies compaction approaches, including leveled, tiered, and time-windowed strategies. It leads to a major reduction in SSTable sizes. Smaller SSTables mean better read and write performance, reduced storage requirements, and improved overall efficiency.  
  • Numerous stability and testing improvements: You can read all about these changes here. 

All these new features are available out-of-the-box in Cassandra 5.0 and do not incur additional costs.  

Our Development team has worked diligently to bring you a stable release of Cassandra 5.0. Substantial preparatory work was done to ensure you have a seamless experience with Cassandra 5.0 on the Instaclustr Platform. This includes updating the Cassandra YAML and Java environment and enhancing the monitoring capabilities of the platform to support new data types.  

We also conducted extensive performance testing and benchmarked version 5.0 with the existing stable Apache Cassandra 4.1.5 version. We will be publishing our benchmarking results shortly; the highlight so far is that Cassandra 5.0 improves responsiveness by reducing latencies by up to 30% during peak load times.  

Through our dedicated Apache Cassandra committer, NetApp has contributed to the development of Cassandra 5.0 by enhancing the documentation for new features like Vector Search (Cassandra-19030), enabling Materialized Views (MV) with only partition keys (Cassandra-13857), fixing numerous bugs, and contributing to the improvements for the unified compaction strategy feature, among many other things. 

Lifecycle Policy Updates 

As previously communicated, the project will no longer maintain Apache Cassandra 3.0 and 3.11 versions (full details of the announcement can be found on the Apache Cassandra website).

To help you transition smoothly, NetApp will provide extended support for these versions for an additional 12 months. During this period, we will backport any critical bug fixes, including security patches, to ensure the continued security and stability of your clusters. 

Cassandra 3.0 and 3.11 versions will reach end-of-life on the Instaclustr Managed Platform within the next 12 months. We will work with you to plan and upgrade your clusters during this period.  

Additionally, the Cassandra 5.0 beta version and the Cassandra 5.0 RC2 version, which were released as part of the public preview, are now end-of-life You can check the lifecycle status of different Cassandra application versions here.  

You can read more about our lifecycle policies on our website. 

Getting Started 

Upgrading to Cassandra 5.0 will allow you to stay current and start taking advantage of its benefits. The Instaclustr by NetApp Support team is ready to help customers upgrade clusters to the latest version.  

  • Wondering if it’s possible to upgrade your workloads from Cassandra 3.x to Cassandra 5.0? Find the answer to this and other similar questions in this detailed blog.
  • Click here to read about Storage Attached Indexes in Apache Cassandra 5.0.
  • Learn about 4 new Apache Cassandra 5.0 features to be excited about. 
  • Click here to learn what you need to know about Apache Cassandra 5.0. 

Why Choose Apache Cassandra on the Instaclustr Managed Platform? 

NetApp strives to deliver the best of supported applications. Whether it’s the latest and newest application versions available on the platform or additional platform enhancements, we ensure a high quality through thorough testing before entering General Availability.  

NetApp customers have the advantage of accessing the latest versions—not just the major version releases but also minor version releases—so that they can benefit from any new features and are protected from any vulnerabilities.  

Don’t have an Instaclustr account yet? Sign up for a trial or reach out to our Sales team and start exploring Cassandra 5.0.  

With more than 375 million node hours of management experience, Instaclustr offers unparalleled expertise. Visit our website to learn more about the Instaclustr Managed Platform for Apache Cassandra.  

If you would like to upgrade your Apache Cassandra version or have any issues or questions about provisioning your cluster, please contact Instaclustr Support at any time.  

The post Instaclustr for Apache Cassandra® 5.0 Now Generally Available appeared first on Instaclustr.

Apache Cassandra® 5.0: Behind the Scenes

Here at NetApp, our Instaclustr product development team has spent nearly a year preparing for the release of Apache Cassandra 5.  

Starting with one engineer tinkering at night with the Apache Cassandra 5 Alpha branch, and then up to 5 engineers working on various monitoring, configuration, testing and functionality improvements to integrate the release with the Instaclustr Platform.  

It’s been a long journey to the point we are at today, offering Apache Cassandra 5 Release Candidate 1 in public preview on the Instaclustr Platform. 

Note: the Instaclustr team has a dedicated open source committer to the Apache Cassandra projectHis changes are not included in this document as there were too many for us to include here. Instead, this blog primarily focuses on the engineering effort to release Cassandra 5.0 onto the Instaclustr Managed Platform. 

August 2023: The Beginning

We began experimenting with the Apache Cassandra 5 Alpha 1 branches using our build systems. There were several tools we built into our Apache Cassandra images that were not working at this point, but we managed to get a node to start even though it immediately crashed with errors.  

One of our early achievements was identifying and fixing a bug that impacted our packaging solution; this resulted in a small contribution to the project allowing Apache Cassandra to be installed on Debian systems with non-OpenJDK Java. 

September 2023: First Milestone 

The release of the Alpha 1 version allowed us to achieve our first running Cassandra 5 cluster in our development environments (without crashing!).  

Basic core functionalities like user creation, data writing, and backups/restores were tested successfully. However, several advanced features, such as repair and replace tooling, monitoring, and alerting were still untested.  

At this point we had to pause our Cassandra 5 efforts to focus on other priorities and planned to get back to testing Cassandra 5 after Alpha 2 was released. 

November 2023 Further Testing and Internal Preview 

The project released Alpha 2. We repeated the same build and test we did on alpha 1. We also tested some more advanced procedures like cluster resizes with no issues.  

We also started testing with some of the new 5.0 features: Vector Data types and Storage-Attached Indexes (SAI), which resulted in another small contribution.  

We launched Apache Cassandra 5 Alpha 2 for internal preview (basically for internal users). This allowed the wider Instaclustr team to access and use the Alpha on the platform.  

During this phase we found a bug in our metrics collector when vectors were encountered that ended up being a major project for us. 

If you see errors like the below, it’s time for a Java Cassandra driver upgrade to 4.16 or newer: 

java.lang.IllegalArgumentException: Could not parse type name vector<float, 5>  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.DataTypeCqlNameParser.parse(DataTypeCqlNameParser.java:233)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.TableMetadata.build(TableMetadata.java:311)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.buildTables(SchemaParser.java:302)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.SchemaParser.refresh(SchemaParser.java:130)
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:417)  
Nov 15 22:41:04 ip-10-0-39-7 process[1548]: at com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:356)  
<Rest of stacktrace removed for brevity>

December 2023: Focus on new features and planning 

As the project released Beta 1, we began focusing on the features in Cassandra 5 that we thought were the most exciting and would provide the most value to customers. There are a lot of awesome new features and changes, so it took a while to find the ones with the largest impact.  

 The final list of high impact features we came up with was: 

  • A new data type Vectors 
  • Trie memtables/Trie Indexed SSTables (BTI Formatted SStables) 
  • Storage-Attached Indexes (SAI) 
  • Unified Compaction Strategy 

A major new feature we considered deploying was support for JDK 17. However, due to its experimental nature, we have opted to postpone adoption and plan to support running Apache Cassandra on JDK 17 when it’s out of the experimentation phase. 

Once the holiday season arrived, it was time for a break, and we were back in force in February next year. 

February 2024: Intensive testing 

In February, we released Beta 1 into internal preview so we could start testing it on our Preproduction test environments. As we started to do more intensive testing, wdiscovered issues in the interaction with our monitoring and provisioning setup. 

We quickly fixed the issues identified as showstoppers for launching Cassandra 5. By the end of February, we initiated discussions about a public preview release. We also started to add more resourcing to the Cassandra 5 project. Up until now, only one person was working on it.  

Next, we broke down the work we needed to do This included identifying monitoring agents requiring upgrade and config defaults that needed to change. 

From this point, the project split into 3 streams of work: 

  1. Project Planning – Deciding how all this work gets pulled together cleanly, ensuring other work streams have adequate resourcing to hit their goals, and informing product management and the wider business of what’s happening.  
  2. Configuration Tuning – Focusing on the new features of Apache Cassandra to include, how to approach the transition to JDK 17, and how to use BTI formatted SSTables on the platform.  
  3. Infrastructure Upgrades Identifying what to upgrade internally to handle Cassandra 5, including Vectors and BTI formatted SSTables. 

A Senior Engineer was responsible for each workstream to ensure planned timeframes were achieved. 

March 2024: Public Preview Release 

In March, we launched Beta 1 into public preview on the Instaclustr Managed Platform. The initial release did not contain any opt in features like Trie indexed SSTables. 

However, this gave us a consistent base to test in our development, test, and production environments, and proved our release pipeline for Apache Cassandra 5 was working as intended. This also gave customers the opportunity to start using Apache Cassandra 5 with their own use cases and environments for experimentation.  

See our public preview launch blog for further details. 

There was not much time to celebrate as we continued working on infrastructure and refining our configuration defaults. 

April 2024: Configuration Tuning and Deeper Testing 

The first configuration updates were completed for Beta 1, and we started performing deeper functional and performance testing. We identified a few issues from this effort and remediated. This default configuration was applied for all Beta 1 clusters moving forward.  

This allowed users to start testing Trie Indexed SSTables and Trie memtables in their environment by default. 

"memtable": 
  { 
    "configurations": 
      { 
        "skiplist": 
          { 
            "class_name": "SkipListMemtable" 
          }, 
        "sharded": 
          { 
            "class_name": "ShardedSkipListMemtable" 
          }, 
        "trie": 
          { 
            "class_name": "TrieMemtable" 
          }, 
        "default": 
          { 
            "inherits": "trie" 
          } 
      } 
  }, 
"sstable": 
  { 
    "selected_format": "bti" 
  }, 
"storage_compatibility_mode": "NONE",

The above graphic illustrates an Apache Cassandra YAML configuration where BTI formatted sstables are used by default (which allows Trie Indexed SSTables) and defaults use of Trie for memtables You can override this per table: 

CREATE TABLE test WITH memtable = {‘class’ : ‘ShardedSkipListMemtable’};

Note that you need to set storage_compatibility_mode to NONE to use BTI formatted sstables. See Cassandra documentation for more information

You can also reference the cassandra_latest.yaml  file for the latest settings (please note you should not apply these to existing clusters without rigorous testing). 

May 2024: Major Infrastructure Milestone 

We hit a very large infrastructure milestone when we released an upgrade to some of our core agents that were reliant on an older version of the Apache Cassandra Java driver. The upgrade to version 4.17 allowed us to start supporting vectors in certain keyspace level monitoring operations.  

At the time, this was considered to be the riskiest part of the entire project as we had 1000s of nodes to upgrade across may different customer environments. This upgrade took a few weeks, finishing in June. We broke the release up into 4 separate rollouts to reduce the risk of introducing issues into our fleet, focusing on single key components in our architecture in each release. Each release had quality gates and tested rollback plans, which in the end were not needed. 

June 2024: Successful Rollout New Cassandra Driver 

The Java driver upgrade project was rolled out to all nodes in our fleet and no issues were encountered. At this point we hit all the major milestones before Release Candidates became available. We started to look at the testing systems to update to Apache Cassandra 5 by default. 

July 2024: Path to Release Candidate 

We upgraded our internal testing systems to use Cassandra 5 by default, meaning our nightly platform tests began running against Cassandra 5 clusters and our production releases will smoke test using Apache Cassandra 5. We started testing the upgrade path for clusters from 4.x to 5.0. This resulted in another small contribution to the Cassandra project.  

The Apache Cassandra project released Apache Cassandra 5 Release Candidate 1 (RC1), and we launched RC1 into public preview on the Instaclustr Platform. 

The Road Ahead to General Availability 

We’ve just launched Apache Cassandra 5 Release Candidate 1 (RC1) into public preview, and there’s still more to do before we reach General Availability for Cassandra 5, including: 

  • Upgrading our own preproduction Apache Cassandra for internal use to Apache Cassandra 5 Release Candidate 1. This means we’ll be testing using our real-world use cases and testing our upgrade procedures on live infrastructure. 

At Launch: 

When Apache Cassandra 5.0 launches, we will perform another round of testing, including performance benchmarking. We will also upgrade our internal metrics storage production Apache Cassandra clusters to 5.0, and, if the results are satisfactory, we will mark the release as generally available for our customers. We want to have full confidence in running 5.0 before we recommend it for production use to our customers.  

For more information about our own usage of Cassandra for storing metrics on the Instaclustr Platform check out our series on Monitoring at Scale.  

What Have We Learned From This Project? 

  • Releasing limited, small and frequent changes has resulted in a smooth project, even if sometimes frequent releases do not feel smooth. Some thoughts: 
    • Releasing to a small subset of internal users allowed us to take risks and break things more often so we could learn from our failures safely.
    • Releasing small changes allowed us to more easily understand and predict the behaviour of our changes: what to look out for in case things went wrong, how to more easily measure success, etc. 
    • Releasing frequently built confidence within the wider Instaclustr team, which in turn meant we would be happier taking more risks and could release more often.  
  • Releasing to internal and public preview helped create momentum within the Instaclustr business and teams:  
    • This turned the Apache Cassandra 5.0 release from something that “was coming soon and very exciting” to “something I can actually use.”
  • Communicating frequently, transparently, and efficiently is the foundation of success:  
    • We used a dedicated Slack channel (very creatively named #cassandra-5-project) to discuss everything. 
    • It was quick and easy to go back to see why we made certain decisions or revisit them if needed. This had a bonus of allowing a Lead Engineer to write a blog post very quickly about the Cassandra 5 project. 

This has been a longrunning but very exciting project for the entire team here at Instaclustr. The Apache Cassandra community is on the home stretch for this massive release, and we couldn’t be more excited to start seeing what everyone will build with it.  

You can sign up today for a free trial and test Apache Cassandra 5 Release Candidate 1 by creating a cluster on the Instaclustr Managed Platform.  

More Readings 

 

The post Apache Cassandra® 5.0: Behind the Scenes appeared first on Instaclustr.

easy-cass-lab v5 released

I’ve got some fun news to start the week off for users of easy-cass-lab: I’ve just released version 5. There are a number of nice improvements and bug fixes in here that should make it more enjoyable, more useful, and lay groundwork for some future enhancements.

  • When the cluster starts, we wait for the storage service to reach NORMAL state, then move to the next node. This is in contrast to the previous behavior where we waited for 2 minutes after starting a node. This queries JMX directly using Swiss Java Knife and is more reliable than the 2-minute method. Please see packer/bin-cassandra/wait-for-up-normal to read through the implementation.
  • Trunk now works correctly. Unfortunately, AxonOps doesn’t support trunk (5.1) yet, and using the agent was causing a startup error. You can test trunk out, but for now the AxonOps integration is disabled.
  • Added a new repl mode. This saves keystrokes and provides some auto-complete functionality and keeps SSH connections open. If you’re going to do a lot of work with ECL this will help you be a little more efficient. You can try this out with ecl repl.
  • Power user feature: Initial support for profiles in AWS regions other than us-west-2. We only provide AMIs for us-west-2, but you can now set up a profile in an alternate region, and build the required AMIs using easy-cass-lab build-image. This feature is still under development and requires using an easy-cass-lab build from source. Credit to Jordan West for contributing this work.
  • Power user feature: Support for multiple profiles. Setting the EASY_CASS_LAB_PROFILE environment variable allows you to configure alternate profiles. This is handy if you want to use multiple regions or have multiple organizations.
  • The project now uses Kotlin instead of Groovy for Gradle configuration.
  • Updated Gradle to 8.9.
  • When using the list command, don’t show the alias “current”.
  • Project cleanup, remove old unused pssh, cassandra build, and async profiler subprojects.

The release has been released to the project’s GitHub page and to homebrew. The project is largely driven by my own consulting needs and for my training. If you’re looking to have some features prioritized please reach out, and we can discuss a consulting engagement.

easy-cass-lab updated with Cassandra 5.0 RC-1 Support

I’m excited to announce that the latest version of easy-cass-lab now supports Cassandra 5.0 RC-1, which was just made available last week! This update marks a significant milestone, providing users with the ability to test and experiment with the newest Cassandra 5.0 features in a simplified manner. This post will walk you through how to set up a cluster, SSH in, and run your first stress test.

For those new to easy-cass-lab, it’s a tool designed to streamline the setup and management of Cassandra clusters in AWS, making it accessible for both new and experienced users. Whether you’re running tests, developing new features, or just exploring Cassandra, easy-cass-lab is your go-to tool.

easy-cass-lab now available in Homebrew

I’m happy to share some exciting news for all Cassandra enthusiasts! My open source project, easy-cass-lab, is now installable via a homebrew tap. This powerful tool is designed to make testing any major version of Cassandra (or even builds that haven’t been released yet) a breeze, using AWS. A big thank-you to Jordan West who took the time to make this happen!

What is easy-cass-lab?

easy-cass-lab is a versatile testing tool for Apache Cassandra. Whether you’re dealing with the latest stable releases or experimenting with unreleased builds, easy-cass-lab provides a seamless way to test and validate your applications. With easy-cass-lab, you can ensure compatibility and performance across different Cassandra versions, making it an essential tool for developers and system administrators. easy-cass-lab is used extensively for my consulting engagements, my training program, and to evaluate performance patches destined for open source Cassandra. Here are a few examples:

Cassandra Training Signups For July and August Are Open!

I’m pleased to announce that I’ve opened training signups for Operator Excellence to the public for July and August. If you’re interested in stepping up your game as a Cassandra operator, this course is for you. Head over to the training page to find out more and sign up for the course.

Streaming My Sessions With Cassandra 5.0

As a long time participant with the Cassandra project, I’ve witnessed firsthand the evolution of this incredible database. From its early days to the present, our journey has been marked by continuous innovation, challenges, and a relentless pursuit of excellence. I’m thrilled to share that I’ll be streaming several working sessions over the next several weeks as I evaluate the latest builds and test out new features as we move toward the 5.0 release.

Streaming Cassandra Workloads and Experiments

Streaming

In the world of software engineering, especially within the realm of distributed systems, continuous learning and experimentation are not just beneficial; they’re essential. As a software engineer with a focus on distributed systems, particularly Apache Cassandra, I’ve taken this ethos to heart. My journey has led me to not only explore the intricacies of Cassandra’s distributed architecture but also to share my experiences and findings with a broader audience. This is why my YouTube channel has become an active platform where I stream at least once a week, engaging with viewers through coding sessions, trying new approaches, and benchmarking different Cassandra workloads.

Live Streaming On Tuesdays

As I promised in December, I redid my presentation from the Cassandra Summit 2023 on a live stream. You can check it out at the bottom of this post.

Going forward, I’ll be live-streaming on Tuesdays at 10AM Pacific on my YouTube channel.

Next week I’ll be taking a look at tlp-stress, which is used by the teams at some of the biggest Cassandra deployments in the world to benchmark their clusters. You can find that here.

Cassandra Summit Recap: Performance Tuning and Cassandra Training

Hello, friends in the Apache Cassandra community!

I recently had the pleasure of speaking at the Cassandra Summit in San Jose. Unfortunately, we ran into an issue with my screen refusing to cooperate with the projector, so my slides were pretty distorted and hard to read. While the talk is online, I think it would be better to have a version with the right slides as well as a little more time. I’ve decided to redo the entire talk via a live stream on YouTube. I’m scheduling this for 10am PST on Wednesday, January 17 on my YouTube channel. My original talk was done in 30 minute slot, this will be a full hour, giving plenty of time for Q&A.

Cassandra Summit, YouTube, and a Mailing List

I am thrilled to share some significant updates and exciting plans with my readers and the Cassandra community. As we draw closer to the end of the year, I’m preparing for an important speaking engagement and mapping out a year ahead filled with engaging and informative activities.

Cassandra Summit Presentation: Mastering Performance Tuning

I am honored to announce that I will be speaking at the upcoming Cassandra Summit. My talk, titled “Cassandra Performance Tuning Like You’ve Been Doing It for Ten Years,” is scheduled for December 13th, from 4:10 pm to 4:40 pm. This session aims to equip attendees with advanced insights and practical skills for optimizing Cassandra’s performance, drawing from a decade’s worth of experience in the field. Whether you’re new to Cassandra or a seasoned user, this talk will provide valuable insights to enhance your database management skills.

Uncover Cassandra's Throughput Boundaries with the New Adaptive Scheduler in tlp-stress

Introduction

Apache Cassandra remains the preferred choice for organizations seeking a massively scalable NoSQL database. To guarantee predictable performance, Cassandra administrators and developers rely on benchmarking tools like tlp-stress, nosqlbench, and ndbench to help them discover their cluster’s limits. In this post, we will explore the latest advancements in tlp-stress, highlighting the introduction of the new Adaptive Scheduler. This brand-new feature allows users to more easily uncover the throughput boundaries of Cassandra clusters while remaining within specific read and write latency targets. First though, we’ll take a brief look at the new workload designed to stress test the new Storage Attached Indexes feature coming in Cassandra 5.

AxonOps Review - An Operations Platform for Apache Cassandra

Note: Before we dive into this review of AxonOps and their offerings, it’s important to note that this blog post is part of a paid engagement in which I provided product feedback. AxonOps had no influence or say over the content of this post and did not have access to it prior to publishing.

In the ever-evolving landscape of data management, companies are constantly seeking solutions that can simplify the complexities of database operations. One such player in the market is AxonOps, a company that specializes in providing tooling for operating Apache Cassandra.

Benchmarking Apache Cassandra with tlp-stress

This post will introduce you to tlp-stress, a tool for benchmarking Apache Cassandra. I started tlp-stress back when I was working at The Last Pickle. At the time, I was spending a lot of time helping teams identify the root cause of performance issues and needed a way of benchmarking. I found cassandra-stress to be difficult to use and configure, so I ended up writing my own tool that worked in a manner that I found to be more useful. If you’re looking for a tool to assist you in benchmarking Cassandra, and you’re looking to get started quickly, this might be the right tool for you.

Back to Consulting!

Saying “it’s been a while since I wrote anything here” would be an understatement, but I’m back, with a lot to talk about in the upcoming months.

First off - if you’re not aware, I continued writing, but on The Last Pickle blog. There’s quite a few posts there, here are the most interesting ones:

Now the fun part - I’ve spent the last 3 years at Apple, then Netflix, neither of which gave me much time to continue my writing. As of this month, I’m officially no longer at Netflix and have started Rustyrazorblade Consulting!

Building a 100% ScyllaDB Shard-Aware Application Using Rust

Building a 100% ScyllaDB Shard-Aware Application Using Rust

I wrote a web transcript of the talk I gave with my colleagues Joseph and Yassir at [Scylla Su...

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

Learning Rust the hard way for a production Kafka+ScyllaDB pipeline

This is the web version of the talk I gave at [Scylla Summit 2022](https://www.scyllad...

On Scylla Manager Suspend & Resume feature

On Scylla Manager Suspend & Resume feature

!!! warning "Disclaimer" This blog post is neither a rant nor intended to undermine the great work that...

Renaming and reshaping Scylla tables using scylla-migrator

We have recently faced a problem where some of the first Scylla tables we created on our main production cluster were not in line any more with the evolved s...

Python scylla-driver: how we unleashed the Scylla monster's performance

At Scylla summit 2019 I had the chance to meet Israel Fruchter and we dreamed of working on adding **shard...