DataStax AI PaaS Is Now Enhanced with State-of-the-Art Retrieval Embedding with NVIDIA NeMo Retriever Integration

DataStax Enterprise 6.9, DataStax Hyper-Converged Database (HCD), and DataStax Astra DB combine the best of NoSQL and vector databases, powered by Apache Cassandra®.  We provide the premier solution for modern generative AI applications that supports diverse data formats required by these apps. The...

Benchmarking MongoDB vs ScyllaDB: Social Media Workload Deep Dive

benchANT’s comparison of ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a social media workload BenchANT recently benchmarked the performance and scalability of the market-leading general-purpose NoSQL database MongoDB and its performance-oriented challenger ScyllaDB. You can read a summary of the results in the blog Benchmarking MongoDB vs ScyllaDB: Performance, Scalability & Cost, see the key takeaways for various workloads in this technical summary, and access all results (including the raw data) from the complete benchANT report. This blog offers a deep dive into the tests performed for the social workload. This workload is based on the YCSB Workload B. It creates a read-heavy workload, with 95% read operations and 5% update operations. We use two shapes of this workload, which differ in terms of the request distribution patterns, namely uniform and hotspot distribution. These workloads are executed against the small database scaling size with a data set of 500GB and against the medium scaling size with a data set of 1TB. Before we get into the benchmark details, here is a summary of key insights for this workload. ScyllaDB outperforms MongoDB with higher throughput and lower latency for all measured configurations of the social workload. ScyllaDB provides up to 12 times higher throughput ScyllaDB provides significantly lower (down to 47 times) update latencies compared to MongoDB ScyllaDB provides lower read latencies, down to 5 times Throughput Results for MongoDB vs ScyllaDB The throughput results for the social workload with the uniform request distribution show that the small ScyllaDB cluster is able to serve 60 kOps/s with a cluster CPU utilization of ~85% while the small MongoDB cluster serves only 10 kOps/s under a comparable cluster utilization of 80-90%. For the medium cluster sizes, ScyllaDB achieves an average throughput of 232 kOps/s showing ~85% cluster utilization while MongoDB achieves 42 kOps/s at a CPU utilization of ~85%. The throughput results for the social workload with the hotspot request distribution show a similar trend, but with higher throughput numbers since the data is mostly read from the cache. The small ScyllaDB cluster serves 152 kOps/s while the small MongoDB serves 14 kOps/s. For the medium cluster sizes, ScyllaDB achieves an average throughput of 587 kOps/s and MongoDB achieves 48 kOps/s. Scalability Results for MongoDB vs ScyllaDB These results also enable us to compare the theoretical throughput scalability with the actually achieved throughput scalability. For this, we consider a simplified scalability model that focuses on compute resources. It assumes the scalability factor is reflected by the increased compute capacity from the small to medium cluster size. For ScyllaDB, this means we double the cluster size from 3 to 6 nodes and also double the instance size from 8 cores to 16 cores per instance, resulting in a theoretical scalability of 400%. For MongoDB, we move from one replica set of three data nodes to a cluster with three shards and nine data nodes and increase the instance size from 8 cores to 16 cores, resulting in a theoretical scalability factor of 600%. The ScyllaDB scalability results for the uniform and hotspot distributions both show that ScyllaDB is close to achieving linear scalability by achieving a throughput scalability of 386% (of the theoretically possible 400%). With MongoDB, the gap between theoretical throughput scalability and the actually achieved throughput scalability is significantly higher. For the uniform distribution, MongoDB achieves a scaling factor of 420% (of the theoretically possible 600%). For the hotspot distribution, we measure 342% (of the theoretically possible 600%). Throughput per Cost Ratio In order to compare the costs/month in relation to the provided throughput, we take the MongoDB Atlas throughput/$ as baseline (i.e. 100%) and compare it with the provided ScyllaDB Cloud throughput/$. The results for the uniform distribution show that ScyllaDB provides five times more operations/$ compared to MongoDB Atlas for the small scaling size and 5.7 times more operations/$ for the medium scaling size. For the hotspot distribution, the results show an even better throughput/cost ratio for ScyllaDB, providing 9 times more operations/$ for the small scaling size and 12.7 times more for the medium scaling size. Latency Results for MongoDB vs ScyllaDB For the uniform distribution, ScyllaDB provides stable and low P99 latencies for the read and update operations for the scaling sizes small and medium. MongoDB generally has higher P99 latencies. Here, the read latencies are 2.8 times higher for the small scaling size and 5.5 times higher for the medium scaling size. The update latencies show an even more distinct difference; MongoDB’s P99 update latency in the small scaling size is 47 times higher compared to ScyllaDB and 12 times higher in the medium scaling size. For the hotspot distribution, the results show a similar trend for the stable and low ScyllaDB latencies. For MongoDB, read and update latencies increase from the small to medium scaling size. It is interesting that in contrast to the uniform distribution, the read latency only increases by a factor of 2.8 while the update latency increases by 970%. Technical Nugget – Performance Impact of the Data Model The default YCSB data model is composed of a primary key and a data item with 10 fields of strings that results in a document with 10 attributes for MongoDB and a table with 10 columns for ScyllaDB. We analyze how performance changes if a pure key-value data model is applied for both databases: a table with only one column for ScyllaDB and a document with only one field for MongoDB The results show that for ScyllaDB the throughput improves by 24% while for MongoDB the throughput increase is only 5%.   Technical Nugget – Performance Impact of the Consistency Level All standard benchmarks are run with the MongoDB client consistency writeConcern=majority/readPreference=primary and for ScyllaDB with writeConsistney=QUORUM/readConsistency=QUORUM. Besides these client consistent configurations, we also analyze the performance impact of weaker read consistency settings. For this, we enable MongoDB to read from the secondaries (readPreference=secondarypreferred) and set readConsistency=ONE for ScyllaDB. The results show an expected increase in throughput: for ScyllaDB 56% and for MongoDB 49%. Continue Comparing ScyllaDB vs MongoDB Here are some additional resources for learning about the differences between MongoDB and ScyllaDB: Benchmarking  MongoDB vs ScyllaDB: Results from benchANT’s complete benchmarking study that comprises 133 performance and scalability measurements that compare MongoDB against ScyllaDB. Benchmarking MongoDB vs ScyllaDB: Caching Workload Deep Dive: benchANT’s comparison of  ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a caching workload (50% read operations and 50% update operations). Benchmarking MongoDB vs ScyllaDB: IoT Sensor Workload Deep Dive: benchANT’s comparison of  ScyllaDB vs MongoDB in terms of throughput, latency, scalability, and cost for a workload simulating an IoT sensor (90% insert operations and 10% read operations). A Technical Comparison of MongoDB vs ScyllaDB: benchANT’s technical analysis of how MongoDB and ScyllaDB compare with respect to their features, architectures, performance, and scalability. ScyllaDB’s MongoDB vs ScyllaDB page: Features perspectives from users – like Discord – who have moved from MongoDB to ScyllaDB.

Powering AI Workloads with Intelligent Data Infrastructure and Open Source

In the rapidly evolving technological landscape, artificial intelligence (AI) is emerging as a driving force behind innovation and efficiency. However, to harness its full potential, enterprises need suitable data infrastructures that can support AI workloads effectively. 

This blog explores how intelligent data infrastructure, combined with open source technologies, is revolutionizing AI applications across various business functions. It outlines the benefits of leveraging existing infrastructure and highlights key open source databases that are indispensable for powering AI. 

The Power of Open Source in AI Solutions 

Open source technologies have long been celebrated for their flexibility, community support, and cost-efficiency. In the realm of AI these advantages are magnified. Here’s why open source is indispensable for AI-fueled solutions: 

  1. Cost Efficiency: Open source solutions eliminate licensing fees, making them an attractive option for businesses looking to optimize their budgets.
  2. Community Support: A vibrant community of developers constantly improves these platforms, ensuring they remain cutting-edge.
  3. Flexibility and Customization: Open source tools can be tailored to meet specific needs, allowing enterprises to build solutions that align perfectly with their goals. 
  4. Transparency and Security: With open source, you have visibility into the code, which allows for better security audits and trustworthiness. 

Vector Databases: A Key Component for AI Workloads 

Vector databases are increasingly indispensable for AI workloads. They store data in high-dimensional vectors, which AI models use to understand patterns and relationships. This capability is crucial for applications involving natural language processing, image recognition, and recommendation systems. 

Vector databases use embedding vectors (lists of numbers) to represent data similarities and plot relationships spatially. For example, “plant” and “shrub” will have closer vector coordinates than “plant” and “car”. This allows enterprises to build their own LLMs, explore large text datasets, and enhance search capabilities. 

Vector databases and embeddings also support retrieval augmented generation (RAG), which improves LLM accuracy by refining its understanding of new information. For example, RAG can let users query documentation by creating embeddings from an enterprise’s documents, translating words into vectors, finding similar words in the documentation, and retrieving relevant information. This data is then provided to an LLM, enabling it to generate accurate text answers for users. 

The Role of Vector Databases in AI: 

  1. Efficient Data Handling: Vector databases excel at handling large volumes of data efficiently, which is essential for training and deploying AI models. 
  2. High Performance: They offer high-speed retrieval and processing of complex data types, ensuring AI applications run smoothly. 
  3. Scalability: With the ability to scale horizontally, vector databases can grow alongside your AI initiatives without compromising performance. 

Leveraging Existing Infrastructure for AI Workloads 

Contrary to popular belief, it isn’t necessary to invest in new and exotic specialized data layer solutions. Your existing infrastructure can often support AI workloads with a few strategic enhancements: 

  1. Evaluate Current Capabilities: Start by assessing your current data infrastructure to identify any gaps or areas for improvement. 
  2. Upgrade Where Necessary: Consider upgrading components such as storage, network speed, and computing power to meet the demands of AI workloads. 
  3. Integrate with AI Tools: Ensure your infrastructure is compatible with leading AI tools and platforms to facilitate seamless integration. 

Open Source Databases for Enterprise AI 

Several open source databases are particularly well-suited for enterprise AI applications. Let‘s look at the 3 free open source databases that enterprise teams can leverage as they scale their intelligent data infrastructure for storing those embedding vectors: 

PostgreSQL® and pgvector 

“The world’s most advanced open source relational database, PostgreSQL is also one of the most widely deployed, meaning that most enterprises will already have a strong foothold in the technology. The pgvector extension turns Postgres into a high-performance vector store, offering a path of least resistance for organizations familiar with PostgreSQL to quickly stand-up intelligent data infrastructure. 

From a RAG and LLM training perspective, pgvector excels at enabling distance-based embedding search, exact nearest neighbor search, and approximate nearest neighbor search. pgvector efficiently captures semantic similarities using L2 distance, inner product, and (the OpenAI-recommended) cosine distance. Teams can also harness OpenAI’s embeddings model (available as an API) to calculate embeddings for documentation and user queries. As an enterprise-ready open source option, pgvector is an already-proven solution for achieving efficient, accurate, and performant LLMs, helping equip teams to confidently launch differentiated and AI-fueled applications into production.

OpenSearch® 

Because OpenSearch is a mature search and analytics engine already popular with a wide swath of enterprises, new and current users will be glad to know that the open source solution is ready to up the pace of AI application development as a singular search, analytics, and vector database.  

OpenSearch has long offered low latency, high availability, and the scale to handle tens of billions of vectors while backing stable applications. It provides great nearest-neighbor search functionality to support vector, lexical, and hybrid search and analytics. These capabilities significantly simplify the implementation of AI solutions, from generative AI  agents to recommendation engines with trustworthy results and minimal hallucinations. 

Apache Cassandra® 5.0 with Native Vector Indexing

Known for its linear scalability and fault-tolerance on commodity hardware or cloud infrastructure, Apache Cassandra is a reliable choice for enterprise-grade AI applications. The newest version of the highly popular open source Apache Cassandra database introduces several new features built for AI workloads. It now includes Vector Search and Native Vector indexing capabilities.

Additionally, there is a new vector data type specifically for saving and retrieving embedding vectors, and new CQL functions for easily executing on those capabilities. By adding these features, Apache Cassandra 5.0 has emerged as an especially ideal database for intelligent data strategies and for enterprises rapidly building out AI applications across myriad use cases.

Cassandra’s earned reputation for delivering high availability and scalability now adds AI-specific functionality, making it one of the most enticing open source options for enterprises. 

Open Source Opens the Door to Successful AI Workloads 

Clearly, given the tremendously rapid pace at which AI technology is advancing, enterprises cannot afford to wait to build out differentiated AI applications. But in this pursuit, engaging with the wrong proprietary data-layer solutionsand suffering the pitfalls of vendor lock-in or simply mismatched featurescan easily be (and, for some, already is) a fatal setback. Instead, tapping into one of the very capable open source vector databases available will allow enterprises to put themselves in a more advantageous position. 

When leveraging open source databases for AI workloads, consider the following: 

  • Data Security: Ensure robust security measures are in place to protect sensitive data. 
  • Scalability: Plan for future growth by choosing solutions that can scale with your needs. 
  • Resource Allocation: Allocate sufficient resources, such as computing power and storage, to support AI applications. 
  • Governance and Compliance: Adhere to governance and compliance standards to ensure responsible use of AI. 

Conclusion 

Intelligent data infrastructure and open source technologies are revolutionizing the way enterprises approach AI workloads. By leveraging existing infrastructure and integrating powerful open source databases, organizations can unlock the full potential of AI, driving innovation and efficiency. 

Ready to take your AI initiatives to the next level? Leverage a single platform to help you design, deploy and monitor the infrastructure to support the capabilities of PostgreSQL with pgvector, OpenSearch, and Apache Cassandra 5.0 today.

And for more insights and expert guidance, don’t hesitate to contact us and speak with one of our open source experts! 

The post Powering AI Workloads with Intelligent Data Infrastructure and Open Source appeared first on Instaclustr.

ScyllaDB in Action Book Excerpt: ScyllaDB, a Distributed Database

How does ScyllaDB provide scalability and fault tolerance by distributing its data across multiple nodes? Read what Bo Ingram (Staff Engineer at Discord) has to say – in this excerpt from the book “ScyllaDB in Action.” Editor’s note: We’re honored to share the following excerpt from Bo Ingram’s informative – and fun! – new book on ScyllaDB: ScyllaDB in Action. You might have already experienced Bo’s expertise and engaging communication style in his blog How Discord Stores Trillions of Messages or ScyllaDB Summit talks How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB and  So You’ve Lost Quorum: Lessons From Accidental Downtime  If not, you should 😉 You can purchase the full 370-page book from Manning.com. You can also access a 122-page early-release digital copy for free, compliments of ScyllaDB. The book excerpt includes a discount code for 45% off the complete book. Get the 122-page PDF for free The following is an excerpt from Chapter 1; it’s reprinted here with permission of the publisher. *** ScyllaDB runs multiple nodes, making it a distributed system. By spreading its data across its deployment, it uses that to achieve its desired availability and consistency, which, when combined, differentiates the database from other systems. All distributed systems have a bar to meet: they must deliver enough value to overcome the introduced complexity. ScyllaDB, designed to be a distributed system, achieves its scalability and fault tolerance through this design. When users write data to ScyllaDB, they start by contacting any node. Many systems follow a leader-follower topology, where one node is designated as a leader, giving it special responsibilities within the system. If the leader dies, a new leader is elected, and the system continues operating. ScyllaDB does not follow this model; each node is as special as any other. Without a centralized coordinator deciding who stores what, each node must know where any given piece of data should be stored. Internally, Scylla can map a given row to the node that owns it, forwarding requests to the appropriate nodes by calculating its owner using the hash ring that you’ll learn about in chapter 3. To provide fault tolerance, ScyllaDB not only distributes data but replicates it across multiple nodes. The database stores a row in multiple locations – the amount depends upon the configured replication factor. In a perfect world, each node acknowledges every request instantly every time, but what happens if it doesn’t? To help with unexpected trouble, the database provides tunable consistency. How you query data is dependent on what degree of consistency you’re looking to get. ScyllaDB is an eventually consistent database, and you perhaps will see inconsistent data as the system converges toward consistency. Developers must keep this eventual consistency in mind when working with the database. To facilitate the various needs of consistency, ScyllaDB provides a variety of consistency levels for queries, including those listed in table 1.1. With a consistency level of ALL, you can require that all replicas for a key acknowledge a query, but this setting harms availability. You can no longer tolerate the loss of a node. With a consistency level of ONE, you require a single replica for a key to acknowledge a query, but this greatly increases our chances of inconsistent results. Luckily, some options aren’t as extreme. ScyllaDB lets you tune consistency via the concept of quorums. A quorum is when a group has a majority of members. Legislative bodies, such as the US Senate, do not operate when the number of members present is below the quorum threshold. When applied to ScyllaDB, you can achieve intermediate forms of consistency. With a QUORUM consistency level, the database requires a majority of replicas for a key to acknowledge a query. If you have three replicas, two of them must accept every read and every write. If you lose one node, you can still rely on the other two to keep serving traffic. You additionally guarantee that a majority of your nodes get every update, preventing inconsistent data if you use the same consistency level when reading. Once you have picked your consistency level, you know how many replicas you need to execute a successful query. A client sends a request to a node, which serves as the coordinator for that query. Your coordinator node reaches out to the replicas for the given key, including itself if it is a replica. Those replicas return results to the coordinator, and the coordinator evaluates them according to our consistency. If it finds the result satisfies the consistency requirements, it returns the result to the caller. The CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem) classifies distributed systems by saying that they cannot provide all three of these properties – consistency, availability, and network partition tolerance, as seen in figure 1.8. For the CAP theorem’s purposes, we define consistency as every request reading the most recent write; it’s a measure of correctness within the database. Availability is whether the system can serve requests, and network partition tolerance is the ability to handle a disconnected node. Figure 1.8 The CAP theorem says a database can only provide two of three properties — consistency, availability, and partition tolerance. ScyllaDB is classified as an AP system. According to the CAP theorem, a distributed system must have partition tolerance, so it ultimately chooses between consistency and availability. If a system is consistent, it must be impossible to read inconsistent data. To achieve consistency, it must ensure that all nodes receive all necessary copies of data. This requirement means it cannot tolerate the loss of a node, therefore losing availability. TIP: In practice, systems aren’t as rigidly classified as the CAP theorem suggests. For a more nuanced discussion of these properties, you can research the PACELC theorem (https://en.wikipedia.org/wiki/PACELC_theorem), which illustrates how systems make partial tradeoffs between latency and consistency. ScyllaDB is typically classified as an AP system. When encountered with a network partition, it chooses to sacrifice consistency and maintain availability. You can see this in its design – ScyllaDB repeatedly makes choices, via quorums and eventual consistency, to keep the system up and running in exchange for potentially weaker consistency. By emphasizing availability, you see one of ScyllaDB’s differentiators against its most popular competition — relational databases.

Getting Started with DataStax Enterprise 6.9

DataStax recently announced the latest version of its widely-used enterprise grade distributed database, DataStax Enterprise 6.9 (DSE 6.9). With the battle-tested Apache Cassandra® at its core, DSE 6.9 gives you all of the stability that your organization has come to expect, while also delivering...

Getting Started with Database-Level Encryption at Rest in ScyllaDB Cloud

Learn about ScyllaDB database-level encryption with Customer-Managed Keys & see how to set up and manage encryption with a customer key — or delegate encryption to ScyllaDB ScyllaDB Cloud takes a proactive approach to ensuring the security of sensitive data: we provide database-level encryption in addition to the default EC2 storage-level encryption. With this added layer of protection, customer data is always protected against attacks. Customers can focus on their core operations, knowing that their critical business and customer assets are well-protected. Customers can either use a customer-managed key or let ScyllaDB Cloud manage a key for them. This article explains how ScyllaDB Cloud protects customer data. It focuses on the technical aspects of ScyllaDB database-level encryption with Customer-Managed Keys (CMK). Specifically, it includes a walkthrough of how to set up and manage encryption on ScyllaDB Cloud clusters with a customer key or how to delegate encryption to ScyllaDB. Storage-level encryption Encryption at rest is when data files are encrypted before being written to persistent storage. ScyllaDB Cloud always uses encrypted volumes to prevent data breaches caused by physical access to disks. Database-level encryption Database-level encryption is a technique for encrypting all data before it is stored in the database. 
 The ScyllaDB Cloud feature is based on the proven ScyllaDB Enterprise database-level encryption at rest, extended with the Customer Managed Keys (CMK) encryption control. This ensures that the data is securely stored – and the customer is the one holding the key. The keys are stored and protected separately from the database, substantially increasing security. ScyllaDB Cloud provides full database-level encryption using the Customer Managed Keys (CMK) concept. It is based on envelope encryption to encrypt the data and decrypt only when the data is needed. This is essential to protect the customer data at rest. Some industries, like healthcare or finance, have strict data security regulations. Encrypting all data helps businesses comply with these requirements, avoiding the need to prove that all tables holding sensitive personal data are covered by encryption. It also helps businesses protect their corporate data, which can be even more valuable. A key feature of CMK is that the customer has complete control of the encryption keys. Data encryption keys will be introduced later (it is confusing to cover them at  the beginning). The customer can: Revoke data access at any time Restore data access at any time Manage the master keys needed for decryption Log all access attempts to keys and data Customers can delegate all key management operations to the ScyllaDB Cloud support team if they prefer this. To do this, the customer can choose the ScyllaDB key when creating the cluster. To ensure customer data is secure and adheres to all privacy regulations. By default, encryption uses the symmetrical algorithm AES-128, a solid corporate encryption standard covering all practical applications. Breaking AES-128 can take an immense amount of time, approximately trillions of years. The strength can be increased to to AES-256. Note: Database-level encryption in ScyllaDB Cloud is available for all clusters deployed in Amazon Web Services. Support for other cloud services, like Google Cloud KMS, will come later this year. Encryption To ensure all user data is protected, ScyllaDB will encrypt: All user tables Commit logs Batch logs Hinted handoff data This ensures all customer data is properly encrypted. The first step of the encryption process is to encrypt every record with a data encryption key (DEK). Once the data is encrypted with the DEK, it is sent to AWS KMS, where the master key (MK) resides. The DEK is then encrypted with the master key (MK), producing an encrypted DEK (EDEK or a wrapped key). The master key remains in the KMS, while the EDEK is returned and stored with the data. The DEK used to encrypt the data is destroyed to ensure data protection. A new DEK will be generated the next time new data needs to be encrypted. Decryption Because the original non-encrypted DEK is destroyed when the EDEK was produced, the data cannot be decrypted. The EDEK cannot be used to decrypt the data directly because the DEK key is encrypted. It has to be decrypted, and for that, the master key will be required again. This can only be decrypted with the master key(MK) in the KMS. Once the DEK is unwrapped, the data can be decrypted. As you can see, the data cannot be decrypted without the master key – which is protected at all times in the KMS and cannot be “copied” outside of KMS. By revoking the master key, the customer can disable access to the data independently from the database or application authorization. Multi-region deployment Adding new data centers to the ScyllaDB cluster will create additional local keys in those regions. All master keys support multi-regions, and a copy of each key resides locally in each region – ensuring those multi-regional setups are protected from regional outages for the cloud provider and against disaster. The keys are available in the same region as the data center and can be controlled independently. In case you use a Customer Key – AWS KMS will charge $1/month for each cluster prorated per hour. Each additional DC creates a replica that is counted as an additional key. There is an additional cost per key request. ScyllaDB Enterprise utilizes those requests efficiently, resulting in an estimated monthly cost of up to $1 for a 9-node cluster. Managing encryption keys adds another layer of administrative work in addition to the extra cost. ScyllaDB Cloud offers database clusters that can be encrypted using keys managed by ScyllaDB support. They provide the same level of protection, but our support team helps you manage the master keys. The ScyllaDB keys are applied by default and are free to our customers. Creating a Cluster with Database-Level Encryption Creating a cluster with database-level encryption requires: A ScyllaDB Cloud account – If you don’t have one, you can create a ScyllaDB Cloud account here. 10 minutes with ScyllaDB Key 20 minutes creating your own key To create a cluster with database-level encryption enabled, we will need a master key. We can either create a customer-managed key using ScyllaDB Cloud UI or skip this step completely and use a ScyllaDB Managed Key, which will skip the next six steps. In both cases, all the data will be protected by strong encryption at the database level. Create a Customer Managed Key After logging into the ScyllaDB Cloud portal, select the “Security” option from the user menu. Then, we can use the “Add Key” option. The Add Key menu will be presented. It allows selection of cloud providers and regions. Currently, only the region can be selected. Select the region where you plan to deploy your cluster. For multi-DC setups, choose the cluster where your first cluster will be. For my cluster, it will be US East. Click the “Set Key.” The key is successfully configured, but it also has to be created in AWS on my behalf so I can have full control over it. This is done using CloudFormation Stack, which allows you to execute a cloud stack with your AWS permissions that will provision the key directly in AWS KMS. Click Launch Stack in the summary pop-up to open the CloudFormation Stack in a new tab and provision your key in AWS. The AWS page will request a login and “StackPrincipal.” The role of the “StackPrincipal” has to be provided in ARN format. Assuming your user has enough permissions, the following cloudshell command will return the ARN as the last line of the response. [CloudShell-user@ip-10-136-55-116 ~]$ aws sts get-caller-identity { "UserId": "AR**********************************domain.com", "Account": "023***********", "Arn": "arn:aws:sts::023**************************domain.com" } Once you successfully execute the stack, the key should be created, and ScyllaDB will get permission to use it with a cluster of your choice. In ScyllaDB, you will see the key name and a green “available,” meaning the key is ready and can be assigned. You can still manage the key. However, if you disable it, it can no longer encrypt or decrypt the data. The next step is to create a cluster with the new key. Choose the Dedicated VM or the Free Trial option from the New Cluster menu. Then, select Customer Key, and you can select the key we created. The key will show only if the cluster and key regions match. Use ScyllaDB Managed Key If you prefer to use ScyllaDB Managed Key, skip all the above steps and choose ScyllaDB Key instead. This is an easier option that encrypts the data on the database level with the same encryption but the key will be managed by ScyllaDB Cloud agents. Create an Encrypted Cluster Once the master key is chosen, click Next and wait a few minutes until the cluster is created with the selected encryption option. You can see the indicator for encryption at rest will be enabled. That’s it! Your cluster is now using database-level encryption with the selected master key. Transparent database-level encryption in ScyllaDB Cloud significantly boosts the security of your ScyllaDB clusters and backups.   Next Steps Start using this feature in ScyllaDB Cloud. Get your questions answered in our community forum and Slack channel. Or, use our contact form.

How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes?

Data modeling in Apache Cassandra® is an important topic—how you model your data can affect your cluster’s performance, costs, etc. Today I’ll be looking at a new feature in Cassandra 5.0 called Storage -Attached Indexes (SAI), and how they affect the way you model data in Cassandra databases. 

First, I’ll briefly cover what SAIs are (for more information about SAIs, check out this post). Then I’ll look at 3 use cases where your modeling strategy could change with SAI. Finally, I’ll talk about benefits and constraints of SAIs. and constraints of SAIs. 

What Are StorageAttached Indexes? 

From the Cassandra 5.0 Documentation, Storage Attached Indexes (SAIs) “[provide] an indexing mechanism that is closely integrated with the Cassandra storage engine to make data modeling easier for users. Secondary Indexing, which is indexing values on properties that are not part of the Primary Key for that table, has been available for Cassandra in the past (called SASI and 2i). However, SAIs will replace the existing functionality, as it will be deprecated in 5.0, and then tentatively removed in Cassandra 6.0. 

This is because SAIs improve upon the older methods in a lot of key ways. For one, according to the devs, SAIs are the fastest indexing method for Cassandra clusters. This performance boost was a plus for using indexing in production environments. It also lowered the data storage overhead over prior implementations, which lowers costs by reducing the need for database storage, which induces operational costs, and by reducing latency when dealing with indexes, lowering a loss of user interaction due to high latency. 

How Do SAIs work? 

SAIs are implemented as part of the SSTables, or Sorted String Tables, of a Cassandra database. This is because SAIs index Memtables and SSTables as they are written. It filters from both in-memory and on-disk sources, filtering them out into a series of indexed columns at read time. I’m not going to go into too much detail here because there are a lot of existing resources on this exciting topic: see the Cassandra 5.0 Documentation and the Instaclustr site for examples. 

The main thing to keep in mind is that SAIs are attached to Cassandra’s storage engine, and it’s much more performant from speed, scalability, and data storage angles as a result. This means that you can use indexing reliably in production beginning with Cassandra 5.0, which allows data modeling to be improved very quickly. 

To learn more about how SAIs work, check out this piece from the Apache Cassandra blog. 

What Is SAI For? 

SAI is a filtering engine, and while it does have some functionality overlap with search engines, it directly says it is not an enterprise search engine” (source). 

SAI is meant for creating filters on non-primary-key or composite partition keys (source), essentially meaning that you can enable a ‘WHERE’ clause on any column in your Cassandra 5.0 database. This makes queries a lot more flexible without sacrificing latency or storage space as with prior methods.  

How Can We Use SAI When Data Modeling in Cassandra 5.0? 

Because of the increased scalability and performance of SAIs, data modeling in Cassandra 5.0 will most definitely change 

You will be able to search collections more thoroughly and easily, for instance, indexing is more of an option when designing your Cassandra queries. This will also allow new query types, which can improve your existing querieswhich by Cassandra’s design paradigm changes your table design. 

But what if you’re not on a greenfield project and want to use SAIs? No problem! SAI is backwards-compatible, and you can migrate your application one index at a time if you need. 

How Do StorageAttached Indexes Affect Data Modeling in Apache Cassandra 5.0? 

Cassandra’s SAI was designed with data modeling in mind (source). It unlocks new query patterns that make data modeling easier in quite a few cases. In the Cassandra team’s words: “You can create a table that is most natural for you, write to just that table, and query it any way you want.” (source) 

I think another great way to look at how SAIs affect data modeling is by looking at some queries that could be asked of SAI data. This is because Cassandra data modeling relies heavily on the queries that will be used to retrieve the data. I’ll take a look at 2 use cases: indexing as a means of searching a collection in a row and indexing to manage a one-to-many relationship. 

Use Case: Querying on Values of Non-Primary-Key Columns 

You may find you’re searching for records with a particular value in a particular column often in a table. An example may be a search form for a large email inbox with lots of filters. You could find yourself looking at a record like: 

  • Subject 
  • Sender 
  • Receiver 
  • Body 
  • Time sent 

Your table creation may look like: 

CREATE KEYSPACE IF NOT EXISTS inbox 

WITH REPLICATION = { 

  'class' : 'SimpleStrategy', 

  'replication_factor' : 3 

}; 

CREATE TABLE IF NOT EXISTS emails ( 

  id int,  

  sender text,  

  receivers text,  

  subject text,  

  body text, 

  timeSent timestamp, 

  PRIMARY KEY (id)); 

};

If you allow users to search for a particular subject or sender, and the data set is large, not having SAIs could make query times painful: 

SELECT * FROM emails WHERE emails.sender == “sam.example@example.com”

To fix this problem, we can create secondary indexes on our sender, receiver, and body fields: 

CREATE CUSTOM INDEX sender_sai_idx ON Inbox.emails (sender)  

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; 

CREATE INDEX IF NOT EXISTS receiver_sai_idx on Inbox.emails (receiver) 

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; 

CREATE CUSTOM INDEX body_sai_idx ON Inbox.emails (body)  

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; 

CREATE CUSTOM INDEX subject_sai_idx ON Inbox.emails (subject)  

USING 'StorageAttachedIndex'  

WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};

Once you’ve established the indexes, you can run the same query and it will automatically use the SAI index to find all emails with a sender of “sam.example@examplemail.com OR by subject match/body match.  Note that although the data model changed with the inclusion of the indexes, the SELECT query does not change, and the fields of the table stayed the same as well! 

Use Case: Managing One-To-Many Relationships 

Going back to the previous example, one email could have many recipients. Prior to secondary indexes, you would need to scan every row in the collection of every row in the table in order to query on recipients. This could be solved in a few ways. One is to create a join table for recipients that contains an id, email id, and recipient. This becomes complicated when the constraint that each email should only appear once per email is added. With SAI, we now have an index-based solutioncreate an index on a collection of recipients for each row. 

The script to create the table and indices changes a bit: 

id int,  

  sender text,  

  receivers set<text>,  

  subject text,  

  body text, 

  timeSent timestamp, 

  PRIMARY KEY (id)); 

};

The text type of receivers changes to a set<text>. A set is used because each email should only occur once. This takes the logic you would have had to implement for the join table solution and moves it to Cassandra.  

The indexing code remains mostly the same, except for the creation of the index for receivers: 

CREATE INDEX IF NOT EXISTS receivers_sai_idx on Inbox.emails (receivers)

That’s it! One line of CQL and there’s now an index on receivers. We can query for emails with a particular receiver: 

SELECT * FROM emails WHERE emails.receievers CONTAINS “sam.example@examplemail.com”

There are many one-to-many relationships that can be simplified in Cassandra with the use of secondary indexes and SAI. 

What Are the Benefits of Data Modeling With Storage Attached Indexes? 

There are many benefits to using SAI when data modeling in Cassandra 5.0: 

  • Query performance: because of SAI’s implementation, it has much faster query speeds than previous implementations, and indexed data is generally faster to search than unindexed data. This means you have more flexibility to search within your data and write queries that search non-primary-key columns and collections. 
  • Move over piecemeal: SAI’s backwards compatibility, coupled with how little your table structure has to change to add SAIs, means you can move over your data models piece by piece, meaning moving is easier.  
  • Data storage overhead: SAI has much lower data overhead than previous secondary index implementations, meaning more flexibility in what you can store in your data models without impacting overall storage needs. 
  • More complex queries/features: SAI allows you to write much more thorough queries when looking through SAIs, and offers up a lot of new functionality, like: 
    • Vector Search 
    • Numeric Range queries 
    • AND queries within indexes 
    • Support for map/set/ 

What Are the Constraints of StorageAttached Indexes? 

While there are benefits to SAI, there are also a few constraints, including: 

  • Because SAI is attached to the SSTable mechanism, the performance of queries on indexed columns will be “highly dependent on the compaction strategy in use” (per the Cassandra 5.0 CEP-7) 
  • SAI is not designed for unlimited-size data sets, such as logs; indexing on a dataset like that would cause performance issues. The reason for this is read latency at higher row counts spread across a cluster. It is also related to consistency level (CL), as the higher the CL is, the more nodes you’ll have to ping on larger datasets. (Source). 
  • Query complexity: while you can query as many indexes as you like, when you do so, you incur a cost related to the number of index values processed. Be aware when designing queries to select from as few indexes as necessary. 
  • You cannot index multiple columns in one index, as there is a 1-to-1 mapping of an SAI index to a column. You can however create separate indexes and query them simultaneously. 

This is a v1, and some features, like the LIKE comparison for strings, the OR operator, and global sorting are all slated for v2. 

Disk usage: SAI uses an extra 20-35% disk space over unindexed data; note that over previous versions of indexing, it consumes much less (source). You shouldn’t just make every column an index if you don’t need to, saving disk space and maintaining query performance. 

Conclusion 

SAI is a very robust solution for secondary indexes, and their addition to Cassandra 5.0 opens the door for several new data modelling strategiesfrom searching non-primary-key columns, to managing one-to-many relationships, to vector search. To learn more about SAI, read this post from the Instaclustr by NetApp blog, or check out the documentation for Cassandra 5.0. 

If you’d like to test SAI without setting up and configuring Cassandra yourself, Instaclustr has a free trial and you can spin up Cassandra 5.0 clusters today through a public preview! Instaclustr also offers a bunch of educational content about Cassandra 5.0. 

 

The post How Does Data Modeling Change in Apache Cassandra® 5.0 With Storage-Attached Indexes? appeared first on Instaclustr.

Cassandra Lucene Index: Update

**An important update regarding support of Cassandra Lucene Index for Apache Cassandra® 5.0 and the retirement of Apache Lucene Add-On on the Instaclustr Managed Platform.** 

Instaclustr by NetApp has been maintaining the new fork of the Cassandra Lucene Index plug-in since its announcement in 2018After extensive evaluation, we have decided not to upgrade the Cassandra Lucene Index to support Apache Cassandra® 5.0. This decision aligns with the evolving needs of the Cassandra community and the capabilities offered by the StorageAttached Indexing (SAI) in Cassandra 5.0.  

SAI introduces significant improvements in secondary indexing, while simplifying data modeling and creating new use cases in Cassandra, such as Vector Search. While SAI is not a direct replacement for the Cassandra Lucene Index, it offers a more efficient alternative for many indexing needs.  

For applications requiring advanced indexing features, such as full-text search or geospatial queries, users can consider external integrations, such as OpenSearch®, that offer numerous full-text search and advanced analysis features. 

We are committed to maintaining the Cassandra Lucene Index for currently supported and newer versions of Apache Cassandra 4 (including minor and patch-level versions) for users who rely on its advanced search capabilities. We will continue to release bug fixes and provide necessary security patches for the supported versions in the public repository. 

Retiring Apache Lucene™ Add-On for Instaclustr for Apache Cassandra 

Similarly, Instaclustr is commencing the retirement process of the Apache Lucene add-on on its Instaclustr Managed Platform. The offering will move to the Closed state on July 31, 2024. This means that the add-on will no longer be available for new customers.  

However, it will continue to be fully supported for existing customers with no restrictions on SLAs, and new deployments will be permitted by exception. Existing customers should be aware that the add-on will not be supported for Cassandra 5.0. For more details about our lifecycle policies, please visit our website here.  

Instaclustr will work with the existing customers to ensure a smooth transition during this period. Support and documentation will remain in place for our customers running the Lucene addon on their clusters.  

For those transitioning to, or already using the Cassandra 5.0 beta version, we recommend exploring how Storage-Attached Indexing can help you with your indexing needs. You can try the SAI feature as part of the free trial on the Instaclustr Managed Platform.  

We thank you for your understanding and support as we continue to adapt and respond to the community’s needs. 

If you have any questions about this announcement, please contact us at support@instaclustr.com. 

The post Cassandra Lucene Index: Update appeared first on Instaclustr.