A New Chapter: Introducing Our New Brand

This week was a huge milestone for DataStax. We announced the expansion of our mandate to unlock real-time AI at scale for all organizations. After more than a decade building applications used by millions with Apache Cassandra®, we’ve identified what makes the best applications. It’s not only...

Introducing ScyllaDB Enterprise 2022.2

ScyllaDB Enterprise 2022.2 is here! It offers new features, plus over 100 stability and performance fixes for our popular NoSQL database. ScyllaDB Enterprise is available as a standalone self-hosted product, and serves as the engine used within our fully-managed Database-as-a-Service, ScyllaDB Cloud.

New Release Cycle for Enterprise Customers

ScyllaDB Enterprise 2022.1 was introduced last year for both stand-alone self-hosted use, whether on the public cloud or on premises. ScyllaDB Enterprise also serves as the engine at the heart of ScyllaDB Cloud, our fully-managed Database-as-a-Service (DBaaS). As we promised, we have now delivered a feature-based release to keep ScyllaDB Enterprise more closely in sync with the release cadence of our ScyllaDB Open Source project. Thus, ScyllaDB Enterprise 2022.2 mirrors the production-ready features available in ScyllaDB Open Source 5.1.

Alternator (DynamoDB Compatible) TTL

In ScyllaDB Open Source 5.0, we introduced Time To Live (TTL) to the DynamoDB compatible API (Alternator) as an experimental feature. In ScyllaDB Enterprise 2022.2, we promote it to production ready.

Like in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time, but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable — it defaults to 24 hours, but can be set with the --alternator-ttl-period-in-seconds configuration option.

Rate Limit Per Partition

We’re now allowing you to put per-partition rate limits for reads or writes per second. Consider the following CQL example:

CREATE TABLE tab ( ... )
   'max_writes_per_second': 150,
   'max_reads_per_second': 400

You can set different rates for writes and reads per partition. Queries exceeding these limits will be rejected. This helps the database avoid hot partition problems or mitigate issues external to the database, such as spam bots. This feature pairs well with ScyllaDB’s shard-aware drivers because rejected requests will have the least cost. You can read more about this feature in the ScyllaDB documentation and in the feature design note on Github.

Load and Stream

Historically when restoring from backup, you needed to place the restored SSTables on the same number of nodes as the original cluster. But sometimes you may find the cluster topology has changed radically by adding or removing nodes from the topology used when the backup was made. If that’s the case, then the new token distribution will be mismatched to the SSTables. With ScyllaDB’s Load and Stream feature, you don’t need to worry about the details of what the cluster topology was when you made your backup. You can just place the SSTables on one of the nodes of the current cluster, run nodetool refresh, and the system will automatically determine how to reshard and rebalance the partitions across the cluster, streaming data to the new owning nodes.

Performance: Eliminate Exceptions from Read and Write Path

When a coordinator times out, it generates an exception which is then caught in a higher layer and converted to a protocol message. Since exceptions are slow, this can make a node that experiences timeouts become even slower. To prevent that, the coordinator write path and read path has been converted not to use exceptions for timeout cases, treating them as another kind of result value instead. Further work on the read path and on the replica reduces the cost of timeouts, so that goodput is preserved while a node is overloaded. This results in the following performance improvement:

You can read more about the commands to generate this exception elimination in the release notes.

Prune Materialized Views

Another new feature is a CQL extension, PRUNE MATERIALIZED VIEW. This statement can be used to remove inconsistent rows, known as “ghost rows,” from materialized views.

ScyllaDB’s materialized views have been production-ready since ScyllaDB Open Source 3.0, and we continuously strive to make them even more robust and problem-free. A ghost row is an inconsistency issue which manifests itself by having rows in a materialized view which does not correspond to any base table rows. Such inconsistencies should be prevented altogether and ScyllaDB strives to avoid them. Yet if they happen, this statement can be used to restore a materialized view to a fully consistent state without rebuilding it from scratch.

Example usages:

PRUNE MATERIALIZED VIEW my_view WHERE token(v) > 7 AND token(v) < 1535250;

Getting ScyllaDB Enterprise 2022.2

If you are an existing ScyllaDB Enterprise customer, we encourage you to upgrade your clusters in coordination with our support team.

Note that since ScyllaDB Cloud is a fully-managed solution, all clusters will be automatically upgraded by our support team in due time.

Finally, if you are interested in ScyllaDB Enterprise, you can try out our 30 day trial offer. However, trials and proof of concepts (POCs) for enterprise software have the highest success rate when conducted in coordination with our team of experts. Before you start the trial timer ticking, make sure you contact us directly so we can help you achieve your goals.


ScyllaDB University’s Journey Featured at the OEB Conference in Berlin

Last month, I attended the OEB conference in Berlin, Germany. Its focus was on technology-supported learning and training. I was invited to give a talk about my experience and what I’ve learned from creating ScyllaDB University.

In this post, I’ll cover the gist of my talk. In a future post, I’ll share some of my experiences and what I learned at the conference.

You can discuss this blog post and ask me questions in the community forum.

© OEB Learning Technologies Europe GmbH used with permission

What We Learned from Building ScyllaDB University

My talk focused on our journey of creating ScyllaDB University, our online, free, self-paced learning and training center. When I joined ScyllaDB, back in 2018, we already had a great product, but the supporting material was lacking. Without proper guidance, the highly-technical product was hard to get started with.

As a start-up, we had limited capacity for face-to-face training. And while our users enjoyed the in-person training we could deliver, there was a massive pent-up demand for online training delivery. This initiative began even before the COVID19 pandemic, but with the pandemic, demand for online training increased even more.

Our potential user base was spread around the world. Being an open-source first company and having a very technical solution, delivering in-person (or online, but live) training did not scale and was not feasible given the number of users we had. So I was tasked with building a new online self-paced training program from scratch.

We needed to do something different and create a solution that was free to users, affordable for the company to maintain, and that would allow us to deliver virtual, online, self-paced training at scale. The solution would have to be flexible enough to allow us to start off with something small and grow with the company and community.

There was a lot of complex technical content to write, and this is all very specialized content. Code that actually has to work. Technical terms that are specialized to the database industry, and to our product in particular.

I used my previous product management experience to work on the challenge.

From the discovery phase of understanding our goals and the user needs to the development stage of making sure that we develop the right product. Our approach was lean and flexible.

Here are the criteria I had for the platform we needed:

  • Affordable — we would be able to scale the content and learning community without running into cost overruns
  • Flexible — we would be able to customize the learning platform for our content, things like hands-on labs that use code etc.
  • Facile — For us, it would be easy to get started and to build curriculum. For users, it would be easy to take courses and progress through content
  • Secure — User login credentialing, privately store student progress
  • Multimedia — Support textual content as well as presentation graphics and video
  • Progressive — Our users would take many small learning steps in a self-paced learning experience; each step should be remembered so users come back to where they last left off.
  • No vendor lock-in – easy and straightforward to switch to another LMS if we choose to do so in the future.

Some of the LMS solutions I evaluated offered a complete turn-key hosted product. Yet they were more expensive, and I didn’t like the vendor lock-in and the lack of flexibility.

After evaluating different solutions, we decided to adopt WordPress with the Learndash LMS. Both are open source, which aligns with our company culture and values. They are well-supported, have a relatively high adoption rate, and offer a flexible and robust solution.

We moved fast. After a few weeks, we had an initial platform running with some basic content.

Starting out with something small, getting feedback, and then expanding based on that, proved itself to be an effective strategy.

Initially, ScyllaDB University only had one course and 3 lessons. Now it has multiple courses with 327 lessons that include videos, hands-on labs, and quizzes. Plus, there are learning paths based on technical roles, such as database administrators, application developers, and systems architects.

An example of blended learning, we have also expanded to ScyllaDB University LIVE, an online half-day free training event. It started as a one-off, but since we saw there was a lot of interest and demand, we decided to do it quarterly.

It has two tracks: one for beginners and one for advanced users. The sessions are live hosted presentations by our top engineers and experts, and usually include hands-on examples. Users can interact with instructors using a built-in chat feature of the Hopin event hosting platform we use for these special events.

After the event we encourage participants to view the slides and labs on ScyllaDB University where this content is made privately and exclusively available to them. They can do some hands-on labs and learn some more.

ScyllaDB University has been a success. Besides empowering a growing user community, it’s also a major source of data for the company. Being open source, we don’t usually know who is doing what and who is using our software. Knowing how users progress through ScyllaDB University, we have a better idea of where they are in their journey.

We have thousands of active users on ScyllaDB University, and we’re getting great feedback:

  • Rahul S. Gaikwad, FireEye (now at AWS), “Scylla University is a very good starting point to learn ScyllaDB. The courses are very informative and well explained. The best part is that it’s free and online which covers conceptual and hands-on knowledge.” 
  • Vinicius Neves, Mobiauto, “Scylla University attracted our collaborators for its easy and intuitive language!”
  • Tho Vo, Weatherford, “The courses are very well laid-out, easy to follow. Top grade materials.”
  • Felipe Santos, Dynasty Sports & Entertainment, “Scylla University is a great place to start if you are not familiar with NoSQL databases and planning to switch your current relational database to a very robust NoSQL solution. Moreover, its content helped me a lot on my first steps delving into NoSQL world and gave me many insights of how that would affect our product’s design.”
  • Satyam Mittal, Database Software Engineer, Grab, “The content at Scylla University is very good quality and quizzes at the end of the section encourage me to figure more things out and actually learn. It not only helped me in learning specific concepts about Scylla but also helped me in figuring out differences with other NoSQL databases. I will be looking at learning more courses from Scylla University in my free time.”

Another positive outcome is that for some search terms, ScyllaDB University lessons come up as the first result, and this is for broader database industry search results, not necessarily for terms related to ScyllaDB. This creates awareness for our product and brings potential users to the website.

In retrospect, I wouldn’t have done things very differently. Choosing an open-source LMS gave us a lot of freedom to develop what we want without any vendor lock-in. It was also more affordable. Creating a mix of LIVE events with on-demand learning proved popular and successful especially in strange times (COVID-19).

Our strategy was perfect for meeting our goals. Not waiting for the perfect solution, but rather starting with something small and developing from there, worked well for a fast growing start-up.

What’s Next?

The work is not done yet. We are constantly improving ScyllaDB University and have many things we’d like to do. For example:

  • Get better insights into what users are doing on the website, which lessons are more popular, and which content needs to be improved
  • Being an open-source company, we’d like to increase community engagement, allowing trainees to interact with one another.
  • Improvements to the platform to make it more engaging and fun
  • Getting single-sign for users of our ScyllaDB University and ScyllaDB Forums for a seamless experience.

Our technology is constantly evolving. Since we have a lot of content now, one of the challenges is keeping it up to date, removing obsolete content, and adding content for new features.

Personally, I enjoyed my work in developing ScyllaDB University, and I’m getting great feedback about it, both from my internal peers, stakeholders and managers, as well as enthusiastic learners who really enjoyed their learning experience.

Prof. Dr. Anja Schmitz, of Pforzheim University, Germany, and myself. Prof. Schmitz was the moderator of the panel I took part in.

My talk was well received, the room was full, and I didn’t notice anyone leave as I was speaking. 😄

In a future post, I plan to share some of my experiences from the conference and what I learned.



5 Factors when Selecting a High Performance, Low Latency Database

How to Tell When a Database is Right for Your Project

When you are selecting databases for your latest use case (or replacing one that’s not meeting your current needs), the good news these days is that you have a lot of options to choose from. Of course, that’s also the bad news. You have a lot to sort through.

There are far more databases to consider and compare than ever before. In December 2012, the end of the first year DB-Engines.com first began ranking databases, they had a list of 73 systems (up significantly from the 18 they first started their list with). As of December 2022, they are just shy of 400 systems. This represents a Cambrian explosion of database technologies over the past decade. There is a vast sea of options to navigate: SQL, NoSQL, and a mix of “multi-model” databases that can be a mix of both SQL and NoSQL, or multiple data models of NoSQL (combining two or more options: document, key-value, wide column, graph and so on).

Further, users should not confuse outright popularity with fitness for their use case. While network effects definitely have advantages (“Can’t go wrong with X if everyone is using it”), it can also lead to groupthink, stifling innovation and competition.

In a recent webinar, my colleague Arthur Pesa and I took users through a consideration of five factors that users need to keep foremost when shortlisting and comparing databases.


The Five Factors

Let’s get straight into the list.

  1. Software Architecture — Does the database use the most efficient data structures, flexible data models, and rich query languages to support your workloads and query patterns?
  2. Hardware Utilization — Can it take full advantage of modern hardware platforms? Or will you be leaving a significant amount of CPU cycles underutilized?
  3. Interoperability — How easy is it to integrate into your development environment? Does it support your programming languages, frameworks and projects? Was it designed to integrate into your microservices and event streaming architecture?
  4. RASP — Does it have the necessary qualities of Reliability, Availability, Scalability, Serviceability and, of course, Performance?
  5. Deployment — Does this database only work in a limited environment, such as only on-premises, or only in a single datacenter or a single cloud vendor? Or does it lend itself to being deployed exactly where and how you want around the globe?

Any such breakdown is subjective. You may have your own list of 4 factors, or 12 or 20 or 100 criteria. And, of course, each of these factors like software architecture break down into subcategories, such as “storage engine,” “distributed processing architecture,” and even “query language.” But this is how I’d bucketize them into general categories.

Software Architecture

The critical consideration here is “does the database use the most efficient data structures, flexible data models and rich query languages to support your specific workloads and query patterns?”

Workload — Do you need to do a write-heavy or mixed read-write transactional workload? Or are you going to do a mostly-read analytical workload? Do you need to have a hybrid workload with a mix of transactions and analytics? Is that workload real-time, batched or a mix? Is it a steady stream of events per second, or are there smooth, regular intraday predictable rises and falls? Or maybe do you need to plan to deal with stochastic shocks of sudden bursts of traffic (for example, breaking news, or any other sudden popularity of a record)?

Data Model — Are you dealing with key-value pairs? Wide column stores (row-based “key-key-value” data)? A column store (columnar data)? Document? Graph? RDBMS (with tables and JOINs)? Or something else entirely. Do you really have the time and need to do fully normalized data, or will you be absorbing so much unstructured data so quickly that normalization is a fool’s errand, and you’d be better served with a denormalized data model to start with? There’s no singular “right” answer here. “It depends” should be embraced as your mantra.

Query Language — Here there is definitely more of a bias. Because while your data engineering team may be able to mask or hide the back-end query model, many of your users have their own biases and preferences. This is one of the main reasons why SQL remains such a lock-in. At the same time, there are new query languages that are available. Some are SQL-like, such as the Cassandra Query Language (CQL) that is used by Cassandra and ScyllaDB. It has a passing familiarity to SQL users. But don’t be fooled – there are no table JOINs! Then there are a series of new school query languages which may use, for example JSON. This is how Amazon DynamoDB queries work. Again, here, ScyllaDB supports such a JSON query model using our Alternator interface, which is compatible with DynamoDB. Regardless of which way you lean, query language should not be an afterthought in your consideration.

Transactions / Operations / CAP — Which is more important to you? Fully consistent ACID transactions? Or highly performant, highly available basic CRUD operations? The CAP theorem says you can have any two of three: consistency, availability or partition tolerance. Considering that distributed databases always need to be partition-tolerant, that leaves you with the choice between so-called “CP”-mode consistency-oriented systems, or “AP”-mode availability-oriented systems. And within these modes, there are implementation details to consider. For example, how you achieve strong consistency within a distributed system can vary widely. Consider even the choice of various consensus algorithms to ensure linearizability, like Paxos, Raft, Zookeeper (ZAB) and so on. Besides the different algorithms, each implementation can vary significantly from another.

Data Distribution — When you say “distributed system,” what do you mean exactly? Are we talking about a local, single-datacenter cluster? Or are we talking multi-datacenter clustering? How do cross-cluster updates occur? Is it considered all one logical cluster, or does it require inter-cluster syncs? How does it handle data localization and, for example, GDPR compliance?

Hardware Utilization

We are in the middle of an ongoing revolution in underlying hardware that continues to push the boundaries of software. A lot of software applications, and many databases in particular, are still rooted in decades-old origins, designs and presumptions.

CPU Utilization / Efficiency — A lot of software is said to be running poorly if CPU utilization goes up beyond, say, 40% or 50%. That means you are supposed to run that software inefficiently, leaving half of your box underutilized on a regular basis. In effect, you’re paying for twice the infrastructure (or more) than you actually need. So it behooves you to look at the way your system handles distributed processing.

RAM Utilization / Efficiency — Is your database consistently memory-bound? Is its caching too aggressive, or too bloated (such as having multiple layers of caching), keeping unneeded data in memory? How does it optimize its read and write paths?

Storage Utilization / Efficiency — What storage format does your database use? Does it have compact mutable tables that may require heavyweight file locking mechanisms? Or does it use immutable tables that may produce fast writes, but come at a cost of space and read amplification? Does it allow for tiered storage? How does it handle concurrency? Are files stored row-wise (good for transactional use cases) or column-wise (good for analytics on highly repetitive data)? Note that there isn’t just one “right” answer. Each solution is optimizing for different use cases.

Network Utilization / Efficiency — Here you should think both about the efficiency of client-server cluster communications, as well as intra-cluster communications. Client/server models can be made more efficient with concurrency, connection pooling, and so on. Intra-cluster communications span from typical operational/transactional chatter (replicating data in an update or a write), as well as administrative tasks such as streaming and balancing data between nodes during a topology change.


No database is an island. How easy is it to integrate into your development environment? Does it support your programming languages, frameworks and projects? Was it designed to integrate into your microservices and event streaming architecture?

Programming Languages / Frameworks — Over and over you hear “We’re an X shop,” where X stands for your preferred programming language or framework. If your database doesn’t have the requisite client, SDK, library, ORM, and/or other packages to integrate it into that language, it might as well not exist. To be fair, the massive explosion of databases is concurrent to the massive explosion in programming languages. Yet it pays to look at programming language support for the client. Note that this is not the same as what language the database may be written in itself (which may factor into its software architecture and efficiency). This is purely about what languages you can write apps in to connect to that back end database.

Event Streaming / Message Queuing — Databases may be a single source of truth, but they are not the only systems running in your company. In fact, you may have different databases all transacting, analyzing and sharing different slices of your company’s overall data and information space. Event streaming is the increasingly common media for modern enterprises to avoid data silos, and these days your database is only as good as its integration with real time event streaming and message queuing technologies. Can your database act as both a sink and a source of data? Does it have Change Data Capture (CDC)? Does it connect to your favorite event streaming and message queuing technologies such as Apache Kafka, or Apache Pulsar or RabbitMQ?

APIs — To facilitate your database integration into your application and microservices architecture, does your database support one or more APIs, such as a RESTful interface, or GraphQL? Does it have an administrative API so you can programmatically provision it rather than do everything via a GUI interface? Using the GUI might seem convenient at first, until you have to manage and automate your deployment systems.

Other Integrations — What about CI/CD toolchains? Observability platforms? How about using your database as a pluggable storage engine or underlying element of a broader architecture? How well does it serve as infrastructure, or fit into the infrastructure you already use?


This acronym goes back decades and generally is used in a hardware context. It stands for Reliability, Availability, Serviceability (or Scalability) and Performance. Basically these “-ilities” are “facilities” — things that make it easy to run your system. In a database, they are vital to consider how much manual intervention and “plate-spinning” you might need to perform to keep your system up and stable. They represent how much the database can take care of itself under general operating conditions, and even mitigate failure states as much as possible.

Typical platform engineer spinning up a bunch of new nodes.

Reliability —How much tinkering do you need to put in to keep this thing from collapsing, or from data disappearing? Does your database have good durability capabilities? How survivable is it? What anti-entropy mechanisms does it include to get a cluster back in sync? How good are your backup systems? Even more important, how good are your restore systems? And are there operational guardrails to keep individuals from bringing the thing down with a single “Oops!”

Availability — What does your database do when you have short term network partitions and transient node unavailability? What happens when a node fully fails? What if that network failure stretches out to more than a few hours?

Serviceability — These days the common buzzword is “observability,” which generally encompasses the three pillars of logging, tracing and metrics. Sure, your database needs to have observability built-in. Yet serviceability goes beyond that. How easy is it to perform upgrades without downtime? How pain-free are maintenance operations?

Scalability — Some databases are great to get started with. Then… you hit walls. Hard. Scalability means you don’t have to worry about hitting limits either in terms of total data under management, total operations per second, or geographic limits — such as going beyond a single datacenter to truly global deployability. Plus, there’s horizontal scalability — the scale out of adding more nodes to a cluster — as well as vertical scalability — putting your database on servers that have ever increasing numbers of CPUs, ever more RAM and more storage (refer back to the hardware section above).

Performance — Bottom line: if the database just can’t meet your latency or throughput SLAs, it’s just not going to fly in production. Plus, linked to scalability, many databases seem like they’ll meet your performance requirements at small scale or based on a static benchmark using test data but, when hit with real-world production workloads, just can’t keep up with the increasing frequency, variability and complexity of queries. So performance requires a strong correlation to linear scale.


All of the above then needs to run where you need it to. Does this database only work in a limited environment, such as only on-premises, or only in a single datacenter or a single cloud vendor? Or does it lend itself to being deployed exactly where and how you want around the globe? Ask yourself these question:

Lock-ins — Can this run on-premises? Or, conversely, is it limited to only on-premises deployments? Is it limited to only a certain public cloud vendor, or can this run in the cloud vendor of your choice? What are your hybrid cloud or multicloud deployment options?

Management / Control — Similarly, is this only available as a self-managed database, or can it be consumed as a fully-managed Database-as-a-Service (DBaas)? The former allows teams full control of the system, and the latter relieves teams of administrative burden. Both have their tradeoffs. Can you select only one, or does the database allow users to switch between these two business models?

Automation and Orchestration — Does it have a Kubernetes Operator to support it in production? Terraform and Ansible scripts? While this is the last itemn the list, rest assured, this should not be an afterthought in any production consideration.

So How Do You Evaluate Databases on This Rubric?

With this as a general rubric, you can watch the on-demand webinar to learn how ScyllaDB compares to such consideration criteria. ScyllaDB was architected from the ground up to take advantage of modern software paradigms and hardware architecture. ScyllaDB is based on the underlying Seastar framework, designed for building shard-per-core, shared nothing, highly asynchronous applications. There’s a lot going on under the hood, so I hope you watch the video to get the inside scoop!

Also, if you want to evaluate your current database against this criteria and discuss with us how your existing technology stacks up, feel free to contact us privately, or sign up for our user forums or Slack community to ask more questions and compare your results with your data engineering peers.


Bryan Cantrill on What’s Next for Infrastructure, Open Source & Rust

“As technologists, we live partially in the future: We are always making implicit bets based on our predictions of what the future will bring. To better understand our predictions of the future, it can be helpful to understand the past – and especially our past predictions of the future.”
– Bryan Cantrill at ScyllaDB Summit 2022

If you know Bryan Cantrill, you know that his mind works in mysterious ways to dare mighty things. So it shouldn’t surprise you that Cantrill’s take on the age-old practice of New Year’s predictions is a bit unexpected…and yields predictably perspicacious results.

For nearly a decade at Sun Microsystems (which his 10-year-old daughter suspected was a microbrewery) Cantrill and a dozen or so fellow infrastructure technologists made it a habit to cast their one-, three-, and six-year predictions. What did they get wildly wrong? Uncannily correct? And what does it all mean for the future? Let’s take a look.

Cantrill crafted this talk for ScyllaDB Summit, a virtual conference for exploring what’s needed to power instantaneous experiences with massive distributed datasets. You can watch his complete session below. Also, register now (free + virtual) to join us live for ScyllaDB Summit 2023 featuring experts from Discord, Hulu, Strava, ShareChat, Percona, ScyllaDB and more, plus industry leaders on the latest in WebAssembly, Rust, NoSQL, SQL, and event streaming trends.




Looking Back to Future Technology Predictions from 2000-2007

Here are some of the more notable one-, three- and six-year predictions that Cantrill & Co made during the early 2000s – exactly as they recorded them:

  • Six-year, 2000: “Most CPUs have four or more cores.”
  • Three-year, 2003: “Apple develops new ‘must-have’ gadget: iPhone. Digital camera/MP3 player/cell phone.”
  • Six-year, 2003: “Internet bandwidth grows to the point that TV broadcasters become largely irrelevant; former TV networks begin internet broadcasts.”
  • One-year, 2005: “Spam turns corner, less of a problem than year before.”
  • One-year, 2006: “Google embarrassed by revelation of unauthorized U.S. government spying at Gmail.”
  • Six-year, 2006: “Volume CPUs still less than 5 GHz.”
  • Many of these predictions nailed the trend, but were a bit off on the timing. Let’s review each in turn.

Six-Year, 2000: ‘Most CPUs Have Four or More Cores’

From the perspective of 2022, where any laptop has six or eight cores, this seems like a no-brainer. But it wasn’t yet a reality by 2006. File under “Right trend, wrong timing.”

Three-Year, 2003: ‘Apple Develops New ‘Must-Have’ Gadget: iphone. Digital Camera/MP3 Player/Cell Phone’

This prescient prediction was Cantrill’s own (and yes, he did actually predict the name “iPhone”). But he was a bit off in one not-so-minor respect. He admits:

“I was almost making fun of myself for making this prediction because I thought this thing would be ridiculous and that nobody would want it. So this prediction was correct, but it was also really, really, deeply wrong.”

Six-Year, 2003: ‘Internet Bandwidth Grows to the Point That Tv Broadcasters Become Largely Irrelevant; Former TV Networks Begin Internet Broadcasts’

Cantrill remembers his disbelief when his colleague shared this one. It’s now hard to believe that we once lived in a world where you had to sit in front of a television to learn about breaking news. Nevertheless, in 2003, the whole concept of getting news online felt like an impossible future.

One-Year, 2005: ‘Spam Turns Corner, Less of a Problem Than Year Before’

Difficult as it may be to believe, Cantrill assures us that the spam problem was previously much worse than it is today. Around 2005, it felt hopeless. Yet, it did turn the corner right around 2006 – exactly as predicted. It’s probably worth noting that this precise short-term prediction came from a technologist who worked on a mail server, and was thus intimately involved with the spam problem.

One-Year, 2006: ‘Google Embarrassed by Revelation of Unauthorized U.S . Government Spying at Gmail’

We have witnessed a variety of scandals involving the government having unauthorized access to large services, so this prediction did capture that general zeitgeist. However, the specific details were off.

Six-year, 2006: ‘Volume CPUs Still Less than 5 GHz’

Before you dismiss this one as obvious, realize that in 2006 it wasn’t yet clear when Dennard scaling (the idea that transistors get faster as they get smaller) was going to end. But, as Cantrill’s colleague predicted, it did end – maybe sooner than anticipated (around 2006, 2007). And we did top out at less than 5 GHz: more like 4 or even 3.

So What? And What About Missing that Whole ‘Cloud Computing’ Thing?

As we’re on the cusp of 2023, why are we looking back at technology predictions from the early 2000s? Cantrill’s response: “The thing that is so interesting about them is that they tell us so much about what we were thinking at the time. I think predictions tell us much more about the present than they do about the future, and that’s a bit of a paradox.”

In retrospect, looking at the types of predictions that came true was ultimately more intriguing than the fate of the individual predictions. Their longer-term predictions were often more accurate than their one-year ones. Even though that one-year horizon is right in front of your eyes, so much can change in a year that it’s difficult to predict.

Even more interesting: megatrends that this group of infrastructure technologists overlooked. Cantrill explains, “Yes, we predicted the end of Dennard scaling… Yes, we predicted, albeit mockingly, the iPhone. But, we did not predict cloud computing or Software as a Service at all anywhere over that period of time.”

Then, the epiphany: “The things that we missed were the ramifications of the broadening of things that we already knew about at the time.” The list of their megatrend misses is populated by technologies that were right under their noses in the early 2000s – just not (yet!) at the scope that tapped their potential and made them truly transformational. For instance, they underestimated the impact of:

  • The internet
  • Distributed version control
  • Moore’s Law
  • Open source

A little more color on this, taking the example of open source: The technologists making the predictions were users of open source. They had open sourced their own software. They were ardent supporters of open source. However, according to Cantrill, they underestimated its power to truly transform the industry because they “just didn’t fully understand what it meant for everything to be open source.”

Back to the Future

So how does all this analysis of past predictions inform Cantrill’s expectations for the future?

He’s focusing less on what new things will be created and more on evolutions that tap the power of things that already exist today. Some specifics…

Compute is Becoming Ubiquitous

Cantrill’s first prediction is that powerful compute will become even more broadly available – not just with respect to moving computers into new places (à la IoT) but also having CPUs where we once thought of components For example, open 32-bit CPUs are replacing hidden, closed 8-bit microcontrollers. We’re already seeing CPUs on the NIC (SmartNIC), CPUs next to flash (open-channel SSD) and also on the spindle (WD’s SweRV). Cantrill is confident that this compute proliferation will bring new opportunities for hardware/software co-design. (See “Bryan Cantrill on Rust and the Future of Low-Latency Systems” for more details on this thread.)

Open FPGAs/HDLs are Real

Field programmable gate arrays (FPGA) are integrated circuits that can be programmed, post manufacturing, to do arbitrary things. To change an FPGA’s functionality, you reconfigure it with a bitstream that uses Hardware Description Language (HDL) designs.

Historically, these bitstreams were entirely proprietary, so anyone programming them was entirely dependent on proprietary toolchains that were completely closed. Claire Wolf changed this. Wolf’s terrific work of reverse engineering the Lattice iCE40 bitstream and other bitstreams opened the door to truly open FPGAs: FPGAs where you can synthesize the bitstream, where you can synthesize what you’re going to program onto that FPGA with 100% open source tools.

Cantrill believes this will be a game changer. Just as having open source development tools has democratized software development, the same will happen with FPGAs. With the toolchains opening up, many more people can actually synthesize bitstreams. In Cantrill’s words, “This stuff is amazing. It’s not the solution to all problems, for certain. But if you have a problem that’s amenable to special-purpose compute, FPGA can provide you a quick and easy way there.”

Likewise, HDLs are also opening up, and Cantrill believes this too will be transformative.

HDLs have traditionally been dominated by Verilog and (later) SystemVerilog. Their compilers have been historically proprietary, and the languages themselves are error-prone. But the past few years have yielded an explosion of new, open HDLs; for example, Chisel, nMigen, Bluespec, SpinalHDL, Mamba (PyMTL 3) and HardCaml.

Of these, Bluespec is the most interesting to the team at Oxide Computer, where Cantrill is co-founder and CTO. He explains, “The way one of our engineers describes it, ‘Bluespec is to SystemVerilog what Rust is to Assembly. It is a much higher-level way of thinking about the system, using types in the compiler to actually generate a reliable system, a verifiable system.’”

Open Source EDA is Becoming Real

Proprietary software has historically dominated electronic design automation (EDA), but open source options are coming to light in this domain as well.

Open source alternatives have existed for years, but one in particular, KiCad, has enjoyed sufficiently broad sponsorship to close the gaps with professional-grade software. The maturity of KiCad (especially KiCad 6), coupled with the rise of quick turn printed circuit board (PCB) manufacturing/assembly, has allowed for astonishing speed. It’s now feasible to go from conception to manufacture in hours, then from manufacture to shipping board in a matter of days.

Oxide has been using KiCad for its smaller boards (prototype boards), but envisions a future in which it can use KiCad for its bigger boards – and move off of proprietary EDA software. Cantrill explains, “This proprietary EDA software has all of the problems that proprietary software has. Like many shops, we have lost time because a license server crashed or a license server needed to be restarted…No one should be blocked from their work because a license server is down. The quality that we’re getting at KiCad now is really professional grade, which allows us to iterate so much faster on hardware, to go from that initial conception to a manufacturer in just hours. When that manufacturer can ship a board to you in days, and you can go from something that existed in your head to a PCB in your hand in a week, it’s remarkable. It’s a whole new era.”

Open Source Firmware is (Finally!) Happening

The Oxide team is just as bullish about open source firmware as they are about KiCad.

Cantrill laments, “The open source revolution has been so important all the way through the stack. There’s open source databases, with ScyllaDB and many others, open source system software, and open source operating systems. But the firmware itself has been resistant to it.”

The result? All the same problems that tend to plague other proprietary software. They take a long time to develop. When they come out, they’re buggy. They’ve got security problems. Cantrill continues, “We know that open source software is the way to deliver economic software, reliable software, secure software. We need to get that all the way into firmware.”

He believes that we’re finally getting there, though. The software that runs closest to the hardware is increasingly open, with drivers almost always open. The firmware of unseen parts of the system is also increasingly becoming open as well (see the Open Source Firmware Conference). This trend is slower in the 7nm SoCs, but it is indeed happening. The only real straggler is the boot ROMs. Even in putatively open architectures, the boot ROMs remain proprietary. This is a problem, but Cantrill is confident that we’ll get beyond it soon.

Rust is Revolutionary for Deeply Embedded Systems

Last but not least, Rust. Rust has proven to be a revolution for systems software, thanks to how its rich type system, algebraic types and ownership model allow for fast, correct code. Rust’s somewhat unanticipated ability to get small – coupled with its lack of a runtime – means it can fit practically everywhere. Cantrill believes that with its safety and expressive power, Rust represents a quantum leap over C – and without losing performance or sacrificing size. And embedded Rust is a prime example of the potential for hardware-software co-design.

The Oxide team members are big believers in Rust. They don’t use Rust by fiat, but they have found that Rust is the right tool for many of their needs.

Cantrill’s personal take on Rust: “Speaking personally as someone who was a C programmer for 2+ decades, Rust is emphatically the biggest revolution in system software since C. It is a very, very, very big deal. It is hard to overstate how important Rust is for system software. I’m shocked – and delighted – that a programming language is so revolutionary for us in system software. For so long, all we had really was C and then this offshoot in terms of C++ that … well … we’ll leave C++ alone. Suffice it to say that I was in a bad relationship with C++.

“But Rust solves so many of those problems and especially for this embedded use case. Where we talked about that ubiquitous compute earlier, Rust allows us to get into those really tiny spaces. At Oxide, we’ve developed a new Rust-embedded operating system called Hubris. The debugger, appropriately enough, is called Humility, and I definitely encourage folks to check that out.”

Evenly Distributing Our Present into the Future

The technologies featured in this latest batch of predictions are not new. In some cases, they’ve actually been around for decades. But, Cantrill believes they’ve all reached an inflection point where they are ready to take off and become (with a nod to the famous quote attributed to William Gibson) much more “evenly distributed.”

Cantrill concludes, “We believe that the future is one in which hardware and software are co-designed, and again, we are seeing that very concretely. And the fact that all of these technologies are open assures that they will survive. So we can quibble with the timing, but these technologies will endure. It may take time for them to broaden, but their trajectory seems likely, and we very much look forward to evenly distributing our present into the future.”

ScyllaDB Innovation Awards: Nominate Your Team

Get your team’s amazing achievements the recognition they deserve — tell us why you should win a 2023 ScyllaDB Innovation Award!


The ScyllaDB Innovation Awards shine a spotlight on ScyllaDB users who went above and beyond to deliver exceptional data-intensive applications. All ScyllaDB users are eligible: ScyllaDB Cloud, Enterprise, and Open Source.

This year, there are 7 categories that honor technical achievements, business impact, community contributions, and more. Specifically:

  • Gamechanger: Got a use case that pushes the bounds of what’s possible? What ground-breaking data-intensive app did you create? What sets it apart from the others? Tell us everything you can about your system and how you’re using ScyllaDB.
  • Business Impact: We love to hear about people that built their business on ScyllaDB. Did you fundamentally change your top line revenue or your bottom line profits using our database? We’d love to hear your stories of ROI and savings on TCO.
  • Technical Accomplishment: Now’s your chance to show off the technical chops of your team! What innovative technical challenge did you tackle and beat using ScyllaDB?
  • New ScyllaDB User: If you’ve hit the ground running, getting ScyllaDB up and into production this past year, we’d love to hear your story! Tell us how you beat expectations on reaching time-to-production, and what you’ve been able to achieve in your first year as a user.
  • ScyllaDB Expansion: You’re not a new user, but you’ve taken ScyllaDB to new use cases within your organization – or you’ve really expanded the scope of your existing use case. Tell us why and how you went big with ScyllaDB.
  • Integrator: No database is an island. Have you pulled off an impressive integration into other parts of your data pipeline or your DevOps pipeline? Share what magic you were able to perform.
  • Security Vanguard: Trusting that your database provides enterprise-grade security in the cloud is critical. This award honors the year’s most innovative and collaborative security projects with ScyllaDB.
  • Top ScyllaDB Open Source Contributor: Who stands out as a champion and technical leader of the ScyllaDB community? Someone who’s knee-deep in Github, and who’s always been there to aid you via Slack? This award is a great chance to recognize and nominate your professional colleagues.

Winners receive an award and a special ScyllaDB swag pack — plus recognition in a ScyllaDB Summit keynote, blog, press release, and social media posts.

Summit Gallery Image

Interested? Tell us why you should win before the January 13, 2023 deadline, then wait for the big announcement at ScyllaDB Summit, February 15-16, 2023.

The 2023 award winners will join a rather distinguished group of past honorees. For example, earlier in 2022 we recognized the following organizations and individuals:

  • Palo Alto Networks: For architecting a solution using ScyllaDB as a high-performance low-latency database for network events and as a message queue. Their solution achieves near real-time correlation of millions of different types of network security events per second, from multiple different sources. Read More
  • Instacart: For their rapid implementation of ScyllaDB as a unified feature store for their company-wide Machine Learning (ML) initiative. Faster ingestion of company-wide ML pipeline data translates to more helpful recommendations for both customers and shoppers.
  • The Janssen Pharmaceutical Companies of Johnson & Johnson: For developing, through its R&D Data Science team, an integrated, artificial intelligence (AI)-driven graph of biomedical knowledge to help researchers accelerate drug discovery. This innovative approach, recently presented at BioIT World, has taken knowledge graphs beyond the convention of standalone network visualizations and applied them across the company’s therapeutic areas to help enhance their understanding of the underlying mechanisms of diseases and interpretation of study results.
  • IBM: For nearly doubling cluster storage capacity with zero request rejections despite internal system challenges such as server memory issues and disks with bad sectors. This feat was orchestrated by adding higher storage capacity nodes, decommissioning lower storage capacity nodes, updating ScyllaDB releases, and working closely with the ScyllaDB team.
  • Happn: For a strategically planned and flawlessly executed migration from 68B rows of data from Cassandra to ScyllaDB. The move from Cassandra to ScyllaDB reduced Happn’s TCO by 75%. Read more
  • China Mobile: For their use of Alternator, ScyllaDB’s DynamoDB-compatible API, to store metadata that is critical for realizing low-latency and high-performance metadata storage for the company’s next generation architecture. They were also an important contributor to the API’s development; they started using it in 2019 (pre-production), put it to the test, and helped make it an even better option for companies seeking a more flexible, cost-efficient alternative to DynamoDB.
  • Meraj Rasool, SkyElectric: For his success completing the core ScyllaDB University courses plus his participation in ScyllaDB University Live. Meraj also invited his colleagues to ScyllaDB University and applied their lessons learned to more efficiently utilize ScyllaDB for SkyElectric’s production load. Read more about SkyElectric’s use of ScyllaDB

Nominate your team or colleagues for the ScyllaDB Innovation Awards, and then sign up to attend ScyllaDB Summit 2023 — our free virtual conference — to see who won!



Integrating Apache Cassandra® and Kubernetes through K8ssandra

Scalability, high performance, and fault tolerance are key features that most enterprises aim to integrate into their database architecture. Apache Cassandra® is the database of choice for large-scale cloud applications, while Kubernetes (K8s) has emerged as the leading orchestration platform for...

Apache Cassandra® 4.1: Discover What’s New

Apache Cassandra® 4.1 is bringing some seriously useful new features to the table! 

First up: the new Guardrails framework is specifically designed to help operators avoid configuration and usage pitfalls that could potentially degrade the performance and availability of their clusters. This is a big deal, as it means that you will have more control over how Cassandra is used and can proactively prevent issues from arising. You can disable certain features, disallow certain configuration values, and even set soft and hard limits for certain database magnitudes. 

But that is not all. Cassandra 4.1 is also introducing the Partition Denylist feature, giving you options for dealing with problematic partitions. With Partition Denylists, you can now choose between providing access to the entire data set with reduced performance or reducing the available data set to ensure that performance is not affected. This is a game-changer, as it means that you will have more control over how problematic partitions impact other reads and writes; you can even prevent slow and resource-intensive operations before they have a chance to start. 

And if that was not enough, the Paxos optimizations in this release are set to improve latency and halve the number of round trips needed to achieve consensus. Plus, they guarantee linearizability across range movements – something you would normally only expect from a database with strong consistency. 

As for updates to the Cassandra Query Language (CQL), developers can now group by time range, use CONTAINS and CONTAINS KEY conditions in conditional updates, and even use IF EXISTS and IF NOT EXISTS in ALTER statements. All these updates are going to make it even easier and more efficient to work with Cassandra.  

All in all, Apache Cassandra 4.1 is shaping up to be a great release with some truly impressive new features – so many, in fact, we cannot even list them all in this article! 

If you are a Cassandra user, then you do not want to miss out on everything 4.1 has to offer! 

The post Apache Cassandra® 4.1: Discover What’s New appeared first on Instaclustr.

Top NoSQL Blogs of 2022

As the year winds down, let’s look back at our top 10 blogs written this year – plus 10 perennial favorites.

Before we start, thank you to the community members who contributed to our blogs in various ways – from users sharing best practices at ScyllaDB Summit, to open source contributors and ScyllaDB engineers explaining how they raised the bar on what’s possible for NoSQL performance, to anyone who has initiated or contributed to the discussion on HackerNews, Reddit, and other platforms. And if you have suggestions for 2023 blog topics, we welcome you to share them in this thread on our new ScyllaDB Community Forum.

With no further ado, here are the most read NoSQL blogs that we published in 2022…

How Palo Alto Networks Replaced Kafka with ScyllaDB for Stream Processing

How cybersecurity leader, Palo Alto Networks, used their existing ScyllaDB database to eliminate the MQ layer (Kafka) for a project that correlates events in near real time.


Async Rust in Practice: Performance, Pitfalls, Profiling

How our engineers used flamegraphs to diagnose and resolve performance issues in our Tokio framework based Rust driver.


Shaving 40% Off Google’s B-Tree Implementation with Go Generics

How we got a 40% performance gain in an already well optimized package, the Google B-Tree implementation, using Go generics.


A New ScyllaDB Go Driver: Faster Than GoCQL and Its Rust Counterpart

How we built a new Go ScyllaDB driver that’s almost 4x faster than its GoCQL predecessor and 2X faster than its Rust counterpart.


Benchmarking Apache Cassandra (40 Nodes) vs ScyllaDB (4 Nodes)

We benchmarked Apache Cassandra on 40 nodes vs ScyllaDB on just 4 nodes. See how they stacked up on throughput, latency, and cost.


Why Disney+ Hotstar Replaced Redis and Elasticsearch with ScyllaDB Cloud

The inside perspective on how Disney+ Hotstar simplified its “continue watching” data architecture for scale.


We’re Porting Our Database Drivers to Async Rust

The ScyllaDB Rust Driver beats even the reference C++ driver in terms of raw performance. That gave us an idea: Why not unify all our drivers to use Rust underneath?


Implementing a New IO Scheduler Algorithm for Mixed Read/Write Workloads

A deep under-the-hood view of our NoSQL database engine. Learn how our new IO scheduler improved latencies in mixed workloads.


ScyllaDB on the New AWS EC2 I4i Instances: Twice the Throughput & Lower Latency

How ScyllaDB achieves 2.7x higher throughput with a 40% reduction in average latency on the new AWS I4i series, which uses the Intel Ice Lake processors and AWS Nitro SSD.


Wasmtime: Supporting UDFs in ScyllaDB with WebAssembly

How you can use WebAssembly to call user-defined functions when querying the database – plus, get a sneak peek at what else ScyllaDB is doing with Wasm.


Bonus: Top NoSQL Database Blogs From Years Past

Many of the blogs published in previous years continued to resonate with readers. Here’s a rundown of 10 ongoing favorites:

More Insights from NoSQL and Distributed Data System Experts

Want to learn more from these and other database experts? Join us at ScyllaDB Summit 2023: an immersive and highly-interactive opportunity to:

  • Discover the latest distributed database advancements
  • Hear how your peers are solving their toughest database challenges
  • Learn what’s new with ScyllaDB
  • Explore the latest trends across the broader data ecosystem (event streaming, graph databases, …)

It’s free and virtual — two half days, February 15 and 16.


Cutting Database Costs: Lessons from Comcast, Rakuten, Expedia & iFood

Dealing with infrastructure costs typically isn’t high on an R&D team’s priority list. But these aren’t typical times, and lowering costs is unfortunately yet another burden that’s now being placed on already overloaded teams.

For those responsible for data-intensive applications, reducing database costs can be a low-hanging fruit for significant cost reduction. If you’re managing terabytes to petabytes of data with millions of read/write operations per second, the total cost of operating a highly-available database and keeping it humming along can be formidable – whether you’re working with open source on-prem, fully-managed database-as-a-service, or anything in between. Too many teams have been sinking too much into their databases. But, looking on the bright side, this means there’s a lot to be gained by rethinking your database strategy.

For some inspiration, here’s a look at how several dev teams significantly reduced database costs while actually improving database performance.

Comcast: 60% Cost Savings by Replacing 962 Cassandra Nodes + 60 Cache Servers with 78 ScyllaDB Nodes

“We reduced our P99, P999, and P9999 latencies by 95%–resulting in a snappier interface while reducing CapEx and OpEx.” – Phil Zimich, Senior Director of Engineering at Comcast

Comcast is a global media and technology company with three primary businesses: Comcast Cable (one of the United States’ largest video, high-speed internet, and phone providers to residential customers), NBCUniversal, and Sky.


Comcast’s Xfinity service serves 15M households with 2B+ API calls (reads/writes) and 200M+ new objects per day. Over the course of 7 years, the project expanded from supporting 30K devices to over 31M devices.

They first began with Oracle, then later moved to Apache Cassandra (via DataStax). When Cassandra’s long tail latencies proved unacceptable at the company’s rapidly-increasing scale, they began exploring new options. In addition to lowering latency, the team also wanted to reduce complexity. To mask Cassandra’s latency issues from users, they placed 60 cache servers in front of their database. Keeping this cache layer consistent with the database was causing major admin headaches.


Moving to ScyllaDB enabled Comcast to completely eliminate the external caching layer, providing a simple framework in which the data service connected directly to the data store. The result was reduced complexity and higher performance, with a much simpler deployment model.


Since ScyllaDB is architected to take full advantage of modern infrastructure — allowing it to scale up as much as scale out — Comcast was able to replace 962 Cassandra nodes with just 78 nodes of ScyllaDB.

They improved overall availability and performance while completely eliminating the 60 cache servers. The result: a 10x latency improvement with the ability to handle over twice the requests – at a fraction of the cost. This translates to 60% savings over Cassandra operating costs – saving $2.5M annually in infrastructure costs and staff overhead.

More from Comcast


Rakuten: 2.5x Lower Infrastructure Costs From a 75% Node Reduction

“Cassandra was definitely horizontally scalable, but it was coming at a stiff cost. About two years ago, we started internally realizing that Cassandra was not the answer for our next stage of growth.” – Hitesh Shah, Engineering Manager at Rakuten

Rakuten allows its 1.5B members to earn cash back for shopping at over 3,500 stores. Stores pay Rakuten a commission for sending members their way, and Rakuten shares that commission with its members.


Rakuten Catalog Platform provides ML-enriched catalog data to improve search, recommendations, and other functions to deliver a superior user experience to both members and business partners. Their data processing engine normalizes, validates, transforms, and stores product data for their global operations.

While the business was expecting this platform to support extreme growth with exceptional end-user experiences, the team was battling Apache Cassandra’s instability, inconsistent performance at scale, and maintenance overhead. They faced JVM issues, long Garbage Collection (GC) pauses, and timeouts – plus they learned the hard way that a single slow node can bring down the entire cluster.


Rakuten replaced 24 nodes of Cassandra with 6 nodes of ScyllaDB. ScyllaDB now lies at the heart of their core technology stack, which also involves Spark, Redis, and Kafka. Once data undergoes ML-enrichment, it is stored in ScyllaDB and sent out to partners and internal customers. ScyllaDB processes 250M+ items daily, with a read QPS of 10k-15k per node and write QPS of 3k-5k per node.

One ScyllaDB-specific capability that increases Rakuten’s database cost savings is Incremental Compaction Strategy (ICS). ICS allows greater disk utility than standard Cassandra compaction strategies, so the same amount of total data requires less hardware. With traditional compaction strategies, users need to set aside half of their total storage for compaction. With ICS, Rakuten can use 85% or more of their total storage for data, enabling far better hardware utilization.


Rakuten can now publish items up to 5x faster, enabling faster turnaround for catalog changes. This is especially critical for peak shopping periods like Black Friday. They are achieving predictably low latencies, which allows them to commit to impressive internal and external SLAs. Moreover, they are enjoying 2.5x lower infrastructure costs following the 4x node reduction.

More from Rakuten


Expedia: 35% Cost Savings by Replacing Redis + Cassandra

“We no longer have to worry about ‘stop-the-world’ garbage collection pauses. Also, we are able to store more data per node and achieve more throughput per node, thereby saving significant dollars for the company.” – Singaram Ragunathan, Cloud Data Architect at Expedia Group

Expedia is one of the world’s leading full-service online travel brands helping travelers easily plan and book their whole trip with a wide selection of vacation packages, flights, hotels, vacation rentals, rental cars, cruises, activities, attractions, and services.


One of Expedia’s core applications provides information about geographical entities and the relationships between them. It aggregates data from multiple systems, like hotel location info, third-party data, etc. This rich geography dataset enables different types of data searches using a simple REST API with the goal of single-digit millisecond P99 read response time.

The team was using a multilayered approach with Redis as a first cache layer and Apache Cassandra as a second persistent data store layer, but they grew increasingly frustrated with Cassandra’s technical challenges. Managing garbage collection and making sure it was appropriately tuned for the workload at hand required significant time, effort, and expertise. Also, burst traffic and workload peaks impacted the P99 response time – requiring buffer nodes to handle peak capacity, which drove up infrastructure costs.


The team migrated from Cassandra to ScyllaDB without modifying their data model or application drivers. As Singaram Ragunathan, Cloud Data Architect at Expedia Group put it: “From an Apache Cassandra code base, it’s frictionless for developers to switch over to ScyllaDB. There weren’t any data model changes necessary. And the ScyllaDB driver was compatible, and a swap-in replacement with Cassandra driver dependency. With a few tweaks to our automation framework that provisions an Apache Cassandra cluster, we were able to provision a ScyllaDB cluster.”


With Cassandra, P99 read latency was previously spiky, varying from 20 to 80 ms per day. With ScyllaDB, it’s consistently around 5 ms. ScyllaDB throughput is close to 3x Cassandra’s. Moreover, ScyllaDB is providing 35% infrastructure cost savings.


More from Expedia


iFood: Moving Off DynamoDB to Scale with 9X Cost Savings

“One thing that’s really relevant here is how fast iFood grew. We went from 1M orders a month to 20M a month in less than 2 years.” – Thales Biancalana, Backend Developer at iFood

iFood is the largest food delivery company in Latin America. It began as a Brazilian startup, and has since grown into the clear leader, with a market share of 86%. After becoming synonymous for ‘food delivery’ in Brazil, iFood expanded its operations into Columbia and Mexico.


The short answer: online ordering at scale, with PostgreSQL as well as DynamoDB.

Each online order represents about 5 events in their database, producing well over 100M events on a monthly basis. Those events are sent to restaurants via the iFood platform, which uses SNS and SQS. Since internet connections are spotty in Brazil, they rely on an HTTP-based polling service that fires off every 30 seconds for each device. Each of those polls invokes a database query.

After they hit 10 million orders a month and were impacted by multiple PostgreSQL failures, the team decided to explore other options. They moved to NoSQL and selected DynamoDB for their Connection-Polling service. They quickly discovered that DynamoDB’s autoscaling was not fast enough for their use case. iFood’s bursty intraday traffic naturally spikes around lunch and dinner times. Slow autoscaling meant that they could not meet those daily bursts of demand unless they left a high minimum throughput (which was expensive) or managed scaling themselves (which is work that they were trying to avoid by paying for a fully-managed service).


iFood transitioned their Connection-Polling service to ScyllaDB Cloud. They were able to keep the same data model that they built when migrating from PostgreSQL to DynamoDB. Even though DynamoDB uses a document-based JSON notation, and ScyllaDB used the SQL-like CQL, they could use the same query strategy across both.


iFood’s ScyllaDB deployment easily met their throughput requirements and enabled them to reach their mid-term goal of scaling to support 500K connected merchants with 1 device each. Moreover, moving to ScyllaDB reduced their database expenses for the Connection-Polling service from $54k to $6K — a 9x savings.

More from iFood


Wrap Up

These examples are just the start. There are quite a few ways to reduce your database spend:

  • Improve your price-performance with more powerful hardware – and a database that’s built to squeeze every ounce of power out of it, such as ScyllaDB’s highly efficient shard-per-core architecture and custom compaction strategy for maximum storage utilization.
  • Reduce admin costs by simplifying your infrastructure (eliminating external caches, reducing cluster size, or moving to a database that requires less tuning and babysitting).
  • Tap technologies like workload prioritization to run OLTP and OLAP workloads on the same cluster without sacrificing latency or throughput.
  • Consider DBaaS options that allow flexibility in your cloud spend rather than lock you into one vendor’s ecosystem.
  • Move to a DBaaS provider whose pricing is better aligned with your workload, data set size, and budget (compare multiple vendors’ prices with this DBaaS pricing calculator).

And if you’d like advice on which – if any – of these options might be a good fit for your team’s particular workload, use case, and ecosystem, our architects would be happy to provide a technical consultation.


Accelerate your Cassandra development and automation with the Astra CLI

We are pleased to announce the release of the new DataStax Astra Command Line Interface (CLI). For Apache Cassandra developers using DataStax Astra DB, the Astra CLI is a one stop shop for managing your Astra resources through scripts or in your local terminal. It covers a wide variety of...

It’s time to upgrade to Cassandra 4.1

Database infrastructure, particularly for security and operations minded team members, shouldn’t actually be very exciting. In fact, it should be as boring as possible, particularly for a database that’s been powering massive scale infrastructure for over a decade. Let’s save the excitement for...

High Performance NoSQL Masterclass: Watch Now on Demand

There are literally dozens upon dozens of NoSQL databases available these days. A lot of choices for users to consider. Yet databases are not all built the same. There are times when you are looking to build specifically for high performance and high scalability — to the tune of hundreds of thousands or millions of operations per second. What criteria do you use to short list a database to consider for your specific use case?

In our most recent Masterclass, hosted in partnership between the database experts at ScyllaDB and Pythian, we dove into the factors for decision-making, and made the case for wide column NoSQL databases. How and why they are used for modern high performance applications? How do you get the most out of a wide column database?


The Case for Wide Column NoSQL

I was the host for the first session: A Survey of High Performance NoSQL Systems. I began by looking at the most popular databases these days — SQL, NoSQL, and so-called “multimodel” databases that can be a mix of both SQL and NoSQL, or multiple NoSQL data models. Then, I broke down the various most popular types of NoSQL data models these days — key value stores, document databases, graph and wide column.

Once laying out that orientation I made the case for using wide column NoSQL databases, such as Apache Cassandra or ScyllaDB, for high performance use cases. What attributes and capabilities do they provide that make them ideally suited for massive scale out, and, in the case of ScyllaDB, scale up as well?

Modeling Data and Queries for Wide Column NoSQL

Next up was Pythian’s lead database consultant Allan Mason. His session focused on how to be successful with wide column NoSQL by understanding the nature of its data model and query structure. While many users are familiar with Structured Query Language (SQL) popular across RDBMS systems, Cassandra Query Language (CQL) is deceptively similar but fundamentally different. With RDBMS, you usually begin with a schema-first design. With a wide column database like ScyllaDB or Cassandra, you begin with a query-first design. Then, step-by-step, Allan showed how to model tables around your query patterns.

Scaling for Performance

The third session was hosted by ScyllaDB Solutions Architect and author Felipe Cardeneti Mendes. His session was focused on the features and capabilities that let you scale for your specific use case. From how to do workload prioritization to balance read-write workloads of OLTP and read-heavy workloads of OLAP in the same cluster, to how to select the right compaction strategy to match your workload.

Felipe also covered aspects of deployment and production readiness, reinforcing the points made in Allan’s talk about data modeling, but also including testing, application design, sizing and observability.

A Winning Formula: ScyllaDB’s Masterclass Format

This is the third of a series of Masterclasses we offered over the course of 2022. Prior Masterclass sessions included the Distributed Data System Masterclass in partnership with StreamNative, and the Performance Engineering Masterclass, which we hosted in conjunction with Grafana k6 and Dynatrace. For those who have not attended one of our Masterclasses before, the format is as follows:

  • Three expert video presentations, grounding you in the topic
  • A live panel talk between the three experts, where you can ask questions
  • A test, which, if you pass, you can share your achievement on your LinkedIn profile

It’s more than a webinar or YouTube video, but short of a full tech conference in length. Plus, you can’t just bury the browser tab in the background and sort of half-listen. You need to be paying attention, because there is going to be a test!

Rest assured, if you don’t pass on the first try, we offer Masterclass attendees a chance to retake the test. With a careful re-watching of the materials you’ll find the correct answers were all covered in the materials presented.

This tough but fair test format has been key to the enthusiastic user response to the ScyllaDB Masterclass series. Users need to keep on their toes and stay engaged in the content. They feel challenged intellectually, and then have a chance to show that they’ve integrated the key lessons from the materials.

We look forward to hosting more Masterclass sessions in the future. If you have any ideas, feel free to join our user Slack and drop your ideas in the #events channel, or post your thoughts in a thread on our new user Forum.

Curious? Watch Now!

If you missed the live event, never fear! We have you covered. You can watch the High Performance NoSQL Masterclass now on demand. Totally online and totally free. Enjoy!


Register for ScyllaDB Summit 2023

ScyllaDB’s Annual Conference for NoSQL Gamechangers

“Database monsters of the world, connect!” Once again we will be hosting the next annual ScyllaDB Summit on February 15th and 16th, 2023. It will be, as usual, all free and all online. We’ll have dozens of speakers, from your professional peers from Discord, Hulu, Strava, Sharechat and more, to industry experts from Intel, Confluent and Rackspace, as well as ScyllaDB’s own engineering staff and leaders.

Want to hear how your peers are solving their toughest data-intensive application challenges? Explore the latest trends across the broader data ecosystem? Learn practical tips you can bring back into your own organization? You’ll find plenty of tech talks to keep you engaged across the conference. It’s also a great opportunity to learn about ScyllaDB’s latest innovations and best practices that will help your team get the most out of ScyllaDB.


This is an event by and for NoSQL distributed database experts. Whether your role is an architect, data engineer, application engineer, platform engineer, DevOps, SRE or DBA, this is your tribe. From keynotes, to tech talks, to lounges, chats, and games, there are lots of opportunities to level up your database skills and expertise about building real-time data systems.

Just some of the many speakers for this year’s event

Who’s Speaking This Year?

We will have real-world use case talks from ScyllaDB users. Speakers from Discord, Hulu, Strava, Sharechat, iFood, Numberly, GumGum, Optimizely, ZeroFlucs and more to be announced.

ScyllaDB Summit will also feature talks by our leadership, including ScyllaDB CEO Dor Laor, CTO Avi Kivity, our new VP of R&D, Yaniv Kaul, and our VP of Product, Tzach Livyatan. They will be delivering a recap of the progress we’ve made to date, the current state-of-the-art and the direction of where we are taking our NoSQL database in the future.

Plus you’ll hear detailed talks on specific features and capabilities directly from the ScyllaDB engineers working on the code.

As well, we will have talks from industry luminaries on how ScyllaDB fits into broader trends in the data landscape like event streaming, open data platforms, cloud computing, high performance benchmarking, DevOps, site reliability and more. You’ll hear from speakers from leading tech companies such as Intel, Confluent, Rackspace, dbt, Percona, StreamNative, Anant, and BenchANT.


A Highly Interactive Virtual Conference

“They are really showing folks how to do it with respect to on-line conferences.”

— Bryan Cantrill, Oxide Computer

ScyllaDB online events are highly interactive experiences where you can connect with speakers and thousands of like-minded “database monsters” across the community.

Within the platform, you will have a chance to chat with your industry peers, ask questions directly of the speakers and ScyllaDB experts, plus attend the “Speaker’s Lounge” to hear more live insights from our VIP guests.

For a taste of what you can expect from the keynotes and tech talks, take a look at some highlights from ScyllaDB Summit 2022.



ScyllaDB Summit will be held on Wednesday and Thursday, February 15-16, 2023, from 8 AM to 12 PM Pacific Time. We’ll also offer some cool ways for attendees in other time zones to binge-watch talks they might have missed.

Register Now!

We look forward to seeing you online at our ScyllaDB Summit in February! But don’t delay — sign up for our event today!


Apache Cassandra for the SQL Server Developer

I started my career as a DBA with SQL Server 6.0. around about 2015.  I was a new hire who was scheduled to support our SQL Server and MySQL databases. However, soon after my arrival, the CTO called me into his office to tell me I was the new Cassandra DBA. My reply was “That’s great. What’s Cassandra?”  

The first 6 months were rough. The cluster had been in operation for more than 6 months but was not doing too well. Performance was poor and, worse, it frequently crashed. It was not a fun time. But eventually, the problems got fixed. 

There were several issues (including my inexperience) that caused these problems, but the core one was that the original developer had treated it like another relational database. 

You will not learn Cassandra by reading this. Rather, my goal is to assist with your transition from SQL Server to Apache Cassandra® and hopefully help you avoid most of the pitfalls I encountered when making my transition. 

This article will focus on Cassandra itself and not go into how to query Cassandra. 

What’s Cassandra?

Apache Cassandra is an open source, NoSQL, distributed database. Facebook developed it in 2011 as a “merger of Amazon’s Dynamo distributed storage and replication techniques and Google’s Bigtable data and storage engine model. The current stable version as of August 2022 is 4.0.6

Cassandra is written in Java and runs within a Java virtual machine (JVM). 

Apache Cassandra is licensed under the very permissive Apache Licence 2.0. It can be freely downloaded from the Apache Cassandra download site. Source code can be downloaded from the Apache Git repository. Cassandra runs on most Linux distributions. Microsoft Windows is no longer supported.

NoSQL is a broad term used to describe non-relational databases. Cassandra stores data as rows and columns in tables. Each row is a collection of columns where a column is a mapping of a key, a value, and metadata such as creation time. Relationships between tables are not supported.

A Cassandra cluster is a group of Cassandra installations, called nodes. All nodes are peers and can perform any operation. Data is sharded across the nodes in partitions. Cassandra clusters are easily scaled out by adding nodes which can be done while the cluster is running. Large-scale clusters can handle petabytes of data. 

See our content library for our e-book, A Guide to Apache Cassandra.  For a comprehensive guide to Cassandra Architecture, look here.  

Why on Earth Would I Use Cassandra? 

This was my question after asking “what’s Cassandra?” The simple answer was senior management wanted a database that could easily scale and be highly available. The marketing department was projecting customer usage that would require 20+ TB of storage and it had to survive potential hardware failures. We were a Microsoft SQL Server shop so that was the first place we looked. In the end, Cassandra was chosen.

 A Very Simplified View of Cassandra Data Distribution

Every node added to the cluster is assigned a range of hash values, or tokens (in actual practice, it is a range of tokens but let’s keep it simple).  When a new record is created, Cassandra generates a consistent hash, a token, from the partition key which is part of the record’s primary key. The record is then stored on the node that “owns” that token. A collection of records with the same partition key form a partition. All rows in a partition must reside on the same node. 

Glossary of Terms

Node   An instance of Cassandra. 

Cluster   A collection of nodes working together. 

Datacenter  A subdivision of a cluster.  A cluster can have multiple datacenters which can be geographically separated. Data is replicated among the datacenters.

Keyspace  The Cassandra equivalent of a database.  It serves as the namespace for the table. It is also used to define the replication factor of the keyspace for each datacenter. 

Replication Factor   The number of data replicas made.  Each replica is stored on a different node to ensure fault tolerance. 

Partition  A group of rows that share the same partition key. It is how Cassandra shards data.   The location of the partition is a function of its hash value.  For more information see Apache Cassandra Data Partitioning.

Consistency Level  The number of replicas that must successfully respond to a request. 

Tombstone  A special marker written whenever data is deleted, the NULL value is inserted, or a non-frozen collection is used. Excess tombstones can cause long GC pauses, latency, read failures, or out-of-heap errors. For more information see Managing Tombstones in Cassandra. 

SSTable (Sorted String Table)  An immutable data file that Cassandra uses for persisting data on a disk.

Memtable   A memory structure where Cassandra buffers write data. In general, there is one active memtable per table. Eventually, a memtable is flushed onto the disk and becomes an SSTable.

Major Differences From SQL Server

Data is stored as a Log-Structured-Merger (LSM) tree. The use of this structure avoids the need for a read before a write. Cassandra groups inserts and updates in memory, and, at intervals, sequentially writes the data to disk in append mode. Once written to disk, the data is immutable and is never overwritten. Write latency is generally sub-millisecond and Cassandra can inject large amounts of data very quickly. 

Cassandra does not follow the relational model. Data is modeled on the query. Joins and sub-queries are not supported. Data denormalization and table duplication are required. 

The design of tables is driven by queries. Tables, specifically the primary keys, must conform to the restrictions placed on data searching by the architecture. 

Searching data is restricted. Only primary key columns can be used to select data. By default, Cassandra will allow only search arguments that return a single partition, which makes the partition key columns mandatory in a search argument.  

Sort order is a design decision. The sort order available on queries is fixed and is determined entirely by the selection of clustering columns. 

Cassandra is not ACID compliant. Only one command (BATCH) guarantees Atomicity and isolation. 

Data Consistency is a design decision. The level of consistency can be adjusted at the query level. 

Indexing is limited.  Cassandra provides Secondary Indexes and SSTable Attached Secondary Index (SASI) for querying on non-key fields but both types should be used with caution.

CQL is not T-SQL.  Although the syntax is very similar, familiar commands such as DELETE and UPDATE have different behaviors.   

The Relational Model Does Not Work

I was going to title this section “Put Away your E.F. Codd ” but I am not sure if he is still referenced today.  If you sweated to create a correctly normalized database, congratulations. However, if you try to use it with Cassandra it will not work. Many years ago, I went to work for a company whose SQL Server data was described to me as “a collection of Excel spreadsheets”. Thinking back, it might have made a good Cassandra database.

Designing tables in Cassandra is shaped by the following limits:

  • Queries that would scan all partitions are prohibited by default. This means the entire partition key must be used in all searches. 
  • Table joins are not supported. (see above).
  • You cannot rely on secondary indexes. They do not perform well and should be used only in limited use cases. 
  • Queries should return only a single partition. Writing a query that returns multiple nodes is expensive because Cassandra must visit multiple nodes. 
  • Searches are restricted. Only primary key columns are allowed. Partition key columns allow only equality searches. There are restrictions on how the clustering columns may be searched. 

Tables must be modeled on the query, not the data. You must identify all possible queries at the BEGINNING of the design process not towards the end. 

Data denormalization is required.  If you need data from more than one table, they must be merged into a single table.

Tables must be duplicated. If a table needs to be queried by a non-key column, a duplicate table with a new primary key must be created. 

See data modeling in the Apache Cassandra Documentation for a quick overview.

For a more in-depth treatment, see our e-book  “Apache Cassandra® Data Modeling Guide to Best Practices”, in our content library.

Cassandra Is Not ACID-Compliant

Cassandra is not an ACID (Atomicity, Consistency, Isolation, and Durability compliant system).  Rather, its behavior is explained by the CAP theorem and its extension, the PACELC theorem. 

What this means is that there is a tradeoff between Availability (every request receives a non-error response) over Consistency (every read receives the most recent write). In addition, there is a tradeoff between consistency and latency. 

By default, Cassandra provides eventual consistency. Simply stated, this means that a read that follows a write is not guaranteed to return the latest data. Once the data is finished replicating, consistency is restored. At least until the next write.  

However, it offers “tunable” consistency. Developers can change (“tune”)  this default behavior at the query level to increase data consistency at the expense of availability. In an extreme case, a developer could make Cassandra behave like SQL Server.

Cassandra isolation or atomicity. If 2 processes update the same data at the same time, Cassandra follows the principle of “last write wins”.  Exceptions to this behavior can be found in using the BATCH command and lightweight transactions. However, there is a performance penalty when either of these commands is used.


Cassandra stores data in a table organized into rows and columns: 

  • A row is a collection of one or more columns uniquely identified by a primary key
  • A column is a mapping of the column name (the key), the key value, and metadata such as creation time or expiration date. 
Primary Key Columns




Col1 Col2 Col3
Value1 Value2 Value3
Metadata1 Metadata2 Metadata3

You must define a schema, column names, and data type, for a table with the CREATE TABLE command. Cassandra does not support column constraints. 

It is not necessary for a row to contain all of the defined columns. If data does not exist for a column, the practice in Cassandra is to not set a NULL value. 

Primary Key Columns




Col1 Col3  
Value1 Value3  
Metadata1 Metadata3  

A Primary Key Has an Expanded Role

The way primary keys work in Cassandra is an important concept to grasp and it works somewhat differently than in SQL Server.  

The primary key has 2 parts: the partition key and the clustering key. The partition key is mandatory. The neatest equivalent is the clustered primary key.  The entire primary key uniquely identifies a row. However,  the primary key does not enforce the uniqueness constraint. If a duplicate key is entered, the existing row is replaced.

The goal of the partition key is to evenly distribute data across the cluster and to query the data efficiently. A partition key that does not do this severely affects cluster performance and scalability.

The clustering key physically sorts rows with a partition. It determines and fixes the sort order available for queries. The order cannot be changed by using the ORDER BY in your query.

The primary key cannot be dropped or altered after table creation. The only way to resolve problems stemming from poor design choices is to create a new table with a better key and then copy the data from the old table.  

Data Types

There is some overlap between SQL Server and Cassandra. Many of the common data types for integer and decimal values are found in CQL. All text strings are defined by the text type.  There are no char or varchar types. All binary data is defined by the blob type. 

XML data is not supported.

For a complete list of available data types, look here.

Table Features Not Supported by Cassandra

  • Constraints 
  • Computed columns  
  • The NULL property
  • Foreign Keys
  • Encryption
  • Identity property.  The CQL Counter type is the closest analogy but can be used only on counter tables.
  • XML data types and functions. 

The Memtable and SSTables

The write path in Cassandra somewhat resembles that of SQL Server. The data is first written to an on-disk commitlog and then to an on-heap structure called a memtable and, when that is full, the contents are flushed to disk in the form of a Sorted String Table (SSTable).

The contents of an SSTable are never overwritten. 

CQL Versus T-SQL

The syntax of Cassandra Query Language is very similar to Transact-SQL. However, there are some familiar commands that behave differently. 

  • Object names in CQL are case-sensitive in CQL but DDL and DML commands are not. 
  • The DELETE command does not remove data from the table.  Instead, the command inserts a special marker called a tombstone to indicate which data is being deleted.  Deleted data will be removed at a later time during the compaction process. 
  • UPDATE will insert a new record that contains the columns which were modified. Data is never overwritten in Cassandra.
  • Filtering Data with the WHERE clause is restricted to using primary key columns and must conform to a number of restrictions. For example, the partition keys allow only equality searches. 
  • The ORDER BY clause can use only cluster key columns.
  • CREATE FUNCTION By default, Cassandra supports defining functions in Java and JavaScript. 
  • CREATE INDEX creates a secondary index on a non-key column  Do not use them as you would a non-clustered index. They do not perform well and should be used sparingly and only in limited use cases.
  • CREATE VIEW creates a materialized view. Originally intended as a method to replace data redundancy, the feature has numerous problems and is disabled by default in Cassandra 4. Do not use it in a production cluster.
  • CREATE TRIGGER supports the JAVA language.
  • ALTER TABLE can not alter  PRIMARY KEY columns.
  • The standard aggregate functions of min, max, avg, sum, and count are built-in functions.  Cassandra will generate a warning message when an aggregate function is used with no WHERE clause. If this is done on a very large table, there is the possibility that one or more nodes will crash. 

Below is a quick guide to how SQL Server objects map to Cassandra.

Table 1:   SQL Server Object Mapping to CQL

SQL Server Object Cassandra Object CQL Command
Server or instance Node None – see notes
Database/schema Keyspace CREATE KEYSPACE
Table Table 

(originally called 

Column Family)

Primary Key Clustered Index See the section on the primary key
Clustered Index Clustering Key See the section on the primary key
Foreign Key Not Available none
Nonclustered Index  Secondary Index 

SASI Index

Stored Procedure Not Available
User Defined Function User-defined Functions CREATE FUNCTION
View  Materialized Views 

(experimental not for production )

Trigger Triggers CREATE TRIGGER

(has been deprecated)

Dynamic Management Views  Virtual Tables 

(Only in Cassandra 4)

Login / Roles Roles CREATE ROLES
Permissions GRANT GRANT

Concluding Thoughts

If there is one thing you must get right when using Cassandra, it is designing the primary key—particularly the partition key. Instaclustr offers a variety of services that can help the first-time Cassandra user. To see how we can help the first-time Cassandra developer, please see our pages on Cassandra consulting services and training.

The post Apache Cassandra for the SQL Server Developer appeared first on Instaclustr.

Join in and Learn: Cassandra Summit Registration is Open!

Last week was a pretty amazing one for the Apache Cassandra® community.  First of all, among all the hype of the AWS re:Invent conference, bigDATAwire (formerly known as Datanami) announced its Readers’ Choice awards—and Cassandra was chosen as a top three “data and AI open source project to...