How Mongoose Will Bring JSON-Oriented Developers to Apache CassandraApache Cassandra® is becoming the best database for handling JSON documents. If you’re a Cassandra developer who finds that statement provocative, read on. In a previous post, I discussed using data APIs and data modeling to mold Cassandra into a developer experience more idiomatic to the way...
ScyllaDB in 2023: Incremental Changes are Just the Tip of the Iceberg
This article was written by tech journalist George Anadiotis.
Is incremental change a bad thing? The answer, as with most things in life, is “it depends.” In the world of technology specifically, the balance between innovation and tried-and-true concepts and solutions seems to have tipped in favor of the former. Or at least, that’s the impression the headlines give. Good thing there’s more to life than headlines.
Innovation does not happen overnight, and is not applied overnight either. In most creative endeavors, teams work relentlessly for long periods until they are ready to share their achievements with the world. Then they go back to their garage and keep working until the next milestone is achieved. If we were to peek in the garage intermittently, we’d probably call what we’d see most of the time “incremental change.”
The ScyllaDB team works with their garage doors up and are not necessarily after making headlines. They believe that incremental change is nothing to shun if it leads to steady progress. Compared to the release of ScyllaDB 5.0 at ScyllaDB Summit 2022, “incremental change” could be the theme of ScyllaDB Summit 2023 in February. But this is just the tip of the iceberg, as there’s more than meets the eye here.
I caught up with ScyllaDB CEO and co-founder Dor Laor to discuss what kept the team busy in 2022, how people are using ScyllaDB, as well as trends and tradeoffs in the world of high- performance compute and storage.
Note: In addition to reading the article, you can hear the complete conversation in this podcast:
Data is Going to the Cloud in Real Time, and so is ScyllaDB
The ScyllaDB team have their ears tuned to what their clients are doing with their data. What I noted in 2022 was that data is going to the cloud in real time, and so is ScyllaDB 5.0. Following up, I wondered whether those trends have kept pace with the way they manifested previously.
The answer, Laor confirmed, is simple: absolutely yes. ScyllaDB Cloud, the company’s database-as-a-service, has been growing over 100% year over year in 2022. In just three years since its introduction in 2019, ScyllaDB Cloud is now the major source of revenue for ScyllaDB, exceeding 50%.
“Everybody needs to have the core database, but the service is the easiest and safest way to consume it. This theme is very strong not just with ScyllaDB, but also across the board with other databases and sometimes beyond databases with other types of infrastructure. It makes lots of sense”, Laor noted.
Similarly, ScyllaDB’s support for real-time updates via its change data capture (CDC)feature is seeing lots of adoption. All CDC events go to a table that can be read like a regular table. Laor noted that this makes CDC easy to use, also in conjunction with the Kafka connector. Furthermore, CDC opens the door to another possibility: using ScyllaDB not just as a database, but also as an alternative to Kafka.
“It’s not that ScyllaDB is a replacement for Kafka. But if you have a database plus Kafka stack, there are cases that instead of queuing stuff in Kafka and pushing them to the database, you can just also do the queuing within the database itself” Laor said.
This is not because Kafka is bad per se. The motivation here is to reduce the number of moving parts in the infrastructure. Palo Alto Networks did this and others are following suit too. Numberly is another example in which ScyllaDB was used to replace both Kafka and Redis.
High Performance and Seamless Migration
Numberly was one of the many use cases presented in ScyllaDB Summit 2023. Others included the likes of Discord, Epic Games, Optimizely, ShareChat and Strava. Browsing through those, two key themes emerge: high performance and migration.
Migration is a typical adoption path for ScyllaDB as Laor shared. Many users come to ScyllaDB from other databases in search of scalability. As ScyllaDB sees a lot of migrations, it offers support for two compatible APIs, one for Cassandra and one for DynamoDB. There are also several migration tools, such as Spark Migrator, scanning the source database and writing to the target database. CDC may also help there.
While each migration has its own intricacies, when organizations like Discord or ShareChat migrate to ScyllaDB, it’s all about scale. Discord migrated trillions of messages. ShareChat migrated dozens of services. Things can get complicated and users will make their own choices. Some users rewrite their stack without keeping API compatibility, or even rewrite parts of their codebase in another programming language like Go or Rust.
Either way, ScyllaDB is used to dealing with this, Laor said. Another thing that ScyllaDB is used to dealing with is delivering high performance. After all, this was the premise it was built on. Epic Games, Optimizely and Strava all presented high- performance use cases. Laor pointed out that as an avid gamer and mountain bike rider, having ScyllaDB being part of the Epic Games stack and powering Strava was gratifying.
Epic Games are the creators of the Unreal game engine. Its evolution reflects the way that modern software has evolved. Back in the day, using the Unreal game engine was as simple as downloading a single binary file. Nowadays, everything is distributed. Epic Games works with game makers by providing a reference architecture and a recommended stack, and makers choose how to consume it. ScyllaDB is used as a distributed cache in this stack, providing fast access over objects stored in AWS S3.
A Sea of Storage, Raft and Serverless
ScyllaDB increasingly is being used as a cache. For Numberly, it happened because ScyllaDB does not need a cache, so that made Redis obsolete. For Epic Games, the need was to add a fast-serving layer on top of S3.
S3 works great, is elastic and economic, but if your application has stringent latency requirements, then you need a cache, Laor pointed out. This is something a lot of people in the industry are aware of, including ScyllaDB engineering. As Laor shared, there is an ongoing R&D effort in ScyllaDB to use S3 for storage too. As he put it:
“S3 is a sea of extremely cheap storage, but it’s also slow. If you can marry the two, S3 and fast storage, then you manage to break the relationship between compute and storage. That gives you lots of benefits, from extreme flexibility to lower TCO.”
This is a key project that ScyllaDB’s R&D is working on these days, but not the only one. Yaniv Kaul, who just joined ScyllaDB as vice president of R&D coming from Red Hat, where he was senior director of engineering, has a lot to keep him busy. The team is growing, and recently ScyllaDB held its R&D Summit bringing everyone together to discuss what’s next.
ScyllaDB comes in two flavors, open source and enterprise. There are not that many differences between the two, primarily security features and a couple of performance and TCO (total cost of ownership)-based features. However, the enterprise version, based on the DBaaS offering, also comes with 2 ½ years of support, and is the one that the DBaaS offering is based on. The current open source version is 5.1 while the current enterprise version is 2022.2.
In the upcoming open source version 5.2, ScyllaDB will have a consistent transactional schema operation based on Raft. In the next release, 5.3, transactional topology changes will also be supported. Metadata strong consistency is essential for sophisticated users who programmatically scale the cluster and Data Definition Language.
In addition, these changes will enable ScyllaDB to pass the Jepsen test with many topology and schema changes. More importantly, this paves the way toward changes in the way ScyllaDB shards data, making it more dynamic and leading to better load balancing.
Many of these efforts come together in the push toward serverless. There is a big dedicated team in ScyllaDB working on Kubernetes, which is already used in the DBaaS offering. This work will also be leveraged in the serverless offering. This is a major direction ScyllaDB is headed toward and a cross-team project. A free trial based on serverless was made available at the ScyllaDB Summit and will become generally available later this year.
P99, Rust and Beyond
Laor does not live in a ScyllaDB-only world. Being someone with a hypervisor and Linux Red Hat background, he appreciates the nuances of P99. P99 latency is the 99th latency percentile. This means 99% of requests will be faster than the given latency number, or that only 1% of the requests will be slower than your P99 latency.
P99 CONF is also the name of an event on all things performance, organized by the ScyllaDB team but certainly not limited to them. P99 CONF is for developers who obsess over P99 percentiles and high-performance, low-latency applications. It’s where developers can get down in the weeds, share their knowledge and focus on performance in real-time.
P99 CONF is not necessarily a ScyllaDB event or a database event. Laor said that participants are encouraged to present databases, including competitors, as well as talk about operating systems development, programming languages, special libraries and more. It’s not even just about performance. It’s also about performance predictability, availability and tradeoffs. In his talk, Laor emphasized that it’s expensive to have a near-perfect system, and not everybody needs a near-perfect system all the time.
Among many distinguished P99 CONF speakers, one who stood out was Bryan Cantrill. Cantrill has stints at Sun Microsystems, Oracle and Joyent and a loyal following among developers and beyond. In his P99 2021 talk, Cantrill shared his experience and thinking on “Rust, Wright’s Law and The Future of Low-Latency Systems.” In it, Cantrill praised Rust as being everything he has come to expect from C and C++ and then some.
ScyllaDB’s original premise was to be a faster implementation of the Cassandra API, written in C++ as opposed to Java. The Seastar library that was developed as part of this effort has gotten a life of its own and is attracting users and contributors far and wide (such as Redpanda and Ceph). Dor weighed in on the “what is the best programming language today” conversation, always a favorite among developers.
Although ScyllaDB has invested heavily in its C++ codebase and perfected it over time, Laor also gives Rust a lot of credit. So much, in fact, that he said it’s quite likely that if they were to start the implementation effort today, they would have done it using Rust. Not so much for performance reasons, but more for the ease of use. In addition, many ScyllaDB users like Discord and Numberly are moving to Rust.
Even though a codebase that has stood the test of time is not something any wise developer would want to get rid of, ScyllaDB is embracing Rust too. Rust is the language of choice for the new generation of ScyllaDB’s drivers. As Laor explained, going forward the core of ScyllaDB’s drivers will be written in Rust. From that, other language-specific versions will be derived. ScyllaDB is also embracing Go and Wasm for specific parts of its codebase.
To come full circle, there’s a lot of incremental change going on. Perhaps if the garage door wasn’t up, and we only got to look at the finished product in carefully orchestrated demos, those changes would stack up more impressions. Apparently, that’s not what matters more for Laor and ScyllaDB.
DataStax and Delphix Enable Faster Provisioning of Cassandra Data for CI/CD Pipelines and AnalyticsThe pace of innovation among DataStax customers is always accelerating, and they need their data to keep up. Apache Cassandra is a foundational element in their quest to adopt a modern data infrastructure using a distributed data approach, but slow data provisioning—especially in lower test and...
What’s Trending on the ScyllaDB Community Forum: March 2023
In case you haven’t heard, ScyllaDB recently launched a new community forum where our users can connect and learn from one another’s experiences. It’s a great place to:
- Get quick access to the most common getting started questions
- Troubleshoot any issues you come across
- Engage in in-depth discussions about new features, configuration tradeoffs, and deployment options
- Search the archives to see how your peers are setting up similar integrations (e.g., ScyllaDB + JanusGraph + Tinkerpop)
- Propose a new topic for us to cover in ScyllaDB University
- Share your perspective on a ScyllaDB blog, ask questions about on-demand videos, or tell us more about what types of resources your team is looking for
- Engage with the community, share how you’re using ScyllaDB, what you learned along the way, and get ideas from your peers
Say hello & share how you’re using ScyllaDB
When Should You Use The Forum And When Should You Use Slack?
The forum is better for deeper technical issues and long-form articles. Its discussions are searchable and last forever.
Slack is better for short, quick answers and real-time chat interactions. If the question would benefit other community members over time, it’s better to ask on the forum. If you prefer to chat directly with peers and experts, or have a real-time discussion, Slack is the best option. You’re encouraged to look for the answer on the forum before posting a new question.
Behavior Guidelines and Best Practices
We emphasize the importance of respectful communication and consideration for others. Harassment of any kind is strictly prohibited, and participants are encouraged to report any misconduct immediately. Some best practices include:
- Be respectful
- Communicate and collaborate
- Be kind
- Report any misconduct
The forum is a resource for community members to help each other. While ScyllaDB employees may participate and offer their insights, they are not obligated to do so.
Recent Trending Discussions on the Community Forum
Here’s a look at some of the forum’s recent trending discussions.
Is ScyllaDB Right for My Application? ScyllaDB’s Sweet Spot
Discussion recap: ScyllaDB is a NoSQL database that is designed to handle large amounts of data and high throughput. The sweet spot for using ScyllaDB is when an application requires low latency, high availability, and the ability to handle large amounts of data. ScyllaDB is particularly useful for applications that require real-time analytics, fast data ingestion, and low-latency queries.
The post provides examples of use cases where ScyllaDB is a good fit, such as online gaming, AdTech, and financial services. However, the post also notes that ScyllaDB might not be the best choice for applications that require strong consistency (ACID) or have low data volumes.
Replication Strategy Change from Simple to Network
Discussion recap: Discusses the process of changing the replication strategy of a ScyllaDB cluster from SimpleStrategy to NetworkTopologyStrategy. It explains the benefits of using NetworkTopologyStrategy for more complex deployments and provides step-by-step instructions for making the switch. The discussion also highlights potential issues that might arise during the migration process and provides recommendations for avoiding them.
Knowing ScyllaDB Data Modeling and Limitations: data per row, number of index columns per table, consistency.
Discussion recap: The article explores some of the data modeling recommendations for working with ScyllaDB. The recommendations discussed include the use of Materialized Views, Indexing, the need for careful tuning of compaction and repair processes, the impact of large partitions on performance, and the challenges of scaling write-heavy workloads. It suggests using denormalization and careful monitoring of the repair process to optimize performance, and considering data modeling changes to reduce the impact of large partitions. Additionally, it suggests using asynchronous writes and optimizing the write path for write-heavy workloads. Finally, it suggests evaluating the trade-offs between data consistency and availability when designing a ScyllaDB deployment.
What table options should be used for fast writes & reads and no deletes
Discussion recap: This thread covers ScyllaDB options that can be used to optimize performance for the above use case (fast writes…). Using the TimeWindowCompaction strategy instead of dropping and creating the table with a TTL of X days ensures that tombstones won’t be an issue. ScyllaDB will effectively throw them away (thanks to TTL and windows). If there are no updates and deletes, then this is a perfect TTL + TWCS situation, assuming you really want to drop all your data older than 2 days.
Is there an API for ScyllaDB Nodetool?
Discussion recap: The ScyllaDB server indeed has a REST API. You can find a Swagger UI when you start the ScyllaDB server as well: https://docs.scylladb.com/operating-scylla/rest/
But, please note that nodetool, for example, does not use the API directly. Instead, it talks to the ScyllaDB JMX proxy, which is a Java process that implements Cassandra-compatible JMX API. You can still use the REST API directly, but you have to figure out the mapping between the JMX operations and the REST API yourself.
Building Data Services with Apache CassandraAll applications depend on data, yet application developers don’t like to think about databases. Learning the internals and query language of a particular database adds cognitive load, and requires context switching that detracts from productivity. Still, successful applications must be responsive,...
Building a Media Understanding Platform for ML Innovations
By Guru Tahasildar, Amir Ziai, Jonathan Solórzano-Hamilton, Kelli Griggs, Vi Iyengar
Netflix leverages machine learning to create the best media for our members. Earlier we shared the details of one of these algorithms, introduced how our platform team is evolving the media-specific machine learning ecosystem, and discussed how data from these algorithms gets stored in our annotation service.
Much of the ML literature focuses on model training, evaluation, and scoring. In this post, we will explore an understudied aspect of the ML lifecycle: integration of model outputs into applications.
Specifically, we will dive into the architecture that powers search capabilities for studio applications at Netflix. We discuss specific problems that we have solved using Machine Learning (ML) algorithms, review different pain points that we addressed, and provide a technical overview of our new platform.
At Netflix, we aim to bring joy to our members by providing them with the opportunity to experience outstanding content. There are two components to this experience. First, we must provide the content that will bring them joy. Second, we must make it effortless and intuitive to choose from our library. We must quickly surface the most stand-out highlights from the titles available on our service in the form of images and videos in the member experience.
Here is an example of such an asset created for one of our titles:https://medium.com/media/492a3a98f4974db7281de52e68f3f039/href
These multimedia assets, or “supplemental” assets, don’t just come into existence. Artists and video editors must create them. We build creator tooling to enable these colleagues to focus their time and energy on creativity. Unfortunately, much of their energy goes into labor-intensive pre-work. A key opportunity is to automate these mundane tasks.
Use case #1: Dialogue search
Dialogue is a central aspect of storytelling. One of the best ways to tell an engaging story is through the mouths of the characters. Punchy or memorable lines are a prime target for trailer editors. The manual method for identifying such lines is a watchdown (aka breakdown).
An editor watches the title start-to-finish, transcribes memorable words and phrases with a timecode, and retrieves the snippet later if the quote is needed. An editor can choose to do this quickly and only jot down the most memorable moments, but will have to rewatch the content if they miss something they need later. Or, they can do it thoroughly and transcribe the entire piece of content ahead of time. In the words of one of our editors:
Watchdowns / breakdown are very repetitive and waste countless hours of creative time!
Scrubbing through hours of footage (or dozens of hours if working on a series) to find a single line of dialogue is profoundly tedious. In some cases editors need to search across many shows and manually doing it is not feasible. But what if scrubbing and transcribing dialogue is not needed at all?
Ideally, we want to enable dialogue search that supports the following features:
- Search across one title, a subset of titles (e.g. all dramas), or the entire catalog
- Search by character or talent
- Multilingual search
Use case #2: Visual search
A picture is worth a thousand words. Visual storytelling can help make complex stories easier to understand, and as a result, deliver a more impactful message.
Artists and video editors routinely need specific visual elements to include in artworks and trailers. They may scrub for frames, shots, or scenes of specific characters, locations, objects, events (e.g. a car chasing scene in an action movie), or attributes (e.g. a close-up shot). What if we could enable users to find visual elements using natural language?
Here is an example of the desired output when the user searches for “red race car” across the entire content library.
Use case #3: Reverse shot search
Natural-language visual search offers editors a powerful tool. But what if they already have a shot in mind, and they want to find something that just looks similar? For instance, let’s say that an editor has found a visually stunning shot of a plate of food from Chef’s Table, and she’s interested in finding similar shots across the entire show.
Prior engineering work
Approach #1: on-demand batch processing
Our first approach to surface these innovations was a tool to trigger these algorithms on-demand and on a per-show basis. We implemented a batch processing system for users to submit their requests and wait for the system to generate the output. Processing took several hours to complete. Some ML algorithms are computationally intensive. Many of the samples provided had a significant number of frames to process. A typical 1 hour video could contain over 80,000 frames!
After waiting for processing, users downloaded the generated algo outputs for offline consumption. This limited pilot system greatly reduced the time spent by our users to manually analyze the content. Here is a visualization of this flow.
Approach #2: enabling online request with pre-computation
After the success of this approach we decided to add online support for a couple of algorithms. For the first time, users were able to discover matches across the entire catalog, oftentimes finding moments they never knew even existed. They didn’t need any time-consuming local setup and there was no delays since the data was already pre-computed.
The following quote exemplifies the positive reception by our users:
“We wanted to find all the shots of the dining room in a show. In seconds, we had what normally would have taken 1–2 people hours/a full day to do, look through all the shots of the dining room from all 10 episodes of the show. Incredible!”
Dawn Chenette, Design Lead
This approach had several benefits for product engineering. It allowed us to transparently update the algo data without users knowing about it. It also provided insights into query patterns and algorithms that were gaining traction among users. In addition, we were able to perform a handful of A/B tests to validate or negate our hypotheses for tuning the search experience.
Our early efforts to deliver ML insights to creative professionals proved valuable. At the same time we experienced growing engineering pains that limited our ability to scale.
Maintaining disparate systems posed a challenge. They were first built by different teams on different stacks, so maintenance was expensive. Whenever ML researchers finished a new algorithm they had to integrate it separately into each system. We were near the breaking point with just two systems and a handful of algorithms. We knew this would only worsen as we expanded to more use cases and more researchers.
The online application unlocked the interactivity for our users and validated our direction. However, it was not scaling well. Adding new algos and onboarding new use cases was still time consuming and required the effort of too many engineers. These investments in one-to-one integrations were volatile with implementation timelines varying from a few weeks to several months. Due to the bespoke nature of the implementation, we lacked catalog wide searches for all available ML sources.
In summary, this model was a tightly-coupled application-to-data architecture, where machine learning algos were mixed with the backend and UI/UX software code stack. To address the variance in the implementation timelines we needed to standardize how different algorithms were integrated — starting from how they were executed to making the data available to all consumers consistently. As we developed more media understanding algos and wanted to expand to additional use cases, we needed to invest in system architecture redesign to enable researchers and engineers from different teams to innovate independently and collaboratively. Media Search Platform (MSP) is the initiative to address these requirements.
Although we were just getting started with media-search, search itself is not new to Netflix. We have a mature and robust search and recommendation functionality exposed to millions of our subscribers. We knew we could leverage learnings from our colleagues who are responsible for building and innovating in this space. In keeping with our “highly aligned, loosely coupled” culture, we wanted to enable engineers to onboard and improve algos quickly and independently, while making it easy for Studio and product applications to integrate with the media understanding algo capabilities.
Making the platform modular, pluggable and configurable was key to our success. This approach allowed us to keep the distributed ownership of the platform. It simultaneously provided different specialized teams to contribute relevant components of the platform. We used services already available for other use cases and extended their capabilities to support new requirements.
Next we will discuss the system architecture and describe how different modules interact with each other for end-to-end flow.
Netflix engineers strive to iterate rapidly and prefer the “MVP” (minimum viable product) approach to receive early feedback and minimize the upfront investment costs. Thus, we didn’t build all the modules completely. We scoped the pilot implementation to ensure immediate functionalities were unblocked. At the same time, we kept the design open enough to allow future extensibility. We will highlight a few examples below as we discuss each component separately.
Interfaces - API & Query
Starting at the top of the diagram, the platform allows apps to interact with it using either gRPC or GraphQL interfaces. Having diversity in the interfaces is essential to meet the app-developers where they are. At Netflix, gRPC is predominantly used in backend-to-backend communication. With active GraphQL tooling provided by our developer productivity teams, GraphQL has become a de-facto choice for UI — backend integration. You can find more about what the team has built and how it is getting used in these blog posts. In particular, we have been relying on Domain Graph Service Framework for this project.
During the query schema design, we accounted for future use cases and ensured that it will allow future extensions. We aimed to keep the schema generic enough so that it hides implementation details of the actual search systems that are used to execute the query. Additionally it is intuitive and easy to understand yet feature rich so that it can be used to express complex queries. Users have flexibility to perform multimodal search with input being a simple text term, image or short video. As discussed earlier, search could be performed against the entire Netflix catalog, or it could be limited to specific titles. Users may prefer results that are organized in some way such as group by a movie, sorted by timestamp. When there are a large number of matches, we allow users to paginate the results (with configurable page size) instead of fetching all or a fixed number of results.
The client generated input query is first given to the Query processing system. Since most of our users are performing targeted queries such as — search for dialogue “friends don’t lie” (from the above example), today this stage performs lightweight processing and provides a hook to integrate A/B testing. In the future we plan to evolve it into a “query understanding system” to support free-form searches to reduce the burden on users and simplify client side query generation.
The query processing modifies queries to match the target data set. This includes “embedding” transformation and translation. For queries against embedding based data sources it transforms the input such as text or image to corresponding vector representation. Each data source or algorithm could use a different encoding technique so, this stage ensures that the corresponding encoding is also applied to the provided query. One example why we need different encoding techniques per algorithm is because there is different processing for an image — which has a single frame while video — which contains a sequence of multiple frames.
With global expansion we have users where English is not a primary language. All of the text-based models in the platform are trained using English language so we translate non-English text to English. Although the translation is not always perfect it has worked well in our case and has expanded the eligible user base for our tool to non-English speakers.
Once the query is transformed and ready for execution, we delegate search execution to one or more of the searcher systems. First we need to federate which query should be routed to which system. This is handled by the Query router and Searcher-proxy module. For the initial implementation we have relied on a single searcher for executing all the queries. Our extensible approach meant the platform could support additional searchers, which have already been used to prototype new algorithms and experiments.
A search may intersect or aggregate the data from multiple algorithms so this layer can fan out a single query into multiple search executions. We have implemented a “searcher-proxy” inside this layer for each supported searcher. Each proxy is responsible for mapping input query to one expected by the corresponding searcher. It then consumes the raw response from the searcher before handing it over to the Results post-processor component.
The Results post-processor works on the results returned by one or more searchers. It can rank results by applying custom scoring, populate search recommendations based on other similar searches. Another functionality we are evaluating with this layer is to dynamically create different views from the same underlying data.
For ease of coordination and maintenance we abstracted the query processing and response handling in a module called — Search Gateway.
As mentioned above, query execution is handled by the searcher system. The primary searcher used in the current implementation is called Marken — scalable annotation service built at Netflix. It supports different categories of searches including full text and embedding vector based similarity searches. It can store and retrieve temporal (timestamp) as well as spatial (coordinates) data. This service leverages Cassandra and Elasticsearch for data storage and retrieval. When onboarding embedding vector data we performed an extensive benchmarking to evaluate the available datastores. One takeaway here is that even if there is a datastore that specializes in a particular query pattern, for ease of maintainability and consistency we decided to not introduce it.
We have identified a handful of common schema types and standardized how data from different algorithms is stored. Each algorithm still has the flexibility to define a custom schema type. We are actively innovating in this space and recently added capability to intersect data from different algorithms. This is going to unlock creative ways of how the data from multiple algorithms can be superimposed on each other to quickly get to the desired results.
Algo Execution & Ingestion
So far we have focused on how the data is queried but, there is an equally complex machinery powering algorithm execution and the generation of the data. This is handled by our dedicated media ML Platform team. The team specializes in building a suite of media-specific machine learning tooling. It facilitates seamless access to media assets (audio, video, image and text) in addition to media-centric feature storage and compute orchestration.
For this project we developed a custom sink that indexes the generated data into Marken according to predefined schemas. Special care is taken when the data is backfilled for the first time so as to avoid overwhelming the system with huge amounts of writes.
Last but not the least, our UI team has built a configurable, extensible library to simplify integrating this platform with end user applications. Configurable UI makes it easy to customize query generation and response handling as per the needs of individual applications and algorithms. The future work involves building native widgets to minimize the UI work even further.
The media understanding platform serves as an abstraction layer between machine learning algos and various applications and features. The platform has already allowed us to seamlessly integrate search and discovery capabilities in several applications. We believe future work in maturing different parts will unlock value for more use cases and applications. We hope this post has offered insights into how we approached its evolution. We will continue to share our work in this space, so stay tuned.
Do these types of challenges interest you? If yes, we’re always looking for engineers and machine learning practitioners to join us.
Special thanks to Vinod Uddaraju, Fernando Amat Gil, Ben Klein, Meenakshi Jindal, Varun Sekhri, Burak Bacioglu, Boris Chen, Jason Ge, Tiffany Low, Vitali Kauhanka, Supriya Vadlamani, Abhishek Soni, Gustavo Carmo, Elliot Chow, Prasanna Padmanabhan, Akshay Modi, Nagendra Kamath, Wenbing Bai, Jackson de Campos, Juan Vimberg, Patrick Strawderman, Dawn Chenette, Yuchen Xie, Andy Yao, and Chen Zheng for designing, developing, and contributing to different parts of the platform.
Building a Media Understanding Platform for ML Innovations was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Inside ScyllaDB University LIVE: Q & A with Guy Shtub
ScyllaDB University LIVE is back in session! Whether you’re just curious about ScyllaDB or an experienced user looking to optimize your deployment, this event is a fast and focused way to learn strategies for getting the most out of ScyllaDB. You can mix and match sessions from two parallel tracks
- ScyllaDB Essentials: ScyllaDB architecture, key components, and building your first ScyllaDB-powered app.
- Advanced Topics and Integrations: Deep dives into optimizing performance, ScyllaDB on Kubernetes, and advanced data modeling.
To give you a better understanding of what the event involves, we recently sat down for a conversation with ScyllaDB University’s fearless leader: Guy Shtub, Head of Training at ScyllaDB.
Register now to save your spot
How would you describe ScyllaDB University LIVE to someone who has never experienced it?
The event is a half-day of live instructor led training. It’s online and free. Our goal is to make the event as interactive as possible. The sessions are NOT pre-recorded and the speakers include code examples in their talks.
How is it different than ScyllaDB University?
ScyllaDB University is a self-paced training and resource center. Users can take a single lesson or lab whenever they have some free time.
On the other hand, ScyllaDB University LIVE is first and foremost a live event, as the name suggests 🙂 It’s an interactive event that enables engagement with the speakers and with the larger ScyllaDB user community.
After the live event, we publish the slide decks and related resources in a special ScyllaDB University course that’s available to people who participated in the live event. That way, attendees can review the material afterward and get their own hands-on practice with the examples presented during the event.
What inspired ScyllaDB to create ScyllaDB University LIVE, and how has it evolved?
Before COVID-19, we had in-person training days. Since face-to-face events were suddenly on hold – but users still really wanted interactive events – we came up with the concept of doing a half-day, virtual, live training event…and ScyllaDB University LIVE was born.
Based on the feedback we received from the first event, plus the surprisingly high number of participants, we decided to hold the event every quarter.
Who generally attends, and what do they get out of it?
Usually most of the attendees are beginners: people that have heard about ScyllaDB and want to learn more about it. But each event also offers a whole track of sessions designed for more experienced users – people who are already using ScyllaDB but want to learn about more advanced topics and about integrations with other software, such as Kubernetes, Kafka, Spring Boot and so on.
As far as roles, we see many application developers, DevOps engineers, and architects – as well as people in higher management interested in improving their database’s cost effectiveness.
Do you have any tips or advice for attendees to get the most out of the event?
Yes! Before the event, take some lessons at ScyllaDB University. ScyllaDB University is self-paced, completely free, and a centralized place to explore a broad range of core ScyllaDB topics. You can roll up your sleeves and “get your hands dirty” by running some of the labs and seeing how ScyllaDB actually runs.
What’s your favorite part of the event?
While I love the talks, my favorite part is the roundtable that occurs right after the three talks in each track conclude. This is a chance for attendees to ask the speakers, as well as other experts, any questions they have. Having Avi Kivity (our CTO and co-founder) and Dor Laor (our CEO and co-founder) on camera together usually makes for interesting times. 😉
Here’s an excerpt from a previous roundtable:
What are some future plans and ideas for ScyllaDB University LIVE?
We’ve had a lot of requests for a training event that’s in a more Asia-friendly timezone. Another idea is to make the event even more interactive by having attendees run hands-on labs during while we’re all online together. Right now, the instructor runs the labs during the event, then we provide attendees everything they need to repeat the labs on their own afterwards.
We’re always looking to improve and your feedback is very much appreciated. Please let us know what you’d like to see in future events on the community forum.
Join us for the next session of ScyllaDB University LIVE
Elasticsearch Indexing Strategy in Asset Management Platform (AMP)
By Burak Bacioglu, Meenakshi Jindal
Asset Management at Netflix
At Netflix, all of our digital media assets (images, videos, text, etc.) are stored in secure storage layers. We built an asset management platform (AMP), codenamed Amsterdam, in order to easily organize and manage the metadata, schema, relations and permissions of these assets. It is also responsible for asset discovery, validation, sharing, and for triggering workflows.
Amsterdam service utilizes various solutions such as Cassandra, Kafka, Zookeeper, EvCache etc. In this blog, we will be focusing on how we utilize Elasticsearch for indexing and search the assets.
Amsterdam is built on top of three storage layers.
The first layer, Cassandra, is the source of truth for us. It consists of close to a hundred tables (column families) , the majority of which are reverse indices to help query the assets in a more optimized way.
The second layer is Elasticsearch, which is used to discover assets based on user queries. This is the layer we’d like to focus on in this blog. And more specifically, how we index and query over 7TB of data in a read-heavy and continuously growing environment and keep our Elasticsearch cluster healthy.
And finally, we have an Apache Iceberg layer which stores assets in a denormalized fashion to help answer heavy queries for analytics use cases.
Elasticsearch is one of the best and widely adopted distributed, open source search and analytics engines for all types of data, including textual, numerical, geospatial, structured or unstructured data. It provides simple APIs for creating indices, indexing or searching documents, which makes it easy to integrate. No matter whether you use in-house deployments or hosted solutions, you can quickly stand up an Elasticsearch cluster, and start integrating it from your application using one of the clients provided based on your programming language (Elasticsearch has a rich set of languages it supports; Java, Python, .Net, Ruby, Perl etc.).
One of the first decisions when integrating with Elasticsearch is designing the indices, their settings and mappings. Settings include index specific properties like number of shards, analyzers, etc. Mapping is used to define how documents and their fields are supposed to be stored and indexed. You define the data types for each field, or use dynamic mapping for unknown fields. You can find more information on settings and mappings on Elasticsearch website.
Most applications in content and studio engineering at Netflix deal with assets; such as videos, images, text, etc. These applications are built on a microservices architecture, and the Asset Management Platform provides asset management to those dozens of services for various asset types. Each asset type is defined in a centralized schema registry service responsible for storing asset type taxonomies and relationships. Therefore, it initially seemed natural to create a different index for each asset type. When creating index mappings in Elasticsearch, one has to define the data type for each field. Since different asset types could potentially have fields with the same name but with different data types; having a separate index for each type would prevent such type collisions. Therefore we created around a dozen indices per asset type with fields mapping based on the asset type schema. As we onboarded new applications to our platform, we kept creating new indices for the new asset types. We have a schema management microservice which is used to store the taxonomy of each asset type; and this programmatically created new indices whenever new asset types were created in this service. All the assets of a specific type use the specific index defined for that asset type to create or update the asset document.
As Netflix is now producing significantly more originals than it used to when we started this project a few years ago, not only did the number of assets grow dramatically but also the number of asset types grew from dozens to several thousands. Hence the number of Elasticsearch indices (per asset type) as well as asset document indexing or searching RPS (requests per second) grew over time. Although this indexing strategy worked smoothly for a while, interesting challenges started coming up and we started to notice performance issues over time. We started to observe CPU spikes, long running queries, instances going yellow/red in status.
Usually the first thing to try is to scale up the Elasticsearch cluster horizontally by increasing the number of nodes or vertically by upgrading instance types. We tried both, and in many cases it helps, but sometimes it is a short term fix and the performance problems come back after a while; and it did for us. You know it is time to dig deeper to understand the root cause of it.
It was time to take a step back and reevaluate our ES data indexing and sharding strategy. Each index was assigned a fixed number of 6 shards and 2 replicas (defined in the template of the index). With the increase in the number of asset types, we ended up having approximately 900 indices (thus 16200 shards). Some of these indices had millions of documents, whereas many of them were very small with only thousands of documents. We found the root cause of the CPU spike was unbalanced shards size. Elasticsearch nodes storing those large shards became hot spots and queries hitting those instances were timing out or very slow due to busy threads.
We changed our indexing strategy and decided to create indices based on time buckets, rather than asset types. What this means is, assets created between t1 and t2 would go to the T1 bucket, assets created between t2 and t3 would go to the T2 bucket, and so on. So instead of persisting assets based on their asset types, we would use their ids (thus its creation time; because the asset id is a time based uuid generated at the asset creation) to determine which time bucket the document should be persisted to. Elasticsearch recommends each shard to be under 65GB (AWS recommends them to be under 50GB), so we could create time based indices where each index holds somewhere between 16–20GB of data, giving some buffer for data growth. Existing assets can be redistributed appropriately to these precreated shards, and new assets would always go to the current index. Once the size of the current index exceeds a certain threshold (16GB), we would create a new index for the next bucket (minute/hour/day) and start indexing assets to the new index created. We created an index template in Elasticsearch so that the new indices always use the same settings and mappings stored in the template.
We chose to index all versions of an asset in the the same bucket - the one that keeps the first version. Therefore, even though new assets can never be persisted to an old index (due to our time based id generation logic, they always go to the latest/current index); existing assets can be updated, causing additional documents for those new asset versions to be created in those older indices. Therefore we chose a lower threshold for the roll over so that older shards would still be well under 50GB even after those updates.
For searching purposes, we have a single read alias that points to all indices created. When performing a query, we always execute it on the alias. This ensures that no matter where documents are, all documents matching the query will be returned. For indexing/updating documents, though, we cannot use an alias, we use the exact index name to perform index operations.
To avoid the ES query for the list of indices for every indexing request, we keep the list of indices in a distributed cache. We refresh this cache whenever a new index is created for the next time bucket, so that new assets will be indexed appropriately. For every asset indexing request, we look at the cache to determine the corresponding time bucket index for the asset. The cache stores all time-based indices in a sorted order (for simplicity we named our indices based on their starting time in the format yyyyMMddHHmmss) so that we can easily determine exactly which index should be used for asset indexing based on the asset creation time. Without using the time bucket strategy, the same asset could have been indexed into multiple indices because Elasticsearch doc id is unique per index and not the cluster. Or we would have to perform two API calls, first to identify the specific index and then to perform the asset update/delete operation on that specific index.
It is still possible to exceed 50GB in those older indices if millions of updates occur within that time bucket index. To address this issue, we added an API that would split an old index into two programmatically. In order to split a given bucket T1 (which stores all assets between t1 and t2) into two, we choose a time t1.5 between t1 and t2, create a new bucket T1_5, and reindex all assets created between t1.5 and t2 from T1 into this new bucket. While the reindexing is happening, queries / reads are still answered by T1, so any new document created (via asset updates) would be dual-written into T1 and T1.5, provided that their timestamp falls between t1.5 and t2. Finally, once the reindexing is complete, we enable reads from T1_5, stop the dual write and delete reindexed documents from T1.
In fact, Elasticsearch provides an index rollover feature to handle the growing indicex problem https://www.elastic.co/guide/en/elasticsearch/reference/6.0/indices-rollover-index.html. With this feature, a new index is created when the current index size hits a threshold, and through a write alias, the index calls will point to the new index created. That means, all future index calls would go to the new index created. However, this would create a problem for our update flow use case, because we would have to query multiple indices to determine which index contains a particular document so that we can update it appropriately. Because the calls to Elasticsearch may not be sequential, meaning, an asset a1 created at T1 can be indexed after another asset a2 created at T2 where T2>T1, the older asset a1 can end up in the newer index while the newer asset a2 is persisted in the old index. In our current implementation, however, by simply looking at the asset id (and asset creation time), we can easily find out which index to go to and it is always deterministic.
One thing to mention is, Elasticsearch has a default limit of 1000 fields per index. If we index all types to a single index, wouldn’t we easily exceed this number? And what about the data type collisions we mentioned above? Having a single index for all data types could potentially cause collisions when two asset types define different data types for the same field. We also changed our mapping strategy to overcome these issues. Instead of creating a separate Elasticsearch field for each metadata field defined in an asset type, we created a single nested type with a mandatory field called `key`, which represents the name of the field on the asset type, and a handful of data-type specific fields, such as: `string_value`, `long_value`, `date_value`, etc. We would populate the corresponding data-type specific field based on the actual data type of the value. Below you can see a part of the index mapping defined in our template, and an example from a document (asset) which has four metadata fields:
As you see above, all asset properties go under the same nested field `metadata` with a mandatory `key` field, and the corresponding data-type specific field. This ensures that no matter how many asset types or properties are indexed, we would always have a fixed number of fields defined in the mapping. When searching for these fields, instead of querying for a single value (cameraId == 42323243), we perform a nested query where we query for both key and the value (key == cameraId AND long_value == 42323243). For more information on nested queries, please refer to this link.
After these changes, the indices we created are now balanced in terms of data size. CPU utilization is down from an average of 70% to 10%. In addition, we are able to reduce the refresh interval time on these indices from our earlier setting 30 seconds to 1 sec in order to support use cases like read after write, which enables users to search and get a document after a second it was created
We had to do a one time migration of the existing documents to the new indices. Thankfully we already have a framework in place that can query all assets from Cassandra and index them in Elasticsearch. Since doing full table scans in Cassandra is not generally recommended on large tables (due to potential timeouts), our cassandra schema contains several reverse indices that help us query all data efficiently. We also utilize Kafka to process these assets asynchronously without impacting our real time traffic. This infrastructure is used not only to index assets to Elasticsearch, but also to perform administrative operations on all or some assets, such as bulk updating assets, scanning / fixing problems on them, etc. Since we only focused on Elasticsearch indexing in this blog, we are planning to create another blog to talk about this infrastructure later.
Elasticsearch Indexing Strategy in Asset Management Platform (AMP) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.