ScyllaDB Summit: For the ScyllaDB Curious + Serious Sea Monsters

If your team is exploring or using ScyllaDB, we hope you’ll join us for a deep dive into the innovations and strategies that will help you get the most out of it. Attend ScyllaDB Summit (free + virtual) on February 15 and 16.

REGISTER FOR FREE

With 2 days of 25+ sessions, this is your chance to quickly:

  • Discover ScyllaDB’s latest innovations for data-intensive applications and learn what’s on the short-term and long-term roadmap
  • Hear how ScyllaDB is being used by engineers at Discord,  Epic Games, Strava, Sharechat & other gamechangers
  • Get the ScyllaDB engineering perspective on topics like serverless, observability, drivers, consistency algorithms, compaction strategies & our DynamoDB API
  • Explore real-world ScyllaDB integrations and best practices: Pulsar, Flink, Ansible, Quarkus, Quine, all-ARM clusters & more

If you haven’t attended a ScyllaDB virtual conference before, be prepared – this isn’t your typical virtual conference that inflicts death by PowerPoint. It’s highly interactive, including the ability to engage with speakers during their sessions and continue the conversation afterward. There will also be contests, opportunities to connect with your peers from around the world, the announcement of this year’s ScyllaDB Innovation Award winners, and more.

Throughout both days, our tech experts will be available to answer your questions in the ScyllaDB lounge. If you’re a ScyllaDB open source user, this is a great opportunity to get your top questions addressed!

This year’s ScyllaDB Summit agenda features talks across a broad array of topics related to data-intensive applications – NoSQL, SQL, event streaming, Rust, WebAssembly, and more. Here’s a spotlight on the slice of sessions that are geared specifically toward ScyllaDB users and the “ScyllaDB curious.”

Leadership Keynotes

To Serverless and Beyond

Dor Laor, Co-Founder and CEO
This year, a set of technologies we’ve been developing for years graduated together to bring you uncompromised performance, availability and elasticity. In this session, Dor Laor covers the major innovations coming next to ScyllaDB core and ScyllaDB Cloud.

ScyllaDB Summit 2023 Speaker –

 

The Path to ScyllaDB 5.2

Avi Kivity, Co-Founder and CTO

ScyllaDB co-founder and CTO Avi Kivity will cover 2022 accomplishments and deliveries, including full repair-based-node-operations, distributed aggregation, focusing on goodput at the face of overload, and many other changes.

ScyllaDB Summit 2023 Speaker –

 

Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More Affordable, All at Once

Tzach Livyatan, VP of Product

Alternator is ScyllaDB’s API compatible with Amazon DynamoDB. Alternator allows you to go beyond the AWS perimeter to run your workload everywhere, on any cloud or on premises, without a single line of code change. The session will show how to save costs by moving workloads from Amazon DynamoDB to ScyllaDB Alternator using real-life examples. We will give insight into the Alternator development process and roadmap and demonstrate how to use it on ScyllaDB Cloud for production and your local machine for testing.

ScyllaDB Summit 2023 Speaker –

 

ScyllaDB Cloud Goes Serverless

Yaniv Kaul, VP of R & D

Learn how ScyllaDB Cloud is moving to serverless, transforming its single tenant deployment model into a multi-tenant architecture based on Kubernetes. Discover the engineering innovation required, and the user value of the new architecture, including use of encryption (both at flight and at rest), performance isolation, and the capability to scale elastically.

ScyllaDB Summit 2023 Speaker –

 

User Strategies

How Discord Stores Trillions of Messages on ScyllaDB

Bo Ingram, Discord

Learn why and how Discord’s persistence team recently completed their most ambitious migration yet: moving their massive set of trillions of messages from Cassandra to ScyllaDB. Bo Ingram, Senior Software Engineer at Discord, provides a technical look, including:

  • Their reasons for moving from Apache Cassandra to ScyllaDB
  • Their strategy for migrating trillions of messages
  • How they designed a new storage topology – using a hybrid-RAID1 architecture – for extremely low latency on GCP
  • The role of their existing Rust messages service, new Rust data service library, and new Rust data migrator in this project
  • What they’ve achieved so far, lessons learned, and what they’re tackling next

ScyllaDB Summit 2023 Speaker – Bo Ingram, Discord, Senior Software Engineer

 

ScyllaDB at Strava

Phani Teja Nallamothu, Strava

How Strava uses ScyllaDB Enterprise for a variety of use cases, including a look at how ScyllaDB integrates with their architecture and a deep dive into several use cases.

ScyllaDB Summit 2023 Speaker – Phani Teja Nallamothu, Strava, Senior Cloud Engineer

 

Using ScyllaDB for Distribution of Game Assets in Unreal Engine

Joakim Lindqvist, Epic Games

How Epic Games is using ScyllaDB for distribution of large game assets used by Unreal Engine across the world —enabling game developers to more quickly build great games.

ScyllaDB Summit 2023 Speaker – Joakim Lindqvist Epic Games Senior Tools Programmer

 

Worldwide Local Latency With ScyllaDB

Carly Christensen, ZeroFlucs
How ZeroFlucs uses ScyllaDB and Go to offer low-latency data processing in a geographically distributed way, with each customer’s data always locally available.

ScyllaDB Summit 2023 Speaker – ZeroFlucs Pricing Technology, Carly Christensen, Director of Software Engineering

 

ShareChat’s Journey Migrating 100TB of Data to ScyllaDB with NO Downtime

Chinmoy Mahapatra & Anuraj Jain, ShareChat

How ShareChat built a live migration framework moving 100TB of data into ScyllaDB (for cost and performance benefits) without any downtime.

ScyllaDB Summit 2023 Speaker – Charan Movva, ShareChat, Software Engineer ScyllaDB Summit 2023 Speaker – Anuraj Jain, ShareChat, Software Engineer

 

How Proxima Beta Implemented CQRS and Event Sourcing on Top of Apache Pulsar and ScyllaDB

Lei Shi, Zhiwei Peng, Zhihao Chen – Proxima Beta, Tencent IEG Global

How Level Infinite uses ScyllaDB as the state store of the Proxima Beta gaming platform’s service architecture, including strategies for globally replicating data to simplify configuration management and using time window compaction strategy to power a distributed queue-like event store.

ScyllaDB Summit 2023 Speaker – Lei Shi, Level Infinite, Principal Software Engineer ScyllaDB Summit 2023 Speaker – Zhiwei Peng, Senior Security Engineer ScyllaDB Summit 2023 Speaker – Zhihao Chen, Level Infinite, Senior Security Engineer

 

Building a 100% ScyllaDB Shard-Aware Application using Rust

Alexys Jacob, Yassir Barchi, Joseph Perez – Numberly

Numberly’s experience designing and operating a distributed, idempotent, and predictable application 100% based on ScyllaDB’s low-level shard-aware topology using Rust.

ScyllaDB Summit 2023 Speaker – Alexys Jacob Chief Technical Officer at Numberly ScyllaDB Summit 2023 Speaker – Yassir Barchi, Numberly, Lead Software Engineer ScyllaDB Summit 2023 Speaker –

 

Making the Most Out of ScyllaDB’s Awesome Concurrency at Optimizely

Brian Taylor, Optimizely

Optimizely’s client-side strategies for taking full advantage of ScyllaDB’s concurrency while also guaranteeing correctness and protecting quality of service.

ScyllaDB Summit 2023 Speaker – Brian Taylor, Optimizely, Tech Lead

 

Scalable and Resilient Security Ratings Platform with ScyllaDB

Nguyen Cao, Security Scorecard

How SecurityScorecard, a global leader in cybersecurity ratings, uses ScyllaDB for a database with low query latency, real-time data ingestion, fault tolerance, and high scalability—including their migration process from Redis + Presto + Aurora and lessons learned.

ScyllaDB Summit 2023 Speaker – Nguyen Cao, Security Scorecard, Staff Data Engineer

 

From Postgres to ScyllaDB: Migration Strategies and Performance Gains

Sebastian Vercruysse & Dan Harris, Coralogix

How Coralogix shrank query processing times from 30 seconds (not a typo) to 86 ms by moving from Postgres to ScyllaDB.

ScyllaDB Summit 2023 - Sebastian Vercruysse, Coralogix Senior Software Engineer ScyllaDB Summit 2023 - Dan Harris, Coralogix, Principal Software Engineer

 

Key-Key-Value Store: Generic NoSQL Datastore with Tombstone Reduction & Automatic Partition Splitting

Stephen Ma, Discord

Discover Discord’s approach to more quickly and simply onboarding new data storage use cases with their key-value store service that hides many ScyllaDB-specific complexities–like schema design and performance impacts from tombstones and large partitions–from developers.

ScyllaDB Summit 2023 Speaker – Stephen Ma, Discord, Senior Software Engineer

 

Aggregations at Scale for ShareChat Using Kafka Streams and ScyllaDB

Charan Movva, ShareChat

How ShareChat handles the aggregations of a post’s engagement metrics/counters at scale with sub-millisecond P99 latencies for reads and writes.

ScyllaDB Summit 2023 Speaker – Charan Movva, ShareChat, Software Engineer

 

ScyllaDB Engineering

Raft After ScyllaDB 5.2: Safe Topology Changes

Konstantin Osipov, ScyllaDB

ScyllaDB’s drive towards strongly consistent features continues, and in this talk I will cover the upcoming implementation of safe topology changes feature: our rethinking of adding and removing nodes to a ScyllaDB cluster.

Quickly assembling a fresh cluster, performing topology and schema changes concurrently, quickly restarting a node with a different IP address or configuration – all of this has become possible thanks to a centralized – yet fault-tolerant – topology change coordinator, the new algorithm we implemented for Scylla 5.3. The next step would be automatically changing data placement to adjust to the load and distribution of data – our future plans which I will touch upon as well.

ScyllaDB Summit 2023 Speaker –Kostja Osipov, ScyllaDB, Director Software Engineering

 

Squeezing the Most Out of the Storage Engine with State of the Art Compaction

Raphael S. Carvalho, ScyllaDB

Log Structured Merge (LSM) tree storage engines are known for very fast writes. This LSM tree structure is used by ScyllaDB to immutable Sorted Strings Tables (SSTables) on disk. These fast writes come with a tradeoff in terms of read and space amplification. While compaction processes can help mitigate this, the RUM conjecture states that only two amplification factors can be optimized at the extent of a third. Learn how ScyllaDB leverages RUM conjecture and controller theory, to deliver a state-of-the-art LSM-tree compaction for its users.

ScyllaDB Summit 2023 Speaker –Raphael Carvalho, ScyllaDB, Software Engineer

 

Optimizing ScyllaDB Performance via Observability

Amnon Heiman, ScyllaDB

ScyllaDB already does a great job at basic performance optimization at install time and run-time, with IO tuning for your SSDs and real-time dynamic IO and CPU scheduling. Yet there’s more that you, as a user, can do by observing ScyllaDB’s operations. Learn how to get the most out of your database by using these open source tools, techniques and best practices.

ScyllaDB Summit 2023 Speaker –Amnon Heiman, ScyllaDB, Principal Software Engineer

 

Building Next Generation Drivers: Optimizing Performance in Go and Rust

Piotr Grabowski

Optimizing shard-aware drivers for ScyllaDB has taken multiple initiatives, often requiring a complete rewrite from scratch. Learn the work undertaken to improve the performance of ScyllaDB drivers for both Go and Rust, plus how the Rust code base will be used as a core for drivers with other language bindings going forward. The session highlights performance increases obtained using techniques available in the respective programming languages, including shaving performance off Google’s B-tree implementation with Go generics, and using the asynchronous Tokio framework as the basis of a new Rust driver.

ScyllaDB Summit 2023 Speaker –Piotr Grabowski, ScyllaDB, Junior Software Engineer

 

Retaining Goodput with Query Rate Limiting

Piotr Dulikowski, ScyllaDB

Distributed systems are usually optimized with particular workloads in mind. At the same time, the system should still behave in a sane way when the assumptions about workload do not hold – notably, one user shouldn’t be able to ruin the whole system’s performance. Buggy parts of the system can be a source of the overload as well, so it is worth considering overload protection on a per-component basis. For example, ScyllaDB’s shared-nothing architecture gives it great scalability, but at the same time makes it prone to a “hot partition” problem: a single partition accessed with disproportionate frequency can ruin performance for other requests handled by the same shards. This talk will describe how we implemented rate limiting on a per-partition basis which reduces the performance impact in such a case, and how we reduced the CPU cost of handling failed requests such as timeouts (spoiler: it’s about C++ exceptions).

ScyllaDB Summit 2023 Speaker – Piotr Dulikowski ScyllaDB Senior Software Engineer

 

Integrations and Ecosystem

CI/CD for Data – Building Data Development Environment with lakeFS

Vinodhini S Duraisamy, Treeverse

How to use lakeFS and ScyllaDB to quickly set up a development environment and use it to develop/test data pipelines and products, including best practices for safety and automation.

ScyllaDB Summit 2023 Speaker – Vinodhini S Duraisamy, Treeverse, Developer Advocate

 

Integrating ScyllaDB with Quarkus

Joao Martins, iFood

How Joao Martins from iFood, Brazil’s top food delivery company, integrates two powerful cloud native technologies: ScyllaDB and Quarkus.

ScyllaDB Summit 2023 Speaker – João Martins, iFood, Software Engineer

 

Sink Your Teeth into Streaming at Any Scale

Timothy Spann & David Kjerrumgaard, StreamNative

How to build a low-latency scalable platform for today’s massively data-intensive real-time streaming applications using ScyllaDB, Pulsar, and Flink.

ScyllaDB Summit 2023 Speaker – Timothy Spann, SteamNative, Developer Advocate ScyllaDB Summit 2023 Speaker – David Kjerrumgaar, SteamNative, Developer Advocate

 

x86-less ScyllaDB: Exploring an All-ARM Cluster

Keith McKay, ScaleFlux & Mike Bennet, Ampere

How ScyllaDB performs on Ampere ARM-powered servers and ScaleFlux fast SSDs, as well as how to get the most out of your storage using ScaleFlux’s built-in compression and ScyllaDB’s Incremental Compaction Strategy.

ScyllaDB Summit 2023 Speaker – Keith McKay, ScaleFlux, Senior Director ScyllaDB Summit 2023 Speaker – Mike Bennet, Ampere Computing, Solution Architect

 

Maximum Uptime ScyllaDB Cluster Orchestration with Ansible

Ryan Ross

How to orchestrate your ScyllaDB clusters with confidence using Ansible, including tangible code snippets to help operators and developers safely make changes to their ScyllaDB clusters. These are tips learned the hard way in production so you don’t have to.

ScyllaDB Summit 2023 Speaker – Ryan Ross, dbt Labs, Senior Site Reliability Engineer

 

Build Low Latency, Windowless Event Processing Pipelines with Quine and ScyllaDB

Matthew Cullum, thatDot

How to build an event processing pipeline that scales to millions of events per second with sub-millisecond latencies while ingesting multiple streams and demonstrates resilient in the face of host failures.

ScyllaDB Summit 2023 Speaker – Matthew Cullum, thatDot, Director of Engineering

 

Developing Enterprise Consciousness: Building Modern Open Data Platforms

Rahul Xavier Singh, Anant

How modern open source tools can help synchronize data across applications and open data platforms using low-code ETL/ReverseETL tools. Anant introduces a reference stack that can help you integrate ScyllaDB in your data platform.

ScyllaDB Summit 2023 Speaker – Rahul Xavier Singh, Anant Corporation, Business Platform Architect

REGISTER FOR FREE

 

Scalable Annotation Service — Marken

Scalable Annotation Service — Marken

by Varun Sekhri, Meenakshi Jindal

Introduction

At Netflix, we have hundreds of micro services each with its own data models or entities. For example, we have a service that stores a movie entity’s metadata or a service that stores metadata about images. All of these services at a later point want to annotate their objects or entities. Our team, Asset Management Platform, decided to create a generic service called Marken which allows any microservice at Netflix to annotate their entity.

Annotations

Sometimes people describe annotations as tags but that is a limited definition. In Marken, an annotation is a piece of metadata which can be attached to an object from any domain. There are many different kinds of annotations our client applications want to generate. A simple annotation, like below, would describe that a particular movie has violence.

  • Movie Entity with id 1234 has violence.

But there are more interesting cases where users want to store temporal (time-based) data or spatial data. In Pic 1 below, we have an example of an application which is used by editors to review their work. They want to change the color of gloves to rich black so they want to be able to mark up that area, in this case using a blue circle, and store a comment for it. This is a typical use case for a creative review application.

An example for storing both time and space based data would be an ML algorithm that can identify characters in a frame and wants to store the following for a video

  • In a particular frame (time)
  • In some area in image (space)
  • A character name (annotation data)
Pic 1 : Editors requesting changes by drawing shapes like the blue circle shown above.

Goals for Marken

We wanted to create an annotation service which will have the following goals.

  • Allows to annotate any entity. Teams should be able to define their data model for annotation.
  • Annotations can be versioned.
  • The service should be able to serve real-time, aka UI, applications so CRUD and search operations should be achieved with low latency.
  • All data should be also available for offline analytics in Hive/Iceberg.

Schema

Since the annotation service would be used by anyone at Netflix we had a need to support different data models for the annotation object. A data model in Marken can be described using schema — just like how we create schemas for database tables etc.

Our team, Asset Management Platform, owns a different service that has a json based DSL to describe the schema of a media asset. We extended this service to also describe the schema of an annotation object.

{
"type": "BOUNDING_BOX", ❶
"version": 0, ❷
"description": "Schema describing a bounding box",
"keys": {
"properties": { ❸
"boundingBox": {
"type": "bounding_box",
"mandatory": true
},
"boxTimeRange": {
"type": "time_range",
"mandatory": true
}
}
}
}

In the above example, the application wants to represent in a video a rectangular area which spans a range of time.

  1. Schema’s name is BOUNDING_BOX
  2. Schemas can have versions. This allows users to make add/remove properties in their data model. We don’t allow incompatible changes, for example, users can not change the data type of a property.
  3. The data stored is represented in the “properties” section. In this case, there are two properties
  4. boundingBox, with type “bounding_box”. This is basically a rectangular area.
  5. boxTimeRange, with type “time_range”. This allows us to specify start and end time for this annotation.

Geometry Objects

To represent spatial data in an annotation we used the Well Known Text (WKT) format. We support following objects

  • Point
  • Line
  • MultiLine
  • BoundingBox
  • LinearRing

Our model is extensible allowing us to easily add more geometry objects as needed.

Temporal Objects

Several applications have a requirement to store annotations for videos that have time in it. We allow applications to store time as frame numbers or nanoseconds.

To store data in frames clients must also store frames per second. We call this a SampleData with following components:

  • sampleNumber aka frame number
  • sampleNumerator
  • sampleDenominator

Annotation Object

Just like schema, an annotation object is also represented in JSON. Here is an example of annotation for BOUNDING_BOX which we discussed above.

{  
"annotationId": { ❶
"id": "188c5b05-e648-4707-bf85-dada805b8f87",
"version": "0"
},
"associatedId": { ❷
"entityType": "MOVIE_ID",
"id": "1234"
},
"annotationType": "ANNOTATION_BOUNDINGBOX", ❸
"annotationTypeVersion": 1,
"metadata": { ❹
"fileId": "identityOfSomeFile",
"boundingBox": {
"topLeftCoordinates": {
"x": 20,
"y": 30
},
"bottomRightCoordinates": {
"x": 40,
"y": 60
}
},
"boxTimeRange": {
"startTimeInNanoSec": 566280000000,
"endTimeInNanoSec": 567680000000
}
}
}
  1. The first component is the unique id of this annotation. An annotation is an immutable object so the identity of the annotation always includes a version. Whenever someone updates this annotation we automatically increment its version.
  2. An annotation must be associated with some entity which belongs to some microservice. In this case, this annotation was created for a movie with id “1234”
  3. We then specify the schema type of the annotation. In this case it is BOUNDING_BOX.
  4. Actual data is stored in the metadata section of json. Like we discussed above there is a bounding box and time range in nanoseconds.

Base schemas

Just like in Object Oriented Programming, our schema service allows schemas to be inherited from each other. This allows our clients to create an “is-a-type-of” relationship between schemas. Unlike Java, we support multiple inheritance as well.

We have several ML algorithms which scan Netflix media assets (images and videos) and create very interesting data for example identifying characters in frames or identifying match cuts. This data is then stored as annotations in our service.

As a platform service we created a set of base schemas to ease creating schemas for different ML algorithms. One base schema (TEMPORAL_SPATIAL_BASE) has the following optional properties. This base schema can be used by any derived schema and not limited to ML algorithms.

  • Temporal (time related data)
  • Spatial (geometry data)

And another one BASE_ALGORITHM_ANNOTATION which has the following optional properties which is typically used by ML algorithms.

  • label (String)
  • confidenceScore (double) — denotes the confidence of the generated data from the algorithm.
  • algorithmVersion (String) — version of the ML algorithm.

By using multiple inheritance, a typical ML algorithm schema derives from both TEMPORAL_SPATIAL_BASE and BASE_ALGORITHM_ANNOTATION schemas.

{
"type": "BASE_ALGORITHM_ANNOTATION",
"version": 0,
"description": "Base Schema for Algorithm based Annotations",
"keys": {
"properties": {
"confidenceScore": {
"type": "decimal",
"mandatory": false,
"description": "Confidence Score",
},
"label": {
"type": "string",
"mandatory": false,
"description": "Annotation Tag",
},
"algorithmVersion": {
"type": "string",
"description": "Algorithm Version"
}
}
}
}

Architecture

Given the goals of the service we had to keep following in mind.

  • Our service will be used by a lot of internal UI applications hence the latency for CRUD and search operations must be low.
  • Besides applications we will have ML algorithm data stored. Some of this data can be on the frame level for videos. So the amount of data stored can be large. The databases we pick should be able to scale horizontally.
  • We also anticipated that the service will have high RPS.

Some other goals came from search requirements.

  • Ability to search the temporal and spatial data.
  • Ability to search with different associated and additional associated Ids as described in our Annotation Object data model.
  • Full text searches on many different fields in the Annotation Object
  • Stem search support

As time progressed the requirements for search only increased and we will discuss these requirements in detail in a different section.

Given the requirements and the expertise in our team we decided to choose Cassandra as the source of truth for storing annotations. For supporting different search requirements we chose ElasticSearch. Besides to support various features we have bunch of internal auxiliary services for eg. zookeeper service, internationalization service etc.

Marken architecture

Above picture represents the block diagram of the architecture for our service. On the left we show data pipelines which are created by several of our client teams to automatically ingest new data into our service. The most important of such a data pipeline is created by the Machine Learning team.

One of the key initiatives at Netflix, Media Search Platform, now uses Marken to store annotations and perform various searches explained below. Our architecture makes it possible to easily onboard and ingest data from Media algorithms. This data is used by various teams for eg. creators of promotional media (aka trailers, banner images) to improve their workflows.

Search

Success of Annotation Service (data labels) depends on the effective search of those labels without knowing much of input algorithms details. As mentioned above, we use the base schemas for every new annotation type (depending on the algorithm) indexed into the service. This helps our clients to search across the different annotation types consistently. Annotations can be searched either by simply data labels or with more added filters like movie id.

We have defined a custom query DSL to support searching, sorting and grouping of the annotation results. Different types of search queries are supported using the Elasticsearch as a backend search engine.

  • Full Text Search — Clients may not know the exact labels created by the ML algorithms. As an example, the label can be ‘shower curtain’. With full text search, clients can find the annotation by searching using label ‘curtain’ . We also support fuzzy search on the label values. For example, if the clients want to search ‘curtain’ but they wrongly typed ‘curtian` — annotation with the ‘curtain’ label will be returned.
  • Stem Search — With global Netflix content supported in different languages, our clients have the requirement to support stem search for different languages. Marken service contains subtitles for a full catalog of titles in Netflix which can be in many different languages. As an example for stem search , `clothing` and `clothes` can be stemmed to the same root word `cloth`. We use ElasticSearch to support stem search for 34 different languages.
  • Temporal Annotations Search — Annotations for videos are more relevant if it is defined along with the temporal (time range with start and end time) information. Time range within video is also mapped to the frame numbers. We support labels search for the temporal annotations within the provided time range/frame number also.
  • Spatial Annotation Search — Annotations for video or image can also include the spatial information. For example a bounding box which defines the location of the labeled object in the annotation.
  • Temporal and Spatial Search — Annotation for video can have both time range and spatial coordinates. Hence, we support queries which can search annotations within the provided time range and spatial coordinates range.
  • Semantics Search — Annotations can be searched after understanding the intent of the user provided query. This type of search provides results based on the conceptually similar matches to the text in the query, unlike the traditional tag based search which is expected to be exact keyword matches with the annotation labels. ML algorithms also ingest annotations with vectors instead of actual labels to support this type of search. User provided text is converted into a vector using the same ML model, and then search is performed with the converted text-to-vector to find the closest vectors with the searched vector. Based on the clients feedback, such searches provide more relevant results and don’t return empty results in case there are no annotations which exactly match to the user provided query labels. We support semantic search using Open Distro for ElasticSearch . We will cover more details on Semantic Search support in a future blog article.
Semantic search
  • Range Intersection — We recently started supporting the range intersection queries across multiple annotation types for a specific title in the real time. This allows the clients to search with multiple data labels (resulted from different algorithms so they are different annotation types) within video specific time range or the complete video, and get the list of time ranges or frames where the provided set of data labels are present. A common example of this query is to find the `James in the indoor shot drinking wine`. For such queries, the query processor finds the results of both data labels (James, Indoor shot) and vector search (drinking wine); and then finds the intersection of resulting frames in-memory.

Search Latency

Our client applications are studio UI applications so they expect low latency for the search queries. As highlighted above, we support such queries using Elasticsearch. To keep the latency low, we have to make sure that all the annotation indices are balanced, and hotspot is not created with any algorithm backfill data ingestion for the older movies. We followed the rollover indices strategy to avoid such hotspots (as described in our blog for asset management application) in the cluster which can cause spikes in the cpu utilization and slow down the query response. Search latency for the generic text queries are in milliseconds. Semantic search queries have comparatively higher latency than generic text searches. Following graph shows the average search latency for generic search and semantic search (including KNN and ANN search) latencies.

Average search latency
Semantic search latency

Scaling

One of the key challenges while designing the annotation service is to handle the scaling requirements with the growing Netflix movie catalog and ML algorithms. Video content analysis plays a crucial role in the utilization of the content across the studio applications in the movie production or promotion. We expect the algorithm types to grow widely in the coming years. With the growing number of annotations and its usage across the studio applications, prioritizing scalability becomes essential.

Data ingestions from the ML data pipelines are generally in bulk specifically when a new algorithm is designed and annotations are generated for the full catalog. We have set up a different stack (fleet of instances) to control the data ingestion flow and hence provide consistent search latency to our consumers. In this stack, we are controlling the write throughput to our backend databases using Java threadpool configurations.

Cassandra and Elasticsearch backend databases support horizontal scaling of the service with growing data size and queries. We started with a 12 nodes cassandra cluster, and scaled up to 24 nodes to support current data size. This year, annotations are added approximately for the Netflix full catalog. Some titles have more than 3M annotations (most of them are related to subtitles). Currently the service has around 1.9 billion annotations with data size of 2.6TB.

Analytics

Annotations can be searched in bulk across multiple annotation types to build data facts for a title or across multiple titles. For such use cases, we persist all the annotation data in iceberg tables so that annotations can be queried in bulk with different dimensions without impacting the real time applications CRUD operations latency.

One of the common use cases is when the media algorithm teams read subtitle data in different languages (annotations containing subtitles on a per frame basis) in bulk so that they can refine the ML models they have created.

Future work

There is a lot of interesting future work in this area.

  1. Our data footprint keeps increasing with time. Several times we have data from algorithms which are revised and annotations related to the new version are more accurate and in-use. So we need to do cleanups for large amounts of data without affecting the service.
  2. Intersection queries over a large scale of data and returning results with low latency is an area where we want to invest more time.

Acknowledgements

Burak Bacioglu and other members of the Asset Management Platform contributed in the design and development of Marken.


Scalable Annotation Service — Marken was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Get the Most Out of Cassandra Summit 2023

What if someone told you that spending three days in San Jose could change your organization’s trajectory with Apache Cassandra®? If you’re the kind of person who’d answer, “sign me up,” then you might have already registered for the rapidly approaching Cassandra Summit. Hosted by the Linux...

How to Use Change Data Capture with Apache Kafka and ScyllaDB

Today’s blog is taken from a lab in ScyllaDB University. Login or register now for ScyllaDB University where you can take this lab and the entire course and get credit for it free online, plus have access to all our other free online courseware.

TAKE THE LAB IN SCYLLADB UNIVERSITY

Overview

In this hands-on lab, you will learn how to use the ScyllaDB CDC source connector to push the row-level changes events in the tables of a ScyllaDB cluster to a Kafka server.

What’s ScyllaDB CDC

To recap, Change Data Capture (CDC) is a feature that allows you to not only query the current state of a database’s table but also to query the history of all changes made to the table. CDC is production-ready (GA) starting from ScyllaDB Enterprise 2021.1.1 and ScyllaDB Open Source 4.3.

In ScyllaDB, CDC is optional and enabled on a per-table basis. The history of changes made to a CDC-enabled table is stored in a separate associated table.

You can enable CDC when creating or altering a table using the cdc option, for example:

CREATE TABLE ks.t (pk int, ck int, v int, PRIMARY KEY (pk, ck, v)) WITH cdc = {'enabled':true};

ScyllaDB CDC Source Connector

ScyllaDB CDC Source Connector is a source connector capturing row-level changes in the tables of a ScyllaDB cluster. It is a Debezium connector, compatible with Kafka Connect (with Kafka 2.6.0+). The connector reads the CDC log for specified tables and produces Kafka messages for each row-level INSERT, UPDATE or DELETE operation. The connector is fault-tolerant, retrying reading data from Scylla in case of failure. It periodically saves the current position in the ScyllaDB CDC log using Kafka Connect offset tracking. Each generated Kafka message contains information about the source, such as the timestamp and the table name.

Notice that at the time of writing, there is no support for collection types (LIST, SET, MAP) and UDTs – columns with those types are omitted from generated messages. Stay up to date on this enhancement request and other developments in the GitHub project.

Confluent and Kafka Connect

Confluent is a full-scale data streaming platform that enables you to easily access, store, and manage data as continuous, real-time streams. It expands the benefits of Apache Kafka with enterprise-grade features. Confluent makes it easy to build modern, event-driven applications, and gain a universal data pipeline, supporting scalability, performance, and reliability.

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to define connectors that move large data sets in and out of Kafka. It can ingest entire databases or collect metrics from application servers into Kafka topics, making the data available for stream processing with low latency.

Kafka Connect includes two types of connectors:

  • Source connector: Source connectors ingest entire databases and stream table updates to Kafka topics. Source connectors can also collect metrics from application servers and store the data in Kafka topics–making the data available for stream processing with low latency.
  • Sink connector: Sink connectors deliver data from Kafka topics to secondary indexes, such as Elasticsearch, or batch systems, such as Hadoop, for offline analysis.

Service Setup With Docker

In this lab, you’ll use Docker.

Please ensure that your environment meets the following prerequisites:

  1. Docker for LinuxMac, or Windows. Please note that running ScyllaDB in Docker is only recommended to evaluate and try ScyllaDB.
  2. ScyllaDB Open Source. For best performance, a regular install is recommended.
  3. 8GB of RAM or greater for Kafka and ScyllaDB services.
  4. docker-compose.
  5. Git.

ScyllaDB Install And Init Table

First, you’ll launch a three-node ScyllaDB cluster and create a table with CDC enabled.
If you haven’t done so yet, download the example from git:

git clone https://github.com/scylladb/scylla-code-samples.git
cd scylla-code-samples/CDC_Kafka_Lab

This is the docker-compose file you’ll use. It starts a three-node ScyllaDB Cluster:

Launch the ScyllaDB cluster:

docker-compose -f docker-compose-scylladb.yml up -d

Wait for a minute or so, and check that the ScyllaDB cluster is up and in normal status:

docker exec scylla-node1 nodetool status

Next, you’ll use cqlsh to interact with ScyllaDB. Create a keyspace, and a table with CDC enabled, and insert a row into the table:

docker exec -ti scylla-node1 cqlsh

CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor' : 1};

CREATE TABLE ks.my_table (pk int, ck int, v int, PRIMARY KEY (pk, ck, v)) WITH cdc = {'enabled':true};

INSERT INTO ks.my_table(pk, ck, v) VALUES (1, 1, 20);

exit

Confluent Setup And Connector Configuration

To launch a Kafka server, you’ll use the Confluent platform, which provides a user-friendly web GUI to track topics and messages. The confluent platform provides a docker-compose.yml file to set up the services. Notice that this is not how you would use Apache Kafka in production. The example is useful for training and development purposes only. Get the file:

wget -O docker-compose-confluent.yml https://raw.githubusercontent.com/confluentinc/cp-all-in-one/7.3.0-post/cp-all-in-one/docker-compose.yml

Next, download the ScyllaDB CDC connector:

wget -O scylla-cdc-plugin.jar https://github.com/scylladb/scylla-cdc-source-connector/releases/download/scylla-cdc-source-connector-1.0.1/scylla-cdc-source-connector-1.0.1-jar-with-dependencies.jar

Add the ScyllaDB CDC connector to the Confluent connect service plugin directory using a docker volume by editing docker-compose-confluent.yml to add the two lines as below, replacing the directory with the directory of your scylla-cdc-plugin.jar file

Launch the Confluent services:

docker-compose -f docker-compose-confluent.yml up -d

Wait a minute or so, then access http://localhost:9021 for the Confluent web GUI.

Add the ScyllaConnector using the Confluent dashboard:

Add the Scylla Connector by clicking the plugin:

Fill the “Hosts” with the IP address of one of the Scylla nodes (you can see it in the output of the nodetool status command) and port 9042, which is listened to by the ScyllaDB service.

The “Namespace” is the Keyspace you created before in ScyllaDB.

Notice that it might take a minute or so for ks.my_table to appear.

 

Test Kafka Messages

You can see that MyScyllaCluster.ks.my_table  is the topic created by the ScyllaDB CDC connector.

Now, check for Kafka messages from the Topics panel:

Select the topic which is the same as the keyspace and table name that you created in ScyllaDB.

From the Overview tab, you can see the topic info. At the bottom, it shows this topic is on partition 0.

A partition is the smallest storage unit that holds a subset of records owned by a topic. Each partition is a single log file where records are written to it in an append-only fashion. The records in the partitions are each assigned a sequential identifier called the offset, which is unique for each record within the partition. The offset is an incremental and immutable number maintained by Kafka.

As you already know, the ScyllaDB CDC messages are sent to ks.my_table topic, and the partition id of the topic is 0, so next, go to the Messages tab and enter partition id 0 into the offset field:

You can see from the output of the Kafka topic messages that the ScyllaDB table INSERT event and the data were transferred to Kafka messages by the Scylla CDC Source Connector. Click on the message to view the full message info:

The message contains the ScyllaDB table name and keyspace name with the time, as well as the data status before the action and afterward. Since this is an insert operation, the data before the insert is null.

Next, insert another row into the ScyllaDB table:

docker exec -ti scylla-node1 cqlsh

INSERT INTO ks.my_table(pk, ck, v) VALUES (200, 50, 70);

Now, in Kafka, wait for a few seconds and you can see the details of the new Message:

Cleanup

Once you are done working on this lab, you can stop and remove the Docker containers and images.

To view a list of all container IDs:

docker container ls -aq

Then you can stop and remove the containers you are no longer using:

docker stop <ID_or_Name>

docker rm <ID_or_Name>

Later if you want to rerun the lab, you can follow the steps and use docker-compose as before.

Summary

With the CDC source connector, a Kafka plugin compatible with Kafka Connect, you can capture all the ScyllaDB table row-level changes (INSERT, UPDATE or DELETE) and convert those events to  Kafka messages. You can then consume the data from other applications or perform any other operation with Kafka.

You can discuss this lab on the ScyllaDB community forum.

My Experience at the OEB Conference in Berlin

© OEB Learning Technologies Europe GmbH used with permission

In late November I was invited to speak at the OEB learning technologies conference in Berlin. You can read more about the talk I gave in this blog post.

In this post, I’ll share some of my experiences from the conference and what I learned. It was my first time at this conference, and I had a wonderful experience. I feel like I learned more than I shared.

Berlin in November was snowy, cold, and beautiful. Quite a change for me, coming from sunny Israel. The conference took place in the Ku’damm area with lively Christmas markets.

My attendance was valuable. I connected with colleagues, saw what other companies are doing, and learned about new technologies and industry terms.

The conference had an exhibition with many companies demonstrating their technology.

  • I learned from D-ID about using AI to artificially generate videos from text and photos of the speaker. While the technology seems very interesting, the generated videos I saw still felt, well artificial, and I don’t see us using this technology in the near future.
  • Some companies presented gamification solutions that can be incorporated into training material. I found this very interesting and got some ideas for how we can add more gamification to ScyllaDB University. One example is Seppo. An example usage I can think of is the gamification of a monitoring dashboard with different results.
  • A few companies offered platforms to create and consume video. An impressive example is Class, a company that offers a Zoom addon for teaching. I could see how they add value for live classes, but I didn’t see any features related to writing code or features that were focused on IT labs.
  • A few large companies offering LMS solutions were present as well, among them D2L, Moodle and Cypher Learning. It was interesting to see their offering, and their latest features, and compare it to our chosen open-source LMS solution.
  • Some companies offering proctoring solutions were present, like Proctorexam and Proctorio. We previously considered offering a certification exam that would be proctored to be ScyllaDB Certified for different skills but didn’t pursue this for different reasons. If this is something you’d like to see, share your input on the community forum.
  • Many companies offer to integrate with other LMS systems using Learning Tools Interoperability (LTI). Unfortunately, LearnDash, our current solution, does not support LTI, and we hope they will add this support in the future.

Many of the things I learned were from talking with colleagues in between sessions. I got some ideas from my colleague, Yanay Zaguri, of Appsflyer, about how to get engineers and developers to create training material as part of the development process instead of them doing it as an afterthought.

An interesting talk by Katrina Kennedy, a training consultant based in the US, dealt with how to be more engaging as a trainer. It was mainly focused on in-person training, but some of the points were relevant for online training as well. Some of my takeaways were:

  • Get the trainees to be active and not just read or listen.
  • If live and online, ask them to respond to a question, a poll, or use an interactive quiz game.
  • If doing face-to-face, ask the trainees to discuss a topic in groups of 2-3 people.
  • Strive for 5 minutes or less of “passive” learning before each engaging activity, like quizzes, polls, and questions.
  • For the most engaged students, ask them to support peers and moderate or suggest content.

Another interesting talk was by Laura Pomares of the UN. She shared her experience in using a personalized email drip campaign to make sure her users were engaged. In the campaign, users were split into three buckets: ones that don’t engage (red bucket), ones that do engage (green), and those in between (yellow). The campaign reached out to the users from each bucket and, with personalized messages, and tried to get them to move to the next bucket or to make sure that the ones in the green bucket stay there.

© OEB Learning Technologies Europe GmbH used with permission

It was interesting to participate as a trainee in some experiential learning talks and feel firsthand what activities kept me focused on the subject and which didn’t work so well.

There were many more talks I found interesting, covering topics like measuring the impact of learning, using avatars, AI, gamification, analytics, and more. I won’t cover them all.

The conference had over 2000 attendees and dozens of talks. Some of the topics focused on Academia, while others focused on Corporate learning and training.  In a given time slot, there were sometimes 3-4 talks I was interested in, and it was a hard choice to make which one to attend. On the other hand, in other time slots, all the talks were about topics that were not relevant to me.

If there is one thing that I’d improve in the organization of the conference, it would be to make the talks more evenly spread out so that at any given time, there would be talks that are relevant to everyone.

I would personally attend the conference again and recommend it to people in the field.

Check Out ScyllaDB University

My trip to OEB and the related research described herein are all part of our constant efforts to make our ScyllaDB University the best learning platform in the NoSQL industry. If you’d like to check out the lessons for yourself, you can register now. All our courses are online and free.

REGISTER FOR SCYLLA UNIVERSITY

DISCUSS THIS IN OUR COMMUNITY FORUMS

The Year in Real-Time for Apache Pulsar and Cassandra

Data is the lifeblood of every business and the very reason behind naming the industry's collective efforts as “information technology.” The rate of growth and the need for global deployment of data is skyrocketing. Yet the most powerful data of all is real-time data. So why are open source...

3 Things I Learned from The Apache Cassandra® Corner Podcast

People know Apache Cassandra® as the world’s most scalable database, one that’s relied upon by some of the world’s most successful brands including Netflix, Apple, Uber, and FedEx, among many others. But Cassandra and the community that surrounds it is full of surprises, a few of which I discovered...

ScyllaDB Summit 2023 Agenda Announced

If you haven’t already, take a moment to register for ScyllaDB Summit 2023 right now. Because it’s an online event you won’t want to miss!

REGISTER NOW FOR SCYLLADB SUMMIT

For those who have attended our prior ScyllaDB Summits, or one of our P99 CONF events, you know ScyllaDB can throw a great online show! Thousands of you registered for our past events, and we look forward to meeting up with all our old friends and industry colleagues. Yet we also want to make ScyllaDB Summit inclusive to newcomers, because more people are discovering our database than ever before.

For those who are new to ScyllaDB, the NoSQL database, and to ScyllaDB, the company behind it, we’d love to make your acquaintance! So let’s go over what we have in store for you.

ScyllaDB Summit is our annual conference, filled with talks from our open source users and commercial customers, as well as ScyllaDB’s own technical leaders and innovators. Over two days you will be introduced to gamechangers and their industry-revolutionizing use cases, to new architectures and programming paradigms emerging in cloud computing, and to the latest developments in ScyllaDB that enable your innovation.

Keynotes

Wednesday, February 15th

Dor Laor ScyllaDb Summit 2023 Speaker

Dor Laor, ScyllaDB Co-Founder and CEO will be talking about To Serverless and Beyond. Discover all the innovations coming to ScyllaDB and ScyllaDB Cloud.

ScyllaDB Summit 2023 Speaker – Bo Ingram, Discord, Senior Software Engineer

Discord’s Senior Software Engineer Bo Ingram will be presenting on How Discord Migrated Trillions of Messages from Cassandra to ScyllaDB.

ScyllaDB Summit 2023 Speaker – Phani Teja Nallamothu, Strava, Senior Cloud Engineer

Strava’s Phani Teja Nallamothu will discuss ScyllaDB at Strava. Learn how they came to embrace it, shrinking their clusters and easing their administrative burden.

Avi Kivity ScyllaDB Summit 2023 Speaker

Avi Kivity, ScyllaDB Co-Founder and CTO will discuss <em>The Road to ScyllaDB 5.2</em>, covering Repair Based Node Operations (RBNO) and how ScyllaDB distributes aggregations.

Tzach Livyatan ScyllaDB Summit 2023 Seaker

Tzach Livyatan, ScyllaDB VP of Product, will show how to Use ScyllaDB Alternator to Use Amazon DynamoDB API, Everywhere, Better, More Affordable, All at Once.

Thursday, February 16th

ScyllaDB Summit 2023 Speaker –

Yaniv Kaul, ScyllaDB’s VP of R&D, is talking about ScyllaDB Goes Serverless, transforming its deployment to a multi-tenant architecture based on Kubernetes.

ScyllaDB Summit 2023 Speaker – Ramiro del Corro, Hulu, Senior Software Engineer Lead

Hulu’s Ramiro del Corro will discuss How Hulu Serves Dynamic Live Streams at Scale. Discover how and why they moved their systems to ScyllaDB.

ScyllaDB Summit 2023 Speaker – Joakim Lindqvist Epic Games Senior Tools Programmer

Epic Game’s Joakim Lindqvist will reveal how they are Using ScyllaDB for Distribution of Game Assets in Unreal Engine to revolutionize the way games are produced.

ScyllaDB Summit 2023 Speaker – Peter Zaitsev, Percona, Founder

Percona Founder Peter Zaitsev will focus on industry-wide transformations in The Database Trends that are Transforming Your Database Infrastructure Forever.

Check out the Whole Agenda!

We have many more speakers in concurrent sessions as well as extensive on-demand content. Tons of lessons learned, insights, and technical breakthroughs you won’t want to miss. Check out the full agenda and register now.

REGISTER NOW FOR SCYLLADB SUMMIT