Today, we’re announcing DataStax Constellation, our new cloud data platform designed to make hybrid and multi-cloud data management ‘easy and obvious’ in a world where building applications can often be complex and confusing. With Constellation, enterprises will now get the extreme power of our data solutions with push-button simplicity. That power-made-simple will allow our customers to rapidly develop cloud applications with on-prem compatibility, along with smart services for performance monitoring that will span all Cassandra-based deployments, whether on-prem or in the cloud.
DataStax’s masterless architecture uniquely positions us to empower fast application development, no matter the infrastructure configuration or scale and performance needs. In fact, no other database can support hybrid or multi-cloud like we can, which is why 60% of our customers already self manage DataStax deployments in a public cloud.
The benefits from DataStax technology are perfectly suited for modern applications:
- No single point of failure,
- Global data distribution and availability for hybrid and multi-cloud deployments,
- Predictable performance with linear scale, and now,
- Simplified, cloud-native development, deployment, and management.
Hybrid and multi-cloud infrastructures are here to stay, and we will now provide a full spectrum of offerings to match any development and deployment needs. For example, recently we made it even easier for those who want to self-manage in the cloud by partnering with the major marketplace providers. Just two weeks ago, we announced the availability of the DataStax Distribution of Apache Cassandra in Azure, and last month we announced a partnership with Google. These offerings are necessary for customers who want our technology via a cloud partner, but they are not enough. A complete cloud-native DataStax experience is also required.
With Constellation, we will deliver just that.
Simple and Powerful
Constellation is a modern cloud data platform with smart services that enable rapid cloud application development. It is where data management happens with push-button ease and developer solutions are easy and obvious. The goal? To be the data cloud of choice in a world where data has to be available wherever, whenever, and however people want it. Constellation is designed to take advantage of the best the cloud has to offer and eliminate the worst of what the cloud can create.
The first offerings on Constellation will be DataStax Apache Cassandra as a Service, along with DataStax Insights, our next-generation hosted performance monitoring solution.
Through these offerings, we are making it easier than ever for developers to build powerful applications with familiar tools. DataStax Studio—an intuitive tool designed to help developers build and collaborate using the Cassandra Query Language, DataStax Enterprise Graph, and Spark SQL code—will be integrated with Constellation. We are also making available pre-configured deployment tools that can be downloaded with a click of a button, including our drivers, CQL query tool, and fast bulk-loading utility for moving data from on-premises to the cloud.
Constellation lets you get started with Cassandra in just a few clicks, simplifies development, and makes management easy with a single, integrated console.
With Insights, we provide fast and accurate problem resolution with centralized, scalable, and flexible performance monitoring that uses a combination of AI and our years of human expertise working with the largest deployments in the world. At-a-glance health indexes provide an intuitive, single view across all cloud and on-premises deployments. Experts will love the ability to drill down via click-throughs, and AI-powered analysis and recommendations make it easy for novices and busy seasoned professionals alike to tune for better performance. Insights constantly learns about a cluster’s performance to smartly recommend and even automate fixes so that everything runs like a well-oiled machine.
A New Era
Constellation is a leap into a new era at DataStax and a whole new way of making it easy for customers to buy, build, deploy, and manage modern applications in any type of computing environment or infrastructure.
Constellation makes easy what many, if not most, application developers currently find difficult. It makes obvious what companies find challenging and confusing. It simplifies the use of open source software with one of the most powerful distributed databases in the world, and in doing so, empowers enterprises in entirely new ways, and its on-demand pricing is flexible and transparent.
We are excited to see the profound impact Constellation will have on modern enterprises and cannot wait to release Constellation to unleash the power of Cassandra with cloud-native simplicity. We have no doubt that what our customers will build with this technology will be spectacular!
LEARN MORE ABOUT DATASTAX CONSTELLATION
We live in a post-meltdown world. After the discovery of Meltdown and Spectre, it was only a matter of time before bad actors realized that side-channels were a promising attack vector and started looking for more.
Once the floodgates opened, the natural prediction is that side-channel attacks will keep coming. And, as predicted, there is a new processor vulnerability, or rather, a family of vulnerabilities in town. Due to the fact that these flaws were discovered by many different research teams, you may see them referred by many names; RIDL, Fallout, MDS, or my personal favorite, “ZombieLoad.”
Meltdown and spectre exploited the speculative execution unit in the processor to allow unprivileged applications to read memory from the operating system. All operating systems were patched to be more strict in the way the operating system’s memory is mapped, causing a slowdown in most applications. Scylla, due to its mostly-userspace architecture, escaped relatively unscathed.
Zombieload is an attack that exploits Intel’s Hyperthreading implementation, allowing attackers to infer the values of protected data located in architectural components of the processor such as the store Buffer, fill Buffer or register load ports. Some vendors have rushed to recommend users to disable Hyperthreading, and shortly after Intel released a microcode update to protect against the issue.
While savvy organizations will be quick to follow such instructions and apply patches to protect against the known vulnerabilities through code, protecting against the vulnerabilities yet to come is possible only through solid and safe architectural and operational decisions. For example, if you fundamentally can’t trust your neighbors, you’re better off not having any.
One good example of how wise decisions can increase security is the decision by the major cloud providers to not oversubscribe CPUs. Because these new attacks rely on threads on the same core being made visible to each other, the decision of never assigning parts of a core, or sharing the same core with different VMs, mitigates these particular attacks.
For the future attacks, security-conscientious developers and infrastructure maintainers can protect against side-channel attacks by minimizing the amount of shared infrastructure. This means VMs and Container infrastructure. This was always a theoretical concern, but the new flaws catapult this front and center to the main stage.
As usual, security is in a constant tug-of-war against economics and flexibility: for the application layer, the microservices architecture is just too convenient and powerful to pass, and developers may want to look at alternative ways to make sure they are safe in the presence of side-channel attacks.
For infrastructure, the situation is different. Modern databases adopt horizontally scalable architectures, and as a side effect of that usually don’t worry much about scaling up. On the other hand, public cloud providers have been trying to benefit from economies of scale and get the most out of their instances, which often means they are trying to encourage multi-tenancy in large physical servers.
Users also have the choice of being the single tenant in their own bare-metal instances, which implicitly protects against side-channel attacks by enforcing isolation, but if the choice of horizontal scaling is imposed by the database’s lack of ability to scale up well, the extra resources brought in by a large physical server will not translate to better performance or reduction of node count and be wasted as a result, tipping the balance in favor of enduring a systemic security risk.
One of the reasons Scylla is such an attractive piece of infrastructure for engineers who want to protect against side-channel attacks: Scylla scales linearly both up and out, and the user has the choice between doubling the size of their machines while reducing their number in half, or the other way around. It can efficiently utilize all of the cores of a big bare-metal box, meaning service providers aren’t losing out due to idle resources, and customers, seeing high system utilizations, are getting the most return out of their infrastructure investments.
In Scylla, all operations— from foreground ingestion and reads to maintenance operations, like data streaming and compactions, scale up linearly with the amount of resources (disk speed, number of cores, etc).
Another key benefit of scaling up is also minimizing your threat attack surface. A fewer large nodes are far easier to defend than a sprawling horizontal cluster. Thus, if you redeploy from a sprawling containerized horizontal topology today to a vertically-scaled baremetal or reserved instance topology tomorrow, many of the risks associated with ZombieLoad and other side-channel attacks are reduced greatly or even effectively eliminated.
At Scylla, we believe users should never have to make the fool’s choice between economics and security. Scylla’s ability to scale up and out makes sure that the user can efficiently run larger machines effectively utilizing the entire physical server, without breaking the bank. With nobody listening to the side channel, side channel attacks, present or future, are of no concern.
ZombieLoad Survival Pocket Guide
- If you can’t trust your neighbors, don’t have any; avoid multi-tenancy where possible.
- If you need hyperthreading for performance, make sure you are running systems that are isolated; again, avoid multi-tenancy if you can and use applications that dominate entire cores.
- Reduce your threat surface by having fewer, taller boxes. But make sure your systems, like your database, can utilize those resources and scale vertically.
The post The ZombieLoad Pragmatist: Tips for Surviving in a Post-Meltdown World appeared first on ScyllaDB.
Managing the trade-off between consistency and availability is nothing new in distributed databases. It’s such a well-known issue that there is a theorem to describe it.
While modern databases don’t tend to fall neatly into categories, the “CAP” theorem (also known as Brewer’s theorem) is still a useful place to start. The CAP theorem states that a database can’t simultaneously guarantee consistency, availability, and partition tolerance. Partition tolerance refers to the idea that a database can continue to run even if network connections between groups of nodes are down or congested.
Since network failures are a fact of life, we pretty much need partition tolerance, so, from a practical standpoint, distributed databases tend to be either “CP” (meaning they prioritize consistency over availability) or “AP” (meaning they prioritize availability over consistency).
Apache Cassandra is usually described as an “AP” system, meaning it errs on the side of ensuring data availability even if this means sacrificing consistency. This is a bit of an over-simplification because Cassandra seeks to satisfy all three requirements simultaneously and can be configured to behave much like a “CP” database.
Replicas ensure data availability
When Cassandra writes data it typically writes multiple copies (usually three) to different cluster nodes. This ensures that data isn’t lost if a node goes down or becomes unavailable. A replication factor specified when a database is created controls how many copies of data are written.
When data is written, it takes time for updates to propagate across networks to remote hosts. Sometimes hosts will be temporarily down or unreachable. Cassandra is described as “eventually consistent” because it doesn’t guarantee that all replicas will always have the same data. This means there is no guarantee that the data you read is up to date. For example, if a data value is updated, and another user queries a replica to read the same data a few milliseconds later, the reader may end up with an older version of the data.
Tunable consistency in Cassandra
To address this problem, Cassandra maintains tunable consistency. When performing a read or write operation a database client can specify a consistency level. The consistency level refers to the number of replicas that need to respond for a read or write operation to be considered complete.
For reading non-critical data (the number of “likes” on a social media post, for example), it’s probably not essential to have the very latest data. You can set the consistency level to ONE and Cassandra will simply retrieve a value from the closest replica. If I’m concerned about accuracy, however, I can specify a higher consistency level, like TWO, THREE, or QUORUM. If a QUORUM (essentially a majority) of replicas reply, and if the data was written with similarly strong consistency, users can be confident that they have the latest data. If there are inconsistencies between replicas when data is read, Cassandra will internally manage a process to ensure that replicas are synchronized and contain the most recent data.
The same process applies to write operations. Specifying a higher consistency level forces multiple replicas to be written before a write operation can complete. For example, if “ALL” or “THREE” are specified when updating a table with three replicas, data will need to be updated to all replicas before a write can complete.
There is a trade-off between consistency and availability here, as well. If one of the replicas is down or unreachable, the write operation will fail since Cassandra cannot meet the required consistency level. In this case, Cassandra sacrifices availability to guarantee consistency.
Trade-offs between performance and consistency
So far we haven’t talked about performance, but there is also a strong relationship between consistency and performance. While using a high consistency level helps ensure data accuracy, it significantly impacts latency. For example, in the case of a read operation, rather than retrieving data that is possibly cached on the closest replica, Cassandra needs to check with multiple replicas, some of which may be in remote data centers.
Additional consistency levels address other considerations impacting performance and consistency, such as whether a quorum reached in a single data center is sufficient or a quorum needs to be reached across multiple data centers.
Cassandra can be tailored to application requirements
The good news for developers and database administrators is that these behaviors are highly configurable. Consistency can be set individually for each read and write operation, allowing developers to precisely control how they wish to manage trade-offs between consistency, availability, and performance.
Apache Cassandra Architecture (White Paper)
Geospatial Anomaly Detection: Part 1 – Massively Scalable Geospatial Anomaly Detection with Apache Kafka and Cassandra
Recently we announced that we had reached the end of the Anomalia Machina Blog series, but like Star Trek, the universe just keeps on expanding with endless new versions (> 726 Star Trek episodes and movies in 2016!). I guess this makes this blog an anomaly of sorts.
Part 1: The Problem and Initial Ideas
This blog will introduce the problem of Geospatial Anomaly Detection and investigate some initial Cassandra data models based on using latitude and longitude locations.
1. Space: The Final Frontier
Space: the final frontier. These are the voyages of the starship Enterprise. Its continuing mission: to explore strange new worlds. To seek out new life and new civilizations. To boldly go where no one has gone before! [Captain Picard]
Can you imagine what we’ll find? [Captain Kirk]
Alien despots hell-bent on killing us? Deadly spaceborne viruses and bacteria? Incomprehensible cosmic anomalies that could wipe us out in an instant! [Doctor ‘Bones’ McCoy]
It’s gonna be so much fun. [Captain Kirk]
In recent news (10 April) it is apparent that we don’t need a starship to encounter cosmic anomalies! Just a planet sized telescope (the Event Horizon Telescope) producing data at 64 gigabits per second for 7 days, time stamped with hydrogen maser atomic clocks (which only lose one second every 100 million years), resulting in 5 petabytes of data, transferred on half a ton of hard disks to a single location, and then stitched together on two “correlator” supercomputers and thereafter processed on commodity cloud and GPU computers. After two years, all this data was finally turned into an actual photograph of a real cosmic anomaly, the supermassive black hole at the centre of the distant (53 million light years) M87 galaxy. This black hole definitely counts as a cosmic anomaly, it’s billions of times more massive than the Sun, larger than our solar system, and swallows 90 earths a day. Degree of difficulty? Like “taking a picture of a doughnut placed on the surface of the moon” (from the earth I guess, as it would be easy to do if you were on the moon next to the doughnut).
Also in the last month, I started watching the “Project Blue Book”, about the shadowy world of UFO sightings and (maybe) alien encounters which astronomer J. Allen Hynek and his colleague Captain Michael Quinn were tasked to investigate, classify (Hynek’s UFO Classification System has six categories, but only three kinds of “close encounter”), and (officially at least) debunk. Coincidentally, Canberra became the “Drone capital of the world” with the launch of a world-first household drone delivery operation! I predict that the rate of UFO sightings will go up substantially. Time to start plotting those close encounters on a map!
A hand drawn map of close encounters from the Project Blue Book archives.
This combination of events was the stimulus for developing an extension of Anomalia Machina, our scalable real-time Anomaly Detection application using Apache Kafka and Cassandra and Kubernetes, for Geospatial Anomaly Detection. My Latin’s not very good, but I think that makes this new series: “Terra Locus Anomalia Machina”. Who knows? Maybe we’ll discover strange anomalies in drone deliveries, or even a new type of black hole?!
2. The Geospatial Anomaly Detection Problem
Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy
How would a geospatial anomaly detector work? The geospatial problem we want to solve is challenging. Let’s assume that we want to detect anomalies for events that are related in both space and time, but making no assumptions about where in space or how densely clustered they are. The events may be very sparse, spread uniformly across the whole of the planet, or very dense, with many occurring within metres of each other, or a mixture of both. As a result we can’t assume a finite number of fixed locations, or a minimum or maximum distance between events as being in “the same location”. We have both location and scale challenges.
To usefully represent location we really need some sort of coordinate system and a map with a scale. This map (from “Treasure Island”) is a good start as it has coordinates, scale, and potentially interesting locations marked (e.g. “Bulk of treasure here”):
Instead of a single integer ID (as in the original Anomalia Machina) each event will need location coordinates. We’ll use <latitude, longitude> coordinates, with 10m accuracy (assuming that location is determined by a GPS). Each event also has a single value corresponding to some measurement of interest at that location. E.g. radiation, the number of UFOs detected, the trip length (e.g. of a drone or an Uber etc). We also assume that the events are potentially being produced by a moving device, so events may or may not come from the same exact location, and will definitely come from “new” locations.
For each new event coming into the system we want to find the nearest 50 events, in reverse order of time, and then run the anomaly detector over them. This means that nearer events will take precedence over more distant events, even if the more distant events are more recent. It also means that there may not be many events in a particular location (initially, or ever), so the system may have to look an arbitrary distance away to find the nearest 50 events.
For example, if the events are sparsely distributed, then the nearest (4 for simplicity) events (blue) are further away from the latest event (red) (we assume these are all on a flat 2d plane with locations as <x,y> cartesian coordinates similar to the Treasure Island Map):
If events are denser around a particular location (red) then the (4) nearest events (blue) can be found nearby:
Note that as events build up over time there may be “hotspots” where the number of events becomes too high and the older events become irrelevant sooner. We assume that there is a mechanism that can expire events to prevent this becoming a problem. As we will be using Cassandra for data storage, the Cassandra TTL mechanism may be adequate, or we may need a more sophisticated approach to expire events in hotspots faster than other more sparsely populated locations.
However, this all depends on how we implement notions of location and proximity. The above examples used a 2d cartesian coordinate system, which looks similar to the Mercator map projection invented in 1569. It’s a cylindrical projection, parallels and meridians are straight and perpendicular to each other which makes the earth look flat and rectangular. However, only sizes along the equator are correct, and it hugely distorts the size of Greenland and Antarctic, but it’s actually very good for navigation, and a variant is widely used for online mapping services.
So let’s see how far we get with what has the appearance of a flat earth theory!
3. Latitude and Longitude
To modify the Anomalia Machina application to work with geospatial data we need to (1) modify the Kafka load generator so that it produces data with a geospatial location as well as a value (i.e. we replace the original ID integer key with a geospatial key), (2) write the new data type to Cassandra, and (3) for a given geospatial key, query Cassandra for the nearest 50 events in reverse time order.
The obvious way of representing location is with <latitude, longitude> coordinates, so an initial naive implementation used a pair of decimal <latitude, longitude> values as the geospatial key. But in order to store the new data type, we need to think about what the Casandra table schema should be.
As is usual with Cassandra the important question is what should the primary key look like? Cassandra primary keys are made up of two parts: a partition key (simple or compound, which Cassandra uses to determine which node the row is stored on), and zero or more clustering keys (which determine the order of the results). We could use a compound primary key using latitude and longitude, but this would only enable us to look up locations that match exactly with the key. An alternative is to use a different partition key, perhaps the name of the country (assuming we have a simple way of converting <latitude, longitude> into countries, perhaps including oceans to cope with locations at sea), and include <latitude, longitude> as extra columns as follows:
CREATE TABLE latlong ( country text, time timestamp, lat double, long double, PRIMARY KEY (country, time) ) WITH CLUSTERING ORDER BY (time DESC);
Country is the partition key, and time is the clustering key. To find events in an exact location we tried this query:
select * from latlong where country='nz' and lat=- 39.1296 and long=- 175.6358 limit 50;
However, this query produces a message saying:
“Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Adding “allow filtering” to the query:
select * from latlong where country='nz' and lat=- 39.1296 and long=175.6358 limit 50 allow filtering;
This works, and returns up to 50 records at the same exact location in reverse time order (Mount Tongariro in New Zealand, perhaps we want to check if it’s about to erupt. This photo is from the last major eruption in 1975 which I remember well as I lived “nearby”, in volcanic scale, less than 100km).
However, how does this help us find the 50 nearest events to a location? “Nearest” requires an ability to compute distance. How do you compute the distance between two locations? This would be easy if we go along with the illusion of the above maps, that the earth is flat and we are using <x, y> cartesian coordinates, maybe something like this:
It’s time to move beyond a Flat Earth theory (with the associated dangers of falling off the edges and difficulty finding the South Pole). Latitude and Longitude are actually angles (degrees, minutes and seconds) and the earth is really a sphere (approximately). Latitude is degrees North and South of the Equator, Longitude is degrees East and West (of somewhere).
Degrees, minutes and seconds, can be converted to/from decimal degrees. The location of Mount Tongariro in degrees, minutes and seconds is: Latitude: -39° 07′ 60.00″ S Longitude: 175° 38′ 59.99″ E, which in decimal is: <-39.1296, 175.6358>.
This diagram shows the decimal degrees (Latitude is φ, Longitude is λ). A latitude of 0 is at the equator. Positive latitudes are north of the equator (90 is the North Pole), negative latitudes are south of the equator (-90 is the South Pole). Latitudes are also called parallels as they are circles parallel to the equator. Positive longitudes are East of an arbitrary line, the “Prime meridian” (which runs through Greenwich in London, with a Longitude of 0), negative longitudes are West of the Prime Meridian. Latitudes range from -180 to 180, and Longitudes range from -90 to 90 degrees.
The Mercator map is actually a type of cylindrical projection of a sphere onto a cylinder like this:
The theoretically correct calculation of distance between two latitude (φ) longitude (λ) points is non-trivial (the Haversine formula, where r is the radius of the earth, 6371 km, and lat/long angles need to be in radians):
But even assuming we can perform this calculation (maybe with a Cassandra UDF), the use of “allow filtering” makes it impractical to use in production as the processing time will continue to grow with partition size.
4. Bounding Box
Using a simpler approximation for distance such as a bounding box calculation means we can then use inequalities (>=, <=) to compute if a point (x2, y2) is approximately within some distance (d) of another point (x, y). This example is for simple (x,y) co-ordinates, as the calculation for latitude and longitude is more complex and requires: converting latitude and longitude to distance (each degree of latitude is approximately 111km and constant, as latitudes are always parallel, but a degree of longitude is 111km at the equator and shrinks to zero at the poles), and careful handling of boundary conditions near the poles (90, and -90 degrees latitude) and near -180 and 180 degrees longitude (the “antimeridian”, which is the basis for the International date line, directly opposite the Prime Meridian):
Here’s a bounding box query with sides of approximately 100km centred on Mount Tongariro, however, because we still need “allow filtering” it isn’t practical for production:
select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering;
I wondered if indexing the latitude and longitude columns would remove the need for “allow filtering”. Indexing allows rows in Cassandra to be queried by columns other than just those in the partition key. Some Cassandra indexing options include clustering columns, secondary indexes or SASI indexes.
5.1 Clustering Columns
Even though we are using time as a clustering column already, we can also add latitude and longitude as clustering columns as follows:
CREATE TABLE latlong ( country text, time timestamp, lat double, long double, PRIMARY KEY (country, lat, long, time) ) WITH CLUSTERING ORDER BY (lat DESC, long DESC, time DESC);
However, I discovered that a limitation of clustering columns is that to use an inequality on a clustering column in a “where” clause, all of the preceding columns must have equality relationships, so this isn’t going to work for the bounded box calculation. I.e. this query will work without “allow filtering” (but isn’t correct as it only matches exact latitudes, not ranges):
select * from latlong where country='nz' and lat= -39.58 and long >= 175.18 and long <= 176.08 limit 50;
5.2 Secondary Indexes
create index i1 on latlong (lat); create index i2 on latlong (long);
However, secondary indexes appear to have the same select query restrictions as Clustering columns, so this didn’t get us any further.
There is another type of indexing available called SASI (SSTable Attached Secondary Index) which are included in Cassandra by default. SASI supports complex queries more efficiently that the default secondary indexes, including:
- AND or OR combinations of queries.
- Wildcard search in string values.
- Range queries.
SASIs are used as follows. Note that you can’t have secondary and SASI indexes on the same column, so you need to drop the secondary indices first.
create custom index i3 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex'; create custom index i4 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex';
As expected, this did enable more powerful queries, but the complete bounding box query still needed “allow filtering”. Upon further investigation, the SASI documentation on Compound Queries says that even though “allow filtering” must be used with 2 or more column inequalities, there is actually no filtering taking place, so you could probably use a bounded box query efficiently in practice.
Over the next blog(s) we’ll continue to explore feasible solutions, and finally try some of them out in the Anomalia Machina application, and see how well they work in practice. Given that latitude and longitude locations have proved tricky to use as a basis for proximity queries, we’ll next explore an approach that only needs single column locations and equality in the query: Z-order curves! – more commonly known as Geohashes. These map multidimensional data to one dimension while preserving locality of the data points. And not to be confused with Geohashing which is an (extreme?) sport involving going to random geohash locations and proving that you’ve been there! In the meantime, can you find these geohashes? ‘rchcm5brvd4cn’, ‘sunny’, ‘fur’, ‘reef’, ‘geek’, ‘ebenezer’ and ‘queen’? (Travelling to them is optional).
Next Blog: Terra-Locus Anomalia Machina – Part 2: Geohashes, and 3 dimensions
It looks like we’ve been beaten to discovering a new type of cosmic anomaly, as another black hole was recently discovered which is eating its companion star, and spinning so fast it’s pulling space and time around with it (‘Bones’ was correct).
Connect to your cluster using any of the drivers for Apache Cassandra™, which comes in different languages such as Java, Python, C++, C#, Node.js, Ruby, and PHP.
Below we'll go through the steps to create a simple Java application using version 3.7.1 of the DataStax Java Driver for Apache Cassandra™.
Note: There are API changes for newer versions of the Java driver (4.0+). Please make sure you use the appropriate version for this example.
If you’re reading this blog, you’re probably already aware of the power and speed offered by Scylla. Maybe you’re already using Scylla in production to support high I/O real-time applications. But what happens when you come across a problem that requires some features that aren’t easy to come by with native Scylla?
I’ve recently been working on a Master Data Management (MDM) project with a large cybersecurity company. Their analytics team is looking to deliver a better, faster, and more insightful view of customer and supply chain activity to users around the business.
But anyone who’s tried to build such a solution knows that one of the chief difficulties is encompassing the sheer number and complexity of existing data sources. These data sources often serve as the backends to systems that were designed and implemented independently. Each holds a piece of the overall picture of customer activity. In order to deliver a true solution, we need to be able to bring this disparate data together.
This process requires not just performance and scalability, but also flexibility and the ability to iterate quickly. We seek to surface important relationships, and make them an explicit part of our data model itself.
A graph data system, built with JanusGraph and backed by the power of Scylla, is a great fit for solving this problem.
What is a graph data system?
We’ll break it down into 2 pieces:
- Graph – we’re modeling our data as a graph, with vertices and edges representing each of our entities and relationships
- Data system – we’re going to use several components to build a single system with which to store and retrieve our data
There are several options for graph databases out there on the market, but when we need a combination of scalability, flexibility, and performance, we can look to a system built of JanusGraph, Scylla, and Elasticsearch.
At a high level, here’s how it looks:
Let’s highlight 3 core areas in which this data system really shines:
- Schema support
- OLTP + OLAP Support
While we certainly care about data loading and query performance, the killer feature of a graph data system is flexibility. Most databases lock you into a data model once you’ve started. You define some tables that support your system’s business logic, and then you store and retrieve data from those tables. When you need to add new data sources, deliver a new application feature, or ask innovative questions, you better hope it can be done within the existing schema! If you have to pop the hood on the existing data schema, you’ve begun a time-consuming and error-prone change management process.
Unfortunately, this isn’t the way businesses grow. After all, the most valuable questions to ask today are the ones we didn’t even conceive of yesterday.
Our graph data system, on the other hand, allows us the flexibility to evolve our data model over time. This means that as we learn more about our data, we can iterate on our model to match our understanding, without having to start from scratch. (Check out this article for a more complete walkthrough of the process).
What does this get us in practice? It means we can incorporate fresh data sources as new vertices and edges, without breaking existing workloads on our graph. We can also immediately write query results into our graph — eliminating repetitive OLAP workloads that run daily only to produce the same results. Each new analysis can build on those that came before, giving us a powerful way of sharing production-quality results between teams around the business. All this means that we can answer ever-more insightful questions with our data.
2. Schema Enforcement
While schema-lite may seem nice at first glance, using such a database means that we’re off-loading a lot of work into our application layer. The first- and second-order effects are replicated code across multiple consumer applications, written by different teams in different languages. It’s a huge burden to enforce logic that should really be contained within our database layer.
JanusGraph offers flexible schema support that saves us pain without becoming a hassle. It has solid datatype support out of the box, with which we can pre-define the properties that are possible for a given vertex or edge to contain, without requiring that each vertex must contain all of these defined properties. Likewise, we can define which edge types are allowed to connect a pair of vertices, but this pair of vertices are not automatically forced to have this edge. When we decide to define a new property for an existing vertex, we aren’t forced to write that property for every existing vertex already stored in the graph, but instead can include it only on the applicable vertex insertions.
This method of schema enforcement is immediately beneficial to managing a large dataset – especially one that will be used for MDM workloads. It simplifies testing requirements as our graph sees new use cases, and cleanly separates between data integrity maintenance and business logic.
3. OLTP + OLAP Support
Just like with any data system, we can separate our workloads into 2 categories – transactional and analytical. JanusGraph follows the Apache TinkerPop project’s approach to graph computation. Big picture, our goal is to “traverse” our graph, traveling from vertex to vertex by means of connecting edges. We use the Gremlin graph traversal language to do this. Luckily, we can use the same Gremlin traversal language to write both OLTP and OLAP workloads.
Transactional workloads begin with a small number of vertices (found with the help of an index), and then traverse across a reasonably small number of edges and vertices to return a result or add a new graph element. We can describe these transactional workloads as graph local traversals. Our goal with these traversals is to minimize latency.
Analytical workloads require traversing a substantial portion of the vertices and edges in the graph to find our answer. Many classic analytical graph algorithms fit into this bucket. We can describe these as graph global traversals. Our goal with these traversals is to maximize throughput.
With our JanusGraph – Scylla graph data system, we can blend both capabilities. Backed by the high-IO performance of Scylla, we can achieve scalable, single-digit millisecond response for transactional workloads. We can also leverage Spark to handle large scale analytical workloads.
Deploying our Data System
This is all well and good in theory, so let’s go about actually deploying this graph data system. We’ll walk through a deployment on Google Cloud Platform, but everything described below should be replicable on any platform you choose.
Here’s the design of the graph data system we’ll be deploying:
There are three key pieces to our architecture:
- Scylla – our storage backend, the ultimate place where our data gets stored
- Elasticsearch – our index backend, speeding up some searches, and delivering powerful range and fuzzy-match capabilities
- JanusGraph – provides our graph itself, either as a server or embedded in a standalone application
We’ll be using Kubernetes for as much of our deployment as possible. This makes scaling and deployment easy and repeatable regardless of where exactly the deployment is taking place.
Whether we choose to use Kubernetes for Scylla deployment itself depends on how adventurous we are! The Scylla team has been hard at work putting together a production-ready deployment for Kubernetes: Scylla Operator. Currently available as an Alpha release, Scylla Operator follows the CoreOS Kubernetes “Operator” paradigm. While I think this will eventually be a fantastic option for a 100% k8s deployment of JanusGraph, for now we’ll look at a more traditional deployment of Scylla on VMs.
In order to follow along, you can find all of the required setup and load scripts in https://github.com/EnharmonicAI/scylla-janusgraph-examples. You’ll want to change some options as you move to production, but this starting point should demonstrate the concepts and get you moving along quickly.
It’s best to run this deployment from a GCP VM with full access to Google Cloud APIs. You can use one you already have, or create a fresh VM.
gcloud compute instances create deployment-manager \ --zone us-west1-b \ --machine-type n1-standard-1 \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --image 'centos-7-v20190423' --image-project 'centos-cloud' \ --boot-disk-size 10 --boot-disk-type "pd-standard"
Then ssh into the VM:
gcloud compute ssh deployment-manager [ryan@deployment-manager ~]$ ...
We’ll assume everything else is run from this GCP VM. Let’s install some prereqs:
sudo yum install -y bzip2 kubectl docker git sudo systemctl start docker curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh Miniconda3-latest-Linux-x86_64.sh
We can use Scylla’s own Google Compute Engine deployment script as a starting point to our Scylla cluster deployment.
git clone https://github.com/scylladb/scylla-code-samples.git cd scylla-code-samples/gce_deploy_and_install_scylla_cluster
Create a new conda environment and install a few required packages, including Ansible.
conda create --name graphdev python=3.7 -y conda activate graphdev pip install ruamel-yaml==0.15.94 ansible==2.7.10 gremlinpython==3.4.0 absl-py==0.7.1
We’ll also do some ssh key housekeeping and apply project-wide metadata, which will simplify connecting to our instances.
touch ~/.ssh/known_hosts SSH_USERNAME=$(whoami) KEY_PATH=$HOME/.ssh/id_rsa ssh-keygen -t rsa -f $KEY_PATH -C $SSH_USERNAME chmod 400 $KEY_PATH gcloud compute project-info add-metadata --metadata ssh-keys="$SSH_USERNAME:$(cat $KEY_PATH.pub)"
We’re now ready to setup our cluster. From within the
directory we cloned above, we’ll run the
gce_deploy_and_install_scylla_cluster.sh script. We’ll
be creating a cluster of three Scylla 3.0 nodes, each as an
n1-standard-16 VM with 2 NVMe local SSDs.
./gce_deploy_and_install_scylla_cluster.sh \ -p symphony-graph17038 \ -z us-west1-b \ -t n1-standard-4 \ -n -c2 \ -v3.0
It will take a few minutes for configuration to complete, but once this is done we can move on to deploying the rest of our JanusGraph components.
git clone https://github.com/EnharmonicAI/scylla-janusgraph-examples cd scylla-janusgraph-examples
Every command that follows will be run from this top-level directory of the cloned repo.
Setting up a Kubernetes cluster
To keep our deployment flexible and not locked into any cloud provider’s infrastructure, we’ll be deploying everything else via Kubernetes. Google Cloud provides a managed Kubernetes cluster through their Google Kubernetes Engine (GKE) service.
Let’s create a new cluster with enough resources to get started.
gcloud container clusters create graph-deployment \ --project [MY-PROJECT] \ --zone us-west1-b \ --machine-type n1-standard-4 \ --num-nodes 3 \ --cluster-version 1.12.7-gke.10 \ --disk-size=40
We also need to create a firewall rule to allow GKE pods to access other non-GKE VMs.
CLUSTER_NETWORK=$(gcloud container clusters describe graph-deployment \ --format=get"(network)" --zone us-west1-b) CLUSTER_IPV4_CIDR=$(gcloud container clusters describe graph-deployment \ --format=get"(clusterIpv4Cidr)" --zone us-west1-b) gcloud compute firewall-rules create "graph-deployment-to-all-vms-on-network" \ --network="$CLUSTER_NETWORK" \ --source-ranges="$CLUSTER_IPV4_CIDR" \ --allow=tcp,udp,icmp,esp,ah,sctp
There are a number of ways to deploy Elasticsearch on GCP – we’ll choose to deploy our ES cluster on Kubernetes as a Stateful Set. We’ll start out with a 3 node cluster, with 10 GB of disk available to each node.
kubectl apply -f k8s/elasticsearch/es-storage.yaml kubectl apply -f k8s/elasticsearch/es-service.yaml kubectl apply -f k8s/elasticsearch/es-statefulset.yaml
(Thanks to Bayu Aldi Yansyah and his Medium article for laying out this framework for deploying Elasticsearch)
Running Gremlin Console
We now have our storage and indexing backend up and running, so let’s define an initial schema for our graph. An easy way to do this is by launching a console connection to our running Scylla and Elasticsearch clusters.
Build and deploy a JanusGraph docker image to your Google Container Registry.
scripts/setup/build_and_deploy_janusgraph_image.sh -p [MY-PROJECT]
Update the k8s/gremlin-console/janusgraph-gremlin-console.yaml file with your project name to point to your GCR repository image name, and add the correct hostname of one of your Scylla nodes. You’ll notice in the YAML file that we use environment variables to help create a JanusGraph properties file, which we’ll use to instantiate our JanusGraph object in the console with a JanusGraphFactory.
Create and connect to a JanusGraph Gremlin Console:
kubectl create -f k8s/gremlin-console/janusgraph-gremlin-console.yaml kubectl exec -it janusgraph-gremlin-console -- bin/gremlin.sh \,,,/ (o o) -----oOOo-(3)-oOOo----- ... gremlin> graph = JanusGraphFactory.open('/etc/opt/janusgraph/janusgraph.properties')
We can now go ahead and create an initial schema for our graph. We’ll go through this at a high level here, but I walk through a more detailed discussion of the schema creation and management process in this article.
For this example, we’ll look at a sample of Federal Election Commission data on contributions to 2020 presidential campaigns. The sample data has already been parsed and had a bit of cleaning performed on it (shoutout to my brother, Patrick Stauffer, for allowing me to leverage some of his work on this dataset) and is included in the repo as resources/Contributions.csv.
Here’s the start of the schema definition for our contributions dataset:
mgmt = graph.openManagement() // Define Vertex labels Candidate = mgmt.makeVertexLabel("Candidate").make() ename = mgmt.makePropertyKey("name"). dataType(String.class).cardinality(Cardinality.SINGLE).make() filerCommitteeIdNumber = mgmt.makePropertyKey("filerCommitteeIdNumber"). dataType(String.class).cardinality(Cardinality.SINGLE).make() mgmt.addProperties(Candidate, type, name, filerCommitteeIdNumber) mgmt.commit()
You can find the complete schema definition code in the
scripts/load/define_schema.groovy file in the repository. Simply
copy and paste it into the Gremlin Console to execute.
Once our schema has been loaded, we can close our Gremlin Console and delete the pod.
kubectl delete -f k8s/gremlin-console/janusgraph-gremlin-console.yaml
Deploying JanusGraph Server
Finally, let’s deploy JanusGraph as a server, ready to take client requests. We’ll be leveraging JanusGraph’s built-in support for Apache TinkerPop’s Gremlin Server, which means our graph will be accessible to a wide range of client languages, including Python.
k8s/janusgraph/janusgraph-server-service.yaml file to
point to your correct GCR repository image name. Deploying our
JanusGraph Server is now as simple as:
kubectl apply -f k8s/janusgraph/janusgraph-server-service.yaml kubectl apply -f k8s/janusgraph/janusgraph-server.yaml
We’ll demonstrate access to our JanusGraph Server deployment by loading some initial data into our graph.
Anytime we load data into any type of database — Scylla, relational, graph, etc — we need to define how our source data will map to the schema that has been defined in the database. For graph, I like to do this with a simple mapping file. An example has been included in the repo, and here’s a small sample:
vertices: - vertex_label: Candidate lookup_properties: FilerCommitteeIdNumber: filerCommitteeIdNumber other_properties: CandidateName: name edges: - edge_label: CONTRIBUTION_TO out_vertex: vertex_label: Contribution lookup_properties: TransactionId: transactionId in_vertex: vertex_label: Candidate lookup_properties: FilerCommitteeIdNumber: filerCommitteeIdNumber
This mapping is designed for demonstration purposes, so you may notice that there is repeated data in this mapping definition. This simple mapping structure allows the client to remain relatively “dumb” and not force it to pre-process the mapping file.
Our example repo includes
a simple Python script,
load_fron_csv.py, that takes a
CSV file and a mapping file as input, then loads every row into the
graph. It is generalized to take any mapping file and CSV you’d
like, but is single-threaded and not built for speed – it’s
designed to demonstrate the concepts of
data loading from a client.
python scripts/load/load_from_csv.py \ --data ~/scylla-janusgraph-examples/resources/Contributions.csv \ --mapping ~/scylla-janusgraph-examples/resources/campaign_mapping.yaml \ --hostname [MY-JANUSGRAPH-SERVER-LOAD-BALANCER-IP] \ --row_limit 1000
With that, we have our data system up and running, including defining a data schema and loading some initial records.
I hope you enjoyed this short foray into a JanusGraph – Scylla graph data system. It should provide you a nice starting point, and demonstrate how easy it is to deploy all the components. Remember to shutdown any cloud resources you don’t want to keep to avoid incurring charges.
We really just scratched the surface of what you can accomplish with this powerful system, and your appetite has been whetted for more. Please send me any thoughts and questions, and I’m looking forward to seeing what you build!
The post Powering a Graph Data System with Scylla + JanusGraph appeared first on ScyllaDB.
In previous posts, we’ve talked about the benefits of the hybrid cloud. We’ve also offered some guidance about making the transition to the hybrid cloud. So you know about the “whys” and “hows” of moving to the hybrid cloud.
But those are really just two parts of a trifecta. The third part, and the topic of this post, focuses on one of the main keys to success with hybrid cloud: open source software.
The age of open source has arrived
As a recent article in Linux Journal observed, attitudes about open source software have significantly evolved over the last 30 years. Once regarded quite dubiously by mainstream enterprises, open source is now freely accepted and utilized by many of the world’s most successful organizations, including Netflix, Apple, and the United States government. Some of the leading tech companies that were early enemies of open source are now among its most strident and impactful supporters.
Back in 2009 the open software market was forecast to top $8 billion within a few years. But the market blew past $11 billion in 2017, and the current MarketsandMarketsforecast predicts that the open source software services market will hit nearly $33 billion by 2022. That will represent a compound annual growth rate of almost 24%.
So the age of open source has truly arrived. Case in point is Google’s recently announced partnering with many open source service providers, including DataStax. But why is open source so essential for hybrid cloud success?
Enabling hybrid cloud success with open source software
That single, simple sentence encapsulates the power of the hybrid cloud. Because it enables organizations to utilize the optimal set of resources for their application workloads, hybrid cloud offers businesses and public sector agencies the ability to maximize the benefits of the cloud while minimizing costs.
How DataStax leverages open source
DataStax offers the only active everywhere database for hybrid cloud. It is, quite simply, the best available distributed database for hybrid cloud environments. But without open source, the DataStax solution wouldn’t be possible. That’s because DataStax utilizes the power of Apache Cassandra, an open source, distributed, NoSQL database.
- Scalability: Cassandra has the ability to both scale up (adding capacity to a single machine) and scale out (adding more servers) to tens of thousands of nodes.
- High availability: Cassandra’s masterless architecture enables the quick replication of data across multiple data centers and geographies. This feature powers the always-on benefit offered by DataStax.
- High fault tolerance: Cassandra provides automated fault tolerance. Cassandra’s masterless, peer-to-peer architecture and data replication capabilities support a level of system redundancy that ensures full application speed and availability, even when nodes go offline. And this capability is fully automated—no manual intervention is required.
- High performance: In today’s business climate speed is essential. Cassandra architecture minimizes the instances of high latency and bottlenecks that so often stifle productivity, and frustrate both internal users and customers. This is further enhanced in DSE, which offers more than 2x the performance of standalone open source Cassandra.
- Multi-data center and hybrid cloud support: Designed as a distributed system, Cassandra enables the deployment of large numbers of nodes across multiple data centers. It also supports cluster configurations optimal for geographic distributions, providing redundancy for failover and disaster recovery.
Simply put, the very best database solution for the hybrid cloud just wouldn’t exist without open source software. And that’s just one example among countless others that serves to illustrate how essential open source is in enabling the best benefits of the hybrid cloud.
Remember: there’s no such thing as free open source software
As everyone likely knows, the acronym “FOSS” stands for free, open source software. But putting open source software to work for your organization’s hybrid cloud initiative will not be free. As this ComputerWeekly article observed, “Open source is not ‘free’ because every deployment has a cost associated with it, but what it does allow for is choice, and an organization can go down one of two routes in adopting it.”
Those two routes, or choices, in implementing open source are:
- Install, manage, and support the open source software yourself and incur the training and learning-curve costs
- Pay a professional open source services provider to implement and support the open source solution
And sometimes, a combo of the two approaches might be best.
But even though FOSS isn’t free, it can provide you with a wonderful range of lower-cost options for transitioning to hybrid cloud. And you can at least get started with free training on DataStax Academy and you can also download DataStax Distribution of Apache Cassandrafor free.
Apache Cassandra Architecture (White Paper)
I am an architect at SteelHouse, an Ad Tech platform. Our ad serving platform is distributed across the United States and the EU, and we strive to deliver the best and highest-performing real-time digital advertising possible.
Our ability to serve ads and log web events, to consume metadata and make decisions on it, is what drives the whole platform. There’s no point in serving an ad if we don’t analyze all the data up front to ensure we’re serving an appropriate ad on behalf of our customers. It’s also vital we deliver an experience that’s relevant to a consumer’s interests and increases their engagement.
Our stack is comprised of NoSQL databases, with Kafka for all of our messaging. Our services run on the public cloud, with Kubernetes as our container management for applications. The biggest driver of SteelHouse’s data requirements is what we call our pixel server. It handles billions of requests a month. Being so complex, the system is extremely sensitive to latency.
We used Cassandra for seven years as the backend data store for our globally distributed systems. Our Cassandra clusters habitually generated reams of timeouts, and it wasn’t possible for us to maintain our expected timeout SLA. At that point, we knew it was time to try something else. Things were not looking good with the Cassandra cluster, and if we switched over, they probably couldn’t get any worse!
One of our senior engineers, Dako Bogdanov, who’s well-versed in databases and C++, took a close look at Scylla and said, “Okay. These guys are the real deal.” Since he spends a lot of time tuning our data warehouse, our Postgres databases, he gave the rest of us confidence. When he sees something that he likes, it’s easy for the rest of us to get on board.
We had gotten used to the extensive and sometimes painful tuning required by Cassandra and the JVM. But once we set up the first Scylla cluster, we realized that Scylla mostly tunes itself. That seemed like a pretty impressive feat, knowing that the system could be installed, auto-tune, and be ready for a production workload. With Cassandra, the JVM tweaks and tuning are all manual and quite brittle. We were heartened to know that the engineering team at Scylla had put a lot of effort into making sure their product will perform well out of the box.
Once we saw how easy Scylla is to install, how easy it is to tune, and how great the performance is, we consolidated a few workloads from Cassandra. With Scylla, there’s less hardware to manage. That means fewer things can go wrong. With fewer boxes that might fail, our lives suddenly became much simpler.
Scylla’s performance was as good as advertised. As soon as we deployed Scylla, we saw a great improvement in terms of fewer timeouts being logged. The responsiveness and performance of applications improved as well. Once we switched from Cassandra over to Scylla, timeouts for those very sensitive systems simply disappeared. It was like night and day.
The switchover was a little intense and we were all nervous, but it ultimately worked out really well. Over the next few months, we started migrating some of the remaining clusters to Scylla. Black Friday/Cyber Monday is SteelHouse’s peak season, driving three to four times our normal volume. Scylla provided consistent, reliable performance throughout the entire holiday season.
People on the business side commented that it was our smoothest holiday season on record. It also turned out to be our largest holiday season ever. The fact that from a business perspective we performed better than we had ever before, and from a technical perspective, the systems were more stable with more load than ever before, is a true testament to Scylla.
Today, Scylla is handling billions of requests that we serve every month, through write/reads and writes. For every request that comes to our pixel server, there are multiple reads and writes. Scylla easily handles the nine billion or so requests that might be interacting with our backend at any given time.
Our team now saves three to four hours a week that’s no longer being spent on Cassandra maintenance. As a mixture of scheduled and unscheduled work, we’re getting back roughly 20% improvement in team productivity.
We knew we were taking a risk when we migrated from Cassanadra to Scylla in production. But Cassandra pushed us to the edge, and our faith in Scylla turned out to be well-founded. Production migrations aren’t for everyone, so I do recommend that you look into migrating over to Scylla while you still have some time, margin for error, and a safety net!
The post SteelHouse: How Our Real-Time Switchover from Cassandra to Scylla Saved the Day appeared first on ScyllaDB.