Massively Scalable Geospatial Anomaly Detection with Apache Kafka and Cassandra
In this blog, we continue exploring how to build a scalable Geospatial Anomaly Detector. In the previous blog, we introduced the problem and tried an initial Cassandra data model with locations based on latitude and longitude. We now try another approach, Geohashes, to start with, of the 2D kind, and have some close encounters of another kind (Space vs. Time). Note that Geohashes are easily confused with Geohashing, which is an outdoor adventure sport based on trying to reach a random Geohash, and Geocaching, which is a worldwide treasure hunt to attempt to find geocaches hidden at different locations. So, on with the adventure of the Geohashes kind!
1 Geohashes are “everywhere”
In the previous blog (Geospatial Anomaly Detection: Part 1) we discovered that efficient proximity querying over <latitude, longitude> location coordinates using inequalities is challenging in Cassandra.
Is there an alternative? Perhaps the experiments in the previous blog, using a country string as the partition key, give us a valuable clue? The Earth can be divided up in a lot of different (but somewhat arbitrary) ways such as by countries, continents, or even tectonic plates!
The major tectonic plates
What if there was a more “scientific” way of dividing the world up into different named areas, ideally hierarchical, with decreasing sizes, with a unique name for each? Perhaps more like how an address works. For example, using a combination of plate, continent, and country, what famous building is located here?
North American Plate ->
Continent of North America ->
District of Columbia ->
Pennsylvania Avenue ->
It turns out that there are many options for systematically dividing the planet up into nested areas. One popular method is a Geohash. Geohashes use “Z-order curves” to reduce multiple dimensions to a single dimension with fixed ordering (which turns out to be useful for database indexing). A geohash is a variable length string of alphanumeric characters in base-32 (using the digits 0-9 and lower case letters except a, i, l and o). E.g. “gcpuuz2x” (try pronouncing that!) is the geohash for Buckingham Palace, London. A geohash identifies a rectangular cell on the Earth: at each level, each extra character identifies one of 32 sub-cells as shown in the following diagram.
Shorter geohashes have larger areas, while longer geohashes have smaller areas. A single character geohash represents a huge 5,000 km by 5,000 km area, while an 8 character geohash is a much smaller area, 40m by 20m. Greater London is “gcpu”, while “gc” includes Ireland and most of the UK. To use geohashes you encode <latitude, longitude> locations to a geohash (which is not a point, but an area), and decode a geohash to an approximate <latitude, longitude> (the accuracy depends on the geohash length). Some geohashes are even actual English words. You can look up geohashes online (which is an easier type of adventure than geohashing or geocaching): “here” is in Antarctica, “there” is in the Middle East, and “everywhere” is (a very small spot) in Morocco! And here’s an online zoomable geohash map which makes it easy to see the impact of adding more characters to the geohash.
Geohashes are a simpler version of bounding boxes that we tried in the previous blog, as locations with the same geohash will be near each other, as they are in the same rectangular area. However, there are some limitations to watch out for including edge cases and non-linearity near the poles.
To use geohashes with our Anomalia Machina application, we needed a Java implementation, so we picked this one (which was coincidentally developed by someone geospatially nearby to the Instaclustr office in Canberra). We modified the original Anomalia Machina code as follows. The Kafka key is a <latitude, longitude> pair. Once an event reaches the Kafka consumer at the start of the anomaly detection pipeline we encode it as geohash and write it to Cassandra. The query to find nearby events now uses the geohash. The modified anomaly detection pipeline looks like this:
But how exactly are we going to use the geohash in Cassandra? There are a number of options.
2 Implementation Alternatives
2.1 Geohash Option 1 – Multiple indexed geohash columns
The simplest approach that I could think of to use geohashes with Cassandra, was to have multiple secondary indexed columns, one column for each geohash length, from 1 character to 8 characters long (which gives a precision of +/- 19m which we assume is adequate for this example).
The schema is as follows, with the 1 character geohash as the partition key, time as the clustering key, and the longer (but smaller area) geohashes as secondary indexes:
CREATE TABLE geohash1to8 ( geohash1 text, time timestamp, geohash2 text, geohash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (hash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE INDEX i8 ON geohash1to8 (geohash8); CREATE INDEX i7 ON geohash1to8 (geohash7); CREATE INDEX i6 ON geohash1to8 (geohash6); CREATE INDEX i5 ON geohash1to8 (geohash5); CREATE INDEX i4 ON geohash1to8 (geohash4); CREATE INDEX i3 ON geohash1to8 (geohash3); CREATE INDEX i2 ON geohash1to8 (geohash2);
In practice the multiple indexes are used by searching from smallest to largest areas. To find the (approximately) nearest 50 events to a specific location (e.g. “everywhere”, shortened to an 8 character geohash, “everywhe”), we start querying with smallest area first, the 8 character geohash, and increase the area by querying over shorter geohashes, until 50 events are found, then stop:
select * from geohash1to8 where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50; select * from geohash1to8 where geohash1=’e’ limit 50;
Spatial Distribution and Spatial Density
What are the tradeoffs with this approach? The extra data storage overhead of having multiple geohash columns, the overhead of multiple secondary indexes, the overhead of multiple reads, due to (potentially) searching multiple areas (up to 8) to find 50 events, and the approximate nature of the spatial search due to the use of geohashes. How likely we are to find 50 nearby events on the first search depends on spatial distribution (how spread out in space events are, broad vs. narrow) and spatial density (how many events there are in a given area, which depends on how sparse or clumped together they are).
For example, broad distribution and sparse density:
Broad distribution and clumped:
Narrow distribution and sparse:
Narrow distribution and clumped:
There’s a potentially nice benefit of using geohashes for location in our anomaly detection application. Because they are areas rather than highly specific locations, once an anomaly has been detected it’s automatically associated with an area which can be included with the anomaly event reporting. This may be more useful than just a highly specific location in some cases (e.g. for setting up buffer and exclusion zones, triggering more sophisticated but expensive anomaly detection algorithms on all the data in the wider area, etc). This is the flip side of the better-known fact that geohashes are good for privacy protection by virtue of anonymizing the exact location of an individual. Depending on the hash length the actual location of an event can be made as vague as required to hide the location (and therefore the identify) of the event producer.
Note that in theory, we don’t have to include the partition key in the query if we are using secondary indexes, i.e. this will work:
select * from geohash1to8 where geohash8=’everywhe’ limit 50;
The downside of this query is that every node is involved. The upside, that we can choose a partition key with sufficient cardinality to avoid having a few large partitions, which is not a good idea in Cassandra (see note below).
2.2 Geohash Option 2 – Denormalized Multiple Tables
There are a couple of different possible implementations of this basic idea. One is to denormalise the data and use multiple Cassandra tables, one for each geohash length. Denormalisation by duplicating data across multiple tables to optimise for queries, is common in Cassandra – “In Cassandra, denormalization is, well, perfectly normal” – so we’ll definitely try this approach.
We create 8 tables, one for each geohash length:
CREATE TABLE geohash1 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); … CREATE TABLE geohash8 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC);
For each new event, we compute geohashes from 1 to 8 characters long, and write the geohash and the value to each corresponding table. This is fine as Cassandra is optimised for writes. The queries are now directed to each table from smallest to largest area geohashes until 50 events are found:
select * from geohash8 where geohash=’everywhe’ limit 50; select * from geohash7 where geohash=’everywh’ limit 50; select * from geohash6 where geohash=’everyw’ limit 50; select * from geohash5 where geohash=’every’ limit 50; select * from geohash4 where geohash=’ever’ limit 50; select * from geohash3 where geohash=’eve’ limit 50; select * from geohash2 where geohash=’ev’ limit 50; select * from geohash1 where geohash=’e’ limit 50;
2.3 Geohash Option 3 – Multiple Clustering Columns
Did you know that Cassandra supports multiple Clustering columns? I had forgotten. So, another idea is to use clustering columns for the geohashes. I.e. Instead of having multiple indexes, one for each length geohash column, we could have multiple clustering columns:
CREATE TABLE geohash1to8_clustering ( geohash1 text, time timestamp, geohash2 text, gephash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (geohash1, geohash2, geohash3, geohash4, geohash5, geohash6, geohash7, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC, geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC, geohash8 DESC, time DESC);
Clustering columns work well for modelling and efficiently querying hierarchically organised data, so geohashes are a good fit. I.e. a single clustering column is often used to retrieve data in a particular order (e.g. for time series data) but multiple clustering columns are good for nested relationships, as Cassandra stores and locates clustering column data in nested sort order. The data is stored hierarchically, which the query must traverse (either partially or completely). To avoid “full scans” of the partition (and to make queries more efficient), a select query must include the higher level columns (in the sort order) restricted by the equals operator. Ranges are only allowed on the last column in the query. A query does not need to include all the clustering columns, as it can omit lower level clustering columns.
Time Series data is often aggregated into increasingly longer bucket intervals (e.g. seconds, minutes, hours, days, weeks, months, years), and accessed via multiple clustering columns in a similar way. However, maybe only time and space (and other as yet undiscovered dimensions) are good examples of the use of multiple clustering columns? Are there other good examples? Maybe organisational hierarchies (e.g. military ranks are very hierarchical after all), biological systems, linguistics and cultural/social systems, and even engineered and built systems. Unfortunately, there doesn’t seem to be much written about modelling hierarchical data in Cassandra using clustering columns, but it does look as if you are potentially only limited by your imagination.
The query is then a bit trickier as you have to ensure that to query for a particular length geohash, all the previous columns have an equality comparison. For example, to query a length 3 geohash, all the preceding columns (geohash1, geohash2) must be included first:
select * from geohash1to8_clustering where geohash1=’e’ and geohash2=’ev’ and geohash3=’eve’ limit 50;
2.4 Geohash Option 4 – Single Geohash Clustering Column
Another approach is to have a single full-length geohash as a clustering column. This blog (Z Earth, it is round?! In which visualization helps explain how good indexing goes weird) explains why this is a good idea:
“The advantage of [a space filling curve with] ordering cells is that columnar data stores [such as Cassandra] provide range queries that are easily built from the linear ordering that the curves impose on grid cells.”
Got that? It’s easier to understand an example:
CREATE TABLE geohash_clustering ( geohash1 text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY (geohash1, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC);
Note that we still need a partition key, and we will use the shortest geohash with 1 character for this. There are two clustering keys, geohash8 and time. This enables us to use an inequality range query with decreasing length geohashes to replicate the above search from smallest to largest areas as follows:
select * from geohash_clustering where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’every0’ and geohash8 <=’everyz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ever0’ and geohash8 <=’everz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’eve0’ and geohash8 <=’evez’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ev0’ and geohash8 <=’evz’ limit 50; select * from geohash_clustering where geohash1=’e’ limit 50;
3 Space vs. Time (Potential Partition Problems)
Just like some office partitions, there can be issues with Cassandra partition sizes. In the above geohash approaches we used sub-continental (or even continental scale as “6” is almost the whole of South America!) scale partition keys. Is there any issue with having large partitions like this?
Recently I went along to our regular Canberra Big Data Meetup and was reminded of some important Cassandra schema design rules in Jordan’s talk (“Storing and Using Metrics for 3000 nodes – How Instaclustr use a Time Series Cassandra Data Model to store 1 million metrics a minute”)
Even though some of the above approaches may work (but possibly only briefly as it turns out), another important consideration in Cassandra is how long they will work for, and if there are any issues with long term cluster maintenance and node recovery. The relevant rule is related to the partition key. In the examples above we used the shortest geohash with a length of 1 character, and therefore a cardinality of 32, as the partition key. This means that we have a maximum of 32 partitions, some of which may be very large (depending on the spatial distribution of the data). In Cassandra you shouldn’t have unbounded partitions (partitions that keep on growing fast forever), partitions that are too large (> 100MB), or uneven partitions (partitions that have a lot more rows that others). Bounded, small, even partitions are Good. Unbounded, large, uneven partitions are Bad. It appears that we have broken all three rules.
However, there are some possible design refinements that come to the rescue to stop the partitions closing in on you, including (1) a composite partition key to reduce partition sizes, (2) a longer geohash as the partition key to increase the cardinality, (3) TTLs to reduce the partition sizes, and (4) sharding to reduce the partition sizes. We’ll have a brief look at each.
3.1 Composite Partitions
A common approach in Cassandra to limit partition size is to use a composite partition key, with a bucket as the 2nd column. For example:
CREATE TABLE geohash_clustering ( geohash1 text, bucket text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY ((geohash1, bucket), geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC);
The bucket represents a number of fixed duration time range (e.g. from minutes to potentially days), chosen based on the write rate to the table, to keep the partitions under the recommended 100MB. To query you now have to use both the geohash1 and the time bucket, and multiple queries are used to go back further in time. Assuming we have a “day” length bucket:
select * from geohash_clustering where geohash1=’e’ and bucket=’today-date’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and bucket=’yesterday-date’ and geohash8=’everywhe’ limit 50; etc
Space vs. Time
But this raises the important (and ignored so far) question of Space vs. Time. Which is more important for Geospatial Anomaly Detection? Space? Or Time?
Space-time scales of major processes occurring in the ocean (spatial and temporal sampling on a wide range of scales)
The answer is that it really depends on the specific use case, what’s being observed, how and why. In the natural world some phenomena are Big but Short (e.g. tides), others are Small but Longer (e.g. biological), while others are Planetary in scale and very long (e.g. climate change).
So far we have assumed that space is more important, and that the queries will find the 50 nearest events no matter how far back in time they may be. This is great for detecting anomalies in long term trends (and potentially over larger areas), but not so good for more real-time, rapidly changing and localised problems. If we relax the assumption that space is more important than time, we can then add a time bucket and solve the partition size problem. For example, assuming the maximum sustainable throughput we achieved for the benchmark in Anomalia Machina Blog 10 of 200,000 events per second, 24 bytes per row, uniform spatial distribution, and an upper threshold of 100MB per partition, then the time bucket for the shortest geohash (geohash1) can be at most 10 minutes, geohash2 can be longer at just over 5 hours, and geohash3 is 7 days.
What impact does this have on the queries? The question then is how far back in time do we go before we decide to give up on the current search area and increase the area? Back to the big bang? To the dinosaurs? Yesterday? Last minute? The logical possibilities are as follows.
We could just go back to some maximum constant threshold time for each area before increasing the search area. For example, this diagram shows searches going from left to right, starting from the smallest area (1), increasing the area for each search, but with a constant maximum time threshold of 100 for each:
Constant time search with increasing space search
Alternatively, we could alternate between increasing time and space. For example, this diagram shows searches going from left to right, again starting from the smallest area and increasing, but with an increasing time threshold for each:
Increasing time search with increasing space search
However, a more sensible approach (based on the observation that the larger the geohash area the more events there will be), is to change the time threshold based on the space being searched (i.e. time scale inversely depends on space scale). Very small areas have longer times (as there may not be very many events given in a minuscule location), but bigger areas have shorter times (as there will be massively more events at larger scales). This diagram again shows searches increasing from the smaller areas on the left to larger areas on the right, but starting with the longest time period, and reducing as space increases:
Decreasing time search with increasing space search
3.2 Partition Cardinality
Another observation is that in practice the shorter geohashes are only useful for bootstrapping the system. As more and more data is inserted the longer geohashes will increasingly have sufficient data to satisfy the query, and the shorter geohashes are needed less frequently. So, another way of thinking about the choice of correct partition key is to compute the maximum cardinality. A minimum cardinality of 100,000 is recommended for the partition key. Here’s a table of the cardinality of each geohash length:
|geohash length||cardinality||cardinality > 100000|
From this data, we see that a minimum geohash length of 4 (with an area of 40km^2) is required to satisfy the cardinality requirements. In practice, we could, therefore, make the geohash4 the partition key. At a rate of 220,000 checks per second the partitions could hold 230 days of data before exceeding the maximum partition size threshold. Although we note that the partitions are still technically unbounded, so a composite key and/or TTL (see next) may also be required.
A refinement of these approaches, which still allows for queries over larger areas, is to use different TTLs for each geohash length. This would work where we have a separate table for each geohash length. The TTLs are set for each table to ensure that the average partition size is under the threshold, the larger areas will have shorter TTLs to limit the partition sizes, while the smaller areas can have much longer TTLs before they get too big (on average, there may still be issues with specific partitions being too big due if data is clumped in a few locations). For the longer geohashes the times are in fact so long that disk space will become the problem well before partition sizes (e.g. geohash5 could retain data for 20 years before the average partition size exceeds 100MB).
3.4 Manual Sharding
By default, Cassandra is already “sharded” (the word appears to originate with Shakespeare) by the partition key. I wondered how hard it would be to add manual sharding to Cassandra in a similar way to using a compound partition key (option 1), but where the extra sharding key is computed by the client to ensure that partitions are always close to some fixed size. It turns out that this is possible, and could provide a way to ensure that the system works dynamically irrespective of data rates and clumped data. Here’s a couple of good blogs on sharding in Cassandra (avoid pitfalls in scaling, synthetic sharding). Finally, Partitions in Cassandra do some take effort to design, so here’s some advice for data modelling recommended practices.
Next blog: We add a Dimension of the Third Kind to geohashes – i.e. we go Up and Down!
The post Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 2: Geohashes (2D) appeared first on Instaclustr.
For a highly technical product like Scylla, the success or failure of its adoption rests heavily in knowledge transfer to the community. Documentation is fundamental to that. To find out what’s new and what’s hot in Scylla documentation, I went to the source and had this exchange with Laura Novich, Senior Technical Writer for ScyllaDB.
Laura, you’re not just a technical writer. You’ve also been a certified Red Hat systems administrator. What are the commonalities, in skills and habits, between being a sysadmin and being a technical writer for a company like ScyllaDB?
When instructing users on systems such as ScyllaDB, you need to think about the entire package. When working on Linux systems there are multiple ways to the same goal, so our documentation needs to respect that. I wasn’t always a technically oriented person, and learning Operating Systems such as RHEL and databases such as Scylla have helped me to understand our users and customers. I am passionate about the OpenSource community and strive to deliver the best instructions for our products. The Scylla Docs team uses the philosophy of the Open Source community to “Publish Early, Publish often.” As such we aim to get documentation out as soon as it is ready. Our documentation is publicly available on our website.
Aside from technology, what other experiences do you bring? How do those experiences impact your technical writing?
In addition to my SysAdmin experience, I have experience in Education where I was an ESL teacher in public schools. This experience helps me understand our users and customer’s needs from a methodological point of view.
My hobbies include cake decorating and writing and I feel these experiences help me break down instructions into small chunks which are easier to understand and help me build use cases which not only are useful for our users, but are easy to follow.
What recent major changes to Scylla documentation do you want to bring readers’ attention to?
In general, we have incorporated an improved search capability to quickly find any information which is stored on our documentation server. On each page, there is also a link you can click to leave feedback about a change or improvement you would like to see. Also this year, we added a Glossary of terminology with terms highlighted throughout the documentation suite.
Specifically, there are pages about the latest Scylla features such as Workload Prioritization, Role Based Access Control, In-memory Tables, Global Secondary Indexes, Materialized Views, Hinted Handoff, Scylla’s Compaction Strategies, and more.
We have extensive CQL pages and examples throughout the documentation suite.
What is your goal when you set out to write documentation for Scylla?
What we strive to do with our documentation is to not only focus on “How to do X”, but also on “Why to do X versus Y” as well as “How to do X better”.
We hope you will find our documentation easy to follow and understand as well as technically accurate and informative.
What are the ten most popular sections in our documentation? And what might you say about each one?
From what I can see:
- Getting Started – If this is your first time installing Scylla, preparing for production, or creating tables with CQL, this is the place to start.
- Scylla for Developers – Has what you need to take your application and integrate it with Scylla as your backend database. Includes Drivers, and Scylla features.
- CQLSh: the CQL shell – command line shell used to interact with Cassandra through CQL (the Cassandra Query Language).
- Scylla for Administrators – Create users, usernames, passwords, give users roles and permissions. Also includes procedures and information about Scylla Manager and Monitoring
- Data Manipulation – CQL commands for reading data from your database, with and without filtering.
- Data Definition – create tables and keyspaces with fully documented CQL commands.
- Best Practices for Running Scylla on Docker – you can run Scylla on a public cloud (AWS, GCE), on-premises (bare metal), and on Scylla Cloud as well, but this focuses on how to run Scylla in a Docker container.
- Scylla Architecture – Scylla under the hood, how Scylla maintains Consistency Availability and Partition tolerance.
- Scylla Ring Architecture – Overview – Part of Scylla Architecture, the ring is the basis for Scylla’s distributed design.
- Data Types – all the data types supported in CQL, from ascii to varint.
What is the best way for our community to get documentation requests, comments or corrections to you?
On the bottom of every page in the documentation suite is the following link:
Clicking on the link opens an issue in Github and automatically sites the page’s origin. There is a form to fill in where you provide us with information about the issue. Click Submit and your issue is submitted. You can follow the issue’s progress, and all our documentation issues, in Github.
All you need to report an issue is a free Github account.
We strive to make our documentation better and hope you will help us in this endeavor by giving us suggestions for improvement. For example, if you have broader comments besides a specific page drop us a line or join our Slack channel. Let your voice be heard!
Thanks very much Laura for your time today, and for maintaining our documentation site, which such a valuable resource to our users.
The open source software market is expected to reach $32.95 billion by 2022, nearly tripling in growth in five years. Certainly the fact that 98% of developers use open source tools—even when they’re not supposed to—has something to do with this.
A successful open source project requires the support of a strong community, including documentation, code contribution, testing, and bug fixing. DataStax recently announced its continued commitment to open source with expanded support of open source Apache Cassandra.
DataStax is the top maintainer of the Cassandra open source project, contributing the majority of the commits and continuing to take the lead in Cassandra development and engagement via events like DataStax Accelerate.
Suffice it to say that we believe in the promise of open source, big time. But before we explore what the future of open source looks like, let’s look at where it comes from.
A Brief History of Open Source
In the 1950s and ‘60s, computers came preloaded with software. It wasn’t until the 1970s and ‘80s that software companies started selling licenses.
At the time, Richard M. Stallman, a staff programmer at MIT, saw the move to proprietary software as a betrayal of hacker culture. In September 1983, he announced an open source distribution of Unix called GNU, ostensibly giving birth to the modern open source movement.
Some of the more well-known open source projects include:
- Linux, an open source operating system similar to Unix
- Java, a popular open source programming language
- Git, a distributed version control platform for coding
Enterprises Buy into the Promise of Open Source
While some companies were initially scared to use open source solutions, that’s no longer the case. A 2018 survey by The Linux Foundation found that 72% of companies were using open source software in one form or another. This widespread adoption has translated into serious returns, and now many companies based on open source software are the innovation leaders in their industries.
The Open Source Projects Changing the World Today
From databases and operating systems to team messaging platforms and containerization solutions, there’s no shortage to the way open source software is transforming our lives.
Here are six different kinds of projects to get you thinking about what’s possible in the world of open source.
1. Apache Cassandra
Cassandra is an open source distributed database that delivers the high availability, performance, and linear scalability modern applications demand.
We believe in Cassandra so much that we built a whole company around it.
If you’re looking for a production-certified distribution of Cassandra that’s 100% open source compatible and comes with support from the Cassandra experts, check out DataStax Distribution of Apache Cassandra.
Android is an open source mobile operating system developed by Google and released in 2008.
If you don’t have an iPhone, chances are your phone runs on Android; in May 2017, Google announced that there were more than two billion active Android devices.
Lots of engineers prefer Android because they can speed up the development process by leveraging tools and frameworks built by the open source community.
Docker is an open source containerization platform that lets developers create, deploy, and run applications anywhere. This accelerates the development process considerably, which is why many of today’s top-performing DevOps teams use containers to build applications.
Mattermost is an open source team messaging platform that gives companies more control over their data by enabling them to host their server on-premises or in a private cloud. This compares to other popular proprietary platforms that host data on the vendor’s cloud.
Due to its open source nature, Mattermost integrates with DevOps tools and can be extended and customized with plugins.
Kubernetes (also called k8s) is an open source container orchestration platform originally developed by Google. The Cloud Native Computing Foundation currently maintains the project. Kubernetes automates container deployment, scaling, and management. Because it’s a pluggable platform, Kubernetes has an extremely rich ecosystem; a collaboration across many industries and organizations to create something that goes far beyond one open source project.
6. WordPress, Drupal, and Joomla!
WordPress, Drupal, and Joomla! are the three most popular content management systems (CMS), and they all happen to be open source. While WordPress commands nearly 60% of the market, Drupal and Joomla! have a 6.7% and 4.7% share, respectively.
Chances are the bulk of the content you consume every day is posted via an open source CMS.
The Future Is Open Source
From robust user communities and customizability to source code access and more control, open source software delivers a number of benefits to businesses of all sizes.
Open source software has already come a long way in a short period of time. And it remains to be seen what the future holds. But here at DataStax, we’re excited to keep contributing to the open source world and working with user communities to build powerful platforms that help organizations like yours change the world.
What’s new in Apache Cassandra 4.0? (video)
I recently had the pleasure of interviewing Hang Chan of Dstillery. Hang wears many hats at Dstillery, including responsibilities for IT infrastructure, systems, network and site reliability engineering. Hang drew from his breadth of experience to answer a few questions for me regarding Dstillery’s use of Scylla.
If you could, please briefly describe your use of Scylla.
Dstillery is the marketing and advertising industry’s leading custom audience solutions provider. Top agencies and brands rely on Dstillery’s high quality audiences for branding and direct response initiatives to thrive. To do this with accuracy and scale, Dstillery uses massive and diverse device-level data on a daily basis. We use Scylla to look up various attributes for devices and to map devices between our systems and external partners.
What hard technical problems is Dstillery solving through use of Scylla?
We need to be able to read and write at scale to the tune of hundreds of billions of requests a day with a timeout of 25 milliseconds.
What specific issues led you to migrate from Cassandra to Scylla?
We were using Cassandra previously to provide us these capabilities. Scylla is a drop-in replacement for Cassandra.
We were reaching a barrier where no matter how many nodes we added to the Cassandra cluster, we could not lower the failure rate. Even during periods of very low traffic, we would still see 0.1% failure rate using Cassandra, which amounts to thousands of failures. Scylla brings it down to nearly 0.0%, or hundreds of failures. In one extreme case, we were able to reduce failure rate from 14% to 0.8% by replacing Cassandra with Scylla.
Also, any details you can share about your Cassandra clusters?
We still have around 200 Cassandra nodes that we didn’t migrate because either they have a lot of data in them, or they don’t have the performance requirements of the clusters we migrated. These are mostly write-heavy workloads that don’t impact us as much.
One Cassandra cluster is on 3.11.3, and the other clusters are on 2.1.13.
On one cluster, we were able to reduce our server count by about 40% by switching to Scylla.
What did you guys use Cassandra for and is Scylla being used for the same thing? For example, a real-time bidding system?
We use Scylla and Cassandra the same way – to be the backend datastore to model our audiences, create insights, and power our RTB platform.
What do you like most about Scylla? You’ve mentioned low latencies and self-optimizing, is there anything else?
With Cassandra, we needed to tune around 40 different settings for garbage collection, JVM, memtables, compaction, and cache sizes. These settings would also need to be constantly adjusted over time. Since Scylla tunes itself according to the workload, we just set the host IP, cluster name, seeds, and let it go knowing that the performance will normalize itself over time without any manual intervention.
The support on the slack channel and Google group is top notch. A little while back, we were having an issue where compaction was utilizing too many resources and causing failures. Glauber suggested we try a new setting which would throttle the compaction process. Updating the setting reduced our failures. I really appreciate that engineers like Glauber are there when you have issues and always ready to help.
Some other questions we have is how many CPU cores does each node have?
Our nodes are bare metal – they have 40 CPU cores.
Do you have Scylla monitoring installed?
Yes, we collect Scylla metrics via Prometheus, implemented using Scylla Monitoring Stack.
Have you checked out Scylla Open Source 3.0?
We’re really excited about the SSTable format change. Reducing storage usage would be huge as that’s one of the biggest reasons we add more nodes.
How do you backup and restore? What is your repair process like?
We take snapshots for backups and restores, but usually do it as and when needed, such as before an upgrade. We don’t do repairs as it’s a rather intensive operation – at least in Cassandra it was. Haven’t really used repairs in Scylla. Latency and throughput is the most important for our use case and some tradeoffs needed to be made to achieve an acceptable level of performance for both metrics.
How many nodes are you running? And how are they organized into clusters?
We have 8 clusters of around 130 nodes in 2 DCs. This was around 250 nodes before a hardware refresh.Our applications handle the replication.
I read about Dstillery’s method for determining geodata authenticity, detecting inaccurate or falsified GPS coordinates for mobile users, and I was wondering: how is your use of Scylla related to that capability?
Scylla provides the device location history that we apply our algorithms against to detect inaccurate location data. We also use Scylla for device web and app history, which is used to further validate location data quality.
No database operates in a vacuum. What other systems and infrastructure does Scylla integrate with at Dstillery?
We generally use Scylla in cases where we need to keep a lot of data and make simple lookups very quickly. The most common queries is usually a select * from table where key =. For cases where more logic is required, we would use a relational db like MySQL or PostgreSQL.
Mercuryd is middleware we created in-house which allows our apps to route to different datasources.
Dstillery’s Big Data adtech architecture, incorporating both SQL (MySQL) and NoSQL (memcached, Cassandra, and Scylla) databases, Hadoop/MapReduce for analytics, plus both Apache Kafka and Apache Flume for streaming data. Mercuryd is Dstillery’s own in-house middleware solution.
My thanks to Hang for his time, insights and leadership. And from all of us here at ScyllaDB, our thanks and best wishes to everyone at Dstillery for choosing Scylla.
Do you have a Big Data story of your own to share? If so, please contact us and let us know about your use case.
Data modeling provides a means of planning and blueprinting the complex relationship between an application and its data. It has become increasingly important as the “three Vs” of data—volume, variety, and velocity—continue to explode.
But as the importance of data modeling, in general, has grown, data modeling with a relational database has rapidly become less relevant and effective.
Relational databases simply aren’t designed to handle all the different types of data that are integral to modern databases. The mismatch between relational database capabilities and modern needs prompted former federal CIO Vivek Kundra, back in 2009, to declare: “This notion of thinking about data in a structured, relational database is dead.”
Ten years beyond that statement, it has become apparent that Kundra was correct. With every passing day, it seems, more forms of unstructured and semi-structured data must be stored, manipulated, and utilized in making business decisions—and those decisions must often be delivered with lightning-fast speed.
The Complexities of Relational Data Modeling
Data modeling with relational databases can be challenging. It’s a slow and cumbersome process. Making the smallest, simplest changes to a relational database—even changing just one field in one table—can set off a cascading domino effect of additional changes that can be very labor intensive to enact, and quite expensive to complete.
Although relational data modeling has remained a challenge, data modeling itself is very much alive and well. Data modeling is essential because it helps define both the data structure and the business requirements of an application, which brings us to our next topic: Apache Cassandra data modeling.
Relational Data Modeling vs. Cassandra Data Modeling
Where relational databases base the data models on the data itself, Cassandra bases it on what you want to do with the data (i.e., the application).
This is a key distinction.
Cassandra data modeling is far more applicable and useful to today’s business environment. Data modeling with Cassandra, however, is quite different from relational data modeling and requires a mindset adjustment that leaves behind many of the restrictions of relational data modeling.
When transitioning from relational modeling rules to Cassandra modeling rules, two of the most important relational modeling restrictions that may be discarded are:
- Minimizing writes
- Minimizing data duplication
Since writes are expensive in relational databases, relational data modelers typically seek to restrict writes as much as possible.
That restriction is not applicable to Cassandra; writes can be quite inexpensive, and Cassandra is optimized to perform virtually all writes efficiently. And since Cassandra is typically architected around low-cost and abundant data storage, denormalization and duplication of data are common. Read efficiency, in fact, can often be maximized in Cassandra by intentionally using duplicate data.
With Cassandra data modeling, there are two primary goals that are quite different from relational database modeling. These goals are:
- To spread data evenly around the cluster: Rows are spread around the cluster based on the partition key, which is the first element of the primary key. So, in order to spread data evenly, you need to pick a good primary key.
- To minimize the number of partitions read: Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.
And unlike with relational databases, any type of data can be stored in a Cassandra database. Cassandra also differs from relational databases in its ability to handle:
- Massive volume: Multiple petabytes of data? Trillions of data entities? Not a problem in Cassandra.
- Virtually unlimited velocity: Cassandra can handle millions of transactions per second, including real time and streams.
- Infinite variety: Cassandra can accommodate all forms of data, including structured, unstructured, and semi-structured.
Why It’s Important to Get the Data Model Right
Done correctly, data modeling can provide benefits throughout the development lifecycle.
In addition to improving performance, when done correctly data modeling also accelerates application development. It can ensure a more structured, less haphazard development process that contributes toward maximizing the quality of the end product as well as help lower long-term maintenance costs.
Conversely, a flawed data model, or a data model that simply isn’t a proper fit for the application, can lead to a cascade of problems:
- Complicating and slowing the development effort
- Over-complicating data access
- Indecision and uncertainty about how application data will be stored and accessed
- Inflexibility in responding to evolving requirements
- Excessively complicated code that makes ongoing maintenance more time-consuming and expensive
Each of these problems is likely to result in busted budgets and blown deadlines.
Transitioning from Relational Data Modeling to Cassandra Data Modeling
Transitioning from a relational database to Cassandra may seem a daunting challenge. But that’s a common misconception. Companies that manage some of the largest databases on the planet, like Netflix, have made the transition. And there’s plenty of guidance available for making that transition. It, of course, helps that Cassandra uses the SQL-like Cassandra Query Language (CQL).
Similarly, transitioning from relational data modeling to Cassandra data modeling can be easy. So easy, in fact, that you can complete the process in just five simple steps.
5 Steps to an Awesome Apache Cassandra Data Model (free webinar)
Get better price-to-performance with the new AWS M5 instance types on the Instaclustr Managed Platform
Instaclustr is pleased to announce the support for AWS EC2 M5 instance type for Apache Cassandra clusters on the Instaclustr Managed Platform.
The M-series AWS EC2 instances are general purpose compute instances and are quite popular due to their versatile configurations that can cater to most generic application workloads. In case you are wondering what’s the 5 in M5, it’s the generation number. Every few years AWS announces a new generation of instance type for each series as hardware costs become cheaper by the day and economies of scale apply. This 5th generation M-series is much more powerful and cheaper than its predecessor. It’s a no brainer for technology teams to upgrade their cloud deployments to the latest generation instance available and this is an ongoing pursuit with Cloud environments.
In order to assist customers in upgrading their Cassandra clusters from M4 to M5 instance type nodes, Instaclustr technical operations team has built several tried and tested node replacement strategies to provide zero-downtime, non-disruptive migrations for our customers. Read Advanced Node Replacement blog for more details on one such strategy.
Just to compare how well M5s perform compared to M4s, we ran our standard cassandra benchmarking tests. The results showed significant performance gains with M5s compared to M4s. The improvements ranged from 33% for a medium-sized write-only test to a massive 375% for medium-sized read-only test. Below are the benchmarking details.
We had two 3-node clusters, one with m4.large and the other with m5.large instance type nodes. Each node had 250GB EBS backed SSDs attached. Below shows the throughput difference between M4 and M5 for the small sized I/O.
Small writes (insert operations) saw around 81% throughput increase, small reads saw around 47% throughput increase and small reads+writes (mixed workload) saw around 70% throughput increase. (We benchmark throughput achieved for a constant median latency between tests. For details, see our Apache Cassandra certification test plan.)
We noticed that the performance gains increased with increase in IO size. Small I/O test had 0.6kb per row of read or write size. With medium I/O tests, which had a read/write size of 12kb per row, showed several folds better performance for read and mixed workloads but only a meagre improvement in writes (insert operations). Below graph shows the comparison for medium-sized I/O.
Medium writes saw only around 33% throughput increase, whereas medium reads and medium mixed workload saw around 375% and 300% throughput increase respectively.
M5 instance pricing
We have added four variants of M5 instance types added to the platform. One instance of m5.large and 3 instances of m5.xlarge with different storage capacities. All of them come with EBS backed SSD drives. Compared to M4s, there is a slight increase in price. The biggest increase of 23% is for the lowest configuration i.e. the m.large instance. But evidently, from the benchmarks, the price-to-performance ratio is much better for m5.large compared to m4.large. What’s even better for customers is that the higher configs of M5 are only marginally more expensive than their M4 equivalents which means even better price-to-performance.
Below are their specification and pricing (for US West – Oregon region). Pricing for rest of the regions is available on Instaclustr Console when you login.
|#||Instaclustr Instance Type||AWS Instance Name||CPU Cores||Memory||Storage Capacity||Price/node/month
US East (Ohio)
|1||Tiny||m5.large||2||8 Gib||250 GB
|2||Small||m5.xlarge||4||16 Gib||400 GB
|3||Balanced||m5.xlarge||4||16 Gib||800 GB
|4||Bulk||m5.xlarge||4||16 Gib||1600 GB
To know more about this benchmarking or if you need clarification on when to use M5 instance type for Kafka, reach out to our Support team (if you are an existing customer), or contact our Sales team.
The post Get better price-to-performance with the new AWS M5 instance types on the Instaclustr Managed Platform appeared first on Instaclustr.
I recently had the pleasure of exchanging a few questions and answers with Guy Shtub, Manager of Scylla University. Guy had some exciting news about a new module available for Scylla University users plus shared his insights into what else is in the works.
Tell us about the new Data Modeling section of Scylla University.
Data modeling is an important topic when approaching databases in general and specifically Scylla. What we did with the University is we created a new lesson which is quite extensive. It starts with the basics of data modeling and gives an overview of how to get started and how to do it correctly. Going over concepts such as primary key, partitioning key, clustering key. The importance of choosing those keys. We saw that many users — even some that are more experienced — ran into issues getting those concepts confused, and also, in doing the data modeling itself and choosing the right keys or the correct way to actually data model.
An image from the new data modeling module of Scylla Essentials
So is this a whole new course, or is it part of a larger course?
It’s part of a larger course. It’s part of the Scylla Essentials – Overview of Scylla course. We previously did have a much shorter and less-extensive data modeling lesson however we completely rewrote it from scratch. And created one that goes into more details, has hands-on exercises, and explains this subject in more depth.
What are the most crucial concepts to understand about data modeling in Scylla or even Cassandra?
I think, since most people that approach data modeling in Scylla already have some database experience, whether it’s SQL or NoSQL, the most important concept is to understand that when approaching data modeling in Scylla, or Cassandra for that matter, we have to think about the application and the actual queries that will be performed at the beginning of the data modeling process. On the other hand, the typical data modeling process with relational databases starts with our domain. That includes identifying the different entities in the domain, the relationships between them, and breaking that down until finally the database schema is created, the physical data model is created. The queries and the application are something that is thought about only at the end of the process. So the queries are more of an afterthought.
On the other hand, when we talk about data modeling in Scylla or Cassandra we have to think about the application and about the queries that will be performed very early on in the process. That’s a very important point. So it’s not that we don’t think about the domain and the entities and the relationships between those entities at all, but we just do that in addition to thinking about the application and the different queries that will be performed. And then we base our data model on both of those things. So also the domain and also on the application.
Scylla University was announced at Scylla Summit 2018, and was featured in a blog in February 2019. How many people have gone through Scylla University in these first months of availability?
We’ve had quite a success with the University. We’ve had hundreds of people register and go through thousands of steps on the University.
What are the next courses that you’re working on?
Some of the things that we are doing right now is developing a course that is called the “Mutant Monitoring System.” What we did there is we provided a story arc for Scylla training. And in each lesson, we talk about a different topic starting from the very basics and moving on to more advanced topics. The first lessons deal with how to get a Scylla cluster up and running, about high availability, for example, consistency levels, replication factor. Then we talk about using multiple datacenters. That is, replicating data across different geographical locations. As we move forward in the course we go into more interesting and more advanced topics. The students can take each one of these lessons at their own pace.
It’s not an extensive course, like the Scylla Essentials, which covers everything from the beginning. So we do expect the trainees to have some prior knowledge about Scylla. However, in every place that we introduce a new term, we reference either the documentation or other lessons that explain that term. It’s a pretty cool way to dive into a specific issue that’s interesting for the student. For example, monitoring multiple datacenters, repair, and other interesting subjects.
The Mutant Monitoring System is based on a series of blog posts that were released a few months ago. We’re updating those posts. and making sure that everything is up-to-date and working.
I saw that there’s also a course that complements on-site training. Can you tell us more about that?
Right. So the University is relatively new. We wanted to put as much material on the University as soon as possible and make it available to our trainees and Scylla users. What we did there is we took slides from some more advanced courses that we typically use when we provide on-site training. As a first stage, we put those slides on the website as a separate course. We updated those slides and added some quiz questions. We put that as something initial that would create value to the user as it is.
Our plans for the future is to take that course and those lessons and make them more interactive. More hands-on. To improve that material. But until we get that done, we didn’t want to have people waiting for that material. So we just put an initial version online, which still creates value and which will be improved in the future.
Is there anything else you wanted to tell us about Scylla University?
For those that haven’t tried it out yet, I would encourage you to just register as a user. Have a look at the available courses and the different lessons. Just have a go at it. See how it works.
I didn’t mention it but all of the material is available for free. So all you have to do is just register as a user. The courses are completely online. They’re self-paced, which means that whenever you have some free time you can advance at your own pace. So it doesn’t require you to finish it at one go.
We’ll keep our users updated. We’re constantly adding more material. Expect more to come!
The post Scylla University: Data Modeling in Scylla Essentials appeared first on ScyllaDB.
Hybrid cloud, or the combination of on-premises data centers with one or more public or private cloud, is quickly becoming the go-to computing environment for enterprises looking to build modern applications.
Enterprises are now investing in cloud technologies at a faster rate than they are investing in other software technologies combined, according to McKinsey & Company. In fact, the research firm estimates that cloud spending will grow more than six times faster than other IT expenditures through next year. The advantages of the cloud are clear, including being able to deploy scalable, fast, always-available apps.
But how companies arrive at a hybrid cloud environment is not always so straightforward. Enterprises often struggle with how best to achieve the full benefits of hybrid cloud and many take very different paths to get there.
That said, as a company with many customers now using hybrid cloud, here is the typical path we see to hybrid cloud adoption.
1. No Hybrid Cloud Strategy
Companies that haven’t embraced hybrid cloud architectures run their main applications on-premises, with no cloud strategy evident organization-wide.
Individual lines of business might have deployed cloud applications for groups of users, but they are limited in scope. For example, they may use Gmail or other readily available productivity apps.
Such companies run the risk of getting left behind by faster-moving, more nimble competitors since legacy on-premises applications take more time to adapt to evolving organizational needs and don’t easily scale.
It’s time to start looking for opportunities to migrate to hybrid clouds.
2. Opportunistic Cloud Adoption
At this stage, organizations have defined their first processes—the low-hanging fruit—that could benefit from cloud adoption. This is the first real step on the road to hybrid cloud adoption.
Here, organizations may use Software as a Service (SaaS) applications for more sophisticated functions beyond email to branch into more cloud apps, such as CRMs. Managers looking for more opportunities may even go “rogue” and make forays into development or use without involving the IT department.
Meanwhile, IT itself may begin to define an approach to cloud adoption and identify areas in which to create pilot projects for applying it.
Overall, however, adoption is not widely accepted throughout the organization, and overlapping legacy approaches still exist. Developing a hybrid cloud strategy is the key to reaching the next stage.
3. Standardized Cloud Adoption
Here, the cloud has been widely reviewed and accepted by both IT and lines of business that can benefit from it. The organization has also settled on tools and automation that make cloud adoption more efficient. The cloud adoption strategy has been documented, development standards are in place, and non-cloud approaches are now on the way out.
However, organizations at this stage still have many applications that don’t share data, creating data silos and limiting their effectiveness.
4. Optimized Cloud App Adoption
At the app convergence stage, cloud applications begin to simplify organizational processes. They also function closer to real time, unlike legacy on-premises applications. Overall, at this stage, organizations have fewer applications working more efficiently, improved security, and less maintenance overhead. They can now measure business impact in terms of agility and time to market.
But the data that apps require remains chaotic, likely siloed in different, incompatible systems, including legacy relational database management systems. As a result, data management overhead increases. Applications also reveal performance issues and have trouble scaling.
Corralling data is the job of the next stage.
5. Optimized Cloud Data Adoption
Organizations at this stage continue to roll out cloud-aware applications, increasing the need to integrate public, private, and hybrid cloud platforms. Data convergence makes this possible by eliminating silos of disparate data technologies and replacing them with a unified data fabric.
Now the full value of hybrid clouds can be realized as the organization discovers new value in a data fabric that eliminates data redundancy. This leads to new process models, such as microservices, that increase business efficiency and new business models that lead to increased revenue.
Data governance and autonomy now become the major challenges. Cloud vendor lock-in is a real risk, as are increasing challenges to compliance with data regulations.
It’s time to move to the ultimate stage of hybrid cloud maturity.
6. Better Multi-Cloud Adoption
Making full use of optimized, interoperable clouds gives organizations at this stage application and data agility while preserving data autonomy as well as increased availability if one cloud provider has an outage. This means that rather than remaining locked into any one provider, fully cloud-mature organizations take advantage of multiple clouds depending on availability and real-time business needs.
Effective data governance lets organizations maintain, optimize, and eventually retire apps and data to meet ever-changing business needs. It also aids regulatory compliance thanks to increased visibility.
Organizations that have reached this pinnacle of hybrid cloud maturity can’t afford to stand still, however. As the relentless pace of innovation persists, organizations must continue to invest in their cloud assets to stay ahead of competitors and on top of the shifting sands of customer needs.
Recognize your organization in any of these stages of hybrid cloud adoption? If so, the time may be ripe for you to advance to the next level.