Momentum, Change, and Moving to the Future

For several years DataStax has worked with customers in hybrid and multi-cloud environments. Recently, our roadmap has intensified and accelerated in response to customer demands. We all know that cloud is changing every aspect of business—both for our customers, and for ourselves. We must continue to challenge ourselves to rethink the products we build and develop new ways to take them to market.

To that end, we made some big product announcements in May at our Accelerate conference. Our most exciting news by far was the announcement of DataStax Constellation, our new cloud data platform designed to make it easy for developers to build modern applications, and easier for ops teams to manage those applications—regardless of where they are deployed. These are entirely new opportunities for DataStax and have required us to make changes in every part of our business.

Realizing the full potential of what we can accomplish with Constellation will require adjusting every function to this new landscape. As with all meaningful change, some will be fun and exciting, and some will be incredibly hard. In our case, we need to quickly add many new positions in order to expand our cloud-native offerings. Our belief as a company is, and always has been, that we will grow the business responsibly—and that means that we cannot just add new costs and increase expenses. Therefore, we have made the difficult decision to restructure our organization.

Today, we said a heartfelt goodbye to some employees who have made incredible contributions to our success. This is not something that we take lightly. We said goodbye to people who are valued colleagues and, in many cases, close friends. Where we could, we moved people from previous roles into new teams that are part of the future plan. Unfortunately, at our size that does not work for all positions, and as a result, we needed to remove a little less than 10% of the company as part of the restructuring. This has been a time of tough decisions and we have done everything we can to transition these valued people with dignity and compassion.

Going forward, we are now structured and hiring to successfully bring the Constellation platform and cloud-native offerings to market. Working with our cloud partners, this fall we will deliver as-a-service offerings for both Cassandra and Insights; as well as contribute to open-source developer communities to pioneer advances in Apache Cassandra™, and our own portfolio. To accomplish all of this and as part of the restructuring, we need to hire about 100 new employees to bring the talent needed to deliver our roadmap.

Our next generation of products is off to a good start. At Accelerate, we ran our developer bootcamp using an early, hands-on preview of our first offering on Constellation, DataStax Apache Cassandra as a Service. The result was a tremendous amount of enthusiasm and positive feedback on its power and ease of use. That real-world reaction to Constellation accelerated our own decision to invest even more aggressively.

So we move forward, with deep gratitude to all those who helped us accomplish all that has been done, and with excitement for all that is to come.

SkyElectric Uses Scylla’s NoSQL Database to Scale Its Smart Energy Platform

SkyElectric Graphic

 

Smart Energy Algorithms Need Lots of Data

People in the developing world, in countries such as such as Pakistan, India, Bangladesh, Nigeria, often struggle with unstable power grids. In these countries ‘rolling blackouts’ (or ‘load-shedding’) are quite common.  This occurs when electricity delivery is interrupted to avoid excessive load on the power plant. SkyElectric provides a solution to this problem in the form of continuous electricity amidst frequent power outages, while also delivering the benefits of solar solutions that homes and businesses enjoy in the developed world.

Using lithium-ion batteries controlled by patented electronics and artificial intelligence, SkyElectric’s algorithms make intelligent decisions that maximize energy availability and minimize cost. This results in 24×7 sustainable, affordable energy availability for all.

Unlike many other companies in this space, SkyElectric engineers develop all of their products in-house, and customer systems are monitored 24×7 by these products. SkyElectric’s customer base is growing rapidly, and the company plans to expand internationally this year.

Traditional Databases Can’t Scale

In 2017, SkyElectric was running legacy Java software against MySQL. As is common with IoT use cases, SkyElectric’s data size was growing rapidly. MySQL was unable to keep up; queries routinely took up to five minutes to return results.

Rasool, SkyElectric

Evaluating alternatives, the team focused on solutions appropriate to time series data, with a goal of efficiently organizing their machine data. The team initially looked at Apache Cassandra and Riak TS. After some testing, the team realized that Riak would be too restrictive; once the schema went live, they would be unable to make any changes. That proved to be a disqualifier.

While researching Cassandra, a software architect recommended Scylla to SkyElectric’s DevOps lead, Meraj Rasool. Familiar with Java’s operational challenges, Rasool was intrigued to learn that Scylla is implemented in C++.

“I was very reluctant to go into production with Cassandra. So many articles and research papers emphasize Cassandra’s massive performance issues. We dreaded continuously tweaking and tuning the cluster to make sure it operates at the proper maximum resources. I was glad to find a superior alternative,” said Rasool.

A Quick and Painless Migration

SkyElectric compared benchmarks on ScyllaDB’s website with their own Cassandra test runs.

“I was very impressed with Scylla’s speed and ease of deployment,” said Rasool. “Unlike Cassandra, Scylla didn’t need any tuning and was up and running very quickly and painlessly.”

Rasool identified operational ease as the key benefit of Scylla. While SkyElectric saw 10x better performance with Scylla, operational ease was still more important, and ultimately proved to be the key criterion that drove the team’s decision.

“After running a production cluster of Scylla for a year, I just cannot stress this enough; I sleep very well without worrying that something might go wrong. We don’t need to worry about performance at all.”

-Meraj Rasool, DevOps Lead, SkyElectric

Starting out with three small AWS nodes, the team followed the documentation and launched bigger I3.large instances. SkyElectric runs three nodes but plan to double or triple those clusters this year.

For support, the SkyElectric team relied on ScyllaDB’s Slack channel. “I’m very happy with ScyllaDB’s support,” said Rasool.  “There were a few times when I really needed help and got a response within a few hours. But they weren’t just boilerplate responses. They were actual solutions for my questions. I was able to directly apply the solutions to our running clusters.”

“Once we adopted Scylla, we had the freedom to think about other things,” said Rasool. “For example, now we’re building a search engine and a states engine. We are very confident that Scylla will provide the necessary support, and that it will survive the load. The overall performance of the system backed by Scylla, which was previously measured minutes, is now measured in milliseconds.”

Rasool summed up his experience with Scylla. “Best of all, I sleep well thanks to Scylla. Having worked in DevOps for the last three years, combined with IoT, I’ve found that you need extremely reliable technology. The only way to sleep well and basically live a good life is to select the best of the best. In that way, Scylla has helped a lot.”

The post SkyElectric Uses Scylla’s NoSQL Database to Scale Its Smart Energy Platform appeared first on ScyllaDB.

Encryption at Rest in Scylla Enterprise

Encryption at Rest

 

Since you are using a web browser to read this, you are probably at least aware of Transport Level Security (TLS), which is used in wire protocols such as HTTPS. In Scylla, TLS is used to secure network connections between database nodes and/or client endpoints. Encrypting network traffic, also known as data in transit encryption, is essential both to protect sensitive data, and to ensure that traffic actually originates from where you think it did.

Scylla already supports data in transit encryption, supporting both node-to-node (intra-cluster) encryption as well as client-to-node encryption.

But securing data is transit is sometimes not enough. What if your servers themselves are compromised? This is where data at rest encryption comes into play. Data at rest secures the information persisted in a computer, such as on an SSD or HDD volume.

When you deal with sensitive data or multi-tenant deployments where you require isolation between clients (sets of keyspaces/tables), having data in clear text on disk can be a security problem. In a physical deployment, you have the risk of the actual disks being stolen or misplaced, which is why most organizations have strict routines for disposing of discarded storage media.

To alleviate the issue you can use whole-disk encryption solutions like LUKS, Bitkeeper or AWS disk encryption, but this does not handle encrypting different keys for different tables.

To solve this Scylla now supports per-table and per-node transparent data at rest encryption.

Encryption at rest for Scylla Enterprise, available starting with release 2019.1.1, protects data in its persisted state on disk, such as SSTables and commit logs. You can configure any user table as well as the parts of the system storage that includes client data to use any symmetric key algorithm, key width and block mode supported by OpenSSL in a file block encryption scheme.

In this post, we will try to explain some of the key elements and workflows when using transparent data encryption in scylla.

Why not just rely on my cloud vendor for data-at-rest encryption?

While AWS provides on-disk encryption, there are also some limitations to it. First, you have no direct access to the keys used to encrypt the data. In case you need to do forensics, your data on disk is encrypted even from your own view. Also, by doing your own encryption, you can use the same keys across all nodes and backups. You can also use the Scylla-native implementation to manage multi-tenant data access, where different groups may have access to different parts of your data.

In summary:

  • you gain control over the encryption keys
  • you can encrypt each table with a different key
  • data is still transparently encrypted if it gets moved to a different volume, for instance in case of a backup

Plus, of course, relying on a cloud vendor to do disk encryption isn’t even an option for your own on-premises deployments.

Encryption keys

Encryption keys are used for encrypting system data, such as commit logs, hints, user tables and/or other user table keys. (To be clear, “encryption keys” are external to the database, and refer to cryptographic keys. They are not related to the term “key” as used within the database such as primary, partition, or clustering keys.)

An encryption key is either stored as the contents of a local file with a single pre-generated key or, in future versions of Scylla Enterprise, with a named Key Management Interoperability Protocol (KMIP) format key stored on a separate (third-party) KMIP server.

Encryption keys are identified by name. For local file keys a file with the same name as the key is expected to exist, with the appropriate access rights, in the directory designated by the scylla.yaml configuration option:

system_key_directory: <folder>

Transparent user data encryption

Transparent user data encryption enables encrypted storage of persisted SSTable data. Tables are configured for encryption by setting CQL properties on the table while creating or updating the table.

create table <keyspace>.<table_name> (...<columns>...) WITH
      scylla_encryption_options = {
            'cipher_algorithm' : “<alg/mode/padding>”,
            'secret_key_strength' : <128/256…>,
            'key_provider': <provider>,
            [<provider options>...]
      }
;

The default key provider is the local file key storage provider. With no extra options Scylla will read or create a local key file in the key directory configured in scylla.yaml. Scylla supports three storage providers; local (available at initial release of this feature), plus replicated and KMIP key storage (in future versions of Scylla Enterprise).

Note that since data insertion in Scylla typically passes through the commit log and/or batch/hints log before being fully committed to a final SSTable on disk, you should also configure system-level encryption to be sure all sensitive data is protected. You can read the documentation on how to configure system-level encryption here.

The keys used can be either pre-generated or created on demand on first write to disk. File based key management will rely on the same key being available on disk in the named file whereas, in a future version, providers such as KMIP or the replicated key provider use an id-based key lookup to name and later retrieve the keys used by any given disk file.

Local key storage

This initial implementation, available beginning in Scylla Enterprise Release 2019.1.1, is a simple file-based key storage scheme where each key is kept in clear-text files on disk, in a key storage directory configured in scylla.yaml.

This is the default key storage manager in Scylla, since in its simplest usage it requires very little extra configuration. However, care should be taken so that no outside party can easily access the key data from the file system, i.e. you should take care of setting the permissions for the key directory.

You could also consider keeping the key directory on a network drive (using TLS for the file sharing) to avoid having keys and data on the same storage media, should it become stolen or discarded.

To use this provider for user data encryption, set

‘key_provider’: ‘LocalFileSystemKeyProviderFactory’,
'kmip_host‘secret_key_file’: <key storage file> (with optionally pre-created key(s)).

In the scylla_encryption_options attributes.

If secret_key_file is not specified, it will default to a file named data_encryption_keys in the system configuration directory. If this file does not exist and contain an appropriate key it and/or the directory must be writable by the scylla process and a new key will be generated.

System level encryption

System level encryption applied to semi-transient on-disk data, such as commit log, batch log and hinted handoff data. In each of these user table data is temporarily stored until fully persisted to final sstable on disk.

System encryption is configured in scylla.yaml:

System_info_encryption:
     enabled: <bool>
     key_provider: (optional) <key provider type>

Depending on key provider, additional arguments declaring provider properties are also required. Note the replicated key provider is not allowed here.

Scylla also allows you to encrypt sensitive parts of the scylla.yaml configuration file, such as KMIP server passwords. The configuration_encryptor tool accepts a system key file on disk and encrypt or decrypt your scylla config file automatically.

Example 1: Creating a new, encrypted table

Using the simplest use case of encrypting a single table using local key file provider:

  1. Create an encryption key
    > <scylla-tools/bin>/local_file_key_generator
  2. Copy the key file to the same path on all nodes in the cluster.
  3. Create the table
    > cqlsh
    
    cqlsh> CREATE KEYSPACE ks WITH replication={ 'class' : 'SimpleStrategy', 'replication_factor' : 1 } ;
    
    cqlsh> CREATE TABLE ks.test (pk text primary key, c0 int) WITH
    scylla_encryption_options = { 'key_provider' : 'LocalFileSystemKeyProviderFactory', ‘secret_key_file’ : ‘’ };
    '<path-to-key-file>'
    
  4. Insert and select some data
    cqlsh> INSERT INTO ks.test (pk, c0) VALUES (‘apa’, 1);
    
    cqlsh> SELECT * from ks.test;
    
     pk | c0
    ------+----
    apa | 1
  5. Flush the sstable to disk
    >nodetool flush

The created cql table behaves as a regular table, but all data in sstable files written for it will now be written encrypted to disk. You can verify this by for example copying the sstable to a different machine and try reading it with tools such as sstabledump. This will fail.

Example 2: Encrypting an existing table

Using the simplest use case of encrypting a single table using local key file provider:

  1. Create an encryption key
    > <scylla-tools/bin>/local_file_key_generator
  2. Copy the key file to the same path on all nodes in the cluster.
  3. Enable encryption of the table
    > cqlsh cqlsh> ALTER TABLE ks.test WITH scylla_encryption_options = { 'key_provider' : 'LocalFileSystemKeyProviderFactory', ‘secret_key_file’ : ‘<path-to-key-file>’ };
  4. Upgrade existing sstables to encrypt them
    > nodetool upgradesstables -a ks test

Note that until you re-write the sstables in step 4, all disk files already existing will remain unencrypted, even though newly created file

Example 3: Disable encryption for an encrypted table

To un-encrypt an encrypted table:

  1. Disable table encryption
    cqlsh> ALTER TABLE ks.test WITH scylla_encryption_options = { 'key_provider' : 'none’ };
  2. Upgrade existing sstables to decrypt them
    Note: until all existing sstables are re-written unencrypted, the encryption key used needs to remain available or data loss will occur.
    
    > nodetool upgradesstables -a ks test

Conclusion

Transparent data encryption provides a flexible and low overhead option to increase data security. It typically has a minimal cpu overhead (depending on encryption algorithm selected), generally much lower than for example sstable compression or similar, and no additional disk footprint.

There are, however, no silver bullets. When deploying an encryption solution you need to decide which type of attack you are guarding against; loss of disk media or intrusion attacks. To ensure data safety you need to ensure you protect disk-based keys properly, using proper file permissions, non-local storage etc.

Lastly, be sure to backup all your encryption keys (including those stored in Scylla tables) to prevent data loss, as there will generally be no way to recover data without them.

Properly applied, transparent data encryption can help secure sensitive data, even when deployed in semi-public environments such as cloud solutions or shared server farms.

The post Encryption at Rest in Scylla Enterprise appeared first on ScyllaDB.

Why in a Hybrid Cloud World, Developers Need Apache Cassandra™

Say what you’ll say about fads or trends, cloud is here to stay, and for many companies “the cloud” has now evolved into “hybrid cloud.”

But most organizations didn’t expect to be in hybrid cloud, and now that they’re here their application developers are facing issues they’ve never faced before—issues that require a powerful, flexible, always-on database like Apache Cassandra to resolve. 

But before we discuss why Cassandra is such an ideal database for hybrid cloud, we need to understand the how and the why of entire organizations ending up with a hybrid cloud infrastructure when they never set out to be there in the first place. 

The Three Ways Companies Commonly End Up in Hybrid Cloud

1. All-In! Uh … Wait a Second

In recent years, many companies have announced going all-in on cloud, but if you’ve been around for a while, you know that multi-year projects rarely end as originally conceived. 

What typically happens is something changes the equation and that “all-in” never quite materializes. Leadership changes. Priorities change. And before you know it you’re in a completely new world and stuck with new technical debt.

A few years ago, Target, after going all-in on Amazon Web Services (AWS), announced they’d be migrating all workloads to Google Cloud. This was strictly a competitive decision, not a technical one. The result? The entire operation had to be migrated to another cloud provider, but in reality the chances of everything being completely moved over were probably small. 

2. Not Us! (Uh … Wait a Second)

What about companies that claim that they will never be in the cloud? 

Again, experience has shown us that eventually some parts of the company will find their way to a cloud. Maybe they’ll just do some development and testing at first, then a little bit more, and then a lot more, until they find they have equal amounts of cloud and on-premises workloads in a phenomenon known by some as “cloud leakage.”

3. Till Marriage Do Us Part

The third common way companies tend to find themselves in hybrid cloud situations is via a merger or acquisition. There they are, happily managing their infrastructure in a single cloud or on premises, and then they acquire a company with a completely different environment, leaving them with the choice to either migrate everything from the other company’s environment to theirs or just learn to live with it. 

Most of the time, as technical debt goes, companies just live with it for the short term and then slowly start migrating things over.

Either way—they’ve just landed themselves in hybrid cloud. 

Cassandra to the Rescue

By the time most companies realize they’re using hybrid cloud, they haven’t even begun to make the deep technological revamps they need to make at the database level to make hybrid cloud work. They’re probably still using legacy technology, or they’re using outdated NoSQL technology designed around the time “hybrid” primarily referred to gas-saving cars. 

Here’s the thing: it’s not too late for these companies, and it’s certainly not over. In fact, the game—or journey—is just beginning. They just need to make the right choices. 

Cassandra was purpose-built for the type of workloads that hybrid cloud demands, because it offers:

  • Masterless/shared-nothing architecture
  • Low latency
  • Flexible consistency guarantees

What do these give organizations? 

For one—the ability to withstand a single server failure (hard enough for most databases). With Cassandra’s architecture, a company can withstand an entire data center outage with no data loss. More importantly, those data centers could be a cloud data center and an on-premises data center, or even multiple cloud data centers. Companies can create highly resilient architectures that are in lockstep with how Cassandra was built from the beginning. 

Highly replicated systems, such as Cassandra’s, also have the added benefit of providing better ways to manage latency. By keeping data as close to users as possible, they get a much better experience. Users in India should not feel the latency hit for a request destined for a server in North America. An added benefit of this is potentially better compliance with local regulations.

Also, because Cassandra is a replicated database designed for developers building applications, it gives companies a variety of choices when it comes to managing consistency. They can dial down to weak consistency, which requires very little coordination and can be very fast, or bring it up to exclusively strong coordinated consistency to guarantee they are the only ones writing a record into the database. 

In short: Cassandra provides the flexibility and control developers need to maintain data integrity as well as latency and distribution.

Hybrid Cloud Use Cases 

Now that you’ve got Cassandra on your side and working for you—what can you do with all of this power? Most workloads that run on a relational database can also run on Cassandra because that’s what it’s designed to do. 

But what about specific use cases that are far better suited to Cassandra and hybrid cloud? 

1. State Management

One use case that’s particularly hard with relational databases or even cloud vendor databases is state management across multiple data centers. Pushing state down to the data layer and letting Cassandra do its own replication, as it’s designed to do, will give you an enormous amount of flexibility as you choose how to deploy your applications. 

2. Microservices

Another hybrid cloud use case is distributed microservices. 

Imagine you have a common tier of microservices deployed with the same API to multiple places. There are organizations doing that now with Cassandra. Think about a single database spanning clouds and on-premises data centers with the data common to every API consumer, where having a SET operation in AWS is now available in a GET from an on-premises application. These aren’t exotic additions to an existing database or things that require a lot of glue code to make work: they’re how Cassandra works in its most basic form and why Cassandra gives organizations that use it a clear competitive advantage. 

Make Hybrid Cloud the Solution—Not the Problem

Let’s be honest: Hybrid cloud is the kind of thing most developers don’t want to think about because they want to see it as someone else’s problem. 

But the reality is: hybrid cloud is their problem, but it’s also their solution, if they are using the right technology to support it. 

Cassandra was designed for hybrid cloud and for the current and future application development world developers are going to have to deal with. 

DataStax engineers have contributed the majority of code commits to Cassandra and have more experience than anyone else deploying large Cassandra projects. That’s why we can confidently say that we are the Cassandra experts. But don’t just take our word for it; listen to some of our customers.

10 Ways To Multiply the Power of Apache Cassandra Without the Complexity (webinar)

VIEW NOW

A Dozen New Features in tlp-stress

It’s been a while since we introduced our tlp-stress tool, and we wanted to give a status update to showcase the improvements we’ve made since our first blog post in October. In this post we’ll show you what we’ve been working on these last 8 months.

  1. Duration based test runs

    One of the first things we realized was tlp-stress needed a way of specifying the length of time, rather than for a number of operations. We added a -d flag which can be combined with human readable shorthand, such as 4h for four hours or 3d for three days. The feature allows the user to specify any combination of days (d), hours (h), minutes (m), and second (s). This has been extremely helpful when running tests over several days to evaluate the performance during development of Cassandra 4.0.

     tlp-stress run KeyValue -d 3d
    
  2. Dynamic Workload Parameters

    When writing a workload, it became very apparent that we would have to either create thousands of different workloads or make workload parameters customizable to the user. We opted for the latter. Workloads can now annotate their variables to be overridden at runtime, allowing for extremely flexible operations.

    For example, one of our workloads, BasicTimeSeries, has a SELECT that looks like this:

     getPartitionHead = session.prepare("SELECT * from sensor_data_udt WHERE sensor_id = ? LIMIT ?")
    

    The limit clause above was hard coded initially:

     var limit = 500
    

    Changing it required recompiling tlp-stress. This was clearly not fun. We’ve added an annotation that can be applied to variables in the stress workload:

     @WorkloadParameter("Number of rows to fetch back on SELECT queries")
     var limit = 500
    

    Once this annotation is applied, the field can be overridden at run time via the --workload command line parameter:

     $ tlp-stress run BasicTimeSeries --workload.limit=1000
    

    You can find out what workload parameters are available by running tlp-stress info with the workload name:

     $ tlp-stress info BasicTimeSeries
     CREATE TABLE IF NOT EXISTS sensor_data (
                                 sensor_id text,
                                 timestamp timeuuid,
                                 data text,
                                 primary key(sensor_id, timestamp))
                                 WITH CLUSTERING ORDER BY (timestamp DESC)
     Default read rate: 0.01 (override with -r)
    
     Dynamic workload parameters (override with --workload.name=X)
    
     Name  | Description                                    | Type
     limit | Number of rows to fetch back on SELECT queries | kotlin.Int
    
  3. TTL Option

    When tlp-stress finally matured to be usable for testing Cassandra 4.0, we immediately realized that the biggest problem with long running tests was if they were constantly inserting new data, we’d eventually run out of disk. This happens sooner rather than later - tlp-stress can easily saturate a 1GB/s network card. We needed an option to add a TTL to the table schema. Providing a --ttl with a time in seconds will add a default_time_to_live option to the tables created. This is to limit the amount of data that will live in the cluster and allow for long running tests.

  4. Row and Key Cache Support

    On many workloads we found disabling key cache can improve performance by up to 20%. Being able to dynamically configure this to test different workloads has been extremely informative. The nuance here is that the data model matters a lot. A key value table with a 100 byte text field has a much different performance profile than a time series table spanning 12 SSTables. Figuring out the specific circumstances where these caches can be beneficial (or harmful) is necessary to correctly optimize your cluster.

    You can configure row cache or key cache by doing the following when running tlp-stress:

     $ tlp-stress run KeyValue --rowcache 'ALL' --keycache 'NONE' 
    

    We’ll be sure to follow up with a detailed post on this topic to show the results of our testing of these caches.

  5. Populate Data Before Test

    It’s not useful to run a read heavy workload without any data, so we added a --populate flag that can load up the cluster with data before starting a measured workload. Consider a case where we want to examine heap allocations when running a read dominated workload across ten thousand partitions:

     $ tlp-stress run KeyValue --populate 10k -p 10k -r .99 -d 1d
    

    The above will run the KeyValue workload, performing 99% reads for one day, but only after inserting the initial ten thousand rows.

    Workloads can define their own custom logic for prepopulating fields or allow tlp-stress to do it on its own.

  6. Coordinator Only Mode

    This allows us to test a lesser known and used feature known as “coordinator nodes”. Eric Lubow described how using dedicated nodes as coordinators could improve performance at the 2016 Cassandra summit (slides here). We thought it would be useful to be able to test this type of workload. Passing the --coordinatoronly flag allows us to test this scenario.

    We will post a follow up with detailed information on how and why this works.

  7. Pass Extra CQL

    We know sometimes it would be helpful to test a specific feature in the context of an existing workload. For example, we might want to test how a Materialized View or a SASI index works with a time series workload. If we were to create an index on our keyvalue table we might want to do something like this:

    --cql "CREATE CUSTOM INDEX fn_prefix ON keyvalue(value) USING 'org.apache.cassandra.index.sasi.SASIIndex';

    For now, we won’t be able to query the table using this workload. We’re only able to see the impact of index maintenance, but we’re hoping to improve this in the future.

  8. Random Access Workload

    As you can see above, the new workload parameters have already been extremely useful, allowing you customize workloads even further than before. This will allow us to model a new pattern and data model: the random access workload.

    Not every Cassandra use case is time series. In fact, we see a significant number of non time series workloads. A friends list is a good example of this pattern. We may want to select a single friend out of a friends list, or read the entire partition out at once. This new workload allows us to do either.

    We’ll check the workload to see what workload specific parameters exist:

     $ tlp-stress info RandomPartitionAccess
     CREATE TABLE IF NOT EXISTS random_access (
                                partition_id text,
                                row_id int,
                                value text,
                                primary key (partition_id, row_id)
                             )
     Default read rate: 0.01 (override with -r)
        
     Dynamic workload parameters (override with --workload.name=X)
        
     Name   | Description                                                                   | Type
     rows   | Number of rows per partition, defaults to 100                                 | kotlin.Int
     select | Select random row or the entire partition.  Acceptable values: row, partition | kotlin.String
    

    We can override the number of rows per partition as well as how we’ll read the data. We can either select the entire row (a fast query) or the entire partition, which gets slower and more memory hungry as the partition gets bigger. If we were to consider a case where the users in our system have 500 friends, and we want to test the performance of selecting the entire friends list, we can do something like this:

     $ tlp-stress run RandomPartitionAccess --workload.rows=500 --workload.select=partition
    

    We could, of course, run a second workload that queries individual rows if we wanted to, throttling it, having it be read heavy, or whatever we needed that will most closely simulate our production environment.

  9. LWT Locking Workload

    This workload is a mix of a pattern and a feature - using LWT for locking. We did some work researching LWT performance for a customer who was already making heavy use of them. The goal was to identify the root cause of performance issues on clusters using a lot of lightweight transactions. We were able to find some very interesting results which lead to the creation of CASSANDRA-15080.

    We will dive into the details of tuning a cluster for LWTs in a separate post, there’s a lot going on there.

  10. CSV Output

    Sometimes it’s easiest to run basic reports across CSV data rather than using reporting dashboards, especially if there’s only a single stress server. One stress server can push a 9 node cluster pretty hard so for small tests, you’ll usually only use a single instance. CSV is easily consumed from Python with Pandas or gnuplot, so we added it in as an option. Use the --csv option to log all results to a text file in CSV format which you can save an analyze later.

  11. Prometheus Metrics

    Prometheus has quickly become the new standard in metrics collection, and we wanted to make sure we could aggregate and graph statistics from multiple stress instances.

    We expose an HTTP endpoint on port 9500 to let prometheus scrape metrics.

  12. Special Bonus: We have a mailing list!

    Last, but not least, we’ve set up a mailing list on Google Groups for Q&A as well as development discussion. We’re eager to hear your questions and feedback, so please join and help us build a better tool and community.

    This list is also used to discuss another tool we’ve been working on, tlp-cluster, which we’ll cover in a follow up blog post.

We’re committed to improving the tlp-stress tooling and are looking forward to our first official release of the software. We’ve got a bit of cleanup and documentation to write. We don’t expect the architecture or functionality to significantly change before then.

KillrVideo Python Pt. 6— Cassandra with Python: Simple to Complex

Keeping it simple

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 4: Cassandra, meet Lucene! – The Cassandra Lucene Index Plugin

Massively Scalable Geospatial Anomaly Detection with Apache Kafka and Cassandra

After being Lost in (3D) Space in the last blog, this final part of the geospatial anomaly detection series, we come back down to Earth and try out the Cassandra Lucene Plugin (this was our destination all along, but it’s a long way to Alpha Centauri!). We’ll also reveal how well a subset of the alternative solutions, from this and the previous 1-3 parts, actually worked.

1. The Cassandra Lucene Index

The Cassandra Lucene Index is a plugin for Apache Cassandra:

“… that extends its index functionality to provide near real-time search, including full-text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data.

Recently we announced that Instaclustr is supporting the Cassandra Lucene Index plugin, and providing it as part of our managed Cassandra service.  The full documentation for the plugin is here. 

The  documents say that “the plugin is not intended to replace Apache Cassandra denormalized tables, indexes, or secondary indexes, as it is just a tool to perform some kind of queries which are really hard to be addressed using Apache Cassandra out of the box features.”

Given that in the past 3 blogs we’ve investigated using all of these approaches to add geospatial queries to Cassandra, it’s finally time to take the Cassandra Lucene Index for a spin. 

How does Lucene work? Lucene is built on a very ancient hand data mining technique which was originally used by Monks in 1230 to build a concordance of the Bible, it helped them find all the verses which had specific words in them (e.g. “Nephilim” – not many verses, “a” – lots of verses). A more recent (1890s) concordance was Strong’s “Exhaustive Concordance”, so-called not because it was heavy (it was, it weighed 3 KG making it bigger and heavier than the book it indexed, the Bible) but because it indexed every word, including “a”: 

Geospatial Anomaly Detection 4 - Exhaustive Concordance of the Bible

This is a good overview of Lucene (Lucene: The Good Parts), which helps make sense of this overview of the Lucene Cassandra plugin.

How is the Lucene Cassandra plugin used? It’s supported as an optional add-on in Instaclustr’s managed service offering, just select it in the Instaclustr Management Console when you are creating a Cassandra cluster and it is automatically deployed for you. 

Lucene indexes are an extension of the Cassandra secondary indexes, so they are created using the CQL CREATE CUSTOM INDEX statement like this:

CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>
ON <table_name> ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = <options>

To search using the index, this syntax is used.

SELECT * FROM test WHERE expr(<idx_name>, '<expr>');

Note that <options> and <expr> are JSON objects which actually specify what indexes are created and how they are searched. Also note that you only need to create a single Lucene index for each Cassandra table, as Lucene handles indexing all the requested columns for you.

Lucene indexes are queried using a custom JSON syntax defining the kind of search to be performed:

SELECT ( <fields> | * ) FROM <table_name> WHERE expr(<index_name>, '{
   (filter: ( <filter> )* )?
   (, query: ( <query>  )* )?
   (, sort: ( <sort>   )* )?
   (, refresh: ( true | false ) )?
}');

You can combine multiple filters, queries, and sorts (including sorting by geo distance).  Note that queries are sent to all nodes in the Cassandra cluster. However, filters may find the results from a subset of the nodes, so filters will have better throughput than queries.

2. Geospatial searches

Geospatial Anomaly Detection 4 - Throughput Results - Geospatial searches
How does this help for Geospatial queries? The Lucene plugin has very rich geospatial semantics including support for geo points, geo shapes, geo distance search, geo bounding box search, geo shape search, as well as multiple distance units, geo transformations, and complex geo shapes. We only need to use a subset of these for our geospatial anomaly detection use case. 

The simplest concept is the geo point which is <latitude, longitude> coordinates. Interestingly, under the hood Indexing is done using a tree structure with geohashes (with configurable precision).  Here’s how to create an index over latitude and longitude columns: 

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         geo_point: {
            type: "geo_point",
            validated: true,
            latitude: "lat",
            longitude: "long",
            max_levels: 8
         
      
   }'
};

This index can be used to search for rows within a distance range from a specific point with this syntax:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '{
   (filter | query): {
      type: "geo_distance",
      field: <field_name> ,
      latitude: <latitude> ,
      longitude: <longitude> ,
      max_distance: <max_distance>
      (, min_distance: <min_distance> )?
   
}');

There’s also a bounding box search with syntax like this. Don’t forget you first have to convert distance to latitude and longitude for the bounding box corners:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '{
   (filter | query): {
      type: "geo_bbox",
      field: <field_name>,
      min_latitude: <min_latitude>,
      max_latitude: <max_latitude>,
      min_longitude: <min_longitude>,
      max_longitude: <max_longitude>
   
}');

Partition-directed searches will be routed to a single partition, increasing performance. However, token range searches without filters over the partitioning column will be routed to all the partitions, with a slightly lower performance. This example fetches all nodes and all partitions so will be slower:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '<expr>’);

But this example fetches a single partition and will be faster:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '<expr>’) AND <partition_key> = <value>;

You can also limit the result set:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '<expr>’) AND <partition_key> = <value> limit <limit>;

Sorting is sophisticated, but note that you can’t use the CQL ORDER BY clause with the Lucene indexes. Here’s what the documentation says:

  1. When searching by filter, without any query or sort defined, then the results are returned in Cassandra’s natural order, which is defined by the partitioner and the column name comparator.
  2. When searching by query, results are returned sorted by descending relevance.
  3. Sort option is used to specify the order in which the indexed rows will be traversed.
  4. When simple_sort_field sorting is used, the query scoring is delayed.

Finally, prefix search is useful for searching larger areas over a single geohash column as you can search for a substring:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '{
   (filter | query): {
      type: "prefix",
      field: <field_name> ,
      value: <value>
   
}');

For example, to search for a 5 character geohash:

filter: {
      type: "prefix",
      field: "geohash",
      value: "every"

Here are the Cassandra table and Lucene indexes we created to evaluate the performance:

CREATE TABLE latlong_lucene (
   geohash1 text,
   value double,
   time timestamp,
   latitude double,
   longitude double,
   Primary key (geohash1, time)
) WITH CLUSTERING ORDER BY (time DESC);

CREATE CUSTOM INDEX latlong_index ON latlong_lucene ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         geohash1: {type: "string"},
         value: {type: "double"},
         time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"},
         place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
      
   }'
};

The simplest possible search is just using the partition key and a sort, Where <lat> and <long> are the location of the current event to check for an anomaly.

SELECT value FROM latlong_lucene WHERE expr(latlong_index,
'{ sort: [  {field: "place", type: "geo_distance", latitude: " + <lat> + ", longitude: " + <long> + "}, {field: "time", reverse: true} ] }') and geohash1=<geohash> limit 50;

The prefix search starts from geohash8 and increases the area until 50 rows are found:

SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: [ {type: "prefix", field: "geohash1", value: <geohash>} ] }') limit 50

This query shows how the geo distance search works (distances are increased incrementally until 50 rows are found):

SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_distance", field: "place", latitude: " + <lat> + ", longitude: " + <long> + ", max_distance: " <distance> + "km"   } }') and geohash1=' + <hash1> + ' limit 50;

Finally, here’s the bounding box search, where the min and max bounding box latitudes and longitudes are computed based on increasing distance until 50 rows are found:

SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_bbox", field: "place", min_latitude: " + <minLat> + ", max_latitude: " + <maxLat> + ", min_longitude: " + <minLon> + ", max_longitude: " + <maxLon> + " }}')  limit 50;

3. And the Winner is …

We tested a subset of the geospatial implementation alternatives we explored in this geospatial anomaly detection blog series, and ran throughput tests on identical Cassandra, Kafka and Kubernetes clusters to those we used in Anomalia Machina Blog 9.

Some high-level observations are that:

  1. There are many alternative ways of implementing geospatial proximity queries in Cassandra
  2. The geospatial alternatives show significant variation (the graph below shows best, average and worst, the best geospatial results are 62 times better than the worst)
  3. The original non-geospatial results (18,000 TPS) are better than all the geospatial results, and
  4. The best geospatial results only achieve ⅓ the throughput of the original results, the average geospatial results are only 1/10th the original results.   

However, this is not entirely unexpected, as the geospatial problem is a lot more demanding as (1) the queries are more complex being over 2 or 3 dimensions, and requiring calculation and of proximity between locations, and (2) more than one query is typically needed due to increasing the search area until 50 events are found. 

Following are the detailed results.  For some options, we tried both spatially uniform distributions of data, and dense data (clumped within 100km of a few locations), otherwise the default was a uniform distribution. We also tried a dynamic optimisation which kept track of the distance that returned 50 results and tried this distance first (which only works for uniform distributions).

3.1 Cassandra and Geohashes

The following graph shows the results for the “pure” Cassandra implementations (Blog 2) including basic secondary indexes, all using geohashes. For most of the options both 2D and 3D geohashes are supported (Blog 3), with the exception of the option using “inequality over hash8 clustering column” which is 2D only (it relies on ordering which is not maintained for the 3D geohash). The results from worst to best are: 

  • Using 1 index for each geohash length 2D/3D (uniform, 1300 TPS, dense, 1690 TPS)
  • 1 table for each geohash length 2D/3D (uniform, 2700 TPS), inequality over hash8 clustering column (2D only) (uniform, 2700 TPS)
  • 1 table for each geohash length 2D/3D (dense, 3600 TPS)
  • And the winner is, inequality over hash8 clustering column (2D only) (dense, 6200 TPS). 

3.2 Cassandra Lucene Plugin/SASI and latitude/longitude

In this section, we evaluate the Cassandra Lucene Plugin (this blog) and SASI options (5.3). These options enabled the spatial searches over full latitude/longitude coordinates, using either bounding boxes or increasing distances.  The exception was the Lucene prefix option (see above) which used a single geohash and the Lucene prefix operator to search shorter geohashes (and larger areas).

The most obvious thing to note is that the SASI options were the worst:  SASI lat long clustering columns bounded box (dense, 100TPS, and uniform, 120TPS). 

The Lucene options involved combinations of dense or uniform distributions, filtering by distance or bounding box, and sorted or not (in which case the client can sort the results). The results (from worst to best) were:

  • filter by increasing distances, distance sort (uniform data, 200TPS)
  • distance sort, no filter (uniform data, 300TPS)
  • filter by increasing distances, no sort (uniform data, 400TPS)
  • filter by bounded box increasing distance (700TPS)
  • filter by bounded box, automatically optimised for most likely distance (1000TPS), equal with filter by increasing distances, no sort (dense data, 1000TPS)
  • prefix filter (using geohashes, uniform data, 2300TPS, and with dense data 4600TPS).  

Not surprisingly, the best result from this bunch used geohashes in conjunction with the prefix filter and Lucene indexes. There was not much to pick between the bounding box vs. distance filter options, with both achieving 1000 TPS (with distance optimisation or densely distributed data), probably because the geo point indexes use geohashes and both distance search types can take advantage of search optimisations using them.

3.3 Best Geohash Results

The following graph shows the best six results (all using geohashes) from worst to best:

  • The Lucene plugin prefix filter (uniform data, 2300 TPS) performed well, but was beaten by both the
  • Cassandra 1 table per hash 2D/3D (uniform data, 2700 TPS) and Cassandra inequality over a single geohash8 clustering column 2D (uniform, 2700 TPS).
  • The impact of assumptions around the data density are apparent as dense data reduces the number of queries required and increases the throughput dramatically with Cassandra 1 table per hash 2D/3D (dense data, 3600 TPS),
  • and Lucene plugin prefix filter (dense data, 4600 TPS).
  • And the overall winner is Cassandra inequality over a single geohash8 clustering column 2D (dense, 6200 TPS).

However, the worst and best results from this bunch are limited to 2D geospatial queries (but see below) whereas the rest all work just as well for 3D as 2D. The best result is the simplest implementation, using a single geohash as a clustering column. 

Lucene options that use full latitude/longitude coordinates are all slower than geohash options. This is to be expected as the latitude/longitude spatial queries are more expensive, even with the Lucene indexing. There is, however, a tradeoff between the speed and approximation of geohashes, and the power but computational complexity of using latitude/longitude. For simple spatial proximity geohashes may be sufficient for some use cases. However, latitude/longitude has the advantage that spatial comparisons can be more precise and sophisticated, and can include arbitrary polygon shapes and checking if shapes overlap.

Finally, higher throughput can be achieved for any of the alternatives by adding more nodes to the Cassandra, Kafka and Kubernetes (worker nodes) clusters, which is easy with the Instaclustr managed Apache Cassandra and Kafka services. 

4. “Drones Causing Worldwide Spike In UFO Sightings!”

If you really want to check for Drone proximity anomalies correctly, and not cause unnecessary panic in the general public by claiming to have found UFOs instead, then it is possible to use a more theoretically correct 3D version of geohashes. If the original 2D geohash algorithm is directly modified so that all three spatial coordinates (latitude, longitude, altitude), are used to encode a geohash, then geohash ordering is also ensured.  Here’s example code in gist for a 3D geohash encoding which produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit. Note that it’s just sample code and only implements the encoding method, a complete implementation would also need a decoder and other helper functions.

The tradeoff between using the previous “2D geohash+rounded altitude” and this approach is that: (1) this approach is correct, but for the previous approach (2) the 2D part is still compatible with the original 2D geohashes (this 3D geohash is incompatible with it), and (3) the altitude is explicit in the geohash string (it’s encoded into and therefore only implicit in the 3D geohash).

Geospatial Anomaly Detection 4 - UFO Sighting

5. Conclusions and Further Information

In conclusion, using Geohashes with a Cassandra Clustering Column is the fastest approach for simple geospatial proximity searches, and also gives you the option of 3D for free, but Cassandra with the Lucene plugin may be more powerful for more complex use cases (but doesn’t support 3D). If the throughput is not sufficient a solution is to increase the cluster size.

The Instaclustr Managed Platform includes Apache Cassandra and Apache Kafka, and the (optional add-on) Cassandra Lucene Plugin can be selected at cluster creation and will be automatically provisioned along with Cassandra.

Try the  Instaclustr Managed Platform with our Free Trial.

The post Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 4: Cassandra, meet Lucene! – The Cassandra Lucene Index Plugin appeared first on Instaclustr.

Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones)

Massively Scalable Geospatial Anomaly Detection with Apache Kafka and Cassandra

In this blog we discover that we’ve been trapped in Flatland, and encounter a Dimension of the Third Kind. We introduce 3D geohashes, see how far up (and down) we can go (geostationary satellite orbits, the moon, and beyond), revisit Cassandra partition sizes, and come back down to earth with a suggested 3D geohash powered hazard proximity anomaly detector application for Drones!

1. What goes Up…

Geospatial Anomaly Detector 3 - Flatland

As if changing from believing that the earth is “flat” to round isn’t hard enough for some people, there’s also the challenge of moving from 2D to 3D!

“Either this is madness or it is Hell.” “It is neither,” calmly replied the voice of the Sphere, “it is Knowledge; it is Three Dimensions: open your eye once again and try to look steadily.”
Edwin A. Abbott, Flatland: A Romance of Many Dimensions

In the world of Flatland there are only 2 dimensions, and the Sphere (who inhabits 3D Spaceland) tries to explain height to the 2D creatures than inhabit it!

So let’s try and imagine a 3D world, so we can go Up:

Geospatial Anomaly Detector 3 - Disney Pixar - Up

And Down! (No, there wasn’t actually a sequel):

Geospatial Anomaly Detector 3 - Disney Pixar - Down

It’s a big jump from a flatworld (e.g. a 2D map) to even just a surface on a round ball (i.e. a sphere). But introducing a 3rd, vertical, dimension has more challenges. Where do you measure elevation from? The centre of the earth? The surface of the earth? And guess what, the earth isn’t actual spherical (white line), it’s an ellipsoid (orange line), and even has dips and bumps (geoid height, red line). This diagram shows the vertical complications of a (cross section of a) real planet:

Geospatial Anomaly Detector 3 - Vertical complications of a planet

Does the vertical dimension matter? Well, there are actually lots of things below the nominal surface of the earth. For example, the deepest borehole, the Kola Superdeep Borehole, has a depth of 12,226m (as the top of the hole proclaims):

Geospatial Anomaly Detector 3 - Kola Superdeep Borehole

It’s 11km to the bottom of the sea – the deepest dive ever was made recently to 10,928 m down in the Mariana Trench, 16m deeper than the previous record set way back in 1960, finding a plastic bag and a few weird fish:

Geospatial Anomaly Detector 3 - Eel at the bottom of Mariana Trench

An arrowtooth eel swims at the bottom of the Mariana Trench.

By comparison, it’s only 4km down the deepest mine, 2.2km down the deepest cave, and the Dead sea is 400m below sea level.

Geospatial Anomaly Detector 3 - Dead Sea

Going in the other direction, altitude above the earth can range over essentially infinite distances (the galaxy is big), but coming back closer to earth there are some more useful altitudes. Planes and some drones operate up to 10km up, jet planes up to 37km, balloons up to 40km, and rocket planes up to 112km!

Geospatial Anomaly Detector 3 - Altitude

Further out, the orbits of satellites can extend more than 30,000 km above sea level:

Geospatial Anomaly Detector 3 - Orbits of a satellite

2. 3D Geohashes

How can we add altitude (and depth) to geohashes? A theoretically correct approach would be to modify the geohash encoding and decoding algorithms to incorporate altitude directly into the geohash string, making it a true 3D geohash (See postscript below). A simpler approach, used by Geohash-36, is to append altitude data onto the end of the 2D geohash. However, given that we want the ability to use different length geohashes to find nearby events, just using the raw altitude won’t work. We also want the 3D geohash cells to have similar lengths in each dimension (including altitude), and in proportion to the geohash length. The approach we used was to round the altitude value to the same precision as the 2D geohash (which depends on the length) and append it to the 2D geohash with a special character as a separator (e.g. “+”/“-” for altitudes above/below sea level). Note that this approach is only approximate as cubes further away from the earth will be larger than those nearer. The function to round an altitude (a) to a specified precision (p) is as follows:

roundToPrecision(a, p) = round( a / p ) x p

The following table shows the 2D geohash lengths, precision, area and volume for the corresponding 3D geohashes:

2D geohash length precision (m) precision (km) area (km^2) volume (km^3)
1 5000000 5000 25000000 1.25E+11
2 1260000 1260 1587600 2000376000
3 156000 156 24336 3796416
4 40000 40 1600 64000
5 4800 4.8 23.04 110.592
6 1220 1.22 1.4884 1.815848
7 152 0.152 0.023104 0.003511808
8 38 0.038 0.001444 0.000054872

Note that we refer to the length of the 3D geohash as the base 2D geohash length, even though the 3D geohash is obviously longer due to the appended altitude characters. For an 8 character 2D geohash, the 3D geohash represents a cube with 38m sides. At the other extreme, for 1 character, the cube is a massive 5000km on each side. For a 2 character geohash this is what the approximate 1,000km^3 cube (representing all the world’s water) looks like;

Geospatial Anomaly Detector 3 - Earth's Water

Some example altitudes encoded as a 3D geohash are as follows:

altitude (m) 3D geohash notes
0 “sv95xs01+0” Dead sea, sea level
-500 “sv95xs0-494” Dead sea, below sea level
50 “sv95xs01+38” Dead sea, 50m altitude
100 “sv95xs01+114” Dead sea, 100m altitude
8848 “tuvz4+9600” Mount Everest, cube with 4.8km sides
8848 “tuvz+0” Mount Everest, 3D geohash cube with 40km sides
35786000 “x+35000000” geostationary satellite orbit, 3D geohash cube with 5,000km sides

Geostationary satellites appear to be fixed over the same position on the earth, which has to be close to the equator, and because they all have to be 35,786 km up it gets a bit crowded (an average 73km separation). So the above geohash example, “x+35000000”, could “net” around 68 satellites in a cube 5,000km on each side.

Because the 3D geohash is based on latitude/longitude it’s an example of a earth-centered rotational coordinate system, as the coordinates remain fixed on the earth’s surface (i.e. the frame of reference is the earth’s surface, and rotates with it). For objects that are in space, it’s easier to use a Earth-centered inertial coordinate system which doesn’t move with the earth’s rotation, so (for objects further away) their location doesn’t change rapidly.

I was curious about what coordinate systems were used for the Apollo moon landings, now almost exactly 50 years ago.

Geospatial Anomaly Detector 3 - Apollo Lunar Mission

The answer was “lots”. All the stages of the rocket, including the command/service module and lunar module each used a local coordinate system. This made it easier for the astronauts to compute course correction burns. But there were also many other coordinate systems (the guidance computer translated between them): Earth basic, Moon basic, Earth launch site, passive thermal control (the “barbecue roll”), preferred lunar orbital insertion, lunar landing site, preferred lunar orbital plane change, LM ascent, preferred trans-Earth injection, and Earth entry.

Geospatial Anomaly Detector 3 - Rendezvous radar and CSM target orientation

And given that everyone knows the earth is not really the centre of the universe, there are sun centred coordinate systems to (and Galactic and Supergalactic!) . This brings us back to full circle to 2D again, as typically distance isn’t relevant for astronomical objects (because it’s unknown or infinite), so stars, black holes, galaxies etc are assumed to be located on the imaginary celestial sphere (the glass sphere in this old example):Geospatial Anomaly Detector 3 - Celestial Globe

A celestial globe enclosing a terrestrial globe, from 1739

3. Do 3D geohashes impact partition cardinality?

By using 3D geohashes I realised that we are increasing the cardinality, which may go some of the way to solving the issue we identified at the end of the Blog 2 on Geospatial Anomaly Detection:Geohashes (2D) With shorter geohash partitions being too large. Doing some calculations of the 3D geohash cardinality reveals that the increase in cardinality depends on the altitude range actually in use. For example, if the altitude range is as far as the geostationary satellite orbits, then the cardinality of the 3D geohash based on the 3 character 2D geohash (green) is now high enough to use as the partition key (previously by itself the 4 character 2D geohash was the shortest geohash that would work, orange).

2D geohash length 2D cardinality 2D cardinality > 100000 3D cardinality (0-30,000km altitude) 3D cardinality > 100000
1 32 FALSE 192 FALSE
2 1024 FALSE 24576 FALSE
3 32768 FALSE 6324224 TRUE
4 1048576 TRUE 786432000 TRUE
5 33554432 TRUE 2.09715E+11 TRUE
6 1073741824 TRUE 2.64044E+13 TRUE
7 34359738368 TRUE 6.78155E+15 TRUE
8 1.09951E+12 TRUE 8.68036E+17 TRUE

However, this is highly sensitive to the subrange of altitudes in use. A 0-100km altitude subrange has no useful impact on the cardinality, but a subrange of 0-400,000km (the approximate distance from the earth to the moon) gives the 3D geohash based on the 2 character 2D geohash high enough cardinality to act as the partition key.

4. A Drone Proximity Anomaly Detection Example

Geospatial Anomaly Detector 3 - Drone Proximity Anomaly Detection Example

To test the 3D geohash with the Anomalia Machina pipeline we generated random 3D location data, but rather than shooting for the moon, we limited it to the low-altitudes legally permitted for drones (a few hundred metres high). At these altitudes the 8 and 7 character geohashes can adequately distinguish locations with nearby altitudes (with a precision of 38m and 152m respectively). This is a realistic simulation of future piloted or automated Drone delivery or monitoring platforms, perhaps working in complex built up environments like this diagram, where there are many hazards that drones need to navigate around:

Geospatial Anomaly Detector 3 - Low altitude airspace

An example of a more concrete drone anomaly detection problem is when the value to be checked is a count of the number of proximity alerts for each drones and location. The proximity alerts could jump if drones started operating near built up or congested areas, near exclusion zones, or near temporary flight restriction areas. The rules for drone proximity are complex as permitted distances from drones to different types of hazards vary (and vary across the world). Some hazards are likely to be in fixed locations (e.g. airports and built up areas), others will themselves be moving as will the drones (e.g. vehicles). We’ll focus on fixed hazards only (moving hazards could be processed by a Kafka streams and state store application), and assume that proximity is measured in 3D. The rules for fixed hazards (in the UK) are (1) > 50m from people and property (2) > 150m from congested areas (3) > 1000m from airports, and (4) > 5000m (a guess) from emergency exclusion zones (e.g. fires).

Geospatial Anomaly Detector 3 - Drone Safety

How could we generate proximity alerts for rules like this? Geohashes are again a good fit due to their natural ability to model locations at different scales. A different Cassandra table can be used for each type of hazard, with a 3D geohash as the partition key, where the partition key for each table is chosen to be based on the 2D geohash length closest to the hazard scale. We can simply query each table with the drone location to check if (and how many) hazards of each type are too close to the drone. I.e. geohash8 for people and property (38m precision), geohash7 for congested areas (152m precision), geohash6 for airports (1220m precision), and geohash5 for emergency exclusion zones (4800m precision). The resulting counts would then be fed to the 3D geospatial anomaly detector for checking.

Also note that if we want to increase the precision of the altitude for Drone operations, perhaps to enable altitude to be accurate to within 1m to ensure Drones are hovering outside the correct floor of a skyscraper (e.g. to support emergency services, maintenance, etc), then a 3D geohash constructed from a 10 character 2D geohash, with altitude rounded to 1m appended, can be used (which is a 1 cubic metre box).


Next blog: We investigate using some of the powerful geospatial features of the Lucene Cassandra index plugin, and implement a subset of the alternatives from all the blogs, and run load tests with the Anomalia Machina system. The winner will be revealed in the next blog.

Postscript – Example 3D Geohash code

If you really want to check for Drone proximity anomalies correctly, and not cause unnecessary panic in the general public by claiming to have found UFOs instead, then it is possible to use a more theoretically correct 3D version of geohashes. If the original 2D geohash algorithm is directly modified so that all three spatial coordinates (latitude, longitude, altitude), are used to encode a geohash, then geohash ordering is also ensured. Here’s example code in gist for a 3D geohash encoding which produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit. Note that it’s just sample code and only implements the encoding method, a complete implementation would also need decoder and other helper functions.

The tradeoff between using the previous “2D geohash+rounded altitude” and this approach is that: (1) this approach is correct, but for the previous approach (2) the 2D part is still compatible with the original 2D geohashes (this 3D geohash is incompatible with it), and (3) the altitude is explicit in the geohash string (it’s encoded into and therefore only implicit in the 3D geohash).

Geospatial Anomaly Detector 3 - 3D Hilbert Cube

A 3D Hilbert Cube (the Maths behind geohashes) made from pipes!

Subscribe to our newsletter below or Follow us on LinkedIn or Twitter to get notified when we add the Third Dimension.

The post Geospatial Anomaly Detection (Terra-Locus Anomalia Machina) Part 3: 3D Geohashes (and Drones) appeared first on Instaclustr.