Reaper 1.4 Released

Cassandra Reaper 1.4 was just released with security features that now expand to the whole REST API.

Security Improvements

Reaper 1.2.0 integrated Apache Shiro to provide authentication capabilities in the UI. The REST API remained fully opened though, which was a security concern. With Reaper 1.4.0, the REST API is now fully secured and managed by the very same Shiro configuration as the Web UI. Json Web Tokens (JWT) were introduced to avoid sending credentials over the wire too often. In addition spreaper, Reaper’s command line tool, has been updated to provide a login operation and manipulate JWTs.

The documentation was updated with all the necessary information to handle authentication in Reaper and even some samples on how to connect LDAP directories through Shiro.

Note that Reaper doesn’t support authorization features and it is impossible to create users with different rights.
Authentication is now enabled by default for all new installs of Reaper.

Configurable JMX port per cluster

One of the annoying things with Reaper was that it was impossible to use a different port for JMX communications than the default one, 7199.
You could define specific ports per IP, but that was really for testing purposes with CCM.
That long overdue feature has now landed in 1.4.0 and a custom JMX can be passed when declaring a cluster in Reaper:

Configurable JMX port

TWCS/DTCS tables blacklisting

In general, it is best to avoid repairing DTCS tables, as it can generate lots of small SSTables that could stay out of the compaction window and generate performance problems. We tend to recommend not to repair TWCS tables either, to avoid replicating timestamp overlaps betwen nodes that can delay the deletion of fully expired SSTables.

When using the auto-scheduler though, it is impossible to specify blacklists, as all keyspaces and all tables get automatically scheduled by Reaper.

Based on the initial PR of Dennis Kline that was then re-worked by our very own Mick, a new configuration setting allows automatically blacklisting of TWCS and DTCS tables for all repairs:

blacklistTwcsTables: false

When set to true, Reaper will discover the compaction strategy for all tables in the keyspace and remove any table with either DTCS or TWCS, unless they are explicitely passed in the list of tables to repair.

Web UI improvements

The Web UI reported decommissioned nodes that still appeared in the Gossip state of the cluster, with a Left state. This has been fixed and such nodes are not displayed anymore.
Another bug was the number of tokens reported in the node detail panel, which was nowhere near matching reality. We now display the correct number of tokens and clicking on this number will open a popup containing the list of tokens the node is responsible for:

Tokens

Work in progress

Work in progress will introduce the Sidecar Mode, which will collocate a Reaper instance with each Cassandra node and support clusters where JMX access is restricted to localhost.
This mode is being actively worked on currently and the branch already has working repairs.
We’re now refactoring the code and porting other features to this mode like snapshots and metric collection.
This mode will also allow for adding new features and permit Reaper to better scale with the clusters it manages.

Upgrade to Reaper 1.4.0

The upgrade to 1.4 is recommended for all Reaper users. The binaries are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 1.4 are available on the Reaper website.

The Complex Path for a Simple Portable Python Interpreter, or Snakes on a Data Plane

We needed a Python interpreter that can be shipped everywhere. You won’t believe what happened next!

Snakes on a Data Plane

“When I said I wanted portable Python, this is NOT what I meant!”

In theory, Python is a portable language. You can write your script locally and distribute it to other machines with the Python interpreter. In practice, things can go wrong for a variety of reasons.

The first and simpler problem is the module system: for a script to run, all of the modules it uses must be installed. For Python-savvy users, installing them is not a problem. But for a software vendor that wants to guarantee a hassle-free experience for all sorts of users the module dependency is not always easily met.

Second, Python is not a language, but actually two languages: under the name Python, there are Python2 and Python3. And while Python2 is set to be deprecated soon (which has been a true statement for the past decade, btw!) and bleeding-edge organizations will deal with that nicely, the situation is much different in old-school enterprise: RHEL7, for instance, does not ship with a Python3 interpreter at all and will still be supported under Red Hat policies for years to come.

Using Python3 in such distributions is possible: third-party organizations like EPEL produce RHEL-compatible binaries for the interpreter. But once more, old-school enterprise usually means security policies are in place that either disallow installing packages from untrusted sources, or want to install the Python scripts in machines with no connection to the internet. EPEL would require internet connectivity to update dependencies if some packages are some minor versions behind, which makes this a no go.

Scylla is the Real Time Big Data Database. In organizations with strict security policies, the database nodes tend to be even stricter than average. While our main database codebase is written in native C++, we adopt higher level languages like Python for our configuration and deployment scripts. In dealing with enterprise-grade organizations, we have dealt with the problem on how to ship our Python scripts to our customers in a way that will work across a wide variety of setups and security policies.

Harry Potter, a Python Interpreter

One of many approaches to a Python interpreter. Requires broomstick package for portability.

We considered many approaches: should we just rewrite everything in C++? Should we make sure that everything we do works with Python2 as well, and uses as few modules as possible? Should we compile the Python binaries to C with Cython or Nuitka? Should we rely on virtualenv or PyInstaller to ship a complete environment? All of those solutions had pros and cons and ultimately none of them would work across the wide variety of scenarios our database seeks to be installed.

In this article we will very briefly discuss those alternatives, and describe in details the solution we ended up employing: we now distribute a full Python3 interpreter together with our scripts, that can be executed in any Linux distribution of any kind.

“Why not X?” Or, The Top Three Reasons We Didn’t Use Your Pet Suggestion!

Ophidiophobia or Fear of Snakes; courtesy SantiMB.Photos

Image courtesy SantiMB.Photos; used with permission

Computer science is hard. While cosmologists tackle hard questions like “where do we come from?”, anyone working with code has to deal with much harder questions like “why didn’t you use X instead?”, where X is any other approach that the reader has in mind. So let’s get that out of the way and discuss the alternatives!

1. Rewrite our scripts

ScyllaDB is an Open Source database mainly written in C++. Since we already have to deploy our C++ code in a portable manner, it wouldn’t be a stretch to just rewrite the Python scripts. We know what you are thinking and sure, we could also have rewritten it in Go, Lua, Rust, Bash, Fortran or Cobol.

However, we wrote those scripts in Python initially for a reason: configuration and deployment are not performance critical, and writing that kind of code in Python is easy, and it is a language that is well known among many developers from many backgrounds.

Rewriting something that already does its job just fine is even worse, since this is time we could be spending somewhere else. Not a chance.

2. Write everything to also work with Python2

This would be like coding with shackles: many modules that we already use are not even available for Python2 anymore and soon the entire Python2 language will be no more. We would rather be free to code as we will, without having to worry about that. Making sure that changes to the scripts don’t break that also require testing infrastructure (we still use human coders, the kind that every now and then forgets something).

3. Just compile it with Nuitka, Cython, use PyInstaller, or whatever!

This is where things get interesting: we did very seriously consider those alternatives. The problem is that all of them will generate some standalone installer that ships with its dependencies. But such installer cannot be shipped everywhere.

Let’s take a look for instance at what cython generates. Consider the following script:

We can tell cython to compile it with the environment embedded into a single binary, and in theory that’s what we would want: we could then distribute the resulting binary. But how does it look like in the end? Cython allows us to compile the python script and generate an ELF executable binary in the end:

ELF binaries in Linux can consume shared libraries in two ways: they can be loaded at the program startup time, or dynamically loaded during its execution. The ldd tool can be used to inspect which libraries the program will need during startup. Let’s see what the cython-generated binary generates:

There are two problems with the output above. First, the list is deceptively small. Since hello uses the YAML library, we would expect it to depend on it. The strace utility can be used to inspect all calls to the Operating System being issued by a program. If we use this to see which files are being opened we can confirm our suspicion:

That is because libyaml is being loaded during execution time. Cython has no knowledge of which libraries will be loaded during execution time and will just trust that those are found in the system.

Another problem is that the resulting binary depends on system libraries from the host system (like the GNU libc, libpython, etc), on their specific versions. So while it can be transferred to a system similar to the host system, it can’t be transferred to an arbitrary system.

The situation with PyInstaller is a bit better. The shared libraries the script uses during execution time are discovered and added to the final bundle:

But we still have the issue that the resulting binary depends on the basic system libraries needed during startup, as the ldd program will tell us:

This is actually discussed in the PyInstaller FAQ, from which we quote the relevant part for simplicity (highlight is ours):

The executable that PyInstaller builds is not fully static, in that it still depends on the system libc. Under Linux, the ABI of GLIBC is backward compatible, but not forward compatible. So if you link against a newer GLIBC, you can’t run the resulting executable on an older system. The supplied binary bootloader should work with older GLIBC. However, the libpython.so and other dynamic libraries still depends on the newer GLIBC. The solution is to compile the Python interpreter with its modules (and also probably bootloader) on the oldest system you have around, so that it gets linked with the oldest version of GLIBC.

As the PyInstaller FAQ notices, a fully static binary (without using shared libraries at all) is usually one way to overcome shared library dependencies. However such method can present problems on its own, the most obvious of them being the final size of the application: if we ship 10 scripts, each script has to be compiled into its own multi-MB bundle separately. Soon, this solution becomes non-scalable.

The proposed solution, building it in the oldest available system also doesn’t quite work for our use case: the oldest available system is exactly the ones in which installing Python3 can be a challenge and tools are not up to date. We rely on modern tools to build, so often we want to do the other way around.

We tried Nuitka as well, which is an awesome project and can operate as a mixture between what cython and PyInstaller offers with its --standalone mode. While we won’t detail our efforts here for brevity, the end result has drawbacks similar to PyInstaller. On a side note, both these tools seem to have issues with syntax like __import__(“some-string”) (since it is not possible to know what will be imported until this is called during the program execution time), and modules may have to be passed explicitly in the command line in that case. You never know when a dependency-of-your-dependency may do that, so that’s an added risk for our deployments.

Virtualenv also has similar issues. It solves the module-packaging issue nicely, but it will still create symlinks to core Python functionality in the base system. It is just not intended for across-system portable deployments.

Taming the wild Python — Our solution:

Animalist: Fear of Pythons

At this point in our exploration we realized that since our requirements look a bit unique, then maybe that means we should invest in our own solution. And while we could invest in enhancing some of the existing solutions (the reader will notice that some of the techniques we used could be used to solve some of the shortcomings of PyInstaller and Nuitka as well), we realized that for the same effort we could ship the entire Python interpreter in a way that it doesn’t depend on anything in the system and then just uses that to execute the scripts. This means not having to worry about any compatibility issue, every single Python syntax will work, there is no need to compile the scripts statically, or compile the code an lose access to the source in the destination machine.

We did that by creating a relocatable interpreter: in simple terms, the interpreter we ship will also ship with all the libraries that we need in relative paths in the filesystem, and everything that is done by the interpreter will refer to those paths. This is similar to what PyInstaller does, with the exception that we also handle glibc, the dynamic loader and the system libraries.

Another advantage of our solution is that if we package 30 scripts with PyInstaller, each of them will have its own bundle of libraries. This is because each resulting bundle will have its own copy of the interpreter and dependencies. In our solution, because we are relocating the interpreter, and all scripts will share that interpreter, all the needed libraries are naturally present only once.

What to include in the interpreter?

The code to generate the relocatable interpreter is now part of the Scylla git tree and is available under the AGPL as the rest of Scylla (although we would be open to move it to its own project under a more permissive license if there is community interest). It can be found in our github repository. The script is used to generate the relocatable interpreter works on any modern Fedora system (since we rely on Fedora tools to map dependencies). Although it has to be built on Fedora, it generates an archive that can be then copied anywhere to any distribution. We pass as an input to the script the list of modules we will use, for example:

We then use standard rpm management utilities (which is why the generation of the relocatable interpreter is confined to Fedora) in the distribution to obtain a list of all files that these modules need, together with its dependencies:

We then copy the resulting files, with some heuristic to skip things like documentation and configuration files, to a temporary location. We organize it so the binaries go to libexec/, and the libraries go to lib/.

At this point, the Python binary still refers to hard coded system library paths. People familiar with the low level inner workings of Linux shared objects will by now certainly think of the usual crude trick to get around this: setting the environment variable LD_LIBRARY_PATH, which tells the ELF loader to search for shared objects in an alternative path first.

The problem with that is that as an environment variable it will be inherited by child processes. Any call to an external program would try to find its libraries in that same path. So code like this:

output = subprocess.check_output(['ls'])

wouldn’t work, since the system’s `ls` needs to use its own shared libraries, not the ones we ship with the Python interpreter.

A better approach is to patch the interpreter binary so as not to depend on environment variables. The ELF format specifies that lookup directories can be specified in the DT_RUNPATH or DT_RPATH dynamic section attributes. And this being 2019, thankfully we have an app for that. The patchelf utility can be used to add that attribute to an existing binary where there was none, so that’s our next step ($ORIGIN is an internal variable in the ELF loader that refers to the directory where the application binary lives)

patchelf --set-rpath $ORIGIN/../lib <python_binary>

Things are starting to take shape: now all the libraries will be taken from ../lib and we can move things around. This works with both libraries loaded at startup and execution time. The remaining problem is that the ELF loader itself has to be replaced as in practice it has to match the libc in use by the application (where the API to load shared libraries during execution time will live)

To solve that, we ship the ELF loader as well!. We place the ELF loader (here called ld.so) in libexec/, like the actual python binary, and Inside our bin/ directory, instead the Python binary itself we have a trampoline-like shell script that looks like this:

This shell script finds the real path of the executable, in case you are calling it from a symlink, (stored in “x”), and then splits it in its basename (“b”) and the root of the relocatable interpreter (“d”). Note that because the binary will be inside ./bin/, we need to descend one level from its dirname.

In the last two lines, once we find out the real location of the script (“d”), we find the location of the dynamic loader (ld.so) and the Python binary ($realexe) which we know will always be in libexec and force the invocation to happen using the ELF loader we provided, while setting PYTHONPATH in a way that make sure that the interpreter will find its dependencies. Does it work? Oh yes!

But how do we know it’s really taking its libraries from the right place? Well, aside from trying it on a old, target system, we can just look at the output of the ldd tool again. We installed an interpreter into /tmp//python/, and this is what it says:

All of them are coming from their relocatable location. Well, all except for ld-linux-x86_64.so.2, which is the ELF loader. But remember we will force the override of the ELF loader by executing ours explicitly, so we are fine here. The libraries loaded at execution time are fine as well:

And with that, we now have a Python interpreter that can be installed with no dependencies whatsoever, in any Linux distribution!

But wait… no way this works with unmodified scripts

OMG That's Not a Python!

“OMG! That’s not a Python!”

If you are thinking that there is no way this work with unmodified scripts, you are technically correct. The default shebang will point to /usr/bin/python3, and we still have to change every single script to point to the new location. But nothing else aside from that has to be changed. Still, is it even worth the trouble?

It is, if everything can happen automatically. We wrote a script that given a set of Python scripts, will modify their shebang to point to `/usr/bin/env python3`. The actual script is then replaced by a bash trampoline script much like the one we used for the interpreter itself. That guarantees that the relocatable interpreter precedes everything else in the PATH (so env ends up picking our interpreter, but without having to mess with the system’s PATH), and then calls the real script which now lives in the `./libexec`/ directory.

We set the PYTHONPATH as well to make sure we’re looking for imports inside libexec, to allow local imports to keep working. See for instance what happens with scylla_blocktune.py, one of the scripts we ship, after relocating it like this:

And there we have it: now we can distribute the entire directory /tmp/test with as many Python3 scripts as we want and unpack it anywhere we want: they will all run using the interpreter that ships inside it, that in turn can run in any Linux distribution. Total Python freedom!

But how extensible is it really??

Fully Extensible PythonAn example of fully extensible Python

For people used to very flexible Python environments with user modules installed via pip, our method of specifying the specific modules we want installed in the relocatable package doesn’t seem very flexible and extensible. But what if instead of packaging those modules, we were to package pip itself?

To demonstrate that, we ran our script again with the following invocation:

Note that since the pip binary in the base system is itself a python script, we have to modify it as well before we copy it to the destination environment. Once we unpack it, the environment is now ready. PyYAML was not included in the modules list and is not available, so our previous `hello.py` example won’t work:

But once we install it via pip, will we be luckier?

Where did it go? We can see that it is now installed in a relative path inside our relocatable directory:

Which due to the way the interpreter is started through the trampoline scripts, is included in the search path:

And now what? Just run it!

Conclusion

In this article we have detailed the uncommon ScyllaDB’s approach for a common problem: how to distribute software, in particular ones written in Python without having to worry about the destination environment. We did that by shipping a relocatable interpreter, that has everything it needs in its relative paths and can be passed around without depending on the system libraries. The scripts can then just be executed with that interpreter with just a minor indirection.

The core of the solution we have adopted could have been used with other third party tools like PyInstaller and Nuitka as well. But in our analysis at that point it would just be simpler to provide the entire Python interpreter and its environment. It makes for a more robust solution (such as easier handling on execution time dependencies) without restricting access to the source code, and is fully extensible: we demonstrated that it is even possible to run pip in the destination environment and from there install whatever one needs.

Since we believe the full depth of this solution is very specific to our needs, we wrote this in a way that plays well with our build system and stopped there. In particular, we only support creating the relocatable interpreter on Fedora. Extending the script to also support Ubuntu would not be very hard, though. The relocatable interpreter is also part of the Scylla repository and not a standalone solution at the moment. If you think we’re wrong in our analysis about this being a narrow and specialized use case and could benefit from this too, we would love to hear from you. We are certainly open to making changes to accommodate the wider needs of the community. Reach out to us on Twitter, @ScyllaDB, or drop us a message via our web site.

No actual Pythons were harmed in the writing of this blog.

The post The Complex Path for a Simple Portable Python Interpreter, or Snakes on a Data Plane appeared first on ScyllaDB.

Scylla Open Source Release 3.0.3

Scylla Release

The Scylla team announces the release of Scylla Open Source 3.0.3, a bugfix release of the Scylla Open Source 3.0 stable branch. Scylla 3.0.3, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

Related links:

Issues solved in this release:

  • Counters: Scylla rejects SSTables that contain counters that were created by Cassandra 2.0 and earlier. Due to #4206, Scylla mistakenly rejected some SSTables that were created by Cassandra 2.1 as well.
  • TLS: Scylla now disables TLS1.0 by default and forces minimum 128 bits ciphers #4010. More on encryption on transit (client to server) here.
  • Core: In very rare cases, the commit log replay fails. Commit log replay is used after a node was unexpectedly restarted #4187
  • Streaming: in some cases, Scylla exits due to a failed streaming operation #4124
  • A rare race condition between a node restart and schema updates may cause Scylla to exit #4148

The post Scylla Open Source Release 3.0.3 appeared first on ScyllaDB.

Scylla Users Share Their Stories

 

Customer Stories: Comcast, Zenly, Grab

“It worked out of the box. We didn’t have to tune anything. It was all just easy.”

— Derek Ramsey, Sensaphone

Our favorite and most important discussions are always with our users. They tell us about their experiences with our database, how they came about using our technology, what they like about it, where we could improve, and much more. These conversations take place any number of ways — on our busy Slack channel, Zoom meetings, emails, phone calls, you name it. Of course, our most customer-centric event is our Scylla Summit user conference.

“What Cassandra can do with 600 nodes, Scylla can do with 60.”

— Murukesh Muhanan, Yahoo! Japan

At our Scylla Summit last November, we took the opportunity to interview a number of users with a video camera rolling. We recently added these interview clips to our site — you’ll see a carousel of these videos on our Users page. The fun part for us isn’t just hearing all the great things they have to say about Scylla, it’s seeing the innovative solutions our users are able to create with our technology. From world-renown entertainment apps, to taxi hailing, to social apps, to tech industry leaders and any variety of IoT systems.

Customer Stories: Yahoo Japan, Numberly, Sensaphone

“With Scylla, there’s no JVM so cutting out those problems from the list and just having to configure things properly for a Cassandra-type system or a Dynamo-type system is great.”

— Keith Lohnes, Software Engineer, IBM

We encourage you to watch what some of our users have to say in these videos. And have a look at our library of written case studies as well — we now have almost 40 use case we’ve collected on our Users page.

“It’s critical we have a system we can trust, that’s efficient and that maintains consistent low latencies.”

— Alexys Jacob, CTO, Numberly

Interested in sharing your use of Scylla? Please let us know. We’d be glad to share your Scylla story with the world!

The post Scylla Users Share Their Stories appeared first on ScyllaDB.

How to build your very own Cassandra 4.0 release

Over the last few months, I have been seeing references to Cassandra 4.0 and some of its new features. When that happens with a technology I am interested in, I go looking for the preview releases to download and test. Unfortunately, so far, there are no such releases. But, I am still interested, so I’ve found it necessary to build my own Cassandra 4.0 release. This is in my humble opinion not the most desirable way to do things since there is no Cassandra 4.0 branch yet. Instead, the 4.0 code is on the trunk. So if you do two builds a commit or two apart, and there are typically at least three or four commits a week right now, you get a slightly different build. It is, in essence, a moving target.

All that said and done, I decided if I could do it, then the least I could do is write about how to do it and let everyone who wants to try it learn how to avoid a couple of dumb things I did when I first tried it.

Building your very own Cassandra 4.0 release is actually pretty easy. It consists of five steps:

  1. Make sure you have your prerequisites
    1. Java SDK 1.8 or Java 1.11 Open Source or Oracle
    2. Ant 1.8
    3. Git CLI client
    4. Python >=2.7<3.0
  2. Download the GIT repository
    1. git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
  3. Build your new Cassandra release
    1. Cd cassandra
    2. Ant
  4. Run Cassandra
    1. Cd ./bin
    2. ./cassandra
  5. Have fun
    1. ./nodetool status
    2. ./cqlsh

I will discuss each step in a little bit more detail:

Step 1) Verify, and if necessary, install your prerequisites

For Java, you can confirm the JDK presence by typing in:


john@Lenny:~$javac -version
javac 1.8.0_191

For ant:


john@Lenny:~$ ant -version
Apache Ant(TM) version 1.9.6 compiled on July 20 2018

For git:


john@Lenny:~$ git --version
git version 2.7.4

For Python:


john@Lenny:~$ python --version
Python 2.7.12

If you have all of the right versions, you are ready for the next step. If not, you will need to install the required software which I am not going to go into here.

Step 2) Clone the repository

Verify you do not already have an older copy of the repository:


john@Lenny:~$ ls -l cassandra
ls: cannot access 'cassandra': No such file or directory

If you found a Cassandra directory, you will want to delete or move it or your current directory elsewhere. Otherwise:


john@Lenny:~$ git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
Cloning into 'cassandra'...
remote: Counting objects: 316165, done.
remote: Compressing objects: 100% (51450/51450), done.
remote: Total 316165 (delta 192838), reused 311524 (delta 189005)
Receiving objects: 100% (316165/316165), 157.78 MiB | 2.72 MiB/s, done.
Resolving deltas: 100% (192838/192838), done.
Checking connectivity... done.
Checking out files: 100% (3576/3576), done.

john@Lenny:~$ du -sh *
294M cassandra

At this point, you have used up 294 MB on your host and you have an honest-for-real git repo clone on your host – in my case, a Lenovo laptop running Windows 10 Linux subsystem.

And your repository looks something like this:


  john@Lenny:~$ ls -l cassandra
  total 668
  drwxrwxrwx 1 john john    512 Feb  6 15:54 bin
  -rw-rw-rw- 1 john john    260 Feb  6 15:54 build.properties.default
  -rw-rw-rw- 1 john john 101433 Feb  6 15:54 build.xml
  -rw-rw-rw- 1 john john   4832 Feb  6 15:54 CASSANDRA-14092.txt
  -rw-rw-rw- 1 john john 390460 Feb  6 15:54 CHANGES.txt
  drwxrwxrwx 1 john john    512 Feb  6 15:54 conf
  -rw-rw-rw- 1 john john   1169 Feb  6 15:54 CONTRIBUTING.md
  drwxrwxrwx 1 john john    512 Feb  6 15:54 debian
  drwxrwxrwx 1 john john    512 Feb  6 15:54 doc
  -rw-rw-rw- 1 john john   5895 Feb  6 15:54 eclipse_compiler.properties
  drwxrwxrwx 1 john john    512 Feb  6 15:54 examples
  drwxrwxrwx 1 john john    512 Feb  6 15:54 ide
  drwxrwxrwx 1 john john    512 Feb  6 15:54 lib
  -rw-rw-rw- 1 john john  11609 Feb  6 15:54 LICENSE.txt
  -rw-rw-rw- 1 john john 123614 Feb  6 15:54 NEWS.txt
  -rw-rw-rw- 1 john john   2600 Feb  6 15:54 NOTICE.txt
  drwxrwxrwx 1 john john    512 Feb  6 15:54 pylib
  -rw-rw-rw- 1 john john   3723 Feb  6 15:54 README.asc
  drwxrwxrwx 1 john john    512 Feb  6 15:54 redhat
  drwxrwxrwx 1 john john    512 Feb  6 15:54 src
  drwxrwxrwx 1 john john    512 Feb  6 15:54 test
  -rw-rw-rw- 1 john john  17215 Feb  6 15:54 TESTING.md
  drwxrwxrwx 1 john john    512 Feb  6 15:54 tools

Step 3) Build your new Cassandra 4.0 release

Remember what I said in the beginning? There is no branch for Cassandra 4.0 at this point, so building from the trunk is quite simple:


john@Lenny:~$ cd cassandra
john@Lenny:~/cassandra$ ant
Buildfile: /home/john/cassandra/build.xml

BUILD SUCCESSFUL
Total time: 1 minute 4 seconds

That went quickly enough. Let’s take a look and see how much larger the directory has gotten:


john@Lenny:~$ du -sh *
375M cassandra

Our directory grew by 81MB pretty much all in the new build directory which now has 145 new files including ./build/apache-cassandra-4.0-SNAPSHOT.jar. I am liking that version 4.0 right in the middle of the filename.

Step 4) Start Cassandra up. This one is easy if you do the sensible thing


john@Lenny:~/cassandra$ cd ..
john@Lenny:~$ cd cassandra/bin
john@Lenny:~/cassandra/bin$ ./cassandra
john@Lenny:~/cassandra/bin$ CompilerOracle: dontinline org/apache/cassandra/db/Columns$Serializer.deserializeLargeSubset (Lorg/apache/cassandra/io/util/DataInputPlus;Lorg/apache/cassandra/db/Columns;I)Lorg/apache/cassandra/db/Columns;
CompilerOracle: dontinline org/apache/cassandra/db/Columns$Serializer.serializeLargeSubset (Ljava/util/Collection;ILorg/apache/cassandra/db/Columns;ILorg/apache/cassandra/io/util/DataOutputPlus;)V
CompilerOracle: dontinline org/apache/cassandra/db/Columns$Serializer.serializeLargeSubsetSize (Ljava/util/Collection;ILorg/apache/cassandra/db/Columns;I)I

INFO [MigrationStage:1] 2019-02-06 21:26:26,222 ColumnFamilyStore.java:407 - Initializing system_auth.role_members
INFO [MigrationStage:1] 2019-02-06 21:26:26,234 ColumnFamilyStore.java:407 - Initializing system_auth.role_permissions
INFO [MigrationStage:1] 2019-02-06 21:26:26,244 ColumnFamilyStore.java:407 - Initializing system_auth.roles

We seem to be up and running. Its time to try some things out:

Step 5) Have fun

We will start out making sure we are up and running by using nodetool to connect and display a cluster status. Then we will go into the CQL shell to see something new. It is important to note that since you are likely to have nodetool and cqlsh already installed on your host, you need to use the ./ in front of your commands to ensure you are using the 4.0 version. I have learned the hard way that forgetting the ./ can result in some very real confusion.


  john@Lenny:~/cassandra/bin$ ./nodetool status
  Datacenter: datacenter1
  =======================
  Status=Up/Down
  |/ State=Normal/Leaving/Joining/Moving
  --  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
  UN  127.0.0.1  115.11 KiB  256          100.0%            f875525b-3b78-49b4-a9e1-2ab0cf46b881  rack1
            
  john@Lenny:~/cassandra/bin$ ./cqlsh
  Connected to Test Cluster at 127.0.0.1:9042.
  [cqlsh 5.0.1 | Cassandra 4.0-SNAPSHOT | CQL spec 3.4.5 | Native protocol v4]
  Use HELP for help.
  cqlsh> desc keyspaces;

  system_traces  system_auth  system_distributed     system_views
  system_schema  system       system_virtual_schema

  cqlsh>

We got a nice cluster with one node and we see the usual built-in key spaces. Well um… not exactly. We see two new key spaces system_virtual_schema and system_views. Those look very interesting.

In my next blog, I’ll be talking more about Cassandra’s new virtual table facility and how very useful it is going to be someday soon. I hope.

What Apache Cassandra™ Developers Want







 



One of the fun things we were able to do at conferences we attended in 2018 was to use some “conversation starters”. We asked our booth visitors at Strata NYC and AWS re:Invent to contribute to a community comment block (aka sticky notes on our booth wall). We had four prompts on the post its:

Welcome to the Context Economy

I like the analysts at the 451 Research Group. They’re smart, experienced, no-nonsense, and have a good track record of identifying meaningful movements in the technology space. So when they released their report on 2019 IT trends, I made sure to get my copy and thoroughly go through it.

Some of their identified trends will be instantly recognized by anyone keeping up with our industry—data becoming increasingly important among successful organizations, a ‘no-trust’ security model being implemented by smart companies, and the increasing tilt toward hybrid and multicloud deployments for modern applications and data platforms.

No surprises there.

But one of their trends held my attention more than the others: “The Context Economy Begins to Flourish”. I found their term ‘Context Economy’ particularly insightful and the content on why it’s important to be spot on.

What is the ‘Context Economy’?

451 Group defines the context economy as a paradigm where data without context will be mostly worthless to organizations; only contextual data (data and its associated relationships with other data elements) will deliver the lifeblood needed by today’s digital applications whose success hinges on providing the personalized experience needed to attract and retain customers.

451 Group’s research shows that those leading the pack in their respective markets are the ones (naturally) who prioritize contextual experiences:

What are the Requirements for the Context Economy?

No doubt some of you are thinking, “Well, duh… Of course data relationships and context matters—isn’t this what relational databases have been doing for 40 years now?”

Actually, no. This is exemplified by 451 Group saying that only now will the context economy begin to flourish.

RDBMSs do indeed store and enforce relationships between data elements, however they are crippled in supporting today’s modern cloud-style applications in several ways.

First, RDBMSs cannot handle the multiworkload requirements. Data context demands a blending and blurring of transactional, analytical, search, in-memory, and graph operations in the same database, with little to no resource or data contention. For example, today’s fraud applications must digest incoming transactions, analyze the current request in conjunction with searching historical buying behavior, which is then mapped to other highly connected data to determine whether to approve or deny the transaction, and then send the response back to the requesting application in the blink of an eye.

Next, data context requires multimodel support. Today’s applications utilize micro-services architectures to componentize the app’s various functions. Each component has its own best-of-breed data model requirements, which then must be stored and accessed with other models in real time. Further, today’s data relationships differ from those in past generations in that they are highly connected, meaning they outstrip the ability of an RDBMS to efficiently store and query the connected data. A graph database, however, is built for this very purpose.

Lastly, data context needs multilocation capabilities for all supported data models. And make no mistake: this requirement goes beyond the master/slave or multi-master architectures of all RDBMSs and other NoSQL vendors, which are painfully hard to administer and woefully inadequate at supporting continuous availability, both read and write functions at any location, hybrid and multi-cloud support, and high-speed data synchronization.

The Context Database for the Context Economy

At DataStax, our customers are building contextual applications on DataStax Enterprise (DSE) because it natively supports multi-workload, multi-model, and multi-location. Multi-workload is handled via DSE Analytics, DSE Search, inmemory functionality, and distributed graph capabilities, all of which run in the same cluster/database.

DSE also supports all key data models, including tabular, key-value, document, and graph, which can all exist in the same cluster/database. Finally, DSE and its underlying foundation of open source Apache Cassandra™ are the gold standard when it comes to globally distributing and synchronizing data everywhere as well as transparently running in hybrid and multi-cloud fashion.

The 451 Group’s report on 2019 trends, and in particular their point on the context economy, dovetails well with how we see DataStax customers beating out their competition. For more information on DSE, see our Resources website page, get registered for our upcoming Accelerate Conference, and also be sure to download the latest version of DSE to try in your environment.

Powering A Next Generation Streaming TV Experience

Sling TV is an over-the-top (OTT) live streaming content platform that instantly delivers live TV and on-demand entertainment via the internet and customer owned and managed devices. This includes a variety of smart televisions, tablets, game consoles, computers, smartphones and streaming devices (16 platforms total). It is currently the number one live TV streaming service with approximately 2.3MM customers.

Outperforming the Competition

With so many options available to potential “cord cutters” (ie, people seeking a wireless- or Internet-based service over a paid TV subscription service), it is important to provide a first-class experience that makes your product stand out in a market that is becoming more and more saturated.

As such, it is critical for Sling TV to provide a highly resilient service that is personalized to each user and scale on demand to keep up with our expanding customer base and changes on the internet. This includes the need to be highly available and resilient while having the ability to centralize business logic across our sixteen platforms to deliver a common experience to our customers.

We also want to reduce the time to market for features in a continuous deployment model and ultimately enable a deployment unit of “datacenter” allowing for another instance of the Sling TV backend to be built on demand as needed in a hybrid cloud environment.

On the backend, we needed a common data store for our core customer and personalized content that is available in all data centers serving our middleware stack. The solution would need to provide media distribution capabilities that include authentication, program catalogs and feature meta-data storage. We had a big list of needs to fulfill, and we wanted a proven solution that would support our next-gen architecture goals.

Why Sling Chose DataStax and DSE Built on Apache Cassandra

We chose DataStax Enterprise (DSE) for three main reasons:

  1. Not only was it important for us to have a proven solution to support our goals, we were also looking for a partner—as opposed to just a vendor—to help us achieve the success we had envisioned.
  2. Having a database designed for hybrid cloud infrastructure was a key part of our strategy, as we needed to be able to replicate data across remote data centers.
  3. DSE would give us virtually an unlimited ability to scale horizontally to keep up with future growth.

DSE has become a key part of our hybrid cloud strategy and has enabled our software to run in private and public clouds with close to the same tooling. With DSE, we are now able to replicate data across the country in less than two seconds, which is a big win for us. We look forward to leveraging DSE to power the future growth of Sling TV.

eBook:The Way of Customer 360

READ NOW