Monday, 16 September 2013

Cassandra on Windows Azure Virtual Machines

In this guide I will show how to deploy Cassandra on Azure Virtual Machine. I will describe also how to make a pretty handy image of it, which will be used later for deployment of a Cassandra cluster. I assume that if someone get into here then knows what Cassandra and Windows Azure is. Just in case, here are the references for Cassandra and Azure.

Deploying Cassandra on a single Azure Virtual Machine

Creating virtual machine

Through the Windows Azure management portal it is fairly easy to create a new virtual machine. Just click the FROM GALLERY option like on image below:

Operating system that I have used is Ubuntu 12.04 LTS. It can be easily picked from the gallery:

Next just follow the virtual machine creation wizard steps. It is important to open SSH port( it should be opened by default) because it will be the main way of communication with the machine.

Connecting to the machine

When VM will be up, we can connect to it using widely known tool – Putty. Just run it and create a SSH session with Hostname and port as in Azure management portal:

We will be prompted for credentials. Use same as during machine creation. After it we should successfully get logged to the machine:

Installation of Cassandra on VM

Cassandra can be installed in many ways, in this guide I will describe how to build it straight from sources. At first we have to install Oracle’s java. It is needed for running Cassandra(JRE) and also for building it with Ant(JDK). Easiest way to install java is to get it through the debian package. We have to add appropriate repository to Advanced Packaging Tool(apt) sources list:

$ sudo add-apt-repository ppa:webupd8team/java

Previous step always have to be followed by :

$ sudo apt-get update

It downloads the package lists from the newly added repository( and also updates package lists of the other already used repos). After these steps the Oracle’s Java package will be reachable for us. To install it use the following command:

$ sudo apt-get install oracle-java7-installer

During installation we will have to accept the license:

Next step will be installation of Apache Ant build tool and Git:

$ sudo apt-get install ant

$ sudo apt-get install git

To download latest Cassandra sources from its repository type:

$ sudo git clone http://git-wip-us.apache.org/repos/asf/cassandra.git

It will copy Cassandra git repository to “cassandra” folder under the current location.
To build Cassandra, go to that folder:

$ cd cassandra

and use:

$ sudo ant build

Cassandra needs two following folders, one for data and second for logs. Create them with:

$ sudo mkdir /var/lib/cassandra /var/log/cassandra

And change their owner to the current user and group:

$ sudo chown -R $USER:$GROUP /var/lib/cassandra
$ sudo chown -R $USER:$GROUP /var/log/cassandra

Now we have a machine with raw Cassandra installation but do not run it, we want to have it raw, as it is now.

Last step is to create a startup script that will add the machine name to the /etc/hosts file. Why doing it? It is kind of workaround. When we will capture an image of this machine and use it to create another, with different name, Azure won’t add it for us and we will run into :

Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: <name>:<name>

while starting Cassandra.

To create this script open some text editor:

$ nano

And paste the following code to it:

#!/bin/bash

local_address=`hostname -I`

cassandra_yaml="$HOME/cassandra/conf/cassandra.yaml"

sed -i "/^127.0.0.1 localhost/ c\127.0.0.1 localhost $HOSTNAME" /etc/hosts

sed -i "/^rpc_address: / c\rpc_address: $local_address" $cassandra_yaml

sed -i "/^listen_address: / c\listen_address: $local_address" $cassandra_yaml

Save it with .sh extension. We have to make this script executable:

$ sudo chmod +x <script_name>.sh

To make it run during boot, create a symbolic link in /etc/init.d:

$ sudo ln -s /path/to/script/<script_name>.sh /etc/init.d/<script_name>.sh

And add script to the startup time:

$ sudo update-rc.d <script_name>.sh defaults

As you probably noticed, this script does more than I described above. It changes also some of the Cassandra configuration options which are mandatory for further Cassandra execution:

listen_address – an address on which Cassandra node communicates with other nodes
rpc_address – an address on which node will listen for clients

Script sets values, for both above options, to virtual machine internal address. Why setting rpc_address to internal address, not 0.0.0.0? It depends on client, more specifically on protocol. If you are going to use client that supports only Thrift protocol, set it to 0.0.0.0 then Thrift will listen on all interfaces. But if you are going to use new CQL native protocol and place client within cluster service use the internal machine address. Cassandra client drivers provided by DataStax( for Java, C# and Python) supports CQL native protocol and are fully asynchronous. Moreover you can configure retry, reconnection and load balancing policies, so you have full control on cluster traffic.

But there is still one more mandatory option that is not configured – seeds - which is in fact a comma delimited list of hosts addresses. Cassandra nodes uses this list to find each other and learn the topology of the ring. At this point virtual machine created from that image, won’t know anything about other nodes in the service subnet, so later, you will have to manually add the seed nodes addresses in cassandra/conf/cassandra.yaml.

Now our default Cassandra node is almost finished. Last thing to do is to prepare it for capturing by undoing the provisioning customization. Following command does the trick:

$ sudo waagent –deprovision

And we are ready for capturing.

Capturing Virtual Machine

In Azure management portal shutdown the machine we were working on. When it will be off, the Capture icon became enabled.

Clicking it displays following window:

Set the name for the image and tick the “I have run the Windows Azure Linux Agent on the virtual machine” checkbox. As states in IMPORTANT NOTE section, this virtual machine will be deleted.

Creating Cassandra cluster

Now when we have image of machine with Cassandra on it, we can start deploying a cluster. In this example I will describe scenario of deploying cluster in a single service.

Creating nodes.

This step is almost same as Creating virtual machine on Azure, difference is that we will use our captured image. You can find it in VM’s Gallery under MY IMAGES. Note to use same username as during image creation! This will prevent creating a new profile on that machine. Repeat this step as many times as many nodes you want in your cluster, but remember to use same service for all of them.

Running the cluster

When all of the machines will be up, you will have to connect to each of them in order to configure cassandra.yaml file by adding seeds to it. You will have to select which nodes will be treated as seeds, get their internal addresses and update the cassandra.yaml with them. Next run Cassandra on each machine starting from machines considered as seeds. To run Cassandra type:

$ cassandra/bin/cassandra

When Cassandra will be up on all machines, you can check with nodetool if all of created nodes are in ring. Just execute following command on any machine:

$ cassandra/bin/nodetool ring

That’s it, using this image we can add new nodes to cluster pretty fast and in easy way. In above process of adding new nodes there is still alot of space for automatization, but it is a subject for another article.

Now Fluent Editor matches not only your needs! Regular expressions.

Since version 2.3.17 Fluent Editor has new feature: regular expressions matcher.
This functionality allows to specify not only one, particular string as attribute, but also a whole set (or class) of strings defined by regular expression.

Regular expressions

Regular expressions have strong theoretical background in computer science and mathematical linguistics. These items are connected with Chomsky hierarchy of languages, and are equivalent to regular language (see Chomsky hierarchy).
About regular expressions and theirs syntax you can learn e.g. here.
Broadly speaking, thanks to regular expressions, you can specify pattern that matches one, many or even infinite number of strings. This is the easy way to validate for example e-mail addresses and phone numbers.

Regular expressions in Fluent Editor

Since Fluent Editor version 2.3.17 there is an ability to define string attributes as regular expression patterns. Both in ontology and questions. It is possible with new keyword:

that-matches-pattern

This keyword may appear insted of "equal-to" keyword. You can attach such patterned attribute to instance or concept or use it in a question.
Now it is the time for really quick "Hello World" to regular expressions in Fluent Editor.
Lets create ontology:

Tom has-name equal-to 'Tommy'.

Jerry has-name equal-to 'Jerry'.

Every-single-thing that has-name that-matches-pattern '.*rry' is a mouse.

Every cat has-name that-matches-pattern '(T|J)o(n|m){1,2}y'.

Max is a cat.

Max has-name equal-to 'Max'.

In above ontology there are 2 regular expressions. First one ('.*rry') describes all strings that finish with 'rry'. So it matches 'Jerry' as well as 'blabla23424rry'.
The second one ('(T|J)o(n|m){1,2}y') fits to any string that starts with either 'T' or 'J', then has 'o', the next character is either 'n' or 'm' repeated 1 or 2 times, and then it finishes with 'y'. So all and only strings that are matched by this regular expression are the following: Tony, Tonny, Tomy, Tommy, Jony, Jonny, Jomy and Jommy.
Lets ask some questions. First one to warm up:

Who-Or-What has-name that-matches-pattern '.*ry'?

It returns only 'Jerry'. Why not 'mouse'? Because '.*ry' matches e.g. 'Kery', which is not matched by '.*rry', so Kery is not necessarily a mouse (but it can be a mouse).
The second question (is the first one slightly modified):

Who-Or-What has-name that-matches-pattern '.*erry'?

Now it returns 'Jerry' and 'mouse'.'Jerry' as a result is obvious. 'mouse' is obvious too, because every string that is matched by '.*erry' (finishes with 'erry') is matched by '.*ry' (finishes with 'ry').
The third question:

Who-Or-What has-name that-matches-pattern '[A-Z]+[a-z]*'?

The regular expression used above matches all strings that start with one or more big letters and then contains zero or more small letters.
This query returns 'Tom', 'Jerry' and 'Max' (which are obvious) and 'cat' (because if name satisfies '(T|J)o(n|m){1,2}y' it also matches '[A-Z]+[a-z]*').

Now it is a time for the tricky question:

Who-Or-What has-name that-matches-pattern '[TJonm]*y?'?

The regular expression '[TJonm]*y?' matches the string that contains zero or more 'T', 'J', 'o', 'n' or 'm' letters and may finish with 'y' ('?' indicates that there are zero or one occurrences of 'y'). So this regular expression matches such strings as: 'Tommy' or 'omnomnom'. It matches also all string generated by '(T|J)o(n|m){1,2}y'. So there is no surprise that 'Tom' and 'cat' are returned after execution of this query.
But why this query returned 'Max'? Admittedly, Max has name that do not match regular expression in the question, but the true is that all cats fits the query and Max is a cat. Because of Open World Assumption (see OWA), Max has another unknown name that matches regular expression in the query.

If you want to learn more about Fluent Editor CNL-EN grammar, visit this link.

*) FluentEditor 2, ontology editor, is a comprehensive tool for editing and manipulating complex ontologies that uses Controlled Natural Language. Fluent editor provides one with a more suitable for human users alternative to XML-based OWL editors. It's main feature is the usage of Controlled English as a knowledge modeling language. Supported via Predictive Editor, it prohibits one from entering any sentence that is grammatically or morphologically incorrect and actively helps the user during sentence writing. The Controlled English is a subset of Standard English with restricted grammar and vocabulary in order to reduce the ambiguity and complexity inherent in full English.