Mobile BI and Big Data – How to use AWS Elastic MapReduce results with Roambi Mobile BI Analyics

So far we covered server-side/cloud components – how to process data with  MapReduce running in the cloud or on our own Hadoop cluster. This time it is about client-side.

If you have a look at Mary Meeker’s latest brilliant presentation about the Internet trends, one of the key messages is the significant increase in mobile 3G subscriptions and the mind-boggling sales figures for tablets (read: iPad) and smartphones (read: iPhone and Android):

Internet goes mobile and the applications follow the trend – that can be seen in mobile business intelligence, too that has shown a significant momentum recently. People are on the move with mobile devices that have similar performance as a notebook a few years ago, see geekbench results in here. It is time to use this power at hand for business intelligence, too. The tools are already out there to analyse big data and then publish results to mobile devices.

Amazon Elastic MapReduce

In the March post we covered Amazon Elastic MapReduce. Having talked about the mobile internet subscriptions and the enourmous growth in that area, this time we will analyse mobile subscriptions data from Worldbank. This data is about subscriptions to a public mobile telephone service using cellular technology, postpaid and prepaid subscriptions included.

To create an AWS Elastic MapReduce job requires 3 steps: upload input data to an S3 bucket/folder, run an EMR job (e.g. Hive, Pig, custom java), and download the output from an S3 folder.

The S3 storage looks like this for our test :there is a mobilesubscriptions bucket, then there are two folders: one for hive-scripts and one for mobilesubs data (folder). In the mobilesubs folder there is an input folder where we upload the mobile_subscriptions.csv file. The output will be created under s3://mobilesubscriptions/mobilesubs/output folder in csv format.

Its format is like:

Country Name,Country Code,2010
American Samoa,ASM,

(2010 is the last year where we had data)

The hive script the we use for data processing is – this will show the top 100 countries with the highest number of subscriptions:

    country_name STRING, country_code STRING, subscriptions FLOAT
LOCATION 's3://mobilesubscriptions/mobilesubs/input/';

CREATE TABLE top100_mobilesubs (
    country_name STRING, country_code STRING, subscriptions FLOAT

INSERT OVERWRITE TABLE top100_mobilesubs
SELECT country_name, country_code, subscriptions
FROM  mobilesubs
ORDER BY subscriptions DESC
LIMIT 100;

INSERT OVERWRITE DIRECTORY 's3://mobilesubscriptions/mobilesubs/output/'
SELECT * from top100_mobilesubs;

The job that will process the data using AWS EMR is configured as follows:

Once we run the job, it will create a 000000_0 file under s3://mobilesubscriptions/mobilesubs/output directory.

This ouput files needs to be downloaded and processed to replaces the SOH characters with comma (,), in order to be able to publish it with Roambi Mobile BI analytics. This can be done by any text processing tool (e.g. notepad++)

Roambi Analytics

Roambi Analytics has a cloud based publishing services and a mobile BI visualizer tool available for iPad and iPhones. The application can be installed on the mobile devices from Apple AppStore for free.

The Roambi publisher has 3 versions: Roambi Lite that is free and has limited functionality (support for csv, excel and html format), Roambi Pro (with additional Google docs and support) and Roambi Enterprise (with Oracle, SAP BusinessObjects, SAS, Microsoft, IBM Cognos, etc support).

This demo is based on Roambi Lite. First you need to create an account or login using Google Account (OpenID) at

Then click on Publish:

Select the approriate view (e.g. CataList) and import data (this will be the mobilesubs_result.csv that we downloaded from AWS EMR s3://mobilesubscriptions/mobilesubs/output folder and prepared for Roambi Analytics as described above.

You can refine the data if you wish and then publish it:

The file will be pushed to the mobile devices (iPad or iPhone). In case of Roambi Lite e.g.  you can push it to your own device.

Roambi Analytics Visualizer

On the handset you can retrieve the result using Roambi Analytics Visualiser. You can create an email or screenshot from the report, you can add it to favorites, etc.

iPhone screenhots:

iPad screenshot:

Email sent from Roambi Analytics Visualizer:

As you can see, mobile BI and BigData in the cloud can free users from being a desktop slave: no need for datacenter infrastructure and no need for traditional desktop – just the joy of mobility spiced with the power of cloud computing.


Spring Data – Apache Hadoop


Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop.  This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent  of Hadoop programming – a Wordcount application.

1./ Launch an Amazon Web Services EC2 instance.

– Navigate to AWS EC2 Console (“”):

– Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09” 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour.  (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section)

2./ Download Apache  Hadoop – as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz  and copied it into /home/ec2-user directory using pscp command from my PC running Windows:

c:\downloads>pscp -i mykey.ppk hadoop-1.0.0.tar.gz

(the computer name above – – can be found on AWS EC2 console, Instance Description, public DNS field)

3./ Install Apache Hadoop:

As prerequisites, you need to have Java JDK 1.6 and ssh installed, see Apache Single-Node Setup Guide.  (ssh is automatically installed with Basic Amazon AMI). Then install hadoop itself:

$ cd  ~   # change directory to ec2-user home (/home/ec2-user)

$ tar xvzf hadoop-1.0.0.tar.gz

$ ln -s hadoop-1.0.0  hadoop

$ cd hadoop/conf

$ vi   # edit as below

export JAVA_HOME=/opt/jdk1.6.0_29

$ vi core-site.xml    # edit as below – this defines the namenode to be running on localhost and listeing to port 9000.


$ vi hdsf-site.xml  # edit as below  this defines that file system replicate is 1 (in  production environment it is supposed to be 3 by default)


$ vi mapred-site.xml  # edit as below – this defines the jobtracker to be running on localhost and listeing to port 9001.


$ cd ~/hadoop

$ bin/hadoop namenode -format

$ bin/

At this stage all hadoop jobs are running in pseudo distributed mode, you can verify it by running:

$ ps -ef | grep java

You should see 5 java processes: namenode, secondarynamenode, datanode, jobtracker and tasktracker.

4./ Install Spring Data Hadoop

Download Spring Data Hadoop package from SpringSource community download site.  As of writing this article, the latest stable version is

$ cd ~

$ tar xzvf

$ ln -s spring-data-hadoop-1.0.0.M1 spring-data-hadoop

5./ Build and Run Spring Data Hadoop Wordcount example

$ cd spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount

Spring Data Hadoop is using gradle as build tool. Check build.grandle  build file. The original version packaged in the tar.gz file does not compile,  it complains about thrift, version 0.2.0 and  jdo2-api, version2.3-ec.

Add maven repository to the build.gradle file to support jdo2-api ( .

Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should  download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “” and then add it to local maven repo.

$ mvn install:install-file -DgroupId=org.apache.thrift  -DartifactId=thrift  -Dversion=0.2.0 -Dfile=thrift-0.2.0.jar  -Dpackaging=jar

$ vi build.grandle  # modify the build file to refer to datanucleus maven repo for jdo2-api and the local repo for thrift

repositories {
// Public Spring artefacts
maven { url “; }
maven { url “; }
maven { url “; }
maven { url “; }
maven { url “file:///home/ec2-user/.m2/repository” }

I also modified the META-INF/spring/context.xml file in order to run hadoop file system commands manually:

$ cd /home/ec2-user/spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount/src/main/resources

$vi META-INF/spring/context.xml   # remove clean-script and also the dependency on it for JobRunner.

<?xml version=”1.0″ encoding=”UTF-8″?>
<context:property-placeholder location=””/>


<hdp:job id=”wordcount-job” validate-paths=”false”
input-path=”${wordcount.input.path}” output-path=”${wordcount.output.path}”
<!– simple job runner –>
<bean id=”runner” class=”” p:jobs-ref=”wordcount-job”/>


Copy the sample file – nietzsche-chapter-1.txt – to Hadoop file system (/user/ec2-user-/input directory)

$ cd src/main/resources/data

$ hadoop fs -mkdir /user/ec2-user/input

$ hadoop fs -put nietzsche-chapter-1.txt /user/ec2-user/input/data

$ cd ../../../..   # go back to samples/wordcount directory

$ ../gradlew

Verify the result:

$ hadoop fs -cat /user/ec2-user/output/part-r-00000 | more

“BY 1
“Beyond 1
“By 2
“Cheers 1
“DE 1
“Everywhere 1
“FROM” 1
“Flatterers 1
“Freedom 1

Welcome to BigHadoop

BigData and particulary Hadoop/MapReduce represent a quickly growing part of Business Intelligence and Data Analytics. In his frequently quoted article on O’Reilly Radar, Edd Dumbill gives a good introduction to big data landscape: What is big data?

Three V-words are recurring when experts attempt to give definition what big data is all about: Volume (terabytes and petabytes of information),  Velocity (data is literally streaming  in with unprecedented speed) and Variety (structured and unstrucuted data). You can convert these V-words into the forth one: Value. BigData promises insights about  things that remained hidden until now.

The intention of this blog is to cover various technologies from cloud computing that provides the infrastructure to Hadoop distributions that are used to crunch the numbers to mobile analytics that can support easy access to the results of the complex algorithms and enormous computing capacity.