Integrating R with Cloudera Impala for Real-Time Queries on Hadoop


Introduction

Cloudera Impala supports low-latency, interactive queries on Hadoop data sets either stored in Hadoop Distributed File System (HDFS) or HBase, the distributed NoSQL database for Hadoop. Impala’s notion is to use Hadoop as a storage engine but move away from MapReduce algorithms. Instead, Impala uses distributed queries, a concept inherited from massive parallel processing databases. As a result, Impala supports SQL-like query languange (in the same way way as Apache Hive), but can execute the queries 10-100 times fasters than Hive that converts them into MapReduce. You can find more details on Impala in one of the previous posts.

is one of the most popular open source statistical computing and graphical software. It can work with various data sources from comma separated files to web contents referred by URLs to relational databases to NoSQL (e.g. MongoDB or Cassandra) and Hadoop.

Thanks to the generic Impala ODBC driver, R can be integrated with Impala, too. The solution will provide fast, interactive queries running on top of Hadoop data sets and then the data can be further processed or visualized within R.

Cloudera Impala ODBC drivers

As we can see in the diagram below, Impala runs on the top of dataset stored in HDFS or HBase and the users can interact with it in multiple ways.

impala-architecture

One option is to use impala-shell which is part of the impala package and provides a command line interface. Other option is to use Hue (Cloduera’s Hadoop User Experience product) that is a web browser based UI offering a query editor among other functions that is capable of run queries against Pig, Hive or Impala.The third option is to use ODBC driver and connect some of the well-known popular BI tools to Impala.

Cloudera provides connectors for some of the most popular leading analytics and data visualization tools such as Tableau, QlikView or Microstrategy. It can also offer a generic ODBC driver that can be used to connect various tools. This is the software component that we will use in the post to demonstrate how to integrate R with Cloudera Impala.

Install R, RStudio Server, Impala ODBC and RODBC

Impala installation was covered in this post. To install R on a Linux environment (for now Fedora 19 will be used ) we need to execute the following commands:

# Install EPEL package - EPEL stands for Extra package for Enterprise Linux
$ sudo rpm -ivh http://mirror.chpc.utah.edu/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

$ sudo yum install R
================================================================================
 Package                Arch           Version               Repository    Size
================================================================================
Updating:
 R                      x86_64         3.0.2-1.el6           epel          20 k
Updating for dependencies:
 R-core                 x86_64         3.0.2-1.el6           epel          46 M
 R-core-devel           x86_64         3.0.2-1.el6           epel          90 k
 R-devel                x86_64         3.0.2-1.el6           epel          19 k
 R-java                 x86_64         3.0.2-1.el6           epel          20 k
 R-java-devel           x86_64         3.0.2-1.el6           epel          20 k
 libRmath               x86_64         3.0.2-1.el6           epel         116 k
 libRmath-devel         x86_64         3.0.2-1.el6           epel          24 k

Transaction Summary
================================================================================
Upgrade       8 Package(s)

R comes with a command line interpreter but if you want to have a more convenient development environment, you may prefer to use RStudio. RStudio has a desktop version as well as a web browser based alternative called RStudio Server. They can be downloaded for free from RStudio website. We will use RStudio Server in this post.

To install RStudio Server, you need to execute the following command:

$ sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm

================================================================================
 Package           Arch   Version         Repository                       Size
================================================================================
Installing:
 rstudio-server    x86_64 0.97.551-1      /rstudio-server-0.97.551-x86_64  96 M
...

Transaction Summary
===================================================================
Install       3 Package(s)

To ensure that Impala ODBC driver will work and RODBC package can be installed within R (as it will be shown later on in this post), you also need to install unixODBC and unixODBC-devel packages:

$ sudo yum install unixODBC
$ sudo yum install unixODBC-devel

Finally you have to install Cloudera Impala ODBC driver. You can download it from Cloudera website, as of writing the post the latest version is 2.5 (the driver file name is ClouderaImpalaODBC-2.5.5.1005-1.el6.x86_64.rpm). To install Impala ODBC driver, you need to run the following command after downloading the driver:

$ yum --nogpgcheck localinstall ClouderaImpalaODBC-2.5.5.1005-1.el6.x86_64.rpm

Impala ODBC driver requires a couple of files configured properly (the driver package has templates files embedded that needs to be edited and copied to the correct directory). The two key configuration files are odbc.init and cloudera.impalaodbc.ini.

odbc.ini should look something like this:

[Impala]
# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Cloudera ODBC Driver for Impala (64-bit) DSN

# Driver: The location where the ODBC driver is installed to.
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so

# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=localhost
PORT=21050
Database=default

In cloudera.impalaodbc.ini configuration file we have the following settings:

# SimbaDN / unixODBC
ODBCInstLib=libodbcinst.so

In addition, we need to define the environment variables as follows:

$ export LD_LIBRARY_PATH=/usr/local/lib:/opt/cloudera/impalaodbc/lib/64
$ export ODBCINI=/etc/odbc.ini
$ export SIMBADN=/etc/cloudera.impalaodbc.ini

The final step is to install RODBC package for R. You can do it using R command line tool:

$ R
>install.packages("RODBC")

Analyzing Hadoop datasets with R and Impala

Now we are ready to start analyzing our Hadoop data set with R and Impala. We will demonstrate how they work together using stock price information. You can download e.g. Google stock prices from http://finance.yahoo.com (symbol: GOOG). Once you have the spreadsheet downloaded, you need to remove the first line (header) from the file and then load it into HDFS using Hadoop file system shell.

$ hadoop fs -mkdir /user/cloudera/stock
$ hadoop fs -put google.csv /user/cloudera/stock
$ hadoop fs -ls /user/cloudera/stock
Found 1 items
-rw-r--r--   3 cloudera cloudera     126379 2013-11-22 12:22 /user/cloudera/stock/google.csv

Now we can login to impala shell to create our table. Impala has a SQL-like query language so you can use the familiar CREATE TABLE command. The external clause indicates that the physical data files are managed outside Impala; even if you drop the table, the files will be kept in the HDFS directory.

After the table is created, we can run SHOW TABLES statement to verify if the table is accessible from Impala. We can also run a SELECT statement from impala-shell to display a couple of rows from the stock table.

$ impala-shell
[localhost.localdomain:21000] > create external table stock (stock_date string, stock_open float, stock_high float, stock_low float, stock_close_ float, stock_volume int, stock_adjclose float) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/cloudera/stock/';
...
[localhost.localdomain:21000] > show tables;
Query: show tables

+-------+
| name  |
+-------+
| stock |
+-------+
Returned 1 row(s) in 0.01s
[localhost.localdomain:21000] > select * from stock limit 3;
...
+------------+-------------------+-------------------+-------------------+-------------------+--------------+-------------------+
| stock_date | stock_open        | stock_high        | stock_low         | stock_close_      | stock_volume | stock_adjclose    |
+------------+-------------------+-------------------+-------------------+-------------------+--------------+-------------------+
| 2013-11-21 | 1027              | 1038.31005859375  | 1026              | 1034.069946289062 | 1091800      | 1034.069946289062 |
| 2013-11-20 | 1029.949951171875 | 1033.359985351562 | 1020.359985351562 | 1022.309997558594 | 963700       | 1022.309997558594 |
| 2013-11-19 | 1031.719970703125 | 1034.75           | 1023.049987792969 | 1025.199951171875 | 1116400      | 1025.199951171875 |
+------------+-------------------+-------------------+-------------------+-------------------+--------------+-------------------+
Returned 3 row(s) in 0.37s

The next step is to start R command line interpreter. In order to run the same Impala SELECT statement from R, we need to execute the following commands from R:

$ R
> library("RODBC");
> conn <- odbcConnect("Impala")
> result <- sqlQuery(conn, "select * from stock limit 3")
> result
  stock_date stock_open stock_high stock_low stock_close_ stock_volume
1 2013-11-21    1027.00    1038.31   1026.00      1034.07      1091800
2 2013-11-20    1029.95    1033.36   1020.36      1022.31       963700
3 2013-11-19    1031.72    1034.75   1023.05      1025.20      1116400
  stock_adjclose
1        1034.07
2        1022.31
3        1025.20

As mentioned above, if you want to use a more convenient R development environment with various advanced features such as debugging, package management, file system navigation, etc. then RStudio is an excellent choice. It can be run as a desktop application or via a web browser if you have installed RStudio Server.  In case of the second option, RStudio can be accessed via http://hostname:8787 and you can login using your Linux username and password.

Before you use RStudio from your browser, you also need to set the following environment variables in .Renviron file n your home directory:

$ cat .Renviron 
LD_LIBRARY_PATH=/usr/local/lib:/opt/cloudera/impalaodbc/lib/64
ODBCINI=/etc/odbc.ini
SIMBAINI=/etc/cloudera.impalaodbc.ini

Now you can login to RStudio and execute the same R commands as we have shown from the command line interpreter, see the figure below.

Impala-R-2

You can also plot a graphical representation of your data set, as show below. The diagram illustrates a line chart for Google stock prices in 2013:

Impala-R-1

The actual R commands to generate this plot are as follows:

> library("RODBC");
> conn <- odbcConnect("Impala")
> result <- sqlQuery(conn, "select stock_date, stock_close from stock where stock_date > '2013' order by stock_date asc limit 300")
> result
    stock_date stock_close
1   2013-01-02      723.25
2   2013-01-03      723.67
3   2013-01-04      737.97
...

> plot(result$stock_close, lwd="1", xlab="Days", ylab="Price (USD)")
> lines(result$stock_close, lwd="2")
> axis(1, result$stock_date, labels=result$stock_date)

Conclusion

Cloudera Impala is an exciting new technology to provide real-time, interactive queries in Hadoop environment. It supports ODBC connectors and this makes it possible to integrate it with many popular BI tools and statistical software such as R. Together R and Impala provide an excellent combination for data analyst to process massive data sets efficiently and they can also support graphical representation of the result sets.

If you are interested to learn more about Impala, please, check out our book, Impala in Action at Manning Publishing.

5 thoughts on “Integrating R with Cloudera Impala for Real-Time Queries on Hadoop

  1. Hi and thanks for the tutorial, it’s very helpful !
    However, when I try it I get an error when connecting to Impala :
    > conn <- odbcConnect("Impala")
    Warning messages:
    1: In odbcDriverConnect("DSN=Impala") :
    [RODBC] ERROR: state HY000, code -1, message [unixODBC][DSI] The error message NoSQLGetPrivateProfileString could not be found in the en-US locale. Check that /en-US/ODBCMessages.xml exists.
    2: In odbcDriverConnect("DSN=Impala") : ODBC connection failed

    I double checked and I did all you said.
    I do get that I have a problem with the error message but the error itself seems to come from the name "Impala", altough I don't know where to set it or find the proper name to use here.

    • Hi,
      Do you have odbc.ini configured properly? That is where the “Impala” name comes from:

      [Impala]
      # Description: DSN Description.
      # This key is not necessary and is only to give a description of the data source.
      Description=Cloudera ODBC Driver for Impala (64-bit) DSN

      # Driver: The location where the ODBC driver is installed to.
      Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so

      # Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
      # They can also be specified on the connection string.
      HOST=localhost
      PORT=21050
      Database=default

      And if you run R from command line, make sure that you have the correct environment variables (e.g. ODBCINI) set as described in the post. (export ODBCINI=/etc/odbc.ini)

Leave a reply to Praveen Sripati Cancel reply