NewSQL at Cloud Scale – Spring, Hibernate and Amazon Web Services with NuoDB


Introduction

In the previous articles, we have shown the NuoDB NewSQL architecture, its key components and how to scale it easily at transaction and storage tiers. We have also demonstrated JDBC and Hibernate with NuoDB. In this closing post of the 3-article series, we demonstrate Spring and Hibernate with NuoDB at cloud scale using AWS capabilities (AWS EC2 and CloudFormation).

Spring and Hibernate with NuoDB

Spring PetClinic is a sample application used to be distributed with Spring Framework.  It is designed to show how the Spring application frameworks can be used to build simple, but powerful database-oriented applications.This year it has been refactored to be based on a new architecture and the source code can be downloaded from github. Out of the box Spring PetClinic supports HSQL and MySQL databases and in this post we port it to use NuoDB.

To get the code we need to run:

$ git clone https://github.com/SpringSource/spring-petclinic.git

The application can be built and run using a maven command, but first, we need to implement the NuoDB-related changes, all of which are around configuring the database, updating Hibernate and Spring configurations.  In other words, no application sources needed modification.

The scripts for the db layer can be found under ~/spring/spring-petclinic/src/main/resources/db directory. We created a new nuodb directory in there and created two SQL scripts, initDB.sql and populateDB.sql.

$ cd ~/spring/spring-petclinic/src/main/resources/db
$ mkdir nuodb
$ vi initDB.sql  # edit initDB.sql
DROP TABLE vet_specialties IF EXISTS;
DROP TABLE vets IF EXISTS;
DROP TABLE specialties IF EXISTS;
DROP TABLE visits IF EXISTS;
DROP TABLE pets IF EXISTS;
DROP TABLE types IF EXISTS;
DROP TABLE owners IF EXISTS;

CREATE TABLE vets (
  id         INTEGER primary key  generated always as identity,
  first_name VARCHAR(30),
  last_name  VARCHAR(30)
);
CREATE INDEX vets_last_name ON vets (last_name);

CREATE TABLE specialties (
  id   INTEGER primary key  generated always as identity,
  name VARCHAR(80)
);
CREATE INDEX specialties_name ON specialties (name);

CREATE TABLE vet_specialties (
  vet_id       INTEGER NOT NULL,
  specialty_id INTEGER NOT NULL
);
ALTER TABLE vet_specialties ADD CONSTRAINT fk_vet_specialties_vets FOREIGN KEY (vet_id) REFERENCES vets (id);
ALTER TABLE vet_specialties ADD CONSTRAINT fk_vet_specialties_specialties FOREIGN KEY (specialty_id) REFERENCES specialties (id);

CREATE TABLE types (
  id   INTEGER primary key  generated always as identity,
  name VARCHAR(80)
);
CREATE INDEX types_name ON types (name);

CREATE TABLE owners (
  id         INTEGER primary key  generated always as identity,
  first_name VARCHAR(30),
  last_name  VARCHAR(30),
  address    VARCHAR(255),
  city       VARCHAR(80),
  telephone  VARCHAR(20)
);
CREATE INDEX owners_last_name ON owners (last_name);

CREATE TABLE pets (
  id         INTEGER primary key  generated always as identity,
  name       VARCHAR(30),
  birth_date DATE,
  type_id    INTEGER NOT NULL,
  owner_id   INTEGER NOT NULL
);
ALTER TABLE pets ADD CONSTRAINT fk_pets_owners FOREIGN KEY (owner_id) REFERENCES owners (id);
ALTER TABLE pets ADD CONSTRAINT fk_pets_types FOREIGN KEY (type_id) REFERENCES types (id);
CREATE INDEX pets_name ON pets (name);

CREATE TABLE visits (
  id          INTEGER primary key  generated always as identity,
  pet_id      INTEGER NOT NULL,
  visit_date  DATE,
  description VARCHAR(255)
);
ALTER TABLE visits ADD CONSTRAINT fk_visits_pets FOREIGN KEY (pet_id) REFERENCES pets (id);
CREATE INDEX visits_pet_id ON visits (pet_id);
$ vi  populateDB.qsl  # edit populateDB.sql
INSERT INTO vets VALUES (NULL, 'James', 'Carter');
INSERT INTO vets VALUES (NULL, 'Helen', 'Leary');
INSERT INTO vets VALUES (NULL, 'Linda', 'Douglas');
INSERT INTO vets VALUES (NULL, 'Rafael', 'Ortega');
INSERT INTO vets VALUES (NULL, 'Henry', 'Stevens');
INSERT INTO vets VALUES (NULL, 'Sharon', 'Jenkins');

INSERT INTO specialties VALUES (NULL, 'radiology');
INSERT INTO specialties VALUES (NULL, 'surgery');
INSERT INTO specialties VALUES (NULL, 'dentistry');

INSERT INTO vet_specialties VALUES (2, 1);
INSERT INTO vet_specialties VALUES (3, 2);
INSERT INTO vet_specialties VALUES (3, 3);
INSERT INTO vet_specialties VALUES (4, 2);
INSERT INTO vet_specialties VALUES (5, 1);

INSERT INTO types VALUES (NULL, 'cat');
INSERT INTO types VALUES (NULL, 'dog');
INSERT INTO types VALUES (NULL, 'lizard');
INSERT INTO types VALUES (NULL, 'snake');
INSERT INTO types VALUES (NULL, 'bird');
INSERT INTO types VALUES (NULL, 'hamster');

INSERT INTO owners VALUES (NULL, 'George', 'Franklin', '110 W. Liberty St.', 'Madison', '6085551023');
INSERT INTO owners VALUES (NULL, 'Betty', 'Davis', '638 Cardinal Ave.', 'Sun Prairie', '6085551749');
INSERT INTO owners VALUES (NULL, 'Eduardo', 'Rodriquez', '2693 Commerce St.', 'McFarland', '6085558763');
INSERT INTO owners VALUES (NULL, 'Harold', 'Davis', '563 Friendly St.', 'Windsor', '6085553198');
INSERT INTO owners VALUES (NULL, 'Peter', 'McTavish', '2387 S. Fair Way', 'Madison', '6085552765');
INSERT INTO owners VALUES (NULL, 'Jean', 'Coleman', '105 N. Lake St.', 'Monona', '6085552654');
INSERT INTO owners VALUES (NULL, 'Jeff', 'Black', '1450 Oak Blvd.', 'Monona', '6085555387');
INSERT INTO owners VALUES (NULL, 'Maria', 'Escobito', '345 Maple St.', 'Madison', '6085557683');
INSERT INTO owners VALUES (NULL, 'David', 'Schroeder', '2749 Blackhawk Trail', 'Madison', '6085559435');
INSERT INTO owners VALUES (NULL, 'Carlos', 'Estaban', '2335 Independence La.', 'Waunakee', '6085555487');

INSERT INTO pets VALUES (NULL, 'Leo', '2010-09-07', 1, 1);
INSERT INTO pets VALUES (NULL, 'Basil', '2012-08-06', 6, 2);
INSERT INTO pets VALUES (NULL, 'Rosy', '2011-04-17', 2, 3);
INSERT INTO pets VALUES (NULL, 'Jewel', '2010-03-07', 2, 3);
INSERT INTO pets VALUES (NULL, 'Iggy', '2010-11-30', 3, 4);
INSERT INTO pets VALUES (NULL, 'George', '2010-01-20', 4, 5);
INSERT INTO pets VALUES (NULL, 'Samantha', '2012-09-04', 1, 6);
INSERT INTO pets VALUES (NULL, 'Max', '2012-09-04', 1, 6);
INSERT INTO pets VALUES (NULL, 'Lucky', '2011-08-06', 5, 7);
INSERT INTO pets VALUES (NULL, 'Mulligan', '2007-02-24', 2, 8);
INSERT INTO pets VALUES (NULL, 'Freddy', '2010-03-09', 5, 9);
INSERT INTO pets VALUES (NULL, 'Lucky', '2010-06-24', 2, 10);
INSERT INTO pets VALUES (NULL, 'Sly', '2012-06-08', 1, 10);

INSERT INTO visits VALUES (NULL, 7, '2013-01-01', 'rabies shot');
INSERT INTO visits VALUES (NULL, 8, '2013-01-02', 'rabies shot');
INSERT INTO visits VALUES (NULL, 8, '2013-01-03', 'neutered');
INSERT INTO visits VALUES (NULL, 7, '2013-01-04', 'spayed');

Then we had to modify business-config.xml under ~/spring/spring-petclinic/src/main/resources/spring directory to refer to the correct NuoDB hibernate.dialect using the hibernate.dialect property:

  # business-config.xml

   
        <bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean"
              p:dataSource-ref="dataSource">
            <property name="jpaVendorAdapter">
                <bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter" />
            </property>
            <property name="jpaPropertyMap">
              <map>
                <entry key="hibernate.show_sql" value="${jpa.showSql}" />
                <entry key="hibernate.dialect" value="${hibernate.dialect}" />
                <entry key="hibernate.temp.use_jdbc_metadata_defaults" value="false" />
              </map>
           </property>
           <property name="persistenceUnitName" value="petclinic"/>
           <property name="packagesToScan" value="org.springframework.samples.petclinic"/>
        </bean>

In order to support Hibernate 4, we also set hibernate.temp.use_jdbc_metadata_defaults to false.

The database properties shall be configured in data-access.properties file – this file contains the appropriate JDBC parameters (used in data-source.xml) as well as the Hibernate dialect. The sample below has a reference to a local NuoDB instance running on an Ubuntu virtual machine and an AWS EC2 instance (commented out) that will be configured later on in this article:

jdbc.driverClassName=com.nuodb.jdbc.Driver
jdbc.url=jdbc:com.nuodb://192.168.80.128/spring?schema=user
#jdbc.url=jdbc:com.nuodb://ec2-46-51-162-14.eu-west-1.compute.amazonaws.com/spring?schema=user
jdbc.username=spring
jdbc.password=spring

# Properties that control the population of schema and data for a new data source
jdbc.initLocation=classpath:db/nuodb/initDB.sql
jdbc.dataLocation=classpath:db/nuodb/populateDB.sql

# Property that determines which Hibernate dialect to use
# (only applied with "applicationContext-hibernate.xml")
hibernate.dialect=com.nuodb.hibernate.NuoDBDialect

# Property that determines which database to use with an AbstractJpaVendorAdapter
jpa.showSql=true

The datasource-config.xml file defines the data source bean using JDBC:

    
    <bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"
          p:driverClassName="${jdbc.driverClassName}" p:url="${jdbc.url}"
          p:username="${jdbc.username}" p:password="${jdbc.password}"/>

The original pom.xml from Github refers to Hibernate 4 and NuoDB can support Hibernate 4 so my pom.xml looked like this

# pom.xml              
          <properties>
                
                <hibernate.version> 4.1.0.Final</hibernate.version>
                
                <hibernate-validator.version> 4.2.0.Final</hibernate-validator.version>
                <nuodb.version>1.1</nuodb.version>
           </properties>

           <dependencies>
                
                <dependency>
                        <groupId>com.nuodb</groupId>
                        <artifactId>nuodb-hibernate</artifactId>
                        <version>${nuodb.version}</version>
                </dependency>
                <dependency>
                        <groupId>com.nuodb</groupId>
                        <artifactId>nuodb-jdbc</artifactId>
                        <version>${nuodb.version}</version>
                </dependency>
           </dependencies>

Finally we need to change web.xml in ~/spring/spring-petclinic/src/main/webapp/WEB-INF directory to use jpa – the out-ot-the-box configuration is using jdbc.

   <context-param>
        <param-name>spring.profiles.active</param-name>
        <param-value>jpa</param-value>
    </context-param>

Now we can run the maven command to compile the code, initialize and populate NuoDB tables and then listen to port 9966 for requests.

 $ mvn tomcat7:run

and we can connect Spring using http://localhost:9966/petclinic/.

Spring-PetClinic-1

Spring PetClinic using NuoDB on AWS

Until now we used a local NuoDB instance, it is time to migrate the database layer onto AWS cloud. To run NuoDB on AWS EC2 instances, one of the simplest options is to use AWS CloudFormation. There are CloudFormation templates available on github.

To download them, we can run

$ git clone https://github.com/nuodb/cloudformation.git

Then we need to open AWS management console, go to CloudFormation and select Create Stack and upload the NuoDB template. (e.g. NuoDB-1.1.template in our case). Then we can define the number of agents (in addition to the broker), the domainname, the domain admin username and password and the EC2 instance type.

NuoDB-CF-1

Once we click on continue, the EC2 instances are going to be created, the status first will be  CREATE_IN_PROGRESS, then CREATE_COMPLETE.

As a result, we will get 3 EC2 instances (since we selected 2 agents); one dedicated to NuoDB broker and the other two for NuoDB agents. We will also have 3 EBS volumes, one for each EC2 instance.

CAUTION: The NuoDB CloudFormation script uses an EC2 auto-scaling group which ensures that the number of agents you selected will always be running. As a result, if you just stop the EC2 instances at the end of your test, AWS will restart them. In order to stop everything properly and clean up your test environment, open the AWS management console, go to CloudFormation, select the NuoDB stack and click “Delete Stack.” This will terminate all instances that were started by the CloudFormation template. More on AWS auto-scaling and how to use the command line tool can be found on AWS website.

Now we can connect to the EC2 host that runs the NuoDB broker:

NuoDB-Console-1

Once we logged in, we can start a database:

NuoDB-Console-2

In the next steps we can define the database name (spring), leave “allow non-durable database” option un-selected and the archive and journal directories for storage manager running on one of the hosts (/home/ec2-user/nuodb/data and /home/ec2-user/nuodb/journal respectively). Please, note that the /home/ec2-user directory has to have the appropriate rights to allow the creation of the data and journal directory (e.g. -rwxrwxrwx in our test).

NuoDB-Console-3

After that we can define the transaction engine running on the other EC2 host:

NuoDB-Console-4

Now we have 3 EC2 hosts : one dedicated to NuoDB broker to serve client connections, one for the storage manager with filesystem storage and one for the transaction engine.

Now, if we change the jdbc connection string in Spring PetClinic (remember, it is defined in data-access.properties), we can connect our application to the NuoDB using AWS servers. The reason why the connection is possible because AWS security groups allow any servers to be connected.

$ vi data-access.properties
#jdbc.url=jdbc:com.nuodb://192.168.80.128/spring?schema=user
jdbc.url=jdbc:com.nuodb://ec2-46-51-162-14.eu-west-1.compute.amazonaws.com/spring?schema=user

We can then rerun the Spring PetClinic command, this time with the database in the AWS cloud:

$ mvn tomcat7:run

Once Tomcat server is up and running, we can go to http://ec2-46-51-162-14.eu-west-1.compute.amazonaws.com:9966/petclinic/  and  we can search for vets, we can add new owners, we can search for them,  etc.

PetClinic-3

PetClinic-1

PetClinic-2

Scale out and Resilience

Scale out is a method of adding computing resources by adding additional computers to the system, rather than increasing the computing resources on the computers in the system.  Resilience is the ability to provide and maintain an acceptable level of service in case of faults.

If we want to scale out our database and also make it resilient, we can simply go back to the NuoDB console and add a new process using the Add Process menu. For instance, we can add a transaction engine to the EC2 server that was originally running the storage manager, and we can also add a storage manager to the EC2 server previously running a transaction server. This way, we have two EC2 servers both running one instance of the transaction engine and the storage manager.

Scaling out the database is a seamless process for the Spring PetClinic application; we just start up another EC2 server with an agent and add a transaction engine or a storage engine.

As can be seen in the NuoDB Performance Report, NuoDB scales almost linearly. The diagram below shows how number of transactions per second (TPS) can be increased by adding a new node:

NuoDBPerf

Conclusion

As we have seen in this series, NuoDB combines the standard SQL and ACID properties with elastic scalability that makes it perfectly suitable to be a robust cloud data management system of the 21st century. Its unique architectural approach provides high performance reads and writes and geo-distributed 24/7 operations with built-in resilience. Moreover,applications using well-know frameworks such as Spring and Hibernate can be easily ported or developed with NuoDB as a NewSQL database that meets cloud scale demands.

Advertisements

Large Scale Data Analytics with XtremeData Parallel SQL Database Engine


Introduction

A few months ago I have posted an article about Amazon Web Services Redshift – Amazon’s Data Warehouse solution in the cloud. XtremeData dbX is a similar massive parallel SQL database engine that can be run on-premise as well as in the cloud. It is a purpose-built high-performance and scalable solution for data warehouse applications and large scale analytics using innovative technologies such as vector execution model, dynamic data distribution and automatic load balancing.

XtremeData dbX Architecture

XtremeData dbX is a full SQL and ACID compliant database engine based on shared-nothing, massive parallel query execution. The underlying technology relies on PostgreSQL (similarly to AWS Redshift). In essence, the key architecture components are the head node and multiple data nodes – this is a fairly common patterns in massive parallel execution scenarios.

The head node manages the client connections, parses and plans the queries and sends the result back to the clients. The data nodes manage data storage and execute queries.

dbx-query-exec

XtremeData dbX in AWS cloud

Besides the on-premise deployment option, XtremeData dbX is available in AWS cloud. We need to go to AWS Marketplace to register for it.There are different editions; one for dbX Head and another for dbX Data nodes.

dbx-awsmarketplace

XtremeData provides a document how to setup dbX head and data nodes in the AWS cloud but I decided to implement AWS CloudFormation templates to make the deployment easier. Since there are two different AMIs for head and data nodes, I created two templates.

The AWS CloudFormation dbX-Head template:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "XtremeData dbX-Head CloudFormation",
  "Parameters" : {
    "ClusterName" : {
      "Description" : "Name of the XtremeData Cluster",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
    "KeyName" : {
      "Description" : "Name of an existing EC2 KeyPair to enable SSH access to the instances",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
    "InstanceType" : {
        "Description" : "XtremeData dbX-Head EC2 instance type",
        "Type" : "String",
        "Default" : "m1.medium",
        "AllowedValues" : [ "m1.medium","m1.large","m1.xlarge","m2.xlarge","m2.2xlarge","m2.4xlarge", "c1.xlarge","hi1.4xlarge"],
        "ConstraintDescription" : "must be a valid EC2 instance type."
      }
  },
  "Mappings" : {
    "AWSRegion2AMI" : {
      "us-east-1"      : {"AMI" : "ami-4c2f5d25"},
      "us-west-2"      : {"AMI" : "ami-c127b7f1"},
      "us-west-1"      : {"AMI" : "ami-84cde4c1"}
    }
  },
  "Resources": {
    "XtremeDataSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "Enable XTremedata Access",
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": "22",
            "ToPort": "22",
            "CidrIp": "0.0.0.0/0"
          }
        ]
      }
    },
	"XtremeDataSecurityGroupIngress" : {
	   "Type": "AWS::EC2::SecurityGroupIngress",
	   "Properties": {
            "GroupName": { "Ref": "XtremeDataSecurityGroup" },
            "IpProtocol": "tcp",
            "FromPort": "0",
            "ToPort": "65535",
            "SourceSecurityGroupName": { "Ref": "XtremeDataSecurityGroup" }
       }
	},
    "XtremeDataHeadInstance": {
      "Type": "AWS::EC2::Instance",
      "Properties": {
		"UserData" : { "Fn::Base64" : { "Ref" : "ClusterName" }},
        "SecurityGroups": [ { "Ref": "XtremeDataSecurityGroup" } ],
		"ImageId" : { "Fn::FindInMap" : [ "AWSRegion2AMI", { "Ref" : "AWS::Region" }, "AMI" ]},
        "InstanceType": {"Ref" : "InstanceType"},
        "KeyName": { "Ref" : "KeyName" },
		"BlockDeviceMappings" : [
               {
                  "DeviceName" : "/dev/sdf",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdg",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
			   {
                  "DeviceName" : "/dev/sdh",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdi",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               }
            ],
        "Tags" : [
          { "Key" : "Name", "Value" : "XtremeData dxX-Head"}
        ]
      }
    }
  }
}

This will create the security group for dbX node communications, define userdata (cluster name) and allocate 4 EBS volumes (100 GB each) to dbX Head node.

The AWS CloudFormation dbX-Data template looks as follows:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "XtremeData dbX-Data CloudFormation",
  "Parameters" : {
    "ClusterNameAnddbXHead" : {
      "Description" : "Name of the XtremeData Cluster and the dbX-Head EC2 hostname concatenated with semi-colon",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64"
    },
	"XtremeSecurityGroup" : {
      "Description" : "Name of the XtremeData SecurityGroup",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
	"NumberOfDataNodes" : {
      "Description" : "Max. Number of Xtreme Data Instances to start (in addition to the Head Instance)",
      "Type" : "Number",
      "Default" : "2"
    },
    "HeadAvailZone" : {
      "Description" : "Name of the Availability Node where dbX-Head instance is running",
      "Type" : "String",
	  "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    }, 
    "KeyName" : {
      "Description" : "Name of an existing EC2 KeyPair to enable SSH access to the instances",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
    "InstanceType" : {
        "Description" : "XtremeData dbX-Data EC2 instance type",
        "Type" : "String",
        "Default" : "m1.large",
        "AllowedValues" : [ "m1.large","m1.xlarge","m2.xlarge","m2.2xlarge","m2.4xlarge", "c1.xlarge","hi1.4xlarge"],
        "ConstraintDescription" : "must be a valid EC2 instance type."
      }
  },
  "Mappings" : {
    "AWSRegion2AMI" : {
      "us-east-1"      : {"AMI" : "ami-442c5e2d"},
      "us-west-2"      : {"AMI" : "ami-1d28b82d"},
      "us-west-1"      : {"AMI" : "ami-66cce523"}
    }
  },
  "Resources": {
    "XtremeDataNodeGroup" : {
      "Type" : "AWS::AutoScaling::AutoScalingGroup",
      "Properties" : {
        "AvailabilityZones" : [ { "Ref" : "HeadAvailZone" } ],
        "LaunchConfigurationName" : { "Ref" : "DataNodeLaunchConfig" },
        "MinSize" : "0",
        "MaxSize" : {"Ref" : "NumberOfDataNodes"},
		"DesiredCapacity" : {"Ref" : "NumberOfDataNodes"},
        "Tags" : [
          { "Key" : "Name", "Value" : "XtremeData dbX-Data", "PropagateAtLaunch" : "true" }
        ]
      }
    },

	"DataNodeLaunchConfig" : {
       "Type" : "AWS::AutoScaling::LaunchConfiguration",
	   "Properties": {
	      "UserData" : { "Fn::Base64" : { "Ref" : "ClusterNameAnddbXHead" }},
          "KeyName" : { "Ref" : "KeyName" },
          "SecurityGroups" : [ { "Ref" : "XtremeSecurityGroup" } ],
          "InstanceType" : { "Ref" : "InstanceType" },
		  "ImageId" : { "Fn::FindInMap" : [ "AWSRegion2AMI", { "Ref" : "AWS::Region" }, "AMI" ]},
		  "BlockDeviceMappings" : [
               {
                  "DeviceName" : "/dev/sdf",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdg",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
			   {
                  "DeviceName" : "/dev/sdh",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdi",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               }
          ]
		}
    }
  }
}

This template uses cluster name and AWS EC2 hostname concatenated with ; for userdata, it also uses the availability zones for dbX Head node and the security group that we just defined for dbX Head node. In addition, it allows to define the number of data nodes – it will be used to configure an auto-scaling group to ensure that we have the requested amount of data instances running.

CAUTION: Please, note that it is not enough just to terminate the nodes if you have auto-scaling defined since AWS will initiate new nodes. We need to terminate the instances using as-terminate-instance-in-auto-scaling-group command and then delete the auto-scaling config with as-delete-auto-scaling-group and as-delete-launch-config. More information about AWS auto-scaling can be found here.

To start an XtremeData dbX Head node, we need to go to AWS console Cloud Formation section, then select Create Stack:

dbx-2

Once we submit the configuration, the status will become CREATE_IN_PROGRESS and then after a while CREATE_COMPLETE.

Then we can create another stack for the data nodes using the second CloudFormation template:

dbx-8

After a while we should see one head node and the requested number of datanodes running in AWS EC2 console:

dbx-10

Creating a database in ExtremeData dbX

Once we have fired up the nodes, we can login to the head node as ec2-user using ssh (or e.g. putty on Windows).

First we need to initiate the cluster (please, note that this is a destructive operation, no data will be preserved), then start the cluster and the dbX database engine:

[ec2-user@ip-10-224-122-90 home]$ cluster_init -i 2
init started
examining nodes
initializing head ebs disks: xvdf xvdg xvdh xvdi
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
mkfs.xfs: Specified data stripe unit 512 is not the same as the volume stripe unit 128
initializing node ebs disks: xvdf xvdg xvdh xvdi
clearing head ephemeral disks
clearing nodes ephemeral disks
cluster_init done

[ec2-user@ip-10-224-122-90 home]$ cluster_start
initializing head ephemeral disks
  head ephemeral disks:  xvdb
    raid-ing
mdadm: /dev/xvdb appears to contain an ext2fs file system
    size=419395584K  mtime=Sat Jul  6 21:52:00 2013
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
mkfs.xfs: Specified data stripe unit 512 is not the same as the volume stripe unit 128
mkfs.xfs: Specified data stripe width 512 is not the same as the volume stripe width 128
examining nodes
assembling node EBSs: xvdf xvdg xvdh xvdi
initializing node tmp disks: xvdb xvdc
  raid-ing
cluster_start done

[ec2-user@ip-10-224-122-90 home]$ dbx_start
new dbx installation initialized
Initializing kernel nfsd:                                  [  OK  ]
Starting NFS services:                                     [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Stopping RPC idmapd:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
rpc.svcgssd is stopped
rpc.mountd (pid 3402) is running...
nfsd (pid 3467 3466 3465 3464 3463 3462 3461 3460) is running...
Generating binary repository file ...
file 'xdapp.conf2.txt.conf0' generated.
starting dbx nodes
Starting xdu daemon...                                     [  OK  ]
dbx startup done
[ec2-user@ip-10-224-122-90 home]$
[ec2-user@ip-10-224-122-90 home]$

Now we can login as dbxdba user (we can do it again with ssh or putty since the key is copied over from ec2-user). Then we need to change the dbxdba password (using passwd Linux commad) in order to be able to login using dbX tools.

Now we are ready to create a server and once the server is created, we can then create a database. The simplest way to manage the database and run SQL queries is to use a command line tool called xdudb:

$ xdudb create aws_server 32145
||==================||

[dbxdba@ip-10-224-122-90 ~]$ xdudb list
||==================||
aws_server
[dbxdba@ip-10-224-122-90 ~]$ xdudb status
||==================||
ip-10-224-122-90 configuration: 1 head node and 2 data nodes
running dbx services          : 1 head node and 2 data nodes

[dbxdba@ip-10-224-122-90 ~]$ xdudb start aws_server
||==================||

[dbxdba@ip-10-224-122-90 ~]$ xdudb info aws_server
||==================||
Name        : aws_server
NodeSet     : Default_NS
Owner       : dbxdba
Port        : 32145
H|0   |head    |10.224.122.90  |head |/volumes/data/dbstore/dbxdba/aws_server/head
D|1   |node00  |10.154.137.31  |n0   |/hd/dbxdba/aws_server/n0
D|2   |node01  |10.165.7.225   |n1   |/hd/dbxdba/aws_server/n1

[dbxdba@ip-10-224-122-90 ~]$ xdudb dbcreate aws_server aws_db
CREATE DATABASE

Now we can create a table, insert some rows and run select statements.

[dbxdba@ip-10-224-122-90 ~]$ xdudb sql aws_server aws_db
Welcome to dbX psql 3.2.5, the interactive SQL terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

aws_db=# CREATE TABLE products (
aws_db(# product_no INTEGER,
aws_db(# name VARCHAR(30),
aws_db(# price NUMERIC(10,2)
aws_db(# );
CREATE TABLE
aws_db=# INSERT INTO products VALUES(1, 'Product1', 100.00);
INSERT 0 1
aws_db=# SELECT * FROM products;
 product_no |   name   | price
------------+----------+--------
          1 | Product1 | 100.00
(1 row, Query Total: 1)

WIth the help of the xdudb command line tool we can check dbX version, check disk usage and mounted filesystems, etc.

[dbxdba@ip-10-224-122-90 ~]$ xdudb version
ip-10-224-122-90        2013-07-07 00:03:10
  DB           : dbX SQL 3.2.5
               : fpga.run r4347
  command line : xdudb 1.4
  GUI          : xdadm 3.0.2
  daemon       : xdutils 4.5.8
  OS           : Linux 3.4.43-43.43.amzn1.x86_64

[dbxdba@ip-10-224-122-90 ~]$ xdudb  du aws_server
||==================||
H|0   |head    |10.224.122.90  |head |60M  |/volumes/data/dbstore/dbxdba/aws_server/head
D|1   |node00  |10.154.137.31  |n0   |58M  |/hd/dbxdba/aws_server/n0
D|2   |node01  |10.165.7.225   |n1   |58M  |/hd/dbxdba/aws

[dbxdba@ip-10-224-122-90 ~]$ xdudb  df
head : 10.224.122.90
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              400G   95M  400G   1% /volumes/data
node00 : 10.154.137.31
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              400G   93M  400G   1% /volumes/data
node01 : 10.165.7.225
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              400G   93M  400G   1% /volumes/data

xdAdm web based management GUI

In addition to the command line tool that we described in the AWS cloud deployment scenario, there is also a web base GUI to manage the applicance – it is based on Apache Tomcat. We just go to http://dbx-headnode:8080/xdadm url and login. The we can manage the servers and databases, check nodesets and monitor the system:

dbx-15

XtremeData dbX virtual machine

If someone just wants to try out dbX database engine then she can download a virtual machine from XtremeData web site. The instance can be run e.g using vmplayer. Once the user logged in to the virtual machine as dbxdba, he can startup a server:

$ xdudb start dbxtest

Amazon Web Services Redshift – Data Warehouse in the Cloud


Introduction

Amazon Web Services has made publicly available its fully managed, petabyte-scale data warehouse cloud service in February, 2013. It promises a high performance, secure, easily scalable data warehouse solution that costs 1/10th of a traditional data warehouse (less than 1,000 USD/TB/year, according to the AWS Introduction to Redshift presentation: http://aws.amazon.com/redshift/) , it is compatible with the traditional BI tools and ready to be running within minutes. As of writing this article the service is available in US East region only but supposed to be rolled out to other regions, too. The service is manageable via the regular AWS tools: AWS management console, command line tools (aws commands based on python) and API based on HTTP requests/responses.

Under the hood

Under the hood, AWS Redshift is based on PostgreSQL 8.0.2. The architecture consist of 1 leader node – a node which is responsible for managing the communications with the clients, developing the execution plan and then distributing the compiled code to the compute nodes-, and 1 or more compute nodes that are exetung the code and then sending back the result to the leader node for aggregation. The compute nodes can have either 2-cores, 15GB RAM and 2 TB storage node (dubbed as XL node) or a 16-cores, 120 GB RAM and 16 TB storage node (dubbed as 8XL node). More details about the Redshift archtecture can be found at http://docs.aws.amazon.com/redshift/latest/dg/c_internal_arch_system_operation.html

AWS-Redshift-Arch

Launching a cluster

The easiest way to launch a cluster is via AWS console.

We need to define the basic attributes like cluster identifier, database name, database port, master username and password:

AWS-RedShift1

Then we need to select the node type (XL or 8XL) and the number of compute nodes. A cluster can be single or multi-node, the minimum config is a one XL node cluster, while the maximum config is sixteen 8XL nodes – you can do the math in terms of cores, memory and storage.

AWS-Redshift2

Then we can configure additional parameters (like database encyption or security groups)

AWS-Redshift4

We can then review the configuration and are ready to launch the service:

AWS-RedShift5

The status will be first “creating” for a while then it will become “available”. This is when the JDBC url will become known and can be used for configuring the clients.

AWS-RedShift8

In order to make the service accessible, we need to configure the security options (either a security group – if Redshift is going to be accessed from EC2 – or a CIDR/IP (Classless- Inter-Domain Routing IP range)  – if Redshift is to be accessed from public Internet.  The system will automatically recognise the IP address of the client connected to AWS console.

AWS-RedShift6

And that is it! From then on the client can be connected to the Redshift cluster.

We used SQLWorkbench to test the service, the same way as suggested by AWS Redshift documentation. It is a Java based open source SQL tool. The connection parameters are the standard JDBC attributes:

AWS-Redshift9

The PostgreSQL version can be checked using

select version();

version
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.546

We tested the service with Amazon stock prices downloaded from Yahoo Finance.

The content has been uploaded to a S3 bucket called stockprice (S3://stockprice/amzn.csv). We had to make it accessible for everyone to read. (open/download).

Then we could create the appropriate table using standard SQL command:

CREATE TABLE stockprice (
     stockdate date not null,
     stockopen decimal(6,2),
     stockhigh decimal(6,2),
     stocklow decimal(6,2),
     stockclose decimal(6,2),
     stockvolume integer,
     stockadjclose decimal(6,2)
     );

Table 'stockprice' created

Execution time: 0.15s

desc stockprice
COLUMN_NAME	DATA_TYPE	PK	NULLABLE	DEFAULT	AUTOINCREMENT	REMARKS	POSITION
stockdate	date	NO	YES		NO		1
stockopen	numeric(6,2)	NO	YES		NO		2
stockhigh	numeric(6,2)	NO	YES		NO		3
stocklow	numeric(6,2)	NO	YES		NO		4
stockclose	numeric(6,2)	NO	YES		NO		5
stockvolume	integer	NO	YES		NO		6
stockadjclose	numeric(6,2)	NO	YES		NO		7

To load the data into stockprice table, we had to use copy command with the S3 source file (it could also be an Amazon DynamoDB source).

copy stockprice from 's3://stockprices/amzn.csv' CREDENTIALS 'aws_access_key_id=XXXXXXX;aws_secret_access_key=XXXXXXX' delimiter ',';

If there is any error during the load operation, it can be verified by running a select statement on the stl_load_errors table. (e.g. incorrect data format).

And then we can run our SQL statements to analyse the data.

select * from stockprice order by stockadjclose desc limit 100;

stockdate	stockopen	stockhigh	stocklow	stockclose	stockvolume	stockadjclose
2013-01-25	275.00	284.72	274.40	283.99	4968100	283.99
2013-01-28	283.78	284.48	274.40	276.04	4321400	276.04
2013-03-05	274.00	276.68	269.99	275.59	3686200	275.59
2013-03-13	275.24	276.50	272.64	275.10	1884200	275.10
2013-03-08	275.00	275.44	271.50	274.19	1879800	274.19
2013-03-12	271.00	277.40	270.36	274.13	3246200	274.13
2013-03-07	274.10	274.80	271.85	273.88	1939000	273.88
2013-03-06	275.76	276.49	271.83	273.79	2050700	273.79
2013-01-24	269.37	276.65	269.37	273.46	3417000	273.46
2013-03-04	265.36	273.30	264.14	273.11	3453000	273.11
2013-01-30	283.00	284.20	267.11	272.76	13075400	272.76
2013-01-14	268.00	274.26	267.54	272.73	4275000	272.73
2013-01-18	270.83	274.50	269.60	272.12	2942000	272.12
2013-01-15	270.68	272.73	269.30	271.90	2326900	271.90
2013-03-11	273.43	273.99	270.40	271.24	1904900	271.24

AWS console supports various management functions of the cluster, we can reboot the cluster, we can modify parameters, we can resize it by defining different node type (XL->8XL) or decreasing/increasing the number of nodes. We can also delete the cluster via AWS console.
AWS-Redshift11

Conclusion

Amazon Web Services Redshift is another big step to make cloud services available for enterprise computing. It offers a data warehouse capability with minimal effort to start up and scale as operations demand. It is a great complement to other database services such as DynamoDB for NoSQL requirements and RDS for relational database services.