Large Scale Data Analytics with XtremeData Parallel SQL Database Engine


Introduction

A few months ago I have posted an article about Amazon Web Services Redshift – Amazon’s Data Warehouse solution in the cloud. XtremeData dbX is a similar massive parallel SQL database engine that can be run on-premise as well as in the cloud. It is a purpose-built high-performance and scalable solution for data warehouse applications and large scale analytics using innovative technologies such as vector execution model, dynamic data distribution and automatic load balancing.

XtremeData dbX Architecture

XtremeData dbX is a full SQL and ACID compliant database engine based on shared-nothing, massive parallel query execution. The underlying technology relies on PostgreSQL (similarly to AWS Redshift). In essence, the key architecture components are the head node and multiple data nodes – this is a fairly common patterns in massive parallel execution scenarios.

The head node manages the client connections, parses and plans the queries and sends the result back to the clients. The data nodes manage data storage and execute queries.

dbx-query-exec

XtremeData dbX in AWS cloud

Besides the on-premise deployment option, XtremeData dbX is available in AWS cloud. We need to go to AWS Marketplace to register for it.There are different editions; one for dbX Head and another for dbX Data nodes.

dbx-awsmarketplace

XtremeData provides a document how to setup dbX head and data nodes in the AWS cloud but I decided to implement AWS CloudFormation templates to make the deployment easier. Since there are two different AMIs for head and data nodes, I created two templates.

The AWS CloudFormation dbX-Head template:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "XtremeData dbX-Head CloudFormation",
  "Parameters" : {
    "ClusterName" : {
      "Description" : "Name of the XtremeData Cluster",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
    "KeyName" : {
      "Description" : "Name of an existing EC2 KeyPair to enable SSH access to the instances",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
    "InstanceType" : {
        "Description" : "XtremeData dbX-Head EC2 instance type",
        "Type" : "String",
        "Default" : "m1.medium",
        "AllowedValues" : [ "m1.medium","m1.large","m1.xlarge","m2.xlarge","m2.2xlarge","m2.4xlarge", "c1.xlarge","hi1.4xlarge"],
        "ConstraintDescription" : "must be a valid EC2 instance type."
      }
  },
  "Mappings" : {
    "AWSRegion2AMI" : {
      "us-east-1"      : {"AMI" : "ami-4c2f5d25"},
      "us-west-2"      : {"AMI" : "ami-c127b7f1"},
      "us-west-1"      : {"AMI" : "ami-84cde4c1"}
    }
  },
  "Resources": {
    "XtremeDataSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "Enable XTremedata Access",
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": "22",
            "ToPort": "22",
            "CidrIp": "0.0.0.0/0"
          }
        ]
      }
    },
	"XtremeDataSecurityGroupIngress" : {
	   "Type": "AWS::EC2::SecurityGroupIngress",
	   "Properties": {
            "GroupName": { "Ref": "XtremeDataSecurityGroup" },
            "IpProtocol": "tcp",
            "FromPort": "0",
            "ToPort": "65535",
            "SourceSecurityGroupName": { "Ref": "XtremeDataSecurityGroup" }
       }
	},
    "XtremeDataHeadInstance": {
      "Type": "AWS::EC2::Instance",
      "Properties": {
		"UserData" : { "Fn::Base64" : { "Ref" : "ClusterName" }},
        "SecurityGroups": [ { "Ref": "XtremeDataSecurityGroup" } ],
		"ImageId" : { "Fn::FindInMap" : [ "AWSRegion2AMI", { "Ref" : "AWS::Region" }, "AMI" ]},
        "InstanceType": {"Ref" : "InstanceType"},
        "KeyName": { "Ref" : "KeyName" },
		"BlockDeviceMappings" : [
               {
                  "DeviceName" : "/dev/sdf",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdg",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
			   {
                  "DeviceName" : "/dev/sdh",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdi",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               }
            ],
        "Tags" : [
          { "Key" : "Name", "Value" : "XtremeData dxX-Head"}
        ]
      }
    }
  }
}

This will create the security group for dbX node communications, define userdata (cluster name) and allocate 4 EBS volumes (100 GB each) to dbX Head node.

The AWS CloudFormation dbX-Data template looks as follows:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "XtremeData dbX-Data CloudFormation",
  "Parameters" : {
    "ClusterNameAnddbXHead" : {
      "Description" : "Name of the XtremeData Cluster and the dbX-Head EC2 hostname concatenated with semi-colon",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64"
    },
	"XtremeSecurityGroup" : {
      "Description" : "Name of the XtremeData SecurityGroup",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
	"NumberOfDataNodes" : {
      "Description" : "Max. Number of Xtreme Data Instances to start (in addition to the Head Instance)",
      "Type" : "Number",
      "Default" : "2"
    },
    "HeadAvailZone" : {
      "Description" : "Name of the Availability Node where dbX-Head instance is running",
      "Type" : "String",
	  "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    }, 
    "KeyName" : {
      "Description" : "Name of an existing EC2 KeyPair to enable SSH access to the instances",
      "Type" : "String",
      "MinLength": "1",
      "MaxLength": "64",
      "AllowedPattern" : "[-_ a-zA-Z0-9]*",
      "ConstraintDescription" : "can contain only alphanumeric characters, spaces, dashes and underscores."
    },
    "InstanceType" : {
        "Description" : "XtremeData dbX-Data EC2 instance type",
        "Type" : "String",
        "Default" : "m1.large",
        "AllowedValues" : [ "m1.large","m1.xlarge","m2.xlarge","m2.2xlarge","m2.4xlarge", "c1.xlarge","hi1.4xlarge"],
        "ConstraintDescription" : "must be a valid EC2 instance type."
      }
  },
  "Mappings" : {
    "AWSRegion2AMI" : {
      "us-east-1"      : {"AMI" : "ami-442c5e2d"},
      "us-west-2"      : {"AMI" : "ami-1d28b82d"},
      "us-west-1"      : {"AMI" : "ami-66cce523"}
    }
  },
  "Resources": {
    "XtremeDataNodeGroup" : {
      "Type" : "AWS::AutoScaling::AutoScalingGroup",
      "Properties" : {
        "AvailabilityZones" : [ { "Ref" : "HeadAvailZone" } ],
        "LaunchConfigurationName" : { "Ref" : "DataNodeLaunchConfig" },
        "MinSize" : "0",
        "MaxSize" : {"Ref" : "NumberOfDataNodes"},
		"DesiredCapacity" : {"Ref" : "NumberOfDataNodes"},
        "Tags" : [
          { "Key" : "Name", "Value" : "XtremeData dbX-Data", "PropagateAtLaunch" : "true" }
        ]
      }
    },

	"DataNodeLaunchConfig" : {
       "Type" : "AWS::AutoScaling::LaunchConfiguration",
	   "Properties": {
	      "UserData" : { "Fn::Base64" : { "Ref" : "ClusterNameAnddbXHead" }},
          "KeyName" : { "Ref" : "KeyName" },
          "SecurityGroups" : [ { "Ref" : "XtremeSecurityGroup" } ],
          "InstanceType" : { "Ref" : "InstanceType" },
		  "ImageId" : { "Fn::FindInMap" : [ "AWSRegion2AMI", { "Ref" : "AWS::Region" }, "AMI" ]},
		  "BlockDeviceMappings" : [
               {
                  "DeviceName" : "/dev/sdf",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdg",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
			   {
                  "DeviceName" : "/dev/sdh",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               },
               {
                  "DeviceName" : "/dev/sdi",
                  "Ebs" : {
                     "VolumeType" : "standard",
                     "DeleteOnTermination" : "true",
                     "VolumeSize" : "100"
                  }
               }
          ]
		}
    }
  }
}

This template uses cluster name and AWS EC2 hostname concatenated with ; for userdata, it also uses the availability zones for dbX Head node and the security group that we just defined for dbX Head node. In addition, it allows to define the number of data nodes – it will be used to configure an auto-scaling group to ensure that we have the requested amount of data instances running.

CAUTION: Please, note that it is not enough just to terminate the nodes if you have auto-scaling defined since AWS will initiate new nodes. We need to terminate the instances using as-terminate-instance-in-auto-scaling-group command and then delete the auto-scaling config with as-delete-auto-scaling-group and as-delete-launch-config. More information about AWS auto-scaling can be found here.

To start an XtremeData dbX Head node, we need to go to AWS console Cloud Formation section, then select Create Stack:

dbx-2

Once we submit the configuration, the status will become CREATE_IN_PROGRESS and then after a while CREATE_COMPLETE.

Then we can create another stack for the data nodes using the second CloudFormation template:

dbx-8

After a while we should see one head node and the requested number of datanodes running in AWS EC2 console:

dbx-10

Creating a database in ExtremeData dbX

Once we have fired up the nodes, we can login to the head node as ec2-user using ssh (or e.g. putty on Windows).

First we need to initiate the cluster (please, note that this is a destructive operation, no data will be preserved), then start the cluster and the dbX database engine:

[ec2-user@ip-10-224-122-90 home]$ cluster_init -i 2
init started
examining nodes
initializing head ebs disks: xvdf xvdg xvdh xvdi
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
mkfs.xfs: Specified data stripe unit 512 is not the same as the volume stripe unit 128
initializing node ebs disks: xvdf xvdg xvdh xvdi
clearing head ephemeral disks
clearing nodes ephemeral disks
cluster_init done

[ec2-user@ip-10-224-122-90 home]$ cluster_start
initializing head ephemeral disks
  head ephemeral disks:  xvdb
    raid-ing
mdadm: /dev/xvdb appears to contain an ext2fs file system
    size=419395584K  mtime=Sat Jul  6 21:52:00 2013
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
mkfs.xfs: Specified data stripe unit 512 is not the same as the volume stripe unit 128
mkfs.xfs: Specified data stripe width 512 is not the same as the volume stripe width 128
examining nodes
assembling node EBSs: xvdf xvdg xvdh xvdi
initializing node tmp disks: xvdb xvdc
  raid-ing
cluster_start done

[ec2-user@ip-10-224-122-90 home]$ dbx_start
new dbx installation initialized
Initializing kernel nfsd:                                  [  OK  ]
Starting NFS services:                                     [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Stopping RPC idmapd:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]
Starting NFS daemon:                                       [  OK  ]
rpc.svcgssd is stopped
rpc.mountd (pid 3402) is running...
nfsd (pid 3467 3466 3465 3464 3463 3462 3461 3460) is running...
Generating binary repository file ...
file 'xdapp.conf2.txt.conf0' generated.
starting dbx nodes
Starting xdu daemon...                                     [  OK  ]
dbx startup done
[ec2-user@ip-10-224-122-90 home]$
[ec2-user@ip-10-224-122-90 home]$

Now we can login as dbxdba user (we can do it again with ssh or putty since the key is copied over from ec2-user). Then we need to change the dbxdba password (using passwd Linux commad) in order to be able to login using dbX tools.

Now we are ready to create a server and once the server is created, we can then create a database. The simplest way to manage the database and run SQL queries is to use a command line tool called xdudb:

$ xdudb create aws_server 32145
||==================||

[dbxdba@ip-10-224-122-90 ~]$ xdudb list
||==================||
aws_server
[dbxdba@ip-10-224-122-90 ~]$ xdudb status
||==================||
ip-10-224-122-90 configuration: 1 head node and 2 data nodes
running dbx services          : 1 head node and 2 data nodes

[dbxdba@ip-10-224-122-90 ~]$ xdudb start aws_server
||==================||

[dbxdba@ip-10-224-122-90 ~]$ xdudb info aws_server
||==================||
Name        : aws_server
NodeSet     : Default_NS
Owner       : dbxdba
Port        : 32145
H|0   |head    |10.224.122.90  |head |/volumes/data/dbstore/dbxdba/aws_server/head
D|1   |node00  |10.154.137.31  |n0   |/hd/dbxdba/aws_server/n0
D|2   |node01  |10.165.7.225   |n1   |/hd/dbxdba/aws_server/n1

[dbxdba@ip-10-224-122-90 ~]$ xdudb dbcreate aws_server aws_db
CREATE DATABASE

Now we can create a table, insert some rows and run select statements.

[dbxdba@ip-10-224-122-90 ~]$ xdudb sql aws_server aws_db
Welcome to dbX psql 3.2.5, the interactive SQL terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

aws_db=# CREATE TABLE products (
aws_db(# product_no INTEGER,
aws_db(# name VARCHAR(30),
aws_db(# price NUMERIC(10,2)
aws_db(# );
CREATE TABLE
aws_db=# INSERT INTO products VALUES(1, 'Product1', 100.00);
INSERT 0 1
aws_db=# SELECT * FROM products;
 product_no |   name   | price
------------+----------+--------
          1 | Product1 | 100.00
(1 row, Query Total: 1)

WIth the help of the xdudb command line tool we can check dbX version, check disk usage and mounted filesystems, etc.

[dbxdba@ip-10-224-122-90 ~]$ xdudb version
ip-10-224-122-90        2013-07-07 00:03:10
  DB           : dbX SQL 3.2.5
               : fpga.run r4347
  command line : xdudb 1.4
  GUI          : xdadm 3.0.2
  daemon       : xdutils 4.5.8
  OS           : Linux 3.4.43-43.43.amzn1.x86_64

[dbxdba@ip-10-224-122-90 ~]$ xdudb  du aws_server
||==================||
H|0   |head    |10.224.122.90  |head |60M  |/volumes/data/dbstore/dbxdba/aws_server/head
D|1   |node00  |10.154.137.31  |n0   |58M  |/hd/dbxdba/aws_server/n0
D|2   |node01  |10.165.7.225   |n1   |58M  |/hd/dbxdba/aws

[dbxdba@ip-10-224-122-90 ~]$ xdudb  df
head : 10.224.122.90
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              400G   95M  400G   1% /volumes/data
node00 : 10.154.137.31
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              400G   93M  400G   1% /volumes/data
node01 : 10.165.7.225
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              400G   93M  400G   1% /volumes/data

xdAdm web based management GUI

In addition to the command line tool that we described in the AWS cloud deployment scenario, there is also a web base GUI to manage the applicance – it is based on Apache Tomcat. We just go to http://dbx-headnode:8080/xdadm url and login. The we can manage the servers and databases, check nodesets and monitor the system:

dbx-15

XtremeData dbX virtual machine

If someone just wants to try out dbX database engine then she can download a virtual machine from XtremeData web site. The instance can be run e.g using vmplayer. Once the user logged in to the virtual machine as dbxdba, he can startup a server:

$ xdudb start dbxtest