The goal of this exercise is to setup Apache Hadoop on Amazon Web Services (AWS) Cloud and demonstrate using Hadoop for a typical Big Data problem i.e. counting number of words in document(s). Brief introduction of technologies involved is listed below:
· Amazon Web Services (AWS; aka Amazon Cloud) delivers a set of services that together form a reliable, scalable, and inexpensive computing platform “in the cloud”. AWS’s Elastic Compute Cloud (EC2) service provides the servers which are used to setup and execute Hadoop jobs.
· Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
· Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Setup the instance using AWS Management Console using following parameters:
· Operating System: Ubuntu TLS 64-bit
· Instance type: t1.micro (although this is insufficient to do any decent Hadoop work, but chosen for this exercise as it is experimental and ‘t1.micro’ only available instance type under ‘Free’ tier)
· Storage: Instance type (do not chose EBS since transfer in and out of EBS is paid)
· Make sure following software is setup on the server
o JDK version 1.6 or above
o Maven 2
o ‘tree’ package. Can be installed using the following command:
sudo apt-get install tree
EC2 tools are used by Hadoop’s EC2 setup utilities to create instances and launch them.
· Get the EC2 tools using following command:
· Unzip the compressed file
$ unzip ec2-api-tools.zip
· Setup the ec2-init.sh file using following configuration:
· Execute following at prompt to set the parameters:
$ source ec2-init.sh
· Make sure EC2 init file parameters are set at the system startup by adding above command to .profile file
· Add your private key file as shown below. Make sure the file has ‘chmod’ set to ‘400’.
· Retrieve Hadoop tools (Use a standard Apache URL to download if the below doesn’t work)
$ tar –xzvf Hadoop-1.2.1
· Create Hadoop-ec2 initialization script
$ vi hadoop-ec2-init.sh
· Add following lines to the initialization script:
· Initialize the variables:
$ source hadoop-ec2-init.sh
$ Put it in ~/.profile to have it done automatically on login
· Configure Hadoop with EC2 account
$ vi ~/Hadoop-1.2.1/src/contrib/ec2/bin/Hadoop-ec2-env.sh
◦ AWS_ACCESS_KEY_ID=< Get this from your account page >
Looks like AKIAJ5U4QYDDZCNDDY5Q
◦ AWS_SECRET_ACCESS_KEY=<Get this from your account page>
Looks like FtDMaAuSXwzD7pagkR3AfIVTMjc6+pdab2/2iITL
The same keypair you set up earlier at ~/.ec1/ida_rsa-<key name>
· Check Hadoop configuration files in ‘hadoop-1.2.1’ directory to make sure all files use a login which is supported by the OS. These configuration files use ‘root’ by default. However, Ubuntu in EC2 doesn’t allow a root access. Therefore ‘root’ needs to change to ‘ubuntu’.
· Create/launch cluster
$ Hadoop-ec2 launch-cluster <group>-cluster 2
· Test login to master node
$ Hadoop-ec2 login test-cluster
· Retrieve Apache Crunch
tar –xzvf apache-crunch-0.6.0-src.tar.gz
· Make sure ‘JAVA_HOME’ points to Open JDK (and not JRE). Command to install the package on Ubuntu is:
$ sudo apt-get install openjdk-6-jdk
· Generate simple project and answer questions as shown below:
$ mvn archetype:generate -Dfilter=org.apache.crunch:crunch-archetype
1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job with the core library.)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : 1
Define value for property 'groupId': : com.example
Define value for property 'artifactId': : crunch-demo
Define value for property 'version': 1.0-SNAPSHOT: : [HIT ENTER]
Define value for property 'package': com.example: : [HIT ENTER]
Confirm properties configuration:
Y: : [HIT ENTER]
· Change directory using:
$ cd crunch-demo
· See the directory structure:
· Build code. Sometimes Maven given error that indicates too many unlicensed products are under use. To eliminated the error, set approved license limit to a high number.
$ mvn -Drat.numUnapprovedLicenses=100 package
· Execute the following command to run the sample word count application. ‘Input’ can be either a file or a directory. ‘Output’ should be a directory.
$ ~/Hadoop-1.2.1/bin/Hadoop jar ./target/crunch-demo-1.0-SNAPSHOT-job.jar input output
$ cat input
dnasmdad daksj d kdjkadmsak d james john james john john
$ cd output
$ cat part-r-00000
Apache Hadoop provides built-in EC2 support that allows user to setup Hadoop cluster in a very easy manner. Although there are other options to setup the Hadoop clusters (such as Cloudera Server Manager) but Apache Hadoop’s build in capability allows greater control.
During the installation process, some of the default options may need to be adapted for the type of operating system under use.