1
Background
The goal of this exercise is
to setup Apache Hadoop on Amazon Web Services (AWS) Cloud and demonstrate using
Hadoop for a typical Big Data problem i.e. counting number of words in
document(s). Brief introduction of technologies involved is listed below:
·
Amazon Web
Services (AWS; aka Amazon Cloud) delivers a set of
services that together form a reliable, scalable, and inexpensive computing
platform “in the cloud”. AWS’s Elastic Compute Cloud (EC2) service provides the
servers which are used to setup and execute Hadoop jobs.
·
Apache Hadoop is an open-source
software framework that supports data-intensive distributed applications,
licensed under the Apache v2 license. It supports the running of applications
on large clusters of commodity hardware. Hadoop was derived from Google's
MapReduce and Google File System (GFS) papers.
·
Apache Crunch Java library
provides a framework for writing, testing, and running MapReduce pipelines. Its
goal is to make pipelines that are composed of many user-defined functions
simple to write, easy to test, and efficient to run.
2
Steps
2.1 Start a new instance on AWS
Setup the instance using AWS Management Console
using following parameters:
·
Operating
System: Ubuntu TLS 64-bit
·
Instance
type: t1.micro (although this is insufficient to do any decent Hadoop work, but chosen for this exercise as it is
experimental and ‘t1.micro’ only available instance type under ‘Free’ tier)
·
Storage:
Instance type (do not chose EBS since transfer in and out of EBS is paid)
·
Make
sure following software is setup on the server
o
JDK version 1.6 or above
o
Maven 2
o
‘tree’ package. Can be installed using the following command:
sudo apt-get install tree
2.2 Setup EC2 account and tools
EC2 tools are used by Hadoop’s EC2 setup utilities to create instances
and launch them.
·
Get the EC2 tools using following command:
·
Unzip the compressed file
$
unzip ec2-api-tools.zip
·
Setup the ec2-init.sh file using following
configuration:
export
JAVA_HOME=/usr
export
EC2_HOME=~/ec2-api-tools-1.6.10.0
export
PATH=$PATH:$EC2_HOME/bin
export
AWS_ACCESS_KEY=AKIAJYXX37NSBTETUVHA
export
AWS_SECRET_KEY=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
export
EC2_PRIVATE_KEY=~/.ec2/pk-unencrypt-test.pem
export
EC2_CERT=~/.ec2/cert-test.pem
·
Execute following at prompt to set the parameters:
$
source ec2-init.sh
·
Make
sure EC2 init file parameters are set at the system startup by adding above
command to .profile file
·
Add
your private key file as shown below. Make sure the file has ‘chmod’ set to
‘400’.
.ec2/id_rsa-<key
name>
2.3 Setup Apache Hadoop on EC2
·
Retrieve
Hadoop tools (Use a standard Apache URL to download if the below doesn’t work)
$
tar –xzvf Hadoop-1.2.1
·
Create
Hadoop-ec2 initialization script
$ vi hadoop-ec2-init.sh
·
Add
following lines to the initialization script:
export HADOOP_EC2_BIN=~/Hadoop-1.2.1/src/contrib/ec2/bin
export
PATH=$PATH:$HADOOP_EC2_BIN
·
Initialize
the variables:
$ source hadoop-ec2-init.sh
$
Put it in ~/.profile to have it done automatically on login
·
Configure
Hadoop with EC2 account
$ vi ~/Hadoop-1.2.1/src/contrib/ec2/bin/Hadoop-ec2-env.sh
◦ AWS_ACCOUNT_ID=283072064258
◦ AWS_ACCESS_KEY_ID=< Get this from your account page >
Looks like AKIAJ5U4QYDDZCNDDY5Q
◦ AWS_SECRET_ACCESS_KEY=<Get this from your account page>
Looks like FtDMaAuSXwzD7pagkR3AfIVTMjc6+pdab2/2iITL
◦ KEY_NAME=<group>-keypair
The same keypair you set up earlier at ~/.ec1/ida_rsa-<key name>
·
Check
Hadoop configuration files in ‘hadoop-1.2.1’ directory to make sure all files
use a login which is supported by the OS. These configuration files use ‘root’
by default. However, Ubuntu in EC2 doesn’t allow a root access. Therefore ‘root’
needs to change to ‘ubuntu’.
·
Create/launch
cluster
$
Hadoop-ec2 launch-cluster <group>-cluster 2
·
Test
login to master node
$
Hadoop-ec2 login test-cluster
2.4 Setup Apache Crunch on EC2
·
Retrieve
Apache Crunch
tar
–xzvf apache-crunch-0.6.0-src.tar.gz
·
Make
sure ‘JAVA_HOME’ points to Open JDK (and not JRE). Command to install the
package on Ubuntu is:
$
sudo apt-get install openjdk-6-jdk
·
Generate
simple project and answer questions as shown below:
$ mvn archetype:generate
-Dfilter=org.apache.crunch:crunch-archetype
[...]
1: remote ->
org.apache.crunch:crunch-archetype (Create a basic, self-contained job with the
core library.)
Choose a number or apply filter (format:
[groupId:]artifactId, case sensitive contains): : 1
Define value for property 'groupId': : com.example
Define value for property 'artifactId': : crunch-demo
Define value for property 'version': 1.0-SNAPSHOT: : [HIT ENTER]
Define value for property 'package': com.example: : [HIT ENTER]
Confirm properties configuration:
groupId: com.example
artifactId: crunch-demo
version: 1.0-SNAPSHOT
package: com.example
Y: :
[HIT ENTER]
[...]
$
·
Change
directory using:
$
cd crunch-demo
·
See
the directory structure:
$
tree
·
Build
code. Sometimes Maven given error that indicates too many unlicensed products
are under use. To eliminated the error, set approved license limit to a high
number.
$
mvn -Drat.numUnapprovedLicenses=100 package
2.5 Execute sample Hadoop job using Crunch
·
Execute
the following command to run the sample word count application. ‘Input’ can be
either a file or a directory. ‘Output’ should be a directory.
$
~/Hadoop-1.2.1/bin/Hadoop jar ./target/crunch-demo-1.0-SNAPSHOT-job.jar input
output
$
cat input
dnasmdad
daksj d kdjkadmsak d james john james john john
$
cd output
$
ls
part-r-00000
$
cat part-r-00000
[james,2]
[d,2]
[daksj,1]
[dnasmdad,1]
[john,3]
[kdjkadmsak,1]
3
Summary
Apache Hadoop provides built-in EC2 support that allows user to setup
Hadoop cluster in a very easy manner. Although there are other options to setup
the Hadoop clusters (such as Cloudera Server Manager) but Apache Hadoop’s build
in capability allows greater control.
During the installation process, some of the default options may need to
be adapted for the type of operating system under use.
No comments:
Post a Comment