Sunday, May 11, 2014

Setting up Hadoop 1.22 on Centos 6.4 with OpenJDK1.6

Before we begin I would like to give a little intro.  We have been working on a lot of “Big Data” and machine learning type projects for some future products and services we intend to offer.  So we like to share some of our initial findings and code into this big realm.  Haven’t heard of Hadoop yet?
As the title indicates we will be installing Hadoop 1.22 on a Centos 6.4 linux server.  This will be a “singe node” install, I will write a separate articles on installing hadoop on a cluster.  If you are just getting into Hadoop this tutorial will help get your first hadoop installation out of the way.


First let’s get started, with some java.

 Installing OpenJDK 1.6

Luckily hadoop works and runs well (from what we have tested) on the openjdk.  No need to download any jdk binaries from Oracle.
yum install java-1.6.0-openjdk.x86_64
Hadoop will run just fine with the vanilla open jdk.  However for maven to run properly we are going need the devel jdk also installed.
yum install java-1.6.0-openjdk-devel.x86_64
Let’s make sure java is registered on your system run.
First let’s see if the correct version is setup in our path.
java -version

java version "1.6.0_28"
OpenJDK Runtime Environment (IcedTea6 1.13.0pre) (rhel-1.66.1.13.0.el6-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
javac -version
IF you do not see this version of Java, you probably had a previous install of java on your system.  Luckily we are using CentOS so we can easily change this by running the alternatives config.
alternatives -config java
There is 1 program that provides 'java'.
Selection Command
 -----------------------------------------------
 *+ 1 /usr/lib/jvm/jre-1.6.0-openjdk.x86_64/bin/java
If you had any previous versions of java installed make sure the 1.6 JDK is selected.  We will need this version for our later examples.

Now that we have all that java stuff out of the way let’s get down to installing and configuring hadoop.
Note: As with anything there are many ways to do this, Robe de Mariée 2014when we get into installing Hadoop into a cluster it will save you a lot of time.  However sometimes it’s best to do the hard way first to familiarize your self with something new.
Before we begin let’s create some credentials for Hadoop to use.
useradd hadoop
passwd hadoop
You do not have to create a specific user account for hadoop to run properly, of course root works just fine.  You can even enable key based login for your cluster.
Note: For this demo we ran everything as the hadoop account with no key login.
Let’s create a directory for all the hadoop binaries(okay mostly jars) to live in.

Create a Hadoop Directory

mkdir /opt/hadoop
cd /opt/hadoop
wget http://apache.cs.utah.edu/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz
Note: Default minimum installs of CentOS don’t include wget.
yum install wget
tar -xzf Hado -C hadoop
Chown -R /opt/hadoop hadoop

 Edit Hadoop Configs

Now that we have everything extracted and proper ownership applied there are a few hadoop configs we will need to change.
 vi conf/core-site.xml
#Add the following inside the configuration tag
<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000/</value>
</property>
<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>
Edit hdfs-site.xml
 vi conf/hdfs-site.xml
# Add the following inside the configuration tag
<property>
 <name>dfs.data.dir</name>
 <value>/opt/hadoop/hadoop/dfs/name/data</value>
 <final>true</final>
</property>
<property>
 <name>dfs.name.dir</name>
 <value>/opt/hadoop/hadoop/dfs/name</value>
 <final>true</final>
</property>
<property>
 <name>dfs.replication</name>
 <value>2</value>
</property>
Edit mapred-site.xml
 vi conf/mapred-site.xml
# Add the following inside the configuration tag
<property>
        <name>mapred.job.tracker</name>
 <value>localhost:9001</value>
</property>
Edit hadoop-env.sh
 vi conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64/
Set JAVA_HOME path as per your system configuration for java.
Let’s format our first namenode!
 su - hadoop
 cd /opt/hadoop/hadoop
 bin/hadoop namenode -format

 Start Hadoop

 bin/start-all.sh
Each Service Has It’s Own Status Page
  http://hnode1.vaurent.com:50030/   for the Jobtracker
  http://hnode1.vaurent.com:50070/   for the Namenode
  http://hnode1.vaurent.com:50060/   for the Tasktracker

To stop Hadoop

bin/stop-all.sh
That about sums it all up. I will update this article with links to setting up Maven and running our first Hadoop test with Apache Pig.

No comments:

Post a Comment