In this post, we’ll look at how to install the latest stable version of Apache Hadoop on a laptop running Linux Mint 13, although this install setup should be very similar for most other distributions.
Before we begin, you should check out the hadoop website here: http://hadoop.apache.org/ and familiarize yourself with the Getting Started documentation.
// GET HADOOP
The first thing you need to do is acquire the hadoop package. At the time of writing, Hadoop was available for download from the following link: http://www.apache.org/dyn/closer.cgi/hadoop/common/. The latest stable version at this point is 1.0.4, so I opted to download the file hadoop-1.0.4.tar.gz as I want to use the source tarball instead of the rpm to build hadoop. I have chosen this route because the laptop I am installing hadoop on will not be solely used for hadoop, so using a source based distribution allows me to keep everything in a single working directory and allows for easy upgrading in the future.
Once you have downloaded the tarball, extract it with the following command:
$ tar zxvf hadoop-1.0.4.tar.gz
Optionally, you can move it somewhere on the filesystem where hadoop will reside (maybe in your home directory).
Inside the newly created hadoop-1.0.4 directory, there is a docs directory that has an outline of all the steps we will be working through. This post will walk through the ‘Single Node Setup’ instructions inside the docs directory.
First, we will look at Standalone Operation which does not provide HDFS support, it simply gets a hadoop node up and running to process simple jobs. SSH access is not required for Standalone Operation.
To get the Standalone Operation functional, we only need to add 1 line of code that tells hadoop where the Java Installation resides on our machine. Apache recommends using the Sun flavor of the Java JDK, but since this is not a production setup, I am going to use the OpenJDK installation that is already installed on my Linux Mint distribution. Further information about supported flavors of the JDK can be found here: http://wiki.apache.org/hadoop/HadoopJavaVersions.
Since I am running the 64 bit version of Mint, my JDK is installed at /usr/lib/jvm/java-6-openjdk-amd64. Add this configuration to the hadoop environment variables like this:
$ cd hadoop-1.0.4/conf $ vim hadoop-env.sh
Somewhere around the top of this file you will see the following lines:
# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
We are going to change this to:
# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
Congratulations! You should now be able to fire up hadoop in Standalone Operation mode. You can start by taking a look at the usage for the script that starts hadoop by going back up to the main hadoop folder and typing the following:
Now lets run the example hadoop script to actually push a full job through our Standalone server. Again in the hadoop folder:
$ mkdir input $ cp conf/*.xml input $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' $ cat output/*
This should run the example script and put the output from the job in the output folder.
// PSUDO-DISTRIBUTED OPERATION
Although Psudo-Distributed Operation is not required, it will provide you with an environment that much more resembles what you will encounter in production. I recommend performing the extra configuration to provide the web-based job tracker, web-based node tracker, and HDFS that you will find in hadoop production servers.
On Linux Mint, we must start by installing the SSH server and starting up the ssh daemon. Install the server with the following command:
$ sudo apt-get install openssh-server
After the server is installed, check its status with:
$ /etc/init.d/ssh status
If the server is not started, start it with:
$ /etc/init.d/ssh start
Now that the SSH server is running, we must enable passphraseless ssh
THIS IS DANGEROUS. YOU SHOULD NOT ENABLE PASSPHRASELESS SSH WHEN YOU ARE NOT USING HADOOP. EITHER DISABLE SSH ALLTOGETHER OR REMOVE THE AUTHORIZED_KEYS WHEN NOT USING HADOOP.
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
You should now be able to ssh into the localhost without entering a password. Try it with the following command (you might have to type yes to accept the RSA key fingerprint, this is OK):
$ ssh localhost $ exit
Inside the conf directory, we need to modify 3 files to add configuration for Psudo-Distributed Operation. This should be added between the <configuration> and </configuration>tags.
In conf/core-site.xml add the following:
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property>
In conf/hdfs-site.xml add the following:
<property> <name>dfs.replication</name> <value>1</value> </property>
In conf/mapred-site.xml add the following:
<property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property>
Now we need to format the distributed filesystem with the following command, from the hadoop folder:
$ bin/hadoop namenode -format
And now were ready to go! Start up the hadoop daemon with the following command:
You should now be able to browse the nameNode and JobTracker in your browser (after a short delay for startup) by browsing to the following URLs:
At this point your Psudo-Distributed Mode should be fully operational.
// RUN THE EXAMPLE ON THE PSUDO-DISTRIBUTED CLUSTER
Now were going to run the exact same example we ran above but this time inside of the Psudo-Distributed Cluster. We start by copying our input data set into HDFS:
$ bin/hadoop fs -put conf input
Run the example:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs['a-z.]+'
You should now see the job executing in the web based Job Tracker, and when the job completes you will be able to get the report from the job tracker. To copy the output from HDFS back to the local file system for examination, we can use the following command:
$ bin/hadoop fs -get output output
The output from the job that just completed will now be in a folder called ‘output’ in the current working directory.
// STOP THE HADOOP DAEMON
Once you are finished, you should stop the hadoop daemons from running with the following command: