Hadoop Use

From Medialab

Here are simple steps to follow to perform a map[reduce] job.

I'll refer Hadoop installation directory as

  $HADOOP_HOME (/usr/local/hadoop on paleo):
  • Format your hadoop-distributed-file-system:
$HADOOP_HOME/bin/hadoop namenode -format
  • Startup all nodes (things will be a bit more tricky with a full working cluster...):
  • Copy your inputfile to HDFS hdfsFile:

(Remember that your HOME in HDFS is

    /user/YOURNAME: the full HDFS path of your copied file
    will be /user/YOURNAME/hdfsFile)
$HADOOP_HOME/bin/hadoop dfs -copyFromLocal inputFile hdfsFile
  • You can look at the content of your HDFS and
 check whether file you created exists:
$HADOOP_HOME/bin/hadoop dfs -ls
  • Call your mapper function on copiedHdfsFile and puts results in outputDirName;

this is just a map task, we don't need a reduce one because we don't want our data to be sorted:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/hadoop-0.15.3-streaming.jar
   -file yourScriptPath -mapper "yourScriptName"
   -jobconf mapred.reduce.tasks=0
   -input copiedHdfsFile -output outputDirName

If everything worked properly, your output would be placed in outputDirName, split into several files (one for each map task performed by Hadoop).

The output of dfs -ls outputDirName looks like:

Found 4 items
/user/YOURNAME/outputDirName/part-00000   <r 1>   SIZE CREATION_TIME
/user/YOURNAME/outputDirName/part-00001   <r 1>   ...  ...
/user/YOURNAME/outputDirName/part-00002   <r 1>   ...  ...
/user/YOURNAME/outputDirName/part-00003   <r 1>   ...  ...

To read and recombine these files you can use dfs -cat option (look at /project/piqasso/workspaces/tamberi/catter.sh)

To copy a single file from HDFS to the local filesystem do:

$HADOOP_HOME/bin/hadoop dfs -copyToLocal pathToHdfsFile pathToLocalFile 

To stop the Hadoop cluster:


Sample Session

In the output below, we use DIR to stand for /project/piqasso/workspaces/tamberi.


$ $HADOOP_HOME/bin/hadoop namenode -format
08/02/24 19:03:17 INFO dfs.NameNode: STARTUP_MSG: 
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = paleo.di.unipi.it/
STARTUP_MSG:   args = [-format]
08/02/24 19:03:18 INFO dfs.Storage: 
  Storage directory /tmp/hadoop-tamberi/dfs/name has been
  successfully formatted.
08/02/24 19:03:18 INFO dfs.NameNode: SHUTDOWN_MSG: 
SHUTDOWN_MSG: Shutting down NameNode at paleo.di.unipi.it/

Starting Hadoop:

$ $HADOOP_HOME/bin/start-all.sh
starting namenode, logging to
localhost: starting datanode, logging to
localhost: starting secondarynamenode, logging to
starting jobtracker, logging to
localhost: starting tasktracker, logging to

Copying local file to HDFS:

$ $HADOOP_HOME/bin/hadoop dfs -copyFromLocal $HOME/hadoopTest/wiki.xml wiki.xml

Listing HDFS content:

$ $HADOOP_HOME/bin/hadoop dfs -ls
Found 1 items
/user/tamberi/wiki.xml  <r 1>   1000000 2008-02-24 19:05

Launching job (/project/piqasso/workspaces/tamberi/allTools.sh is the simple mapper):

$ $HADOOP_HOME/bin/hadoop jar 
   -file DIR/allTools.sh
   -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0
   -input /user/tamberi/wiki.xml -output wiki-output
packageJobJar: [/project/piqasso/workspaces/tamberi/allTools.sh,
   /tmp/hadoop-tamberi/hadoop-unjar45697/] []
   /tmp/streamjob45698.jar tmpDir=null
08/02/24 19:09:27 INFO mapred.FileInputFormat: Total input paths to process : 1
08/02/24 19:09:28 INFO streaming.StreamJob: getLocalDirs():
08/02/24 19:09:28 INFO streaming.StreamJob: Running job: job_200802241904_0001
08/02/24 19:09:28 INFO streaming.StreamJob: To kill this job, run:
08/02/24 19:09:28 INFO streaming.StreamJob: ~tamberi/hadoop/bin/../bin/hadoop job
    -Dmapred.job.tracker=localhost:55556 -kill job_200802241904_0001
08/02/24 19:09:28 INFO streaming.StreamJob: Tracking URL:
08/02/24 19:09:29 INFO streaming.StreamJob:  map 0%  reduce 0%
08/02/24 19:09:38 INFO streaming.StreamJob:  map 25%  reduce 0%
08/02/24 19:09:42 INFO streaming.StreamJob:  map 50%  reduce 0%
08/02/24 19:09:45 INFO streaming.StreamJob:  map 75%  reduce 0%
08/02/24 19:09:55 INFO streaming.StreamJob:  map 100%  reduce 0%
08/02/24 19:09:58 INFO streaming.StreamJob:  map 100%  reduce 100%
08/02/24 19:09:58 INFO streaming.StreamJob: Job complete: job_200802241904_0001
08/02/24 19:09:58 INFO streaming.StreamJob: Output: wiki-output

Listing output directory content:

$ $HADOOP_HOME/bin/hadoop dfs -ls wiki-output
Found 4 items
/user/tamberi/wiki-output/part-00000    <r 1>   118330  2008-02-24 19:09
/user/tamberi/wiki-output/part-00001    <r 1>   12056   2008-02-24 19:09
/user/tamberi/wiki-output/part-00002    <r 1>   26952   2008-02-24 19:09
/user/tamberi/wiki-output/part-00003    <r 1>   137785  2008-02-24 19:09

Merging output files into single one:

$ ../workspace/catter.sh wiki-output
file(s) merged into /home/medialab/tamberi/hadoop-out.txt

Stopping Hadoop:

$ $HADOOP_HOME/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode


There's an annoying bug related to HDFS, copying file can throw this exception:

copyFromLocal: java.io.IOException: File filePath could only
be replicated to x nodes, instead of y

The only way to get around that seems to be this:

rm -Rf /tmp/hadoop-$USER*
$HADOOP_HOME/bin/hadoop namenode -format

Which erases the HDFS from disk and then recreates it.. Obviously you'll lose every HDFS data so be careful!