Difference between revisions of "Hadoop"

From Medialab

Line 1: Line 1:
 
== Hand-made streaming ==
 
== Hand-made streaming ==
Hadoop's streaming is intended for who want to run a tool over a distrubuted architecture.
+
Hadoop's streaming is intended for running a tool over a distributed architecture.
   
The problem is that usually the tool to be runned has not been written for hadoop and therefore can't handle <key>/<value> couples.
+
The problem is that usually the tool to be runned has not been written for hadoop and therefore can't handle <key>/<value> pairs.
   
For that reason in a text processing scenario the usual approach is to write an InputFormat that outputs <text to be processed>/<empty value> couples, do text processing in map phase and use no reduce phase: this leads to a correct processing but the resulting output will be necessarily shuffled (you'll just have one file for every mapper task and no info about postion the split took into original file).
+
For that reason in a text processing scenario the usual approach is to write an InputFormat that outputs <text to be processed>/<empty value> pairs, to do text processing in the map phase and to skip the reduce phase: this leads to a correct processing but the resulting output will be necessarily shuffled (you'll just have one file for every mapper task and no info about position the split took into original file).
   
Our solution to overcome this problem is to use a "hand-made" streaming to keep control of <key>/<value> couples during whole process:
+
Our solution to overcome this problem is to use a "hand-made" streaming to keep control of <key>/<value> pairs during the whole process:
* The InputFormat' next() method outputs <offset in file>/<text to be processed> couples;
+
* The InputFormat' next() method outputs <offset in file>/<text to be processed> pairs;
* The Mapper runs needed tool with an exec() call, passes text split to the tool and grab tool's stdout and stderr with two threads; processed text is then re-coupled with original <key>;
+
* The Mapper runs the tool with an exec() call, passes text split to the tool and grabs the tool's stdout and stderr with two threads; processed text is then re-coupled with the original <key>;
* Reduce phase (just one to have just one output file) sorts couples by <key>;
+
* Reduce phase (just one to have just one output file) sorts pairs by <key>;
 
* OutputFormat dumps <key> and outputs <value> only.
 
* OutputFormat dumps <key> and outputs <value> only.
   
This maybe could be a little tricky and expensive (because of exec() call) but it allows you to process a file over Hadoop with a streaming tool written in any language without shuffling the original file.
+
This maybe could be a little tricky and expensive (because of the exec() calls) but it allows processing a file over Hadoop with a streaming tool written in any language without shuffling the original file.
   
 
== Other resources ==
 
== Other resources ==

Revision as of 02:06, 17 February 2009

Hand-made streaming

Hadoop's streaming is intended for running a tool over a distributed architecture.

The problem is that usually the tool to be runned has not been written for hadoop and therefore can't handle <key>/<value> pairs.

For that reason in a text processing scenario the usual approach is to write an InputFormat that outputs <text to be processed>/<empty value> pairs, to do text processing in the map phase and to skip the reduce phase: this leads to a correct processing but the resulting output will be necessarily shuffled (you'll just have one file for every mapper task and no info about position the split took into original file).

Our solution to overcome this problem is to use a "hand-made" streaming to keep control of <key>/<value> pairs during the whole process:

  • The InputFormat' next() method outputs <offset in file>/<text to be processed> pairs;
  • The Mapper runs the tool with an exec() call, passes text split to the tool and grabs the tool's stdout and stderr with two threads; processed text is then re-coupled with the original <key>;
  • Reduce phase (just one to have just one output file) sorts pairs by <key>;
  • OutputFormat dumps <key> and outputs <value> only.

This maybe could be a little tricky and expensive (because of the exec() calls) but it allows processing a file over Hadoop with a streaming tool written in any language without shuffling the original file.

Other resources