Hadoop's streaming is intended for running a tool over a distributed architecture.
The problem is that usually the tool to be runned has not been written for hadoop and therefore can't handle <key>/<value> pairs.
For that reason in a text processing scenario the usual approach is to write an InputFormat that outputs <text to be processed>/<empty value> pairs, to do text processing in the map phase and to skip the reduce phase: this leads to a correct processing but the resulting output will be necessarily shuffled (you'll just have one file for every mapper task and no info about position the split took into original file).
Our solution to overcome this problem is to use a "hand-made" streaming to keep control of <key>/<value> pairs during the whole process:
- The InputFormat' next() method outputs <offset in file>/<text to be processed> pairs;
- The Mapper runs the tool with an exec() call, passes text split to the tool and grabs the tool's stdout and stderr with two threads; processed text is then re-coupled with the original <key>;
- Reduce phase (just one to have just one output file) sorts pairs by <key>;
- OutputFormat dumps <key> and outputs <value> only.
This maybe could be a little tricky and expensive (because of the exec() calls) but it allows processing a file over Hadoop with a streaming tool written in any language without shuffling the original file.