Hadoop in Action (2010 printing)

Working through the examples in Hadoop in Action (printed in 2010) and I've found they are frustratingly out of date. This is the nature of the Hadoop project since so much changes version to version.

I wasn't able to find converted samples for the book so I'm working on converting them for myself for Hadoop 0.20.2-cdh3u4 (but these likely work on the 1.0.x release as well).

The first map/reduce job which inverts patent data associations is:

/* src/main/java/com/deploymentzone/hadoop/MyJob.java */  
package com.deploymentzone.hadoop;

import java.io.IOException;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.fs.Path;  
import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Job;  
import org.apache.hadoop.mapreduce.Mapper;  
import org.apache.hadoop.mapreduce.Reducer;  
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;  
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.thirdparty.guava.common.base.Splitter;  
import org.apache.hadoop.thirdparty.guava.common.collect.Iterables;  
import org.apache.hadoop.util.Tool;  
import org.apache.hadoop.util.ToolRunner;  
import org.apache.hadoop.util.GenericOptionsParser;

public class MyJob extends Configured implements Tool {

  public static class Map extends Mapper<LongWritable,Text,Text,Text> {

    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      Iterable<String> split = Splitter.on(',').limit(2).split(value.toString());

      context.write(new Text(Iterables.getLast(split)), new Text(Iterables.getFirst(split, "")));
    }

  }

  public static class Reduce extends Reducer<Text,Text,Text,Text> {
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context)
        throws IOException, InterruptedException {
      StringBuilder csv = new StringBuilder(16);
      for (Text value : values) {
        if (csv.length() > 0) csv.append(",");
        csv.append(value.toString());
      }
      String result = csv.toString();
      context.write(key, new Text(result));
    }

  }

  @Override
  public int run(String[] args) throws Exception {
    Job job = new Job();

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    Path in = new Path(args[0]);
    Path out = new Path(args[1]);
    FileInputFormat.setInputPaths(job, in);
    FileOutputFormat.setOutputPath(job, out);

    job.waitForCompletion(true);

    return 0;
  }

  public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
      int result = ToolRunner.run(new MyJob(), otherArgs);
      System.exit(result);
   }
}

The cite7599.txt ASCII data file and apat6399.txt files from NBER U.S. Patent Data Files as zipped archives (kind of a pain to hunt down).

I have a Maven project setup on Github for this project. If you clone it, you can mvn package to get a *.jar in the target directory. Once you've unzipped the data files you can put them in HDFS:

hadoop fs -mkdir hdfs://localhost:8020/input  
hadoop fs -copyFromLocal ~/Downloads/cite75_99.txt hdfs://localhost:8020/input  
hadoop fs -mkdir hdfs://localhost:8020/output

Then to run the MyJob map/reduce:

hadoop jar target/hadoop-first-example-1.0-SNAPSHOT.jar com.deploymentzone.hadoop.MyJob  hdfs://localhost:8020/input/cite75_99.txt hdfs://localhost:8020/output/cite