Hadoop Tutorial: Creating MapReduce Jobs in Java

April 24, 2014

This tutorial will go over the steps to create Hadoop applications in Java using Eclipse. My platform: Mac OS X Mountain Lion, Java 1.6, Hadoop 2.3, and Eclipse IDE (ADT release v22.3.0).

Prerequisites

There a few things you'll need before you can start coding Hadoop MapReduce jobs.

  • Hadoop

    This goes without saying but you do need Hadoop Environment up & running. I'm using Hadoop v2.3.

  • Hadoop Map/Reduce Job Overview

    If you're just starting out with Hadoop Map/Reduce applications, you should take a look at the documentation and try understand how Hadoop Map/Reduce applications work with data.

    A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

    Source: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html

  • Hadoop Libraries

    In order to build Hadoop applications you'll need to reference the Hadoop Common and Client libraries, which your version of Hadoop may or may not have been shipped with. If you used Homebrew to install Hadoop, you will most likely not have these files installed. You'll need the following jar files:

    • hadoop-common-2.3.0.jar
    • hadoop-mapreduce-client-core-2.0.2-alpha.jar
    • hadoop-test-1.2.1.jar
    • slf4j-api-1.7.7.jar
    You can download the above libraries for your version of Hadoop from the Maven Repository site. If you're using Hadoop v2.3, you can download the files from my site.

  • SLF4J Library

    You may not need this but I was getting an error that had to do with the The Simple Logging Facade for Java (SLF4J) when running one of the jobs. It turns out there's some dependency on the SLF4J framework so I had to include it in my build. The library (1.7.7) can be downloaded from here.

Setup

We'll be using Eclipse and the two popular MapReduce jobs from the Hadoop: Definitive Guide book: MaxTemperature and WordCount. Both applications will be part of the same Eclipse project but will have different build and input files. Full source is available for download.

  • Create Java Project

    Create a new java project in Eclipse:

  • Reference Hadoop Libraries

    Add Hadoop Library jar file references (Project Properties -> Java Build Path -> Libraries -> Add External Jars)

  • Create Application Files: Mapper, Reducer, and Driver

    We'll need three files to run the WordCount Hadoop Map/Reduce job: Mapper, Reducer, and Driver.
    WordCount Mapper:

    WordCount Reducer:

    WordCount Driver:

    Full source is available here.

  • Create Ant Buildfile

    I'm using Apache Ant to compile and build both projects (WordCount and MaxTemperature). In order to build your project using an Ant buildfile, you'll need to change the settings and define a new Java Builder (Project Properties -> Builders -> New -> Ant Builder -> Browse Workspace). The buildfile will create a build directory and generate two jar files for each project that will be used to run the Hadoop application.

    You can download the file from here.

  • Input Files

    In order to run Hadoop applications we'll need to supply input data. Create a new directory (input) under your project folder and place a few files there (e.g. sample.txt, test.txt, hellowworld.txt). We'll use these files for the WordCount Map/Reduce job.

    The MaxTemperature Hadoop application will need the following input data (e.g. /input/sample.txt):

    0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
    0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
    0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
    0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
    0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
    							

  • Create Runner Scripts

    To run the Map/Reduce jobs I created two quick shell scripts: runner-temperature.sh - to run the MaxTemperature job and runner-wordcount.sh to run the WordCount application. The shell scripts clear out the output directory, run the Hadoop application and pass required parameters to it, then display the contents of input & output directories.
    MaxTemperature application script (runner-temperature.sh)

    rm -rf output
    /usr/local/bin/hadoop jar build/jar/Temperature.jar input/sample.txt output
    echo 'Job Input'
    echo '----------'
    echo ''
    cat input/sample.txt
    echo ''
    echo 'Job Output'
    echo '----------'
    cat output/part-r-00000
    							
    WordCount application script (runner-wordcount.sh)
    rm -rf output
    /usr/local/bin/hadoop jar build/jar/WordCount.jar input/ output
    echo 'Job Input'
    echo '----------'
    echo ''
    ls -la input
    echo ''
    echo 'Job Output'
    echo '----------'
    cat output/part-00000
    							

Running Map/Reduce Jobs

We're now ready to run the jobs. The source files (Mapper, Reducer, and Driver) for both projects are available for download.

  • Build Jar Files

    Build your project and make sure you have the jar files for both projects.

  • Start Hadoop

    Make sure Hadoop is up and running. To start the HDP you can run HADOOP_HOME/libexec/sbin/start-yarn.sh and HADOOP_HOME/libexec/sbin/start-dfs.sh scripts (version 2.3).
    Verify that Hadoop is running:

    $ jps
    6061 ResourceManager
    45173 
    6155 NodeManager
    6233 Jps
    							

  • Run WordCount Map/Reduce Job

    Lets run the WordCount job first. Using the runner script above (runner-wordcount.sh) we'll execute the job and then read the output from the application. If you're not using the runner script, make sure to remove the "output" directory prior to running the job. You should see the following output if everything went well. The WordCount Hadoop application will count number of words from the supplied input files or directory.

    $ ./runner-wordcount.sh 
    14/04/24 16:44:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    14/04/24 16:44:02 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
    14/04/24 16:44:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
    14/04/24 16:44:02 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
    14/04/24 16:44:02 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    14/04/24 16:44:02 INFO mapred.FileInputFormat: Total input paths to process : 3
    ....
    Job Input
    ----------
    
    total 24
    drwxr-xr-x   5 glebp  Domain Users  170 Apr 16 17:38 .
    drwxr-xr-x  18 glebp  Domain Users  612 Apr 24 16:44 ..
    -rw-r--r--   1 glebp  Domain Users   12 Apr 16 14:31 helloworld.txt
    -rw-r--r--   1 glebp  Domain Users  529 Apr 16 10:44 sample.txt
    -rw-r--r--   1 glebp  Domain Users   12 Apr 16 14:32 test.txt
    
    Job Output
    ----------
    0043011990999991950051512004+68750+023550fm-12+038299999v0203201n00671220001cn9999999n9+00221+99999999999	1
    0043011990999991950051518004+68750+023550fm-12+038299999v0203201n00261220001cn9999999n9-00111+99999999999	1
    0043012650999991949032412004+62300+010750fm-12+048599999v0202701n00461220001cn0500001n9+01111+99999999999	1
    0043012650999991949032418004+62300+010750fm-12+048599999v0202701n00461220001cn0500001n9+00781+99999999999	1
    0067011990999991950051507004+68750+023550fm-12+038299999v0203301n00671220001cn9999999n9+00001+99999999999	1
    hello	2
    there!	1
    world!	1
    
    							

  • Run MaxTemperature Map/Reduce Job

    Lets now run the MaxTemperature job. Using the runner script (runner-maxtemperature.sh) we'll execute the job and then read the output from the application. If you're not using the runner script, make sure to remove the "output" directory prior to running the job. You should see the following output if everything went well. The MaxTemperature Hadoop application will read the input file (input/sample.txt), which is 5 rows of NCDC weather data, and then calculate the maximum temperature in a given year.

    $ ./runner-temperature.sh 
    14/04/24 16:51:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    14/04/24 16:51:26 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
    14/04/24 16:51:26 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
    14/04/24 16:51:26 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    14/04/24 16:51:27 INFO input.FileInputFormat: Total input paths to process : 1
    ...
    Job Input
    ----------
    
    0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
    0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
    0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
    0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
    0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
    Job Output
    ----------
    1949	111
    1950	22
    							

That's about it. Feel free to download and distribute the source code for these apps.

Email me () if you have any questions or suggestions.