Big Data and Hadoop

April 21, 2014

I am very curious about big data and the tools and platforms for big data processing and aggregation. After reading a few tutorials and installing Hadoop on my workstation, I decided to document the steps that I took and the pains I experienced installing, configuring, and working with Hadoop, Hive, and Hbase on Mac OS X Mountain Lion.

Installing Hadoop on Mac OS X Mountain Lion 10.8.5 64-Bit

Below are the steps to install and configure Hadoop 2.3 using Homebrew.

Hadoop Prerequisites

There a few things you'll need before installing Hadoop.

  • Java Version 1.6.*
    Make sure you java 1.6+ installed. It's OK to have java 1.7 but you will need some extra configuration and tweaking. Ideally, you should have Java v1.6.* since that's what Hadoop uses. You can find out the version of your java software by running 'java -version' from the console.
    $ java -version
    java version "1.6.0_65"
    Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
    Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
    					
  • Ruby & Homebrew
    Make sure you have the Ruby runtime installed then install Homebrew - an awesome little package manager for OS X. To install Homebrew open up console and type the command below. Source: http://brew.sh/
    $ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    					
  • SSH & Remote Login
    Hadoop requires SSH so you'll need to enable it. First, make sure Remote Login under System Preferences -> Sharing is checked (this enables SSH on Mac OS X). Then generate and authorize SSH keys:
    $ ssh-keygen -t rsa -P ""
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    					

Hadoop Installation

Homebrew (as of April 21, 2014) installs Hadoop v2.3 so the steps below are specific to that version of Hadoop.

  • SSH into localhost
    To begin the installation you'll need to SSH into your workstation. Open up console and fire up the ssh client.
    $ ssh localhost
    					
  • Install Hadoop
    We're going to install Hadoop in a single-node cluster using Homebrew. To install a different version of Hadoop, navigate to the Hadoop project site and pick a version to install then follow the manual installation instructions.
    $ brew install hadoop
    					

Hadoop Configuration

Almost done...now lets configure Hadoop. Homebrew will install Hadoop to /usr/local/Cellar/hadoop/2.3.0 directory.

  • The important config files are:
    • hadoop-env.sh
    • mapred-env.sh
    • yarn-env.sh
    • core-site.xml
    • hdfs-site.xml
    • mapred-site.xml
    • yarn-site.xml
The files are in /usr/local/Cellar/hadoop/2.3.0/libexec/etc/hadoop

  • hadoop-env.sh
    Thanks to Homebrew you'll have most of Hadoop configuration already in place; however, I had to tweak a few things to make Hadoop work. Specifically, I was getting the following error "Unable to load realm info from SCDynamicStore" when trying to use Hadoop so I ended up changing the environment variables slightly in Hadoop, MapReduce, and Yarn environment config files. My changes included adding the following lines:
    • export JAVA_HOME="$(/usr/libexec/java_home)"
    • export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
    • export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.conf=/dev/null"
    Note that I'm using Java 1.6.0_65. Below is my hadoop-env.sh configuration.
    # The java implementation to use.
    export JAVA_HOME="$(/usr/libexec/java_home)"
    #export JAVA_HOME=`/usr/libexec/java_home -v 1.6`
    
     #The jsvc implementation to use. Jsvc is required to run secure datanodes.
    #export JSVC_HOME=${JSVC_HOME}
    
    export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
    
    # Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
    for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
      if [ "$HADOOP_CLASSPATH" ]; then
        export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
      else
        export HADOOP_CLASSPATH=$f
      fi
    done
    
    # The maximum amount of heap to use, in MB. Default is 1000.
    #export HADOOP_HEAPSIZE=
    #export HADOOP_NAMENODE_INIT_HEAPSIZE=""
    
    # Extra Java runtime options.  Empty by default.
    #export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
    export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
    export HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.conf=/dev/null"
    
    # Command specific options appended to HADOOP_OPTS when specified
    export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
    export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
    
    export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
    
    # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
    export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
    #HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
    
    # On secure datanodes, user to run the datanode as after dropping privileges
    export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}
    
    # Where log files are stored.  $HADOOP_HOME/logs by default.
    #export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
    
    # Where log files are stored in the secure data environment.
    export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
    
    # The directory where pid files are stored. /tmp by default.
    # NOTE: this should be set to a directory that can only be written to by 
    #       the user that will run the hadoop daemons.  Otherwise there is the
    #       potential for a symlink attack.
    export HADOOP_PID_DIR=${HADOOP_PID_DIR}
    export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
    
    # A string representing this instance of hadoop. $USER by default.
    export HADOOP_IDENT_STRING=$USER
    					
  • mapred-env.sh
    I had to tweak the MapReduce environment configuration file. Specifically, I had to add the export JAVA_HOME="$(/usr/libexec/java_home)" line. Here's full configuration:
    export JAVA_HOME="$(/usr/libexec/java_home)"
    
    export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000
    
    export HADOOP_MAPRED_ROOT_LOGGER=INFO,RFA
    					
  • yarn-env.sh
    I also made changes to the Yarn environment configuration file. I added the following line: YARN_OPTS="$YARN_OPTS -Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk". Here's full configuration:
    export HADOOP_YARN_USER=${HADOOP_YARN_USER:-yarn}
    
    # resolve links - $0 may be a softlink
    export YARN_CONF_DIR="${YARN_CONF_DIR:-$HADOOP_YARN_HOME/conf}"
    
    # some Java parameters
    export JAVA_HOME="$(/usr/libexec/java_home)"
    if [ "$JAVA_HOME" != "" ]; then
      #echo "run java in $JAVA_HOME"
      JAVA_HOME=$JAVA_HOME
    fi
      
    if [ "$JAVA_HOME" = "" ]; then
      echo "Error: JAVA_HOME is not set."
      exit 1
    fi
    
    JAVA=$JAVA_HOME/bin/java
    JAVA_HEAP_MAX=-Xmx1000m 
    
    # For setting YARN specific HEAP sizes please use this
    # Parameter and set appropriately
    # YARN_HEAPSIZE=1000
    
    # check envvars which might override default args
    if [ "$YARN_HEAPSIZE" != "" ]; then
      JAVA_HEAP_MAX="-Xmx""$YARN_HEAPSIZE""m"
    fi
    
    # so that filenames w/ spaces are handled correctly in loops below
    IFS=
    
    
    # default log directory & file
    if [ "$YARN_LOG_DIR" = "" ]; then
      YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
    fi
    if [ "$YARN_LOGFILE" = "" ]; then
      YARN_LOGFILE='yarn.log'
    fi
    
    # default policy file for service-level authorization
    if [ "$YARN_POLICYFILE" = "" ]; then
      YARN_POLICYFILE="hadoop-policy.xml"
    fi
    
    # restore ordinary behaviour
    unset IFS
    
    YARN_OPTS="$YARN_OPTS -Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
    YARN_OPTS="$YARN_OPTS -Dhadoop.log.dir=$YARN_LOG_DIR"
    YARN_OPTS="$YARN_OPTS -Dyarn.log.dir=$YARN_LOG_DIR"
    YARN_OPTS="$YARN_OPTS -Dhadoop.log.file=$YARN_LOGFILE"
    YARN_OPTS="$YARN_OPTS -Dyarn.log.file=$YARN_LOGFILE"
    YARN_OPTS="$YARN_OPTS -Dyarn.home.dir=$YARN_COMMON_HOME"
    YARN_OPTS="$YARN_OPTS -Dyarn.id.str=$YARN_IDENT_STRING"
    YARN_OPTS="$YARN_OPTS -Dhadoop.root.logger=${YARN_ROOT_LOGGER:-INFO,console}"
    YARN_OPTS="$YARN_OPTS -Dyarn.root.logger=${YARN_ROOT_LOGGER:-INFO,console}"
    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
      YARN_OPTS="$YARN_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
    fi  
    YARN_OPTS="$YARN_OPTS -Dyarn.policy.file=$YARN_POLICYFILE"
    					
  • core-site.xml
    By default this file is blank. The core-site.xml config file allows you to override any site-specific entries. Here's a sample file for development (uses localhost and pseudo-distributed mode):
    <configuration>
     <property>
       <name>hadoop.tmp.dir</name>
       <value>/tmp/hadoop-${user.name}</value>
       <description>A base for other temporary directories.</description>
     </property> 
    <property>
       <name>fs.default.name</name>
       <value>hdfs://localhost:9000</value>
     </property>
    </configuration>
    					
  • hdfs-site.xml
    By default this file is blank. The hdfs-site.xml config file allows you to override any Hadoop Distributed File System parameters. Here's a sample file:
    <configuration>
     <property>
         <name>dfs.replication</name>
         <value>1 </value>
     </property>
    </configuration>
    					
  • mapred-site.xml
    By default this file is blank. The mapred-site.xml config file allows you to configure the MapReduce job tracker host & port. Here's a sample file:
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
      </property>
    </configuration>
    					
  • yarn-site.xml
    By default this file is blank. The yarn-site.xml config file allows you to customize YARN configuration. Full configuration is here. Here's a sample file:
    <configuration>
    <property>       
     <name>yarn.resourcemanager.resourcetracker.address</name>       
     <value>$resourcemanager.full.hostname:8025</value>  
     <description>Enter your ResourceManager hostname.</description>
    </property>
    
    <property>       
     <name>yarn.resourcemanager.scheduler.address</name>       
     <value>$resourcemanager.full.hostname:8030</value>  
     <description>Enter your ResourceManager hostname.</description>
    </property>
    
    <property>       
     <name>yarn.resourcemanager.address</name>       
     <value>$resourcemanager.full.hostname:8050</value>  
     <description>Enter your ResourceManager hostname.</description>
    </property>
    
    <property>       
     <name>yarn.resourcemanager.admin.address</name>       
     <value>$resourcemanager.full.hostname:8041</value>  
     <description>Enter your ResourceManager hostname.</description>
    </property>
    
    <property>       
     <name>yarn.nodemanager.local-dirs</name>       
     <value>/grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn</value>  
     <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.  
                    For example, /grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn.</description>
    </property>
    
    <property>       
     <name>yarn.nodemanager.log-dirs</name>       
     <value>/var/log/hadoop/yarn</value>
     <description>Use the list of directories from $YARN_LOG_DIR.  
                    For example, /var/log/hadoop/yarn.</description>
    </property>
    </configuration>
    					

Running Hadoop

We should be ready to run Hadoop now.

  • Formatting HDFS
    Prior to using Hadoop you need to format the HDFS and initialize file & metadata storage. If you're seeing the "Unable to load realm info from SCDynamicStore" error, check your JAVA_HOME and YARN_OPTS config entries in the config files above.
    $ hadoop namenode -format
    					
  • Running Hadoop
    Ok, we're finally ready to run Hadoop. Execute the following commands (please note the start-all.sh command has been deprecated):
    $ cd /usr/local/Cellar/hadoop/2.3.0/libexec/sbin
    $ ./start-dfs.sh
    $ ./start-yarn.sh
    					
  • Verifying Hadoop
    You can use the jps (java process monitor) utility to verify that Hadoop is running:
    $ jps
    79942 Jps
    45971 NodeManager
    45877 ResourceManager
    50488 RunJar
    					
  • Testing Hadoop
    Hadoop ships with a few sample MapReduce jobs that you can run to see whether your HDP instance is running and can accept jobs. Here's a sample MapReduce job to calculate the value of PI:
    	
    $ hadoop jar /usr/local/Cellar/hadoop/2.3.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar pi 2 5
    Wrote input for Map #0
    Wrote input for Map #1
    Starting Job
    ...
    Job Finished in 1.685 seconds
    Estimated value of Pi is 3.60000000000000000000
    					

Monitoring Hadoop

Hadoop provides a few web-based tools to monitor HDFS, MapReduce, and Tasks in a pseudo-distributed single node mode:

  • HDFS Administrator: http://localhost:50070
  • ResourceManager Administrator: http://localhost:8088

Useful Links

That's about it when it comes to Hadoop installation. Check out my other Hadoop tutorials on implementing MapReduce jobs and processing big data with Hive.

Email me () if you have any questions or suggestions.