Big Data and Hadoop

April 21, 2014

I am very curious about big data and the tools and platforms for big data processing and aggregation. After reading a few tutorials and installing Hadoop on my workstation, I decided to document the steps that I took and the pains I experienced installing, configuring, and working with Hadoop, Hive, and Hbase on Mac OS X Mountain Lion.

Installing Hadoop on Mac OS X Mountain Lion 10.8.5 64-Bit

Below are the steps to install and configure Hadoop 2.3 using Homebrew.

Hadoop Prerequisites

There a few things you'll need before installing Hadoop.

  • Java Version 1.6.*
    Make sure you java 1.6+ installed. It's OK to have java 1.7 but you will need some extra configuration and tweaking. Ideally, you should have Java v1.6.* since that's what Hadoop uses. You can find out the version of your java software by running 'java -version' from the console.
    $ java -version
    java version "1.6.0_65"
    Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
    Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
  • Ruby & Homebrew
    Make sure you have the Ruby runtime installed then install Homebrew - an awesome little package manager for OS X. To install Homebrew open up console and type the command below. Source:
    $ ruby -e "$(curl -fsSL"
  • SSH & Remote Login
    Hadoop requires SSH so you'll need to enable it. First, make sure Remote Login under System Preferences -> Sharing is checked (this enables SSH on Mac OS X). Then generate and authorize SSH keys:
    $ ssh-keygen -t rsa -P ""
    $ cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Hadoop Installation

Homebrew (as of April 21, 2014) installs Hadoop v2.3 so the steps below are specific to that version of Hadoop.

  • SSH into localhost
    To begin the installation you'll need to SSH into your workstation. Open up console and fire up the ssh client.
    $ ssh localhost
  • Install Hadoop
    We're going to install Hadoop in a single-node cluster using Homebrew. To install a different version of Hadoop, navigate to the Hadoop project site and pick a version to install then follow the manual installation instructions.
    $ brew install hadoop

Hadoop Configuration

Almost lets configure Hadoop. Homebrew will install Hadoop to /usr/local/Cellar/hadoop/2.3.0 directory.

  • The important config files are:
    • core-site.xml
    • hdfs-site.xml
    • mapred-site.xml
    • yarn-site.xml
The files are in /usr/local/Cellar/hadoop/2.3.0/libexec/etc/hadoop

    Thanks to Homebrew you'll have most of Hadoop configuration already in place; however, I had to tweak a few things to make Hadoop work. Specifically, I was getting the following error "Unable to load realm info from SCDynamicStore" when trying to use Hadoop so I ended up changing the environment variables slightly in Hadoop, MapReduce, and Yarn environment config files. My changes included adding the following lines:
    • export JAVA_HOME="$(/usr/libexec/java_home)"
    • export HADOOP_OPTS="${HADOOP_OPTS}"
    • export HADOOP_OPTS="${HADOOP_OPTS}"
    Note that I'm using Java 1.6.0_65. Below is my configuration.
    # The java implementation to use.
    export JAVA_HOME="$(/usr/libexec/java_home)"
    #export JAVA_HOME=`/usr/libexec/java_home -v 1.6`
     #The jsvc implementation to use. Jsvc is required to run secure datanodes.
    #export JSVC_HOME=${JSVC_HOME}
    export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
    # Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
    for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
      if [ "$HADOOP_CLASSPATH" ]; then
        export HADOOP_CLASSPATH=$f
    # The maximum amount of heap to use, in MB. Default is 1000.
    #export HADOOP_HEAPSIZE=
    # Extra Java runtime options.  Empty by default.
    # Command specific options appended to HADOOP_OPTS when specified
    # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
    # On secure datanodes, user to run the datanode as after dropping privileges
    # Where log files are stored.  $HADOOP_HOME/logs by default.
    # Where log files are stored in the secure data environment.
    # The directory where pid files are stored. /tmp by default.
    # NOTE: this should be set to a directory that can only be written to by 
    #       the user that will run the hadoop daemons.  Otherwise there is the
    #       potential for a symlink attack.
    # A string representing this instance of hadoop. $USER by default.
    I had to tweak the MapReduce environment configuration file. Specifically, I had to add the export JAVA_HOME="$(/usr/libexec/java_home)" line. Here's full configuration:
    export JAVA_HOME="$(/usr/libexec/java_home)"
    I also made changes to the Yarn environment configuration file. I added the following line: YARN_OPTS="$YARN_OPTS". Here's full configuration:
    # resolve links - $0 may be a softlink
    # some Java parameters
    export JAVA_HOME="$(/usr/libexec/java_home)"
    if [ "$JAVA_HOME" != "" ]; then
      #echo "run java in $JAVA_HOME"
    if [ "$JAVA_HOME" = "" ]; then
      echo "Error: JAVA_HOME is not set."
      exit 1
    # For setting YARN specific HEAP sizes please use this
    # Parameter and set appropriately
    # YARN_HEAPSIZE=1000
    # check envvars which might override default args
    if [ "$YARN_HEAPSIZE" != "" ]; then
    # so that filenames w/ spaces are handled correctly in loops below
    # default log directory & file
    if [ "$YARN_LOG_DIR" = "" ]; then
    if [ "$YARN_LOGFILE" = "" ]; then
    # default policy file for service-level authorization
    if [ "$YARN_POLICYFILE" = "" ]; then
    # restore ordinary behaviour
    unset IFS
    YARN_OPTS="$YARN_OPTS -Dhadoop.log.dir=$YARN_LOG_DIR"
    YARN_OPTS="$YARN_OPTS -Dyarn.log.dir=$YARN_LOG_DIR"
    YARN_OPTS="$YARN_OPTS -Dhadoop.log.file=$YARN_LOGFILE"
    YARN_OPTS="$YARN_OPTS -Dyarn.log.file=$YARN_LOGFILE"
    YARN_OPTS="$YARN_OPTS -Dhadoop.root.logger=${YARN_ROOT_LOGGER:-INFO,console}"
    YARN_OPTS="$YARN_OPTS -Dyarn.root.logger=${YARN_ROOT_LOGGER:-INFO,console}"
    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
      YARN_OPTS="$YARN_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
    YARN_OPTS="$YARN_OPTS -Dyarn.policy.file=$YARN_POLICYFILE"
  • core-site.xml
    By default this file is blank. The core-site.xml config file allows you to override any site-specific entries. Here's a sample file for development (uses localhost and pseudo-distributed mode):
       <description>A base for other temporary directories.</description>
  • hdfs-site.xml
    By default this file is blank. The hdfs-site.xml config file allows you to override any Hadoop Distributed File System parameters. Here's a sample file:
         <value>1 </value>
  • mapred-site.xml
    By default this file is blank. The mapred-site.xml config file allows you to configure the MapReduce job tracker host & port. Here's a sample file:
  • yarn-site.xml
    By default this file is blank. The yarn-site.xml config file allows you to customize YARN configuration. Full configuration is here. Here's a sample file:
     <description>Enter your ResourceManager hostname.</description>
     <description>Enter your ResourceManager hostname.</description>
     <description>Enter your ResourceManager hostname.</description>
     <description>Enter your ResourceManager hostname.</description>
     <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.  
                    For example, /grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn.</description>
     <description>Use the list of directories from $YARN_LOG_DIR.  
                    For example, /var/log/hadoop/yarn.</description>

Running Hadoop

We should be ready to run Hadoop now.

  • Formatting HDFS
    Prior to using Hadoop you need to format the HDFS and initialize file & metadata storage. If you're seeing the "Unable to load realm info from SCDynamicStore" error, check your JAVA_HOME and YARN_OPTS config entries in the config files above.
    $ hadoop namenode -format
  • Running Hadoop
    Ok, we're finally ready to run Hadoop. Execute the following commands (please note the command has been deprecated):
    $ cd /usr/local/Cellar/hadoop/2.3.0/libexec/sbin
    $ ./
    $ ./
  • Verifying Hadoop
    You can use the jps (java process monitor) utility to verify that Hadoop is running:
    $ jps
    79942 Jps
    45971 NodeManager
    45877 ResourceManager
    50488 RunJar
  • Testing Hadoop
    Hadoop ships with a few sample MapReduce jobs that you can run to see whether your HDP instance is running and can accept jobs. Here's a sample MapReduce job to calculate the value of PI:
    $ hadoop jar /usr/local/Cellar/hadoop/2.3.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar pi 2 5
    Wrote input for Map #0
    Wrote input for Map #1
    Starting Job
    Job Finished in 1.685 seconds
    Estimated value of Pi is 3.60000000000000000000

Monitoring Hadoop

Hadoop provides a few web-based tools to monitor HDFS, MapReduce, and Tasks in a pseudo-distributed single node mode:

  • HDFS Administrator: http://localhost:50070
  • ResourceManager Administrator: http://localhost:8088

Useful Links

That's about it when it comes to Hadoop installation. Check out my other Hadoop tutorials on implementing MapReduce jobs and processing big data with Hive.

Email me () if you have any questions or suggestions.