INSTALLING HADOOP ON MAC OSX LION

2013年6月18日 18:25

Step 1: Installing Hadoop

brew install hadoop

Step 2: Edit Configurations
Step 2.1 Add following line to /usr/local/Cellar/hadoop/1.0.1/libexec/conf/hadoop-env.sh. This line is required to overcome the following error related “SCDynamicStore”, expecially “Unable to load realm info from SCDynamicStore”

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

Step 2.2:Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/core-site.xml.One key property is hadoop.tmp.dir. Note that we are setting the hdfs in current user’s folder and naming it as hadoop-store. You don’t need to create this folder as it will be automatically created for you in the later stages.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/Users/${user.name}/hadoop-store</value>
        <description>A base for other temporary directories.</description>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
    </property>
</configuration>

Step 2.3: Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
    </property>

    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>2</value>
    </property>

    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>2</value>
    </property>
</configuration>

Step 2.4: Add the following content in the /usr/local/Cellar/hadoop/1.0.1/libexec/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
</configuration>

Step 3:Enable SSH to localhost
Make sure that you have ssh private (~/.ssh/id_rsa) and public (~/.ssh/id_rsa.pub) keys already setup. If you are missing the above two files, then run the following command (Thanks to Ryan Rosario for pointing out this). Instead of using rsa key, you can also use dsa (replace rsa with dsa in the command below). However instructions below assume that you have used rsa key.

ssh-keygen -t rsa

Step 3.1: Make sure that “Remote login” is enabled in your system preferences. For this, Go to

“System Preferences” -> “Sharing”. “Remote login” should be checked.

Step 3.2: From the terminal run the following command. Make sure that authorized_key has 0600 permission. (see Raj Bandyopadhay’s comment)

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 3.3: Try login to localhost. If you get any error remove (or change to something else) ~/.ssh/known_hosts and retry connecting to localhost.

ssh localhost

Step 4. Start and Test Hadoop

hadoop namenode -format
/usr/local/Cellar/hadoop/1.0.1/bin/start-all.sh
hadoop jar /usr/local/Cellar/hadoop/1.0.1/libexec/hadoop-examples-1.0.1.jar pi 10 100

To make sure that all hadoop processes started, use the following command

ps ax | grep hadoop | wc -l
# expected output is 6

There are 5 process related to hadoop. If you see less than 6 processes then check log files. Log files are located at /usr/local/Cellar/hadoop/1.0.1/libexec/logs/*.log

Additional Notes

Namenode info:http://localhost:50070/dfshealth.jsp
Jobtracker: http://localhost:50070/dfshealth.jsp
Starting hadoop cluster:‘/usr/local/Cellar/hadoop/1.0.1/bin/start-all.sh’
Stop hadoop cluster: /usr/local/Cellar/hadoop/1.0.1/bin/stop-all.sh
Verify hadoop started properly: Use ps ax | grep hadoop | wc -l and make sure you see 6 as output. There are 5 processes associated with hadoop and one pertaining to the last command