Installing Hadoop in Pseudo-Distributed Mode

In this post we’ll see how to install Hadoop in Pseudo-distributed mode (single node cluster). With the steps given here for Hadoop installation you will have Hadoop common, HDFS, MapReduce and YARN installed.

Hadoop release used for installation is Hadoop 2.9.0 and it is installed on Ubuntu 16.04.

Modes in Hadoop

Here steps are given to install Hadoop in Pseudo-distributed mode but there are other modes too. Hadoop can be run in any of the following modes.

  1. Local (Standalone) mode – By default Hadoop is configured to run in a non distributed mode. In this mode no daemons are running and Hadoop runs as a single Java process. This mode is easy to setup and useful for debugging.
  2. Pseudo-Distributed mode – You can also run Hadoop on a single node in a pseudo-distributed mode. In this mode all the daemons run in a separate Java processes. Pseudo-distributed mode mimics a cluster on a small scale as all the daemon run locally.
  3. Fully-Distributed mode – In this mode Hadoop runs on a cluster of machines. Cluster may range from a few nodes to extremely large clusters with thousands of nodes.

Prerequisites for Hadoop installation

  1. Java must be installed, to check for the version of the Java refer https://wiki.apache.org/hadoop/HadoopJavaVersions
  2. ssh must be installed and sshd must be running.

Steps for Hadoop installation

You will have to perform following steps in order to have a Pseudo-distributed Hadoop installation.

  1. Make sure Java is installed.
  2. Download Hadoop tarball.
  3. Install and configure SSH.
  4. Configuring XML files.
  5. Formatting HDFS filesystem.

Installing Hadoop

Let’s go through the steps now and make the required changes and configurations in order to install Hadoop in Pseudo-distributed mode.

1- Make sure Java is installed
Hadoop requires Java to be installed. If you are not sure whether Java is installed or not check with java -version command. If output is the version of the Java installed then you have Java already installed.

2- Download Hadoop

Download stable version of Hadoop from here – http://hadoop.apache.org/releases.html

You can download the binary tarball from the given location.

Now you can decide you want to create a new user for Hadoop or want to use existing user account. I am using the existing user account just creating a directory for Hadoop files.

Creating directory

You can create a new directory /usr/hadoop using the following command. That’s where you can keep your Hadoop installation files.

Hadoop installation directory

Move and untar files

By default Hadoop tarball will be downloaded in the Downloads directory, from there move it to /usr/hadoop and untar the installation files.

install hadoop in pseudo-distributed mode

For unpacking the gzipped tar files, run the following commands.

cd /usr/hadoop
tar zxvf hadoop-2.9.0.tar.gz

At this point you should have a directory hadoop-x.x.x (hadoop-2.9.0 for the tar ball I have used) at the location /usr/hadoop.

Setting paths

Hadoop needs to know which Java installation it has to use, for that edit the file etc/hadoop/hadoop-env.sh to define JAVA_HOME parameter as follows.

export JAVA_HOME=/usr/java/jdk1.8.0_151

You can also create an environment variable to point to your Hadoop installation. You can name it HADOOP_HOME. Also add bin and sbin directories to the PATH. Run the following command to open the environment file.

sudo gedit /etc/environment

Then add the following to the already existing PATH variable

:/usr/hadoop/hadoop-2.9.0/bin:/usr/hadoop/hadoop-2.9.0/sbin

and add HADOOP_HOME environment variable at the end –

HADOOP_HOME="/usr/hadoop/hadoop-2.9.0"

Please make sure to change the path as per your Hadoop installation directory.

Scripts to run daemons (start-dfs.sh and start-yarn.sh) are in sbin directory, by appending it to PATH you can execute these scripts from anywhere, you don’t need to go to $HADOOP_HOME/sbin every time you need to start daemons.

Run the following command to reload the environment.

source /etc/environment

I prefer adding them to /etc/environment, if you want you can add them to ~/.bashrc file too –

export HADOOP_HOME=/usr/hadoop/hadoop-2.9.0
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Just to ensure that everything is going fine till now you can run Hadoop version command. You should get the output similar to given here.

hadoop installation version

3-Installing and configuring SSH

Even in Pseudo-distributed mode, Hadoop will connect to host and start the daemon process there. Though in Pseudo-distributed mode host is always localhost. For that it will use ssh command which is used to connect to remote host. So we need to ensure that Hadoop can ssh to localhost and connect without entering a password.

To install ssh run the following command –

hadoop ssh installation

Now to ensure that password is not needed at the time of logging to host, generate a SSH key with empty pass phrase.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Add the generated SSH key to the list of authorized keys.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Try to ssh to localhost now, you should not be asked to enter a password.

hadoop installation

4- Configuring XML files

You will need to change the configuration files placed inside /etc/hadoop directory in your Hadoop installation.

coresite.xml

Here hadoop.tmp.dir is the path for tmp directory. Based on your user account privileges you may need to create the specified directory yourself and using chmod command give it permissions for read and write.

hdfs-site.xml

Add following between the <configuration></configuration> tag. This changes the replication factor to 1 as you have a single node cluster.

mapred-site.xml

Since we are going to use YARN to run MapReduce job need to add following between the <configuration></configuration> tag.

yarn-site.xml

Add following between the <configuration></configuration> tag.

5- Formatting HDFS filesystem

You also need to format the filesystem once before HDFS can be used.
Run the following command for that-

hdfs namenode -format

6- Starting daemons

Run sbin/start-dfs.sh to start NameNode daemon, Secondary NameNode daemon and DataNode daemon:

knpcode:sbin$ start-dfs.sh

Run sbin/start-yarn.sh to start ResourceManager daemon and NodeManager daemon.

knpcode:sbin$ start-yarn.sh

You can use start-all.sh script to start all the daemons with one script, this script is deprecated though.

You can use jps command to verify that all the daemons are running.

knpcode:sbin$ jps

You should get the given 5 daemons, running as Java processes.

Want to write your first MapReduce program, refer Word Count Program Using MapReduce in Hadoop

Browse the web interface for the NameNode; by default it is available at-

  • NameNode – http://localhost:50070/

namenode web interface

 

Browse the web interface for the ResourceManager; by default it is available at-

  • ResourceManager – http://localhost:8088/

ResourceManager web interface
Stopping the daemons

You can stop the running daemons by using stop-dfs.sh and stop-yarn.sh scripts. There is also a stop-all.sh script to stop all the daemons using one script, this script is deprecated though.

That’s all for the topic Installing Hadoop in Pseudo-Distributed Mode. If something is missing or you have something to share about the topic please write a comment.


You may also like

3 Comments

  1. Pingback: Word Count Program Using MapReduce in Hadoop – Technical Tutorials

  2. Pingback: What is Hadoop - KnpCode

  3. Pingback: How to Create Bootable USB Drive For Installing Ubuntu - KnpCode

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.