In this post we’ll see how to install Hadoop in Pseudo-distributed mode (single node cluster). With the steps given here for Hadoop installation you will have Hadoop common, HDFS, MapReduce and YARN installed.
Hadoop release used for installation is Hadoop 2.9.0 and it is installed on Ubuntu 16.04.
Modes in Hadoop
Here steps are given to install Hadoop in Pseudo-distributed mode but there are other modes too. Hadoop can be run in any of the following modes.
- Local (Standalone) mode– By default Hadoop is configured to run in a non distributed mode. In this mode no daemons are running and Hadoop runs as a single Java process. This mode is easy to setup and useful for debugging.
- Pseudo-Distributed mode– You can also run Hadoop on a single node in a pseudo-distributed mode. In this mode all the daemons run in a separate Java processes. Pseudo-distributed mode mimics a cluster on a small scale as all the daemon run locally.
- Fully-Distributed mode– In this mode Hadoop runs on a cluster of machines. Cluster may range from a few nodes to extremely large clusters with thousands of nodes.
Prerequisites for Hadoop installation
- Java must be installed, to check for the version of the Java refer https://wiki.apache.org/hadoop/HadoopJavaVersions
- ssh must be installed and sshd must be running.
Steps for Hadoop installation
You will have to perform following steps in order to have a Pseudo-distributed Hadoop installation.
- Make sure Java is installed.
- Download Hadoop tarball.
- Install and configure SSH.
- Configuring XML files.
- Formatting HDFS filesystem.
Installing Hadoop on a single node
Let’s go through the steps now and make the required changes and configurations in order to install Hadoop in Pseudo-distributed mode.
1- Make sure Java is installed
Hadoop requires Java to be installed. If you are not sure whether Java is installed or not check with java -version command. If output is the version of the Java installed then you have Java already installed.
- Refer https://knpcode.com/java/installation-java/installing-java-in-ubuntu/ to see how to install Java in Ubuntu.
2- Download Hadoop
Download stable version of Hadoop from here- http://hadoop.apache.org/releases.html
You can download the binary tarball from the given location.
Now you can decide you want to create a new user for Hadoop or want to use existing user account. I am using the existing user account just creating a directory for Hadoop files.
You can create a new directory /usr/hadoop using the following command. That’s where you can keep your Hadoop installation files.
Move and untar files
By default Hadoop tarball will be downloaded in the Downloads directory, from there move it to /usr/hadoop and untar the installation files.
For unpacking the gzipped tar files, run the following commands.
cd /usr/hadoop tar zxvf hadoop-2.9.0.tar.gz
At this point you should have a directory hadoop-x.x.x (hadoop-2.9.0 for the tar ball I have used) at the location /usr/hadoop.
Hadoop needs to know which Java installation it has to use, for that edit the file etc/hadoop/hadoop-env.sh to define JAVA_HOME parameter as follows.
You can also create an environment variable to point to your Hadoop installation. You can name it HADOOP_HOME. Also add bin and sbin directories to the PATH. Run the following command to open the environment file.
sudo gedit /etc/environment
Then add the following to the already existing PATH variable
and add HADOOP_HOME environment variable at the end –
Please make sure to change the path as per your Hadoop installation directory.
Scripts to run daemons (start-dfs.sh and start-yarn.sh) are in sbin directory, by appending it to PATH you can execute these scripts from anywhere, you don’t need to go to $HADOOP_HOME/sbin every time you need to start daemons.
Run the following command to reload the environment.
I prefer adding them to /etc/environment, if you want you can add them to ~/.bashrc file too –
export HADOOP_HOME=/usr/hadoop/hadoop-2.9.0 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
Just to ensure that everything is going fine till now you can run Hadoop version command. You should get the output similar to given here.
3-Installing and configuring SSH
Even in Pseudo-distributed mode, Hadoop will connect to host and start the daemon process there. Though in Pseudo-distributed mode host is always localhost. For that it will use ssh command which is used to connect to remote host. So we need to ensure that Hadoop can ssh to localhost and connect without entering a password.
To install ssh run the following command –
Now to ensure that password is not needed at the time of logging to host, generate a SSH key with empty pass phrase.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Add the generated SSH key to the list of authorized keys.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Try to ssh to localhost now, you should not be asked to enter a password.
4- Configuring XML files
You will need to change the configuration files placed inside /etc/hadoop directory in your Hadoop installation.
<property> <name>hadoop.tmp.dir</name> <value>/usr/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
Here hadoop.tmp.dir is the path for tmp directory. Based on your user account privileges you may need to create the specified directory yourself and using chmod command give it permissions for read and write.
Add following between the <configuration></configuration> tag. This changes the replication factor to 1 as you have a single node cluster.
<property> <name>dfs.replication</name> <value>1</value> </property>
Since we are going to use YARN to run MapReduce job need to add following between the <configuration></configuration> tag.
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Add following between the <configuration></configuration> tag.
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property>
5- Formatting HDFS filesystem
You also need to format the filesystem once before HDFS can be used.
Run the following command for that-
hdfs namenode -format
6- Starting daemons
Run sbin/start-dfs.sh to start NameNode daemon, Secondary NameNode daemon and DataNode daemon:
Run sbin/start-yarn.sh to start ResourceManager daemon and NodeManager daemon.
You can use start-all.sh script to start all the daemons with one script, this script is deprecated though.
You can use jps command to verify that all the daemons are running.
knpcode:sbin$ jps 14370 NodeManager 14020 SecondaryNameNode 13655 NameNode 13817 DataNode 14234 ResourceManager 14698 Jps
You should get the given 5 daemons, running as Java processes.
Browse the web interface for the NameNode; by default it is available at-
- NameNode – http://localhost:50070/
Browse the web interface for the ResourceManager; by default it is available at-
- ResourceManager – http://localhost:8088/
Stopping the daemons
You can stop the running daemons by using stop-dfs.sh and stop-yarn.sh scripts. There is also a stop-all.sh script to stop all the daemons using one script, this script is deprecated though.
- How to dual-boot Ubuntu and Windows
- What is Big Data
- How MapReduce Works in Hadoop
- Word Count Program Using MapReduce in Hadoop
- Frequently Used HDFS Commands With Examples
- Java Program to Write a File in HDFS
- How to Compress Map Phase Output in Hadoop MapReduce
- How to Read And Write SequenceFile in Hadoop
That’s all for the topic Installing Hadoop in Pseudo-Distributed Mode. If something is missing or you have something to share about the topic please write a comment.
You may also like