Installing Hadoop 3.3.1 Two-Node Cluster on Ubuntu 20.04

Photo by Austin Poon on Unsplash

Introduction

Apache’s Hadoop is a platform used to create a clustered environment. For supporting big data, distributed system or environment is needed. To create such a distributed environment or cluster, we need Hadoop. Hadoop comes with two core components: HDFS (Hadoop Distributed File System) and YARN (Yet Another resource Negotiator). HDFS is responsible for storage management and YARN is responsible for managing computation.

This is a tutorial of how to install Hadoop 3.3.1 on Ubuntu 20.04 with two nodes. We can increase the number of nodes once we get to know how to do it with two nodes.

We need to have Ubuntu 20.04 installed on a virtual machine.

Create two machines

The two machines: Node1 and Node2

Iso file used: ubuntu-20.04.2.0-desktop-amd64.iso

Network for both machines: NAT

The username should be set same for both machines. (magna)

Start machine 1

Click on terminal and check the IP address.

ifconfig
Image by author

Run the following commands

sudo su

Enter your password and then run

cd                         #(To come back to the parent directory)
sudo nano /etc/hosts

Type in the machine’s IP address and the hostname (192.168.254.129 Node1 and 192.168.254.130 Node2)

Image by author

Start Machine 2

Do the same as we did for machine 1

ifconfig
Image by author

Run the following commands

sudo su

Enter your password and then run

cd                       #(To come back to the parent directory)
sudo nano /etc/hosts

Type in the machine’s IP address and the hostname (192.168.254.129 Node1 and 192.168.254.130 Node2)

Checking if the machines are able to ping each other and themselves

ping Node1
ping Node2
Image by author

Disable the firewall for both machines

ufw disable
Image by author
Image by author

Installing SSH, Java on both machines

sudo apt install openssh-server
Image by author
Image by author
sudo apt install openjdk-8-jdk
Image by author
Image by author

Checking if the java is installed properly on both machines

java -version
Image by author
Image by author

Download the below file on machine 1

Copy and paste this link in Firefox browser of Ubuntu https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

All this is to be done in machine 1.

While file is getting download, give root privileges to your user using the following commands:-

visudo

Scroll down to the line which says root and below it add

magna         ALL=(ALL:ALL) ALL

Now run in the terminal:

sudo nano .bashrc
Image by author

Now in this file, in the fourth line write the following commands:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
Image by author

After this, run

source .bashrc
java -version
echo $JAVA_HOME #(To see if java is installed properly and to see if it is showing the proper path)
Image by author
sudo apt install vim
su — magna #(To Go to normal user)
nano .bashrc

Now in this file, in the fourth line write the following commands:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

After this, run

java -version
echo $JAVA_HOME #(To see if java is installed properly and to see if it is showing the proper path)
Image by author
cd /usr/local

Extract the Hadoop folder downloaded. Using the tar command, the file is decompressed.

sudo tar -xvf /home/magna/Downloads/hadoop-3.3.1.tar.gz
Image by author
Image by author

Enter your password

Run

ls
Image by author

The decompressed folder is now present in the /usr/local.

sudo ln -s hadoop-3.3.1 hadoop
sudo chown -R magna:magna hadoop*
ls -all
Image by author

Now, authority is given to the hadoop link to the user magna. The name of the symbolic link is hadoop. The use of symbolic link is to directly call the folder with the name hadoop.

Run:

cd hadoop
ls
ls bin #(It comprises of cmd files which are important for execution purpose.)
Image by author
ls sbin                             #(It comprises of shell files or shell scripts and are important for execution part.)
Image by author
ls etc/hadoop                #(It comprises of configuration files.)
cd
Image by author

Setting environment variables for machine 1

nano .bashrc
Image by author

Type in the following after the export PATH:

export HADOOP_INSTALL=/usr/local/Hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
Image by author
source .bashrc                         #(To implement the changes)
hadoop version #(To verify if the changes are done)
Image by author

SSH settings for machine 1

sudo ufw disable (Disable the firewall)
ssh-keygen -t rsa -P ””
Image by author
ls -all .ssh
Image by author
ssh-copy-id -i $HOME/.ssh/id_rsa.pub magna@Node2
Image by author

Go to machine 2 and run the following

sudo su
cd
visudo
Image by author

Scroll down to the line which says root and below it add

magna        ALL=(ALL:ALL) ALL
Image by author

Now run in the terminal:

su - magna
Image by author
sudo ufw disable                        #(Disable the firewall)
ssh-keygen -t rsa -P ””
Image by author
ls -all .ssh
Image by author
ssh-copy-id -i $HOME/.ssh/id_rsa.pub magna@Node1
ls -all .ssh
Image by author

Go to machine 1 and run

ssh Node2
exit
Image by author

Go to machine 2 and run

ssh Node1
exit
Image by author

Run the following command on both the machines

cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
Image by author
Image by author

Go to Machine 1 and run

scp /home/magna/Downloads/ hadoop-3.3.1.tar.gz magna@Node2:/tmp
Image by author

Go to machine 2 and run

ls /tmp                             #(to see if the file is there)
Image by author
cd /usr/local

Extract the Hadoop folder sent from Node1 to Node2. Using the tar command, the file is decompressed.

sudo tar -xvf /tmp/hadoop-3.3.1.tar.gz
ls -all
Image by author

The decompressed folder is now present in the /usr/local.

sudo ln -s hadoop-3.3.1 hadoop
sudo chown -R magna:magna hadoop*
ls -all
Image by author

Go to Machine 1 and run

Now sending the .bashrc file from Node1 to Node2

scp .bashrc magna@Node2:/home/magna
Image by author

Go to Machine 2 and RUN

cd
source .bashrc #(To implement the changes)
Image by author
hadoop version                  #(To verify if the changes are done)
Image by author
ls /usr/local/hadoop/etc/hadoop
Image by author

Creating directories on both machines

Create directories/folders

cd /usr/local
sudo mkdir hdfs
sudo mkdir hdfs/datanode
sudo mkdir hdfs/namenode
sudo mkdir hadoop/logs
sudo mkdir yarn
sudo mkdir yarn/logs
ls
cd
Image by author
Image by author

Run the following commands on both the machines

nano .bashrc

After the export statements add these statements:

export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HDFS_NAMENODE_USER=magna
export HDFS_DATANODE_USER=magna
export HDFS_SECONDARYNAMENODE_USER=magna
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export YARN_HOME=/usr/local/hadoop
Image by author

After this, run

source .bashrc
cd /usr/local
sudo gedit hadoop/etc/hadoop/hadoop-env.sh
Image by author

Update the highlighted statements as given below

Image by author
Image by author
cd
hadoop version

Master Node Configuration (Node1)

Run

cd /usr/local/hadoop/etc/hadoop
ls
sudo nano hdfs-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>dfs.namenode.name.dir</name><value>file:///usr/local/hdfs/namenode</value>
<description>NameNode directory for namespace and transaction logs storage.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hdfs/datanode</value>
<description>DataNode directory</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
</configuration>

After adding these lines, run

sudo nano core-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://Node1:9820/</value>
<description>NameNode URI</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>Buffer size</description>
</property>
</configuration>

After this, edit the below file

sudo nano yarn-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Yarn Node Manager Aux Service</description>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///usr/local/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///usr/local/yarn/logs</value>
</property>
</configuration>

Next,

sudo nano mapred-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>MapReduce framework name</description>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>Node1:10020</value>
<description>Default port is 10020.</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value> Node1:19888</value>
<description>Default port is 19888.</description>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
<description>Directory where history files are written by MapReduce jobs.</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
<description>Directory where history files are managed by the MR
obHistory Server.</description>
</property>
</configuration>

Next,

sudo chmod 777 /usr/local/hadoop/logs
sudo chmod 777 /usr/local/hdfs
hdfs namenode -format
Image by author
ls /usr/local/hadoop/etc/hadoop/workers
sudo gedit /usr/local/hadoop/etc/hadoop/workers
Image by author

Add IP addresses of both the machines

Image by author
ssh Node2
exit
Image by author

Configuring Data Nodes

Go to machine 2 (Node2) and do

ssh Node1
exit
Image by author
sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hdfs/datanode</value>
<description>DataNode directory</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
</configuration>

Run

sudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://Node1:9820/</value>
<description>NameNode URI</description>
</property>
</configuration>

Next,

sudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Yarn Node Manager Aux Service</description>
</property>
</configuration>

Next,

sudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml
Image by author

The file must look like:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>MapReduce framework name</description>
</property>
</configuration>

Next,

sudo chmod 777 /usr/local/hadoop/logs
sudo chmod 777 /usr/local/hdfs
sudo chown magna /usr/local/hdfs/datanode

Start hadoop

After finishing the steps above, from the name node(Node1) we have to execute the following command to start the Name node, data nodes and secondary name node:

sudo chown magna /usr/local/hdfs/datanode
ls -l /usr/local/hdfs
Image by author
start-dfs.sh
start-yarn.sh
jps
Image by author

Go to machine 2(Node2) and run

jps
Image by author

Accessing hadoop on browser and from terminal

Open the browser in Machine 1(Node1)

Access to the following URL: http://Node1:9870/

Go to Utilities(we can upload files using this) and then go to Browse the file system.

Put the new directory name and click on Create.

Click on Demo.

Click on browse, select the file you wish to upload and click on Upload.

To see where the files have gone, go to /usr/local/hdfs/datanode/current/(Block Pool ID of the file)/current/finalised/subdir0/subdir0/blk_(Block ID of the file) in the file system of the machine 1.

In machine 2, we will be able to see another copy of this file by going to /usr/local/hdfs/datanode/current/(Block Pool ID of the file)/current/finalised/subdir0/subdir0/blk_(Block ID of the file) in the file system.

Now we are uploading the file using the terminal.

Go to machine 1 and run

hdfs dfs -ls /

Creating a new directory

hdfs dfs -mkdir /Demo1
hdfs dfs -ls /

Creating a subdirectory in Demo1 directory

hdfs dfs -mkdir /Demo1/Demo2
hdfs dfs -ls /Demo1
hdfs dfs -copyFromLocal /home/magna/Desktop/Test1 /Demo1/Demo2

(/home/magna/Desktop/Test is the source and /Demo1/Demo2 is the destination)

hdfs dfs -ls /Demo1/Demo2
hdfs dfs -ls /Demo1/Demo2/Test1
hdfs dfs -cat /Demo1/Demo2/Test1

Now when we check in the browser, we can see the new directory.

copyFromLocal copies the file from the source folder to the destination folder.

moveFromLocal will remove the file from the source and put the file in the destination.

copyToLocal will copy the file from the hdfs source path to the local destination.

hdfs dfs -moveFromLocal /home/magna/Desktop/Test1 /Demo1
hdfs dfs -cat /Demo1/Test1
hdfs dfs -copyToLocal /Demo1/Test1 /home/magna/Desktop
hdfs dfs -mkdir /Demo1_copied
hdfs dfs -ls /
hdfs dfs -cp /Demo1 /Demo1_copied
hdfs dfs -ls /Demo1
hdfs dfs -ls /Demo1_copied
hdfs dfs -ls /Demo1_copied/Demo1

With the help of cp command, we copy the directory Demo1 to Demo1_copied.

hdfs dfs -rm -r /Demo1_copied         #(-rm -r is remove recursively)
hdfs dfs -ls /Demo1_copied
Image by author

Hope this tutorial helped you!

Thank you for taking your time to read 😊

Pursuing M.Sc. Big Data Analytics from St. Xavier’s College, Autonomous-Mumbai, Researcher