What is Hadoop

Apache Hadoop is an open source framework for storing data and processing of data set of big data on a cluster of nodes (commodity hardware) in parallel.

Hadoop framework is designed to scale up from single server to thousand of machines with each machine offering both storage and computation. It is also reliable and fault tolerant, framework itself is designed to detect and handle failures at the application layer, that way Hadoop framework provides a highly-available service using a cluster of nodes.

Modules of Hadoop

Hadoop framework is written in Java and it includes these modules –

  1. Hadoop Common – This module contains libraries and utilities used by other Hadoop modules.
  2. Hadoop Distributed File System (HDFS) – This is the storage part of the Hadoop framework. It is a distributed file system that works on the concept of breaking the huge file into blocks and storing those blocks in different nodes. That way HDFS provides high-throughput access to application data.
  3. Hadoop Yarn (Yet Another Resource Negotiator) – This module is responsible for scheduling jobs and managing cluster resources. Refer YARN in Hadoop to read more about YARN.
  4. Hadoop MapReduce – This is the implementation of the MapReduce programming model to process the data in parallel.

Brief history of Hadoop

Hadoop was created by Doug Cutting and it has its origins in Nutch which is an open source web crawler. When Doug Cutting and Mike Cafarella were working on Nutch and trying to scale it they came across two google white papers about GFS (Google’s Distributed File System) and MapReduce. Using the architecture described in those white papers Nutch’s developers came up with open source implementation of distributed file system NDFS (Nutch Distributed File System) and MapReduce.

It was realized that NDFS and MapReduce can be created as a separate project and that way Hadoop initially became a sub-project. Yahoo also helped by providing resources and team to develop Hadoop by improving scalability, performance, and reliability and adding many new features. In 2008 Hadoop became a top-level project in Apache rather than being a sub-project and now it is a widely used framework with its own ecosystem.

How Hadoop works

Here I’ll try to explain how Hadoop works in very simple terms without going into the complexities what all daemons like NameNode or Resource Manager do.

Once you copy a huge file into HDFS, framework splits the file into blocks and distribute those blocks across nodes in a cluster.

Then you write a MapReduce program having some logic to process that data. You package your code as a jar and that packaged code is transferred to nodes. That way your MapReduce code work on the part of the file (HDFS block that resides on the node where code is running) and process data in parallel.

Other advantage is that rather than sending data to code (like traditional programming where data is fetched from DB server) you send the code to data. Obviously data is much larger in size so that way Hadoop uses network bandwidth more proficiently.

Here is a high level diagram which tells in a simple way how Hadoop framework works.

How Hadoop works

Hadoop HDFS and MapReduce

That’s all for the topic What is Hadoop. If something is missing or you have something to share about the topic please write a comment.

You may also like


  1. Pingback: What is Big Data - KnpCode

  2. Pingback: CapacityScheduler in Yarn - KnpCode

  3. Pingback: Fair Scheduler in Yarn - KnpCode

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.