Apache Hadoop is an open source framework for storing data and processing of data set of big data on a cluster of nodes (commodity hardware) in parallel.
Hadoop framework is designed to scale up from single server to thousand of machines with each machine offering both storage and computation. It is also reliable and fault tolerant, framework itself is designed to detect and handle failures at the application layer, that way Hadoop framework provides a highly-available service using a cluster of nodes.
Modules of Hadoop
Hadoop framework is written in Java and it includes these modules –
- Hadoop Common– This module contains libraries and utilities used by other modules.
- Hadoop Distributed File System (HDFS)– This is the storage part of the Hadoop framework. It is a distributed file system that works on the concept of breaking the huge file into blocks and storing those blocks in different nodes. That way HDFS provides high-throughput access to application data.
- Hadoop Yarn (Yet Another Resource Negotiator)– This module is responsible for scheduling jobs and managing cluster resources. Refer YARN in Hadoop to read more about YARN.
- Hadoop MapReduce– This is the implementation of the MapReduce programming model to process the data in parallel.
Brief history of Hadoop
Hadoop was created by Doug Cutting and it has its origins in Nutch which is an open source web crawler. When Doug Cutting and Mike Cafarella were working on Nutch and trying to scale it they came across two google white papers about GFS (Google’s Distributed File System) and MapReduce. Using the architecture described in those white papers Nutch’s developers came up with open source implementation of distributed file system NDFS (Nutch Distributed File System) and MapReduce.
It was realized that NDFS and MapReduce can be created as a separate project and that way Hadoop initially became a sub-project. Yahoo also helped by providing resources and team to develop the framework by improving scalability, performance, and reliability and adding many new features. In 2008 Hadoop became a top-level project in Apache rather than being a sub-project and now it is a widely used framework with its own ecosystem.
How Hadoop works
Here I’ll try to explain how Hadoop works in very simple terms without going into the complexities what all daemons like NameNode or Resource Manager do.
Once you copy a huge file into HDFS, framework splits the file into blocks and distribute those blocks across nodes in a cluster.
Then you write a MapReduce program having some logic to process that data. You package your code as a jar and that packaged code is transferred to DataNodes where data blocks are stored. That way your MapReduce code work on the part of the file (HDFS block that resides on the node where code is running) and processes data in parallel.
Other advantage is that rather than sending data to code (like traditional programming where data is fetched from DB server) you send the code to data. Obviously data is much larger in size so that way Hadoop uses network bandwidth more proficiently.
Here is a high level diagram which tells in a simple way how Hadoop framework works.
- Installing Hadoop in Pseudo-distributed mode
- Introduction to YARN
- Word Count Program Using MapReduce in Hadoop
- GenericOptionsParser And ToolRunner in Hadoop
- Frequently Used HDFS Commands With Examples
- Java Program to Read a File From HDFS
- Counters in Hadoop MapReduce
- How to Create Bootable USB Drive For Installing Ubuntu
That’s all for the topic What is Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like