In the post WordCount MapReduce program we have seen how to write a MapReduce program in Java, create a jar and run it. There are a lot of things that you do to create a MapReduce job and Hadoop framework also do a lot of processing internally. In this post we’ll see in detail how MapReduce works in Hadoop internally using the word count MapReduce program as example.
What is MapReduce
Hadoop MapReduce is a framework for writing applications that can process huge data in parallel, by working on small chunks of data in parallel on cluster of nodes. The framework ensures that this distributed processing happens in a reliable, fault-tolerant manner.
Map and Reduce
A MapReduce job in Hadoop consists of two phases-
- Map phase– It has a Mapper class which has a map function specified by the developer. The input and output for Map phase is a (key, value) pair. When you copy the file that has to be processed to HDFS it is split into independent chunks. Hadoop framework creates one map task for each chunk and these map tasks run in parallel.
- Reduce phase- It has a Reducer class which has a reduce function specified by the developer. The input and output for Reduce phase is also a (key, value) pair. The output of Map phase after some further processing by Hadoop framework (known as sorting and shuffling) becomes the input for reduce phase. So the output of Map phase is the intermediate output and it is processed by Reduce phase to generate the final output.
Since input and output for both map and reduce functions are key, value pair so if we say input for map is (K1, V1) and output is (K2, V2) then map function input and output will have the following form-
(K1, V1) -> list(K2, V2)
The intermediate output of the map function goes through some further processing with in the framework, known as shuffle and sort phase, before inputting to reduce function. The general form for the reduce function can be depicted as follows-
(K2, list(V2)) -> list(K3, V3)
Here note that the types of the reduce input matches the types of map output.
MapReduce explanation with example
Let’s take Word count MapReduce code as example and see what all happens in both Map and Reduce phases and how MapReduce works in Hadoop.
When we put the input text file into HDFS it is split into chunks of data. For simplicity sake let’s say we have two lines in the file and it is split into two parts with each part having one line.
If the text file has following two lines-
This is a test file This is a Hadoop MapReduce program file
Then there will be two splits and two map tasks will get those two splits as input.
Mapper class
// Map function public static class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Splitting the line on spaces String[] stringArr = value.toString().split("\\s+"); for (String str : stringArr) { word.set(str); context.write(word, one); } } }
In the Mapper class you can see that it has four parameters first two specify the input to the map function and other to specify the output of the map function.
In this Word count program input key value pair will be as follows-
key- byte offset into the file at which the line starts.
Value– Content of the line.
As we assumed there will be two splits (each having one line of the file) and two map tasks let’s say Map-1 and Map-2, so input to Map-1 and Map-2 will be as follows.
Map-1– (0, This is a test file)
Map-2– (0, This is a Hadoop MapReduce program file)
Logic in map function is to split the line on spaces and the write each word to the context with value as 1.
So output from Map-1 will be as follows-
(This, 1) (is, 1) ( a, 1) (test, 1) (file, 1)
And output from Map-2 will be as follows-
(This, 1) (is, 1) (a, 1) (Hadoop, 1) (MapReduce, 1) (program, 1) (file, 1)Reducer class
// Reduce function public static class CountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
In the Reducer class again there are four parameters two for input types and two for output types of the reduce function.
Note that input type of the reduce function must match the output types of the map function.
This intermediate output from Map will be further processed by the Hadoop framework in the shuffle phase where it will be sorted and grouped as per keys, after this internal processing input to reduce will look like this-
[Hadoop, (1)] [MapReduce, (1)] [This, (1, 1)] [a, (1, 1)] [file, (1, 1)] [is, (1, 1)] [program, (1)] [test, (1)]
You can see that the input to the reduce function is in the form (key, list(values)). In the logic of the reduce function, for each key value pair list of values is iterated and values are added. That will be the final output.
Hadoop 1 MapReduce 1 This 2 a 2 file. 2 is 2 program 1 test 1
That's all for the topic How MapReduce Works in Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like
- MapReduce Execution Internal Steps in YARN
- Mapper Only Job in Hadoop MapReduce
- How to See Logs And Sysouts in Hadoop MapReduce
- OutputCommitter in Hadoop MapReduce
- How to Read Delimited File in Java
- Livelock in Java Multi-Threading
- Shallow Copy Vs Deep Copy in Java Object Cloning
- NoClassDefFoundError in Java
- Serialization Proxy Pattern -readResolve() and writeReplace()
No comments:
Post a Comment