Once you have installed Hadoop on your systemand initial verification is done you would be looking to write your first MapReduce program. Before digging deeper into the intricacies of MapReduce programming first step is the word count MapReduce program in Hadoop which is also known as the "Hello World" of the Hadoop framework.
So here is a simple Hadoop MapReduce word count program written in Java to get you started with MapReduce programming.
What you need
- It will be good if you have any IDE like Eclipse to write the Java code.
- A text file which is your input file. It should be copied to HDFS. This is the file which Map task will process and produce output in (key, value) pairs. This Map task output becomes input for the Reduce task.
Process
These are the steps you need for executing your Word count MapReduce program in Hadoop.
- Start daemons by executing the start-dfs and start-yarn scripts.
- Create an input directory in HDFS where you will keep your text file.
bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/input
- Copy the text file you created to /usr/input directory.
bin/hdfs dfs -put /home/knpcode/Documents/knpcode/Hadoop/count /user/input
I have created a text file called count with the following content
This is a test file. This is a test file.
If you want to verify that the file is copied or not, you can run the following command-
bin/hdfs dfs -ls /user/input Found 1 items -rw-r--r-- 1 knpcode supergroup 42 2017-12-22 18:12 /user/input/count
Word count MapReduce Java code
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { // Map function public static class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Splitting the line on spaces String[] stringArr = value.toString().split("\\s+"); for (String str : stringArr) { word.set(str); context.write(word, one); } } } // Reduce function public static class CountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception{ Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(WordMapper.class); job.setReducerClass(CountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
You will need at least the given jars to compile your MapReduce code, you will find them in the share directory of your Hadoop installation.
Running the word count MapReduce program
Once your code is successfully compiled, create a jar. If you are using eclipse IDE you can use it to create the jar by Right clicking on project – export – Java (Jar File)
Once jar is created you need to run the following command to execute your MapReduce code.
bin/hadoop jar /home/knpcode/Documents/knpcode/Hadoop/wordcount.jar org.knpcode.WordCount /user/input /user/output
In the above command
/home/knpcode/Documents/knpcode/Hadoop/wordcount.jar is the path to your jar.
org.knpcode.WordCount is the fully qualified name of Java class that you need to run.
/user/input is the path to input file.
/user/output is the path to output
In the java program in the main method there were these two lines-
FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
That’s where input and output directories will be set.
To see an explanation of word count MapReduce program working in detail, check this post- How MapReduce Works in Hadoop
After execution you can check the output directory for the output.
bin/hdfs dfs -ls /user/output Found 2 items -rw-r--r-- 1 knpcode supergroup 0 2017-12-22 18:15 /user/output/_SUCCESS -rw-r--r-- 1 knpcode supergroup 31 2017-12-22 18:15 /user/output/part-r-00000
The output can be verified by listing the content of the created output file.
bin/hdfs dfs -cat /user/output/part-r-00000
This 2 a 2 file. 2 is 2 test 2
That's all for the topic Hadoop MapReduce Word Count Program. If something is missing or you have something to share about the topic please write a comment.
You may also like
- MapReduce Execution Internal Steps in YARN
- Input Split in Hadoop MapReduce
- NameNode, Secondary Namenode and Datanode in HDFS
- How to Compress MapReduce Job Output
- How to Create Custom Exception Class in Java
- Difference Between sleep() And wait() Methods in Java
- Java Ternary Operator With Examples
- while Loop in Java With Examples
- Spring Boot Application Using Spring Initializr
No comments:
Post a Comment