In a Hadoop MapReduce job you can opt to compress output of the Map phase. Since the output of Map task is stored on local disk and data is also transferred across the network to reducer nodes, compressing map phase output should help your MapReduce job to run faster.
You can use a fast compressor like snappy or LZ4 for compressing map output as compressor is splittable or not doesn’t matter in case of intermediate Map output.
In this tutorial configuration steps for compressing Map output are given using Snappy codec.
In case you don’t have native snappy compressor library you can install it using the following command in Ubuntu. Using native libraries for compression makes it faster and helps in improving performance of MapReduce job.
$ sudo apt-get install libsnappy-dev
- Refer How to Check For Which Compressors Native Libraries Are Present to know how to check native libraries for the compressors are present or not.
Required config changes
If you want to compress output of the map phase using Snappy compression at the whole cluster level, set the following properties in mapred-site.xml:
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
Description for the properties is as follows-
- mapreduce.map.output.compress– Should the outputs of the maps be compressed before being sent across the network. Default is false.
- mapreduce.map.output.compress.codec– If the map outputs are compressed, then what codec should be used. Default is org.apache.hadoop.io.compress.DefaultCodec
If you want to set the property as per-job-basis for compressing the map output then you need to add following lines in your job.
Configuration conf = new Configuration(); conf.setBoolean("mapreduce.map.output.compress", true); conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
Related Posts
- How to Compress MapReduce Job Output
- How to See Logs And Sysouts in Hadoop MapReduce
- Distributed Cache in Hadoop
- Shuffle Phase in Hadoop MapReduce
- How to Improve Map-Reduce Performance
- Namenode in Safemode
- Uber Task in YARN
- HDFS Federation
That’s all for the topic How to Compress Map Phase Output in Hadoop MapReduce. If something is missing or you have something to share about the topic please write a comment.
You may also like