How to Compress Intermediate Output of The Map Phase

In a Hadoop MapReduce job you can opt to compress output of the map phase. Since the output of Map task is stored on local disk and data is also transferred across the network to reducer nodes, compressing intermediate map output should help your MapReduce job to run faster.

You can use a fast compressor like snappy or LZ4 for compressing map output as compressor is splittable or not doesn’t matter in case of intermediate output.

Here configuration steps for compressing Map output are given using Snappy codec.

In case you don’t have native snappy compressor library you can install it using the following command in Ubuntu. Using native libraries for compression makes it faster and helps in improving performance of MapReduce job.

Required config changes

If you want to compress output of the map phase using Snappy compression at the whole cluster level, set the following properties in mapred-site.xml:

Description for the properties is as follows-

  • mapreduce.map.output.compress– Should the outputs of the maps be compressed before being sent across the network. Default is false.
  • mapreduce.map.output.compress.codec– If the map outputs are compressed, then what codec should be used. Default is org.apache.hadoop.io.compress.DefaultCodec

If you want to set the property as per-job-basis for compressing the map output then you need to add following lines in your job.

That’s all for the topic How to Compress Intermediate Output of The Map Phase. If something is missing or you have something to share about the topic please write a comment.


You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.