In a Hadoop MapReduce job you can opt to compress output of the Map phase. Since the output of Map task is stored on local disk and data is also transferred across the network to reducer nodes, compressing map phase output should help your MapReduce job to run faster.
You can use a fast compressor like snappy or LZ4 for compressing map output as compressor is splittable or not, doesn’t matter in case of intermediate Map output.
In this tutorial configuration steps for compressing Map output are given using Snappy codec.
In case you don’t have native snappy compressor library you can install it using the following command in Ubuntu. Using native libraries for compression makes it faster and helps in improving performance of MapReduce job.
$ sudo apt-get install libsnappy-dev
- Refer How to Check For Which Compressors Native Libraries Are Present to know how to check native libraries for the compressors are present or not.
Required config changes
If you want to compress output of the map phase using Snappy compression at the whole cluster level, set the following properties in mapred-site.xml:
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
Description for the properties is as follows-
- mapreduce.map.output.compress- Should the outputs of the maps be compressed before being sent across the network. Default is false.
- mapreduce.map.output.compress.codec- If the map outputs are compressed, then what codec should be used. Default is org.apache.hadoop.io.compress.DefaultCodec
If you want to set the property as per-job-basis for compressing the map output then you need to add following lines in your job.
Configuration conf = new Configuration(); conf.setBoolean("mapreduce.map.output.compress", true); conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
That's all for the topic How to Compress Map Phase Output in Hadoop MapReduce. If something is missing or you have something to share about the topic please write a comment.
You may also like
- How to Compress MapReduce Job Output
- How to See Logs And Sysouts in Hadoop MapReduce
- How to Read And Write SequenceFile in Hadoop
- Java Multi-Catch Exception With Examples
- Sleep Method in Java Multi-Threading
- Java Pass by Value or Pass by Reference
- Java CompletableFuture With Examples
- Spring depends-on Attribute and @DependsOn Annotation
No comments:
Post a Comment