Java Program to Compress File in gzip Format in Hadoop

In this post we’ll see a Java program that shows how to compress file using gzip format in Hadoop.

Compression format gzip does not support splitting so MapReduce job won’t be able to create input splits though compressed file can still be stored as separate HDFS blocks (Size 128 MB by default).

Java program to compress file using gzip format

Hadoop compression codec that has to be used for gzip is “org.apache.hadoop.io.compress.GzipCodec”.

To get that codec getCodecByClassName method of the CompressionCodecFactory class is used.

To create a CompressionOutputStream, createOutputStream(OutputStream out) method of the codec class is used. You will use CompressionOutputStream to write file data in compressed form to the stream.

Java code

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

export HADOOP_CLASSPATH=’/huser/eclipse-workspace/knpcode/bin’

I have my GzipCompress.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

The input file used in the program is large enough to ensure that even after compression file size is more than 128 MB, that way we can ensure that is stored as two separate blocks in HDFS.

You can check that by using hdfs fsck command.

Since gzip doesn’t suppport splitting so using this compressed file as input for a MapReduce job will mean only one split will be created for the Map task.

To test how many input splits are created gave this compressed gzip file as input to the wordcount MapReduce program.

As you can see in this line displayed on the console mapreduce.JobSubmitter: number of splits:1 only one input split is created for the MapReduce job even if there are two HDFS blocks as gzip compressed file is not splittable.

That’s all for the topic Java Program to Compress File in gzip Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

One Comment

  1. Pingback: wait(), notify() And notifyAll() Methods in Java - KnpCode

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.