Java Program to Compress File in bzip2 Format in Hadoop

This post shows how to write a Java program to compress a file in HDFS using bzip2 compression. The program takes input file from local file system and write a BZip2 compressed file as output in HDFS.

Java program to compress file in bzip2 format

Hadoop compression codec that has to be used for bzip2 is “org.apache.hadoop.io.compress.Bzip2Codec”.

To get that codec getCodecByClassName method of the CompressionCodecFactory class is used.

To create a CompressionOutputStream, createOutputStream(OutputStream out) method of the codec class is used. You will use CompressionOutputStream to write file data in compressed form to the stream.

Java code

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

export HADOOP_CLASSPATH=’/huser/eclipse-workspace/knpcode/bin’

I have my HDFSCompressWrite.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

The input file used in the program is large enough to ensure that even after compression file size is more than 128 MB, that way we can ensure that is stored as two separate blocks in HDFS. Since Compressing File in bzip2 Format in Hadoop supports splits so a MapReduce job having this compressed file as input should be able to create 2 separate input splits corresponding to two blocks.

First to check whether the compressed output file in bzip2 format is created or not.

You can see compressed file size is around 228 MB so it should be stored as two separate blocks in HDFS.

You can check that using HDFS fsck command.

If you give this compressed file as input to a MapReduce job, the MapReduce job should be able to create two input splits as bzip2 format supports splitting. To check that gave this file as input to a wordcount MapReduce program.

As you can see in this statement displayed on the console “mapreduce.JobSubmitter: number of splits:2” two splits are created for the map tasks.

That’s all for the topic Java Program to Compress File in bzip2 Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.