In this post we'll see a Java program that shows how to compress file using gzip format in Hadoop.
Compression format gzip does not support splitting so MapReduce job won’t be able to create input splits though compressed file can still be stored as separate HDFS blocks (Size 128 MB by default).
Java program to compress file using gzip format
Hadoop compression codec that has to be used for gzip is org.apache.hadoop.io.compress.GzipCodec
.
getCodecByClassName()
method of the CompressionCodecFactory
class is used.
To create a CompressionOutputStream, createOutputStream(OutputStream out)
method of the codec class is used. You will use CompressionOutputStream to write file data in compressed form to the stream.
import java.io.BufferedInputStream; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import org.apache.hadoop.io.compress.CompressionOutputStream; public class GzipCompress { public static void main(String[] args) { Configuration conf = new Configuration(); InputStream in = null; OutputStream out = null; try { FileSystem fs = FileSystem.get(conf); // Input file from local file system in = new BufferedInputStream(new FileInputStream("/home/knpcode/Documents/knpcode/Hadoop/Test/data.txt")); //Compressed Output file Path outFile = new Path("/user/compout/test.gz"); // Verification if (fs.exists(outFile)) { System.out.println("Output file already exists"); throw new IOException("Output file already exists"); } out = fs.create(outFile); // For gzip compression CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodecByClassName("org.apache.hadoop.io.compress.GzipCodec"); CompressionOutputStream compressionOutputStream = codec.createOutputStream(out); try { IOUtils.copyBytes(in, compressionOutputStream, 4096, false); compressionOutputStream.finish(); } finally { IOUtils.closeStream(in); IOUtils.closeStream(compressionOutputStream); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
Executing program in Hadoop environment
To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.
export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'I have my GzipCompress.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.
Then you can run the program using the following command-
$ hadoop org.knpcode.GzipCompress 18/03/11 12:59:49 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 18/03/11 12:59:49 INFO compress.CodecPool: Got brand-new compressor [.gz]
The input file used in the program is large enough to ensure that even after compression file size is more than 128 MB, that way we can ensure that is stored as two separate blocks in HDFS.
You can check that by using hdfs fsck command.
$ hdfs fsck /user/compout/test.gz .Status: HEALTHY Total size: 233963084 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 2 (avg. block size 116981542 B) FSCK ended at Wed Mar 14 21:07:46 IST 2018 in 6 milliseconds
Since gzip doesn’t support splitting so using this compressed file as input for a MapReduce job will mean only one split will be created for the Map task.
To test how many input splits are created gave this compressed gzip file as input to the wordcount MapReduce program.
$ hadoop jar /home/knpcode/Documents/knpcode/Hadoop/wordcount.jar org.knpcode.WordCount /user/compout/test.gz /user/output3 18/03/11 13:09:23 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/03/11 13:09:23 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/03/11 13:09:23 INFO input.FileInputFormat: Total input files to process : 1 18/03/11 13:09:24 INFO mapreduce.JobSubmitter: number of splits:1
As you can see in this line displayed on the console mapreduce.JobSubmitter: number of splits:1 only one input split is created for the MapReduce job even if there are two HDFS blocks as gzip compressed file is not splittable.
That's all for the topic Java Program to Compress File in gzip Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like
No comments:
Post a Comment