This post shows how to write a Java program to compress a file in HDFS using bzip2 compression. The program takes input file from local file system and write a BZip2 compressed file as output in HDFS.
Java program to compress file in bzip2 format
Hadoop compression codec that has to be used for bzip2 is org.apache.hadoop.io.compress.Bzip2Codec
.
To get that codec getCodecByClassName()
method of the CompressionCodecFactory
class is used.
To create a CompressionOutputStream, createOutputStream(OutputStream out)
method of the codec class is used.
You will use CompressionOutputStream to write file data in compressed form to the stream.
import java.io.BufferedInputStream; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import org.apache.hadoop.io.compress.CompressionOutputStream; public class HDFSCompressWrite { public static void main(String[] args) { Configuration conf = new Configuration(); InputStream in = null; OutputStream out = null; try { FileSystem fs = FileSystem.get(conf); // Input file from local file system in = new BufferedInputStream(new FileInputStream("/home/knpcode/Documents/knpcode/Hadoop/Test/data.txt")); //Compressed Output file Path outFile = new Path("/user/compout/test.bz2"); // Verification if (fs.exists(outFile)) { System.out.println("Output file already exists"); throw new IOException("Output file already exists"); } out = fs.create(outFile); // For bzip2 compression CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodecByClassName("org.apache.hadoop.io.compress.BZip2Codec"); CompressionOutputStream compressionOutputStream = codec.createOutputStream(out); try { IOUtils.copyBytes(in, compressionOutputStream, 4096, false); compressionOutputStream.finish(); } finally { IOUtils.closeStream(in); IOUtils.closeStream(compressionOutputStream); } } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
Executing program in Hadoop environment
To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.
export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'I have my HDFSCompressWrite.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.
Then you can run the program using the following command-
$ hadoop org.knpcode.HDFSCompressWrite 18/03/09 17:10:04 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 18/03/09 17:10:04 INFO compress.CodecPool: Got brand-new compressor [.bz2]
The input file used in the program is large enough to ensure that even after compression file size is more than 128 MB, that way we can ensure that is stored as two separate blocks in HDFS. Since Compressing File in bzip2 Format in Hadoop supports splits so a MapReduce job having this compressed file as input should be able to create 2 separate input splits corresponding to two blocks.
First to check whether the compressed output file in bzip2 format is created or not.
$ hdfs dfs -ls /user/compout Found 1 items -rw-r--r-- 1 knpcode supergroup 228651107 2018-03-09 17:11 /user/compout/test.bz2
You can see compressed file size is around 228 MB so it should be stored as two separate blocks in HDFS.
You can check that using HDFS fsck command.
$ hdfs fsck /user/compout/test.bz2 Status: HEALTHY Total size: 228651107 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 2 (avg. block size 114325553 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) FSCK ended at Fri Mar 09 17:17:13 IST 2018 in 3 milliseconds
If you give this compressed file as input to a MapReduce job, the MapReduce job should be able to create two input splits as bzip2 format supports splitting. To check that gave this file as input to a wordcount MapReduce program.
$ hadoop jar /home/knpcode/Documents/knpcode/Hadoop/wordcount.jar org.knpcode.WordCount /user/compout/test.bz2 /user/output2 18/03/11 12:48:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 18/03/11 12:48:29 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 18/03/11 12:48:30 INFO input.FileInputFormat: Total input files to process : 1 18/03/11 12:48:30 INFO mapreduce.JobSubmitter: number of splits:2
As you can see in this statement displayed on the console "mapreduce.JobSubmitter: number of splits:2" two splits are created for the map tasks.
That's all for the topic Java Program to Compress File in bzip2 Format in Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like
- How to Use LZO Compression in Hadoop
- MapReduce Execution Internal Steps in YARN
- Java Program to Write a File in HDFS
- Type Casting And Type Conversion in Java
- Java Consumer Functional Interface Examples
- How to Iterate a Java HashSet
- Java Program to Check Given String Palindrome or Not
- Spring Boot Microservices Eureka + Ribbon
No comments:
Post a Comment