How to Use LZO Compression in Hadoop

This post shows how to use LZO compression in Hadoop. The topics covered are-

  1. Installing LZO packages.
  2. Downloading and packaging hadoop-lzo. Using hadoop-lzo makes LZO compressed files splittable when used as input to MapReduce job.
  3. Configuring LZO packages so that you can use LZO compression in Hadoop.
  4. Java program that compresses a file using LZOCodec.
  5. An example showing LZO compression in Hadoop MapReduce.
  6. How to index .lzo file to make it splittable.

Installing LZO packages

For installing LZO packages in Ubuntu use the following command.

Downloading and packaging hadoop-lzo

You will need to get hadoop-lzo jars in order to make lzo splittable. For that you will need to clone the hadoop-lzo repository and build it.

Another option is to use the rpm package which you can download from here – https://code.google.com/archive/p/hadoop-gpl-packing/downloads

Here I am showing the steps for cloning and building it. Refer this URL – https://github.com/twitter/hadoop-lzo for further understanding.

Maven is also required for packaging the cloned code. If you don’t have maven installed you can install maven on your system using the following command.

Clone the hadoop-lzo repository.

In order to compile the code and build the hadoop-lzo jar change directory to your cloned hadoop-lzo directory and use the following commands.

This should create a target folder with the created jar – hadoop-lzo-0.4.21-SNAPSHOT.jar.

Configuration for using LZO compression with Hadoop

Since you are going to use LZO compression with MapReduce job so copy hadoop-lzo jar to /share/hadoop/mapreduce/lib in your $HADOOP_INSTALLATION_DIR.

Also add jar to Hadoop class path. For that add the following in $HADOOP_INSTALLATION_DIR/etc/hadoop/hadoop-env.sh

You will also need to update the configuration file $HADOOP_INSTALLATION_DIR/etc/hadoop/core-site.xml to register external codecs.

Example Java program to use LZO compression in Hadoop

Here is a Java program that compresses the file using LzopCodec. Input file is in local file system where as the compressed output file is stored in HDFS.

Make sure that you have added the created external jar for hadoop-lzo in Java build path.

Java Code

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

I have my LzoCompress.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

Just to check how many blocks are occupied by the compressed file.

As you can see that the file is big enough to occupy 4 HDFS blocks. That will help us in checking if MapReduce is able to create splits for the compressed file or not.

Using LZOCompression in Hadoop MapReduce

Let’s create a simple MapReduce job that uses the created .lzo as input. In order to use LZO compressed file in Hadoop MapReduce as input the input format that has to be used is LzoTextInputFormat.

If you run this MapReduce job you can see that only one split is created.

Map task is not able to split the LZO compressed file so it uses the whole file as one input split which means only one Map task will process the whole file. In order to make LZO file splittable you will have to run indexer. You can run lzo indexer as a Java program or as a MapReduce job.

Running lzo indexer as Java program

Running lzo indexer as MapReduce job

Either way it should create an .index file (/user/compout/data.lzo.index) which means your .lzo file is successfully indexed and is splittable now. To check it run the MapReduce job again.

In the console you can see that now Map task is able to create 4 input splits corresponding to 4 HDFS blocks.

Referenehttps://gist.github.com/zedar/c43cbc7ff7f98abee885

https://github.com/twitter/hadoop-lzo

That’s all for the topic How to Use LZO Compression in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

3 Comments

  1. Pingback: Thread Interruption in Java - KnpCode

  2. Thanks for the wonderful post

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.