How to Read And Write SequenceFile in Hadoop

This post shows how to read and write SequenceFile in Hadoop using Java API, using Hadoop MapReduce and how can you provide compression options for a SequenceFile.


Writing a sequence file Java program

SeqeunceFile provides a static method createWriter() to create a writer which is used to write a SequenceFile in Hadoop, there are many overloaded variants of createWriter method (many of them deprecated now) but here the method used is the following one.

Java Code

In the program compression option is also given and the compression codec used is GzipCodec.

Executing program in Hadoop environment

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

export HADOOP_CLASSPATH=’/huser/eclipse-workspace/knpcode/bin’

I have my SFWrite.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

Here /user/output/item.seq is the output path in the HDFS.

If you try to display the file content in HDFS the content will not be readable as SequenceFile is a binary file format. That brings us to the second part how to read a sequence file.

Reading a sequence file Java program

To read a SequenceFile in Hadoop you need to get an instance of SequenceFile.Reader which can read any of the writer SequenceFile formats.
Using this reader instance you can iterate over the records by using the next() method, the variant of the next method used here takes both key and value as arguments of type Writable and assign the next (key, value) pair read from the sequence file into these variables.

Writing SequenceFile using MapReduce Job

You can also write a sequence file in Hadoop using MapReduce job. That is helpful when you have a big file and you want to take advantage of parallel processing.

The MapReduce job in this case will be a simple one where you don’t even need a reduce job and your Map tasks will just require to write the (key, value) pair.

In the MapReduce job for writing a SequenceFile more important thing is the job settings given for output and compression.

Reading SequenceFile using MapReduce Job

If you want to read a sequence file using MapReduce job that code will be very similar to how writing a sequence file is done.
One main change is the input and output formats.

That’s all for the topic How to Read And Write SequenceFile in Hadoop. If something is missing or you have something to share about the topic please write a comment.

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.