How to Read And Write Parquet File in Hadoop

In this post we’ll see how to read and write Parquet file in Hadoop using the Java API. We’ll also see how you can use MapReduce to write Parquet files in Hadoop.

Rather than using the ParquetWriter and ParquetReader directly AvroParquetWriter and AvroParquetReader are used to write and read parquet files.

AvroParquetWriter and AvroParquetReader classes will take care of conversion from Avro schema to Parquet schema and also the types.

Required Jars

To write Java programs to read and write Parquet files you will need to put following jars in classpath. You can add them as Maven dependency or copy the jars.

  • avro-1.8.2.jar
  • parquet-hadoop-bundle-1.10.0.jar
  • parquet-avro-1.10.0.jar
  • jackson-mapper-asl-1.9.13.jar
  • jackson-core-asl-1.9.13.jar
  • slf4j-api-1.7.25.jar

Java program to write parquet file

Since Avro is used so you’ll need avro schema.

schema.avsc

Java code

Executing program in Hadoop environment

Before running this program in Hadoop environment you will need to put the above mentioned jars in HADOOP_INSTALLATION_DIR/share/hadoop/mapreduce/lib.

Also put the current version Avro-1.x.x jar in the location HADOOP_INSTALLATION_DIR/share/hadoop/common/lib if there is a version mismatch.

To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.

I have my ExampleParquetWriter.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.

Then you can run the program using the following command-

Java program to read parquet file

To read the Parquet file created in HDFS using the above program you can use the following method.

Note that builder with org.apache.hadoop.fs.Path instance as argument is deprecated.

You can also use parquet-tools jar to see the content or schema of the parquet file.

Once you download the parquet-tools-1.10.0.jar to see the conent of the file you can use the following command.

To see the schema of a parquet file.

MapReduce to write a Parquet file

In this example a text file is converted to a parquet file using MapReduce. Its a mapper only job so number of reducers is set to zero.

For this program a simple text file (stored in HDFS) with only two lines is used.

MapReduce Java code

Running the MapReduce program

Using parquet-tools you can see the content of the parquet file.

MapReduce to read a Parquet file

This example shows how you can read a Parquet file using MapReduce. The example reads the parquet file written in the previous example and put it in a file.

The record in Parquet file looks as following.

Since only the line part is needed in the output file so you first need to split the record and then again split the value of the line column.

MapReduce Java code

Running the MapReduce program

File content

That’s all for the topic How to Read And Write Parquet File in Hadoop. If something is missing or you have something to share about the topic please write a comment.


You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.