In this post we’ll see how to read and write Parquet file in Hadoop using the Java API. We’ll also see how you can use MapReduce to write Parquet files in Hadoop.
Rather than using the ParquetWriter and ParquetReader directly AvroParquetWriter
and AvroParquetReader
are used to write and read parquet files.
AvroParquetWriter and AvroParquetReader classes will take care of conversion from Avro schema to Parquet schema and also the types.
Required Jars
To write Java programs to read and write Parquet files you will need to put following jars in classpath. You can add them as Maven dependency or copy the jars.
- avro-1.8.2.jar
- parquet-hadoop-bundle-1.10.0.jar
- parquet-avro-1.10.0.jar
- jackson-mapper-asl-1.9.13.jar
- jackson-core-asl-1.9.13.jar
- slf4j-api-1.7.25.jar
Java program to write parquet file
Since Avro is used so you’ll need avro schema.
schema.avsc{ "type": "record", "name": "testFile", "doc": "test records", "fields": [{ "name": "id", "type": "int" }, { "name": "empName", "type": "string" } ] }Java code
import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.avro.Schema; import org.apache.avro.generic.GenericData; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.avro.AvroParquetWriter; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.metadata.CompressionCodecName; public class ExampleParquetWriter { public static void main(String[] args) { Schema schema = parseSchema(); List<GenericData.Record> recordList = createRecords(schema); writeToParquetFile(recordList, schema); } // Method to parse the schema private static Schema parseSchema() { Schema.Parser parser = new Schema.Parser(); Schema schema = null; try { // Path to schema file schema = parser.parse(ClassLoader.getSystemResourceAsStream("resources/schema.avsc")); } catch (IOException e) { e.printStackTrace(); } return schema; } private static List<GenericData.Record> createRecords(Schema schema){ List<GenericData.Record> recordList = new ArrayList<>(); for(int i = 1; i <= 10; i++) { GenericData.Record record = new GenericData.Record(schema); record.put("id", i); record.put("empName", i+"a"); recordList.add(record); } return recordList; } private static void writeToParquetFile(List<GenericData.Record> recordList, Schema schema) { // Output path for Parquet file in HDFS Path path = new Path("/user/out/data.parquet"); ParquetWriter<GenericData.Record> writer = null; // Creating ParquetWriter using builder try { writer = AvroParquetWriter. <GenericData.Record>builder(path) .withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE) .withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE) .withSchema(schema) .withConf(new Configuration()) .withCompressionCodec(CompressionCodecName.SNAPPY) .withValidation(false) .withDictionaryEncoding(false) .build(); // writing records for (GenericData.Record record : recordList) { writer.write(record); } }catch(IOException e) { e.printStackTrace(); }finally { if(writer != null) { try { writer.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } } }
Executing program in Hadoop environment
Before running this program in Hadoop environment you will need to put the above mentioned jars in HADOOP_INSTALLATION_DIR/share/hadoop/mapreduce/lib.
Also put the current version Avro-1.x.x jar in the location HADOOP_INSTALLATION_DIR/share/hadoop/common/lib if there is a version mismatch.
To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.
$ export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'
I have my ExampleParquetWriter.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.
Then you can run the program using the following command-
$ hadoop org.knpcode.ExampleParquetWriter 18/06/06 12:15:35 INFO compress.CodecPool: Got brand-new compressor [.snappy] 18/06/06 12:15:35 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 2048
Java program to read parquet file
To read the Parquet file created in HDFS using the above program you can use the following method.
private static void readParquetFile() { ParquetReader reader = null; Path path = new Path("/user/out/data.parquet"); try { reader = AvroParquetReader .builder(path) .withConf(new Configuration()) .build(); GenericData.Record record; while ((record = reader.read()) != null) { System.out.println(record); } }catch(IOException e) { e.printStackTrace(); }finally { if(reader != null) { try { reader.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } }
$ hadoop org.knpcode.ExampleParquetWriter 18/06/06 13:33:47 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records. 18/06/06 13:33:47 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 18/06/06 13:33:47 INFO compress.CodecPool: Got brand-new decompressor [.snappy] 18/06/06 13:33:47 INFO hadoop.InternalParquetRecordReader: block read in memory in 44 ms. row count = 10 {"id": 1, "empName": "1a"} {"id": 2, "empName": "2a"} {"id": 3, "empName": "3a"} {"id": 4, "empName": "4a"} {"id": 5, "empName": "5a"} {"id": 6, "empName": "6a"} {"id": 7, "empName": "7a"} {"id": 8, "empName": "8a"} {"id": 9, "empName": "9a"} {"id": 10, "empName": "10a"}
Note that builder with org.apache.hadoop.fs.Path instance as argument is deprecated.
You can also use parquet-tools jar to see the content or schema of the parquet file.
Once you download the parquet-tools-1.10.0.jar to see the conent of the file you can use the following command.
$ hadoop jar /path/to/parquet-tools-1.10.0.jar cat /user/out/data.parquet
To see the schema of a parquet file.
$ hadoop jar /path/to/parquet-tools-1.10.0.jar schema /user/out/data.parquet message testFile { required int32 id; required binary empName (UTF8); }
MapReduce to write a Parquet file
In this example a text file is converted to a parquet file using MapReduce. Its a mapper only job so number of reducers is set to zero.
For this program a simple text file (stored in HDFS) with only two lines is used.
This is a test file. This is a Hadoop MapReduce program file.MapReduce Java code
import java.io.IOException; import org.apache.avro.Schema; import org.apache.avro.generic.GenericData; import org.apache.avro.generic.GenericRecord; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.parquet.avro.AvroParquetOutputFormat; import org.apache.parquet.example.data.Group; public class ParquetFile extends Configured implements Tool{ public static void main(String[] args) throws Exception{ int exitFlag = ToolRunner.run(new ParquetFile(), args); System.exit(exitFlag); } /// Schema private static final Schema AVRO_SCHEMA = new Schema.Parser().parse( "{\n" + " \"type\": \"record\",\n" + " \"name\": \"testFile\",\n" + " \"doc\": \"test records\",\n" + " \"fields\":\n" + " [\n" + " {\"name\": \"byteofffset\", \"type\": \"long\"},\n"+ " {\"name\": \"line\", \"type\": \"string\"}\n"+ " ]\n"+ "}\n"); // Map function public static class ParquetMapper extends Mapper<LongWritable, Text, Void, GenericRecord> { private GenericRecord record = new GenericData.Record(AVRO_SCHEMA); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { record.put("byteofffset", key.get()); record.put("line", value.toString()); context.write(null, record); } } @Override public int run(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "parquet"); job.setJarByClass(ParquetFile.class); job.setMapperClass(ParquetMapper.class); job.setNumReduceTasks(0); job.setOutputKeyClass(Void.class); job.setOutputValueClass(Group.class); job.setOutputFormatClass(AvroParquetOutputFormat.class); // setting schema to be used AvroParquetOutputFormat.setSchema(job, AVRO_SCHEMA); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; } }Running the MapReduce program
hadoop jar /path/to/jar org.knpcode.ParquetFile /user/input/count /user/out/parquetFile
Using parquet-tools you can see the content of the parquet file.
hadoop jar /path/to/parquet-tools-1.10.0.jar cat /user/out/parquetFile/part-m-00000.parquet 18/06/06 17:15:04 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 2 records. 18/06/06 17:15:04 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 18/06/06 17:15:04 INFO hadoop.InternalParquetRecordReader: block read in memory in 20 ms. row count = 2 byteofffset = 0 line = This is a test file. byteofffset = 21 line = This is a Hadoop MapReduce program file.
MapReduce to read a Parquet file
This example shows how you can read a Parquet file using MapReduce. The example reads the parquet file written in the previous example and put it in a file.
The record in Parquet file looks as following.
byteofffset: 0 line: This is a test file. byteofffset: 21 line: This is a Hadoop MapReduce program file.
Since only the line part is needed in the output file so you first need to split the record and then again split the value of the line column.
MapReduce Java code
import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.parquet.example.data.Group; import org.apache.parquet.hadoop.example.ExampleInputFormat; public class ParquetFileRead extends Configured implements Tool{ public static void main(String[] args) throws Exception{ int exitFlag = ToolRunner.run(new ParquetFileRead(), args); System.exit(exitFlag); } // Map function public static class ParquetMapper1 extends Mapper<LongWritable, Group, NullWritable, Text> { public static final Log log = LogFactory.getLog(ParquetMapper1.class); public void map(LongWritable key, Group value, Context context) throws IOException, InterruptedException { NullWritable outKey = NullWritable.get(); String line = value.toString(); String[] fields = line.split("\n"); String[] record = fields[1].split(": "); context.write(outKey, new Text(record[1])); } } @Override public int run(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "parquet1"); job.setJarByClass(getClass()); job.setMapperClass(ParquetMapper1.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(ExampleInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitForCompletion(true) ? 0 : 1; } }Running the MapReduce program
hadoop jar /path/to/jar org.knpcode.ParquetFileRead /user/out/parquetFile/part-m-00000.parquet /user/out/dataFile content
$ hdfs dfs -cat /user/out/data/part-m-00000 This is a test file. This is a Hadoop MapReduce program file.
That's all for the topic How to Read And Write Parquet File in Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like
No comments:
Post a Comment