In this post we’ll see how to read and write Avro files in Hadoop using the Java API.
Required Jars
To write Java programs to read and write Avro files you will need to put following jars in classpath. You can add them as Maven dependency or copy the jars.
- avro-1.8.2.jar
- avro-tools-1.8.2.jar
- jackson-mapper-asl-1.9.13.jar
- jackson-core-asl-1.9.13.jar
- slf4j-api-1.7.25.jar
Java program to write avro file
Since Avro is used so you’ll need avro schema. schema.avsc{ "type": "record", "name": "EmployeeRecord", "doc": "employee records", "fields": [{ "name": "id", "type": "int" }, { "name": "empName", "type": "string" }, { "name": "age", "type": "int" } ] }
Java code
import java.io.File; import java.io.IOException; import org.apache.avro.Schema; import org.apache.avro.file.DataFileWriter; import org.apache.avro.generic.GenericData; import org.apache.avro.generic.GenericDatumWriter; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumWriter; public class ExampleAvroWriter { public static void main(String[] args) { Schema schema = parseSchema(); writeToAvroFile(schema); } // Method to parse the schema private static Schema parseSchema() { Schema.Parser parser = new Schema.Parser(); Schema schema = null; try { // Path to schema file schema = parser.parse(ClassLoader.getSystemResourceAsStream("resources/schema.avsc")); } catch (IOException e) { e.printStackTrace(); } return schema; } private static void writeToAvroFile(Schema schema) { GenericRecord emp1 = new GenericData.Record(schema); emp1.put("id", 1); emp1.put("empName", "Batista"); emp1.put("age", 45); GenericRecord emp2 = new GenericData.Record(schema); emp2.put("id", 2); emp2.put("empName", "Jigmi"); emp2.put("age", 23); DatumWriter datumWriter = new GenericDatumWriter(schema); DataFileWriter dataFileWriter = null; try { // Local File system - out file path File file = new File("/home/knpcode/emp.avro"); dataFileWriter = new DataFileWriter(datumWriter); // for compression //dataFileWriter.setCodec(CodecFactory.snappyCodec()); dataFileWriter.create(schema, file); dataFileWriter.append(emp1); dataFileWriter.append(emp2); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }finally { if(dataFileWriter != null) { try { dataFileWriter.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } } }
Note that in this code output avro file is created in local file system. If you want to create output file in HDFS then you need to pass the path using the following changes.
// For HDFS - out file path Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create("hdfs://hostname:port/user/out/emp.avro"), conf); OutputStream out = fs.create(newPath("hdfs://hostname:port/user/out/emp.avro"));
And pass this OutputStream object in the create method
dataFileWriter.create(schema, out);
Executing program in Hadoop environment
Before running this program in Hadoop environment you will need to put the above mentioned jars in $HADOOP_INSTALLATION_DIR/share/hadoop/mapreduce/lib.
Also put the current version Avro-1.x.x jar in the location $HADOOP_INSTALLATION_DIR/share/hadoop/common/lib if there is a version mismatch.
To execute above Java program in Hadoop environment, you will need to add the directory containing the .class file for the Java program in Hadoop’s classpath.
export HADOOP_CLASSPATH='/huser/eclipse-workspace/knpcode/bin'
I have my ExampleAvroWriter.class file in location /huser/eclipse-workspace/knpcode/bin so I have exported that path.
Then you can run the program using the following command-
$ hadoop org.knpcode.ExampleAvroWriter
Java program to read avro file
In order to read the avro file stored in HDFS in the previous example, you can use the following method. Provide values for HOSTNAME and PORT as per your configuration.
private static void readFromAvroFile(Schema schema) { Configuration conf = new Configuration(); DataFileReader dataFileReader = null; try { FsInput in = new FsInput(new Path("hdfs://HOSTNAME:PORT/user/out/emp.avro"), conf); DatumReader datumReader = new GenericDatumReader(schema); dataFileReader = new DataFileReader(in, datumReader); GenericRecord emp = null; while (dataFileReader.hasNext()) { emp = dataFileReader.next(emp); System.out.println(emp); } }catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }finally { if(dataFileReader != null) { try { dataFileReader.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } }Output
{"id": 1, "empName": "Batista", "age": 45} {"id": 2, "empName": "Jigmi", "age": 23}
If you want to read avro file from local file system them you can use the following method.
private static void readFromAvroFile(Schema schema) { DataFileReader dataFileReader = null; try { File file = new File("/home/knpcode/emp.avro"); DatumReader datumReader = new GenericDatumReader(schema); dataFileReader = new DataFileReader(file, datumReader); GenericRecord emp = null; while (dataFileReader.hasNext()) { emp = dataFileReader.next(emp); System.out.println(emp); } }catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }finally { if(dataFileReader != null) { try { dataFileReader.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } }
That's all for the topic How to Read And Write Avro Files in Hadoop. If something is missing or you have something to share about the topic please write a comment.
You may also like
No comments:
Post a Comment