In this post we’ll see a Java program to read text from a PDF document using PDFBox library and a Java program to extract image from a PDF document using PDFBox library.
To know more about PDFBox library and PDF examples in Java using PDFBox check this post- Generating PDF in Java Using PDFBox Tutorial
Reading PDFs using PDFBox
For reading text from a PDF using PDFBox you need to perform the following steps.
- Load the PDF that has to be read using
PDDocument.load
method. - For reading text from PDF using PDFBox,
PDFTextStripper
class is used. This class takes a PDF document and strip out all of the text. getText()
method of the PDFTextStripper class is used for reading the PDF document.
import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; public class ReadPDF { public static final String CONTENT_PDF = "F://knpcode//result//PDFBox//Content.pdf"; public static void main(String[] args) { try { PDDocument document = PDDocument.load(new File(CONTENT_PDF)); PDFTextStripper textStripper = new PDFTextStripper(); // Get total page count of the PDF document int numberOfPages = document.getNumberOfPages(); //set the first page to be extracted textStripper.setStartPage(1); // set the last page to be extracted textStripper.setEndPage(numberOfPages); String text = textStripper.getText(document); System.out.println(text); document.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
Extracting image from PDF using PDFBox
If you want to extract images from a PDF document that can be done using the PDResources
class in PDFBox
library. Using this class you can get all the resources available at page level.
From those resources you can check if any of the resource is image (that can be checked by verifying if resource object is of type PDImageXObject).
import java.io.File; import java.io.IOException; import javax.imageio.ImageIO; import org.apache.pdfbox.cos.COSName; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDResources; import org.apache.pdfbox.pdmodel.common.PDStream; import org.apache.pdfbox.pdmodel.graphics.PDXObject; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; public class ReadPDF { public static final String CONTENT_PDF = "F://knpcode//result//PDFBox//Image.pdf"; public static void main(String[] args) { try { PDDocument document = PDDocument.load(new File(CONTENT_PDF)); // get resources for a page PDResources pdResources = document.getPage(0).getResources(); int i = 0; for(COSName csName : pdResources.getXObjectNames()) { System.out.println(csName); PDXObject pdxObject = pdResources.getXObject(csName); if(pdxObject instanceof PDImageXObject) { PDStream pdStream = pdxObject.getStream(); PDImageXObject image = new PDImageXObject(pdStream, pdResources); i++; // image storage location and image name File imgFile = new File("F://knpcode//result//PDFBox//img"+i+".png"); ImageIO.write(image.getImage(), "png", imgFile); } } document.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
That's all for the topic Java PDFBox Example - Read Text And Extract Image From PDF. If something is missing or you have something to share about the topic please write a comment.
You may also like
- Read PDF in Java Using iText
- Merging PDFs in Java Using PDFBox
- Quick Sort Java Program
- Java Program to Count The Frequency of Each Character in a String
- Interface Vs Abstract Class in Java
- Fail-fast And Fail-safe Iterators in Java
- Java Exception Handling Interview Questions And Answers
- Spring Boot Microservices Eureka + Ribbon
No comments:
Post a Comment