Read All Text from PDF Document using PDFBox 2.0
In this tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.
PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.
Steps to Extract All Text from PDF
Following are the steps that are helpful in extracting the text from PDF document.
Step 1: Load PDF
Load the pdf file into PDDocument
PDDocument doc = PDDocument.load(new File("sample.pdf"));
Step 2: Use PDFTextStripper.getText method
Get the text from doc using PDFTextStripper
String text = new PDFTextStripper().getText(doc);
PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document.
getText returns the text of the pdf document.
Example 1 – Read All Text from PDF
In this example, we will take a PDF and read all the text present in PDF using PDFTextStripper.
ExtractText.java
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class ExtractText {
public static void main(String[] args) {
try {
PDDocument doc = PDDocument.load(new File("sample.pdf"));
String text = new PDFTextStripper().getText(doc);;
System.out.println("Text in PDF\n---------------------------------");
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output
Text in PDF
------------------
This is a sample PDF.
And pdf file used in the example is ? sample.pdf
Conclusion
In this PDFBox Tutorial, we have learnt to read all the text from pdf document using PDFBox 2.0.