Read All Text from PDF Document using PDFBox 2.0

In this tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.

PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.

Steps to Extract All Text from PDF

Following are the steps that are helpful in extracting the text from PDF document.

Step 1: Load PDF

Load the pdf file into PDDocument

</>
Copy
PDDocument doc = PDDocument.load(new File("sample.pdf"));

Step 2: Use PDFTextStripper.getText method

Get the text from doc using PDFTextStripper

</>
Copy
String text = new PDFTextStripper().getText(doc);

PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document.
getText returns the text of the pdf document.

Example 1 – Read All Text from PDF

In this example, we will take a PDF and read all the text present in PDF using PDFTextStripper.

ExtractText.java

</>
Copy
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractText {

	public static void main(String[] args) {
		try {
			PDDocument doc = PDDocument.load(new File("sample.pdf"));
			String text = new PDFTextStripper().getText(doc);;
			System.out.println("Text in PDF\n---------------------------------");
			System.out.println(text);
	        } catch (IOException e) {
			e.printStackTrace();
		}
	}
}

Output

Text in PDF
------------------
This is a sample PDF.

And pdf file used in the example is ? sample.pdf

Conclusion

In this PDFBox Tutorial, we have learnt to read all the text from pdf document using PDFBox 2.0.