Extract Words from PDF Document
To extract words from PDF document (from all the pages), we shall override writeText method of PDFTextStripper.
The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.
To extract extract words from PDF document, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List<TextPosition> textPositions) method.
The first argument to writeString method is a line. This line could be split to words using word separator.
Steps to Extract Words from PDF Document
Following is a step by step process to extract words from pdf :
1. Extend PDFTextStripper
Create a Java Class and extend it with PDFTextStripper.
public class GetWordsFromPDF extends PDFTextStripper {
. . .
}
2. Call writeText method
Set page boundaries (from first page to last page) to strip text and call the method writeText().
PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
3. Override writeString
writeString method receives a line of text as the first argument. writeString method is called for each line of text in the PDF document.
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
. . .
}
4. Get Words
Split the string received by writeString method by word separator.
Example 1 – Extract Words from PDF
In this example, we will take a PDF document, and extract all words from this PDF.
GetWordsFromPDF.java
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException {
PDDocument document = null;
String fileName = "apache.pdf"; // replace with your PDF file name
try {
document = PDDocument.load( new File(fileName) );
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
// print words
for(String word:words){
System.out.println(word);
}
}
finally {
if( document != null ) {
document.close();
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if(wordsInStream!=null){
for(String word :wordsInStream){
words.add(word);
}
}
}
}
Output
2017-8-6
Welcome
to
The
Apache
Software
Foundation!
Custom
Search
The
Apache
Way
(/foundation/governance/)
(http://apache.org/foundation/contributing.html)
Download the PDF document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName
in the Java program with your PDF file path.
Conclusion
In this Apache PDFBox Tutorial, we have learnt to extract words from PDF. You may also refer extract coordinates or position of characters in PDF.