How to Extract Words from PDF Document?

Extract Words from PDF Document

To extract words from PDF document (from all the pages), we shall override writeText method of PDFTextStripper.

The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.

To extract extract words from PDF document, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List<TextPosition> textPositions) method.

The first argument to writeString method is a line. This line could be split to words using word separator.

Steps to Extract Words from PDF Document

Following is a step by step process to extract words from pdf :

1. Extend PDFTextStripper

Create a Java Class and extend it with PDFTextStripper.

</>

Copy

public class GetWordsFromPDF extends PDFTextStripper {
  . . .
}

2. Call writeText method

Set page boundaries (from first page to last page) to strip text and call the method writeText().

</>

Copy

PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

3. Override writeString

writeString method receives a line of text as the first argument. writeString method is called for each line of text in the PDF document.

</>

Copy

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

4. Get Words

Split the string received by writeString method by word separator.

Example 1 – Extract Words from PDF

In this example, we will take a PDF document, and extract all words from this PDF.

GetWordsFromPDF.java

</>

Copy

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
 
/**
* This is an example on how to extract words from PDF document
*/
public class GetWordsFromPDF extends PDFTextStripper {
    
    static List<String> words = new ArrayList<String>();
 
    public GetWordsFromPDF() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf"; // replace with your PDF file name
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetWordsFromPDF();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            // print words
            for(String word:words){
                System.out.println(word); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String[] wordsInStream = str.split(getWordSeparator());
        if(wordsInStream!=null){
            for(String word :wordsInStream){
                words.add(word);
            }
        }
    }
}

Output

2017-8-6
Welcome
to
The
Apache
Software
Foundation!
Custom
Search
The
Apache
Way
(/foundation/governance/)

(http://apache.org/foundation/contributing.html)

Download the PDF document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion

In this Apache PDFBox Tutorial, we have learnt to extract words from PDF. You may also refer extract coordinates or position of characters in PDF.

TutorialKart

How to Extract Words from PDF Document?

Extract Words from PDF Document

Steps to Extract Words from PDF Document

1. Extend PDFTextStripper

2. Call writeText method

3. Override writeString

4. Get Words

Example 1 – Extract Words from PDF

Conclusion

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning