Get Coordinates of Characters in PDF

To extract coordinates or location and size of characters in pdf, we shall extend the PDFTextStripper class, intercept and implement writeString(String string, List<TextPosition> textPositions) method.

The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.

List<TextPosition> in the writeString() method contains information regarding the characters, like whether its Unicode, character’s X coordinate, Y coordinate, height, width, x-scaling value, y-scaling value, font size, space width, etc.

Steps to Extract Coordinates of Characters in PDF

Following is a step by step process to extract coordinates or position of characters in PDF.

1. Extend PDFTextStripper

Create a Java Class and extend it with PDFTextStripper.

public class GetCharLocationAndSize extends PDFTextStripper {
  . . .
}

2. Call writeText method

Set page boundaries (from first page to last page) to strip text and call the method writeText().

PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

3. Override writeString

writeString method receives information about the text positions of characters in a stream. We shall override writeString method as shown below.

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

4. Print Locations and Size

For each item in list of TextPosition which is for an individual character, print the coordinates and size.

ADVERTISEMENT

Example 1 – Extract Coordinates or Position of Characters in PDF

In this example, we will take a PDF with text, and extract the (X, Y) coordinates of characters.

GetCharLocationAndSize.java

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
 
/**
* This is an example on how to get the x/y coordinates and size of each character in PDF
*/
public class GetCharLocationAndSize extends PDFTextStripper {
 
    public GetCharLocationAndSize() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf";
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetCharLocationAndSize();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
        for (TextPosition text : textPositions) {
            System.out.println(text.getUnicode()+ " [(X=" + text.getXDirAdj() + ",Y=" +
                    text.getYDirAdj() + ") height=" + text.getHeightDir() + " width=" +
                    text.getWidthDirAdj() + "]");
        }
    }
}

Output

2 [(X=26.004425,Y=22.003723) height=5.833024 width=5.0907116]
0 [(X=31.095137,Y=22.003723) height=5.833024 width=5.0907116]
1 [(X=36.18585,Y=22.003723) height=5.833024 width=5.0907097]
7 [(X=41.276558,Y=22.003723) height=5.833024 width=5.0907097]
- [(X=46.367268,Y=22.003723) height=5.833024 width=2.8872108]
8 [(X=49.25448,Y=22.003723) height=5.833024 width=5.0907097]
- [(X=54.34519,Y=22.003723) height=5.833024 width=2.8872108]
6 [(X=57.2324,Y=22.003723) height=5.833024 width=5.0907097]
W [(X=226.4448,Y=22.003723) height=5.833024 width=7.911499]
e [(X=233.88747,Y=22.003723) height=5.833024 width=4.922714]
l [(X=238.81018,Y=22.003723) height=5.833024 width=2.2230377]
c [(X=241.03322,Y=22.003723) height=5.833024 width=4.399185]
o [(X=245.4324,Y=22.003723) height=5.833024 width=4.895355]
m [(X=250.32776,Y=22.003723) height=5.833024 width=7.7943115]
e [(X=258.12207,Y=22.003723) height=5.833024 width=4.922699]

Download the PDF document here apache.pdf  if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion

In this PDFBox Tutorial, we have learnt to extract coordinates or position of characters in PDF document and also a way to extract Unicode, X coordinate, Y coordinate, height, width, x-scaling value, y-scaling value, font size, space width, etc..