PDFDomTree not detecting white spaces while converting a pdf file to html

vsbehere :

I am using PDFDomTree with pdfbox-2.0.9 in my java application to convert a pdf file to html file. Following code I have used to convert a pdf.

try {   
    PDDocument document = PDDocument.load(new File("some path"));
    PDFDomTree parser = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig());
    Writer output = new PrintWriter(new File("some output path"), "utf-8");

    parser.writeText(document, output);
    output.close();
    document.close();
} catch (IOException | ParserConfigurationException e) {
    throw e;
}

Now my issue is when I tried to analyse output html, I realised that the converter was not able to detect whitespace between two words due to which I got some words concatenated.

Check the comparison below: enter image description here

Corresponding pdf file can be accessed from here if needed.

Can anyone please help me with this?

mkl :

The text extractor at hand, Pdf2Dom's PDFDomTree, is based on PDFBox' PDFTextStripper but only uses it to parse the PDF drawing instructions into characters with style and position while it does all the analysis of these rich characters itself.

In particular it ignores all incoming white space characters in its PDFBoxTree parent class:

protected void processTextPosition(TextPosition text)
{
    if (text.isDiacritic())
    {
        lastDia = text;
    }
    else if (!text.getUnicode().trim().isEmpty())
    {
        [...process character...]
    }
}

(org.fit.pdfdom.PDFBoxTree override processTextPosition)

In that [...process character...] block it tries to recognize word gaps by hard coded distances:

        //should we split the boxes?
        boolean split = lastText == null || distx > 1.0f || distx < -6.0f || Math.abs(disty) > 1.0f
                            || isReversed(getTextDirectionality(text)) != isReversed(getTextDirectionality(lastText));

(inside the [...process character...] block above)

As the text in your PDF is small to start with (9pt determined by Pdf2Dom) and in many lines very tightly set, gaps between words usually are smaller than the 1.0 assumed above (distx > 1.0f).

In my eyes there a 2 issues here:

  • dropping white spaces means throwing away information; (In some situations this might be advantageous, I've seen PDFs with the same line drawn twice with either drawing string argument containing spaces where the other contains visible characters; but these are exceptions.)

  • having hard-coded distance limits distx > 1.0f, distx < -6.0f, etc. even though the font sizes (and with them the gap sizes) can vary much.

These issues should be fixed in the code. Two possible work-arounds for PDFs like your demo.pdf:

Choosing different distance limits

A true fix should try and make the distance limits dynamic, depending on the font size and probably even the average character distance in the current line up to the current position. A work-around for your PDF would be to replace the hard-coded distance by a smaller hard-coded one.

E.g. using .5f instead of the 1.0f as word distance, i.e. replacing the test above by

        //should we split the boxes?
        boolean split = lastText == null || distx > .5f || distx < -6.0f || Math.abs(disty) > 1.0f

This results in Pdf2Dom recognizing the word gaps in your document (or at least many more, I have not checked all of them).

Interpreting white spaces as splits

Instead of ignoring white spaces, you can explicitly interpret them as word gaps, e.g. by enhancing the processTextPosition override like this

protected void processTextPosition(TextPosition text)
{
    if (text.isDiacritic())
    {
        lastDia = text;
    }
    else if (!text.getUnicode().trim().isEmpty())
    {
        [...process character...]
    } else {
//!! process white spaces here
        //finish current box (if any)
        if (lastText != null)
        {
            finishBox();
        }
        //start a new box
        curstyle = new BoxStyle(style);
        lastText = null;
    }
}

I have not analyzed the code in depth, so I can only call this a work-around. To make it a real fix, you have to test it for side effects and also extend it to look into the exact nature of the white space: There are other white space characters than the normal space, some of them zero-width, some non-breaking, etc. All these different types of white space deserve special treatment.


PS: As many PDFBoxTree members are protected (and not private), it is easily possible to apply the second work-around without having to patch Pdf2Dom:

PDDocument document = PDDocument.load(SOURCE);

PDFDomTree parser = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig()) {
    @Override
    protected void processTextPosition(TextPosition text) {
        if (text.getUnicode().trim().isEmpty()) {
            //finish current box (if any)
            if (lastText != null)
            {
                finishBox();
            }
            //start a new box
            curstyle = new BoxStyle(style);
            lastText = null;
        } else {
            super.processTextPosition(text);
        }
    }
};
Writer output = new PrintWriter(TARGET, "utf-8");

parser.writeText(document, output);
output.close();

(ExtractText test testDemoImproved)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=108586&siteId=1