Problem with merged lines while extracting text from PDF using PDFBox 2.x

mi0 :

I have problem with extracting text from PDF using PDFTextStripper from PDFBox 2.0.13. To be more specific - lines, which are too close to each other, are merged together. For example: enter image description here

On the first line, there is text "signfieldbig", the second line contains underscores but PDFTextStripper parsed it as "s_i_g_n_fi_e_ld_b_ig_ _______" (it merged both lines into one). I tried multiple settings (different lineSeparator, tresholds, etc..) but nothing helped. These two lines were merged every time and I cannot simply remove all unnecessary characters from text, because I am looking for position of this placeholder to create signature field.

UPDATE: I just realized what caused this problem - in original file aren't two normal lines separated by line separator but one line with underscores and manually placed text area with text "placeholder" above that. But still, PDF viewer (viewing it as text) or other PDF library (iText 2.x) parse it as two separate lines...

mkl :

There are different strategies to text extraction, one can either take the text chunks as they come and only add a new line or something similar when the new next chunk's coordinates are not right after the previous one, or one can collect all chunks, sort them by coordinates, and extract the text from these sorted chunks.

(Obviously both strategy types can be combined with a certain degree of analysis of text layout.)

In your case sorting is active, causing the underscores and the text above to be joined as "s_i_g_n_fi_e_ld_b_ig_ _______".

You can disable sorting in the pdfbox text stripper using setSortByPosition(false).


There is no universal best approach, depending on the document in question one or the other might be better.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=145905&siteId=1
Recommended