Red areas around text when converting a pdf to png with pdfbox

linoor :

I'm trying to convert a pdf to png file using pdfbox. Unfortunately in the result I get weird red areas in some places of the output. I'm not sure what's the problem. It's a problem with only some of the pdf files.

Here's some of the code that I'm using:

    public static BufferedImage generateFromPdf(String ref, InputStream stream, int pageIndex, PreviewMode mode) throws IOException {
        PDDocument doc = null;
        try (InputStream buffered = new BufferedInputStream(stream)) {
            doc = PDDocument.load(buffered, PDF_LOADING_MEMORY_SETTING);
            if (pageIndex > doc.getNumberOfPages()) {
                return null;
            }
            PDFRenderer renderer = new PDFRenderer(doc);
            return rasterizePdfBox(ref, pageIndex, renderer, mode);
        } finally {
            if (doc != null) {
                doc.close();
            }
        }
    }

and then:

    private static BufferedImage rasterizePdfBox(String ref, int pageIndex, PDFRenderer renderer, PreviewMode mode) throws IOException {
        Future<BufferedImage> result = executorService.submit(() -> {
            LOGGER.info(String.format("Generate preview for ref: %s, page: %s, mode: %s ", ref, pageIndex, mode.name()));
            return renderer.renderImageWithDPI(pageIndex - 1, mode.getDpi(), ImageType.RGB);
        });

        try {
            return result.get();
        } catch (InterruptedException | ExecutionException e) {
            LOGGER.error(String.format("Error when generating preview: %s", e.getMessage()));
            Thread.currentThread().interrupt();
            throw new IOException(e.getMessage());
        }
    }

So far I've only figured out that the places which are red in the output are blank when I open them in Master PDF editor on linux. They seem normal though when I open them with Document Viewer.

Some hints: - the pdfs with problems have been scanned. I can select text around the working parts but not at the places that have red overlay over them. Maybe it's something to do with OCR issues? - if I use the linux tool convert not-working-pdf.pdf converted.pdf and then try to convert this file to png, then the issue is not there anymore.

png output after converting pdf

Here's an example file: https://ufile.io/3or9l

pdfbox version: 2.0.13

Tilman Hausherr :

This was a PDFBox bug and the cause was a bitonal image with a mask, which is unusual. There is only one color element in the raster so only "R" is applied instead of all 3 of the RGB destination. Because of that, white appeared as red.

More details about this bug in issue PDFBOX-4470, it will be fixed in release 2.0.14. Until then, you can work with a snapshot.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=161039&siteId=1