How to reduce the size of merged PDF/A-1b files with pdfbox or other java library

hab :

Input: A list of (e.g. 14) PDF/A-1b files with embedded fonts.
Processing: Doing a simple merge with Apache PDFBOX.
Result: 1 PDF/A-1b file with large (too large) file size. (It is almost the sum of the size of all the source files).

Question: Is there a way to reduce the file size of the resulting PDF?
Idea: Remove redundant embedded fonts. But how to? And is it the right way to do?

Unfortunately the following code is not doing the job, but is highlighting the obvious problem.

try (PDDocument document = PDDocument.load(new File("E:/tmp/16189_ZU_20181121195111_5544_2008-12-31_Standardauswertung.pdf"))) {
    List<COSName> collectedFonts = new ArrayList<>();
    PDPageTree pages = document.getDocumentCatalog().getPages();
    int pageNr = 0;
    for (PDPage page : pages) {
        pageNr++;
        Iterable<COSName> names = page.getResources().getFontNames();
        System.out.println("Page " + pageNr);
        for (COSName name : names) {
            collectedFonts.add(name);
            System.out.print("\t" + name + " - ");
            PDFont font = page.getResources().getFont(name);
            System.out.println(font + ", embedded: " + font.isEmbedded());
            page.getCOSObject().removeItem(COSName.F);
            page.getResources().getCOSObject().removeItem(name);
        }
    }
    document.save("E:/tmp/output.pdf");
}

The code produces an output like that:

Page 1
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 2
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 3
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 4
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 5
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 6
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 7
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 8
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 9
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 10
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 11
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F33} - PDTrueTypeFont ArialMT-BoldItalic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 12
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 13
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true
Page 14
    COSName{F23} - PDTrueTypeFont ArialMT-Bold, embedded: true
    COSName{F25} - PDTrueTypeFont ArialMT-Italic, embedded: true
    COSName{F27} - PDTrueTypeFont ArialMT-Regular, embedded: true

Any help appreciated ...

schowave :

When debugging in the file, I recognized that the font files for the same fonts were referenced several times. So replacing the actual font file item in the dictionary with an already viewed font file item, the reference was removed and compression could be done. By that, I was able to shrink a 30 MB File to around 6 MB.

    File file = new File("test.pdf");

    PDDocument doc = PDDocument.load(file);
    Map<String, COSBase> fontFileCache = new HashMap<>();
    for (int pageNumber = 0; pageNumber < doc.getNumberOfPages(); pageNumber++) {
        final PDPage page = doc.getPage(pageNumber);
        COSDictionary pageDictionary = (COSDictionary) page.getResources().getCOSObject().getDictionaryObject(COSName.FONT);
        for (COSName currentFont : pageDictionary.keySet()) {
            COSDictionary fontDictionary = (COSDictionary) pageDictionary.getDictionaryObject(currentFont);
            for (COSName actualFont : fontDictionary.keySet()) {
                COSBase actualFontDictionaryObject = fontDictionary.getDictionaryObject(actualFont);
                if (actualFontDictionaryObject instanceof COSDictionary) {
                    COSDictionary fontFile = (COSDictionary) actualFontDictionaryObject;
                    if (fontFile.getItem(COSName.FONT_NAME) instanceof COSName) {
                        COSName fontName = (COSName) fontFile.getItem(COSName.FONT_NAME);
                        fontFileCache.computeIfAbsent(fontName.getName(), key -> fontFile.getItem(COSName.FONT_FILE2));
                        fontFile.setItem(COSName.FONT_FILE2, fontFileCache.get(fontName.getName()));
                    }
                }
            }
        }
    }

    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    doc.save(baos);
    final File compressed = new File("test_compressed.pdf");
    baos.writeTo(new FileOutputStream(compressed));

Maybe this is not the most elegant way to do that, but it works and keeps the PDF/A-1b compatibility.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=73475&siteId=1