itext7 study notes-Chapter 7

Preface

    In Chapters 1-4, we used iText7 to create PDF documents. In chapters 5-6, we manipulated and reused existing PDF documents. The PDF documents we operate in these chapters are all under the ISO 32000 specification, which is the core standard of PDF files. ISO 32000 is not the only ISO standard for PDF, there are many substandards created for specific reasons. In this chapter, we focus on two:

  • ISO 14289, also called PDF/UA. UA means Universal Accessibility. Everyone can view PDF documents that use the PDF/UA standard, including those with visual impairments or even blind people. (My God, is it so amazing?)
  • ISO 19005, also called PDF/A. A means Archiving. The goal is the long-term storage of digital documents.

    In this chapter, we will learn about PDF/A and PDF/UA by creating a series of PDF/A and PDF//UA files.

Create PDF/UA documents

    Before we start the PDF/UA example, let's take a look at the problem we want to solve. In Chapter 1, we have created a document with pictures. In the sentence "Quick brown fox jumps over the lazy dog", we replace "dag" and "fox" with the corresponding pictures. When this file is read At this time, a machine cannot know that the code of the first picture is a fox, and the second picture represents a dog, so this file will be considered: "Quick brown jumps over the lazy".

In a normal PDF, the content will be drawn into the canvas. We may use advanced objects such as Listand Table, but once the PDF is created, these objects will not be saved. One Listis composed of a series of lines, but a text fragment in the list element does not know that it is part of the list. One is Tablecomposed of a group of texts at a specific location. Similarly, a text fragment does not know that it belongs to a specific row and column.

    Unless we turn a PDF into a tagged PDF, the document will not contain any semantic structure. When a document is not stored in a semantic structure, we say that the PDF is not accessible (isn't accessible). In order to be perceivable/understandable, this document needs to be able to distinguish which parts of a page are real content and which parts are not real content (such as headers, page numbers). If a line of text is not paragraphpart of it, you need to know whether you are a title, Of course there are other requirements. We can add all the information to a page in one way, this way is to create 结构树(structure)and define the content as 带标签的内容. This may sound complicated, but if we use iText7's advanced objects, we can use it efficiently setTagged()to achieve this goal.
    By defining PdfDocumenta document with a label List, Tableand Paragraphobjects such as belt structure after being introduced, it will be reflected in the PDF tagged in.
    Of course, this is only for one of the requirements of PDF perception (accessible, I really don’t know what translation is better, just translate into perception for the time being). The following code can help us understand other requirements:

PdfDocument pdf =  new PdfDocument(new PdfWriter(dest, new WriterProperties().addXmpMetadata()));
Document document = new Document(pdf);
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
        new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/UA example");
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
Paragraph p = new Paragraph();
p.setFont(font);
p.add(new Text("The quick brown "));
Image foxImage = new Image(ImageFactory.getImage(FOX));
//PDF/UA: Set alt text
foxImage.getAccessibilityProperties().setAlternateDescription("Fox");
p.add(foxImage);
p.add(" jumps over the lazy ");
Image dogImage = new Image(ImageFactory.getImage(DOG));
//PDF/UA: Set alt text
dogImage.getAccessibilityProperties().setAlternateDescription("Dog");
p.add(dogImage);
document.add(p);
document.close();

    Creating a PdfDocumentand Document, but this time we use WriterPropertiesthe addXmpMetadata()automatically add XMP metadata. In PDF/UA, the same metadata must be stored in the PDF in XML format. XML may not be compressed. Processors/processing programs who are not familiar with the PDF content format must be able to detect this XMP metadata and process it correctly. An XMP data stream is automatically created in the Info dictionary entry. This Info dictionary is a PDF object, which contains data such as document titles. In addition to adding the XMP data stream, we also need to perform the following operations to make it comply with the PDF/UA standard:

  • Set this PdfDocumentas labeled (line 4)
  • We add a language specifier. In this example, the file knows that the main language used in this file is American English (line 5)
  • Change the viewer preferences so that the title of the document is always displayed in the top bar of the PDF viewer
    (lines 6-7). Then we put the title into the metadata of the document (lines 8-9)
  • All fonts need to be embedded (line 11). There are actually some other requirements for fonts, but it is too early for us to discuss them.
  • All content needs to be tagged. When encountering a picture, we need to use alternative picture text to provide a description of the picture (line 17 and line 22)

    Now we have completed the work of creating PDF/UA. The results are shown in the following two figures 1 and 2. The difference may not be very obvious from the previous one, but if we open the Tags page (Adobe Acrobat Pro must be used, Adobe Acrobat Reader DC will not work):

itext7-1

Figure 1. A PDF/UA document and its structure

itext7-2

Figure 2. ctrl+d document properties

    We can see that <Document>there are <P>tags in the tags, and the <P>tags consist of two <Span>and two <Figures>. We will create more complex PDF/UA documents later in this chapter, now let's take a look at how to create PDF/A.

Create PDF/A documents PDF/A-1

    Part 1 of ISO 19005 was released in 2005. It is defined in the official Adobe PDF 1.4 statement (this statement was not an ISO standard at the time). SO 19005-1 introduced a series of obligations and restrictions:

  • All resources and information of the document must be stored by themselves: all fonts need to be embedded; extended animations, videos, sounds and other binary files are not allowed.
  • Documents must store metadata in XMP (eXensible Metadata Platform) format: ISO 16684 (XMP) describes how to store metadata in XML format in a binary file, so that software that does not know how to read and interpret the binary file is still The metadata of the file can be extracted.
  • Some future (advanced, not declared or added in the future) features are not allowed: PDF cannot contain JavaScript and cannot be encrypted

    SO 19005-1:2005 (PDF/A-1) defines two compliance levels:

  • Level B ("basic"): Ensures the visual appearance of long-term preservation files.
  • Level A ("accessible"): Not only ensures the visual appearance of long-term preservation files, but also introduces structural and semantic features. This PDF needs to be a tagged PDF. (Note that it is similar to PDF/UA, but different, the reason will be mentioned in the following example)

    The following code shows how to turn the "Quick brown fox" PDF we created earlier into compliance with the PDF/A-1b standard:

//Initialize PDFA document with output intent
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
    PdfAConformanceLevel.PDF_A_1B,
    new PdfOutputIntent("Custom", "", "http://www.color.org",
            "sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf);
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
Paragraph p = new Paragraph();
p.setFont(font);
p.add(new Text("The quick brown "));
Image foxImage = new Image(ImageFactory.getImage(FOX));
p.add(foxImage);
p.add(" jumps over the lazy ");
Image dogImage = new Image(ImageFactory.getImage(DOG));
p.add(dogImage);
document.add(p);
document.close();

    We can see that we no longer use PdfDocumentinstances, instead, we use PdfADocumentinstances. First, we create an PdfADocumentinstance. The PdfADocumentfirst parameter of the instance constructor is one PdfWriter, the second parameter is the compliance level (here it is PdfAConformanceLevel.PDF_A_1B), and the third parameter is one PdfOutpuyIntext. This output intent tells the document how to interpret the stored in the document colour. On line 10, we make sure that the font is embedded.

    The resulting PDF looks like Figure 3:

itext7-3

Figure 3. A PDF/A-1B standard document

    From the picture above, we can see a small blue bar with "This file complies with the PDF/A standard and has been opened in read-only mode to prevent it from being modified". In this regard, we interpret this sentence in two ways:

  • This sentence does not mean that this PDF actually complies with the PDF/A standard, it just states that it may be. In order to confirm whether it complies with the standard, we need to open the "Standard" panel in Adobe Acrobat, and then click "Verify compliance Link, Acrobat will verify whether the document is the same as it stated. In this example, the result is "verification successful"; in this way, we will finally create a PDF/A-1B standard document.
  • The document has been opened in read-only mode, not because modification is not allowed (PDF/A cannot protect the PDF from being modified), but Adobe Acrobat is displayed in read-only mode, because any modification may change the conversion of the PDF to no longer compliant PDF/A standard PDF. It is allowed to update PDF/A without destroying the status of PDF/A.

    Then we take a look at how to create PDF/A-1a, the code is as follows:

//Initialize PDFA document with output intent
PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
    PdfAConformanceLevel.PDF_A_1A,
    new PdfOutputIntent("Custom", "", "http://www.color.org",
            "sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf);
//Setting some required parameters
pdf.setTagged();
//Fonts need to be embedded
PdfFont font = PdfFontFactory.createFont(FONT, PdfEncodings.WINANSI, true);
Paragraph p = new Paragraph();
p.setFont(font);
p.add(new Text("The quick brown "));
Image foxImage = new Image(ImageFactory.getImage(FOX));
//Set alt text
foxImage.getAccessibilityProperties().setAlternateDescription("Fox");
p.add(foxImage);
p.add(" jumps over the lazy ");
Image dogImage = new Image(ImageFactory.getImage(DOG));
//Set alt text
dogImage.getAccessibilityProperties().setAlternateDescription("Dog");
p.add(dogImage);
document.add(p);
document.close();

    Let's interpret the code. In line 3, we PdfConformanceLevel.PDF_A1Bchanged to PdfConformanceLevel.PDF_A1A. In line 8, turn this PdfADocumentinto a tagged PDF, and then add the text description of the picture. The final result is shown in Figure 4 below:

itext7-4

Figure 4. A PDF/A-1A standard document

    When we open the standard panel, we can see that Adobe Acrobat Pro considers this file to be PDF/A-1A and PDF/UA-1, but this time it did not verify the compliance link, so I need to resort to the pre-press inspection tool (English version It’s Preflight, dizzy. It took a long time to find the Chinese version. I’ll share it with you here. It is estimated that everyone uses Chinese. The specific steps are: open the PDF standard in the tool → pre-press inspection (or directly on the left Click to open the pre-press inspection) → find the PDF/A-1b specification under the PDF/A specification → analysis ), as shown in Figure 5 below:

itext7-8

Figure 5. Preflight inspection tool to view and verify compliance connection

    We continue to look at the picture in the English version and we can see that no errors were found. We cannot verify PDF/UA compliance because PDF/UA involves some requirements that cannot be verified by the local computer. For example: If we exchange the description of the fox image with the description of the dog image, the machine will not notice. This will make the file inaccessible because the file will spread false information to people based on screen readers. In any case, just know that the document we created does not meet the PDF/UA standard, because we have omitted some basic elements (such as language, the language is also set in the first example).

    From the beginning, it was determined that the accreditation part of ISO 19005 would never become invalid. New and subsequent sections will only define new useful functions. These follow-up definitions are the PDF/A-2 and PDF/A-3 we are about to introduce.

Create PDF/A documents PDF/A-2 and PDF/A-3

    ISO 19005-2:2011 (PDF/A-2) was added to the PDF/A standard according to the ISO standard (not Adobe's official PDF document). PDF/A-2 has many features and improvements in PDF1.5 and 1.6.1.7:

  • Useful additional features are: JPEG2000 support, container, object-level XMP and optional content
  • Useful improvements include better support for transparency, type annotations, annotations, and digital signatures.

    In terms of compliance, in addition to the original Level A and Level B, PDF/A-2 also defines additional levels:

  • Level U ("Unicode"): Ensure that the visual foreign trade of the document can be stored for a long time, and the storage format of all texts is UNICODE

    ISO 19005-3 :2012 (PDF/A-3) is almost the same as PDF/A-2. The only difference is: in PDF/A-3, the attachment does not have to be in PDF/A format. You can regard any file format as an attachment to PFA/A-3. For example, you can treat an excel format file as the result of the document, a word format file to create a PDF document, and so on. The document itself needs to comply with all obligations and restrictions of the PDF/A specification, but these obligations and restrictions do not apply to its attachments.

    In the following example, we will create both PDF/UA and PDF/A-3A standards. The reason why we choose PDF/A-3 is because we need to use CSV files to create PDF. The code is as follows:

 PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
    PdfAConformanceLevel.PDF_A_3A,
    new PdfOutputIntent("Custom", "", "http://www.color.org",
            "sRGB IEC61966-2.1", new FileInputStream(INTENT)));
Document document = new Document(pdf, PageSize.A4.rotate());
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
        new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/A-3 example");
//Add attachment
PdfDictionary parameters = new PdfDictionary();
parameters.put(PdfName.ModDate, new PdfDate().getPdfObject());
PdfFileSpec fileSpec = PdfFileSpec.createEmbeddedFileSpec(
    pdf, Files.readAllBytes(Paths.get(DATA)), "united_states.csv",
    "united_states.csv", new PdfName("text/csv"), parameters,
    PdfName.Data, false);
fileSpec.put(new PdfName("AFRelationship"), new PdfName("Data"));
pdf.addFileAttachment("united_states.csv", fileSpec);
PdfArray array = new PdfArray();
array.add(fileSpec.getPdfObject().getIndirectReference());
pdf.getCatalog().put(new PdfName("AF"), array);
//Embed fonts
PdfFont font = PdfFontFactory.createFont(FONT, true);
PdfFont bold = PdfFontFactory.createFont(BOLD_FONT, true);
// Create content
Table table = new Table(new float[]{4, 1, 3, 4, 3, 3, 3, 3, 1});
table.setWidthPercent(100);
BufferedReader br = new BufferedReader(new FileReader(DATA));
String line = br.readLine();
process(table, line, bold, true);
while ((line = br.readLine()) != null) {
    process(table, line, font, false);
}
br.close();
document.add(table);
//Close document
document.close();

    Let's explain the code line by line:

  • Line 1-5: We created PdfADocument(type is PdfAConformanceLevel.PDF_A_3A)) andDocument
  • Line 7: Turn the PDF into a tagged PDF-PDF/UA and PDF/A-3A standards.
  • Lines 8-12: Set language, document title and viewer preferences-PDF/UA standard.
  • Lines 14-20: Use specific parameters to add an attachment-PDF/A-3A standard.
  • Lines 26-27: Embed images and fonts-PDF/UA and PDF/A-3A standards.
  • Line 28-38: The extracted content is the same as our previous code in Chapter 1.
  • Line 30: Close the document and save the content

    As shown in Figure 6 below, we can see that the objects we use Tableand Cellare added to the document in the tag panel are saved with the Table data structure, a bit like HTML:

itext7-5

Figure 6. A PDF/A-3A standard document

    At the same time, we open the attachment panel, we can see the CSV source file, and can be easily extracted, as shown in Figure 7:

itext7-6

Figure 7. A PDF/A-3A standard document and its attachments

    Through the above example, compared with the general PDF file, we need to add additional information when creating a PDF/UA or PDF/A document, *"Can we use iText to change the existing ordinary PDF document into What about documents that conform to PDF/UA or PDF/A standards?"* is the most frequently asked question in forums and consultations. We hope that through this chapter, everyone can understand that iText cannot be automatically converted for the following reasons:

  • If there is a document with a fox and a dog picture as before, iText cannot automatically add missing replacement description information to the picture, because iText is not ready to recognize the meaning of these pictures (to put it bluntly, it means that there is no machine learning, artificial intelligence module. Identify content)
  • If the font is not embedded and the corresponding font program is not provided, iText will not know what the font looks like, nor can it embed the font into the document.

    Of course, these are just two small reasons why they cannot be converted automatically. It is easy for a PDF to display a small blue bar saying that the document appears to comply with the PDF/A standard, but not all statements are correct.
    Finally, let's take a look at the stitching of PDF/A documents.

Stitching PDF/A documents

    When stitching PDF/A files, the most worthy of our attention is that all the documents we stitched must be PDF/A files, not one PDF/A file, one ordinary file, and the level of PDF/A also Similarly, one cannot be A and the other B, because one has a structure tree and the other does not, splicing together will cause the result to be wrong.

    We spliced ​​the previous two PDF/AA-level documents together, and the resulting file is shown in Figure 8 below:

itext7-7

Figure 8. Splicing 2 PDF/AA

    We see one through the tab panel <P>, followed by the <Table>following code showing how to create this document:

PdfADocument pdf = new PdfADocument(new PdfWriter(dest),
    PdfAConformanceLevel.PDF_A_1A,
    new PdfOutputIntent("Custom", "", "http://www.color.org",
            "sRGB IEC61966-2.1", new FileInputStream(INTENT)));
//Setting some required parameters
pdf.setTagged();
pdf.getCatalog().setLang(new PdfString("en-US"));
pdf.getCatalog().setViewerPreferences(
        new PdfViewerPreferences().setDisplayDocTitle(true));
PdfDocumentInfo info = pdf.getDocumentInfo();
info.setTitle("iText7 PDF/A-1a example");
//Create PdfMerger instance
PdfMerger merger = new PdfMerger(pdf);
//Add pages from the first document
PdfDocument firstSourcePdf = new PdfDocument(new PdfReader(SRC1));
merger.addPages(firstSourcePdf, 1, firstSourcePdf.getNumberOfPages());
//Add pages from the second pdf document
PdfDocument secondSourcePdf = new PdfDocument(new PdfReader(SRC2));
merger.addPages(secondSourcePdf, 1, secondSourcePdf.getNumberOfPages());
//Merge
merger.merge();
//Close the documents
firstSourcePdf.close();
secondSourcePdf.

    Overall, this code can be said to be very similar to the previous example:

  • I won’t talk about lines 1-11, which is no different from the previous code.
  • Lines 12-25 are mentioned in the Oscar award splicing example in the previous chapter. When creating PdfMerger, we pass in an PdfADocumentobject, and then PdfMergeradd a PdfDocumenttype to this object . If it is a PdfADocumenttype, the document will be checked legality.

    There are still many discussions about the PDF/UA and PDF/A standards. Of course, there are other sub-standards. For example, there is a German pronunciation ZUGFeRD standard in PDF/A-3, which will be described in other series (this is the official document It says in it, personally, it depends on the needs, if I have time, I will open this pit)

to sum up

    In this chapter, we discussed the creation and splicing of documents that comply with other PDF standards, and learned how to create PDF/UA and PDF/A documents. This series ends here. Of course, we need some other series to go deeper. Learn about iText7.

This is the end of itext7 study notes, but itext7 will also produce other series, such as the current itext7 study notes and talk series, and there are many examples and articles on the itext7 official website. At the same time, I will pay more attention to the format and content of the articles. Keep improving, please continue to support my itext7 series, after reading the article, don’t forget to follow and like a wave~

Guess you like

Origin blog.csdn.net/u012397189/article/details/78882454