Use python and FreePic2Pdf tools to teach you to generate the directory of the scanned PDF document

Judging from downloading a book from the Internet, it turned out that there are hundreds of pages of the book without a catalog. If you want to find the part you want to read, you have to go through it with your bare hands, wow. . . . . At this time, the mentality collapsed. You have also encountered this situation. Many of the downloaded pdf documents are scanned versions, and generally do not have a catalog, which seems to be more troublesome, so some shortcuts must be used to obtain the catalog. . . . . Hey, due to our limited ability to support the original version, we can only use some small means. . . .

Use python to generate PDF document directory

First, introduce a method suitable for programmers. You only need to have a python programming environment without installing third-party PDF software.

Preparation: install python third-party library PyPDF2

First install the python third-party library PyPDF2:pip install PyPDF2

Note: In the presence of the original reading PyPDF2 certain PDF bug, need to
https://github.com/mstamy2/PyPDF2/issues/368 simple modifications to the source code of the method mentioned, that is ${PYTHON_PATH}/site-packages/PyPDF2/pdf.pythe dest = Destination(NameObject("/"+title + " bookmark"), pageRef, NameObject(fit), *zoomArgs)
change
dest = Destination(NullObject(), pageRef,NameObject(fit), *zoomArgs)

Then create a script containing the following code (let's say it is named PDFbookmark.py):

import re
import sys

from distutils.version import LooseVersion
from os.path import exists, splitext
from PyPDF2 import PdfFileReader, PdfFileWriter

is_python2 = LooseVersion(sys.version) < '3'

def _get_parent_bookmark(current_indent, history_indent, bookmarks):
   
    assert len(history_indent) == len(bookmarks)
    if current_indent == 0:
        return None
    for i in range(len(history_indent) - 1, -1, -1):
        # len(history_indent) - 1   ===>   0
        if history_indent[i] < current_indent:
            return bookmarks[i]
    return None

def addBookmark(pdf_path, bookmark_txt_path, page_offset):
    if not exists(pdf_path):
        return "Error: No such file: {}".format(pdf_path)
    if not exists(bookmark_txt_path):
        return "Error: No such file: {}".format(bookmark_txt_path)

    with open(bookmark_txt_path, 'r', encoding='utf-8') as f:
        bookmark_lines = f.readlines()
    reader = PdfFileReader(pdf_path)
    writer = PdfFileWriter()
    writer.cloneDocumentFromReader(reader)

    maxPages = reader.getNumPages()
    bookmarks, history_indent = [], []
    # decide the level of each bookmark according to the relative indent size in each line
    #   no indent:          level 1
    #     small indent:     level 2
    #       larger indent:  level 3
    #   ...
    for line in bookmark_lines:
        line2 = re.split(r'\s+', unicode(line.strip(), 'utf-8')) if is_python2 else re.split(r'\s+', line.strip())
        if len(line2) == 1:
            continue

        indent_size = len(line) - len(line.lstrip())
        parent = _get_parent_bookmark(indent_size, history_indent, bookmarks)
        history_indent.append(indent_size)

        title, page = ' '.join(line2[:-1]), int(line2[-1]) - 1
        if page + page_offset >= maxPages:
            return "Error: page index out of range: %d >= %d" % (page + page_offset, maxPages)
        new_bookmark = writer.addBookmark(title, page + page_offset, parent=parent)
        bookmarks.append(new_bookmark)

    out_path = splitext(pdf_path)[0] + '-new.pdf'
    with open(out_path,'wb') as f:
        writer.write(f)

    return "The bookmarks have been added to %s" % out_path

if __name__ == "__main__":
    import sys
    args = sys.argv
    if len(args) != 4:
        print("Usage: %s [pdf] [bookmark_txt] [page_offset]" % args[0])
    else:
        print(addBookmark(args[1], args[2], int(args[3])))

Supplement:
The page_offset in the above code is generally set to 0. It will only be used in special circumstances, that is, when there is a deviation between the page number in the PDF catalog and the actual page number (usually a fixed offset), this In this case, you only need to change page_offset to the actual corresponding deviation page number.

Add bookmark/table of contents

Prepare a txt text file (assuming it is named toc.txt), manually enter the directory that needs to be added in it, or copy the directory directly from the text PDF, using a format similar to the following:

Introduction                            	   14
I. Interview Questions                         99
    Data Structures                            100
    Chapter 1 | Arrays and Strings             100
        Hash Tables                            100
        StringBuilder                          101
    Chapter 2 | Linked Lists                   104
        Creating a Linked List                 104
        The "Runner" Technique                 105
    Additional Review Problems                 193

The only requirement is that the last item of each line is the page number (the number of blank characters in front is not limited), and the bookmarks of the same level must use the same indentation (spaces or tabs are fine).

[Note] If you are creating a txt file on Windows, you need to save the file in UTF-8 encoding format, otherwise the Chinese characters in it will be garbled when read by the script in step ①. (Txt files created directly on Windows are generally in ANSI encoding format by default )

Then open a terminal/command prompt in the directory where PDFbookmark.py created in step ① is located, and run the following code:

其中最后一个参数 9 表示将 toc.txt 文件中的页码全部偏移 +9(即全部加上 9)
 python ./PDFbookmark.py '/Users/Emrys/Desktop/demo 2015.pdf' '/Users/Emrys/Desktop/toc.txt' 9

After running, a new PDF file will be generated under the same directory as the PDF file to be added, the name is [original PDF file name]-new.pdf. The demonstration effect is as follows:

[Supplement]: Possible problems

问题一:ValueError: {’/Type’: ‘/Outlines’, ‘/Count’: 0} is not in list

If the PDF file to be processed by the above script has been modified by other PDF editor software before, it may cause an error similar to the following:

Traceback (most recent call last):
  File ".\PDFbookmark.py", line 70, in <module>
    print(addBookmark(args[1], args[2], int(args[3])))
  File ".\PDFbookmark.py", line 55, in addBookmark
    new_bookmark = writer.addBookmark(title, page + page_offset, parent=parent)
  File "C:\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 732, in addBookmark
    outlineRef = self.getOutlineRoot()
  File "C:\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 607, in getOutlineRoot
    idnum = self._objects.index(outline) + 1
ValueError: {
    
    '/Type': '/Outlines', '/Count': 0} is not in list

In this case, it is necessary to PyPDF2in-source pdf.pyfile getOutlineRoot()function to modify it.
(The source file path is ${PYTHON_PATH}/site-packages/PyPDF2/pdf.py):

  def getOutlineRoot(self):
        if '/Outlines' in self._root_object:
            outline = self._root_object['/Outlines']
            try:
                idnum = self._objects.index(outline) + 1
            except ValueError:
                if not isinstance(outline, TreeObject):
                    def _walk(node):
                        node.__class__ = TreeObject
                        for child in node.children():
                            _walk(child)
                    _walk(outline)
                outlineRef = self._addObject(outline)
                self._addObject(outlineRef.getObject())
                self._root_object[NameObject('/Outlines')] = outlineRef
                idnum = self._objects.index(outline) + 1
            outlineRef = IndirectObject(idnum, 0, self)
            assert outlineRef.getObject() == outline
        else:
            outline = TreeObject()
            outline.update({
    
     })
            outlineRef = self._addObject(outline)
            self._root_object[NameObject('/Outlines')] = outlineRef

        return outline

Question 2: RuntimeError: generator raised StopIteration

If, after you do the above changes, encountered when running the script RuntimeError: generator raised StopIterationerror, please check it is not a Python version 3.7 or greater (as from version after v3.7, Python to terminate the iterative process has changed, the details can refer PEP 479), if so, please try to run the above code with a version of Python before v3.7.

Python export content/bookmark in PDF

import sys

from distutils.version import LooseVersion
from os.path import exists
from PyPDF2 import PdfFileReader

is_python2 = LooseVersion(sys.version) < '3'


def _parse_outline_tree(outline_tree, level=0):
    """Return List[Tuple[level(int), page(int), title(str)]]"""
    ret = []
    for heading in outline_tree:
        if isinstance(heading, list):
            # contains sub-headings
            ret.extend(_parse_outline_tree(heading, level=level+1))
        else:
            ret.append((level, heading.page.idnum, heading.title))
    return ret

def extractBookmark(pdf_path, bookmark_txt_path):
    if not exists(pdf_path):
        return "Error: No such file: {}".format(pdf_path)
    if exists(bookmark_txt_path):
        print("Warning: Overwritting {}".format(bookmark_txt_path))

    reader = PdfFileReader(pdf_path)
    # List of ('Destination' objects) or ('Destination' object lists)
    #  [{'/Type': '/Fit', '/Title': u'heading', '/Page': IndirectObject(6, 0)}, ...]
    outlines = reader.outlines
    # List[Tuple[level(int), page(int), title(str)]]
    outlines = _parse_outline_tree(outlines)
    max_length = max(len(item[-1]) + 2 * item[0] for item in outlines) + 1
    # print(outlines)
    with open(bookmark_txt_path, 'w') as f:
        for level, page, title in outlines:
            level_space = '  ' * level
            title_page_space = ' ' * (max_length - level * 2 - len(title))
            if is_python2:
                title = title.encode('utf-8')
            f.write("{}{}{}{}\n".format(level_space, title, title_page_space, page))
    return "The bookmarks have been exported to %s" % bookmark_txt_path


if __name__ == "__main__":
    import sys
    args = sys.argv
    if len(args) != 3:
        print("Usage: %s [pdf] [bookmark_txt]" % args[0])
    else:
        print(extractBookmark(args[1], args[2]))

The source of the above code [invasion]:
Link: https://www.zhihu.com/question/344805337/answer/1116258929
Source: Zhihu

FreePic2Pdf tool generates PDF document directory

1. Software Download

Download address ( https://pan.baidu.com/s/1kVHzVmf ) Password: at9e

[Note: This link is provided for VIP exclusive blog posts to add catalogs to pdf in batches (the most complete and detailed method), invaded and deleted]

Software outlook:
Bold style

2. Steps to mount the catalog of books to pdf

2.1 Open the software FreePic2PDF, click "change pdf" in the lower right corner

Insert picture description here

2.2 Select "Get Bookmarks from PDF" → the pdf file to be operated → click to start

Insert picture description here

Current is generated in the same directory as pdf file configuration bookmark folder, which contains FreePic2Pdf.itfand FreePic2Pdf_bkmk.txtfiles
(* .txt files will save your pdf bookmark information already exists, because no bookmarks, then of course is empty or some incomplete bookmark information)
Insert picture description here

2.3 Copy all catalog descriptions on the PDF catalog page to text editor

Next, copy all the catalog descriptions of the PDF catalog page to a text editor. Here I use the Notepad++ editor.

The format must be correct, especially the tab key is between the page number and the section title, and the contents of each level must be correct! ! ! (This is the most troublesome step, but it's definitely faster than making bookmarks one by one) as follows:
Insert picture description here
Copy the formatted text to the FreePic2Pdf_bkmk.txt file generated by extracting bookmarks just now.

2.4 Hang bookmarks into pdf

Just click to bookmark the pdf (the pdf document should be closed at this time), open the pdf and check whether the directory is added successfully, unsuccessful or there is something wrong, most of the data format of the txt file is different, just adjust and adjust can.

The following is added successfully, hahaha~~~~~~
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/ywsydwsbn/article/details/107699577