ChatGPT tuning guide: extract titles from PDF and save

1. Please use python to write a code, use the pymupdf package to extract the title from the pdf, and save the title name and page number.

I didn't add any answer hints to see how GPT reacts. It should know that PDF does not have any semantic information, and all titles or text are just text boxes.

Okay, here is a code example that uses the pymupdf package to extract titles from a pdf and save the title name and page number:

import fitz

def extract_titles(pdf_path):
    doc = fitz.open(pdf_path)
    titles = []

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        
        for block in blocks:
            if block["type"] == 1 and block["bbox"][3] < 100:
                title = block["text"]
                page_num = page.number + 1
                titles.append({
    
    "title": title, "page": page_num})

    return titles

pdf_path = "example.pdf"
titles = extract_titles(pdf_path)

for title in titles:
    print("Title:", title["title"])
    print("Page:", title["page"])
    print()


There was an error in it, page.getTextit should be page.get_text, I corrected it.

This logic is completely wrong. I tried the following text PDF I had on hand (Three Eyes Anthology.pdf). The doc[1]first line of the second page is the title. See if you can extract it

pdf_path = r"d:/docs/wealth-books-2023/权谋/三眼文集.pdf"
doc = fitz.open(pdf_path)
print(doc[1].get_text("dict")['blocks'][0])
'''
{'number': 0,
 'type': 0,
 'bbox': (48.0, 52.383140563964844, 531.0696411132812, 96.42223358154297),
 'lines': [{'spans': [{'size': 16.440000534057617,
     'flags': 4,
     'font': 'MicrosoftYaHeiUI',
     'color': 0,
     'ascender': 1.015625,
     'descender': -0.25439453125,
     'text': '赚钱的本质套路一共就3个:底层收智商税、中层找接盘侠、上层',
     'origin': (48.0, 69.08001708984375),
     'bbox': (48.0,
      52.383140563964844,
      531.0696411132812,
      73.26226043701172)}],
   'wmode': 0,
   'dir': (1.0, 0.0),
   'bbox': (48.0, 52.383140563964844, 531.0696411132812, 73.26226043701172)},
  {'spans': [{'size': 16.440000534057617,
     'flags': 4,
     'font': 'MicrosoftYaHeiUI',
     'color': 0,
     'ascender': 1.015625,
     'descender': -0.25439453125,
     'text': '玩金融!(完整版)',
     'origin': (48.0, 92.239990234375),
     'bbox': (48.0, 75.5431137084961, 199.65902709960938, 96.42223358154297)}],
   'wmode': 0,
   'dir': (1.0, 0.0),
   'bbox': (48.0, 75.5431137084961, 199.65902709960938, 96.42223358154297)}]}
'''

This text block is indeed a title, but typeit is not 1 and cannot be extracted at all.

2. Please use python to write a piece of code, use the pymupdf package to extract the title from the PDF, save the title name and page number, be careful not to judge based on the type of text block, but based on its size.

But we know that the size of the title is obviously different from the main text. For example, the main text size is 12 and the title is 16, then OK, we guide GPT to distinguish them in terms of size.

You can use the following code to extract the title and page number from a PDF using the pymupdf package and save it in a list:

import fitz

def extract_titles(pdf_path):
    doc = fitz.open(pdf_path)
    titles = []

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        blocks = page.get_text("dict", flags=11)["blocks"]

        for block in blocks:
            if block["type"] == 1 and block["bbox"][3] < 100:
                title = block["text"].strip()
                titles.append({
    
    "title": title, "page": page_num})

    return titles

pdf_path = "your_pdf_path.pdf"
titles = extract_titles(pdf_path)

for title in titles:
    print(f"Title: {
      
      title['title']}, Page: {
      
      title['page']}")

Please replace "your_pdf_path.pdf" in the code with the path to the PDF you want to extract titles from. The code uses the pymupdf package to open the PDF file and iterate through each page, then check the type and size of each text block. If the text block is of type 1 (i.e. title) and its bounding box height is less than 100, the title and page number will be extracted and added to the title list. Finally, print out all extracted titles and page numbers.

It seems that it is still unchanged and is still distinguished according to the height and type of the text box.

3. Please use python to write a piece of code, and use the pymupdf package to extract titles from PDF. The general method is to obtain each text box of each page. If the font size exceeds a certain multiple of the average, it is judged as a title, and then the title name and page are saved. number.

You can use the pymupdf package to extract titles from PDF files. The following is a sample code that determines the title based on the font size and saves the title name and page number:

import fitz

def extract_titles_from_pdf(file_path, threshold=1.5):
    doc = fitz.open(file_path)
    titles = []

    for page in doc:
        text_blocks = page.getTextBlocks()
        font_sizes = [block[1] for block in text_blocks]
        avg_font_size = sum(font_sizes) / len(font_sizes)

        for block in text_blocks:
            font_size = block[1]
            text = block[4]

            if font_size > avg_font_size * threshold:
                titles.append((text, page.number))

    return titles

# 调用示例
pdf_path = "example.pdf"
titles = extract_titles_from_pdf(pdf_path)
for title, page_num in titles:
    print(f"标题: {
      
      title}\t页数: {
      
      page_num}")

There is an error in this code, that is, blockthe first four values ​​​​are the four components of the rectangle and not the size. However, if this idea is changed, it should be able to be used.

4. Modified code

'''
block: {'number': 1,
  'type': 0,
  'bbox': (158.25, 569.15625, 446.25, 663.375),
  'lines': [{'spans': [{'size': 72.0,
      'flags': 0,
      'font': 'SimSun',
      'color': 16776960,
      'ascender': 1.04296875,
      'descender': -0.265625,
      'text': '三眼文集',
      'origin': (158.25, 644.25),
      'bbox': (158.25, 569.15625, 446.25, 663.375)}],
    'wmode': 0,
    'dir': (1.0, 0.0),
    'bbox': (158.25, 569.15625, 446.25, 663.375)}]}
'''


def extract_titles_from_pdf(file_path, thres=1.2):
    doc = fitz.open(file_path)
    titles = []

    for i, page in enumerate(doc):
        blocks = [
            block for block in page.get_text("dict")["blocks"]
            if block['type'] == 0
        ]
        sizes = [
            span['size'] 
            for block in blocks
            for line in block['lines']
            for span in line['spans']
        ]
        avg_size = sum(sizes) / (len(sizes) + 1e-9)

        page_titles = [
            (span['text'], i) 
            for block in blocks
            for line in block['lines']
            for span in line['spans']
            if span['size'] > avg_size * thres
        ]
        titles += page_titles

    return titles

OK, then call:

res = extract_titles_from_pdf(pdf_path, 1.2)
print(res)
'''
[('赚钱的本质套路一共就3个:底层收智商税、中层找接盘侠、上层', 1),
 ('玩金融!(完整版)', 1),
 ('钱就是债!——金融家的秘密,老百姓的盲点!明白这个才能不被', 8),
 ('收割', 8),
 ('穷人才想赚快钱!教人致富多为骗局!想变富要明白一个逻辑:分', 11),
 ('配!', 11),
 ('历史观比财经观更重要!经济是政治的延伸,而今天是昨日的推', 14),
 ('演!', 14),
 ('大钱要靠分配!不是卖苦力赚的!人生是无数个局,看局方能破局', 17),
 ('为何啥都不好干了?为何经济放缓了?本质在于这一群体快被抽', 21),
 ('干!', 21),
 ('过去高增长的本质是什么?', 21),
 ...]
'''

OK Initial completion.

Guess you like

Origin blog.csdn.net/wizardforcel/article/details/133465513