GROBID library: use the requests library to request GROBID Web to improve the parsing speed and accuracy of PDF documents

(1) There is a problem with directly calling the GROBID library

In the past few weeks, the batch parsing of hundreds of PDF documents has been completed using the GROBID library, as recorded in these two blogs

GROBID library: installation and use

GROBID library: a solution for running GROBID library to parse documents in Windows environment


The key code is as follows:

from grobid_client.grobid_client import GrobidClient
client = GrobidClient(config_path="./config.json")
client.process("processHeaderDocument","D:/pdf",output="D:/xml",consolidate_citations=True, tei_coordinates=True, force=True)

The general process is as follows:

  1. Installed the library on GitHub and found that it does not support Windows system
  2. Switch to Docker to complete the normal use of the GROBID library in the Windows environment
  3. Read the official document and call the API to complete the batch analysis

I thought this part of the task was over, but when I checked the parsed document today, some files were 1kb in size:
insert image description here

Obviously the parsing is not correct. Open the document and observe this prompt:

[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1

or a hint like this

[TIMEOUT] PDF to XML conversion timed out

Then it is not advisable to simply call the Python API as I did before because of the following disadvantages:

  • The running speed is extremely slow; it takes hours to run hundreds of documents in the case of multi-threading.

  • The accuracy rate is low; the parsing document often reports an error and does not generate an XML document.

  • The documentation is wrong; even generating XML documentation often gives wrong results.

Of course, it may also be due to the poor configuration of my personal computer or the wrong setting of some parameters. In short, I decided to try a new method.

(2) Web page analysis and packet capture

Using routines similar to conventional crawlers, of course, this is not a crawler, but a simple web page request

First observe the structure of the Grobid Web page:

insert image description here

The user dynamically submits the thesis file on the webpage, requests the service below, and returns the result of the thesis analysis, but the website address bar remains unchanged .

Obviously the data of the XML file is dynamically loaded, so we can easily think of the related technology AJAX

Use the developer mode to capture packets and successfully capture related requests

insert image description here

(3) Use requests to request

Then the next job is very simple

  • Send a request to upload a file
  • Obtain the returned XML binary data and write it into the local document.
import os
import re
import requests
import glob

def getXml(filename, path):
    url = "http://localhost:8070/api/processFulltextDocument"
    params = dict(input=open(path + filename + ".pdf", 'rb'))
    response = requests.post(url, files=params, timeout=300)
    fh = open("D:/xml/" + filename + ".xml", "w", encoding="utf-8")
    fh.write(response.text)
    fh.close()

def main():
    path = "D:/pdf/"
    inpdf = glob.glob(path + '*')
    for num in range(len(inpdf)):
        filename = re.findall(r'\\(.*?)\.pdf', str(inpdf[num]))[0]
        getXml(filename, path)

glob is to obtain the name list of the document re regular expression is to extract the file names in all addresses

Just adjust the folder addresses of input PDF and output XML when needed.

Nearly a hundred PDFs can be parsed in five minutes and the XML document can be saved locally.

Guess you like

Origin blog.csdn.net/yt266666/article/details/127539343