GROBID library: installation and use

1. Grobid installation and use
(1) Introduction to grobid

Official documentation: Introduction - GROBID Documentation

GROBID (or GROBID) is GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing, and reconstructing raw documents (such as PDFs) into structured XML/TEI-encoded documents, with a special focus on technical and scientific publications. The initial development started in 2008 as a hobby. In 2011, the tool was made available as open source. Work on GROBID has stabilized since its inception as a side project and is expected to continue.

(2) Program installation

According to the official English document Install GROBID - GROBID Documentation

We perform the following operations for program installation.

Enter GitHub grobid to download the Release version

insert image description here

After the decompression is complete, enter the directory of the decompressed file

insert image description here

Enter gradlew clean install on the command line

insert image description here

Presence of semicolon found after JAVA_HOME path invalid check

After removing the semicolon the problem was solved.

insert image description here

Start the automatic installation of related jar files.

A prompt will appear after the installation is complete

insert image description here

(3) cmd run

The operation method is in the document GROBID service - GROBID Documentation

Enter gradlew run on the command line to start running the program

insert image description here

It does not need to load to 100% Go to the browser to test the API when you see the progress bar is not moving

The default port is http://localhost:8070/

After entering, we can see the following interface:

insert image description here

(4) Use on the Web

The interface is relatively simple. Taking PDF as an example, I uploaded a pdf to it and parsed its header information to produce normal results.

insert image description here

2. Grobid call: PDF batch processing
(1) API script acquisition

If you don't want to use the web to parse PDF directly, you need to call through the API.

In fact, the use of API is also advocated by developers. After all, batch processing cannot be performed on the web side.

Then it is too inefficient to deal with tens of thousands of PDF files.

insert image description here

The official website provides three types of API calling interfaces including Python Java Node.js I choose to use Python.

Download the project file here https://github.com/kermitt2/grobid-client-python

insert image description here

Unzip and there are multiple py files

insert image description here

and some test cases

We read the readme.md file it provides, summarized as follows

  • This Python client can concurrently process a set of PDFs in a given directory on the file system, write the results to a given output directory, or call it from another python program.
  • To use this client, you first need to run gradlew run on the command line and successfully open the web port.
  • No other dependent libraries are required to use this client, but python3.5 and above are required.
(2) Call method

There are two ways to use the client in the official documentation


1. Command line

Enter the cmd command line environment

insert image description here

Path to the python client

cd grobid_client_python

Initialize the environment

python3 setup.py install

Executing the parsing command needs to specify the input path, output path and parsing method.

grobid_client.py --input C:/Users/*/Desktop/in --output C:/Users/*/Desktop/out processFulltextDocument

2.python script

Or run the following code directly (note the relative position of the package and the package directly)

Able to read all pdf files under resources/in

Generate the corresponding xml file in resources/out

from grobid_client.grobid_client import GrobidClient

if __name__ == "__main__":
    client = GrobidClient(config_path="./config.json")
    client.process("processFulltextDocument", "resources/in", output="resources/out/", consolidate_citations=True, tei_coordinates=True, force=True)
3.BAD_INPUT_DATA] PDF to XML conversion failed with error code: 99 Reason for error
(1) Cause of error

Using the Windows system to access the web terminal will result in the following error

insert image description here

By querying in github and re-reading the official documentation for an hour, I found that the program no longer supports the Windows platform

The developers of the program said in community discussions that supporting three platforms was too much work for them.

Official documentation:

Windows related issues
Grobid is developed and tested on Linux. macOS is also supported, although some components might behave slighly different due to the natural incompatibility of Apple with the rest of the world and the availability on some proprietary fonts on this platform.

Windows, unfortunately, is currently not anymore supported, due to lack of experience and time constraints. We recommend Windows users to use the Grobid Docker image (documented here) and call the system via API using one of the various grobid clients.

Windows相关问题

Grobid是在Linux上开发和测试的。macOS也被支持,尽管由于苹果与世界其他地方的自然不兼容以及该平台上的一些专有字体的可用性,一些组件的行为可能略有不同。

不幸的是,由于缺乏经验和时间限制,目前不再支持Windows。我们建议Windows用户使用Grobid Docker映像(这里有文档),并使用各种Grobid客户端之一通过API调用系统。

The developer replied:

Site link: [ BAD_INPUT_DATA ] PDF to XML conversion failed with error code: 99 · Issue #166 · kermitt2/grobid · GitHub

grobid is not supported to work on Windows. Unfortunately three platforms are too many for us, I recommend you to run it using docker.

grobid不支持在Windows上工作。不幸的是,三个平台对我们来说太多了,我建议您使用docker运行它。
(2) Suggestions for solutions

1. Buy an Apple Computer

2. Build a virtual environment such as Linux through programs such as Docker / VMware Workstation

Refer to the blog post GROBID library: PDF to XML conversion failed with error code: 99 error resolution

3. Look for a machine learning library with similar functions that can extract and parse PDF.

Guess you like

Origin blog.csdn.net/yt266666/article/details/127452708