Article directory
1. Grobid installation and use
(1) Introduction to grobid
Official documentation: Introduction - GROBID Documentation
GROBID (or GROBID) is GeneRation Of BIbliographic Data.
GROBID is a machine learning library for extracting, parsing, and reconstructing raw documents (such as PDFs) into structured XML/TEI-encoded documents, with a special focus on technical and scientific publications. The initial development started in 2008 as a hobby. In 2011, the tool was made available as open source. Work on GROBID has stabilized since its inception as a side project and is expected to continue.
(2) Program installation
According to the official English document Install GROBID - GROBID Documentation
We perform the following operations for program installation.
Enter GitHub grobid to download the Release version
After the decompression is complete, enter the directory of the decompressed file
Enter gradlew clean install on the command line
Presence of semicolon found after JAVA_HOME path invalid check
After removing the semicolon the problem was solved.
Start the automatic installation of related jar files.
A prompt will appear after the installation is complete
(3) cmd run
The operation method is in the document GROBID service - GROBID Documentation
Enter gradlew run on the command line to start running the program
It does not need to load to 100% Go to the browser to test the API when you see the progress bar is not moving
The default port is http://localhost:8070/
After entering, we can see the following interface:
(4) Use on the Web
The interface is relatively simple. Taking PDF as an example, I uploaded a pdf to it and parsed its header information to produce normal results.
2. Grobid call: PDF batch processing
(1) API script acquisition
If you don't want to use the web to parse PDF directly, you need to call through the API.
In fact, the use of API is also advocated by developers. After all, batch processing cannot be performed on the web side.
Then it is too inefficient to deal with tens of thousands of PDF files.
The official website provides three types of API calling interfaces including Python Java Node.js I choose to use Python.
Download the project file here https://github.com/kermitt2/grobid-client-python
Unzip and there are multiple py files
and some test cases
We read the readme.md file it provides, summarized as follows
- This Python client can concurrently process a set of PDFs in a given directory on the file system, write the results to a given output directory, or call it from another python program.
- To use this client, you first need to run gradlew run on the command line and successfully open the web port.
- No other dependent libraries are required to use this client, but python3.5 and above are required.
(2) Call method
There are two ways to use the client in the official documentation
1. Command line
Enter the cmd command line environment
Path to the python client
cd grobid_client_python
Initialize the environment
python3 setup.py install
Executing the parsing command needs to specify the input path, output path and parsing method.
grobid_client.py --input C:/Users/*/Desktop/in --output C:/Users/*/Desktop/out processFulltextDocument
2.python script
Or run the following code directly (note the relative position of the package and the package directly)
Able to read all pdf files under resources/in
Generate the corresponding xml file in resources/out
from grobid_client.grobid_client import GrobidClient
if __name__ == "__main__":
client = GrobidClient(config_path="./config.json")
client.process("processFulltextDocument", "resources/in", output="resources/out/", consolidate_citations=True, tei_coordinates=True, force=True)
3.BAD_INPUT_DATA] PDF to XML conversion failed with error code: 99 Reason for error
(1) Cause of error
Using the Windows system to access the web terminal will result in the following error
By querying in github and re-reading the official documentation for an hour, I found that the program no longer supports the Windows platform
The developers of the program said in community discussions that supporting three platforms was too much work for them.
Official documentation:
Windows related issues
Grobid is developed and tested on Linux. macOS is also supported, although some components might behave slighly different due to the natural incompatibility of Apple with the rest of the world and the availability on some proprietary fonts on this platform.
Windows, unfortunately, is currently not anymore supported, due to lack of experience and time constraints. We recommend Windows users to use the Grobid Docker image (documented here) and call the system via API using one of the various grobid clients.
Windows相关问题
Grobid是在Linux上开发和测试的。macOS也被支持,尽管由于苹果与世界其他地方的自然不兼容以及该平台上的一些专有字体的可用性,一些组件的行为可能略有不同。
不幸的是,由于缺乏经验和时间限制,目前不再支持Windows。我们建议Windows用户使用Grobid Docker映像(这里有文档),并使用各种Grobid客户端之一通过API调用系统。
The developer replied:
Site link: [ BAD_INPUT_DATA ] PDF to XML conversion failed with error code: 99 · Issue #166 · kermitt2/grobid · GitHub
grobid is not supported to work on Windows. Unfortunately three platforms are too many for us, I recommend you to run it using docker.
grobid不支持在Windows上工作。不幸的是,三个平台对我们来说太多了,我建议您使用docker运行它。
(2) Suggestions for solutions
1. Buy an Apple Computer
2. Build a virtual environment such as Linux through programs such as Docker / VMware Workstation
Refer to the blog post GROBID library: PDF to XML conversion failed with error code: 99 error resolution
3. Look for a machine learning library with similar functions that can extract and parse PDF.