GROBID library: a solution for running GROBID library to parse documents in Windows environment

1. Problem to be solved: PDF to XML conversion failed with error code: 99

Because the GROBID module no longer supports the Windows platform

Official documentation:

Windows related issues
Grobid is developed and tested on Linux. macOS is also supported, although some components might behave slighly different due to the natural incompatibility of Apple with the rest of the world and the availability on some proprietary fonts on this platform.

Windows, unfortunately, is currently not anymore supported, due to lack of experience and time constraints. We recommend Windows users to use the Grobid Docker image (documented here) and call the system via API using one of the various grobid clients.

Windows相关问题

Grobid是在Linux上开发和测试的。macOS也被支持,尽管由于苹果与世界其他地方的自然不兼容以及该平台上的一些专有字体的可用性,一些组件的行为可能略有不同。

不幸的是,由于缺乏经验和时间限制,目前不再支持Windows。我们建议Windows用户使用Grobid Docker映像(这里有文档),并使用各种Grobid客户端之一通过API调用系统。

Using the Windows system to access the web page/call the API interface will cause the following error

insert image description here

Therefore, other solutions need to be tried so that the Windows system can also use this module.

2. Docker Desktop

(1) Program introduction
  • Docker is a container-based platform that allows for highly portable workloads.
  • Docker containers can run on a developer's local machine, on a physical or virtual machine in a data center, on a cloud service, or in a hybrid environment.
  • Based on the image, various containers can be started in seconds.
  • Each container is a complete operating environment, and the containers are isolated from each other.
  • Docker's portability and lightweight nature enables applications and services to be scaled up or down in real time.
  • Docker containers are similar to virtual machines, but do not create an entire virtual operating system. Instead, Docker allows an application to use the same Linux kernel as the system it is running on. This enables application packages to require only components not already installed on the host computer, reducing package size and improving performance.
(2) Installation steps

1. Based on the underlying architecture of WSL2, install this first (if it is not installed, the subsequent programs cannot run normally)

Link: https://wslstorestorage.blob.core.windows.net/wslblob/wsl_update_x64.msi

insert image description here

insert image description here

2. Go to the Docker official website to download the program, this is the desktop running program

insert image description here

insert image description here

3. The program takes up a lot of memory, so modify its installation to the D drive (if the C drive has enough space, you can skip this step)

create folder

insert image description here

Execute on the command line to establish a soft connection to ensure that the Docker folder does not exist under the Program Files folder under the C drive

Otherwise, an error will be reported when running the command line

mklink /j "C:\Program Files\Docker" "D:\Program Files\Docker"

insert image description here

Move the image file to the D drive. Refer to this blog. Wow, it is really effective.

[Docker] Modify the storage location of the docker image file on win10 (nine) - modify through WSL2_jwensh's blog - CSDN blog

4. After the download is complete, click the exe file and OK to execute it

If the second step is not performed, the default installation path is the C drive, which cannot be changed during installation, so make sure there is enough space on the C drive

insert image description here
insert image description here

3. Need to restart the computer to complete the initialization

insert image description here

4. After the restart is complete

insert image description here

6. Pay attention to whether the system service is enabled, enter services.msc in cmd to open the system service

insert image description here

7.dockers is in normal use and only takes up less than 300MB of space on the C drive.
insert image description here

3. Mirror file

(1) Mirror introduction

Official documentation: GROBID with containers - GROBID Documentation

GROBID can be instantiated and run through Docker. In order to meet the needs of different users, the developer provides two types of docker images.

  • A lightweight image is just the CRF model: the image limits the size of the image at runtime and in memory and provides the best performance.
  • The full image can run CRF and deep learning models: the image includes all required python and TensorFlow libraries,
  • GPU support and all DL model resources. It can give slightly more accurate results at the cost of slower runtime and higher memory usage. The images are also quite large (the python and tensorflow libraries take up over 2GB, and the preloaded embeds ~5GB).
(2) Installation steps

(I installed a lightweight image)

We execute the following statement under cmd, the number at the end is the version number, go to the portal to find the latest version.

Lightweight mirror site: lfoppiano/grobid Tags | Docker Hub

docker pull lfoppiano/grobid:0.7.1

The corresponding image file will be automatically downloaded from the above website

insert image description here

Run the version number of the system corresponding to the image file you installed

docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.1

The following interface appears in the terminal and the operation is successful
insert image description here

4. Successfully resolved and completed the analysis

After mounting the above running interface, there will be no more error reporting through the web terminal or calling the API interface at this time

Enter the website Grobid Web Application parsed successfully

insert image description here

Use the python program to call the API interface

insert image description here

insert image description here

For installing Grobid and calling the program to run Grobid to extract and parse PDF documents in batches, please refer to my previous article.

GROBID library installation and use: batch extraction and analysis of PDF metadata fields and full-text content

Guess you like

Origin blog.csdn.net/yt266666/article/details/127453067