Python library Camelot extracts table data from pdf and some precautions in the installation and use of python library camelot

1. Some precautions in the installation and use of camelot, the python library

1) The camelot method has two analysis modes: stream analysis (stream) and lattice analysis (lattice), in which lattice analysis can retain the complete style of the table, which is better than the stream analysis mode for complex tables. At the same time, the camelot method defaults to lattice analysis (lattice), and to use this analysis method, ghostscript needs to be installed. Therefore, for the camelot installed only through the pip command, errors will usually be reported when the code is running. Need to download ghostscript.exe and install it. After installation, after testing, there is no need to import ghostscript in the code.

2) If the camelot output format is selected as csv format, there may be a problem of Chinese garbled characters. You need to use a text editor to change the encoding of the exported csv file to ansi format.
If you want to save directly in excel format, you need the support of xlwt module. After pip installs xlwt, tables.export('file name.xls',f ='excel') can be exported to excel format.

3) Later, a strange problem occurred when the camelot library was installed on another computer, and an error was reported when the program was running. The reason was discovered after repeated inspections. First of all, on this computer, I entered pip install camelot according to my impression, and the installation was successful. But the code runs incorrectly. After consulting, the correct command (or version) is pip install camelot-py[cv].

So I uninstalled the previously installed camelot, and then reinstalled camelot-py[cv] according to the correct command, but when the code is running, it prompts that there is a problem with import xlwt. After checking in the python library, the xlwt file is normal. I found it Did not find the reason for a long time. Later, I uninstalled xlwt separately, and then re-installed xlwt with pip, and found that the version number of xlwt changed from 0.7 to 1.3, and everything was normal. It is estimated that the version of camelot was installed incorrectly before, which caused the xlwt version installed by the way to be too low to be compatible with python3.6.5.

4) Everything was normal at the beginning of camelot, but suddenly an error was reported when processing a pdf file: pdfminer.psparser.SyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /' BaseFont', /b"b'", /"ABCDEE+\xcb\xce\xcc\xe5'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:11, /'ToUnicode', PDFObjRef:19]

After Baidu, I found a solution, modified the three source codes of pandas and PyPDF2 modules, and returned to normal. For specific modifications, please refer to the pdf document of the online preview of the python crawler processing https://link.csdn.net/?target=https%3A%2F%2Fwww.cnblogs.com%2FEeyhan%2Farchive%2F2019%2F12%2F30%2F12111371.html

2. The python library Camelot extracts tabular data from pdf.
Original link: https://blog.csdn.net/xc_zhou/article/details/99242995

Camelot: a friendly PDF table data extraction tool

A python command line tool that allows anyone to easily extract tabular data from PDF files.

Install Camelot

Installation is very simple! After installing the related dependencies, you can directly use pip to install.

$ pip install camelot-py

  
   
   
  • 1

How to use Camelot

Using Camelot to extract data from PDF documents is very simple

image

Why use Camelot

  • Camelot allows you to precisely control the data extraction process by adjusting the settings
  • The bad form can be judged based on the blank and precision indicators and discarded instead of manually checking
  • Each table data is a panda dataframe, which can be easily integrated into ETL and data analysis workflow
  • Data can be exported to various formats such as CSV, JSON, EXCEL, HTML

First, let us look at a simple example: eg.pdf, the entire file has only one page, and there is only one table on this page, as follows:

9419034-4473cf94547e62f4.png

Use the following Python code to extract the tables in the PDF file:

import camelot

Extract tables from PDF files

tables = camelot.read_pdf(‘E://eg.pdf’, pages=‘1’, flavor=‘stream’)

Form information

print(tables)
print(tables[0])

Tabular data

print(tables[0].data)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

The output is:

<TableList n=1>
<Table shape=(4, 4)>
[['ID', '姓名', '城市', '性别'], ['1', 'Alex', 'Shanghai', 'M'], ['2', 'Bob', 'Beijing', 'F'], ['3', 'Cook', 'New York', 'M']]

 
  
  
  • 1
  • 2
  • 3

Analyze the code, camelot.read_pdf() is the function of camelot to extract data from the table. The input parameters are the path of the PDF file, the page number (pages) and the table analysis method (there are two methods: stream and lattice). For the table parsing method, the default method is lattice, and the stream method will treat the entire PDF page as a table by default. If you need to specify the area in the parsing page, you can use the table_area parameter.

The convenience of the camelot module is that it provides functions to directly convert the extracted table data into pandas, csv, JSON, html, such as tables[0].df, tables[0].to_csv() functions, etc. Let's take the output csv file as an example:

import camelot

Extract tables from PDF files

tables = camelot.read_pdf(‘E://eg.pdf’, pages=‘1’, flavor=‘stream’)

Convert table data into csv file

tables[0].to_csv(‘E://eg.csv’)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

The resulting csv file is as follows:

Example 2

In Example 2, we will extract the data of a table in a certain area of ​​the PDF page. The pages (parts) of the PDF file are as follows:

In order to extract the only table in the entire page, we need to locate the position of the table. The coordinate system of the PDF file is different from that of the picture. It takes the vertex at the lower left corner as the origin, the x-axis to the right, and the y-axis to the top. The coordinates of the text on the entire page can be output by the following Python code:

import camelot

Extract tables from PDF

tables = camelot.read_pdf(‘G://Statistics-Fundamentals-Succinctly.pdf’, pages=‘53’,
flavor=‘stream’)

Draw the coordinates of the PDF document and locate the position of the table

tables[0].plot(‘text’)

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

The output is:

UserWarning: No tables found on page-53 [stream.py:292]

 
  
  
  • 1

The entire code did not find the table. This is because the stream method treats the entire PDF page as a table by default, so the table is not found. But the image of the drawn page coordinates is as follows:

Comparing the previous PDF pages carefully, it is not difficult to find that the coordinates of the upper left corner of the corresponding area of ​​the form are (50,620), and the coordinates of the lower right corner are (500,540). We add the table_area parameter to the read_pdf() function. The complete Python code is as follows:

import camelot

Identify table data in the specified area

tables = camelot.read_pdf(‘G://Statistics-Fundamentals-Succinctly.pdf’, pages=‘53’,
flavor=‘stream’, table_area=[‘50,620,500,540’])

Draw the coordinates of the PDF document and locate the position of the table

table_df = tables[0].df

print(type(table_df))
print(table_df.head(n=6))

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

The output result is:

<class 'pandas.core.frame.DataFrame'>
         0               1                2           3
0  Student  Pre-test score  Post-test score  Difference
1        1              70               73           3
2        2              64               65           1
3        3              69               63          -6
4        …               …                …           …
5       34              82               88           6

 
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

to sum up

When specifically identifying the table in the PDF page, in addition to the parameter specifying the area, there are also parameters such as subscripts and cell merges. For detailed usage, please refer to the official website of camelot: https://camelot-py.readthedocs.io/ en/master/

Reference: https://www.php.cn/python-tutorials-412223.html
https://mp.weixin.qq.com/s?__biz=MjM5NzU0MzU0Nw==&mid=2651380263&idx=1&sn=514485e8c4fe820834bacbccfccfbb4ae9=1164dc96dccfccf1c4a5398d1&sn=514485e8c4fe820834fccbccfbb4ae9&bdc 23&srcid=0520POo6Bt0M0FUTbhnwNptJ#rd

Guess you like

Origin blog.csdn.net/stay_foolish12/article/details/112506327