python-opencv form recognition

Recently I learned about opencv and made a simple little thing, which is to recognize the table in the picture, and then write the data to the csv and save it after the recognition is completed.

Environmental preparation:

Let’s first talk about what environment we should prepare:
1. python installation is complete (author python3.7)
2. tesseract (google’s open source cr framework)
3. pytesseract (python encapsulation of tesseract, when called, it is called through pytesseract tesseract)
4. Some libraries of python, numpy, matplotlib
5. Special mention is to import the opencv package, the name is opencv-python when installed, and the name is cv2 when used.
Cv2 does not mean that it is version two of opencv. It means that it has been improved and optimized on the basis of cv, and it has nothing to do with the version.
There is one more thing to note here. If you use cv2 to report an error, you may have to install opencv-python-headless.

Identification steps:

Let's briefly talk about the principle first, how do you recognize the table? This is the case. If we directly use tesseract to recognize a picture as Chinese, it will count the horizontal and vertical lines of the table.
Many Chinese characters cannot be recognized, and the effect is ugly. So the pre-step we need to do is to identify the cells of the excel picture, identify the information in each cell, and then stitch it
into a csv, and finally convert it to excel, only need to import the csv data through excel That's it.

Note: My level is limited, and I can only deal with relatively simple and square forms. The following procedures can be run directly. However, if there are some picture forms that are taken obliquely, or there are interferences,
such as a pen in the picture, or a very complicated watermark, this cannot be handled at present and is still being studied. Interested friends can play together, add a WeChat to discuss and exchange.

However, we also have some solutions for these situations, which are still being processed:
1. If the table is inclined, we can use transmission transformation to first process the table into a normal rectangle;
2. For a simple watermark, you can select one after passing the grayscale With a reasonable threshold, after the image is binarized, the watermark can be removed from the white and black images.

Here is a good introductory tutorial for python-opencv: github address

Okay, the nonsense is over, start the tutorial:
original picture:

1. Read in the picture grayscale:

Can be understood as changing a color picture to a gray picture

raw = cv2.imread(src, 1)
# 灰度图片
gray = cv2.cvtColor(raw, cv2.COLOR_BGR2GRAY)

2. Picture binarization:

It can be understood that the picture becomes only black and white in two colors, which is more convenient to process. Besides, we don’t need color when processing tables.

binary = cv2.adaptiveThreshold(~gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 35, -5)
# 展示图片
cv2.imshow("binary_picture", binary)

Binary image:

3. Identify horizontal and vertical lines:

After that, if the image is not clear enough or there are small pixels, you can use operations such as corrosion and expansion to make the image clearer

rows, cols = binary.shape
scale = 40
# 自适应获取核值
# 识别横线:
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (cols // scale, 1))
eroded = cv2.erode(binary, kernel, iterations=1)
dilated_col = cv2.dilate(eroded, kernel, iterations=1)
cv2.imshow("excel_horizontal_line", dilated_col)

# 识别竖线:
scale = 20
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, rows // scale))
eroded = cv2.erode(binary, kernel, iterations=1)
dilated_row = cv2.dilate(eroded, kernel, iterations=1)
cv2.imshow("excel_vertical_line:", dilated_row)

Diagram of horizontal and vertical lines:

4. Calculate the focus of the horizontal and vertical lines, and get the coordinates of each cell

# 将识别出来的横竖线合起来
bitwise_and = cv2.bitwise_and(dilated_col, dilated_row)
cv2.imshow("excel_bitwise_and", bitwise_and)

# 标识表格轮廓
merge = cv2.add(dilated_col, dilated_row)
cv2.imshow("entire_excel_contour:", merge)

# 两张图片进行减法运算,去掉表格框线
merge2 = cv2.subtract(binary, merge)
cv2.imshow("binary_sub_excel_rect", merge2)

new_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
erode_image = cv2.morphologyEx(merge2, cv2.MORPH_OPEN, new_kernel)
cv2.imshow('erode_image2', erode_image)
merge3 = cv2.add(erode_image, bitwise_and)
cv2.imshow('merge3', merge3)

# 将焦点标识取出来
ys, xs = np.where(bitwise_and > 0)

Focus map:

5. Filter the coordinates and count the neighboring ones as one:

Let’s talk about the picture array is (y, x) like this, y is in front

# 横纵坐标数组
y_point_arr = []
x_point_arr = []
# 通过排序,排除掉相近的像素点,只取相近值的最后一点
# 这个10就是两个像素点的距离,不是固定的,根据不同的图片会有调整,基本上为单元格表格的高度(y坐标跳变)和长度(x坐标跳变)
i = 0
sort_x_point = np.sort(xs)
for i in range(len(sort_x_point) - 1):
    if sort_x_point[i + 1] - sort_x_point[i] > 10:
        x_point_arr.append(sort_x_point[i])
    i = i + 1
# 要将最后一个点加入
x_point_arr.append(sort_x_point[i])

i = 0
sort_y_point = np.sort(ys)
# print(np.sort(ys))
for i in range(len(sort_y_point) - 1):
    if (sort_y_point[i + 1] - sort_y_point[i] > 10):
        y_point_arr.append(sort_y_point[i])
    i = i + 1
y_point_arr.append(sort_y_point[i])

6. Take out each cell image by coordinates, and then use pytesseract to recognize the text: after removing the special symbols, data is the processed value

# 循环y坐标,x坐标分割表格
data = [[] for i in range(len(y_point_arr))]
for i in range(len(y_point_arr) - 1):
    for j in range(len(x_point_arr) - 1):
        # 在分割时,第一个参数为y坐标,第二个参数为x坐标
        cell = src[y_point_arr[i]:y_point_arr[i + 1], x_point_arr[j]:x_point_arr[j + 1]]
        cv2.imshow("sub_pic" + str(i) + str(j), cell)

        # 读取文字,此为默认英文
        # pytesseract.pytesseract.tesseract_cmd = 'E:/Tesseract-OCR/tesseract.exe'
        text1 = pytesseract.image_to_string(cell, lang="chi_sim+eng")

        # 去除特殊字符
        text1 = re.findall(r'[^\*"/:?\\|<>″′‖ 〈\n]', text1, re.S)
        text1 = "".join(text1)
        print('单元格图片信息:' + text1)
        data[i].append(text1)
        j = j + 1
    i = i + 1

6. Finally write all the information into csv

path is the file path to be written, data is the data

with open(path, "w", newline='') as csv_file:
      writer = csv.writer(file, dialect='excel')
      for index, item in enumerate(data):
          if index != 0 and index != len(data) - 1:
              writer.writerows([[item[0], item[1], item[2], item[3], item[4], item[5]]])

Write table data:

to sum up:

1. After the completion, I roughly know the principle of form recognition, and have a certain understanding and familiarity with the opencv api, here is the address of github: project address , I think you can click star, fork, etc.
2. But this introduction is about the general process. In fact, there are still many pits in the process. Directly running the github project may produce different results from mine.
That is because one is to download the Chinese data set of tesseract. The second is that the mathematics and a few words in this are not recognized, and
some training sets need to be added to tesseract manually . This increased training data set should have another article later.
3. After finishing the form recognition, prepare to look at the picture to correct and remove the watermark. Complex form recognition will also be used.

Reference materials:

1.https://github.com/tesseract-ocr/tesseract
2.https://pypi.org/project/pytesseract/
3.https://blog.csdn.net/muxiong0308/article/details/80969355
4.https://www.cnblogs.com/HL-space/p/10547259.html

Guess you like

Origin blog.csdn.net/sc9018181134/article/details/104577247