See how AI salvages broken documents

  • 1. What is unstructured data
  • 2. Unstructured Data Analysis
  • 3. Document image analysis and preprocessing
    • Correct graphics offset
    • Eliminate moiré
  • 4. Eliminate reflection
    • Reflective principle
    • Python method to eliminate image reflection
  • 5. Layout analysis and document restoration
    • 5.1 Physical Layout & Logical Layout
    • 5.2 Layout element inspection
    • 5.3 Document Restoration
    • 5.4 Application of Document Restoration
  • 6. Overall summary

1. What is unstructured data

Unstructured data refers to data without fixed format and rules, such as text, pictures, video, audio, etc. With the rapid development of information technology, unstructured data is becoming more and more important, the main reasons are as follows:

Growth of social media and digital content : With the popularity of social media and digital content, the amount of unstructured data that people generate in their daily lives continues to increase. For example, photos, posts, comments, etc. that people post through social media are unstructured data.

The advent of the era of big data : With the advent of the era of big data, organizations and enterprises need to process and analyze more data to achieve business goals, and unstructured data often contains useful information that can bring new opportunities and value.

Development of artificial intelligence and machine learning : Artificial intelligence and machine learning require large amounts of data for training and learning, while unstructured data can provide more diverse and realistic data to help algorithms better understand and predict future trends and behavior.

People need more comprehensive data analysis : Unstructured data can provide more complete and comprehensive data analysis, because they contain richer information, which can help organizations better understand their customers, markets and business.

2. Unstructured Data Analysis

For structured data acquisition, only ETL is required ( extract > transform > load ). But it is very difficult to deal with unstructured data. Why is it difficult? Tomato will show you an example.

Scene pits of unstructured data collection:

  • Various scenes and formats
  • Acquisition Equipment Uncertainty
  • Diversity of user needs
  • Document image quality degrades significantly
  • Text detection and layout analysis are difficult
  • Unrestricted text recognition rate is low
  • Poor understanding of structured intelligence

3. Document image analysis and preprocessing

Next, Tomato will share a practical case.

First of all, we got the picture on the left. He has several problems: bending, shadows, moiré, unclear , and so on. It is difficult to identify with the naked eye, let alone a machine.

But don't panic, I have a solution, the following are the detailed steps.

Correct graphics offset

For the deformed image, the algorithm calculates the offset, performs deformation correction, and finally fills the edge to obtain a repaired graphic.

Eliminate moiré

Moiré is caused by interference effects between the sensor array in an image capture device (such as a camera) and details in the object being photographed.

  • Background extraction module
  • Interference Removal Module
  • Information Fusion Module

In order to eliminate moiré, the following Python code can be used:

import cv2
import numpy as np

def remove_moire(image):
    # 将图像转换为灰度图像
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # 使用傅里叶变换将图像转换到频域
    f = np.fft.fft2(gray)
    fshift = np.fft.fftshift(f)

    # 创建一个高斯滤波器来过滤掉高频噪声
    rows, cols = gray.shape
    crow, ccol = rows // 2, cols // 2  # 中心位置
    gauss_filter = np.zeros((rows, cols), np.float32)
    radius = 20  # 半径越小,过滤越强烈。
    for i in range(rows):
        for j in range(cols):
            distance = (i - crow) ** 2 + (j - ccol) ** 2
            gauss_filter[i, j] = np.exp(-distance / (2 * radius ** 2))

    # 将高斯滤波器应用于频域图像
    filtered_fshift = fshift * gauss_filter

    # 使用傅里叶逆变换将图像转换回空间域,并返回结果
    filtered_f = np.fft.ifftshift(filtered_fshift)
    filtered_image = np.fft.ifft2(filtered_f)
    filtered_image = np.abs(filtered_image)

    return filtered_image.astype(np.uint8)

How to use :

image = cv2.imread('input_tomato.jpg')
filtered_image = remove_moire(image)

cv2.imshow('Filtered Image', filtered_image)
cv2.waitKey(0)

input_tomato.jpg is the image file name to be processed. After running the code, the moiré-removed image will be displayed.

Of course, the above example shows the method and effect of eliminating moiré on the basis of open source. To achieve the demonstration effect of Hehe Information on valse2023 , it is not enough to use only open source python packages.

4. Eliminate reflection

Reflective principle

When I was in elementary school, I studied by myself at night, especially the students who sat in the front row, they might see the blackboard like this, with the reflection of the lights.

When a strong light source shines on a smooth surface, the photo effect is usually not ideal. Still don’t panic, Scanning Almighty King is one of the leading products of Hehe Information, let’s learn how others do it.

The principle of anti-reflection is to reduce or remove the reflective area through image enhancement technology, which mainly includes the following steps:

  1. Read an image and convert it to a grayscale image.
  2. Smooth the image using a Gaussian filter to remove noise.
  3. Use Sobel operator to detect edges.
  4. For detected edges, use the Hough transform to identify straight lines.
  5. Calculate the angle between each line and the horizontal, and rotate them back to horizontal.
  6. Properly crop the rotated image to remove possible black borders.

Python method to eliminate image reflection

The open source method cannot achieve the professional effect of Hehe Information (scanner Almighty King).

import cv2
import numpy as np

def remove_glare(image):
    # 转换为灰度图像
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # 平滑处理
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)

    # 边缘检测
    edges = cv2.Canny(blurred, 50, 200)

    # 检测直线
    lines = cv2.HoughLines(edges, 1, np.pi / 180, 100)

    # 计算角度并旋转回水平方向
    angles = []
    for line in lines:
        rho, theta = line[0]
        angle = theta * 180 / np.pi
        angles.append(angle)
    median_angle = np.median(angles)
    rotated = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)
    rotated = cv2.rotate(rotated, cv2.ROTATE_90_CLOCKWISE)
    if median_angle > 0:
        rotated = cv2.rotate(rotated, cv2.ROTATE_180)

    # 裁剪图像
    gray_rotated = cv2.cvtColor(rotated, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray_rotated, 1, 255, cv2.THRESH_BINARY)
    contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    x,y,w,h = cv2.boundingRect(contours[0])
    cropped = rotated[y:y+h,x:x+w]

    return cropped

The above is the open source Python method I discovered during my self-study process . The effect is average, and the non-hehe information demonstration is so good.

5. Layout analysis and document restoration

5.1 Physical Layout & Logical Layout

Here, Tomato first introduces a very important concept of layout analysis .

  • Physical layout refers to the actual existence of things, people or organizations, including their location, shape, size, etc.;
  • Logical layout refers to the relationship and connection established between these physical elements, such as causal relationship, logical relationship, etc.

Simply put, the physical layout emphasizes the position and attributes between elements, while the logical layout emphasizes the interaction and connection between them. By analyzing problems or topics on two different boards, we can understand them more fully and find solutions from different angles.

5.2 Layout element inspection

To conduct layout analysis, the first thing to do is layout element detection. Error text, watermark, QR code, etc.

5.3 Document Restoration

Through the version AI algorithm analysis of the first two steps (physical layout analysis, logical layout analysis), and layout element identification check, we can restore the document.

In the end, we got the content of the picture on the far right restored to the WORD or EXCEL version.

6. Overall summary

The complete processing process is divided into 6 steps: image input— > document extraction— > finger removal— > moiré removal— > deformation correction— > image enhancement

For the above processing process, friends who are interested in the algorithm can study it by themselves.

If you want to use powerful functions to rescue documents, you can search in the application market: Scan Almighty King, the core underlying principle of this software is the above content, and CS Scan Almighty King has been a free efficiency application in 120 countries on the App Store , with downloads Top the leaderboard.

Image source: Wuxi Vision and Learning Young Scholars Symposium - Hehe Information Speech

Shanghai Hehe Information Technology Co., Ltd. is an industry-leading artificial intelligence and big data technology company. It is committed to providing global enterprises and individuals with core technologies in the field of intelligent text recognition and commercial big data, C- end and B -end products, and industry solutions. Provide users with innovative digital and intelligent services.

Of course, intelligent document processing is not only the above content, but also many more. Thanks to Hehe Information for the wonderful speech on "language recognition and understanding" at the VALSE 2023 Wuxi Vision and Learning Young Scholars Symposium. Tomato was greatly inspired after listening to it. The following picture is the sharing of INTSIG Hehe information at the conference .

Intelligent document processing, for each segment, actually has many challenging and interesting things, let us explore together ~

Guess you like

Origin blog.csdn.net/weixin_39032019/article/details/131332043