Public Comment Score Crawl - Image and Text Recognition OCR

Public Comment Score Crawl - Image and Text Recognition OCR

It was eleven and I didn't go out to play, because my wife had to work overtime, so I was with me.
In the evening, she said that she wanted some rating data for the reviews. I summed up the scrapy request and it should be easy to do, so I agreed. I don't think it is difficult.
But it's not that simple. I won’t talk about the problems that need to be verified by people. I don’t think I can solve this. What attracts me more is the way his scores are displayed.
The public comment this display uses pictures, css offset method

The selector set doesn't work
. The tesseract image text recognition I use here
is the approximate process .

Crawl the page


Here is the code snippet for page access using Selenium and then taking a screenshot


opt = Options()
opt.add_argument('--headless')
self.driver = webdriver.Chrome(executable_path='/Users/xiangc/bin/chromedriver', options=opt)
self.wait = WebDriverWait(self.driver, 10)
self.driver.get('http://www.dianping.com/shop/4227604')            self.driver.save_screenshot('image{}.png'.format(url_id))

Screenshot page

Cut out the required part

The code snippet is as follows, here is the hardcode, ashamed


 cropped_img = im.crop((239, 500, 239 + 780, 500 + 63)) 
 cropped_img.save('crop{}.png'.format(url_id))

image preprocessing

The image preprocessing process is as follows

  • Clean up noise, if there is only one non-white point around a point, it is noise, remove it
  • Non-blank points are colored, and points with a color value greater than 200 are directly given white
  • Improve image contrast

def get_color(image, x, y):
    if isinstance(image, type(Image.new('RGB', (0, 0), 'white'))):
        r, g, b = image.getpixel((x, y))[:3]
    else:
        r, g, b = image[x, y]
    return r, g, b


def is_noise(image, x, y):
    white_count = 0
    for i in range(0, x + 2):
        for j in range(0, y + 2):
            r, g, b = get_color(image, i, j)
            if (r, g, b) == (255, 255, 255):
                white_count += 1
    return white_count >= 7


def clear_noise(image, new_pixels):
    w, h = image.size
    clear_count = 0
    for i in range(w):
        for j in range(h):
            r, g, b = get_color(image, i, j)

            if r != g != b and is_noise(image, i, j):
                clear_count += 1
                print(clear_count)
                new_pixels[i, j] = (255, 255, 255)
            else:
                new_pixels[i, j] = (r, g, b)
    return clear_count

def clear_color(new_pixels, w, h):
    for i in range(w):
        for j in range(h):
            r, g, b = get_color(new_pixels, i, j)
            if np.average((r, g, b)) > 200:
                new_pixels[i, j] = (255, 255, 255)
            else:
                new_pixels[i, j] = (0, 0, 0)

def pre_image(full_path):
    image = Image.open(full_path)
    w, h = image.size
    new_image = Image.new('RGB', (w, h), 'white')
    new_pixels = new_image.load()

    clear_count = clear_noise(image, new_pixels)
    while clear_count > 0:
        clear_count = clear_noise(new_pixels, new_pixels)
        print(clear_count)
        if clear_count == 0:
            break
    clear_color(new_pixels, w, h)

    # 对比度增强
    enh_img = ImageEnhance.Contrast(new_image)
    contrast = 3
    image_contrasted = enh_img.enhance(contrast)

    dir_name = os.path.dirname(full_path)
    file_name = os.path.basename(full_path)
    new_file_path = os.path.join(dir_name, 'sharped' + file_name)
    image_contrasted.save(new_file_path)
    return new_file_path

Image text recognition

Text recognition is using tesseract
. Note that a whitelist is added here to improve the accuracy.
Chi is a recognition library trained by myself, and the training set is 10


new_file_path = imgutils.pre_image('crop{}.png'.format(url_id))
result = pytesseract.image_to_string(
    image=new_file_path,
    lang='chi',
    config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789评论服务:费用设施环境条.元'

result

It's okay

training helper script

Below is a collection of some scripts

  • Generate box file
  • Batch image processing
  • Batch training generates training result files
  • Batch image format conversion png->tiff

They are all js and python scripts, which are relatively simple~

gitee link

The crawler code will not be released. It's too ugly to write. I don't have time to do code optimization at the moment.
Since python comments and Markdown code tags are repeated, the comments are removed, I believe everyone can understand~

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324034234&siteId=291194637