Public Comment Score Crawl - Image and Text Recognition OCR
It was eleven and I didn't go out to play, because my wife had to work overtime, so I was with me.
In the evening, she said that she wanted some rating data for the reviews. I summed up the scrapy request and it should be easy to do, so I agreed. I don't think it is difficult.
But it's not that simple. I won’t talk about the problems that need to be verified by people. I don’t think I can solve this. What attracts me more is the way his scores are displayed.
The public comment this display uses pictures, css offset method
The selector set doesn't work
. The tesseract image text recognition I use here
is the approximate process .
Crawl the page
Here is the code snippet for page access using Selenium and then taking a screenshot
opt = Options()
opt.add_argument('--headless')
self.driver = webdriver.Chrome(executable_path='/Users/xiangc/bin/chromedriver', options=opt)
self.wait = WebDriverWait(self.driver, 10)
self.driver.get('http://www.dianping.com/shop/4227604') self.driver.save_screenshot('image{}.png'.format(url_id))
Screenshot page
Cut out the required part
The code snippet is as follows, here is the hardcode, ashamed
cropped_img = im.crop((239, 500, 239 + 780, 500 + 63))
cropped_img.save('crop{}.png'.format(url_id))
image preprocessing
The image preprocessing process is as follows
- Clean up noise, if there is only one non-white point around a point, it is noise, remove it
- Non-blank points are colored, and points with a color value greater than 200 are directly given white
- Improve image contrast
def get_color(image, x, y):
if isinstance(image, type(Image.new('RGB', (0, 0), 'white'))):
r, g, b = image.getpixel((x, y))[:3]
else:
r, g, b = image[x, y]
return r, g, b
def is_noise(image, x, y):
white_count = 0
for i in range(0, x + 2):
for j in range(0, y + 2):
r, g, b = get_color(image, i, j)
if (r, g, b) == (255, 255, 255):
white_count += 1
return white_count >= 7
def clear_noise(image, new_pixels):
w, h = image.size
clear_count = 0
for i in range(w):
for j in range(h):
r, g, b = get_color(image, i, j)
if r != g != b and is_noise(image, i, j):
clear_count += 1
print(clear_count)
new_pixels[i, j] = (255, 255, 255)
else:
new_pixels[i, j] = (r, g, b)
return clear_count
def clear_color(new_pixels, w, h):
for i in range(w):
for j in range(h):
r, g, b = get_color(new_pixels, i, j)
if np.average((r, g, b)) > 200:
new_pixels[i, j] = (255, 255, 255)
else:
new_pixels[i, j] = (0, 0, 0)
def pre_image(full_path):
image = Image.open(full_path)
w, h = image.size
new_image = Image.new('RGB', (w, h), 'white')
new_pixels = new_image.load()
clear_count = clear_noise(image, new_pixels)
while clear_count > 0:
clear_count = clear_noise(new_pixels, new_pixels)
print(clear_count)
if clear_count == 0:
break
clear_color(new_pixels, w, h)
# 对比度增强
enh_img = ImageEnhance.Contrast(new_image)
contrast = 3
image_contrasted = enh_img.enhance(contrast)
dir_name = os.path.dirname(full_path)
file_name = os.path.basename(full_path)
new_file_path = os.path.join(dir_name, 'sharped' + file_name)
image_contrasted.save(new_file_path)
return new_file_path
Image text recognition
Text recognition is using tesseract
. Note that a whitelist is added here to improve the accuracy.
Chi is a recognition library trained by myself, and the training set is 10
new_file_path = imgutils.pre_image('crop{}.png'.format(url_id))
result = pytesseract.image_to_string(
image=new_file_path,
lang='chi',
config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789评论服务:费用设施环境条.元'
result
It's okay
training helper script
Below is a collection of some scripts
- Generate box file
- Batch image processing
- Batch training generates training result files
- Batch image format conversion png->tiff
They are all js and python scripts, which are relatively simple~
The crawler code will not be released. It's too ugly to write. I don't have time to do code optimization at the moment.
Since python comments and Markdown code tags are repeated, the comments are removed, I believe everyone can understand~