Simple verification code processing for crawler

  A branch of machine vision: text recognition, introduces how to use some Python libraries to recognize and use text in online pictures. Translating images into text is generally referred to as Optical Character Recognition (OCR). There are not many low-level libraries that can implement OCR. At present, many libraries use several common low-level OCR libraries, or customize them. Here we only focus on: Tesseract

  Tesseract is an OCR library, currently sponsored by Google (Google is also a company known worldwide for OCR and machine learning techniques). Tesseract is currently recognized as the best and most accurate open source OCR system. In addition to its high accuracy, Tesseract also has high flexibility. It can be trained to recognize any font, as well as any Unicode character.

Windows system

Download the executable installation file https://code.google.com/p/tesseract-ocr/downloads/list install.

Linux system

It can be installed via apt-get: $sudo apt-get tesseract-ocr

Tesseract is a Python command-line tool, not a library imported via the import statement. After installation, use the tesseract command to run outside of Python, but we can install the Tesseract library that supports the Python version via pip:

pip install pytesseract

import pytesseract
from PIL import Image

image = Image.open('test.jpg')
text = pytesseract.image_to_string(image)
print(text)

Using a Python script to clean up the image, using the PIL library, we can create a threshold filter to remove the gradient background color and leave only the text, so that the image is clearer and easier to read by Tesseract:

from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath): image = Image.open(filePath) # 对图片进行阈值过滤(低于143的置为黑色,否则为白色) image = image.point(lambda x: 0 if x < 143 else 255) # 重新保存图片 image.save(newFilePath) # 调用系统的tesseract命令对图片进行OCR识别 subprocess.call(["tesseract", newFilePath, "output"]) # 打开文件读取结果 with open("output.txt", 'r') as f: print(f.read()) if __name__ == "__main__": cleanFile("text2.png", "text2clean.png")

Grab text from website images

The following procedure is to grab the text from the pictures on the website: first open the reader, collect the URL links of the pictures, then download the pictures, identify the pictures, and finally print the text of each picture. Because this program is complex and makes use of several program fragments from previous chapters, I've added some comments to make the purpose of each code clearer:

import time 
from urllib.request import urlretrieve
import subprocess
from selenium import webdriver #Create
a new Selenium driver
driver = webdriver.PhantomJS()

# Try Chrome with Selenium:
# driver = webdriver.Chrome()

driver.get("http ://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
# Click the book preview button
driver.find_element_by_id("sitbLogoImg").click()
imageList = set()
# Wait for the page Loading complete
time.sleep(5)
# When the right arrow can be clicked, start page turning
while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
driver.find_element_by_id("sitbReaderRightPageTurner"). click()
time.sleep(2)
# Get new loaded pages (multiple pages can be loaded at once, but duplicate pages cannot be loaded into a collection)
pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img")
for page in pages:
image = page.get_attribute("src")
imageList.add(image)
driver.quit()

# Use Tesseract to process our collected image URL links
for image in sorted(imageList):
# Save the image
urlretrieve(image, "page.jpg")
p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE,stderr=subprocess.PIPE)
f = open("page.txt", "r")
p.wait()
print(f.read())

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326488664&siteId=291194637