The identification codes of 01 - identification pattern verification code
1. Preparations
- 1. Download and install tesseract Download
- After the download is complete, double-click the installer, you can check Additional language data (download) option to install the OCR language support packages, so we can OCR recognizes multiple languages
- The tesseract configuration environment variable
- Add the tesseract language packs into the environment variable, create a new system variables in the environment variable, the variable name is TESSDATA_PREFIX, tessdata is placed language pack folder, usually in the directory where you installed tesseract, that tesseract installation directory is tessdata parent directory, set the value for it to TESSDATA_PREFIX
- pip install tesserocr file, pay attention to pip install tesserocr installation always fails on the window system, you need to tesserocr of .whl file on github download tesseract version we installed the corresponding installation,
2. Get the captcha
import os import requests from uuid import uuid4 from selenium import webdriver browser = webdriver.Firefox() browser.get('http://my.cnki.net/elibregister/commonRegister.aspx') browser.implicitly_wait(2) os.mkdir('picture') for i in range(5): image = browser.find_element_by_xpath('//*[@id="checkcode"]') image_url = image.get_attribute('src') image_content = requests.get(image_url).content image_path = os.path.join('picture', f'{uuid4()}.jpg') with open(image_path, 'wb') as f: f.write(image_content) image.click() browser.implicitly_wait(2)
3. Identify the test
Import tesserocr from the PIL Import Image image = Image.open ( ' Picture / 1.jpg ' ) Result = tesserocr.image_to_text (image) # converts the object image to text Print (Result) Print (tesserocr.file_to_text ( ' Picture /. 1. JPG ' )) # convert the file to text objects
4. Processing codes
It is converted to gray level image and binary processing
= image.convert Image ( ' L ' ) # picture into a grayscale image image.show () Image = image.convert ( ' . 1 ' ) # the image to binarization processing image.show ()
We can also specify a threshold value binarization, the above method uses the default threshold 127, but we do not directly translate picture, to the first original image is converted to grayscale and then specify the binarization threshold value,
Import tesserocr from the PIL Import Image Image = Image.open ( ' Picture / 2.jpg ' ) Image = image.convert ( ' L ' ) threshold = 105 # The smaller the number, the less the pixels in the picture, the more the blank table = [] for I in Range (256 ): IF I < threshold: table.append (0) the else : table.append ( . 1 ) Image = image.point (Table, ' . 1 ' ) image.show() result = tesserocr.image_to_text(image)