Verification code processing
learning target
-
Understand the relevant knowledge of the verification code
-
Master the use of image recognition engine
-
Understand common coding platforms
-
Master the method of processing verification codes through the coding platform
1. Picture verification code
1.1 What is an image verification code
-
CAPTCHA is the abbreviation of "Completely Automated Public Turing test to tell Computers and Humans Apart", which is a public automatic program that distinguishes whether a user is a computer or a human.
1.2 The role of verification code
-
Prevent malicious password cracking, ticket swiping, forum irrigation, and page swiping. Effectively prevent a hacker from making continuous login attempts to a specific registered user by brute force cracking using a specific program. In fact, the use of verification codes is a common method for many websites (such as China Merchants Bank’s online personal banking, Baidu community), we use A relatively simple way to achieve this function. Although logging in is a bit more troublesome, this function is still necessary and important for the password security of netizens.
1.3 Use scenarios of image verification codes in crawlers
-
registered
-
log in
-
When frequently sending requests, the server pops up a verification code for verification
1.4 Image verification code processing scheme
-
Manual input (input) This method is limited to the case of continuous use after logging in once
-
Image recognition engine analysis uses optical recognition engine to process the data in the picture, currently it is often used for picture data extraction, less used for verification code processing
-
Verification code solutions commonly used by coding platform crawlers
2. Picture recognition engine
OCR (Optical Character Recognition) refers to the software that uses a scanner or digital camera to scan text data into image files, and then analyzes and processes the image files to automatically recognize and obtain text information and layout information.
2.1 What is tesseract
-
Tesseract, an open source OCR engine developed by HP Labs and maintained by Google, is characterized by open source, free, multi-language and multi-platform support.
-
Project address: https://github.com/tesseract-ocr/tesseract
2.2 Installation of image recognition engine environment
1 Engine installation
-
Directly execute commands in the mac environment
brew install --with-training-tools tesseract
-
The installation in the windows environment can be installed through the exe installation package, and the download address can be found from the wiki in the GitHub project. After the installation is complete, remember to add the directory of the Tesseract executable file to the PATH to facilitate subsequent calls.
-
Installation in linux environment
sudo apt-get install tesseract-ocr
2 Python library installation
# PIL is used to open image files PIP / PIP3 install Pillow # pytesseract module for parsing data from the picture pip / pip3 install pytesseract
2.3 Use of image recognition engine
The data in the opened image file can be extracted into string data through the image_to_string method of the pytesseract module. The specific method is as follows
from PIL import Image
import pytesseract
im = Image.open()
result = pytesseract.image_to_string(im)
print(result)
2.4 Expansion of the use of image recognition engine
-
Other ocr platforms
Microsoft Azure image recognition: https: //azure.microsoft.com/zh-cn/services/cognitive-services/computer-vision/ proper way Zhiyun character recognition: http: //aidemo.youdao.com/ocrdemo Ali cloud Text recognition: https://www.aliyun.com/product/cdi/Tencent OCR text recognition: https://cloud.tencent.com/product/ocr
3 coding platform
Let's take cloud coding as an example to understand how to use the coding platform
3.1 Official interface of cloud coding
The following code is provided by the cloud coding platform. A simple modification has been made to implement two methods:
-
indetify: the response binary number of the incoming picture can be
-
indetify_by_filepath: The path of the incoming picture can be identified
The places that need to be configured by yourself are:
username = 'whoarewe' # 用户名
password = '***' # 密码
appid = 4283 # appid
appkey = '02074c64f0d0bb9efb2df455537b01c3' # appkey
codetype = 1004 # 验证码类型
The official API provided by Cloud Code is as follows:
#yundama.py
import requests
import json
import time
class YDMHttp:
apiurl = 'http://api.yundama.com/api.php'
username = ''
password = ''
appid = ''
appkey = ''
def __init__(self, username, password, appid, appkey):
self.username = username
self.password = password
self.appid = str(appid)
self.appkey = appkey
def request(self, fields, files=[]):
response = self.post_url(self.apiurl, fields, files)
response = json.loads(response)
return response
def balance(self):
data = {'method': 'balance', 'username': self.username, 'password': self.password, 'appid': self.appid,
'appkey': self.appkey}
response = self.request(data)
if (response):
if (response['ret'] and response['ret'] < 0):
return response['ret']
else:
return response['balance']
else:
return -9001
def login(self):
data = {'method': 'login', 'username': self.username, 'password': self.password, 'appid': self.appid,
'appkey': self.appkey}
response = self.request(data)
if (response):
if (response['ret'] and response['ret'] < 0):
return response['ret']
else:
return response['uid']
else:
return -9001
def upload(self, filename, codetype, timeout):
data = {'method': 'upload', 'username': self.username, 'password': self.password, 'appid': self.appid,
'appkey': self.appkey, 'codetype': str(codetype), 'timeout': str(timeout)}
file = {'file': filename}
response = self.request(data, file)
if (response):
if (response['ret'] and response['ret'] < 0):
return response['ret']
else:
return response['cid']
else:
return -9001
def result(self, cid):
data = {'method': 'result', 'username': self.username, 'password': self.password, 'appid': self.appid,
'appkey': self.appkey, 'cid': str(cid)}
response = self.request(data)
return response and response['text'] or ''
def decode(self, filename, codetype, timeout):
cid = self.upload(filename, codetype, timeout)
if (cid > 0):
for i in range(0, timeout):
result = self.result(cid)
if (result != ''):
return cid, result
else:
time.sleep(1)
return -3003, ''
else:
return cid, ''
def post_url(self, url, fields, files=[]):
# for key in files:
# files[key] = open(files[key], 'rb');
res = requests.post(url, files=files, data=fields)
return res.text
username = 'whoarewe' # 用户名
password = '***' # 密码
appid = 4283 # appid
appkey = '02074c64f0d0bb9efb2df455537b01c3' # appkey
filename = 'getimage.jpg' # 文件位置
codetype = 1004 # 验证码类型
# 超时
timeout = 60
def indetify(response_content):
if (username == 'username'):
print('请设置好相关参数再测试')
else:
# 初始化
yundama = YDMHttp(username, password, appid, appkey)
# 登陆云打码
uid = yundama.login();
print('uid: %s' % uid)
# 查询余额
balance = yundama.balance();
print('balance: %s' % balance)
# 开始识别,图片路径,验证码类型ID,超时时间(秒),识别结果
cid, result = yundama.decode(response_content, codetype, timeout)
print('cid: %s, result: %s' % (cid, result))
return result
def indetify_by_filepath(file_path):
if (username == 'username'):
print('请设置好相关参数再测试')
else:
# 初始化
yundama = YDMHttp(username, password, appid, appkey)
# 登陆云打码
uid = yundama.login();
print('uid: %s' % uid)
# 查询余额
balance = yundama.balance();
print('balance: %s' % balance)
# 开始识别,图片路径,验证码类型ID,超时时间(秒),识别结果
cid, result = yundama.decode(file_path, codetype, timeout)
print('cid: %s, result: %s' % (cid, result))
return result
if __name__ == '__main__':
pass
4 Common types of verification codes
4.1 The url address remains unchanged, and the verification code remains unchanged
This is a very simple type in the verification code. The corresponding one only needs to obtain the address of the verification code, and then request and identify it through the coding platform.
4.2 The url address remains unchanged, the verification code changes
This type of verification code is a more common type. For this type of verification code, you need to think about:
During the login process, assuming that the verification code I entered is correct, how does the other party's server determine that the verification code I entered is the verification code displayed on my screen, rather than other verification codes?
When obtaining a webpage, requesting a verification code, and when submitting a verification code, the other party's server must have passed some method to verify that the verification code I obtained before and the verification code I submitted last are the same verification code, then what is this method? What?
Obviously, it is achieved through cookies, so correspondingly, in the request page, request the verification code, and submit the verification code, you need to ensure the consistency of the cookie. You can use requests.session to solve this problem.
summary
-
Understand the relevant knowledge of the verification code
-
Master the use of image recognition engine
-
Understand common coding platforms
-
Master the method of processing verification codes through the coding platform