Crawler verification code recognition

The verification code identification of Gushiwen.com in the case I did today
Anti-climbing mechanism: verification code Identify the data in the captcha image to simulate the login operation.
I use the third-party automatic identification verification code of Super Eagle.
After software registration, log in, follow Super Eagle on WeChat official account, you can get 1000 points for free.
How to use: User Center>>Software ID to generate a replacement 96001, local image file path to replace a.jpg.
Crawling steps: First obtain the entire page data of Gushiwen.com, and then parse the verification code image in the page, and download and store it. Then select the python language for the development document on the Super Eagle official website, download the sample code, and modify the relevant code according to the above usage method.
Directly on the code below:

from lxml import etree
#获取相应数据
import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf-8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
    
    
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
    
    
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
    
    
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {
    
    'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
    
    
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()
headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36'
    }
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = requests.get(url=url,headers=headers).text
#解析数据
tree = etree.HTML(page_text)
new_code_src ='https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
new_page_text= requests.get(url=new_code_src,headers=headers).content
#存储数据
with open('./code.jpg','wb') as fp:
    fp.write(new_page_text)
print('验证码下载完成')
if __name__ == '__main__':
	chaojiying = Chaojiying_Client('a1372431588', '970110yy', '905384')	#用户中心>>软件ID 生成一个替换 96001
	im = open('code.jpg', 'rb').read()													#本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
	print(chaojiying.PostPic(im, 1902))

The result is the value in the captured verification code image.

Guess you like

Origin blog.csdn.net/qwerty1372431588/article/details/106302880