Simulation know almost complete three-step login

First, the request to enter the login page for cookies

Using a requests_html module HTMLSession, an object instance of the transmission request is automatically saved cookie is sent. Therefore, there is no need for subsequent capturing, storing, checking and other operations

def get_first_cookie(self):
        self.session.get(url=self.head_page_url)

Second, the check codes

Before calibration verification code will certainly have to request a verification code sent to save, and then enter the validation code, and finally send the request check code. .

Because there is almost known in the checksum log in three cases

  • No check code

  • English checksum: Enter 4 digit or letter combinations, url link with lang = en. This checksum is better treatment, here using the api
  • Text checksum: Click below backwards in Chinese, url link with lang = en

There are three requests, for the same API, the request is carried in different ways and parameters

(1) get request: determining whether a verification code, does not carry the parameter, if the checkword response { "show_captcha": true}, otherwise { "show_captcha": false}

(2) put request: get codes, do not carry the parameter, the response is returned { "img_base64": "xxxxxx"}, the base64 encoded string of characters

(3) post request: check codes, carrying parameters, the verification is successful response is returned { "success": true}, otherwise { "success": false}

def handle_captcha(self):
    r = self.session.get(url=self.captcha_api)  # 发送get判断是否有验证码
    res = r.json()
    
    if res.get('show_captcha'):
        r = self.session.put(url=self.captcha_api)  # 有验证码,发送请求获取验证码
        img_base64 = r.json().get('img_base64')
        # 将得到的验证码解码后存为capcha.png图片,这里就需要导入base64这个模块,用b64decode解码
        with open('captcha.png', 'wb') as f:
            f.write(base64.b64decode(img_base64))
        # open打开图片,show显示出来。需要导入从PIL中导入Image模块
        img = Image.open('captcha.png')
        img.show()
        
        # 输入验证码,携带验证码发生post请求,参数是从浏览器中查看的
        self.captcha = input('请输入校验码:')
        r = self.session.post(url=self.captcha_api,data={"input_text":self.captcha})
        res = r.json()
        if res.get('success'):
            print('验证码正确')
        else:
            self.handle_captcha()

Third, check Log

Seen from the packet capture tool browser, the front end of the parameters carried in the encryption process, so we need to know the front-end code is encrypted process where, what are the parameters?

  • Search encrypt (encrypt) has found the word js file (static.zhihu.com/heifetz/main.app.8b5cc380b705bb3ed141.js file), find the line encryption function
var b = function(e) {
        return __g._encrypt(encodeURIComponent(e))  // encodeURIComponent() 函数可把字符串作为 URI 组件进行编码。
    };
  • E is not unaware of what the hell, marked on this line of code breakpoints Run and enter e Enter the Console to know what is the e
client_id=c3cef7c66a1843f8b3a9e6a1e3160e20&grant_type=password&timestamp=1571675789070&source=com.zhihu.web&signature=652fa1c9bd1831abce73a5ee87d2dd9f748ce308&username=%2B86182000000001&password=xxxx&captcha=%7B%22img_size%22%3A%5B200%2C44%5D%2C%22input_points%22%3A%5B%5B93.33331298828125%2C21.052078247070312%5D%2C%5B158.33331298828125%2C21.052078247070312%5D%5D%7D&lang=cn&utm_source=&ref_source=other_https%3A%2F%2Fwww.zhihu.com%2Fsignin%3Fnext%3D%252F"

A long list of characters, under careful observation, they in fact carry parameters, which signature is encrypted string, now find the encryption function signature

  • Also search signature, to find the encryption function signature, and now it is to turn into js code python. There is a python modules execjs js code can be run in python. Then hit encryption related functions js file copied to, then the code used to manipulate the file execjs

  • Carrying encrypted data, sends a login request

def login_in(self):
    # 请求携带的参数
    formdata = {
        "client_id": "xxxxxxxx",  # 这是客户端id,打印e直接粘贴复制过来
        "grant_type": "password",
        "timestamp": str(int(time.time() * 1000)),
        "source": "com.zhihu.web",
        "signature": self.signature,  # 看js的加密函数,其实就是用了hmac加盐加密,详情参见完整代码中的get_signature(self)方法
        "username": "+86182000000",  # 用户名,请输入自己的
        "password": "xxxxx",  # 密码
        "captcha": self.captcha,
        "lang": "en",
        "utm_source": '',
        "ref_source": "other_https://Fwww.zhihu.com%2Fsignin%3Fnext%3D%252F"
    }
    with open('知乎加密.js','rt',encoding='utf-8') as f:
        # 读取js代码
        js = f.read()
        # 编译。cwd是依赖的环境。js执行代码通常是在node中,所以需要npm install jsdom安装依赖
        execjs_obj = execjs.compile(js,cwd='node_modules')
        # execjs_obj.call(函数名,参数)
        res = execjs_obj.call('b',urlencode(formdata))
        
        # 请求头,必须要有下面的参数,不然会报错
        headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
            "content-type": "application/x-www-form-urlencoded",  # 数据类型
            "x-zse-83": "3_2.0"  # 版本
        }
        r = self.session.post(url=self.sign_in_api, data=res,headers=headers)
        # 登录成功的状态码,在这里是201
        if r.status_code == 201:
            print('登录成功')
            r = self.session.get(url='https://www.zhihu.com/')  # 登录成功后调到知乎首页
            print(r.text)
            else:
                print('用户名或密码错误')

Fourth, the complete code

import time,hmac
from hashlib import sha1
from requests_html import HTMLSession
import base64,execjs
from PIL import Image
from urllib.parse import urlencode

class Spider():
    def __init__(self):
        self.session = HTMLSession()
        self.captcha_api = "https://www.zhihu.com/api/v3/oauth/captcha?lang=en"
        self.sign_in_api = "https://www.zhihu.com/api/v3/oauth/sign_in"
        self.head_page_url = "https://www.zhihu.com/signin?next=%2F"
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
        }
        self.captcha = ''
        self.signature = ''

    def get_first_cookie(self):
        self.session.get(url=self.head_page_url)

    def handle_captcha(self):
        r = self.session.get(url=self.captcha_api)
        res = r.json()
        if res.get('show_captcha'):
            r = self.session.put(url=self.captcha_api)
            img_base64 = r.json().get('img_base64')
            with open('captcha.png', 'wb') as f:
                f.write(base64.b64decode(img_base64))
            img = Image.open('captcha.png')
            img.show()
            self.captcha = input('请输入校验码:')
            r = self.session.post(url=self.captcha_api,data={"input_text":self.captcha})
            res = r.json()
            if res.get('success'):
                print('验证码正确')
            else:
                self.handle_captcha()

    def get_signature(self):
        r = hmac.new(b'd1b964811afb40118a12068ff74a12f4',digestmod=sha1)  # 第一个是加盐,digestmod参数是加密算法
        r.update(b"password")
        r.update(b"c3cef7c66a1843f8b3a9e6a1e3160e20")
        r.update(b"com.zhihu.web")
        r.update(str(int(time.time()*1000)).encode('utf-8'))
        self.signature = r.hexdigest()

    def login_in(self):
        formdata = {
            "client_id": "xxxxxxxx",
            "grant_type": "password",
            "timestamp": str(int(time.time() * 1000)),
            "source": "com.zhihu.web",
            "signature": self.signature,
            "username": "+86182000000",  # 用户名,请输入自己的
            "password": "xxxxx",  # 密码
            "captcha": self.captcha,
            "lang": "en",
            "utm_source": '',
            "ref_source": "other_https://Fwww.zhihu.com%2Fsignin%3Fnext%3D%252F"
        }
        with open('知乎加密.js','rt',encoding='utf-8') as f:
            js = f.read()
            execjs_obj = execjs.compile(js,cwd='node_modules')
        # execjs_obj.call(函数名,参数)
        res = execjs_obj.call('b',urlencode(formdata))
        headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
            "content-type": "application/x-www-form-urlencoded",
            "x-zse-83": "3_2.0"
        }
        r = self.session.post(url=self.sign_in_api, data=res,headers=headers)
        if r.status_code == 201:
            print('登录成功')
            r = self.session.get(url='https://www.zhihu.com/')
            print(r.text)
        else:
            print('用户名或密码错误')

    def run(self):
        self.get_first_cookie()
        self.handle_captcha()
        self.get_signature()
        self.login_in()

if __name__ == '__main__':
    zhihu = Spider()
    zhihu.run()

Guess you like

Origin www.cnblogs.com/863652104kai/p/11717447.html