Hiding in the quilt and secretly learning crawlers (6) --- Handling cookie simulation login and proxy IP

1. Handle cookie simulation for simulated login

The previous editors crawled websites without login, so many websites like Qzone, 17k Novels, etc., cannot enter the personal homepage without first logging in. Is it because the crawlers are helpless?

The answer is definitely no, let’s use the code to simulate login! ! !

1. What is a cookie?

A cookie is a set of key-value pairs stored on the client. As shown in the cookie
insert image description here
2 of QQ Zone, what is the relationship between cookies and crawlers?

Sometimes, when requesting a web page, if the cookie value is not carried during the request process, then we will not be able to request the correct data on the page.

therefore,Cookie is a common and typical anti-crawling mechanism in crawlers

3. Example of 17k novel network

Get the url address corresponding to the login (https://passport.17k.com/ck/user/login)
insert image description here

# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27

import requests


# 创建一个session对象
session = requests.Session()

url1 = 'https://passport.17k.com/ck/user/login'
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
data = {
    
    
    'loginName': '15029494474',  # 这里填写账号
    'password': 'woshinidaye'    # 这里填写密码
}
# 进行登录
res = session.post(url=url1, headers=headers, data=data)
print(res)

<Response [200]>

Returning 200 means that the simulated login is successful!

Next, take the basic information of the books collected in the editor’s bookshelf, as shown in the figure below,
insert image description here
because these are dynamically loaded content, so go to check in Network, select XHR, press f5 to refresh the page, find the following picture package, and copy Headers The url address
insert image description here
carefully observes that all the content of the page you want is here

import requests
import json


# 创建一个session对象
session = requests.Session()

url1 = 'https://passport.17k.com/ck/user/login'
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
data = {
    
    
    'loginName': '15029494474',
    'password': 'woshinidaye'
}
# 进行登录
res = session.post(url=url1, headers=headers, data=data)
# print(res)

# 拿取我的书架上的书
url2 = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
resp = requests.get(url=url2, headers=headers).text
print(resp)

{
    
    "status":{
    
    "code":10103,"msg":"用户登陆信息错误"},"time":1616827996000}

An error was reported, explainWhen I go to get the content of the page, if there is no cookie, the website thinks that we are not logged in

# 拿取我的书架上的书
url2 = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
resp = session.get(url=url2, headers=headers).text
print(resp)

insert image description here
Parse the content and save it locally locally

# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27

import requests
import json


# 创建一个session对象
session = requests.Session()

url1 = 'https://passport.17k.com/ck/user/login'
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
data = {
    
    
    'loginName': '15029494474',
    'password': 'woshinidaye'
}
# 进行登录
res = session.post(url=url1, headers=headers, data=data)
# print(res)

# 拿取我的书架上的书
url2 = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
resp = session.get(url=url2, headers=headers).text
# print(resp)

# 把json字符串转换成python可交互的数据类型字典
resp_data = json.loads(resp)
data_list = resp_data['data']
# print(data_list)
f = open('./17k.txt', 'w', encoding='utf-8')
for data in data_list:
    # 书的类型
    category = data['bookCategory']['name']
    # 书名
    title = data['bookName']
    # 最近更新时间
    chapter = data['lastUpdateChapter']['name']
    # 作者
    author = data['authorPenName']
    # print(category, title, chapter, author)
    # 简单进行数据格式处理
    content = '类型:' + category + " , " + '书名:' + title + " , " + '最新更新章节:' + chapter + " , " + '作者:' + author + '\n\n'
    f.write(content)

print('over!!!')

insert image description here

Second, use the proxy IP

1. What is proxy IP?

five words, that isproxy server

2. May I ask what is the use of her?

Used to forward requests and responses

3. Why do you use him in reptiles?

If the crawler initiates a high-frequency request to a certain server in a short period of time, the server will detect this anomaly, and then temporarily block our IP address, so that we can no longer access the server during the blocked time period. Therefore, it is necessary to use the proxy IP to operate. After using the proxy IP, the corresponding IP of the server receiving the request is the proxy server, not our real client!

4. Several degrees of anonymity of the proxy server

①Transparent agent: In the literal sense, you can also guess that it is almost inseparable. With this proxy, the server knows that you are using a proxy and also knows your real IP.

②Anonymous proxy: With this proxy, the server knows that you use a proxy, but does not know your real IP.

③Highly hidden proxy: With this proxy, the server does not know that you are using a proxy, nor does it know your real IP.

5. Proxy recommended by Xiaobian (Zhilian HTTP)

URL address: http://http.zhiliandaili.cn/

A free agent can really break people's minds, so the editor chooses the renewal version of selling kidneys, as shown in the picture below
insert image description here
to practice using 3 yuan a day. Even though my family is poor, I can barely accept it!

After the purchase is complete, we click API extraction, as shown below
insert image description here

According to my experience, the actual effective time of an IP valid for 1~5 minutes is only about two minutes

6. Write code to get the IP address of the proxy
insert image description here

# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27
import requests
from lxml import etree


url = 'http://ip.ipjldl.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15'

page_content = requests.get(url).text
tree = etree.HTML(page_content)
all_ip = tree.xpath('//body//text()')
https_ip = []
for ip in all_ip:
    dic = {
    
    
        'https': ip
    }
    https_ip.append(dic)
print(https_ip)

[{
    
    'https': '123.73.63.67:46603'}, {
    
    'https': '220.161.32.108:45111'}, {
    
    'https': '183.159.83.169:45112'}, {
    
    'https': '222.37.78.253:32223'}, {
    
    'https': '114.99.11.51:23890'}]

7. Let a website block my ip

This is the most fun, hahaha! Audience friends, don’t learn from the editor, I’m just doing it as an example

Initiate high-frequency requests to the fast agent (https://www.kuaidaili.com/free/inha) to block my local IP
insert image description here

import requests
from lxml import etree

url = 'https://www.kuaidaili.com/free/inha/%s/'
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
all_ip = []
for i in range(1,100):
    page_url = url % i
    page_content = requests.get(url=page_url, headers=headers).text
    tree = etree.HTML(page_content)
    ip = tree.xpath('//*[@id="list"]/table')
    for d in ip:
        page = d.xpath('./tbody/tr/td[1]/text()')
        # print(one)
        for d in page:
            all_ip.append(d)

print(len(all_ip))

It was blocked after just touching it, which is too cooperative (go to refresh the page), as shown in Figure
insert image description here
insert image description here
8 below, the proxy ip

# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27
import requests
from lxml import etree
import random


# 代理对应的代码
url = 'http://ip.ipjldl.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15'

page_content = requests.get(url).text
tree = etree.HTML(page_content)
all_ip = tree.xpath('//body//text()')
https_ip = []
for ip in all_ip:
    dic = {
    
    
        'https': ip
    }
    https_ip.append(dic)
# print(https_ip)

# 用代理再次发起请求
url = 'https://www.kuaidaili.com/free/inha/%s/'
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
all_ip = []
for i in range(1,2):
    page_url = url % i
    # 加入代理IP
    page_content = requests.get(url=page_url, headers=headers, proxies=random.choice(https_ip)).text
    tree = etree.HTML(page_content)
    ip = tree.xpath('//*[@id="list"]/table')
    for d in ip:
        page = d.xpath('./tbody/tr/td[1]/text()')
        for d in page:
            all_ip.append(d)

print(len(all_ip))

insert image description here
Go back and refresh the page again, but you still can't access it, indicating that the proxy is working!

Guess you like

Origin blog.csdn.net/hpl980342791/article/details/115260693