1. Handle cookie simulation for simulated login
The previous editors crawled websites without login, so many websites like Qzone, 17k Novels, etc., cannot enter the personal homepage without first logging in. Is it because the crawlers are helpless?
The answer is definitely no, let’s use the code to simulate login! ! !
1. What is a cookie?
A cookie is a set of key-value pairs stored on the client. As shown in the cookie
2 of QQ Zone, what is the relationship between cookies and crawlers?
Sometimes, when requesting a web page, if the cookie value is not carried during the request process, then we will not be able to request the correct data on the page.
therefore,Cookie is a common and typical anti-crawling mechanism in crawlers!
3. Example of 17k novel network
Get the url address corresponding to the login (https://passport.17k.com/ck/user/login)
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27
import requests
# 创建一个session对象
session = requests.Session()
url1 = 'https://passport.17k.com/ck/user/login'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
data = {
'loginName': '15029494474', # 这里填写账号
'password': 'woshinidaye' # 这里填写密码
}
# 进行登录
res = session.post(url=url1, headers=headers, data=data)
print(res)
<Response [200]>
Returning 200 means that the simulated login is successful!
Next, take the basic information of the books collected in the editor’s bookshelf, as shown in the figure below,
because these are dynamically loaded content, so go to check in Network, select XHR, press f5 to refresh the page, find the following picture package, and copy Headers The url address
carefully observes that all the content of the page you want is here
import requests
import json
# 创建一个session对象
session = requests.Session()
url1 = 'https://passport.17k.com/ck/user/login'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
data = {
'loginName': '15029494474',
'password': 'woshinidaye'
}
# 进行登录
res = session.post(url=url1, headers=headers, data=data)
# print(res)
# 拿取我的书架上的书
url2 = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
resp = requests.get(url=url2, headers=headers).text
print(resp)
{
"status":{
"code":10103,"msg":"用户登陆信息错误"},"time":1616827996000}
An error was reported, explainWhen I go to get the content of the page, if there is no cookie, the website thinks that we are not logged in!
# 拿取我的书架上的书
url2 = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
resp = session.get(url=url2, headers=headers).text
print(resp)
Parse the content and save it locally locally
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27
import requests
import json
# 创建一个session对象
session = requests.Session()
url1 = 'https://passport.17k.com/ck/user/login'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
data = {
'loginName': '15029494474',
'password': 'woshinidaye'
}
# 进行登录
res = session.post(url=url1, headers=headers, data=data)
# print(res)
# 拿取我的书架上的书
url2 = 'https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919'
resp = session.get(url=url2, headers=headers).text
# print(resp)
# 把json字符串转换成python可交互的数据类型字典
resp_data = json.loads(resp)
data_list = resp_data['data']
# print(data_list)
f = open('./17k.txt', 'w', encoding='utf-8')
for data in data_list:
# 书的类型
category = data['bookCategory']['name']
# 书名
title = data['bookName']
# 最近更新时间
chapter = data['lastUpdateChapter']['name']
# 作者
author = data['authorPenName']
# print(category, title, chapter, author)
# 简单进行数据格式处理
content = '类型:' + category + " , " + '书名:' + title + " , " + '最新更新章节:' + chapter + " , " + '作者:' + author + '\n\n'
f.write(content)
print('over!!!')
Second, use the proxy IP
1. What is proxy IP?
five words, that isproxy server!
2. May I ask what is the use of her?
Used to forward requests and responses
3. Why do you use him in reptiles?
If the crawler initiates a high-frequency request to a certain server in a short period of time, the server will detect this anomaly, and then temporarily block our IP address, so that we can no longer access the server during the blocked time period. Therefore, it is necessary to use the proxy IP to operate. After using the proxy IP, the corresponding IP of the server receiving the request is the proxy server, not our real client!
4. Several degrees of anonymity of the proxy server
①Transparent agent: In the literal sense, you can also guess that it is almost inseparable. With this proxy, the server knows that you are using a proxy and also knows your real IP.
②Anonymous proxy: With this proxy, the server knows that you use a proxy, but does not know your real IP.
③Highly hidden proxy: With this proxy, the server does not know that you are using a proxy, nor does it know your real IP.
5. Proxy recommended by Xiaobian (Zhilian HTTP)
URL address: http://http.zhiliandaili.cn/
A free agent can really break people's minds, so the editor chooses the renewal version of selling kidneys, as shown in the picture below
to practice using 3 yuan a day. Even though my family is poor, I can barely accept it!
After the purchase is complete, we click API extraction, as shown below
According to my experience, the actual effective time of an IP valid for 1~5 minutes is only about two minutes
6. Write code to get the IP address of the proxy
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27
import requests
from lxml import etree
url = 'http://ip.ipjldl.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15'
page_content = requests.get(url).text
tree = etree.HTML(page_content)
all_ip = tree.xpath('//body//text()')
https_ip = []
for ip in all_ip:
dic = {
'https': ip
}
https_ip.append(dic)
print(https_ip)
[{
'https': '123.73.63.67:46603'}, {
'https': '220.161.32.108:45111'}, {
'https': '183.159.83.169:45112'}, {
'https': '222.37.78.253:32223'}, {
'https': '114.99.11.51:23890'}]
7. Let a website block my ip
This is the most fun, hahaha! Audience friends, don’t learn from the editor, I’m just doing it as an example
Initiate high-frequency requests to the fast agent (https://www.kuaidaili.com/free/inha) to block my local IP
import requests
from lxml import etree
url = 'https://www.kuaidaili.com/free/inha/%s/'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
all_ip = []
for i in range(1,100):
page_url = url % i
page_content = requests.get(url=page_url, headers=headers).text
tree = etree.HTML(page_content)
ip = tree.xpath('//*[@id="list"]/table')
for d in ip:
page = d.xpath('./tbody/tr/td[1]/text()')
# print(one)
for d in page:
all_ip.append(d)
print(len(all_ip))
It was blocked after just touching it, which is too cooperative (go to refresh the page), as shown in Figure
8 below, the proxy ip
# !/usr/bin/env python
# -*- coding:utf-8 -*-
# @Author:HePengLi
# @Time:2021-03-27
import requests
from lxml import etree
import random
# 代理对应的代码
url = 'http://ip.ipjldl.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15'
page_content = requests.get(url).text
tree = etree.HTML(page_content)
all_ip = tree.xpath('//body//text()')
https_ip = []
for ip in all_ip:
dic = {
'https': ip
}
https_ip.append(dic)
# print(https_ip)
# 用代理再次发起请求
url = 'https://www.kuaidaili.com/free/inha/%s/'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.54"
}
all_ip = []
for i in range(1,2):
page_url = url % i
# 加入代理IP
page_content = requests.get(url=page_url, headers=headers, proxies=random.choice(https_ip)).text
tree = etree.HTML(page_content)
ip = tree.xpath('//*[@id="list"]/table')
for d in ip:
page = d.xpath('./tbody/tr/td[1]/text()')
for d in page:
all_ip.append(d)
print(len(all_ip))
Go back and refresh the page again, but you still can't access it, indicating that the proxy is working!