一先从爬虫案例开始

爬虫和反爬虫之间的斗争，看似反爬虫占据着主动权，但最后都爬虫者获胜，只是付出代价大小的问题，所以这个问题并不重要，真正重要的是爬虫者要有一定的道德底线，虽然技术允许；技术可以具备，但不要爬哪些别人不愿透露的数据，这些数据可能是别人付出巨大代价获取的，恶意爬取别人重要数据，会给别人带来巨大的损失，造成不好的社会风气；本文只在技术层面探讨爬虫入门知识，爬虫本质上就是猜，就是博弈，就是经验，就是分析请求，就是分析web接口，因此，想学爬虫，必须先了解web；只要浏览器能够获得的数据，都能够通过爬虫获取的到，就看伪装的彻底不彻底了，这里先从几个简单爬虫说起：

1 汽车之家

这应该是跟爬取百度一样简单的网站，该网站完全没有设防，无需伪装成浏览器，不需要cookie，也无需登录：

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.autohome.com.cn/news/")
response.encoding = 'gbk'

soup = BeautifulSoup(response.text,'html.parser')

div = soup.find(name='div',attrs={'id':'auto-channel-lazyload-article'})

li_list = div.find_all(name='li')

for li in li_list:

    title = li.find(name='h3')
    if not title:
        continue
    p = li.find(name='p')
    a = li.find(name='a')

    print(title.text)
    print(a.attrs.get('href'))
    print(p.text)

    img = li.find(name='img')
    src = img.get('src')
    src = "https:" + src
    print(src)

    # 再次发起请求，下载图片
    file_name = src.rsplit('/',maxsplit=1)[1]
    ret = requests.get(src)
    with open(file_name,'wb') as f:
        f.write(ret.content)

View Code

2 抽屉新热榜

第一步爬取网页内容，需要伪装成浏览器：

import requests
from bs4 import BeautifulSoup

r1 = requests.get(
    url='https://dig.chouti.com/',
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
)

soup = BeautifulSoup(r1.text,'html.parser')

# 标签对象
content_list = soup.find(name='div',id='content-list')
# print(content_list)
# [标签对象,标签对象]
item_list = content_list.find_all(name='div',attrs={'class':'item'})
for item in item_list:
    a = item.find(name='a',attrs={'class':'show-content color-chag'})
    print(a.text.strip())
    # print(a.text)

View Code

更进一步，给通过文章点赞，注意陷阱，点赞需要cookie，但是不是登录后响应中的cookie，而是第一次加载网页时响应中的cookie，可以看出如果爬虫直接从登录那一步开始发请求是不行的，因为正常的浏览器都是先访问一下页面才登录的，从点赞使用的cookie可以看出，这是反爬虫的一种方案，这就要求爬虫者具有一个的分析能和经验了，不然即使获取到了cookie也是一个烟雾弹，具体如下：

import requests
# 1. 查看首页
r1 = requests.get(
    url='https://dig.chouti.com/',
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
)

# 2. 提交用户名和密码
r2 = requests.post(
    url='https://dig.chouti.com/login',
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    },
    data={
        'phone':'自己先注册',
        'password':'自己先注册',
        'oneMonth':1
    },
    cookies=r1.cookies.get_dict()
)


# 3. 点赞
r3 = requests.post(
    url='https://dig.chouti.com/link/vote?linksId=20435396',
    headers={
        'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    },
    cookies=r1.cookies.get_dict()
)
print(r3.text)

View Code

3 自动登录GitHub

这里登录的时候是form表单提交，需要获取csrf token值防止跨站伪造请求，登录这里的 token值跟登录名和密码一块放在了请求体中；这也是分析得到的，如果一个网站登录或提交数据需要token值的话，放在请求头中还是请求体中，或者cookie中，或者其他参数中，需要自己具体真实登录一下就能分析到了；具体如下：

# 1. GET，访问登录页面
"""
- 去HTML中找隐藏的Input标签获取csrf token
- 获取cookie
"""

# 2. POST，用户名和密码
"""
- 发送数据：
    - csrf
    - 用户名
    - 密码
- 携带cookie
"""

# 3. GET,访问https://github.com/settings/emails
"""
- 携带 cookie
"""

import requests
from bs4 import BeautifulSoup

# # 1. 访问登陆页面，获取 authenticity_token
i1 = requests.get('https://github.com/login')
soup1 = BeautifulSoup(i1.text, features='lxml')
tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
authenticity_token = tag.get('value')
c1 = i1.cookies.get_dict()
i1.close()

# 1. 携带authenticity_token和用户名密码等信息，发送用户验证
form_data = {
"authenticity_token": authenticity_token,
    "utf8": "",
    "commit": "Sign in",
    "login": "自己注册",
    'password': '自己注册'
}

i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
c2 = i2.cookies.get_dict()
c1.update(c2)


i3 = requests.get('https://github.com/settings/repositories', cookies=c1)
soup3 = BeautifulSoup(i3.text, features='lxml')
list_group = soup3.find(name='div', class_='listgroup')

from bs4.element import Tag

for child in list_group.children:
    if isinstance(child, Tag):
        project_tag = child.find(name='a', class_='mr-1')
        size_tag = child.find(name='small')
        temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
        print(temp)

View Code

4 自动登录拉钩网

这里需要在请求页面的响应数据中获取两个隐藏在网页中的参数，然后把参数放在请求头中才能登录成功，这里注意，无论解析网页用的是bs4还是xpath，都不能忘记正则表达式，它能解决其他方式无法解决的一些问题，具体如下：

import re
import requests

r1 = requests.get(
    url='https://passport.lagou.com/login/login.html',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    }
)
X_Anti_Forge_Token = re.findall("X_Anti_Forge_Token = '(.*?)'", r1.text, re.S)[0]
X_Anti_Forge_Code = re.findall("X_Anti_Forge_Code = '(.*?)'", r1.text, re.S)[0]
# print(X_Anti_Forge_Token, X_Anti_Forge_Code)
# print(r1.text)
#
r2 = requests.post(
    url='https://passport.lagou.com/login/login.json',
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        'X-Anit-Forge-Code':X_Anti_Forge_Code,
        'X-Anit-Forge-Token':X_Anti_Forge_Token,
        'Referer': 'https://passport.lagou.com/login/login.html', # 上一次请求地址是什么？
    },
    data={
        "isValidate": True,
        'username': '自己注册',
        'password': 'ab18d270d7126ea65915c50288c22c0d',
        'request_form_verifyCode': '',
        'submit': ''
    },
    cookies=r1.cookies.get_dict()
)
print(r2.text)

View Code

二 requests模块

1 请求方法

requests.get

requests.post

requests.put

requests.delete

等等等等，网络中常见的请求方法都有，具体还想了解更多的，可以参考requests原码查看

另一种写法：requests.request(method='POST')

2 请求参数

2.1 url

2.2 headers

2.3 cookies

2.4 params

2.5 data

传请求体 requests.post(
...,
data={'user':'liuneng','pwd':'123'}
)

GET /index http1.1\r\nhost:c1.com\r\n\r\nuser=liuneng&pwd=123

2.6 json，传请求体

requests.post(
...,
json={'user':'liuneng','pwd':'123'}
)

GET /index http1.1\r\nhost:c1.com\r\nContent-Type:application/json\r\n\r\n{"user":"liuneng","pwd":123}

2.7 代理 proxies

# 无验证
proxie_dict = {
"http": "61.172.249.96:80",
"https": "http://61.185.219.126:3128",
}
ret = requests.get("https://www.proxy360.cn/Proxy", proxies=proxie_dict)

# 验证代理
from requests.auth import HTTPProxyAuth

proxyDict = {
'http': '77.75.105.165',
'https': '77.75.106.165'
}
auth = HTTPProxyAuth('用户名', '密码')

r = requests.get("http://www.google.com",data={'xxx':'ffff'} proxies=proxyDict, auth=auth)
print(r.text)
-----------------------------------------------------------------------------------------上面的几项必须掌握

2.8 文件上传 files

# 发送文件
file_dict = {
'f1': open('xxxx.log', 'rb')
}
requests.request(
method='POST',
url='http://127.0.0.1:8000/test/',
files=file_dict
)

2.9 认证 auth

内部：
用户名和密码，用户和密码加密，放在请求头中传给后台。

- "用户:密码"
- base64("用户:密码")
- "Basic base64("用户|密码")"
- 请求头：
Authorization： "basic base64("用户|密码")"

from requests.auth import HTTPBasicAuth, HTTPDigestAuth

ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text)

2.10 超时 timeout

# ret = requests.get('http://google.com/', timeout=1)
# print(ret)

# ret = requests.get('http://google.com/', timeout=(5, 1))
# print(ret)

2.11 允许重定向 allow_redirects

ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text)

2.12 大文件下载 stream

from contextlib import closing
with closing(requests.get('http://httpbin.org/get', stream=True)) as r1:
# 在此处理响应。
for i in r1.iter_content():
print(i)

2.13 证书 cert

- 百度、腾讯 => 不用携带证书（系统帮你做了）
- 自定义证书
requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")
requests.get('http://127.0.0.1:8000/test/', cert=("xxxx/xxx/xxx.pem","xxx.xxx.xx.key"))

2.14 确认 verify =False

requests.get('http://127.0.0.1:8000/test/', cert="xxxx/xxx/xxx.pem")

作者： E-QUAL
出处： https://www.cnblogs.com/liujiajia_me/
本文版权归作者和博客园共有，不得转载，未经作者同意参考时必须保留此段声明，且在文章页面明显位置给出原文连接。

本文内容参考如下网络文献得来，用于个人学习，如有侵权，请您告知删除修改。

参考链接： https://www.cnblogs.com/linhaifeng/

https://www.cnblogs.com/yuanchenqi/

https://www.cnblogs.com/Eva-J/

https://www.cnblogs.com/jin-xin/

https://www.cnblogs.com/liwenzhou/

https://www.cnblogs.com/wupeiqi/

request、bs4爬虫

一先从爬虫案例开始

1 汽车之家

2 抽屉新热榜

3 自动登录GitHub

4 自动登录拉钩网

二 requests模块

1 请求方法

2 请求参数

2.1 url

2.2 headers

2.3 cookies

2.4 params

2.5 data

2.6 json，传请求体

2.7 代理 proxies

2.8 文件上传 files

2.9 认证 auth

2.10 超时 timeout

2.11 允许重定向 allow_redirects

2.12 大文件下载 stream

2.13 证书 cert

2.14 确认 verify =False

猜你喜欢

request、bs4爬虫

一 先从爬虫案例开始

1 汽车之家

2 抽屉新热榜

3 自动登录GitHub

4 自动登录拉钩网

二 requests模块

1 请求方法

2 请求参数

2.1 url

2.2 headers

2.3 cookies

2.4 params

2.5 data

2.6 json，传请求体

2.7 代理 proxies

2.8 文件上传 files

2.9 认证 auth

2.10 超时 timeout

2.11 允许重定向 allow_redirects

2.12 大文件下载 stream

2.13 证书 cert

2.14 确认 verify =False

猜你喜欢

一先从爬虫案例开始