1 urllib和requests相关部分

1.0.虚拟环境搭建（改用anaconda安装了）

windows虚拟环境搭建链接：https://blog.csdn.net/qq_33404767/article/details/86479820
Centos搭建虚拟环境：https://jingyan.baidu.com/article/9080802216fee7fd91c80fe1.html

1.1. 爬虫的实际例子

搜索引擎（百度，谷歌，360）
伯乐在线
惠惠购物助手（谷歌上面的一个插件）
数据分析研究（数据冰山知乎专栏）
抢票软件等

1.2. url详解

scheme：代表的是访问的协议，一般为http，https或者ftp
host：主机名
port：端口号
path：查询路径
query-string：查询字符串
anchor：锚点

1.3. 常见响应状态码：

200：请求正常，服务器正常的返回数据
301：永久重定向
302：临时重定向
400：请求的url在服务器上找不到
403：服务器拒绝访问，权限不够
500：服务器内部错误。

1.4. urllib库

urlopen

from urllib import request

resp = request.urlopen("http://www.baidu.com")
print(resp.read().decode())

urlretrieve 下载文件，下载图片

from urllib import request
request.urlretrieve("http://www.baidu.com", "baidu.html")

参数的编解码 urlencode 与 parse_qs

from urllib import request
from urllib import parse

params = {'name': '张三', 'age': 10, 'greet': 'hello world'}
result = parse.urlencode(params)
print(result)
origin = parse.parse_qs(result)
print(origin)

url = 'http://www.baidu.com/s?'
qs = {'wd': '刘德华'}
url = url + parse.urlencode(qs)
resp = request.urlopen(url)
print(resp.read())

urlparse和urlsplit提取URL中各个部分的内容
- 注意： urlparse比urlsplit多提取一个params的内容
- params的内容指的是；？之间的东西

from urllib import parse

url = 'http://www.baidu.com/s;one?wd=python&username=abc#1'
result1 = parse.urlparse(url)
result2 = parse.urlsplit(url)
print(result1)
print(result2)

带请求头的爬虫

from urllib import request

url = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read())

ProxyHandler处理器（代理设置）
- 常用的代理：
  - 西刺代理IP：http://www.xicidaili.com/
  - 快代理：http://www.kuaidaili.com/
  - 代理云：http://www.dailiyun.com/

from urllib import request

# 没有使用代理
url = 'http://httpbin.org/ip'
resp = request.urlopen(url)
print(resp.read().decode())

# 使用代理
url = 'http://httpbin.org/ip'
# 1. 使用ProxyHandler， 传入代理构建一个handler
handler = request.ProxyHandler({"http":"27.191.234.69:9999"})
# 2. 使用handler构建一个opener
opener = request.build_opener(handler)
# 3. 使用opener去发送一个请求
resp = opener.open(url)
print(resp.read().decode())

cookies
用代码模拟登陆实现人人网的个人主页

# 方式一： 直接拷贝cookie信息进行爬取
from urllib import request

# 直接拷贝浏览器里面的cookie信息
fan_url = "http://www.renren.com/446858319/profile"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Cookie": "anonymid=jqiro4mregwon5; _r01_=1; _ga=GA1.2.1323033723.1546650817; jebe_key=8616c544-6c0c-4851-9a1e-67e04ca822ed%7C6242683e14ff8ba7b426ec01935c7d64%7C1546650763280%7C1%7C1547516356327; _de=96BBC0985794124272FFEFECAA50653A696BF75400CE19CC; __utma=151146938.1323033723.1546650817.1546650817.1560070724.2; __utmz=151146938.1560070724.2.2.utmcsr=renren.com|utmccn=(referral)|utmcmd=referral|utmcct=/; depovince=BJ; jebecookies=fe87a6db-231c-4f6e-896f-cc441f935f98|||||; JSESSIONID=abc3YTsykrdomaTU-kFXw; ick_login=44c8b524-214d-44f0-9d5e-3f0a98de91f5; p=160ae9502e0b6a1e31f489c37440f8c99; first_login_flag=1; [email protected]; ln_hurl=http://hdn.xnimg.cn/photos/hdn121/20120307/2100/h_main_ApqK_5f420000a8982f75.jpg; t=4016e4713ca3cc89172852a277cdbe4d9; societyguester=4016e4713ca3cc89172852a277cdbe4d9; id=446858319; xnsid=a0d3b3b4; ver=7.0; loginfrom=null; wp_fold=0; jebe_key=8616c544-6c0c-4851-9a1e-67e04ca822ed%7Cd13250d8542a631144f02f62567eb379%7C1564964841392%7C1%7C1564964839344",
    "Referer": "http://www.renren.com/446858319/newsfeed/photo"
}

req = request.Request(fan_url, headers=headers)
resp = request.urlopen(req)

with open('renren.html', 'w', encoding="utf-8") as fp:
    fp.write(resp.read().decode("utf-8"))

# 方式二：先登录，在访问个人主页面

from urllib import request
from urllib import parse
from http.cookiejar import CookieJar

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Referer": "http://www.renren.com/446858319/newsfeed/photo"
}

data = {
    "email": "[email protected]",
    "password": "fanjianhaiabc123"
}

# 创建一个cookieJar对象
cookiejar = CookieJar()
# 使用cookiejar创建一个HTTPCookieProcessor对象
handler = request.HTTPCookieProcessor(cookiejar)
# 使用handler创建一个opener
opener = request.build_opener(handler)

# 进行登录
login_url = "http://www.renren.com/PLogin.do"
req = request.Request(login_url, data=parse.urlencode(data).encode(), headers=headers)
# 通过带有cookiejar的请求结果会把cookie信息添加到内存中
opener.open(req)

fan_url = "http://www.renren.com/446858319/profile"
req = request.Request(fan_url, headers=headers)
# 要通过带有cookie信息的opener去放问人人个人主页
resp = opener.open(req)

with open("renren.html", "w", encoding="utf-8") as fp:
    fp.write(resp.read().decode())

cookie信息的加载与保存

from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar("cookie.txt")
# 加载本地的cookie信息
# cookiejar.load(ignore_discard=True)
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)

url = "http://httpbin.org/cookies/set?course=spider"
opener.open(url)
# 保存到本地，浏览器关闭（代码运行结束，cookie就过期了）
cookiejar.save(ignore_discard=True)

1.5. requests库

中文文档链接：https://2.python-requests.org//zh_CN/latest/index.html
Github链接：https://github.com/psf/requests
requests之 Get

import requests

params = {
    'wd':'中国'
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}

response = requests.get("http://www.baidu.com/s", params=params, headers=headers)

with open("baidu.html", "w", encoding="utf-8") as fp:
    fp.write(response.content.decode("utf-8"))

print(response.url)

requests库使用代理

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}

proxy = {
    # 'http':'163.204.242.225:8888'
}

response = requests.get("http://httpbin.org/ip", proxies=proxy,headers=headers)
print(response.content.decode())

1.6. requests模块爬取拉勾网职位信息

注意：这里没有使用urllib库，requests模块处理session更简单点, 在这里cookie不需要单独传了，session请求会自己带上
相关链接：https://blog.csdn.net/qq_40821402/article/details/88654259

import requests
import time

url1 = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='

url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
# 请求头
headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Referer": "https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput=",
    "Host": "www.lagou.com",
}
# 通过data来控制翻页

for page in range(1, 10):
    data = {
        'first': 'false',
        'pn': page,
        'kd': 'python'
    }
    s = requests.Session()  # 建立session
    response = s.get(url=url1, headers=headers, timeout=3)
    cookie = s.cookies  # 获取cookie
    respon = s.post(url=url, headers=headers, data=data, cookies=cookie, timeout=3)
    time.sleep(7)
    print(respon.text)

2. 手机端爬虫

2.1. Fiddler的安装和使用

https://blog.csdn.net/ychgyyn/article/details/82154433

2.2. 抖音视频的爬取

https://www.cnblogs.com/stevenshushu/p/9635097.html

越奋斗，越幸运

发布了85 篇原创文章 · 获赞 12 · 访问量 3746

私信关注

15.爬虫之前奏部分

1 urllib和requests相关部分

1.0.虚拟环境搭建（改用anaconda安装了）

1.1. 爬虫的实际例子

1.2. url详解

1.3. 常见响应状态码：

1.4. urllib库

1.5. requests库

1.6. requests模块爬取拉勾网职位信息

2. 手机端爬虫

2.1. Fiddler的安装和使用

2.2. 抖音视频的爬取

猜你喜欢