15.爬虫之前奏部分

1 urllib和requests相关部分

1.0.虚拟环境搭建(改用anaconda安装了)

  • windows虚拟环境搭建链接:https://blog.csdn.net/qq_33404767/article/details/86479820
  • Centos搭建虚拟环境:https://jingyan.baidu.com/article/9080802216fee7fd91c80fe1.html

1.1. 爬虫的实际例子

  • 搜索引擎(百度,谷歌,360)
  • 伯乐在线
  • 惠惠购物助手(谷歌上面的一个插件)
  • 数据分析研究(数据冰山知乎专栏)
  • 抢票软件等

1.2. url详解

  • scheme:代表的是访问的协议,一般为http,https或者ftp
  • host:主机名
  • port: 端口号
  • path: 查询路径
  • query-string:查询字符串
  • anchor:锚点

1.3. 常见响应状态码:

  • 200:请求正常, 服务器正常的返回数据
  • 301:永久重定向
  • 302:临时重定向
  • 400:请求的url在服务器上找不到
  • 403:服务器拒绝访问,权限不够
  • 500:服务器内部错误。

1.4. urllib库

  • urlopen
from urllib import request

resp = request.urlopen("http://www.baidu.com")
print(resp.read().decode())
  • urlretrieve 下载文件,下载图片
from urllib import request
request.urlretrieve("http://www.baidu.com", "baidu.html")
  • 参数的编解码 urlencode 与 parse_qs
from urllib import request
from urllib import parse

params = {'name': '张三', 'age': 10, 'greet': 'hello world'}
result = parse.urlencode(params)
print(result)
origin = parse.parse_qs(result)
print(origin)

url = 'http://www.baidu.com/s?'
qs = {'wd': '刘德华'}
url = url + parse.urlencode(qs)
resp = request.urlopen(url)
print(resp.read())

  • urlparse和urlsplit提取URL中各个部分的内容
    • 注意: urlparse比urlsplit多提取一个params的内容
    • params的内容指的是;?之间的东西
from urllib import parse

url = 'http://www.baidu.com/s;one?wd=python&username=abc#1'
result1 = parse.urlparse(url)
result2 = parse.urlsplit(url)
print(result1)
print(result2)

  • 带请求头的爬虫
from urllib import request

url = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='

headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}

req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read())
  • ProxyHandler处理器(代理设置)
    • 常用的代理:
      • 西刺代理IP:http://www.xicidaili.com/
      • 快代理:http://www.kuaidaili.com/
      • 代理云:http://www.dailiyun.com/
from urllib import request

# 没有使用代理
url = 'http://httpbin.org/ip'
resp = request.urlopen(url)
print(resp.read().decode())

# 使用代理
url = 'http://httpbin.org/ip'
# 1. 使用ProxyHandler, 传入代理构建一个handler
handler = request.ProxyHandler({"http":"27.191.234.69:9999"})
# 2. 使用handler构建一个opener
opener = request.build_opener(handler)
# 3. 使用opener去发送一个请求
resp = opener.open(url)
print(resp.read().decode())
  • cookies
  • 用代码模拟登陆实现人人网的个人主页
# 方式一: 直接拷贝cookie信息进行爬取
from urllib import request

# 直接拷贝浏览器里面的cookie信息
fan_url = "http://www.renren.com/446858319/profile"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Cookie": "anonymid=jqiro4mregwon5; _r01_=1; _ga=GA1.2.1323033723.1546650817; jebe_key=8616c544-6c0c-4851-9a1e-67e04ca822ed%7C6242683e14ff8ba7b426ec01935c7d64%7C1546650763280%7C1%7C1547516356327; _de=96BBC0985794124272FFEFECAA50653A696BF75400CE19CC; __utma=151146938.1323033723.1546650817.1546650817.1560070724.2; __utmz=151146938.1560070724.2.2.utmcsr=renren.com|utmccn=(referral)|utmcmd=referral|utmcct=/; depovince=BJ; jebecookies=fe87a6db-231c-4f6e-896f-cc441f935f98|||||; JSESSIONID=abc3YTsykrdomaTU-kFXw; ick_login=44c8b524-214d-44f0-9d5e-3f0a98de91f5; p=160ae9502e0b6a1e31f489c37440f8c99; first_login_flag=1; [email protected]; ln_hurl=http://hdn.xnimg.cn/photos/hdn121/20120307/2100/h_main_ApqK_5f420000a8982f75.jpg; t=4016e4713ca3cc89172852a277cdbe4d9; societyguester=4016e4713ca3cc89172852a277cdbe4d9; id=446858319; xnsid=a0d3b3b4; ver=7.0; loginfrom=null; wp_fold=0; jebe_key=8616c544-6c0c-4851-9a1e-67e04ca822ed%7Cd13250d8542a631144f02f62567eb379%7C1564964841392%7C1%7C1564964839344",
    "Referer": "http://www.renren.com/446858319/newsfeed/photo"
}

req = request.Request(fan_url, headers=headers)
resp = request.urlopen(req)

with open('renren.html', 'w', encoding="utf-8") as fp:
    fp.write(resp.read().decode("utf-8"))

# 方式二:先登录,在访问个人主页面

from urllib import request
from urllib import parse
from http.cookiejar import CookieJar

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Referer": "http://www.renren.com/446858319/newsfeed/photo"
}

data = {
    "email": "[email protected]",
    "password": "fanjianhaiabc123"
}

# 创建一个cookieJar对象
cookiejar = CookieJar()
# 使用cookiejar创建一个HTTPCookieProcessor对象
handler = request.HTTPCookieProcessor(cookiejar)
# 使用handler创建一个opener
opener = request.build_opener(handler)

# 进行登录
login_url = "http://www.renren.com/PLogin.do"
req = request.Request(login_url, data=parse.urlencode(data).encode(), headers=headers)
# 通过带有cookiejar的请求结果会把cookie信息添加到内存中
opener.open(req)

fan_url = "http://www.renren.com/446858319/profile"
req = request.Request(fan_url, headers=headers)
# 要通过带有cookie信息的opener去放问人人个人主页
resp = opener.open(req)

with open("renren.html", "w", encoding="utf-8") as fp:
    fp.write(resp.read().decode())
  • cookie信息的加载与保存
from urllib import request
from http.cookiejar import MozillaCookieJar

cookiejar = MozillaCookieJar("cookie.txt")
# 加载本地的cookie信息
# cookiejar.load(ignore_discard=True)
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)

url = "http://httpbin.org/cookies/set?course=spider"
opener.open(url)
# 保存到本地,浏览器关闭(代码运行结束,cookie就过期了)
cookiejar.save(ignore_discard=True)

1.5. requests库

  • 中文文档链接:https://2.python-requests.org//zh_CN/latest/index.html
  • Github链接:https://github.com/psf/requests
  • requests之 Get
import requests

params = {
    'wd':'中国'
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}

response = requests.get("http://www.baidu.com/s", params=params, headers=headers)

with open("baidu.html", "w", encoding="utf-8") as fp:
    fp.write(response.content.decode("utf-8"))

print(response.url)
  • requests库使用代理
import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}

proxy = {
    # 'http':'163.204.242.225:8888'
}

response = requests.get("http://httpbin.org/ip", proxies=proxy,headers=headers)
print(response.content.decode())

1.6. requests模块爬取拉勾网职位信息

  • 注意: 这里没有使用urllib库,requests模块处理session更简单点, 在这里cookie不需要单独传了,session请求会自己带上
  • 相关链接:https://blog.csdn.net/qq_40821402/article/details/88654259
import requests
import time

url1 = 'https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput='

url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
# 请求头
headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Referer": "https://www.lagou.com/jobs/list_python?labelWords=$fromSearch=true&suginput=",
    "Host": "www.lagou.com",
}
# 通过data来控制翻页

for page in range(1, 10):
    data = {
        'first': 'false',
        'pn': page,
        'kd': 'python'
    }
    s = requests.Session()  # 建立session
    response = s.get(url=url1, headers=headers, timeout=3)
    cookie = s.cookies  # 获取cookie
    respon = s.post(url=url, headers=headers, data=data, cookies=cookie, timeout=3)
    time.sleep(7)
    print(respon.text)

2. 手机端爬虫

2.1. Fiddler的安装和使用

https://blog.csdn.net/ychgyyn/article/details/82154433

2.2. 抖音视频的爬取

https://www.cnblogs.com/stevenshushu/p/9635097.html

发布了85 篇原创文章 · 获赞 12 · 访问量 3746

猜你喜欢

转载自blog.csdn.net/fanjianhai/article/details/103676804