网络爬虫与Tornado

1 爬虫introduction

1.1 爬虫框架

性能：

　　并发方案：异步IO（gevent/Twisted/asyncio/aiohttp），自定义异步IO模块，IO多路复用：select

scrapy框架

　　介绍异步IO：Twisted，Twisted基于Scrapy源码定义爬虫框架，同时Twisted使用Scrapy

1.2 Tornado框架（异步非阻塞）

Tornado基本使用

源码剖析

自定义异步非阻塞框架

2.爬虫基本操作

2.1 requests模板模块

from bp4 import BeautifulSoup

r = requests.get(url)

soup = BeautifulSoup(response.text, features= "html.parser")

target = soup.find(id="xxx")

print(target)

*爬虫框架课件word文件“网络爬虫与信息获取”

2.2 requests详细介绍

2.2.1 反防火墙

人工修改headers内的User-Agent,比如

'ser-Agent':"Mozilla/5.0"

2.2.2 附加操作

- requests.post('url')/get('url')可以加载如下参数

url 提交地址

data 在请求体里传递的数据，一一发送，请求体里可以为dictionary，type，string，一般字典内的键值对，用json和data都可以做

json 在请求体里传递数据，在内部串联所有字符串，统一发送，字典内嵌套字典，只能用json传数据

params 在Url上传递的参数，指定一个方法参数，改方法参数的数目可变

cookies 辨别用户身份，进行sessio跟踪

headers 请求头，修改参数，用于爬取有防火墙的网站

files 用于文件操作

auth 用于headers中加入加密的用户名和密码

timeout 请求和相应的超时时间

allow_redirects 是否允许重定向，即非单一爬虫目的地

proxies 代理

verify 是否忽略系统提供证书

cert 证书文件

requests.Session() 用于保存客户端历史访问信息

- response = requests.post() 返回值

response.get('url')　　

response.text　　输出文本

response.content　　输出内容（不限text）

response.encoding　　编码

response.aparent_encoding　　解决乱码问题

- 请求头/请求尾

Refer，保存数据

-交互时数据的提交方式

1.直接以消息的形式进行传送，这种方式下在html-审查-Network中，数据不刷新；

2.如果以form表单形式提交数据，这种方式下在html-审查-Network中，数据上传后刷新。

2.3 BeautifulSoup提供的功能

有关对于下面代码的理解，已写在单行注释中。

from bs4 import BeautifulSoup

#自定义html
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <div><a href='http://www.cunzhang.com'>剥夺老师<p>asdf</p></a></div>
    <a id='i1'>刘志超</a>
    <div>
        <p>asdf</p>
    </div>
    <p>asdfffffffffff</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, features="html.parser")

tag = soup.find('a')
v = tag.unwrap()
print(soup)

from bs4.element import Tag
obj1 = Tag(name='div', attrs={'id': 'it'})
obj1.string = '我是一个新来的'

tag = soup.find('a')
v = tag.wrap(obj1)  #从当前html行插入
print(soup)


tag = soup.find('body') #从body<div>之后插入a得到部分html后，再输出<body><div><div>之间内容
tag.append(soup.find('a'))
print(soup)

from bs4.element import Tag
obj = Tag(name='i', attrs={'id': 'it'})
obj.string = '我是一个新来的'
tag = soup.find('body')
# tag.insert_before(obj)
tag.insert_after(obj)
print(soup)


tag = soup.find('p',recursive=True)
print(tag)
tag = soup.find('body').find('p',recursive=False)
print(tag)

tag = soup.find('a')
v = tag.get_text()
print(v)

#属性操作
tag = soup.find('a')
tag.attrs['lover'] = '物理老师'
del tag.attrs['href']
print(soup)

# children: 儿子
# 标签和内容
from bs4.element import Tag
tags = soup.find('body').children
for tag in tags:
    if type(tag) == Tag:
        print(tag,type(tag))
    else:
        print('文本....')

tags = soup.find('body').descendants
print(list(tags))

tag = soup.find('body')
# 把对象转换成字节类型
print(tag.encode_contents())    #把汉字转化为二进制编码，<body>xxx<body>标签体内容一次性输出
# 把对象转换成字符串类型
print(tag.decode_contents())    #解码，把二进制转化为汉字,<body>xxx<body>标签体内容分行输出
# print(str(tag))

猜你喜欢