Spider_权威指南_ch01

import requests
from bs4 import BeautifulSoup


# 1：TypeError: object of type 'Response' has no len()
html=requests.get('https://www.pythonscraping.com/pages/page1.html')
bs=BeautifulSoup(html,'html.parser')  # requests对像不能直接被 beautifuasoup解析，将html换成 html.text 或者 html.content
print(bs.h1)  # 报错

# 2：requests方法中content和text区别:

# requests对象的get和post方法都会返回一个Response对象，这个对象里面存的是服务器返回的所有信息，包括响应头，响应状态码等。
# 其中返回的网页部分会存在.content和.text两个对象中。
# 两者区别在于，content中间存的是字节码，而text中存的是Beautifulsoup根据猜测的编码方式将content内容编码成字符串。
# 直接输出content，会发现前面存在b'这样的标志，这是字节字符串的标志，而text是，没有前面的b,对于纯ascii码，这两个可以说一模一样，对于其他的文字，
# 需要正确编码才能正常显示。大部分情况建议使用.text，因为显示的是汉字，但有时会显示乱码，这时需要用.content.decode('utf-8')，中文常用
# utf-8和GBK，GB2312等。这样可以手工选择文字编码方式。
# 所以简而言之: .text是现成的字符串，.content还要编码，但是.text不是所有时候显示都正常，这是就需要用.content进行手动编码。

# 3: beautifulsoup 的 4个解析器：

# html.parser  lxml   xml  html5lib
# 区别和用法见 https://blog.csdn.net/huang1600301017/article/details/83474288

# 4：requests库的异常及处理：
from requests import exceptions

# exceptions.ConnectTimeout       连接远程服务超时,读取超时
# exceptions.ConnectionError      网络连接错误异常，比如DNS查询失败、拒绝连接,未知的服务器等
# exceptions.ProxyError           代理异常
# exceptions.ReadTimeout          读取超时
# exceptions.HTTPError            HTTP错误异常
# exceptions.URLRequired          URL缺失异常
# exceptions.TooManyRedirects    超过最大重定向次数，产生重定向异常
# exceptions.Timeout             请求 URL超时，产生超时异常

几个常见的异常解释：

1--超时异常：requests.exceptions.ConnectTimeout

1). 连接超时--服务器在指定时间内没有应答，抛出 requests.exceptions.ConnectTimeout
requests.get('http://github.com', timeout=0.001)


2). 连接、读取超时--若分别指定连接和读取的超时时间，服务器在指定时间没有应答，抛出 requests.exceptions.ConnectTimeout
- timeout=([连接超时时间], [读取超时时间])
- 连接：客户端连接服务器并并发送http请求服务器
- 读取：客户端等待服务器发送第一个字节之前的时间
requests.get('http://github.com', timeout=(6.05, 0.01))

3). 代理服务器没有响应 抛出 requests.exceptions.ConnectTimeout
requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "10.200.123.123:800"})


2--连接异常：requests.exceptions.ConnectionError

1).未知的服务器  抛出 requests.exceptions.ConnectionError
requests.get('http://github.comasf', timeout=(6.05, 27.05))

2).可能是断网导致 抛出 requests.exceptions.ConnectionError
requests.get('http://github.com', timeout=(6.05, 27.05))



3--代理服务器拒绝建立连接，端口拒绝连接或未开放，抛出 requests.exceptions.ProxyError
requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "192.168.10.1:800"})


4--代理读取超时
说明与代理建立连接成功，代理也发送请求到目标站点，但是代理读取目标站点资源超时
即使代理访问很快，如果代理服务器访问的目标站点超时，这个锅还是代理服务器背
假定代理可用，timeout就是向代理服务器的连接和读取过程的超时时间，不用关心代理服务器是否连接和读取成功
requests.get('http://github.com', timeout=(2, 0.01), proxies={"http": "192.168.10.1:800"})



# 例子：https://blog.csdn.net/weixin_39198406/article/details/81482082

# 另外，使用requests.get()方法，无论成功与否都会返回一个 reponse对象。

# 示例 1--Requests的使用：
import requests

html = requests.get('http://pythonscraping.com/pages/page1.html')
print(html)  #  <Response [200]>
# print(html.content)  #字节串
print(html.text)

<Response [200]>
<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

# 示例 2--BeautifulSoup的使用：

import requests
from bs4 import BeautifulSoup

html = requests.get('http://www.pythonscraping.com/pages/page1.html')
# bs = BeautifulSoup(html.content, 'html.parser')  # 字节串
bs = BeautifulSoup(html.text, 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>

# 示例 3--异常处理：

import requests
from requests import exceptions

try:
#     html = requests.get("https://pythonscrapingthisurldoesnotexist.com")
    html = requests.get('http://www.pythonscraping.com/pages/page1.html',timeout=0.001)
except exceptions.ConnectionError as e:
    print("The server returned an HTTP error")
except exceptions.ConnectTimeout as e:
    print("The server could not be found!")
else:
    print(html.text)

The server returned an HTTP error

import requests
from requests import exceptions
from bs4 import BeautifulSoup



def getTitle(url,timeout):
    try:
        html = requests.get(url,timeout=timeout)
    except exceptions.ConnectionError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.text, "lxml")
        title = bsObj.body.h1
    except exceptions.ConnectTimeout as e: 
        return None
    return title


# title = getTitle("http://www.pythonscraping.com/pages/page1.html",0.001)
title = getTitle("https://pythonscrapingthisurldoesnotexist.com",(5,0.01))
if title == None:
    print("Title could not be found")
else:
    print(title)

Title could not be found

import requests
from requests import exceptions
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = requests.get(url)
        print(type(html))  # <class 'requests.models.Response'>
    except exceptions.HttpError as e:
        return None

    bsObj = BeautifulSoup(html.text, "lxml")
    title = bsObj.body.h1
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1000.html")
print(type(title))  # <class 'bs4.element.Tag'>  bs4构建的对象，不是requests返回的对象
if title == None:
    print("Title could not be found")
else:
    print(title)

<class 'requests.models.Response'>
<class 'bs4.element.Tag'>
<h1 class="title" id="page-title">
                  Page not found                </h1>

Spider_权威指南_ch01

猜你喜欢