Python's web crawler and information extraction (Part 1)

     Prerequisite : After learning a certain python foundation, you can continue to learn the content of web crawler. If you do not have basic python grammar learning, you can browse the python basic grammar notes summary .

1. Rules of web crawler

1. Getting started with the Requests library

method illustrate
requests.request() Construct a request that supports the underlying methods of the following methods
requests.get() The main method for obtaining HTML pages, corresponding to HTTP get
requests.head() The method to get the header information of HTML page, corresponding to the HTTP header
requests.post() The method of submitting a POST request to an HTML page, corresponding to HTTP post
requests.put() Submit a PUT request method to an HTML page, corresponding to HTTP put
requests.patch() Submit a partial modification request to an HTML page, corresponding to an HTTP patch
requests.delete() Submit a delete request to an HTML page, corresponding to HTTP delete

1.1, get() method

r = requests.get(url)	#构造一个向服务器请求资源的Request对象,返回一个Resopnse对象r

​ requests.get(url,params=None,**kwargs)

​url : The url link of the page to be obtained

​params : extra parameters in url, dictionary or byte stream format, optional

​ * * kwargs : 12 parameters that control access

Attributes illustrate
r.status_code The return status of the HTTP request, 200 means the connection is successful, 404 means the failure
r.text The string form of the HTTP response content, that is, the page content corresponding to the url
r.encoding Guessing response content encoding from HTTP headers
r.apparent_encoding Response content encoding to be parsed from content (alternative encoding)
r.content The binary form of the HTTP response content

​Note : r.encoding: If there is no charset in the header, the encoding is considered to be ISO-8859-1, which cannot parse Chinese. When r.encoding cannot be decoded correctly, you need to use r.apparent_encoding, assign the obtained encoding to r.encoding, and it can be parsed correctly

#爬取案例
import requests
r = requests.get("http://www.baidu.com")
print(r.status_code) #状态码200
print(r.encoding)   #ISO-8859-1
#   print(r.text)   不能成功解码
print(r.apparent_encoding)#utf-8
r.encoding = 'utf-8'    #修改编码方式
print(r.text)   #成功获取

1.2, Requests library exception

abnormal illustrate
requests.ConnectionError Abnormal network connection errors, such as DNS query failure, connection refused, etc.
requests.HTTPError HTTP error exception
requests.URLRequired URL missing exception
requests.TooManyRedirects If the maximum number of redirects is exceeded, a redirection exception is generated
requests.ConnectTimeout Connection to remote server timed out exception
requests.Timeout The request URL timed out, resulting in a timeout exception

1.3. General code framework for crawling web pages

import requests

def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status() #如果状态不是200,引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "产生异常"

if __name__ == "__main__":
    url = "http://www.baidu.com"	#选择爬取的网页
    print(getHTMLText(url))

1.4, HTTP protocol

HTTP protocol is a hypertext transfer protocol.

HTTP is a stateless application layer protocol based on the "request and response" model.

​HTTP protocol operations on resources

method illustrate
GET Request to get the resource at the URL location
HEAD Request to get the response message report of the URL location resource, that is, get the header information of the resource
POST Append new data after request to the resource at the URL location
PUT Request to store a resource at the URL location, overwriting the resource at the original URL location
PATCH Request a partial update of the resource at the URL location, that is, change part of the content of the redirected resource
DELETE Request to delete the resource stored at the URL location

1.5, the main analysis of the Requests library

requests.request(method,url,**kwargs)

​method : request method, refer to the six types of 1.4

​ * * kwargs : parameters that control access, all optional

Parameter Type describe
params Dictionary or byte sequence, add url as a parameter
data A dictionary, byte sequence, or file object, as the content of the Request
json Data in JSON format as the content of Request
headers Dictionary, HTTP custom headers
cookies Dictionary or CookieJar, cookie in Request
auth Tuple, support HTTP authentication function
files Dictionary type, transfer file
timeout Set the timeout time in seconds
proxies Dictionary type, set access proxy server, can add login authentication

2. Robots protocol

​ Function: The website tells the crawler which ones can be crawled and which ones cannot

​ Format: robots.txt file in the root directory of the website

Example: http://www.baidu.com/robots.txt

Use of Robots Protocol :

​ Web crawler: Automatically or manually identify robots.txt and crawl the content.

​ Binding: The Robots agreement is recommended but not binding. Web crawlers may not abide by it, but there are legal risks

3, Requests library web crawler combat

​ Simulate a browser to visit a website:

kv = {
    
    "user-agent":"Mozilla/5.0"}
url = ""
r = requests.get(url,headers = kv)

3.1, Baidu search keywords

import requests

# 百度搜索关键字提交:
# 百度搜索关键字格式:http://www.baidu.com/s?wd=关键字
kv = {
    
    "wd":"Python"}    #输入需要查询的关键字
r = requests.get("http://www.baidu.com/s",params=kv)
print(r.status_code)
print(r.request.url) #查询提交的url是否为想提交的
print(len(r.text)) #查询有多少长度

3.2. Crawling and storage of network images

import requests
import os

url = "http://img0.dili360.com/pic/2022/03/28/624109135e19b9603398103.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,"wb") as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else:
        print("文件已存在")
except:
    print("爬取失败")

2. Extraction of web crawler

1. Use of Beautiful Soup library

from bs4 import BeautifulSoup
soup = BeautifulSoup(mk,"html.parser")

1.1. Basic elements

​Beautiful Soup library parser

parser Instructions condition
HTML parser for bs4 BeautifulSoup(mk,“html.parser”) Install bs4 library
lxml的HTML解析器 BeautifulSoup(mk,“lxml”) pip install lxml
lxml的XML解析器 BeautifulSoup(mk,“xml”) pip install lxml
html5lib的解析器 BeautifulSoup(mk,“html5lib”) pip install html5lib

Beautiful Soup类的基本元素

基本元素 说明
Tag 标签,最基本的信息组织单元,分别使用<>和</>标明开头和结尾
Name 标签的名字,

的名字是’p’,格式:.name
Attributes 标签的属性,字典形式组织,格式:.attrs
NavigableString 标签内非属性字符串,<>…</>中字符串,格式:.string
Comment 标签内字符串的注释部分,一种特殊的Comment类型

1.2、基于bs4库的HTML内容遍历方法

标签树的下行遍历

属性 说明
.contents 子节点的列表,将所有儿子节点存入列表
.children 子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历

标签树的下行遍历

属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点

标签树的平行遍历

属性 说明
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings xu迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签

1.3、基于bs4库的HTML格式化和编码

​ prettify():在html后自动增加换行,更好的观察标签树

​ 默认编码格式:utf-8

2、信息组织与提取方法

2.1、信息标记

信息的标记

​ 标记后的信息可形成信息组织结构,增加了信息维度

​ 标记后的信息可用于通信、存储或展示

​ 标记结构与信息一样具有重要价值

​ 标记后的信息更利于程序的理解和应用

XML:最早的通用信息标记语言,可扩展性好,但繁琐

JSON:信息有类型,适合程序处理(js),较XML简洁

YAML:信息无类型,文本信息比例最高,可读性好

2.2、信息提取的一般方法

方法一:完整解析信息的标记形式,再提取关键信息

​ 需要标记解析器,例如:bs4库的标签树遍历

​ 优点:信息解析准确

​ 缺点:提取过程繁琐,速度慢

方法二:无视标记形式,直接搜索关键信息

​ 对信息的文本查找函数即可。

​ 优点:提取过程简洁,速度较快

​ 缺点:提取结果准确性与信息内容相关

融合方法:结合形式解析与搜索方法,提取关键信息。

​ 需要标记解析器及文本查找函数

实例:提取HTML中所有的URL链接

​ 思路:搜索到所有的标签,解析标签格式,提取href后的链接内容。

import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
for link in soup.find_all('a'):
    print(link.get('href'))

2.3、基于bs4库的HTML内容查找方法

<>.find_all(name, atrs, recursive, string, **kwargs)

​ 返回一个列表类型,存储查找的结果

name:对标签名称的索引字符串。

attrs:对标签属性值的检索字符串,可标记属性检索

recursive:是否对子孙全部检索,默认True

string:<>…</>中字符串区域的检索字符串

(…) 等价于 .find_all(…)

soup(…) 等价于 soup.find_all(…)

3、实例:大学排名信息

import bs4
import requests
from bs4 import BeautifulSoup

def getHTMLText(url):
    '''获取网页的文本'''
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()  # 如果状态不是200,引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist,html):
    '''解析需要的文本,放入到列表中'''
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string.strip('\n '),tds[1].a.string,tds[4].string.strip('\n ')])

def printUnivLIist(ulist,num):
    '''按格式化输出'''
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo = []
    url = "https://www.shanghairanking.cn/rankings/bcur/2022"
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivLIist(uinfo,30)

main()

Guess you like

Origin blog.csdn.net/weixin_45573296/article/details/124324679