Python comes with its own library urllib and lxml library for web crawling (fixed code format, quick start)

Use python's own library to crawl web page information

library:

import urllib.request
from lxml import etree

Example:

1. Obtain the source code of the web page

import urllib.request

# 一个url+请求头、请求头:告诉网页我这是浏览器或者其他非爬虫类的应用访问,不加可能报418错误
url = 'http://www.baidu.com'
headers={
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}

# 将二者封装为一个对象
request = urllib.request.Request(url=url,headers=headers)#将请求头和url封装为一个对象

#获取网页源代码
response = urllib.request.urlopen(request) # 访问请求
content = response.read().decode('utf-8') # 获取网页源代码并使用utf-8解码



The above steps are fixed, and the source code of any webpage can be obtained, just modify the url

2. Get the files (pictures, videos) in the webpage

# 参数一:文件链接,参数二:存储文件的地址与文件名(包括后缀名)
# 下面是下载html文件的方法,举一反三可下载图片或者视频:更改链接与相应的参数二
urllib.request.urlretrieve(url,'baidu.html')#下载文件

3. Get the specified element of the web page

from lxml import etree

#xpath解析
# 1、本地文件etree.parse
# 2、服务器响应的数据 response.read().decode('utf-8') etree.HTML()

tree = etree.HTML(url) # 使用网页解析,但此处为了演示方便我们自定义一个html文件,且假设此文件为我们需要的网页并将地址作为url传入

================================================================================
# 上述步骤固定,此下可开始获取网页指定元素
# 先将基本语法,lxml库的使用语法
1、路径查询
# /	:找直接子节点
# // :查找所有子孙节点,不考虑层级关系
示例:
假设html文件为:
<body>
	<h1>1</h1>
	<div>
    	<h2 id="2" class="3">2</h2>
    </div>
</body>
1、获取文本值
# 我们要获取h1中的值:1,使用 / 语法
li_list = tree.xpath('//body/h1/text()') text():获取标签内容
# 我们要获取h2中的值:2,使用 / 语法
li_list = tree.xpath('//body/div/h2/text()')

# 我们要获取h1中的值:1,使用 // 语法
li_list = tree.xpath('//body/h1/text()')
# 我们要获取h2中的值:2,使用 // 语法
li_list = tree.xpath('//body/h2/text()') 
    
2、获取属性值(h2中的class的属性值3
li_list = tree.xpath('//body/h2/@class') 
    
3、通过属性值(id="2")获取h2的文本值:2
li_list = tree.xpath('//body/h2[@id="2"]/text()') 

4、模糊、条件查询,获取所有id值为2或者3的文本值
li_list = tree.xpath('//body/h2[@id="2" or @id="3"]/text()') 

Concrete syntax:

路径查询
//:查找所有子孙节点,不考虑层级关系
/ :找直接子节点
谓词查询
//div[@id]
//div[@id="maincontent"]
属性查询
//@class
模糊查询
//div[contains(@id, "he")]
//div[starts‐with(@id, "he")]
内容查询
//div/h1/text()
逻辑运算
//div[@id="head" and @class="s_down"]
//title | //pric

So far, 80% of the problems of obtaining web page elements can be solved

Additional:

1. When downloading a video, the video will be too large and cannot be downloaded. You can set a loop to download, downloading 1024 bytes each time, until it is completed (idea)

2. There may be a 418 error when crawling. Even if you use the request header, we can use the exception capture mechanism to catch the exception at this time to prevent the crawler from stopping after an error occurs, or use the random sleep function to simulate human use as much as possible.

3. The progress will not be displayed when downloading the video. Here you can learn from other people's code to make the progress display:

from urllib.request import urlretrieve
import socket
# urlretrieve()的回调函数,显示当前的下载进度
# a为已经下载的数据块
# b为数据块大小
# c为远程文件的大小
global myper


def jindu(a, b, c):
    if not a:
        print("连接打开")
    if c < 0:
        print("要下载的文件大小为0")
    else:
        global myper
        per = 100 * a * b / c

        if per > 100:
            per = 100
        myper = per
        if 0.95<(per%10) <1:
            print("进度:" + '%.2f%%' % per,end=" ")
    if per == 100:
        return True


# 解决urlretrieve下载文件不完全的问题且避免下载时长过长陷入死循环
def auto_down(url, filename):
    try:
        urlretrieve(url, filename, jindu)
    except socket.timeout:
        count = 1
        while count <= 5:
            try:
                urlretrieve(url, filename, jindu)
                break
            except socket.timeout:
                err_info = 'Reloading for %d time' % count if count == 1 else 'Reloading for %d times' % count
                print(err_info)
                count += 1
        if count > 5:
            print("下载失败")

Guess you like

Origin blog.csdn.net/qq_43483251/article/details/127539924