Python3 implements crawler urllib articles + data processing (using bs4)

Crawlers require you to know at least the basics of html tags and css. It is recommended to learn first and then learn crawlers. It is
very simple, and you can understand what html and css do in one day.
If you don't want to learn, you have to read my explanation while Baidu doesn't understand what you don't understand

See the last code block for the complete code. It is not recommended to use it directly. It is recommended to slowly understand the principle and write your own unique code.

First throw a few reference documents
urllib documentation: https://docs.python.org/3.7/library/urllib.html
beautifulsoup Chinese documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index. zh.html
re documentation: https://docs.python.org/3.7/library/re.html

1. Get text information


urllib is the package bs4 that comes with python3 and needs to be downloaded
. After configuring the environment variables, enter pip install bs4 in cmd

from urllib import request
from bs4 import BeautifulSoup
#要抓取的链接
url = "https://blog.csdn.net/qq_36376711/article/details/86614578"
#获取到的内容
content = request.urlopen(url)
#获取网页的源码
encode_html = content.read()
#print(encode_html)#打印出来会发现是编码过的html
#解码,一般是utf-8
html = encode_html.decode("utf-8")
#print(html)#打印发现中文能正常显示了

The simplest access is done (access only requires the urllib library)

Let's analyze the obtained data (using bs4 for analysis).
Here, take the number of readings of this article as an example.
Press F12/ or right click -> inspect element. Patiently find the value of the class of the object
number of readings
and find that the class corresponding to the reading count is read-count, which is located in the span tag pair

#应用BeautifulSoup对获取到的数据进行格式转换,方便数据处理
soup = BeautifulSoup(html,"html.parser")
#同样方法找到文章标题的字体大小标签,发现是h1(不懂class,h1,标签对是啥的自己先百度html标签看了
#再来)
#一看就知道这个大小的字体只有标题
print("文章:",soup.h1)#直接通过标签属性访问
#找到soup中所有标签名为span,且span对应的class名为read-count的标签对
print("阅读数:",soup.find_all("span","read-count"))#通过标签加类别

As a result of the printing, we found that result
the part of the label pair we didn’t want to see was also printed out
, so the above part of the code was modified as follows:

print("文章:",soup.h1.get_text())
#因为count是个列表,所以不能像上面一样直接get_text()
count = soup.find_all("span","read-count")
for c in count:
    print("阅读数:",c.get_text())

This way of writing cannot be accessed normally when the article is deleted, the CSDN server is down, etc.
You don’t think it is a big problem just by looking at this program, but when you are crawling information from multiple URLs and
summarizing it, or When accessing different URLs repeatedly by changing the URL,
we don't want the entire program to be terminated due to a problem with one URL
. During high-frequency access, if the access interval is not set, it will be recognized by the anti-crawler mechanism.
And there is no disguise as an access header here, so the python access header will tell the access object that it is a python crawler by default.
Some websites require a verification code, some websites need to log in, and some websites need headers information to access normally. Overseas links need to bypass the wall and may also lead to unexpected suspension.
You also need to output the obtained data not only to the console, you may want to output it to a txt or csv file

The above code can also be written as
soup.find_all(“span”,{“class”:“read-count”})
find_all can also be written as findAll, find_all is more in line with python grammar habits

Just get the text analogy above, don’t go into details, 95% of what you didn’t learn is because of your html knowledge.
Most people don’t know how to get the information they want because they don’t know how to analyze html pages
, not python knowledge.

2. Get pictures

Continue the above code to add content
$ is a regular expression, you can refer to my other article, or you can search
https://blog.csdn.net/qq_36376711/article/details/86505332

import re#一个字符串处理的包
#获取图片的链接,"img表示选取所有img标签对"".jpg$表示获取标签对中
#所有以.jpg结尾的内容"
links = soup.find_all('img',"",src=re.compile(r'.jpg$'))
#打印出所有链接中src属性的内容
print(links.attsr["src"])

get picture

import time#时间相关的包
# 设置保存图片的路径,否则会保存到程序当前路径
#要求path必须存在,所以测试时发现bug多半是因为你没有创建该文件夹
#修改path的值或者自己去E盘创建该文件夹,那些不阅读代码光复制粘贴运行的估计会认为是教学有问题
path = r'E:\pystest\images'
#路径前的r是保持字符串原始值的意思,就是说不对其中的符号进行转义
for link in links:
    #打印出所有链接中src属性的内容
    print("正在下载:",link.attrs['src'])
    #保存链接并命名,time.time()返回当前时间戳防止命名冲突
    request.urlretrieve(link.attrs['src'],path+'\%s.jpg' % time.time())
    #使用request.urlretrieve直接将所有远程链接数据下载到本地

result:
The obtained jpg picture

3. Complete code

Improve code and increase stability

from urllib import request
from bs4 import BeautifulSoup
import re
import time
#导入chardet 用于检测编码
import chardet

def get_html_content(url):
    try:
        '''
        urlopen 返回对象可以使用
        1.geturl:返回请求对象的url
        2.info:对象的meta信息,包含http返回的头信息
        3.getcode:返回的http code
        例如print("URL:{0}".format(page.getcode()))
        
        '''
        xhtml = request.urlopen(url).read()
        #将bytes内容解码,转换为字符串
        #html源码里面一般有编码格式
        #利用chardet检测编码
        charset = chardet.detect(xhtml)
    except Exception as e:
        print(e)
        #urlopen可能出现httperror,如404等
        return None
    try:
        #如果找到encoding,返回其值,设置没有找到时默认为utf-8
        #使用get取值防止出错
        html = xhtml.decode(charset.get("encoding","utf-8"))
    except AttributeError as e:
        print(e)
        return None
    soup = BeautifulSoup(html,"html.parser")
    return soup
               
url = "https://blog.csdn.net/qq_36376711/article/details/85712738"

soup = get_html_content(url)
#如果获得的内容为空则直接退出
if soup is None:
    exit(0)

print("文章:",soup.h1.get_text())

count = soup.find_all("span","read-count")

for c in count:
    print("阅读量:",c.get_text())

links = soup.find_all('img',"",src=re.compile(r'.jpg$'))

path = r'E:\pystest\images'
print("以下图片将存储在:",path)

for link in links:
    print("正在下载:",link.attrs['src'])
    #建一个空的readme防止路径不存在错误
    #直接创建文件夹更好,参考下面文章,为了代码简短所以我没用
    #https://www.cnblogs.com/monsteryang/p/6574550.html
    with open(r"E:\pystest\images\readme.txt","w") as file:
        #pass表什么都不做
        pass
    request.urlretrieve(link.attrs['src'],path+'\%s.jpg' % time.time())

About how to collect different webpage information through loops and deal with anti-crawling, verification codes, using scarpy or requests library crawlers, etc. In the future, when I have time, I will write another article and add links here, or find links to better tutorials.

This blog is original and written based on the knowledge the author has learned in the Netease Cloud Classroom, the Laboratory Building and the book "Python Network Data Collection".

You are free to modify and publish this blog post, you can even say that you wrote it yourself.

How to view html pages and http connection information: https://blog.csdn.net/qq_36376711/article/details/86679266
More detailed analysis: wireshark packet capture tool
follow-up details supplement: https://blog.csdn.net/qq_36376711 /article/details/86675208

Guess you like

Origin blog.csdn.net/qq_36376711/article/details/86614578