I want to learn Python secretly, and then shock everyone (day eight)

Insert picture description here

The title is not intended to offend, but I think this ad is fun
. Take the above mind map if you like it. I can’t learn so much anyway.

Preface

Early review: I want to secretly learn Python (day 7)

In the previous article, we saw crawlers for the first time. In fact, after close contact, we found that we couldn’t learn it. Although we only crawled small pictures, crawling webpages is still more difficult. The following use cases are too strong to remove the source code of others. Climb over.

then what should we do? When we look at the effects of other people's crawlers, it seems that there is no such messy source code. This requires web page analysis. Okay, today we will parse the web page and extract the valid words from it.

Attention, we can't teach you to do the front-end, but simply understand the structure of the web page in order to locate our crawler.

Insert a push: (If it's Xiaobai, you can look at the following paragraph)

Welcome to our circle

I built a Python Q&A group, friends who are interested can find out: What kind of group is this?

There are already more than 200 friends in the group! ! !

Portal through the group: Portal


本系列文默认各位有一定的C或C++基础,因为我是学了点C++的皮毛之后入手的Python,这里也要感谢齐锋学长送来的支持。
本系列文默认各位会百度,学习‘模块’这个模块的话,还是建议大家有自己的编辑器和编译器的,上一篇已经给大家做了推荐啦?

我要的不多,点个关注就好啦
然后呢,本系列的目录嘛,说实话我个人比较倾向于那两本 Primer Plus,所以就跟着它们的目录结构吧。

本系列也会着重培养各位的自主动手能力,毕竟我不可能把所有知识点都给你讲到,所以自己解决需求的能力就尤为重要,所以我在文中埋得坑请不要把它们看成坑,那是我留给你们的锻炼机会,请各显神通,自行解决。

HTML basics

What is HTML?

HTML (Hyper Text Markup Language) is a language used to describe web pages, also called Hyper Text Markup Language.
What it is is no longer important here. Let's focus on it. Everyone here should not reject unfamiliar languages. This is a major premise.

View the HTML code of the page

Let's open a webpage casually, for example, the seventh day: https://blog.csdn.net/qq_43762191/article/details/109320746

Right-click in the blank area of ​​the webpage to display the source code of the webpage.
If the code is too long, I won’t copy it. Now, the URL is here: view-source:https://blog.csdn.net/qq_43762191/article/details/109320746 If you
can’t find the source code, you can directly enter it from the URL .

There is another way: right-click -> check, this method can make the source code and the webpage appear on the same screen.
It depends on personal preference, each has its own advantages.


What did we watch?

Labels and elements

Well, let’s look at the HTML document first. You can see a lot of letters sandwiched between angle brackets <>, they are called [tags].

Tags usually appear in pairs: the front is [start tag], such as; the back is [end tag], such as.

However, there are also tags that appear singly, such as the fifth line of the HTML code (which defines the encoding format of the web page as utf-8). As long as you know these, in most cases, labels that appear in pairs are used.

Insert picture description here
In fact, the start tag + end tag + everything in between, they together form the [element].

The following table lists several common elements:
Insert picture description here

Now go back and look at the source code of the webpage, although you can guess the same without this form.
Is it a lot clearer! ! !

HTML basic structure

If you don’t think it’s enough, let’s get another one:
Insert picture description here

HTML attributes

You will also find that there are some things like "class" appearing repeatedly in the code, following the body. These are called "attributes". Let's take a look:

Insert picture description here

So far, we have finished the composition of HTML: tags, elements, structure (header and body), attributes.
Okay, now let's look at the source code of the webpage intuitively, until I feel no pressure, let's go down.


Crawl webpage text

Review

The day before, we left everyone with a regret that we could only crawl the source code of the webpage, but that was actually the first step. Next, we parsed the webpage code and extracted the content we wanted.

The code at the time was like this:

import requests #调用requests库
res = requests.get('https://blog.csdn.net/qq_43762191')	#爬我自己的主页
#获取网页源代码,得到的res是Response对象
html=res.text
#把res的内容以字符串的形式返回
print('响应状态码:',res.status_code) #检查请求是否正确响应
print(html) #打印网页源代码

BeautifulSoup

Here is a module, BeautifulSoup, the name is not bad, beautiful soul.

I can’t download this in my pycharm. I don’t know if it’s because I use Python3.9, or it can’t be downloaded separately, just like I did a word cloud before. But I can download beautifulsoup4, anyway, no matter what you download, just make sure you can find the bs4 package after downloading.

Web page data analysis

res = BeautifulSoup('data to be parsed','parser')

Parameter definition: The
first one is the text to be parsed. Note that it must be a string.

The second parameter is used to identify the parser, we are going to use a Python built-in library: html.parser. (It is not the only parser, but it is relatively simple)

Well, let’s look at:

import requests
from bs4 import BeautifulSoup
res = requests.get('https://blog.csdn.net/qq_43762191') 
soup = BeautifulSoup( res.text,'html.parser')
print(type(soup)) #查看soup的类型
wt = open('test4.txt','w',encoding='utf-8')
wt.write(res.text)
wt.close()

OK, you go try it. You will find that CSDN does not let you succeed so easily, which is great, although it is still crawled.
Let's change the website.
https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c088dc81499e3d5a0feeb67fec21499e3d5a0feeb67fec21499e3d5a0feeb67fec27499e3d5a0feeb67fec21499e3d5a0feeb67fec21499e3d5a0feeb233

This article has some highlights, but I climbed down and looked at it myself. After reading it, I also studied it all afternoon. Of course, I was for safety. Yes, I am a good person.

Look at the running result, the data type of soup is <class'bs4.BeautifulSoup'>, indicating that soup is a BeautifulSoup object.


Extract data

This step can be divided into two parts of knowledge: find() and find_all(), and Tag objects.

Insert picture description here

import requests
from bs4 import BeautifulSoup
res = requests.get('https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c088dc81499e3d5a0febeb67fec27ba52f233b6a0e6fda37221a2c497dee82f2de29e567&scene=21#wechat_redirect') #好像只能爬取以.html结尾的网址,连我的博客主页都没办法爬
soup = BeautifulSoup( res.text,'html.parser')
item = soup.find('div')
#item = soup.find_all('div')
print(type(item)) #查看soup的类型
wt = open('test6.txt','w',encoding='utf-8')
wt.write(str(item))
wt.close()
print(item)

Facts have proved that it is so shallow on paper, I know that I have to do it, I almost believed the nonsense of the above picture.

(Actually, if you look back and study it carefully, there will be new discoveries)

Parameters in brackets: tags and attributes can be either one of them, or both can be used together, depending on the content we want to extract from the web page.
icon

If only one parameter can be used for accurate positioning, only one parameter is used for retrieval. If the label and the attribute are required to be satisfied at the same time to accurately locate the content we are looking for, use the two parameters together.


tag object

Do you feel that the victory is in sight, and then after the execution, it is found that there is nothing wrong with the spectacle, don’t worry, let’s continue to look down, and we can’t see a loss:

We have been busy for most of the day, and we have come up with a bunch of source codes. Tony brings water, but is there really no change? Don't believe your eyes. Your eyes can sometimes be deceiving. You may wish to print out the attributes each time the return value is returned. In fact, attributes are always developing in a direction that is beneficial to us.

We see that their data type is <class'bs4.element.Tag'>, which is a Tag object

Remember what we said earlier, the extracted data is divided into what and tags? Now it’s time for the final sprint. That’s right, rush! ! !
Insert picture description here

Insert picture description here

First, the Tag object can be retrieved using find() and find_all().

res = requests.get('https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c088dc81499e3d5a0febeb67fec27ba52f233b6a0e6fda37221a2c497dee82f2de29e567&scene=21')#wechat_redirect') #好像只能爬取以.html结尾的网址,连我的博客主页都没办法爬')
# 返回一个response对象,赋值给res
html = res.text
# 把res的内容以字符串的形式返回
soup = BeautifulSoup( html,'html.parser')
# 把网页解析为BeautifulSoup对象
items = soup.find_all('div') # 通过定位标签和属性提取我们想要的数据
for item in items:
    kind = item.find(class_='rich_media_area_primary') # 在列表中的每个元素里,匹配标签<h2>提取出数据
    print(kind,'\n') # 打印提取出的数据
    print(type(kind)) # 打印提取出的数据类型

We use Tag.text to propose the text in the Tag object, and use Tag['href'] to extract the URL.

import requests # 调用requests库
from bs4 import BeautifulSoup # 调用BeautifulSoup库
res =requests.get('https://mp.weixin.qq.com/s?__biz=MzI4NDY5Mjc1Mg==&mid=2247489783&idx=1&sn=09d76423b700620f80c9da9e4d8a8536&chksm=ebf6c088dc81499e3d5a0febeb67fec27ba52f233b6a0e6fda37221a2c497dee82f2de29e567&scene=21#wechat_redirect')
# 返回一个response对象,赋值给res
html=res.text
# 把res解析为字符串
soup = BeautifulSoup( html,'html.parser')
# 把网页解析为BeautifulSoup对象
items = soup.find_all(class_='rich_media')   # 通过匹配属性class='rich_media'提取出我们想要的元素
for item in items:                      # 遍历列表items
    kind = item.find('h2')               # 在列表中的每个元素里,匹配标签<h2>提取出数据
    title = item.find(class_='profile_container')     #  在列表中的每个元素里,匹配属性class_='profile_container'提取出数据
    print(kind.text,'\n',title.text) # 打印书籍的类型、名字、链接和简介的文字

review

In fact, to put it plainly, from the very beginning to use the requests library to obtain data, to use the BeautifulSoup library to parse the data, and then to continue to use the BeautifulSoup library to extract data, what we constantly experience is the type conversion of our operating objects.

Please see the picture below:

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43762191/article/details/109399104