In the process of data capture, we often need to process the data
In this article, we mainly introduce the Python HTML and XML analysis library BeautifulSoup
The official documentation site for BeautifulSoup is as follows
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
BeautifulSoup can extract data from HTML and XML structured documents, and also provides various methods, which can easily search, extract and modify documents, which can greatly improve the efficiency of our data mining
Let's install BeautifulSoup
(I have already installed above, so no progress bar is displayed)
Very simple, nothing more than pip install plus the installed package name
pip3 install bs4
Let's start to formally learn this module
First, provide a destination URL
my personal website
http://www.susmote.com
Next, we save the source code of this URL content through the get method of requests
import requests urls = "http://www.susmote.com" resp = requests.get(urls) resp.encoding = "utf8" content = resp.text with open("Bs4_test.html", 'w', encoding="utf8") as f: f.write(content)
Run it, we can get the source code of this web page right away
The program we wrote below is to analyze this source code using BeautifulSoup
First, let's get the href links and corresponding texts of all the a tags inside.
code show as below
from bs4 import BeautifulSoup with open("Bs4_test.html", 'r', encoding='utf8') as f: bs = BeautifulSoup(f.read()) a_list = bs.find_all('a') for a in a_list: if a.text != "": print(a.text.strip(), a["href"])
First we import BeautifulSoup from BS4
Then open the file in read-only mode to open the file, we take f.read() as the parameter of BeautifulSoup, that is, initialize the string, and record the returned object as bs
Then we can call the method of BeautifulSoup. The most common methods of BeautifulSoup are find and find_all, which can find qualified elements in the document. The difference is to find one and find all
Here we use the find_all method, its common form is
list of elements = bs.find_all(element_name, attires = {attribute_name:attribute_value})
Then it outputs the found elements in turn, so I won't say much here.
We run this code on the command line
The output is as follows
Looking for too many results, not showing them one by one
You can see that there are many rules in the crawled links
e.g. tag link
We can make a slight change to the code to get all the tag links of the website, that is to do a filter
code show as below
from bs4 import BeautifulSoup with open("Bs4_test.html", 'r', encoding='utf8') as f: bs = BeautifulSoup(f.read(), "lxml") a_list = bs.find_all('a') for a in a_list: if a.text != "" and 'tag' in a["href"]: print(a.text.strip(), a["href"])
The general content has not changed, but a judgment condition is added before the output to achieve filtering
We run this program on the command line
The result is as follows
Apart from this, you can use many methods to achieve the same goal
Use the attrs = [attribute_name:attribute_value] parameter
I believe that people who have learned HTML must know the attribute name. For example, "class", "id", and "style" are all attributes. Let's go deeper and use this to dig deeper into the data.
Get the title of each article in my blog site
After browser debugging, we can easily get the attribute style of the title part of my blog page
As shown below
The header style is a <header class="post-header">
very simple property
Below we use code to achieve batch access to article titles
# coding=utf-8 __Author__ = "susmote" from bs4 import BeautifulSoup n = 0 with open("Bs4_test.html", 'r', encoding='utf8') as f: bs = BeautifulSoup(f.read(), "lxml") header_list = bs.find_all('header', attrs={'class': 'post-header'}) for header in header_list: n = int(n) n += 1 if header.text != "": print(str(n) + ": " + header.text.strip() + "\n")
Basically, there is no difference with the previous code, except that one more parameter is added to the find_all method, attrs to achieve attribute filtering, and then to make the result clearer, I add an n
Run it on the command line, the result is as follows
Use regular expressions to express the characteristics of attribute values
It is nothing more than adding a regular matching method after the attribute value. I will not explain too much here. If you want to know more, you can go to Baidu by yourself.
My blog site www.susmote.com