The use of Python data scraping _BeautifulSoup module

In the process of data capture, we often need to process the data

In this article, we mainly introduce the Python HTML and XML analysis library BeautifulSoup

 

The official documentation site for BeautifulSoup is as follows

https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 

 

 

BeautifulSoup can extract data from HTML and XML structured documents, and also provides various methods, which can easily search, extract and modify documents, which can greatly improve the efficiency of our data mining

Let's install BeautifulSoup

(I have already installed above, so no progress bar is displayed)

Very simple, nothing more than pip install plus the installed package name

pip3 install bs4

 

 

Let's start to formally learn this module

First, provide a destination URL

my personal website

http://www.susmote.com

 

Next, we save the source code of this URL content through the get method of requests

import requests

urls = "http://www.susmote.com"

resp = requests.get(urls)
resp.encoding = "utf8"
content = resp.text

with open("Bs4_test.html", 'w', encoding="utf8") as f:
    f.write(content)

 Run it, we can get the source code of this web page right away 

 

The program we wrote below is to analyze this source code using BeautifulSoup

First, let's get the href links and corresponding texts of all the a tags inside.

code show as below

from bs4 import BeautifulSoup
with open("Bs4_test.html", 'r', encoding='utf8') as f:
    bs = BeautifulSoup(f.read())
    a_list = bs.find_all('a')
    for a in a_list:
        if a.text != "":
            print(a.text.strip(), a["href"])

 First we import BeautifulSoup from BS4

Then open the file in read-only mode to open the file, we take f.read() as the parameter of BeautifulSoup, that is, initialize the string, and record the returned object as bs

Then we can call the method of BeautifulSoup. The most common methods of BeautifulSoup are find and find_all, which can find qualified elements in the document. The difference is to find one and find all

Here we use the find_all method, its common form is

list of elements = bs.find_all(element_name, attires = {attribute_name:attribute_value})

 

Then it outputs the found elements in turn, so I won't say much here.

We run this code on the command line 

The output is as follows 

Looking for too many results, not showing them one by one

 

You can see that there are many rules in the crawled links

e.g. tag link

We can make a slight change to the code to get all the tag links of the website, that is to do a filter

code show as below

from bs4 import BeautifulSoup
with open("Bs4_test.html", 'r', encoding='utf8') as f:
    bs = BeautifulSoup(f.read(), "lxml")
    a_list = bs.find_all('a')
    for a in a_list:
        if a.text != "" and 'tag' in a["href"]:
            print(a.text.strip(), a["href"])

 The general content has not changed, but a judgment condition is added before the output to achieve filtering

We run this program on the command line 

The result is as follows 

 

Apart from this, you can use many methods to achieve the same goal

Use the attrs = [attribute_name:attribute_value] parameter

I believe that people who have learned HTML must know the attribute name. For example, "class", "id", and "style" are all attributes. Let's go deeper and use this to dig deeper into the data.

Get the title of each article in my blog site

After browser debugging, we can easily get the attribute style of the title part of my blog page

As shown below

 

 The header style is a <header class="post-header">

very simple property

Below we use code to achieve batch access to article titles

# coding=utf-8
__Author__ = "susmote"

from bs4 import BeautifulSoup
n = 0
with open("Bs4_test.html", 'r', encoding='utf8') as f:
    bs = BeautifulSoup(f.read(), "lxml")
    header_list = bs.find_all('header', attrs={'class': 'post-header'})
    for header in header_list:
        n = int(n)
        n += 1
        if header.text != "":
            print(str(n) + ":  " + header.text.strip() + "\n")

 Basically, there is no difference with the previous code, except that one more parameter is added to the find_all method, attrs to achieve attribute filtering, and then to make the result clearer, I add an n

Run it on the command line, the result is as follows 

 

 

Use regular expressions to express the characteristics of attribute values

It is nothing more than adding a regular matching method after the attribute value. I will not explain too much here. If you want to know more, you can go to Baidu by yourself.

 

My blog site www.susmote.com

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324609397&siteId=291194637