Python crawler advanced article-use the beautifulsoup library to crawl web page article content practical demonstration

Let's take the fox news article as an example, and crawl out the entire article.

The first is the title. It can be seen from the structure that the content in h1 under the node whose class is article-header is the title, and the text content in the dom node can be obtained through string .

# 获取文章标题
alert_header = soup.find('header', class_="article-header").find('h1')
print(alert_header.string)

Insert picture description here
Then there is the text. From the structure, it can be seen that the p elements under the node whose class is article-body constitute the content of the text, and all the nodes under the body can be obtained through contents . Then traverse all the nodes and print out the contents under all p elements.
Insert picture description here

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = urlopen('https://www.foxnews.com/tech/mom-received-dirty-diapers-amazon')
soup = BeautifulSoup(url, 'html.parser')   # parser 解析

# 获取文章标题
alert_header = soup.find('header', class_="article-header").find('h1')
print("标题如下:")
print(alert_header.string)

# 获取文章正文
alert_body = soup.find('div', class_="article-body").contents   # 所有body里的p节点

# 打印文章正文
print("正文如下:")
for i in alert_body:
    if(i.name == "p"):
        print(i.getText())
        print()

Running effect chart:
If there are advertisements in the middle, you can check the structure of the article body and the advertisement, and then further eliminate the advertisement.
Insert picture description here
Like it if you like it ❤!

Guess you like

Origin blog.csdn.net/qq_38161040/article/details/104021581