Crawling static web pages with requests and BeautifulSoup

Crawling static web pages with requests and BeautifulSoup

1. Case description

This case uses requests and BeautifulSoup to crawl the first 2 pages of news titles, dates, publishers, and content of the Hubei Institute of Economics.
Second, the crawler idea
first finds the URL (http://news.hbue.edu.cn/jyyw/list. htm) page, right-click "check", the developer mode is displayed

It is found that the news URL of each page is (http://news.hbue.edu.cn/jyyw/list+digit.htm), so different news pages can be crawled based on this information

It is found that the URL of each page of news is in span class="Article_Title", so you can crawl different news page information according to this information.
3. Code

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import re
def getnews(newurl):
    html = requests.get(newurl)
    bs = BeautifulSoup(html.content,'lxml')
    the_title = bs.find(name='h1',class_="arti_title")
    title = re.sub(' ','',the_title.string)
    #用正则表达式将空格去除
    publisher = bs.find(name='span',attrs={
    
    'class':'arti_publisher'})
    date = bs.find(class_="arti_update")
    print(title)
    print(publisher.string)
    print(date.string)
    #获取过滤出的节点文本内容,用.string
for i in range(1,3):
    url = 'http://news.hbue.edu.cn/jyyw/list' + str(i) + '.htm'
    html = requests.get(url)
    #用requests的get方法
    bs = BeautifulSoup(html.content,'lxml')
    #需使用.content
    newurlset = bs.find_all(name='span',attrs={
    
    'class':'Article_Title'})
    #BeautfiulSoup的find_all返回的是tag对象的集合,故可以用循环语句提取
    for i in newurlset:
    #因为有个新闻的网址链接与其他不同,故加上这个判断语句
        if 'http://news.hbue.edu.cn' in i.a.attrs['href']:
            newurl = i.a.attrs['href']
        else:
            newurl = 'http://news.hbue.edu.cn' + i.a.attrs['href']
        #找到网页前缀,再提取出下一步。因为a标签为span节点的子节点,故可直接选。但只可选择至子节点。
        getnews(newurl)
 

Guess you like

Origin blog.csdn.net/sgsdsdd/article/details/109325059