用requests和BeautifulSoup爬取静态网页 - 代码天地

用requests和BeautifulSoup爬取静态网页

其他 2021-03-19 21:25:21 阅读次数: 0

用requests和BeautifulSoup爬取静态网页

一、案例说明

本案例使用requests和BeautifulSoup爬取湖北经济学院经院要闻的前2页新闻标题、日期、发布者、内容
二、爬虫思路
首先找到网址（http://news.hbue.edu.cn/jyyw/list.htm）的页面，右键“检查”，显示出开发者模式

发现每页的新闻网址都为（http://news.hbue.edu.cn/jyyw/list+数字.htm），所以可以根据这个信息来爬取不同的新闻网页

发现每页新闻的网址都在span class="Article_Title"中，，所以可以根据这个信息来爬取不同的新闻网页信息
三、代码

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import re
def getnews(newurl):
    html = requests.get(newurl)
    bs = BeautifulSoup(html.content,'lxml')
    the_title = bs.find(name='h1',class_="arti_title")
    title = re.sub(' ','',the_title.string)
    #用正则表达式将空格去除
    publisher = bs.find(name='span',attrs={
    
    'class':'arti_publisher'})
    date = bs.find(class_="arti_update")
    print(title)
    print(publisher.string)
    print(date.string)
    #获取过滤出的节点文本内容，用.string
for i in range(1,3):
    url = 'http://news.hbue.edu.cn/jyyw/list' + str(i) + '.htm'
    html = requests.get(url)
    #用requests的get方法
    bs = BeautifulSoup(html.content,'lxml')
    #需使用.content
    newurlset = bs.find_all(name='span',attrs={
    
    'class':'Article_Title'})
    #BeautfiulSoup的find_all返回的是tag对象的集合，故可以用循环语句提取
    for i in newurlset:
    #因为有个新闻的网址链接与其他不同，故加上这个判断语句
        if 'http://news.hbue.edu.cn' in i.a.attrs['href']:
            newurl = i.a.attrs['href']
        else:
            newurl = 'http://news.hbue.edu.cn' + i.a.attrs['href']
        #找到网页前缀，再提取出下一步。因为a标签为span节点的子节点，故可直接选。但只可选择至子节点。
        getnews(newurl)

猜你喜欢

转载自blog.csdn.net/sgsdsdd/article/details/109325059

用requests和BeautifulSoup爬取静态网页

python使用requests和BeautifulSoup爬取网页乱码问题

requests与BeautifulSoup爬取网页图片

Python爬虫实战：使用Requests和BeautifulSoup爬取网页内容

requests与BeautifulSoup结合爬取网页数据应用

xpath和beautifulsoup爬取网页的demo

使用Requests和BeautifulSoup爬取妹子图

python获取网页page数，同时按照href批量爬取网页（requests+BeautifulSoup）

requests+beautifulsoup爬取豆瓣图书

python爬虫爬取招聘（ requests，BeautifulSoup）

Python爬虫学习三------requests+BeautifulSoup爬取简单网页

python 爬虫（一） requests+BeautifulSoup 爬取简单网页代码示例

python爬虫——利用requests库BeautifulSoup定向爬取网页内容写入txt文件

python爬虫——利用requests库BeautifulSoup简单爬取网页上照片—代码完善

python爬虫——利用requests库BeautifulSoup简单爬取网页上照片

Python使用urllib,urllib3,requests库+beautifulsoup爬取网页

爬取静态网页

【爬虫】002 python3 +beautifulsoup4 +requests 爬取静态页面

ython 从零开始爬虫(三)：实战：requests+BeautifulSoup实现静态爬取

Python爬虫自学之第（③）篇——实战：requests+BeautifulSoup实现静态爬取

Python爬虫（一）：用 Requests + BeautifulSoup 爬取网站上的信息

Python使用BeautifulSoup爬取网页信息

Python爬虫实践~BeautifulSoup+urllib+Flask实现静态网页的爬取

用requests爬取图片

爬虫入门（一）：用Python爬取静态HTML网页

Python3爬虫--两种方法（requests(urllib)和BeautifulSoup）爬取网站pdf

python使用requests和BeautifulSoup包爬取Pixiv图片--指定tag下的所有作品

利用requests和BeautifulSoup爬取菜鸟教程的代码与图片并保存为markdown格式

利用python的requests和BeautifulSoup库爬取小说网站内容

Python网页解析库：用requests-html爬取网页

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)