BeautifulSoup语法笔记（爬取新浪新闻） - 代码天地

BeautifulSoup语法笔记（爬取新浪新闻）

数据库 2018-07-19 02:09:12 阅读次数: 0

以爬取新浪新闻为例

import re
import requests
from bs4 import  BeautifulSoup
import json
from datetime import datetime

def getSoup(newsurl):
    res=requests.get(newsurl)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    return soup

newsurl为新浪新闻sh首页某则新闻的链接

打印出soup查看结构

 title=soup.select('title')[0].text

def getArtcle(soup):
    article=[]
    for p in soup.select('#article p')[:-1]:
        article.append(p.text.strip())
    return ' '.join(article)
print(getArtcle(getSoup('http://news.sina.com.cn/w/2018-07-17/doc-ihfkffam0100205.shtml')))

注意像这种格式的不能直接用soup.select('#show_author')[0].text.strip('责任编辑：') 而应该用下面这种语法：

editor=soup.select('p[class="show_author"]')[0].text.strip('责任编辑：')

最后贴出完整的代码

import re
import requests
from bs4 import  BeautifulSoup
import json
from datetime import datetime

def getSoup(newsurl):
    res=requests.get(newsurl)
    res.encoding='utf-8'
    soup=BeautifulSoup(res.text,'html.parser')
    return soup

def getTitle(soup):
    title=soup.select('title')[0].text
    return title

def getArtcle(soup):
    article=[]
    for p in soup.select('#article p')[:-1]:
        article.append(p.text.strip())
    return ' '.join(article)

def getEditor(soup):
    editor=soup.select('p[class="show_author"]')[0].text.strip('责任编辑：')
    return editor

def getTime(soup):
    time=soup.select('span[class="date"]')[0].text
    return(time)

def catch_all(newsurl):
    soup=getSoup(newsurl)
    print('新闻标题：',getTitle(soup),'\n'
          '时间：',getTime(soup),'\n'
          '新闻内容：',getArtcle(soup),'\n'
          '编辑作者：',getEditor(soup),'\n')

#新浪新闻首页某则新闻的链接作为catch_all的参数输入，
#即可输出该则新闻的标题、时间、内容、作者 

catch_all('http://news.sina.com.cn/w/2018-07-17/doc-ihfkffam0100205.shtml')

catch_all('http://news.sina.com.cn/c/2018-07-17/doc-ihfkffak9643422.shtml')

猜你喜欢

转载自blog.csdn.net/u014165082/article/details/81083120

BeautifulSoup语法笔记（爬取新浪新闻）

爬取新浪新闻

python爬取新浪新闻

python 爬取网页新浪新闻

使用scrapy爬取新浪新闻

Python爬虫爬取新浪新闻内容

简单python爬虫爬取新浪新闻

爬虫：新浪详情新闻爬取总结

Webdriver 爬取新浪滚动新闻

python：爬取新浪新闻的内容

python爬虫：爬取新浪新闻数据

新浪新闻标题爬取

Python数据挖掘学习笔记（9）爬取新浪新闻首页的所有新闻

python[爬虫]爬取百万条新浪新闻新浪滚动新闻中心(多进程)

BeautifulSoip+pandas 爬取新浪国内新闻

新浪网(sina)新闻链接爬取

爬取新浪社会新闻源代码

Python利用xpath和正则re爬取新浪新闻

Python 利用 BeautifulSoup 爬取网站获取新闻流

爬虫二：用BeautifulSoup爬取南方周末新闻

python学习，新浪新闻的爬取和CSDN博文爬取

（详细步骤）使用scrapy爬取"新浪热点新闻",进入链接获取新闻内容。

python3爬取新浪NBA新闻信息（待完善）

python爬虫爬取新浪新闻的评论数以及部分评论

【转】写一个简单的爬虫来批量爬取新浪网的新闻

爬取IT之家新闻

爬取网易新闻

python针对新浪新闻国内新闻爬取的爬虫，存入mysql数据库，也可输出为txt文件

爬取新浪微博

新浪股票信息爬取

今日推荐

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

周排行

Metasploit文件目录与入侵基本概念

跨域(CORS)请求问题[No 'Access-Control-Allow-Origin' header is present on the requested resource]常见解决方案

CodeIgniter 源码解读之 CodeIgniter.php（二）

SAS入门之（四）改变数据类型

初识元组

[数学建模]数学建模算法和模型（B站视频）（二）

Nginx 服务器源码安装配置流程

C#实现语音视频录制【基于MCapture + MFile】

开发进度4

下载安装vue的方法网址

每日归档

更多

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)