python学习之新闻爬虫（五） - 代码天地

python学习之新闻爬虫（五）

其他 2019-05-13 14:30:54 阅读次数: 0

将新浪新闻首页（https://news.sina.com.cn/）所有新闻爬到本地。
先爬首页，通过正则表达式获取所有新闻链接，然后依次爬取各新闻，并存储到本地

import urllib.request
import re
data=urllib.request.urlopen("https://news.sina.com.cn/").read()
data2=data.decode("utf-8","ignore") #ignore 有错误时忽略
pat='href="(https://news.sina.com.cn/.*?)"'   #(组合），匹配括号内的任意正则表达式，并标识出组合的开始和结尾。匹配完成后，组合的内容可以被获取
allurl=re.compile(pat).findall(data2)
for i in range(0,len(allurl)):
  
  thisurl=allurl[i]
  
  file="f:/pytest/unit/news/"+str(i)+'.html'
  urllib.request.urlretrieve(thisurl,file)
  
  print(thisurl)

爬取CSDN博客首页

import urllib.request
import re
url1='https://blog.csdn.net/'

header=('User-Agent',"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5221.400 QQBrowser/10.0.1125.400")
opener=urllib.request.build_opener()
opener.add_handlers=[header]
data=opener.open(url1).read()
data2=data.decode('utf-8','ignore')

pat='href="(https://blog.csdn.net/.*?)"'
url=re.compile(pat).findall(data2)
for i in range(0,len(url)):
  thisurl=url[i]
  fh=open('f:/pytest/unit/blog.txt','w')
  fh.write(thisurl)
  file='f:/pytest/unit/blog'+str(i)+'.html'
  urllib.request.urlretrieve(thisurl,file)
  
  print(thisurl)
fh.close()

猜你喜欢

转载自blog.csdn.net/weixin_39892788/article/details/89855582

python学习之新闻爬虫（五）

python爬虫之爬取腾讯新闻

Python爬虫学习（五）

python爬虫学习(五)

python爬虫高校新闻

python爬虫搜狐新闻

Python学习（五）：爬虫之爬各城市天气

Python爬虫学习笔记（五）

python爬虫之抓取网页新闻标题与链接

python实战之网络爬虫（爬取新闻内文信息）

Python3《机器学习实战》学习笔记（五）：朴素贝叶斯实战篇之新浪新闻分类

python爬虫实践（腾讯新闻）

python实现新浪新闻爬虫

Python爬虫新闻实例代码

Python BeautifulSoup 爬虫入门笔记 --- 新闻爬虫

Python爬虫利器五之Selenium的用法

Python爬虫之（五）Cookie和URLError

python爬虫之requests库（五）

Python之新闻分类

Python网络爬虫学习笔记（五）

python爬虫学习笔记(五)-URLError与Cookie

Python爬虫学习笔记（五）————JsonPath解析

Python学习之爬虫-爬虫的异常处理

《爬虫学习》（五）（爬虫实战之爬取天气信息）

python学习之爬虫一

python爬虫之BeautifulSoup学习

python学习之爬虫技术

学习笔记之Python爬虫

python学习之天气爬虫

python学习之图片爬虫

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)