python学习，新浪新闻的爬取和CSDN博文爬取 - 代码天地

python学习，新浪新闻的爬取和CSDN博文爬取

其他 2018-12-07 10:31:09 阅读次数: 0

版权声明：本文为博主原创文章，未经博主允许不得转载。 https://blog.csdn.net/VABTC/article/details/84784033

import urllib.request
import ssl
import re#导入正则表达式模块
data=urllib.request.urlopen("http://news.sina.com.cn/").read()#将网址信息读取出来，并赋值给data
#headers=("","")#此处可以添加模拟http请求,详见前面博文
#opener...
#opener
data2=data.decode("utf-8","ignore")#编码
pat='href="(http://news.sina.com.cn/.*?)"'
allurl=re.compile(pat).findall(data2)
for i in range(0,len(allurl)):
    try:
        print("第"+str(i)+"次爬取")
        thisurl=allurl[i]
        file="E:/practice/sinanews/"+str(i)+".html"
        urllib.request.urlretrieve(thisurl,file)#下载网页
        print("---成功---")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)

import urllib.request
import  re
url="http://blog.csdn.net/"
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.53 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
pat='href="(https://blog.csdn.net/.*?)"'
result=re.compile(pat).findall(data)
for i in range(0,len(result)):
    try:
        print("第"+str(i)+"次爬取")
        thisurl=result[i]
        file="E:/practice/csdntitle/"+str(i)+".html"
        urllib.request.urlretrieve(thisurl,file)
        print("成功")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)

猜你喜欢

转载自blog.csdn.net/VABTC/article/details/84784033

python学习，新浪新闻的爬取和CSDN博文爬取

python爬取新浪新闻

爬取新浪新闻

python 爬取网页新浪新闻

Python爬虫爬取新浪新闻内容

简单python爬虫爬取新浪新闻

python：爬取新浪新闻的内容

python爬虫：爬取新浪新闻数据

Python利用xpath和正则re爬取新浪新闻

Python爬取新浪微博评论

爬取新浪微博

新浪微博爬取

Python数据挖掘学习笔记（9）爬取新浪新闻首页的所有新闻

BeautifulSoup语法笔记（爬取新浪新闻）

使用scrapy爬取新浪新闻

爬虫：新浪详情新闻爬取总结

Webdriver 爬取新浪滚动新闻

新浪新闻标题爬取

python爬取新浪微博话题的相关数据

Python爬取新浪微博热搜榜

爬虫爬取新浪微博

新浪微博爬取整理

python[爬虫]爬取百万条新浪新闻新浪滚动新闻中心(多进程)

python：头条新闻微博的爬取

爬取新浪微博数据+新浪微博模拟登录+mysql+python

python 爬取腾讯新闻

Python数据挖掘学习笔记（10）爬取CSDN资讯页的所有新闻

python 爬虫爬取csdn

[Python3爬虫]爬取新浪微博用户信息及微博内容

python爬取新浪财经的股票信息

今日推荐

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

周排行

Metasploit文件目录与入侵基本概念

跨域(CORS)请求问题[No 'Access-Control-Allow-Origin' header is present on the requested resource]常见解决方案

CodeIgniter 源码解读之 CodeIgniter.php（二）

SAS入门之（四）改变数据类型

初识元组

[数学建模]数学建模算法和模型（B站视频）（二）

Nginx 服务器源码安装配置流程

C#实现语音视频录制【基于MCapture + MFile】

开发进度4

下载安装vue的方法网址

每日归档

更多

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)