python爬虫的基本框架 - 代码天地

python爬虫的基本框架

其他 2018-05-03 16:43:30 阅读次数: 5

1.爬虫的基本流程：

通过requests库的get方法获得网站的url

浏览器打开网页源码分析元素节点

通过BeautifulSoup或者正则表达式提取想要的数据

储存数据到本地磁盘或者数据库

2.正式开工啦

url = “http://www.jianshu.com”

page = requests.get(url) #发现返回状态码403，说明有问题出现（除200外，其他的都是有问题的）

#这个时候查看一下爬虫的robots协议，的确有些问题，解决方案如下：

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
获取html页面

page = requests.get(url, headers = headers)

demo = page.text

#记住，有时候有可能出现编码问题

page.encoding = page.apparent_encoding

#将获取的内容转换为BeautifulSoup格式，并将html.parser作为解释器（熬一锅汤）

soup = BeautifulSoup(demo, 'html.parser')

#以格式化的形式打印html

print(soup.prettify()) #利于分析元素节点

#查找所有a标签中class=‘tilte’的语句

titles = soup.find_all('a', 'title')

#打印查找到的每一个标签的string和文章链接

for titile in titles:

　　print(title.string) #打印字符串

　　print("http://www.jianshu.com" + title.get('href')) #利用title的get方法获取连接，可通过dir(titles)查看可用的方法

#将获取的内容写入本地磁盘

with open('aa.txt', 'w') as f:

　　for title in titles:

　　　　f.write(title.string+'\n')

　　　　f.write('http://www.jianshu.com' + title.get('href') + '\n\n')

猜你喜欢

转载自www.cnblogs.com/lmt921108/p/8986153.html

python爬虫的基本框架

Python爬虫基本框架

python爬虫 scrapy爬虫框架的基本使用

python爬虫框架--scrapy 基本使用

Python爬虫之Scrapy框架的基本使用

Python:爬虫框架Scrapy的安装与基本使用

Python的爬虫框架Scrapy基本使用

python爬虫Scrapy框架的基本结构讲解

Python爬虫 - scrapy框架的基本操作

python 爬虫框架scrapy的安装以及基本操作

Python 爬虫框架Scrapy的安装与基本使用（入门）

Python爬虫框架Scrapy入门（一）Scrapy安装及基本使用

python爬虫-scrapy爬虫框架

Pyspider爬虫框架的基本使用

scrapy框架爬虫基本流程

Scrapy爬虫框架基本使用

【爬虫框架】Scrapy基本使用

python爬虫----基本操作

python爬虫基本示例

python爬虫基本方法

python爬虫的基本流程

Python爬虫基本流程

python爬虫的基本介绍

Python爬虫的基本操作

python爬虫之Scrapy框架，基本介绍使用以及用框架下载图片案例

scrapy 框架 python 爬虫

python爬虫-scrapy框架

python爬虫相关框架

python爬虫scrapy框架

Python爬虫框架

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)