Scrapy
. requests + selenium --> 90%的爬虫需求
. Scrapy --?10% --> 爬虫更快 更强
. 什么是Scrapy? 框架
. 正则bs4 lxml模块 模块 = 手 框架 = 身体
. Twisted 异步网络框架 加快下载速度 采用大量的闭包
学习scrapy网址:
https://docs.scrapy.org/en/latest/intro/tutorial.html
Scrapy工作流程(重点)
首先Spider(爬虫)将需要发送请求的url经过ScrapyEngins(引擎)交给调试器 Scheduler(调度器)
Scheduler(调度器)排序 入列处理后,再经过ScrapyEngins(引擎)到DownloaderMiddleware(下载中间件)(user-agent cookie porxy)交给downloader(下载者)-->向internet发送请求
Downloader向互联网发起请求,并接收响应(response) 将响应在经过ScrapyEngins(引擎)给了SpiderMiddlewares(爬虫中间件)交给Spiders
Spiders处理response , 提取数据并将数据经过ScrapyEngine交给itemPipeline(管道)保存数据
创建
D:\2020st\pycharm.professional\xxx
$cd scrapy框架
创建一个scrapy项目
$scrapy startproject mySpider
$cd mySpider
# 生成一个爬虫
$scrapy genspider db douban.com
创建后运行
scrapy框架\mySpider>scrapy crawl db
Scrapy
以前爬虫数据是如何进行翻页处理的?
找页数的规律
调用requests.get(url)
Scrapy翻页处理
找到下一页的地址(规律 )
构造一个关于下一页的url地址的request请求传递给调试器
需求:爬取每一个职位的工作职责和工作要求
# 列表页[1 2 3 4]url
第一页:
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1615465802464&countryId=1&cityId=&bgIds=&productId=&categoryId=40001001&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=us
第三页
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1615467778990&countryId=1&cityId=&bgIds=&productId=&categoryId=40001001&parentCategoryId=&attrId=&keyword=&pageIndex=3&pageSize=10&language=zh-cn&area=us
# 详情页url
https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1615466673257&postId=1369980726107185152&language=zh-cn
1.点击控制台Terminal 操作
2. cd (copy paht)路径进入所在的文件
3. 创建一个名:scrapy startproject tencent
4. 进入项目名 cd tencent
5. 创建爬虫文件 scrapy gensipder hr2 tencent.com
6. 在文件中开始以下代码编写
hr2.py
import scrapy
import json
from tencent.items import TencentItem
class HrSpider(scrapy.Spider):
name = 'hr2'
allowed_domains = ['tencent.com']
# 列表页的url
one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1615465802464&countryId=1&cityId=&bgIds=&productId=&categoryId=40001001&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=us'
# 详情页的url
two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1615466673257&postId={}&language=zh-cn'
start_urls = [one_url.format(1)]
def parse(self, response):
for page in range(1, 3):
url = self.one_url.format(page)
yield scrapy.Request(
url=url,
callback=self.parse_one
)
def parse_one(self, response):
data = json.loads(response.text)
for job in data['Data']['Posts']:
item = {
}
item = TencentItem()
item['zhi_jishu'] = job['CategoryName']
item['zhi_type'] = job['RecruitPostName']
# items.py 设置后 输入错误代码会提示错误
# KeyError: 'TencentItem does not support field: zhi_type111'
post_id = job['PostId']
# 拼接详情页的url地址
detail_url = self.two_url.format(post_id)
yield scrapy.Request(
url=detail_url,
meta={
'item': item},
callback=self.parse_two
)
# print(job)
def parse_two(self, response):
# 第一种方式来接收meta传递过来的值
# item = response.meta['item']
# 第二种 方式
item = response.meta.get('item')
# print(response.text)
# print(type(response.text)) # <class 'str'>
data = json.loads(response.text)
item['zhi_duty'] = data['Data']['Responsibility']
item['zhi_require'] = data['Data']['Requirement']
print(item)
# print(type(item)) # <class 'dict'>
运行文件
start.py
from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', 'hr2'])
修配置文件 添加了 LOG_LEVEL = ‘WARNING’
settings.py
BOT_NAME = 'tencent'
LOG_LEVEL = 'WARNING'
SPIDER_MODULES = ['tencent.spiders']
NEWSPIDER_MODULE = 'tencent.spiders'
注释
# ROBOTSTXT_OBEY = True