创建项目
scrapy startproject tencent
cd tencent 打开项目目录
scrapy genspider hr tence.com# hr为spder文件的名字,tencent.com是允许爬的域名范围
hr.py
- 设置初始的url地址
- 打开网页源代码根据xpath找需要的信息
- 取标签的文本值使用text()函数,去标签的属性值用@,比如取a标签的地址
response.xpath("//a[@id='next']/@href").extract_first()
- 另外每个后面要加上extract_first()
# -*- coding: utf-8 -*-
import scrapy
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
start_urls = ['https://hr.tencent.com/position.php']
def parse(self, response):
tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
for tr in tr_list:
item = {}
item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
item["position"] = tr.xpath("./td[2]/text()").extract_first()
item["publish_data"] = tr.xpath("./td[5]/text()").extract_first()
yield item
# 首先找到下一页的url地址
next_url = response.xpath("//a[@id='next']/@href").extract_first()
if next_url != "javascript:;":
next_url = "https://hr.tencent.com/"+next_url
yield scrapy.Request(
next_url,
callback=self.parse,
# meta={"item":item}
)
# def parse1(self,response):
# item = response.meta["item"]
yield item
把itemyield到管道
pipelines.py
- 初始化MongoBD
- 创建tencent数据库和hr表
- collection去insert
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from pymongo import MongoClient
client = MongoClient()
collection = client["tencent"]["hr"]
class TencentPipeline(object):
def process_item(self, item, spider):
collection.insert(item)
with open("./position.text","a") as f:
f.write(str(item))
print(item)
return item
# 回到hr.py看看
- 通过xpath拿到每一个下一页的a标签的href值
- 但是这个href值是不完整的需要拼接
yield scrapy.Request( next_url, callback=self.parse, # meta={"item":item} )
去执行爬下一页的函数,callback的函数是自己- 如果callback函数不是自己而是另一个parse,比如下面注释掉的那个也可以,这个时候如果想让item在解析函数之间传递就需要meta参数,做法如注释掉的一样
# -*- coding: utf-8 -*-
import scrapy
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
start_urls = ['https://hr.tencent.com/position.php']
def parse(self, response):
tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
for tr in tr_list:
item = {}
item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
item["position"] = tr.xpath("./td[2]/text()").extract_first()
item["publish_data"] = tr.xpath("./td[5]/text()").extract_first()
yield item
# 首先找到下一页的url地址
next_url = response.xpath("//a[@id='next']/@href").extract_first()
if next_url != "javascript:;":
next_url = "https://hr.tencent.com/"+next_url
yield scrapy.Request(
next_url,
callback=self.parse,
# meta={"item":item}
)
# def parse1(self,response):
# item = response.meta["item"]