1.打开51job首页,输入Python,地址选择深圳,得到搜索页面:
2.新建jobs项目,方法如同:https://blog.csdn.net/qq523176585/article/details/82955675
3.不同点:
items.py添加如下代码:
from scrapy import Item,Field
class JobsItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
job = Field()
company = Field()
area = Field()
salary = Field()
datetime = Field()
settings.py添加如下代码:
ROBOTSTXT_OBEY = False
#模拟浏览器,应对反爬
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
#解决字符乱码的问题
FEED_EXPORT_ENCODING = 'gbk'
ITEM_PIPELINES = {
'jobs.pipelines.MongoPipeline': 300,
}
MONGO_URL = 'localhost'
MONGO_DB = '51job'
spider文件夹类的py文件添加如下代码:
# -*- coding: utf-8 -*-
import scrapy
import time
from jobs.items import JobsItem
class A51jobSpider(scrapy.Spider):
name = 'a51job'
allowed_domains = ['search.51job.com']
# start_urls = ['http://www.51job.com/']
start_urls = ["https://search.51job.com/list/040000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="]
def parse(self, response):
infos = response.css('.el')
for info in infos:
item = JobsItem()
job = info.css('a::attr("title")')
if len(job) == 0:
continue
item['job'] = info.css('a::attr("title")').extract_first().strip()
item['company'] = info.css('span a::attr("title")').extract()[-1].strip()
item['area'] = info.css('.t3::text').extract_first().strip()
item['datetime'] = info.css('.t5::text').extract_first().strip()
salary = info.css('.t4::text')
if len(salary) == 0:
yield item
continue
item['salary'] = info.css('.t4::text').extract_first().strip()
yield item
time.sleep(1)
url = response.css('.bk a::attr("href")').extract()[-1] #查找下一页的链接
yield scrapy.Request(url = url,callback = self.parse) #解析下一页
运行,保存至MongoDB数据库,如图:
分析:
In [1]: import pymongo
In [2]: client = pymongo.MongoClient(host = 'localhost',port = 27017)
In [3]: db = client['51job']
In [4]: collection = db.JobsItem
In [5]: collection.find().count()
C:\Users\Administrator\AppData\Local\Programs\Python\Python37\Scripts\ipython:1:
DeprecationWarning: count is deprecated. Use Collection.count_documents instead
.
Out[5]: 5326
一共有5326条招聘信息。
#查询今日发布的前50条招聘
In [6]: results = collection.find({'datetime':{'$gt':'10-06'}}).limit(50)
In [7]: for result in results:
...: print ("公司:{}\t薪水:{}".format(result.get('company'),result.get(
...: 'salary')))
...:
公司:深圳市度点科技有限公司 薪水:0.8-2万/月
公司:深圳市恒牛科技有限公司 薪水:3-4万/月
公司:深圳市光速度科技有限公司 薪水:6-9千/月
公司:深圳市德梅寒科技有限公司 薪水:2-2.5万/月
公司:深圳市卓达电子有限公司 薪水:1-1.6万/月
公司:深圳市易思博酷客科技有限公司 薪水:1-1.8万/月
公司:睿思商业智能(深圳)有限公司 薪水:4.5-7千/月
公司:达观数据 薪水:1.8-3.6万/月
公司:深圳德聚企业管理咨询有限公司 薪水:2-4万/月
公司:深圳飞豹航天航空科技有限公司 薪水:1-1.5万/月