Python3 爬取51job的数据存入MongoDB并分析

1.打开51job首页,输入Python,地址选择深圳,得到搜索页面:

https://search.51job.com/list/040000,000000,0000,00,9,99,Python,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=

2.新建jobs项目,方法如同:https://blog.csdn.net/qq523176585/article/details/82955675

3.不同点:

items.py添加如下代码:

from scrapy import Item,Field

class JobsItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job = Field()
    company = Field()
    area = Field()
    salary = Field()
    datetime = Field()

settings.py添加如下代码:

ROBOTSTXT_OBEY = False
#模拟浏览器,应对反爬
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
#解决字符乱码的问题
FEED_EXPORT_ENCODING = 'gbk'

ITEM_PIPELINES = {
    'jobs.pipelines.MongoPipeline': 300,
}

MONGO_URL = 'localhost'
MONGO_DB = '51job'

spider文件夹类的py文件添加如下代码:

# -*- coding: utf-8 -*-
import scrapy
import time
from jobs.items import JobsItem


class A51jobSpider(scrapy.Spider):
    name = 'a51job'
    allowed_domains = ['search.51job.com']
    # start_urls = ['http://www.51job.com/']
    start_urls = ["https://search.51job.com/list/040000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="]

    def parse(self, response):
        infos = response.css('.el')
        for info in infos:
            item = JobsItem()
            job = info.css('a::attr("title")')
            if len(job) == 0:
                continue
            item['job'] = info.css('a::attr("title")').extract_first().strip()
            item['company'] = info.css('span a::attr("title")').extract()[-1].strip()
            item['area'] = info.css('.t3::text').extract_first().strip()
            item['datetime'] = info.css('.t5::text').extract_first().strip()
            salary = info.css('.t4::text')
            if len(salary) == 0:
                yield item
                continue
            item['salary'] = info.css('.t4::text').extract_first().strip()
            yield item

        time.sleep(1)
        url = response.css('.bk a::attr("href")').extract()[-1]  #查找下一页的链接
        yield scrapy.Request(url = url,callback = self.parse)    #解析下一页

运行,保存至MongoDB数据库,如图:

分析:

In [1]: import pymongo

In [2]: client = pymongo.MongoClient(host = 'localhost',port = 27017)

In [3]: db = client['51job']

In [4]: collection = db.JobsItem

In [5]: collection.find().count()
C:\Users\Administrator\AppData\Local\Programs\Python\Python37\Scripts\ipython:1:
 DeprecationWarning: count is deprecated. Use Collection.count_documents instead
.
Out[5]: 5326

一共有5326条招聘信息。

#查询今日发布的前50条招聘
In [6]: results = collection.find({'datetime':{'$gt':'10-06'}}).limit(50)

In [7]: for result in results:
    ...:     print ("公司:{}\t薪水:{}".format(result.get('company'),result.get(
    ...: 'salary')))
    ...:
公司:深圳市度点科技有限公司     薪水:0.8-2万/月
公司:深圳市恒牛科技有限公司     薪水:3-4万/月
公司:深圳市光速度科技有限公司   薪水:6-9千/月
公司:深圳市德梅寒科技有限公司   薪水:2-2.5万/月
公司:深圳市卓达电子有限公司     薪水:1-1.6万/月
公司:深圳市易思博酷客科技有限公司       薪水:1-1.8万/月
公司:睿思商业智能(深圳)有限公司       薪水:4.5-7千/月
公司:达观数据   薪水:1.8-3.6万/月
公司:深圳德聚企业管理咨询有限公司       薪水:2-4万/月
公司:深圳飞豹航天航空科技有限公司       薪水:1-1.5万/月

猜你喜欢

转载自blog.csdn.net/qq523176585/article/details/82958509