python3 + scrapy 抓取boss直聘岗位

前言：本文为记录工程实现过程，会引用其他文章，如果又不清晰的地方可以查看原文章。本文主旨在于记录，所以部分作者了解的部分可能不会介绍而直接操作，如果有疑问请留言或者直接使用搜索引擎。

引用：

一、安装scrapy

管理员模式打开power shell，输入

pip install scrapy

ps：此步之前，需要先行安装pip，具体请自行搜索。

二、到某路径下建立scrapy工程

scrapy startproject boss

三、打开工程目录

cd boss

四、建立爬虫

scrapy genspider bosszhipin www.zhipin.com

五、将爬虫工程导入pycharm，修改setting.py

将 ROBOTSTXT_OBEY = True

改为 ROBOTSTXT_OBEY = False

六、编写bosszhipin.py和run.py

# -*- coding: utf-8 -*-
import scrapy


class BosszhipinSpider(scrapy.Spider):
    name = 'bosszhipin'
    allowed_domains = ['www.zhipin.com']
    start_urls = ['https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1']

    def parse(self, response):
        print(response.text)

run.py放在项目根目录

from scrapy.cmdline import execute
execute(['scrapy','crawl','bosszhipin'])

运行出现错误

2018-11-04 13:03:36 [scrapy.core.engine] INFO: Spider opened
2018-11-04 13:03:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-04 13:03:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-04 13:03:37 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1> (referer: None)
2018-11-04 13:03:37 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1>: HTTP status code is not handled or not allowed
2018-11-04 13:03:37 [scrapy.core.engine] INFO: Closing spider (finished)

链接被关闭，应该是被反爬了，修改中间件来修改headers

middlewares.py 中加入

class UserAgentMiddleware(object):

    def __init__(self, user_agent_list):
        self.user_agent = user_agent_list

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        # 获取配置文件中的 MY_USER_AGENT 字段
        middleware = cls(crawler.settings.get('MY_USER_AGENT'))
        return middleware

    def process_request(self, request, spider):
        # 随机选择一个 user-agent
        request.headers['user-agent'] = random.choice(self.user_agent)

在setting中启用中间件和MY_USER_AGENT的值

USER_AGENT = 'boss (+http://www.yourdomain.com)'
...
DOWNLOADER_MIDDLEWARES = {
   'boss.middlewares.BossDownloaderMiddleware': 543,
}

(以上代码默认有实现，只是被注释了，建议先激活试试能不能用，不能用再找解决方法)

再次运行run.py，可以获取页面html信息。

第一阶段全部代码，后期准备加上MongoDB，因为看不出来爬文本直接输出有什么卵用。。。

# -*- coding: utf-8 -*-
import scrapy


class BosszhipinSpider(scrapy.Spider):
    name = 'bosszhipin'
    allowed_domains = ['www.zhipin.com']
    start_urls = ['https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1']

    def parse(self, response):
        # print(response.text)

        job_node_table = response.xpath("//*[@id=\"main\"]/div/div[2]/ul")
        job_node_list = job_node_table.xpath("./li")
        for job_node in job_node_list:
            enterprise_node = job_node.xpath("./div/div[2]/div/h3/a")
            salary_node = job_node.xpath("./div/div[1]/h3/a/span")
            requirement_node = job_node.xpath("./div/div[1]/p")
            time_node = job_node.xpath("./div/div[3]/p")

            enterprise = enterprise_node.xpath('string(.)')
            salary = salary_node.xpath('string(.)')
            requirement = requirement_node.xpath('string(.)')
            time = time_node.xpath('string(.)')


            print("企业", enterprise.extract_first().strip())
            print("薪资", salary.extract_first().strip())
            print("要求", requirement.extract_first().strip())
            print("更新", time.extract_first().strip())
            print()

python3 + scrapy 抓取boss直聘岗位

猜你喜欢