Use scrapy to crawl 51job website

Let’s talk about the idea of ​​climbing the future:
1. The first thing that comes to my mind is to get the Xpath path through response to get the value on the page below Insert picture description here
. 2. When I get the value corresponding to class='e', I find that it is not what I want. , When you go to the span under class='t', you can't get the value, and return an empty list
Insert picture description here
3. At this time, right-click on the webpage to view the source code and find that the value obtained by class='e' is here. Continue to look down and find that we want The value is in javascript.
Insert picture description here
Insert picture description here
4. I tried to click on the second page to get the requested address. I wanted to try to get a pure json URL. I found that the main interface of the second page was entered, and no URL contained pure json text
Insert picture description here
Insert picture description here
4. Then you can only find a way to get the content from the previous webpage source code. This is also the anti-climbing measure of the future. I thought for a long time, and I knew how to achieve it because there was no way. I saw a blogger later The method of saving the html file, using regularity to match the content we want, the content we want is actually a string of json given a variable window. SEARCH_RESULT . Use the json method to convert a string into a dictionary. The subsequent operations are much easier.




5. On the page, we can get the URL value of the detailed information.
Insert picture description here
6. Then through the callback in scrapy.Request, use the newly defined method to request the URL address to parse the response text content



7. Then get all the content of the child label through the parent label (after looking for a long time to get a method, very practical) can refer to the parent to get all the text content of the child label

The parent gets all the text content of the child label

Insert picture description here
9. The last is to construct the url of the next page to realize the function of page turning

Before, my mind seemed to be a little bad, so I kept forgetting to open the pipeline, resulting in no data being saved, so be sure to open the pipeline setting


Attached code:
py file of spider main program

# -*- coding: utf-8 -*-
import scrapy
import json
import re
from scrapy import *

class Job51Spider(scrapy.Spider):
    name = 'Job51'
    allowed_domains = ['jobs.51job.com', 'search.51job.com']
    url = 'https://search.51job.com/list/040000,000000,0000,00,9,99,java,2,{}.html?'
    page = 1
    start_urls = [url.format(str(page))]

    def parse(self, response):
        try:
            body = response.body.decode("GBK")
            with open("job.html", "w", encoding='utf-8') as f:
                f.write(body) # 此处需要拿到网页通过正则匹配
            data = re.findall('window.__SEARCH_RESULT__ =(.+)}</script>', str(body))[0] + "}"
            data = json.loads(data)
            for list in data["engine_search_result"]:
                item = {
    
    }
                item["name"] = list["company_name"]
                item["price"] = list["providesalary_text"]
                item["workarea"] = list["workarea_text"]
                item["updatedate"] = list["updatedate"]
                item["companytype"] = list["companytype_text"]
                item["jobwelf"] = list["jobwelf"]
                item["companyind"] = list["companyind_text"]
                item["href"] = list["job_href"]
                # print(type(item["href"]))
                detail_url = item["href"]
                yield scrapy.Request(
                    detail_url,
                    callback=self.parser_detail,
                    meta={
    
    "item": item}
                )
            self.page += 1
            next_url = self.url.format(str(self.page))
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

        except TypeError:
            print("爬取结束")



    def parser_detail(self,response):
        item = response.meta["item"]
        welfare = response.xpath("//div[@class='bmsg job_msg inbox']").xpath('string(.)').extract()[0]
        item["content"] = welfare
        yield item

piplines.py file

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from pymongo import MongoClient

client = MongoClient(host="127.0.0.1", port=27017)

collection = client["Job"]["result"]


class JobPipeline(object):
    def process_item(self, item, spider):
        collection.insert(dict(item))
        return item

setting file (open item only)

BOT_NAME = 'Job'

SPIDER_MODULES = ['Job.spiders']
NEWSPIDER_MODULE = 'Job.spiders'
ROBOTSTXT_OBEY = True
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36',
    'Cookie': 'guid=e93e7f6213a66f3dde7a0e1; _ujz=MTY0MTc3NDk3MA%3D%3D; ps=needv%3D0; 51job=cuid%3D164177497%26%7C%26cusername%3Dfuwei_fw123%2540163.com%26%7C%26cpassworB6%25CE%25B0%26%7C%26cemail%3Dfuwei_fw123%2540163.com%26%7'

ITEM_PIPELINES = {
    
    
   'New.pipelines.NewPipeline': 300,
}
}

Attach a screenshot of mongodb (the crawling speed is still slow, and the subsequent optimization is a distributed crawler)
Insert picture description here

Guess you like

Origin blog.csdn.net/Python_BT/article/details/108212572